diff --git a/docs/superpowers/specs/2026-03-17-infrastructure-overview-design.md b/docs/superpowers/specs/2026-03-17-infrastructure-overview-design.md new file mode 100644 index 00000000..f6045a44 --- /dev/null +++ b/docs/superpowers/specs/2026-03-17-infrastructure-overview-design.md @@ -0,0 +1,455 @@ +# Infrastructure Overview — Admin Pages Design + +**Date:** 2026-03-17 +**Status:** Approved +**Scope:** Phase 1 implementation; full vision documented with Phase 2+ sections marked + +## Overview + +Add Database and OpenSearch admin pages to the Cameleer3 Server UI, allowing administrators to monitor subsystem health, inspect metrics, and perform basic maintenance actions. Restructure admin navigation from a single OIDC page to a sidebar sub-menu with dedicated pages per concern. + +## Goals + +- Give admins real-time visibility into PostgreSQL and OpenSearch health, performance, and storage +- Enable basic maintenance actions (kill queries, delete indices) without SSH/kubectl access +- Provide configurable thresholds for visual status indicators (green/yellow/red) +- Establish a database-backed audit log for all admin actions (SOC2 compliance foundation) +- Design for future expansion (VACUUM, reindex, OPERATOR role) without requiring restructuring + +## Non-Goals (Phase 1) + +- Database maintenance actions (VACUUM ANALYZE, Reindex) +- OpenSearch bulk operations (Force Reindex All, Flush) +- OPERATOR role with restricted permissions +- TimescaleDB-specific features (hypertable stats, continuous aggregate status) +- Alerting or notifications beyond visual indicators + +--- + +## 1. Admin Navigation Restructuring + +### Current State + +Single gear icon at bottom of `AppSidebar` linking directly to `/admin/oidc`. + +### New Structure + +The gear icon expands/collapses an admin sub-menu in the sidebar: + +``` +── Apps ────────────── + app-1 + app-2 +── Admin (gear icon) ─ + Database → /admin/database + OpenSearch → /admin/opensearch + Audit Log → /admin/audit + OIDC → /admin/oidc + Users → (future) +``` + +- Admin section visible only to users with `ADMIN` role +- Section collapsed by default; state persisted in localStorage +- Active sub-item highlighted +- `/admin` redirects to `/admin/database` +- Existing `OidcAdminPage` unchanged functionally, re-routed from being the sole admin page to a sub-page + +--- + +## 2. Database Page (`/admin/database`) + +### Header + +- Connection status badge (green/red) +- PostgreSQL version (with TimescaleDB extension noted if present) +- Host and schema name +- Manual refresh button (refreshes all sections) + +### Connection Pool Section + +- Visual bar showing active connections vs. max pool size +- Metrics: active, idle, pending, max wait time +- Status badge based on configurable threshold (% of pool in use) +- Source: HikariCP pool MXBean +- **Auto-refreshes every 15 seconds** + +### Table Sizes Section + +- Table with columns: Table, Rows, Size, Index Size +- All application tables listed (executions, processor_executions, route_diagrams, agent_metrics, users, oidc_config, admin_thresholds) +- Summary row: total data size, total index size +- Source: `pg_stat_user_tables` + `pg_relation_size` +- **Manual refresh only** (expensive query) + +### Active Queries Section + +- Table with columns: PID, Duration, State, Query (truncated), Action +- Queries > warning threshold highlighted yellow, > critical threshold highlighted red +- Kill button per row → calls `pg_terminate_backend(pid)` +- Kill requires confirmation dialog +- After kill, query list refreshes automatically +- Source: `pg_stat_activity` +- **Auto-refreshes every 15 seconds** + +### Maintenance Section (Phase 2 — Visible but Disabled) + +- Buttons: Run VACUUM ANALYZE, Reindex Tables +- Greyed out with tooltip: "Available in a future release" + +### Thresholds Section + +- Collapsible, collapsed by default +- Configurable values: + - Connection pool usage: warning % and critical % + - Query duration: warning seconds and critical seconds +- Save button persists to database + +--- + +## 3. OpenSearch Page (`/admin/opensearch`) + +### Header + +- Cluster health badge (green/yellow/red — maps directly to OpenSearch cluster health) +- OpenSearch version +- Node count +- Host URL +- Manual refresh button + +### Indexing Pipeline Section + +- Visual bar showing queue depth vs. max queue size +- Metrics: queue depth, failed document count, debounce interval, indexing rate (docs/s), time since last indexed +- Status badge based on configurable thresholds +- Source: `SearchIndexer` internal stats (exposed via `SearchIndexerStats` interface) +- **Auto-refreshes every 15 seconds** + +### Indices Section + +- **Search/filter** by index name pattern (text input) +- **Filter by health** — All / Green / Yellow / Red dropdown +- **Sortable columns** — Name, Docs, Size, Health, Shards (click column header) +- **Pagination** — 10 per page, server-side +- **Summary row** above table — total index count, total docs, total storage +- **Delete button** (trash icon) per row: + - Confirmation dialog: "Delete index `{name}`? This cannot be undone." + - User must type the index name to confirm + - After deletion, table and summary refresh +- Table columns: Index, Docs, Size, Health, Shards (primary/replica) +- Source: OpenSearch `_cat/indices` API +- **Manual refresh only** + +### Performance Section + +- Metrics: query cache hit rate, request cache hit rate, average search latency, average indexing latency, JVM heap used (visual bar with used/max) +- Source: OpenSearch `_nodes/stats` API +- **Auto-refreshes every 15 seconds** + +### Operations Section (Phase 2 — Visible but Disabled) + +- Buttons: Force Reindex All, Flush Index, Delete Index (bulk via checkbox selection) +- Greyed out with tooltip: "Available in a future release" + +### Thresholds Section + +- Collapsible, collapsed by default +- Configurable values: + - Cluster health: warning level, critical level + - Queue depth: warning count, critical count + - JVM heap usage: warning %, critical % + - Failed docs: warning count, critical count +- Save button persists to database + +--- + +## 4. Audit Log Page (`/admin/audit`) + +### Purpose + +Database-backed audit trail of all administrative actions across the system. Provides SOC2-compliant evidence of who did what, when, and from where. The audit log is append-only — entries cannot be modified or deleted through the UI or API. + +### Header + +- Total event count +- Date range selector (default: last 7 days) + +### Audit Log Table + +``` +┌─ Audit Log ────────────────────────────────────────────────┐ +│ Date range: [2026-03-10] to [2026-03-17] │ +│ [User: All ▾] [Category: All ▾] [Search: ________] │ +│ │ +│ Timestamp User Category Action Target │ +│ 2026-03-17 14:32:01 admin INFRA kill_query PID 42 │ +│ 2026-03-17 14:28:15 admin INFRA delete_idx exec-… │ +│ 2026-03-17 12:01:44 admin CONFIG update oidc │ +│ 2026-03-17 09:15:22 jdoe AUTH login │ +│ 2026-03-16 18:45:00 admin USER_MGMT update_roles u:5 │ +│ ... │ +│ │ +│ ◀ 1 2 3 ... 12 ▶ Showing 1-25 of 294 │ +└────────────────────────────────────────────────────────────┘ +``` + +- **Filterable** by user, category, date range +- **Searchable** by free text (matches action, target, detail) +- **Sortable** by timestamp (default: newest first) +- **Pagination** — 25 per page, server-side +- **Detail expansion** — click a row to expand and show full `detail` JSON +- **Read-only** — no edit or delete actions available (compliance requirement) +- **Export** (Phase 2) — CSV/JSON download for auditors + +### Audit Categories + +| Category | Actions Logged | +|----------|---------------| +| `INFRA` | kill_query, delete_index, update_thresholds | +| `AUTH` | login, login_oidc, logout, login_failed | +| `USER_MGMT` | create_user, update_roles, delete_user | +| `CONFIG` | update_oidc, delete_oidc, test_oidc | + +### What Gets Logged + +Every admin action across the system, not just infrastructure pages: + +- **Infrastructure:** kill query, delete OpenSearch index, save thresholds +- **OIDC:** save config, delete config, test connection +- **User management:** update roles, delete user +- **Authentication:** login (success and failure), OIDC login, logout + +### Audit Record Fields + +| Field | Description | +|-------|-------------| +| `timestamp` | When the action occurred (server time, UTC) | +| `username` | Authenticated user who performed the action | +| `action` | Machine-readable action name (e.g., `kill_query`, `delete_index`) | +| `category` | Grouping: `INFRA`, `AUTH`, `USER_MGMT`, `CONFIG` | +| `target` | What was acted on (e.g., PID, index name, user ID) | +| `detail` | JSONB with action-specific context (e.g., query text for killed query, old/new roles for role change) | +| `result` | `SUCCESS` or `FAILURE` | +| `ip_address` | Client IP address from the request | + +### Backend Implementation + +- `AuditService` — central service injected into all admin controllers +- Single method: `log(action, category, target, detail, result)` +- Extracts username and IP from `SecurityContextHolder` and `HttpServletRequest` +- Writes to both the `audit_log` table AND SLF4J (belt and suspenders) +- Async write option not used — audit must be synchronous for compliance guarantees + +--- + +## 5. Backend API + +All endpoints under `/api/v1/admin/` — secured by existing Spring Security filter chain (`ROLE_ADMIN` required). Controllers additionally annotated with `@PreAuthorize("hasRole('ADMIN')")` for defense-in-depth. + +### Database Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `GET` | `/admin/database/status` | Version, host, schema, connection state | +| `GET` | `/admin/database/pool` | Active, idle, pending, max wait (HikariCP) | +| `GET` | `/admin/database/tables` | Table names, row counts, data sizes, index sizes | +| `GET` | `/admin/database/queries` | Active queries: pid, duration, state, SQL | +| `POST` | `/admin/database/queries/{pid}/kill` | Terminate query via `pg_terminate_backend` | + +### OpenSearch Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `GET` | `/admin/opensearch/status` | Version, host, cluster health, node count | +| `GET` | `/admin/opensearch/pipeline` | Queue depth, failed count, debounce, rate, last indexed | +| `GET` | `/admin/opensearch/indices` | Paginated, sortable, filterable index list | +| `DELETE` | `/admin/opensearch/indices/{name}` | Delete specific index (with audit log) | +| `GET` | `/admin/opensearch/performance` | Cache rates, latencies, JVM heap | + +#### Indices Query Parameters + +| Param | Type | Default | Description | +|-------|------|---------|-------------| +| `search` | string | — | Filter by index name pattern | +| `health` | enum | `ALL` | Filter by health: ALL, GREEN, YELLOW, RED | +| `sort` | string | `name` | Sort field: name, docs, size, health | +| `order` | enum | `asc` | Sort direction: asc, desc | +| `page` | int | `0` | Page number (zero-based) | +| `size` | int | `10` | Page size | + +### Audit Log Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `GET` | `/admin/audit` | Paginated, filterable audit log entries | + +#### Audit Log Query Parameters + +| Param | Type | Default | Description | +|-------|------|---------|-------------| +| `username` | string | — | Filter by username | +| `category` | enum | — | Filter by category: INFRA, AUTH, USER_MGMT, CONFIG | +| `search` | string | — | Free text search across action, target, detail | +| `from` | ISO date | 7 days ago | Start of date range | +| `to` | ISO date | now | End of date range | +| `sort` | string | `timestamp` | Sort field | +| `order` | enum | `desc` | Sort direction: asc, desc | +| `page` | int | `0` | Page number (zero-based) | +| `size` | int | `25` | Page size | + +### Thresholds Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `GET` | `/admin/thresholds` | All configured thresholds | +| `PUT` | `/admin/thresholds` | Save thresholds (database + OpenSearch in one payload) | + +### Thresholds Payload + +```json +{ + "database": { + "connectionPoolWarning": 80, + "connectionPoolCritical": 95, + "queryDurationWarning": 1.0, + "queryDurationCritical": 10.0 + }, + "opensearch": { + "clusterHealthWarning": "YELLOW", + "clusterHealthCritical": "RED", + "queueDepthWarning": 100, + "queueDepthCritical": 500, + "jvmHeapWarning": 75, + "jvmHeapCritical": 90, + "failedDocsWarning": 1, + "failedDocsCritical": 10 + } +} +``` + +--- + +## 6. Security + +### Enforcement Layers + +1. **Spring Security filter chain** — `/api/v1/admin/**` requires `ROLE_ADMIN` (existing configuration) +2. **Controller annotation** — `@PreAuthorize("hasRole('ADMIN')")` on each controller class (defense-in-depth) +3. **UI role check** — sidebar admin section hidden for non-admin users (cosmetic only, not a security boundary) + +### Audit Logging + +All admin actions are persisted to the `audit_log` database table (see Section 4 and Section 7 — Data Storage) AND logged via SLF4J at INFO level. The database record is the source of truth for compliance; the SLF4J log provides operational visibility. + +The `AuditService` is injected into all admin controllers (infrastructure, OIDC, user management) and the authentication flow. See Section 4 (Audit Log Page) for full details on what is logged and the record structure. + +### Future: OPERATOR Role (Phase 2+) + +Design anticipates a read-only `OPERATOR` role: +- Can view all monitoring data +- Cannot perform destructive actions (kill, delete) +- Implementation: method-level `@PreAuthorize` on action endpoints, UI conditionally disables buttons based on role + +--- + +## 7. Data Storage + +### New Flyway Migration: V9 + +```sql +CREATE TABLE admin_thresholds ( + id INTEGER PRIMARY KEY DEFAULT 1, + config JSONB NOT NULL DEFAULT '{}', + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_by TEXT NOT NULL, + CONSTRAINT single_row CHECK (id = 1) +); + +CREATE TABLE audit_log ( + id BIGSERIAL PRIMARY KEY, + timestamp TIMESTAMPTZ NOT NULL DEFAULT now(), + username TEXT NOT NULL, + action TEXT NOT NULL, + category TEXT NOT NULL, + target TEXT, + detail JSONB, + result TEXT NOT NULL, + ip_address TEXT +); + +CREATE INDEX idx_audit_log_timestamp ON audit_log (timestamp DESC); +CREATE INDEX idx_audit_log_username ON audit_log (username); +CREATE INDEX idx_audit_log_category ON audit_log (category); +``` + +**admin_thresholds:** +- Single-row table (same pattern as `oidc_config`) +- JSON column for flexibility — adding new thresholds doesn't require schema changes +- Tracks who last updated and when + +**audit_log:** +- Append-only table — no UPDATE or DELETE exposed via API +- Indexed on timestamp (primary query axis), username, and category for filtered views +- JSONB `detail` column holds action-specific context without schema changes +- No foreign key to `users` table — username is denormalized so audit records survive user deletion + +--- + +## 8. Frontend Architecture + +### New Files + +| File | Purpose | +|------|---------| +| `pages/admin/DatabaseAdminPage.tsx` | Database monitoring and management | +| `pages/admin/OpenSearchAdminPage.tsx` | OpenSearch monitoring and management | +| `pages/admin/AuditLogPage.tsx` | Audit log viewer | +| `api/queries/admin/database.ts` | React Query hooks for database endpoints | +| `api/queries/admin/opensearch.ts` | React Query hooks for OpenSearch endpoints | +| `api/queries/admin/thresholds.ts` | React Query hooks for threshold endpoints | +| `api/queries/admin/audit.ts` | React Query hooks for audit log endpoint | +| `components/admin/StatusBadge.tsx` | Color-coded status indicator (green/yellow/red) | +| `components/admin/RefreshableCard.tsx` | Card with manual refresh button + optional auto-refresh | +| `components/admin/ConfirmDeleteDialog.tsx` | Confirmation dialog requiring name input for destructive actions | + +### Modified Files + +| File | Change | +|------|--------| +| `components/layout/AppSidebar.tsx` | Refactor admin section to collapsible sub-menu with multiple items | +| `router.tsx` | Add routes for `/admin/database`, `/admin/opensearch`, `/admin/audit`, redirect `/admin` | +| `SpaForwardController.java` | Ensure `/admin/*` forwarding covers new routes | + +### Auto-Refresh Strategy + +- React Query `refetchInterval: 15000` on lightweight endpoints (pool, queries, pipeline, performance) +- Heavy endpoints (tables, indices) use `refetchInterval: false` — manual refresh only +- Refresh button calls `queryClient.invalidateQueries` for all queries on that page + +--- + +## 9. Implementation Phases + +### Phase 1 (Current Scope) + +1. Admin sidebar restructuring +2. Database page — all monitoring sections + kill query +3. OpenSearch page — all monitoring sections + delete index +4. Threshold configuration (both pages) +5. Audit log — database-backed audit trail + admin viewer page +6. Retrofit audit logging into existing admin controllers (OIDC, user management) and auth flow +7. Backend endpoints with RBAC enforcement +8. Flyway migration V9 for thresholds + audit_log tables + +### Phase 2 + +- Database maintenance actions (VACUUM ANALYZE, Reindex) +- OpenSearch operations (Force Reindex All, Flush) +- Bulk index operations (checkbox selection) +- Audit log CSV/JSON export for auditors +- OPERATOR role with view-only permissions + +### Phase 3 + +- TimescaleDB-aware metrics (hypertable chunks, continuous aggregate status, compression) +- Historical trend charts for key metrics +- Alerting/notification system