Files
cameleer-server/docs/superpowers/specs/2026-03-17-infrastructure-overview-design.md
hsiegeln cb3ebfea7c
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
chore: rename cameleer3 to cameleer
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00

512 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Infrastructure Overview — Admin Pages Design
**Date:** 2026-03-17
**Status:** Approved
**Scope:** Phase 1 implementation; full vision documented with Phase 2+ sections marked
## Overview
Add Database and OpenSearch admin pages to the Cameleer Server UI, allowing administrators to monitor subsystem health, inspect metrics, and perform basic maintenance actions. Restructure admin navigation from a single OIDC page to a sidebar sub-menu with dedicated pages per concern.
## Goals
- Give admins real-time visibility into PostgreSQL and OpenSearch health, performance, and storage
- Enable basic maintenance actions (kill queries, delete indices) without SSH/kubectl access
- Provide configurable thresholds for visual status indicators (green/yellow/red)
- Establish a database-backed audit log for all admin actions (SOC2 compliance foundation)
- Design for future expansion (VACUUM, reindex, OPERATOR role) without requiring restructuring
## Non-Goals (Phase 1)
- Database maintenance actions (VACUUM ANALYZE, Reindex)
- OpenSearch bulk operations (Force Reindex All, Flush)
- OPERATOR role with restricted permissions
- TimescaleDB-specific features (hypertable stats, continuous aggregate status)
- Alerting or notifications beyond visual indicators
---
## 1. Admin Navigation Restructuring
### Current State
Single gear icon at bottom of `AppSidebar` linking directly to `/admin/oidc`.
### New Structure
The gear icon expands/collapses an admin sub-menu in the sidebar:
```
── Apps ──────────────
app-1
app-2
── Admin (gear icon) ─
Database → /admin/database
OpenSearch → /admin/opensearch
Audit Log → /admin/audit
OIDC → /admin/oidc
Users → /admin/users (backend exists, UI page is future scope)
```
- Admin section visible only to users with `ADMIN` role
- Section collapsed by default; state persisted in localStorage
- Active sub-item highlighted
- `/admin` redirects to `/admin/database`
- Existing `OidcAdminPage` unchanged functionally, re-routed from being the sole admin page to a sub-page
---
## 2. Database Page (`/admin/database`)
### Header
- Connection status badge (green/red)
- PostgreSQL version (with TimescaleDB extension noted if present)
- Host and schema name
- Manual refresh button (refreshes all sections)
### Connection Pool Section
- Visual bar showing active connections vs. max pool size
- Metrics: active, idle, pending, max wait time
- Status badge based on configurable threshold (% of pool in use)
- Source: HikariCP pool MXBean
- **Auto-refreshes every 15 seconds**
### Table Sizes Section
- Table with columns: Table, Rows, Size, Index Size
- All application tables listed (executions, processor_executions, route_diagrams, agent_metrics, users, oidc_config, admin_thresholds)
- Summary row: total data size, total index size
- Source: `pg_stat_user_tables` + `pg_relation_size`
- **Manual refresh only** (expensive query)
### Active Queries Section
- Table with columns: PID, Duration, State, Query (truncated), Action
- Queries > warning threshold highlighted yellow, > critical threshold highlighted red
- Kill button per row → calls `pg_terminate_backend(pid)`
- Kill requires confirmation dialog
- After kill, query list refreshes automatically
- Source: `pg_stat_activity`
- **Auto-refreshes every 15 seconds**
### Maintenance Section (Phase 2 — Visible but Disabled)
- Buttons: Run VACUUM ANALYZE, Reindex Tables
- Greyed out with tooltip: "Available in a future release"
### Thresholds Section
- Collapsible, collapsed by default
- Configurable values:
- Connection pool usage: warning % and critical %
- Query duration: warning seconds and critical seconds
- Save button persists to database
---
## 3. OpenSearch Page (`/admin/opensearch`)
### Header
- Cluster health badge (green/yellow/red — maps directly to OpenSearch cluster health)
- OpenSearch version
- Node count
- Host URL
- Manual refresh button
### Indexing Pipeline Section
- Visual bar showing queue depth vs. max queue size
- Metrics: queue depth, failed document count, debounce interval, indexing rate (docs/s), time since last indexed
- Status badge based on configurable thresholds
- Source: `SearchIndexer` internal stats, exposed via a new `SearchIndexerStats` interface in `cameleer-server-core`
- **Auto-refreshes every 15 seconds**
**Implementation note:** `SearchIndexer` currently has no stats API. This requires adding:
- `AtomicLong failedCount` — incremented on indexing errors
- `AtomicLong indexedCount` — incremented on successful index operations
- `volatile Instant lastIndexedAt` — updated after each successful batch
- Rate calculation via a sliding window counter (e.g., count delta over last 15s interval)
- `SearchIndexerStats` interface in core module with getters for all above, implemented by `SearchIndexer`
- `queueDepth` and `maxQueueSize` already derivable from the internal `BlockingQueue`
### Indices Section
- **Search/filter** by index name pattern (text input)
- **Filter by health** — All / Green / Yellow / Red dropdown
- **Sortable columns** — Name, Docs, Size, Health, Shards (click column header)
- **Pagination** — 10 per page, server-side
- **Summary row** above table — total index count, total docs, total storage
- **Delete button** (trash icon) per row:
- Confirmation dialog: "Delete index `{name}`? This cannot be undone."
- User must type the index name to confirm
- After deletion, table and summary refresh
- Table columns: Index, Docs, Size, Health, Shards (primary/replica)
- Source: OpenSearch `_cat/indices` API
- **Manual refresh only**
### Performance Section
- Metrics: query cache hit rate, request cache hit rate, average search latency, average indexing latency, JVM heap used (visual bar with used/max)
- Source: OpenSearch `_nodes/stats` API
- **Auto-refreshes every 15 seconds**
### Operations Section (Phase 2 — Visible but Disabled)
- Buttons: Force Reindex All, Flush Index, Delete Index (bulk via checkbox selection)
- Greyed out with tooltip: "Available in a future release"
### Thresholds Section
- Collapsible, collapsed by default
- Configurable values:
- Cluster health: warning level, critical level
- Queue depth: warning count, critical count
- JVM heap usage: warning %, critical %
- Failed docs: warning count, critical count
- Save button persists to database
---
## 4. Audit Log Page (`/admin/audit`)
### Purpose
Database-backed audit trail of all administrative actions across the system. Provides SOC2-compliant evidence of who did what, when, and from where. The audit log is append-only — entries cannot be modified or deleted through the UI or API.
### Header
- Total event count
- Date range selector (default: last 7 days)
### Audit Log Table
```
┌─ Audit Log ────────────────────────────────────────────────┐
│ Date range: [2026-03-10] to [2026-03-17] │
│ [User: All ▾] [Category: All ▾] [Search: ________] │
│ │
│ Timestamp User Category Action Target │
│ 2026-03-17 14:32:01 admin INFRA kill_query PID 42 │
│ 2026-03-17 14:28:15 admin INFRA delete_idx exec-… │
│ 2026-03-17 12:01:44 admin CONFIG update oidc │
│ 2026-03-17 09:15:22 jdoe AUTH login │
│ 2026-03-16 18:45:00 admin USER_MGMT update_roles u:5 │
│ ... │
│ │
│ ◀ 1 2 3 ... 12 ▶ Showing 1-25 of 294 │
└────────────────────────────────────────────────────────────┘
```
- **Filterable** by user, category, date range
- **Searchable** by free text (matches `action` and `target` columns only — JSONB `detail` excluded from text search for performance)
- **Sortable** by timestamp (default: newest first)
- **Pagination** — 25 per page, server-side
- **Detail expansion** — click a row to expand and show full `detail` JSON
- **Read-only** — no edit or delete actions available (compliance requirement)
- **Export** (Phase 2) — CSV/JSON download for auditors
### Audit Categories
| Category | Actions Logged |
|----------|---------------|
| `INFRA` | kill_query, delete_index, update_thresholds |
| `AUTH` | login, login_oidc, logout, login_failed |
| `USER_MGMT` | create_user, update_roles, delete_user |
| `CONFIG` | update_oidc, delete_oidc, test_oidc |
### What Gets Logged
Every admin action across the system, not just infrastructure pages:
- **Infrastructure:** kill query, delete OpenSearch index, save thresholds
- **OIDC:** save config, delete config, test connection
- **User management:** update roles, delete user
- **Authentication:** login (success and failure), OIDC login, logout
### Audit Record Fields
| Field | Description |
|-------|-------------|
| `timestamp` | When the action occurred (server time, UTC) |
| `username` | Authenticated user who performed the action |
| `action` | Machine-readable action name (e.g., `kill_query`, `delete_index`) |
| `category` | Grouping: `INFRA`, `AUTH`, `USER_MGMT`, `CONFIG` |
| `target` | What was acted on (e.g., PID, index name, user ID) |
| `detail` | JSONB with action-specific context (e.g., query text for killed query, old/new roles for role change) |
| `result` | `SUCCESS` or `FAILURE` |
| `ip_address` | Client IP address from the request |
| `user_agent` | Browser/client identification from request header (SOC2 forensics) |
### Backend Implementation
- `AuditService` — central service in `cameleer-server-core`, injected into all admin controllers via direct method calls (no AOP/interceptor — consistent with existing controller style)
- Primary method: `log(action, category, target, detail, result)` — extracts username and IP from `SecurityContextHolder` and `HttpServletRequest`
- Overloaded method for pre-auth contexts: `log(username, action, category, target, detail, result, request)` — used by auth controllers where `SecurityContext` is not yet populated (login success/failure)
- Captures `user_agent` from `HttpServletRequest` header
- Writes to both the `audit_log` table AND SLF4J (belt and suspenders)
- Async write option not used — audit must be synchronous for compliance guarantees
- Retrofit into existing controllers: add `auditService.log(...)` calls to `OidcConfigAdminController` (save, delete, test) and `UserAdminController` (update roles, delete user), and auth controllers (login, OIDC login, logout, failed login)
---
## 5. Backend API
All endpoints under `/api/v1/admin/` — secured by existing Spring Security filter chain (`ROLE_ADMIN` required). Controllers additionally annotated with `@PreAuthorize("hasRole('ADMIN')")` for defense-in-depth.
### Database Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/database/status` | Version, host, schema, connection state |
| `GET` | `/admin/database/pool` | Active, idle, pending, max wait (HikariCP) |
| `GET` | `/admin/database/tables` | Table names, row counts, data sizes, index sizes |
| `GET` | `/admin/database/queries` | Active queries: pid, duration, state, SQL |
| `POST` | `/admin/database/queries/{pid}/kill` | Terminate query via `pg_terminate_backend` |
### OpenSearch Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/opensearch/status` | Version, host, cluster health, node count |
| `GET` | `/admin/opensearch/pipeline` | Queue depth, failed count, debounce, rate, last indexed |
| `GET` | `/admin/opensearch/indices` | Paginated, sortable, filterable index list |
| `DELETE` | `/admin/opensearch/indices/{name}` | Delete specific index (with audit log) |
| `GET` | `/admin/opensearch/performance` | Cache rates, latencies, JVM heap |
**OpenSearch client:** All admin OpenSearch endpoints reuse the existing `OpenSearchClient` bean configured in `OpenSearchConfig.java`. No separate client or credentials needed — the admin endpoints call cluster-level APIs (`_cluster/health`, `_cat/indices`, `_nodes/stats`) using the same connection.
#### Indices Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `search` | string | — | Filter by index name pattern |
| `health` | enum | `ALL` | Filter by health: ALL, GREEN, YELLOW, RED |
| `sort` | string | `name` | Sort field: name, docs, size, health |
| `order` | enum | `asc` | Sort direction: asc, desc |
| `page` | int | `0` | Page number (zero-based) |
| `size` | int | `10` | Page size |
### Audit Log Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/audit` | Paginated, filterable audit log entries |
#### Audit Log Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `username` | string | — | Filter by username |
| `category` | enum | — | Filter by category: INFRA, AUTH, USER_MGMT, CONFIG |
| `search` | string | — | Free text search across `action` and `target` columns (not JSONB `detail`) |
| `from` | ISO date | 7 days ago | Start of date range |
| `to` | ISO date | now | End of date range |
| `sort` | string | `timestamp` | Sort field |
| `order` | enum | `desc` | Sort direction: asc, desc |
| `page` | int | `0` | Page number (zero-based) |
| `size` | int | `25` | Page size |
### Thresholds Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/thresholds` | All configured thresholds |
| `PUT` | `/admin/thresholds` | Save thresholds (database + OpenSearch in one payload) |
### Thresholds Payload
```json
{
"database": {
"connectionPoolWarning": 80,
"connectionPoolCritical": 95,
"queryDurationWarning": 1.0,
"queryDurationCritical": 10.0
},
"opensearch": {
"clusterHealthWarning": "YELLOW",
"clusterHealthCritical": "RED",
"queueDepthWarning": 100,
"queueDepthCritical": 500,
"jvmHeapWarning": 75,
"jvmHeapCritical": 90,
"failedDocsWarning": 1,
"failedDocsCritical": 10
}
}
```
### Thresholds Validation Rules
- Warning must be <= critical for all numeric threshold pairs
- Percentage values must be 0100
- Duration values must be > 0
- `clusterHealthWarning` must be less severe than `clusterHealthCritical` (GREEN < YELLOW < RED)
- Backend returns 400 Bad Request with field-level error messages on validation failure
### Pagination Limits
- All paginated endpoints enforce a maximum page size of 100 (`min(requested, 100)`)
- Applies to: indices listing, audit log
### Error Responses
All new endpoints return errors in a consistent shape:
```json
{
"status": 404,
"error": "Not Found",
"message": "No active query with PID 12345"
}
```
Specific error cases:
- `POST /admin/database/queries/{pid}/kill` — 404 if PID not found, 500 if `pg_terminate_backend` fails
- `DELETE /admin/opensearch/indices/{name}` — 404 if index not found, 502 if OpenSearch unreachable
- `GET /admin/database/status` — returns 200 with `"connected": false` if database is unreachable (not 503), so the frontend can render a red status badge rather than an error state
- `GET /admin/opensearch/status` — returns 200 with `"clusterHealth": "UNREACHABLE"` if OpenSearch is down
---
## 6. Security
### Enforcement Layers
1. **Spring Security filter chain**`/api/v1/admin/**` requires `ROLE_ADMIN` (existing configuration)
2. **Controller annotation**`@PreAuthorize("hasRole('ADMIN')")` on each controller class (defense-in-depth). This is a new convention — existing controllers (`OidcConfigAdminController`, `UserAdminController`) must be retrofitted with this annotation as part of Phase 1.
3. **`@EnableMethodSecurity`** — must be added to `SecurityConfig.java` to activate `@PreAuthorize` processing (prerequisite for layer 2)
4. **UI role check** — sidebar admin section hidden for non-admin users (cosmetic only, not a security boundary)
### Audit Logging
All admin actions are persisted to the `audit_log` database table (see Section 4 and Section 7 — Data Storage) AND logged via SLF4J at INFO level. The database record is the source of truth for compliance; the SLF4J log provides operational visibility.
The `AuditService` is injected into all admin controllers (infrastructure, OIDC, user management) and the authentication flow. See Section 4 (Audit Log Page) for full details on what is logged and the record structure.
### Future: OPERATOR Role (Phase 2+)
Design anticipates a read-only `OPERATOR` role:
- Can view all monitoring data
- Cannot perform destructive actions (kill, delete)
- Implementation: method-level `@PreAuthorize` on action endpoints, UI conditionally disables buttons based on role
---
## 7. Data Storage
### Flyway Migration V9: Admin Thresholds
```sql
CREATE TABLE admin_thresholds (
id INTEGER PRIMARY KEY DEFAULT 1,
config JSONB NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_by TEXT NOT NULL,
CONSTRAINT single_row CHECK (id = 1)
);
```
- Single-row table using `CHECK (id = 1)` constraint — stricter than the `oidc_config` pattern (which uses a text PK defaulting to `'default'` without a constraint). The CHECK approach is preferred going forward as it explicitly prevents multiple rows.
- JSON column for flexibility — adding new thresholds doesn't require schema changes
- Tracks who last updated and when
### Flyway Migration V10: Audit Log
```sql
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
username TEXT NOT NULL,
action TEXT NOT NULL,
category TEXT NOT NULL,
target TEXT,
detail JSONB,
result TEXT NOT NULL,
ip_address TEXT,
user_agent TEXT
);
CREATE INDEX idx_audit_log_timestamp ON audit_log (timestamp DESC);
CREATE INDEX idx_audit_log_username ON audit_log (username);
CREATE INDEX idx_audit_log_category ON audit_log (category);
CREATE INDEX idx_audit_log_action ON audit_log (action);
CREATE INDEX idx_audit_log_target ON audit_log (target);
```
- Separate migration from thresholds so they can be developed and tested independently
- Append-only table — no UPDATE or DELETE exposed via API
- Indexed on timestamp (primary query axis), username, category, action, and target for filtered views and free-text search via `ILIKE` on indexed text columns
- JSONB `detail` column holds action-specific context without schema changes (not searched via free text — use row expansion for detail inspection)
- `user_agent` field captures client identification for forensic analysis (SOC2)
- No foreign key to `users` table — username is denormalized so audit records survive user deletion
- **Retention:** unbounded in Phase 1. Phase 2+ should add a retention/archival strategy (e.g., TimescaleDB hypertable with retention policy, or periodic archive to cold storage). Typical SOC2 retention is 7 years.
---
## 8. Frontend Architecture
### New Files
| File | Purpose |
|------|---------|
| `pages/admin/DatabaseAdminPage.tsx` | Database monitoring and management |
| `pages/admin/OpenSearchAdminPage.tsx` | OpenSearch monitoring and management |
| `pages/admin/AuditLogPage.tsx` | Audit log viewer |
| `api/queries/admin/database.ts` | React Query hooks for database endpoints |
| `api/queries/admin/opensearch.ts` | React Query hooks for OpenSearch endpoints |
| `api/queries/admin/thresholds.ts` | React Query hooks for threshold endpoints |
| `api/queries/admin/audit.ts` | React Query hooks for audit log endpoint |
| `components/admin/StatusBadge.tsx` | Color-coded status indicator (green/yellow/red) |
| `components/admin/RefreshableCard.tsx` | Card with manual refresh button + optional auto-refresh |
| `components/admin/ConfirmDeleteDialog.tsx` | Confirmation dialog requiring name input for destructive actions |
### Modified Files
| File | Change |
|------|--------|
| `components/layout/AppSidebar.tsx` | Refactor admin section to collapsible sub-menu with multiple items |
| `router.tsx` | Add routes for `/admin/database`, `/admin/opensearch`, `/admin/audit`, redirect `/admin` |
| `SpaForwardController.java` | Existing `/admin/{path:[^\\.]*}` pattern already covers single-segment routes — no change needed unless deeper routes are added |
### Auto-Refresh Strategy
- React Query `refetchInterval: 15000` on lightweight endpoints (pool, queries, pipeline, performance)
- Heavy endpoints (tables, indices) use `refetchInterval: false` — manual refresh only
- Refresh button calls `queryClient.invalidateQueries` for all queries on that page
---
## 9. Implementation Phases
### Phase 1 (Current Scope)
1. Admin sidebar restructuring
2. Database page — all monitoring sections + kill query
3. OpenSearch page — all monitoring sections + delete index
4. Threshold configuration (both pages)
5. Audit log — database-backed audit trail + admin viewer page
6. Retrofit audit logging into existing admin controllers (OIDC, user management) and auth flow
7. Backend endpoints with RBAC enforcement
8. Flyway migrations V9 (thresholds) and V10 (audit_log)
9. `SearchIndexerStats` interface and `SearchIndexer` stats instrumentation
10. `@EnableMethodSecurity` + `@PreAuthorize` retrofit on existing admin controllers
### Phase 2
- Database maintenance actions (VACUUM ANALYZE, Reindex)
- OpenSearch operations (Force Reindex All, Flush)
- Bulk index operations (checkbox selection)
- Audit log CSV/JSON export for auditors
- Audit log retention/archival strategy (7-year SOC2 requirement)
- OPERATOR role with view-only permissions
### Phase 3
- TimescaleDB-aware metrics (hypertable chunks, continuous aggregate status, compression)
- Historical trend charts for key metrics
- Alerting/notification system