docs: add infrastructure overview design spec

Covers admin navigation restructuring, database/OpenSearch monitoring pages,
configurable thresholds, database-backed audit log (SOC2), and phased
implementation plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-03-17 14:55:47 +01:00
parent fc412f7251
commit 2bcbff3ee6

View File

@@ -0,0 +1,455 @@
# Infrastructure Overview — Admin Pages Design
**Date:** 2026-03-17
**Status:** Approved
**Scope:** Phase 1 implementation; full vision documented with Phase 2+ sections marked
## Overview
Add Database and OpenSearch admin pages to the Cameleer3 Server UI, allowing administrators to monitor subsystem health, inspect metrics, and perform basic maintenance actions. Restructure admin navigation from a single OIDC page to a sidebar sub-menu with dedicated pages per concern.
## Goals
- Give admins real-time visibility into PostgreSQL and OpenSearch health, performance, and storage
- Enable basic maintenance actions (kill queries, delete indices) without SSH/kubectl access
- Provide configurable thresholds for visual status indicators (green/yellow/red)
- Establish a database-backed audit log for all admin actions (SOC2 compliance foundation)
- Design for future expansion (VACUUM, reindex, OPERATOR role) without requiring restructuring
## Non-Goals (Phase 1)
- Database maintenance actions (VACUUM ANALYZE, Reindex)
- OpenSearch bulk operations (Force Reindex All, Flush)
- OPERATOR role with restricted permissions
- TimescaleDB-specific features (hypertable stats, continuous aggregate status)
- Alerting or notifications beyond visual indicators
---
## 1. Admin Navigation Restructuring
### Current State
Single gear icon at bottom of `AppSidebar` linking directly to `/admin/oidc`.
### New Structure
The gear icon expands/collapses an admin sub-menu in the sidebar:
```
── Apps ──────────────
app-1
app-2
── Admin (gear icon) ─
Database → /admin/database
OpenSearch → /admin/opensearch
Audit Log → /admin/audit
OIDC → /admin/oidc
Users → (future)
```
- Admin section visible only to users with `ADMIN` role
- Section collapsed by default; state persisted in localStorage
- Active sub-item highlighted
- `/admin` redirects to `/admin/database`
- Existing `OidcAdminPage` unchanged functionally, re-routed from being the sole admin page to a sub-page
---
## 2. Database Page (`/admin/database`)
### Header
- Connection status badge (green/red)
- PostgreSQL version (with TimescaleDB extension noted if present)
- Host and schema name
- Manual refresh button (refreshes all sections)
### Connection Pool Section
- Visual bar showing active connections vs. max pool size
- Metrics: active, idle, pending, max wait time
- Status badge based on configurable threshold (% of pool in use)
- Source: HikariCP pool MXBean
- **Auto-refreshes every 15 seconds**
### Table Sizes Section
- Table with columns: Table, Rows, Size, Index Size
- All application tables listed (executions, processor_executions, route_diagrams, agent_metrics, users, oidc_config, admin_thresholds)
- Summary row: total data size, total index size
- Source: `pg_stat_user_tables` + `pg_relation_size`
- **Manual refresh only** (expensive query)
### Active Queries Section
- Table with columns: PID, Duration, State, Query (truncated), Action
- Queries > warning threshold highlighted yellow, > critical threshold highlighted red
- Kill button per row → calls `pg_terminate_backend(pid)`
- Kill requires confirmation dialog
- After kill, query list refreshes automatically
- Source: `pg_stat_activity`
- **Auto-refreshes every 15 seconds**
### Maintenance Section (Phase 2 — Visible but Disabled)
- Buttons: Run VACUUM ANALYZE, Reindex Tables
- Greyed out with tooltip: "Available in a future release"
### Thresholds Section
- Collapsible, collapsed by default
- Configurable values:
- Connection pool usage: warning % and critical %
- Query duration: warning seconds and critical seconds
- Save button persists to database
---
## 3. OpenSearch Page (`/admin/opensearch`)
### Header
- Cluster health badge (green/yellow/red — maps directly to OpenSearch cluster health)
- OpenSearch version
- Node count
- Host URL
- Manual refresh button
### Indexing Pipeline Section
- Visual bar showing queue depth vs. max queue size
- Metrics: queue depth, failed document count, debounce interval, indexing rate (docs/s), time since last indexed
- Status badge based on configurable thresholds
- Source: `SearchIndexer` internal stats (exposed via `SearchIndexerStats` interface)
- **Auto-refreshes every 15 seconds**
### Indices Section
- **Search/filter** by index name pattern (text input)
- **Filter by health** — All / Green / Yellow / Red dropdown
- **Sortable columns** — Name, Docs, Size, Health, Shards (click column header)
- **Pagination** — 10 per page, server-side
- **Summary row** above table — total index count, total docs, total storage
- **Delete button** (trash icon) per row:
- Confirmation dialog: "Delete index `{name}`? This cannot be undone."
- User must type the index name to confirm
- After deletion, table and summary refresh
- Table columns: Index, Docs, Size, Health, Shards (primary/replica)
- Source: OpenSearch `_cat/indices` API
- **Manual refresh only**
### Performance Section
- Metrics: query cache hit rate, request cache hit rate, average search latency, average indexing latency, JVM heap used (visual bar with used/max)
- Source: OpenSearch `_nodes/stats` API
- **Auto-refreshes every 15 seconds**
### Operations Section (Phase 2 — Visible but Disabled)
- Buttons: Force Reindex All, Flush Index, Delete Index (bulk via checkbox selection)
- Greyed out with tooltip: "Available in a future release"
### Thresholds Section
- Collapsible, collapsed by default
- Configurable values:
- Cluster health: warning level, critical level
- Queue depth: warning count, critical count
- JVM heap usage: warning %, critical %
- Failed docs: warning count, critical count
- Save button persists to database
---
## 4. Audit Log Page (`/admin/audit`)
### Purpose
Database-backed audit trail of all administrative actions across the system. Provides SOC2-compliant evidence of who did what, when, and from where. The audit log is append-only — entries cannot be modified or deleted through the UI or API.
### Header
- Total event count
- Date range selector (default: last 7 days)
### Audit Log Table
```
┌─ Audit Log ────────────────────────────────────────────────┐
│ Date range: [2026-03-10] to [2026-03-17] │
│ [User: All ▾] [Category: All ▾] [Search: ________] │
│ │
│ Timestamp User Category Action Target │
│ 2026-03-17 14:32:01 admin INFRA kill_query PID 42 │
│ 2026-03-17 14:28:15 admin INFRA delete_idx exec-… │
│ 2026-03-17 12:01:44 admin CONFIG update oidc │
│ 2026-03-17 09:15:22 jdoe AUTH login │
│ 2026-03-16 18:45:00 admin USER_MGMT update_roles u:5 │
│ ... │
│ │
│ ◀ 1 2 3 ... 12 ▶ Showing 1-25 of 294 │
└────────────────────────────────────────────────────────────┘
```
- **Filterable** by user, category, date range
- **Searchable** by free text (matches action, target, detail)
- **Sortable** by timestamp (default: newest first)
- **Pagination** — 25 per page, server-side
- **Detail expansion** — click a row to expand and show full `detail` JSON
- **Read-only** — no edit or delete actions available (compliance requirement)
- **Export** (Phase 2) — CSV/JSON download for auditors
### Audit Categories
| Category | Actions Logged |
|----------|---------------|
| `INFRA` | kill_query, delete_index, update_thresholds |
| `AUTH` | login, login_oidc, logout, login_failed |
| `USER_MGMT` | create_user, update_roles, delete_user |
| `CONFIG` | update_oidc, delete_oidc, test_oidc |
### What Gets Logged
Every admin action across the system, not just infrastructure pages:
- **Infrastructure:** kill query, delete OpenSearch index, save thresholds
- **OIDC:** save config, delete config, test connection
- **User management:** update roles, delete user
- **Authentication:** login (success and failure), OIDC login, logout
### Audit Record Fields
| Field | Description |
|-------|-------------|
| `timestamp` | When the action occurred (server time, UTC) |
| `username` | Authenticated user who performed the action |
| `action` | Machine-readable action name (e.g., `kill_query`, `delete_index`) |
| `category` | Grouping: `INFRA`, `AUTH`, `USER_MGMT`, `CONFIG` |
| `target` | What was acted on (e.g., PID, index name, user ID) |
| `detail` | JSONB with action-specific context (e.g., query text for killed query, old/new roles for role change) |
| `result` | `SUCCESS` or `FAILURE` |
| `ip_address` | Client IP address from the request |
### Backend Implementation
- `AuditService` — central service injected into all admin controllers
- Single method: `log(action, category, target, detail, result)`
- Extracts username and IP from `SecurityContextHolder` and `HttpServletRequest`
- Writes to both the `audit_log` table AND SLF4J (belt and suspenders)
- Async write option not used — audit must be synchronous for compliance guarantees
---
## 5. Backend API
All endpoints under `/api/v1/admin/` — secured by existing Spring Security filter chain (`ROLE_ADMIN` required). Controllers additionally annotated with `@PreAuthorize("hasRole('ADMIN')")` for defense-in-depth.
### Database Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/database/status` | Version, host, schema, connection state |
| `GET` | `/admin/database/pool` | Active, idle, pending, max wait (HikariCP) |
| `GET` | `/admin/database/tables` | Table names, row counts, data sizes, index sizes |
| `GET` | `/admin/database/queries` | Active queries: pid, duration, state, SQL |
| `POST` | `/admin/database/queries/{pid}/kill` | Terminate query via `pg_terminate_backend` |
### OpenSearch Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/opensearch/status` | Version, host, cluster health, node count |
| `GET` | `/admin/opensearch/pipeline` | Queue depth, failed count, debounce, rate, last indexed |
| `GET` | `/admin/opensearch/indices` | Paginated, sortable, filterable index list |
| `DELETE` | `/admin/opensearch/indices/{name}` | Delete specific index (with audit log) |
| `GET` | `/admin/opensearch/performance` | Cache rates, latencies, JVM heap |
#### Indices Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `search` | string | — | Filter by index name pattern |
| `health` | enum | `ALL` | Filter by health: ALL, GREEN, YELLOW, RED |
| `sort` | string | `name` | Sort field: name, docs, size, health |
| `order` | enum | `asc` | Sort direction: asc, desc |
| `page` | int | `0` | Page number (zero-based) |
| `size` | int | `10` | Page size |
### Audit Log Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/audit` | Paginated, filterable audit log entries |
#### Audit Log Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `username` | string | — | Filter by username |
| `category` | enum | — | Filter by category: INFRA, AUTH, USER_MGMT, CONFIG |
| `search` | string | — | Free text search across action, target, detail |
| `from` | ISO date | 7 days ago | Start of date range |
| `to` | ISO date | now | End of date range |
| `sort` | string | `timestamp` | Sort field |
| `order` | enum | `desc` | Sort direction: asc, desc |
| `page` | int | `0` | Page number (zero-based) |
| `size` | int | `25` | Page size |
### Thresholds Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/admin/thresholds` | All configured thresholds |
| `PUT` | `/admin/thresholds` | Save thresholds (database + OpenSearch in one payload) |
### Thresholds Payload
```json
{
"database": {
"connectionPoolWarning": 80,
"connectionPoolCritical": 95,
"queryDurationWarning": 1.0,
"queryDurationCritical": 10.0
},
"opensearch": {
"clusterHealthWarning": "YELLOW",
"clusterHealthCritical": "RED",
"queueDepthWarning": 100,
"queueDepthCritical": 500,
"jvmHeapWarning": 75,
"jvmHeapCritical": 90,
"failedDocsWarning": 1,
"failedDocsCritical": 10
}
}
```
---
## 6. Security
### Enforcement Layers
1. **Spring Security filter chain**`/api/v1/admin/**` requires `ROLE_ADMIN` (existing configuration)
2. **Controller annotation**`@PreAuthorize("hasRole('ADMIN')")` on each controller class (defense-in-depth)
3. **UI role check** — sidebar admin section hidden for non-admin users (cosmetic only, not a security boundary)
### Audit Logging
All admin actions are persisted to the `audit_log` database table (see Section 4 and Section 7 — Data Storage) AND logged via SLF4J at INFO level. The database record is the source of truth for compliance; the SLF4J log provides operational visibility.
The `AuditService` is injected into all admin controllers (infrastructure, OIDC, user management) and the authentication flow. See Section 4 (Audit Log Page) for full details on what is logged and the record structure.
### Future: OPERATOR Role (Phase 2+)
Design anticipates a read-only `OPERATOR` role:
- Can view all monitoring data
- Cannot perform destructive actions (kill, delete)
- Implementation: method-level `@PreAuthorize` on action endpoints, UI conditionally disables buttons based on role
---
## 7. Data Storage
### New Flyway Migration: V9
```sql
CREATE TABLE admin_thresholds (
id INTEGER PRIMARY KEY DEFAULT 1,
config JSONB NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_by TEXT NOT NULL,
CONSTRAINT single_row CHECK (id = 1)
);
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
username TEXT NOT NULL,
action TEXT NOT NULL,
category TEXT NOT NULL,
target TEXT,
detail JSONB,
result TEXT NOT NULL,
ip_address TEXT
);
CREATE INDEX idx_audit_log_timestamp ON audit_log (timestamp DESC);
CREATE INDEX idx_audit_log_username ON audit_log (username);
CREATE INDEX idx_audit_log_category ON audit_log (category);
```
**admin_thresholds:**
- Single-row table (same pattern as `oidc_config`)
- JSON column for flexibility — adding new thresholds doesn't require schema changes
- Tracks who last updated and when
**audit_log:**
- Append-only table — no UPDATE or DELETE exposed via API
- Indexed on timestamp (primary query axis), username, and category for filtered views
- JSONB `detail` column holds action-specific context without schema changes
- No foreign key to `users` table — username is denormalized so audit records survive user deletion
---
## 8. Frontend Architecture
### New Files
| File | Purpose |
|------|---------|
| `pages/admin/DatabaseAdminPage.tsx` | Database monitoring and management |
| `pages/admin/OpenSearchAdminPage.tsx` | OpenSearch monitoring and management |
| `pages/admin/AuditLogPage.tsx` | Audit log viewer |
| `api/queries/admin/database.ts` | React Query hooks for database endpoints |
| `api/queries/admin/opensearch.ts` | React Query hooks for OpenSearch endpoints |
| `api/queries/admin/thresholds.ts` | React Query hooks for threshold endpoints |
| `api/queries/admin/audit.ts` | React Query hooks for audit log endpoint |
| `components/admin/StatusBadge.tsx` | Color-coded status indicator (green/yellow/red) |
| `components/admin/RefreshableCard.tsx` | Card with manual refresh button + optional auto-refresh |
| `components/admin/ConfirmDeleteDialog.tsx` | Confirmation dialog requiring name input for destructive actions |
### Modified Files
| File | Change |
|------|--------|
| `components/layout/AppSidebar.tsx` | Refactor admin section to collapsible sub-menu with multiple items |
| `router.tsx` | Add routes for `/admin/database`, `/admin/opensearch`, `/admin/audit`, redirect `/admin` |
| `SpaForwardController.java` | Ensure `/admin/*` forwarding covers new routes |
### Auto-Refresh Strategy
- React Query `refetchInterval: 15000` on lightweight endpoints (pool, queries, pipeline, performance)
- Heavy endpoints (tables, indices) use `refetchInterval: false` — manual refresh only
- Refresh button calls `queryClient.invalidateQueries` for all queries on that page
---
## 9. Implementation Phases
### Phase 1 (Current Scope)
1. Admin sidebar restructuring
2. Database page — all monitoring sections + kill query
3. OpenSearch page — all monitoring sections + delete index
4. Threshold configuration (both pages)
5. Audit log — database-backed audit trail + admin viewer page
6. Retrofit audit logging into existing admin controllers (OIDC, user management) and auth flow
7. Backend endpoints with RBAC enforcement
8. Flyway migration V9 for thresholds + audit_log tables
### Phase 2
- Database maintenance actions (VACUUM ANALYZE, Reindex)
- OpenSearch operations (Force Reindex All, Flush)
- Bulk index operations (checkbox selection)
- Audit log CSV/JSON export for auditors
- OPERATOR role with view-only permissions
### Phase 3
- TimescaleDB-aware metrics (hypertable chunks, continuous aggregate status, compression)
- Historical trend charts for key metrics
- Alerting/notification system