Files
cameleer-server/docs/superpowers/specs/2026-03-17-infrastructure-overview-design.md
hsiegeln cb3ebfea7c
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
chore: rename cameleer3 to cameleer
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00

22 KiB
Raw Blame History

Infrastructure Overview — Admin Pages Design

Date: 2026-03-17 Status: Approved Scope: Phase 1 implementation; full vision documented with Phase 2+ sections marked

Overview

Add Database and OpenSearch admin pages to the Cameleer Server UI, allowing administrators to monitor subsystem health, inspect metrics, and perform basic maintenance actions. Restructure admin navigation from a single OIDC page to a sidebar sub-menu with dedicated pages per concern.

Goals

  • Give admins real-time visibility into PostgreSQL and OpenSearch health, performance, and storage
  • Enable basic maintenance actions (kill queries, delete indices) without SSH/kubectl access
  • Provide configurable thresholds for visual status indicators (green/yellow/red)
  • Establish a database-backed audit log for all admin actions (SOC2 compliance foundation)
  • Design for future expansion (VACUUM, reindex, OPERATOR role) without requiring restructuring

Non-Goals (Phase 1)

  • Database maintenance actions (VACUUM ANALYZE, Reindex)
  • OpenSearch bulk operations (Force Reindex All, Flush)
  • OPERATOR role with restricted permissions
  • TimescaleDB-specific features (hypertable stats, continuous aggregate status)
  • Alerting or notifications beyond visual indicators

1. Admin Navigation Restructuring

Current State

Single gear icon at bottom of AppSidebar linking directly to /admin/oidc.

New Structure

The gear icon expands/collapses an admin sub-menu in the sidebar:

── Apps ──────────────
  app-1
  app-2
── Admin (gear icon) ─
  Database              → /admin/database
  OpenSearch            → /admin/opensearch
  Audit Log             → /admin/audit
  OIDC                  → /admin/oidc
  Users                 → /admin/users (backend exists, UI page is future scope)
  • Admin section visible only to users with ADMIN role
  • Section collapsed by default; state persisted in localStorage
  • Active sub-item highlighted
  • /admin redirects to /admin/database
  • Existing OidcAdminPage unchanged functionally, re-routed from being the sole admin page to a sub-page

2. Database Page (/admin/database)

Header

  • Connection status badge (green/red)
  • PostgreSQL version (with TimescaleDB extension noted if present)
  • Host and schema name
  • Manual refresh button (refreshes all sections)

Connection Pool Section

  • Visual bar showing active connections vs. max pool size
  • Metrics: active, idle, pending, max wait time
  • Status badge based on configurable threshold (% of pool in use)
  • Source: HikariCP pool MXBean
  • Auto-refreshes every 15 seconds

Table Sizes Section

  • Table with columns: Table, Rows, Size, Index Size
  • All application tables listed (executions, processor_executions, route_diagrams, agent_metrics, users, oidc_config, admin_thresholds)
  • Summary row: total data size, total index size
  • Source: pg_stat_user_tables + pg_relation_size
  • Manual refresh only (expensive query)

Active Queries Section

  • Table with columns: PID, Duration, State, Query (truncated), Action
  • Queries > warning threshold highlighted yellow, > critical threshold highlighted red
  • Kill button per row → calls pg_terminate_backend(pid)
  • Kill requires confirmation dialog
  • After kill, query list refreshes automatically
  • Source: pg_stat_activity
  • Auto-refreshes every 15 seconds

Maintenance Section (Phase 2 — Visible but Disabled)

  • Buttons: Run VACUUM ANALYZE, Reindex Tables
  • Greyed out with tooltip: "Available in a future release"

Thresholds Section

  • Collapsible, collapsed by default
  • Configurable values:
    • Connection pool usage: warning % and critical %
    • Query duration: warning seconds and critical seconds
  • Save button persists to database

3. OpenSearch Page (/admin/opensearch)

Header

  • Cluster health badge (green/yellow/red — maps directly to OpenSearch cluster health)
  • OpenSearch version
  • Node count
  • Host URL
  • Manual refresh button

Indexing Pipeline Section

  • Visual bar showing queue depth vs. max queue size
  • Metrics: queue depth, failed document count, debounce interval, indexing rate (docs/s), time since last indexed
  • Status badge based on configurable thresholds
  • Source: SearchIndexer internal stats, exposed via a new SearchIndexerStats interface in cameleer-server-core
  • Auto-refreshes every 15 seconds

Implementation note: SearchIndexer currently has no stats API. This requires adding:

  • AtomicLong failedCount — incremented on indexing errors
  • AtomicLong indexedCount — incremented on successful index operations
  • volatile Instant lastIndexedAt — updated after each successful batch
  • Rate calculation via a sliding window counter (e.g., count delta over last 15s interval)
  • SearchIndexerStats interface in core module with getters for all above, implemented by SearchIndexer
  • queueDepth and maxQueueSize already derivable from the internal BlockingQueue

Indices Section

  • Search/filter by index name pattern (text input)
  • Filter by health — All / Green / Yellow / Red dropdown
  • Sortable columns — Name, Docs, Size, Health, Shards (click column header)
  • Pagination — 10 per page, server-side
  • Summary row above table — total index count, total docs, total storage
  • Delete button (trash icon) per row:
    • Confirmation dialog: "Delete index {name}? This cannot be undone."
    • User must type the index name to confirm
    • After deletion, table and summary refresh
  • Table columns: Index, Docs, Size, Health, Shards (primary/replica)
  • Source: OpenSearch _cat/indices API
  • Manual refresh only

Performance Section

  • Metrics: query cache hit rate, request cache hit rate, average search latency, average indexing latency, JVM heap used (visual bar with used/max)
  • Source: OpenSearch _nodes/stats API
  • Auto-refreshes every 15 seconds

Operations Section (Phase 2 — Visible but Disabled)

  • Buttons: Force Reindex All, Flush Index, Delete Index (bulk via checkbox selection)
  • Greyed out with tooltip: "Available in a future release"

Thresholds Section

  • Collapsible, collapsed by default
  • Configurable values:
    • Cluster health: warning level, critical level
    • Queue depth: warning count, critical count
    • JVM heap usage: warning %, critical %
    • Failed docs: warning count, critical count
  • Save button persists to database

4. Audit Log Page (/admin/audit)

Purpose

Database-backed audit trail of all administrative actions across the system. Provides SOC2-compliant evidence of who did what, when, and from where. The audit log is append-only — entries cannot be modified or deleted through the UI or API.

Header

  • Total event count
  • Date range selector (default: last 7 days)

Audit Log Table

┌─ Audit Log ────────────────────────────────────────────────┐
│ Date range: [2026-03-10] to [2026-03-17]                   │
│ [User: All ▾]  [Category: All ▾]  [Search: ________]      │
│                                                            │
│ Timestamp            User     Category   Action     Target │
│ 2026-03-17 14:32:01  admin    INFRA      kill_query PID 42 │
│ 2026-03-17 14:28:15  admin    INFRA      delete_idx exec-… │
│ 2026-03-17 12:01:44  admin    CONFIG     update     oidc   │
│ 2026-03-17 09:15:22  jdoe     AUTH       login             │
│ 2026-03-16 18:45:00  admin    USER_MGMT  update_roles u:5  │
│ ...                                                        │
│                                                            │
│ ◀ 1  2  3  ...  12 ▶              Showing 1-25 of 294     │
└────────────────────────────────────────────────────────────┘
  • Filterable by user, category, date range
  • Searchable by free text (matches action and target columns only — JSONB detail excluded from text search for performance)
  • Sortable by timestamp (default: newest first)
  • Pagination — 25 per page, server-side
  • Detail expansion — click a row to expand and show full detail JSON
  • Read-only — no edit or delete actions available (compliance requirement)
  • Export (Phase 2) — CSV/JSON download for auditors

Audit Categories

Category Actions Logged
INFRA kill_query, delete_index, update_thresholds
AUTH login, login_oidc, logout, login_failed
USER_MGMT create_user, update_roles, delete_user
CONFIG update_oidc, delete_oidc, test_oidc

What Gets Logged

Every admin action across the system, not just infrastructure pages:

  • Infrastructure: kill query, delete OpenSearch index, save thresholds
  • OIDC: save config, delete config, test connection
  • User management: update roles, delete user
  • Authentication: login (success and failure), OIDC login, logout

Audit Record Fields

Field Description
timestamp When the action occurred (server time, UTC)
username Authenticated user who performed the action
action Machine-readable action name (e.g., kill_query, delete_index)
category Grouping: INFRA, AUTH, USER_MGMT, CONFIG
target What was acted on (e.g., PID, index name, user ID)
detail JSONB with action-specific context (e.g., query text for killed query, old/new roles for role change)
result SUCCESS or FAILURE
ip_address Client IP address from the request
user_agent Browser/client identification from request header (SOC2 forensics)

Backend Implementation

  • AuditService — central service in cameleer-server-core, injected into all admin controllers via direct method calls (no AOP/interceptor — consistent with existing controller style)
  • Primary method: log(action, category, target, detail, result) — extracts username and IP from SecurityContextHolder and HttpServletRequest
  • Overloaded method for pre-auth contexts: log(username, action, category, target, detail, result, request) — used by auth controllers where SecurityContext is not yet populated (login success/failure)
  • Captures user_agent from HttpServletRequest header
  • Writes to both the audit_log table AND SLF4J (belt and suspenders)
  • Async write option not used — audit must be synchronous for compliance guarantees
  • Retrofit into existing controllers: add auditService.log(...) calls to OidcConfigAdminController (save, delete, test) and UserAdminController (update roles, delete user), and auth controllers (login, OIDC login, logout, failed login)

5. Backend API

All endpoints under /api/v1/admin/ — secured by existing Spring Security filter chain (ROLE_ADMIN required). Controllers additionally annotated with @PreAuthorize("hasRole('ADMIN')") for defense-in-depth.

Database Endpoints

Method Path Description
GET /admin/database/status Version, host, schema, connection state
GET /admin/database/pool Active, idle, pending, max wait (HikariCP)
GET /admin/database/tables Table names, row counts, data sizes, index sizes
GET /admin/database/queries Active queries: pid, duration, state, SQL
POST /admin/database/queries/{pid}/kill Terminate query via pg_terminate_backend

OpenSearch Endpoints

Method Path Description
GET /admin/opensearch/status Version, host, cluster health, node count
GET /admin/opensearch/pipeline Queue depth, failed count, debounce, rate, last indexed
GET /admin/opensearch/indices Paginated, sortable, filterable index list
DELETE /admin/opensearch/indices/{name} Delete specific index (with audit log)
GET /admin/opensearch/performance Cache rates, latencies, JVM heap

OpenSearch client: All admin OpenSearch endpoints reuse the existing OpenSearchClient bean configured in OpenSearchConfig.java. No separate client or credentials needed — the admin endpoints call cluster-level APIs (_cluster/health, _cat/indices, _nodes/stats) using the same connection.

Indices Query Parameters

Param Type Default Description
search string Filter by index name pattern
health enum ALL Filter by health: ALL, GREEN, YELLOW, RED
sort string name Sort field: name, docs, size, health
order enum asc Sort direction: asc, desc
page int 0 Page number (zero-based)
size int 10 Page size

Audit Log Endpoints

Method Path Description
GET /admin/audit Paginated, filterable audit log entries

Audit Log Query Parameters

Param Type Default Description
username string Filter by username
category enum Filter by category: INFRA, AUTH, USER_MGMT, CONFIG
search string Free text search across action and target columns (not JSONB detail)
from ISO date 7 days ago Start of date range
to ISO date now End of date range
sort string timestamp Sort field
order enum desc Sort direction: asc, desc
page int 0 Page number (zero-based)
size int 25 Page size

Thresholds Endpoints

Method Path Description
GET /admin/thresholds All configured thresholds
PUT /admin/thresholds Save thresholds (database + OpenSearch in one payload)

Thresholds Payload

{
  "database": {
    "connectionPoolWarning": 80,
    "connectionPoolCritical": 95,
    "queryDurationWarning": 1.0,
    "queryDurationCritical": 10.0
  },
  "opensearch": {
    "clusterHealthWarning": "YELLOW",
    "clusterHealthCritical": "RED",
    "queueDepthWarning": 100,
    "queueDepthCritical": 500,
    "jvmHeapWarning": 75,
    "jvmHeapCritical": 90,
    "failedDocsWarning": 1,
    "failedDocsCritical": 10
  }
}

Thresholds Validation Rules

  • Warning must be <= critical for all numeric threshold pairs
  • Percentage values must be 0100
  • Duration values must be > 0
  • clusterHealthWarning must be less severe than clusterHealthCritical (GREEN < YELLOW < RED)
  • Backend returns 400 Bad Request with field-level error messages on validation failure

Pagination Limits

  • All paginated endpoints enforce a maximum page size of 100 (min(requested, 100))
  • Applies to: indices listing, audit log

Error Responses

All new endpoints return errors in a consistent shape:

{
  "status": 404,
  "error": "Not Found",
  "message": "No active query with PID 12345"
}

Specific error cases:

  • POST /admin/database/queries/{pid}/kill — 404 if PID not found, 500 if pg_terminate_backend fails
  • DELETE /admin/opensearch/indices/{name} — 404 if index not found, 502 if OpenSearch unreachable
  • GET /admin/database/status — returns 200 with "connected": false if database is unreachable (not 503), so the frontend can render a red status badge rather than an error state
  • GET /admin/opensearch/status — returns 200 with "clusterHealth": "UNREACHABLE" if OpenSearch is down

6. Security

Enforcement Layers

  1. Spring Security filter chain/api/v1/admin/** requires ROLE_ADMIN (existing configuration)
  2. Controller annotation@PreAuthorize("hasRole('ADMIN')") on each controller class (defense-in-depth). This is a new convention — existing controllers (OidcConfigAdminController, UserAdminController) must be retrofitted with this annotation as part of Phase 1.
  3. @EnableMethodSecurity — must be added to SecurityConfig.java to activate @PreAuthorize processing (prerequisite for layer 2)
  4. UI role check — sidebar admin section hidden for non-admin users (cosmetic only, not a security boundary)

Audit Logging

All admin actions are persisted to the audit_log database table (see Section 4 and Section 7 — Data Storage) AND logged via SLF4J at INFO level. The database record is the source of truth for compliance; the SLF4J log provides operational visibility.

The AuditService is injected into all admin controllers (infrastructure, OIDC, user management) and the authentication flow. See Section 4 (Audit Log Page) for full details on what is logged and the record structure.

Future: OPERATOR Role (Phase 2+)

Design anticipates a read-only OPERATOR role:

  • Can view all monitoring data
  • Cannot perform destructive actions (kill, delete)
  • Implementation: method-level @PreAuthorize on action endpoints, UI conditionally disables buttons based on role

7. Data Storage

Flyway Migration V9: Admin Thresholds

CREATE TABLE admin_thresholds (
    id          INTEGER PRIMARY KEY DEFAULT 1,
    config      JSONB NOT NULL DEFAULT '{}',
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_by  TEXT NOT NULL,
    CONSTRAINT  single_row CHECK (id = 1)
);
  • Single-row table using CHECK (id = 1) constraint — stricter than the oidc_config pattern (which uses a text PK defaulting to 'default' without a constraint). The CHECK approach is preferred going forward as it explicitly prevents multiple rows.
  • JSON column for flexibility — adding new thresholds doesn't require schema changes
  • Tracks who last updated and when

Flyway Migration V10: Audit Log

CREATE TABLE audit_log (
    id          BIGSERIAL PRIMARY KEY,
    timestamp   TIMESTAMPTZ NOT NULL DEFAULT now(),
    username    TEXT NOT NULL,
    action      TEXT NOT NULL,
    category    TEXT NOT NULL,
    target      TEXT,
    detail      JSONB,
    result      TEXT NOT NULL,
    ip_address  TEXT,
    user_agent  TEXT
);

CREATE INDEX idx_audit_log_timestamp ON audit_log (timestamp DESC);
CREATE INDEX idx_audit_log_username ON audit_log (username);
CREATE INDEX idx_audit_log_category ON audit_log (category);
CREATE INDEX idx_audit_log_action ON audit_log (action);
CREATE INDEX idx_audit_log_target ON audit_log (target);
  • Separate migration from thresholds so they can be developed and tested independently
  • Append-only table — no UPDATE or DELETE exposed via API
  • Indexed on timestamp (primary query axis), username, category, action, and target for filtered views and free-text search via ILIKE on indexed text columns
  • JSONB detail column holds action-specific context without schema changes (not searched via free text — use row expansion for detail inspection)
  • user_agent field captures client identification for forensic analysis (SOC2)
  • No foreign key to users table — username is denormalized so audit records survive user deletion
  • Retention: unbounded in Phase 1. Phase 2+ should add a retention/archival strategy (e.g., TimescaleDB hypertable with retention policy, or periodic archive to cold storage). Typical SOC2 retention is 7 years.

8. Frontend Architecture

New Files

File Purpose
pages/admin/DatabaseAdminPage.tsx Database monitoring and management
pages/admin/OpenSearchAdminPage.tsx OpenSearch monitoring and management
pages/admin/AuditLogPage.tsx Audit log viewer
api/queries/admin/database.ts React Query hooks for database endpoints
api/queries/admin/opensearch.ts React Query hooks for OpenSearch endpoints
api/queries/admin/thresholds.ts React Query hooks for threshold endpoints
api/queries/admin/audit.ts React Query hooks for audit log endpoint
components/admin/StatusBadge.tsx Color-coded status indicator (green/yellow/red)
components/admin/RefreshableCard.tsx Card with manual refresh button + optional auto-refresh
components/admin/ConfirmDeleteDialog.tsx Confirmation dialog requiring name input for destructive actions

Modified Files

File Change
components/layout/AppSidebar.tsx Refactor admin section to collapsible sub-menu with multiple items
router.tsx Add routes for /admin/database, /admin/opensearch, /admin/audit, redirect /admin
SpaForwardController.java Existing /admin/{path:[^\\.]*} pattern already covers single-segment routes — no change needed unless deeper routes are added

Auto-Refresh Strategy

  • React Query refetchInterval: 15000 on lightweight endpoints (pool, queries, pipeline, performance)
  • Heavy endpoints (tables, indices) use refetchInterval: false — manual refresh only
  • Refresh button calls queryClient.invalidateQueries for all queries on that page

9. Implementation Phases

Phase 1 (Current Scope)

  1. Admin sidebar restructuring
  2. Database page — all monitoring sections + kill query
  3. OpenSearch page — all monitoring sections + delete index
  4. Threshold configuration (both pages)
  5. Audit log — database-backed audit trail + admin viewer page
  6. Retrofit audit logging into existing admin controllers (OIDC, user management) and auth flow
  7. Backend endpoints with RBAC enforcement
  8. Flyway migrations V9 (thresholds) and V10 (audit_log)
  9. SearchIndexerStats interface and SearchIndexer stats instrumentation
  10. @EnableMethodSecurity + @PreAuthorize retrofit on existing admin controllers

Phase 2

  • Database maintenance actions (VACUUM ANALYZE, Reindex)
  • OpenSearch operations (Force Reindex All, Flush)
  • Bulk index operations (checkbox selection)
  • Audit log CSV/JSON export for auditors
  • Audit log retention/archival strategy (7-year SOC2 requirement)
  • OPERATOR role with view-only permissions

Phase 3

  • TimescaleDB-aware metrics (hypertable chunks, continuous aggregate status, compression)
  • Historical trend charts for key metrics
  • Alerting/notification system