diff --git a/docs/SERVER-CAPABILITIES.md b/docs/SERVER-CAPABILITIES.md new file mode 100644 index 00000000..1d31a4c3 --- /dev/null +++ b/docs/SERVER-CAPABILITIES.md @@ -0,0 +1,421 @@ +# Cameleer3 Server — Capabilities Reference + +> Standalone reference for systems integrating with or managing Cameleer3 Server instances. +> Generated 2026-04-04. Source of truth: the codebase and OpenAPI spec at `/api/v1/api-docs`. + +## What It Does + +Cameleer3 Server is an observability platform for Apache Camel applications. It receives execution traces, metrics, logs, and route diagrams from instrumented Camel agents, stores them in ClickHouse, and serves a web UI for searching, visualizing, and controlling routes. + +**Core capabilities:** +- Real-time execution tracing with processor-level detail +- Full-text search across executions, logs, and attributes +- Route topology diagrams with live execution overlays +- Application configuration push via SSE +- Route control (start/stop/suspend) and exchange replay +- Agent lifecycle management with auto-heal on server restart +- RBAC with local users, groups, roles, and OIDC federation +- Multi-tenant isolation (one tenant per server instance) + +--- + +## Multi-Tenancy Model + +Each server instance serves exactly one tenant. Multiple tenants share infrastructure but are isolated at the data layer. + +| Concern | Isolation | +|---------|-----------| +| PostgreSQL | Schema-per-tenant (`?currentSchema=tenant_{id}`) | +| ClickHouse | Shared DB, `tenant_id` column on all tables, partitioned by `(tenant_id, toYYYYMM(timestamp))` | +| Configuration | `CAMELEER_TENANT_ID` env var (default: `"default"`) | +| Agents | Each agent belongs to one tenant, one environment | + +**Environments** (dev/staging/prod) are first-class within a tenant. Agents send `environmentId` at registration and in every heartbeat. The UI filters by environment. JWT tokens carry an `env` claim for persistence across restarts. + +--- + +## Agent Protocol + +### Lifecycle + +``` +Register (bootstrap token) → Receive JWT + SSE URL + ↓ +Connect SSE ← Receive commands (config-update, deep-trace, replay, route-control) + ↓ +Heartbeat (every 30s) → Send capabilities, environmentId, routeStates + ↓ +Deregister (graceful shutdown) +``` + +### State Machine + +``` +LIVE ──(no heartbeat for 90s)──→ STALE ──(300s more)──→ DEAD + ↑ │ + └────(heartbeat arrives)──────────┘ +``` + +Thresholds are configurable via `agent-registry.*` properties. + +### Registration + +**`POST /api/v1/agents/register`** — requires bootstrap token in `Authorization: Bearer` header. + +Request: +```json +{ + "instanceId": "agent-abc-123", + "displayName": "Order Service #1", + "applicationId": "order-service", + "environmentId": "production", + "version": "3.2.1", + "routeIds": ["processOrder", "handlePayment"], + "capabilities": { "replay": true, "routeControl": true } +} +``` + +Response: +```json +{ + "instanceId": "agent-abc-123", + "eventStreamUrl": "/api/v1/agents/agent-abc-123/events", + "heartbeatIntervalMs": 30000, + "signingPublicKeyBase64": "", + "accessToken": "", + "refreshToken": "" +} +``` + +### Heartbeat + +**`POST /api/v1/agents/{id}/heartbeat`** — JWT auth. + +```json +{ + "capabilities": { "replay": true, "routeControl": true }, + "environmentId": "production", + "routeStates": { "processOrder": "Started", "handlePayment": "Suspended" } +} +``` + +Auto-heals after server restart: if agent not in registry, re-registers from JWT claims + heartbeat body. Environment priority: heartbeat `environmentId` > JWT `env` claim > `"default"`. + +### SSE Event Stream + +**`GET /api/v1/agents/{id}/events`** — long-lived SSE connection. Keepalive ping every 15s. + +Event types pushed to agents: `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control`. + +### Token Refresh + +**`POST /api/v1/agents/{id}/refresh`** — public endpoint, validates refresh token. + +```json +{ "refreshToken": "" } +``` + +Returns new `accessToken` + `refreshToken`. Preserves roles, application, and environment from the original token. + +--- + +## Data Ingestion + +All ingestion endpoints require JWT with `AGENT` role. + +| Endpoint | Data | Notes | +|----------|------|-------| +| `POST /api/v1/data/executions` | Execution chunks (route + processor traces) | Buffered, flushed periodically | +| `POST /api/v1/data/diagrams` | Route graph definitions | Single or array | +| `POST /api/v1/data/events` | Agent lifecycle events | Triggers registry state transitions | +| `POST /api/v1/data/logs` | Application log batches | Buffered, 503 if buffer full | +| `POST /api/v1/data/metrics` | Metrics snapshots | Buffered, 503 if buffer full | + +--- + +## Command System + +Commands are delivered to agents via SSE. Three dispatch modes: + +| Mode | Endpoint | Behavior | +|------|----------|----------| +| Single agent | `POST /api/v1/agents/{id}/commands` | Async (202), DELIVERED or PENDING | +| Group (application) | `POST /api/v1/agents/groups/{group}/commands` | Sync wait (10s), returns per-agent results | +| Broadcast (all LIVE) | `POST /api/v1/agents/commands` | Fire-and-forget (202) | + +**Command types:** `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control` + +**Replay** has a dedicated sync endpoint: `POST /api/v1/agents/{id}/replay` (30s timeout, returns result or 504). + +**Acknowledgment:** `POST /api/v1/agents/{id}/commands/{commandId}/ack` — agent confirms receipt with status/message/data. + +--- + +## Query & Analytics API + +All query endpoints require JWT with `VIEWER` role or higher. + +### Execution Search + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/search/executions` | Search by status, time, text, route, app, environment | +| `POST /api/v1/search/executions` | Advanced search with full filter object | +| `GET /api/v1/executions/{id}` | Execution detail with processor tree | +| `GET /api/v1/executions/{id}/processors/by-id/{pid}/snapshot` | Exchange data at processor | + +### Statistics & Analytics + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/search/stats` | Aggregated stats (P99, error rate, SLA compliance) | +| `GET /api/v1/search/stats/timeseries` | Bucketed time-series | +| `GET /api/v1/search/stats/timeseries/by-app` | Time series grouped by application | +| `GET /api/v1/search/stats/timeseries/by-route` | Time series grouped by route | +| `GET /api/v1/search/stats/punchcard` | Transaction heatmap (weekday x hour) | +| `GET /api/v1/search/errors/top` | Top N errors with velocity trends | +| `GET /api/v1/search/attributes/keys` | Distinct attribute key names | + +### Route Catalog & Metrics + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/routes/catalog` | Applications with routes, agents, health | +| `GET /api/v1/routes/metrics` | Per-route performance (TPS, P99, error rate) | +| `GET /api/v1/routes/metrics/processors` | Per-processor metrics for a route | + +### Logs + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/logs` | Cursor-based log search with level aggregation | + +### Diagrams + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/diagrams` | Find diagram by application + routeId | +| `GET /api/v1/diagrams/{hash}/render` | SVG or JSON layout | + +### Agent Monitoring + +| Endpoint | Description | +|----------|-------------| +| `GET /api/v1/agents` | List agents (filter by status, app, environment) | +| `GET /api/v1/agents/events-log` | Agent lifecycle event history | +| `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series | + +--- + +## Application Configuration + +| Endpoint | Role | Description | +|----------|------|-------------| +| `GET /api/v1/config` | VIEWER | List all app configs | +| `GET /api/v1/config/{app}` | VIEWER | Get config (returns defaults if none stored) | +| `PUT /api/v1/config/{app}` | OPERATOR | Save config + push to all LIVE agents | +| `GET /api/v1/config/{app}/processor-routes` | VIEWER | Processor-to-route mapping | +| `POST /api/v1/config/{app}/test-expression` | VIEWER | Test Camel expression via live agent | + +Config fields: `metricsEnabled`, `samplingRate`, `tracedProcessors`, `logLevels`, `engineLevel`, `payloadCaptureMode`, `version`. + +--- + +## Security + +### Authentication + +| Method | Endpoint | Purpose | +|--------|----------|---------| +| Bootstrap token | `POST /agents/register` | One-time agent registration | +| Local credentials | `POST /auth/login` | UI login (username/password) | +| OIDC code exchange | `POST /auth/oidc/callback` | External identity provider | +| Token refresh | `POST /auth/refresh` | UI token refresh | +| Token refresh | `POST /agents/{id}/refresh` | Agent token refresh | + +### JWT Structure + +- Algorithm: HMAC-SHA256 +- Access token: 1 hour (configurable) +- Refresh token: 7 days (configurable) +- Claims: `sub` (agent ID or `user:`), `group` (application), `env` (environment), `roles` (array), `type` (access/refresh) + +### RBAC Roles + +| Role | Permissions | +|------|-------------| +| `AGENT` | Data ingestion, heartbeat, SSE, command ack | +| `VIEWER` | Read-only: executions, search, diagrams, metrics, logs, config | +| `OPERATOR` | VIEWER + send commands, modify config, replay | +| `ADMIN` | OPERATOR + user/group/role management, OIDC config, database admin | + +### Ed25519 Config Signing + +Server derives an Ed25519 keypair deterministically from the JWT secret. Public key is shared with agents at registration. Config-update payloads are signed so agents can verify authenticity. + +### OIDC Integration + +Configured via admin API (`/api/v1/admin/oidc`). Supports any OpenID Connect provider. Features: role claim extraction (supports nested paths like `realm_access.roles`), auto-signup, configurable display name claim, constant-time token rotation via dual bootstrap tokens. + +--- + +## Admin API + +All admin endpoints require `ADMIN` role. Prefix: `/api/v1/admin/`. + +### User Management + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/users` | GET | List all users | +| `/users` | POST | Create local user | +| `/users/{id}` | GET/PUT/DELETE | Get/update/delete user | +| `/users/{id}/password` | POST | Reset password | +| `/users/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role | +| `/users/{id}/groups/{groupId}` | POST/DELETE | Add/remove from group | + +### Group & Role Management + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/groups` | GET/POST | List/create groups | +| `/groups/{id}` | GET/PUT/DELETE | Manage group (cycle detection on parent change) | +| `/groups/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role from group | +| `/roles` | GET/POST | List/create roles | +| `/roles/{id}` | GET/PUT/DELETE | Manage role (system roles protected) | +| `/rbac/stats` | GET | RBAC statistics | + +### Infrastructure + +| Endpoint | Description | +|----------|-------------| +| `/database/status` | PostgreSQL version, schema, health | +| `/database/pool` | HikariCP connection pool stats | +| `/database/tables` | Table sizes and row counts | +| `/database/queries` | Active queries (with kill) | +| `/clickhouse/status` | ClickHouse version, uptime | +| `/clickhouse/tables` | Table info, row counts, sizes | +| `/clickhouse/performance` | Disk, memory, compression, partitions | +| `/clickhouse/queries` | Active ClickHouse queries | +| `/clickhouse/pipeline` | Ingestion pipeline stats | + +### Settings & Configuration + +| Endpoint | Description | +|----------|-------------| +| `/app-settings` | Per-application settings (CRUD) | +| `/thresholds` | Monitoring threshold configuration | +| `/oidc` | OIDC provider configuration (CRUD + test) | +| `/audit` | Paginated audit log search | +| `/usage` | UI usage analytics (ClickHouse) | + +--- + +## Storage + +### PostgreSQL + +Used for RBAC, configuration, and audit. Schema-per-tenant isolation via `?currentSchema=tenant_{id}`. + +Tables: `users`, `groups`, `roles`, `user_roles`, `user_groups`, `group_roles`, `server_config`, `application_config`, `audit_log`. + +Flyway migrations (V1-V11) manage schema evolution. + +### ClickHouse + +Used for all observability data. Schema managed by `ClickHouseSchemaInitializer` (idempotent on startup). + +| Table | Engine | Purpose | TTL | +|-------|--------|---------|-----| +| `executions` | ReplacingMergeTree | Route execution records | 365d | +| `processor_executions` | MergeTree | Per-processor trace data | 365d | +| `agent_events` | MergeTree | Agent lifecycle audit trail | 365d | +| `route_diagrams` | ReplacingMergeTree | Route graph definitions | - | +| `logs` | MergeTree | Application logs | 365d | +| `usage_events` | MergeTree | UI action tracking | 90d | +| `stats_1m_all` | AggregatingMergeTree | Global 1-minute rollups | - | +| `stats_1m_app` | AggregatingMergeTree | Per-application rollups | - | +| `stats_1m_route` | AggregatingMergeTree | Per-route rollups | - | +| `stats_1m_processor` | AggregatingMergeTree | Per-processor-type rollups | - | +| `stats_1m_processor_detail` | AggregatingMergeTree | Per-processor-instance rollups | - | + +All tables include `tenant_id` and `environment` columns. Partitioned by `(tenant_id, toYYYYMM(timestamp))`. + +Stats tables are fed by Materialized Views from base tables. Query with `-Merge()` combinators (e.g., `countMerge(total_count)`). + +--- + +## Deployment + +### Container Image + +Multi-stage Docker build: Maven 3.9 + JDK 17 (build) → JRE 17 (runtime). Port 8081. + +Registry: `gitea.siegeln.net/cameleer/cameleer3-server` + +### Infrastructure Requirements + +| Component | Version | Purpose | +|-----------|---------|---------| +| PostgreSQL | 16+ | RBAC, config, audit | +| ClickHouse | 24.12+ | All observability data | + +### Required Environment Variables + +| Variable | Required | Default | Purpose | +|----------|----------|---------|---------| +| `CAMELEER_AUTH_TOKEN` | Yes | - | Bootstrap token for agent registration | +| `CAMELEER_JWT_SECRET` | Recommended | Random (ephemeral) | JWT signing secret | +| `CAMELEER_TENANT_ID` | No | `default` | Tenant identifier | +| `CAMELEER_UI_USER` | No | `admin` | Default admin username | +| `CAMELEER_UI_PASSWORD` | No | `admin` | Default admin password | +| `CAMELEER_UI_ORIGIN` | No | `http://localhost:5173` | CORS allowed origin | +| `CLICKHOUSE_URL` | No | `jdbc:clickhouse://localhost:8123/cameleer` | ClickHouse JDBC URL | +| `CLICKHOUSE_USERNAME` | No | `default` | ClickHouse user | +| `CLICKHOUSE_PASSWORD` | No | (empty) | ClickHouse password | +| `SPRING_DATASOURCE_URL` | No | `jdbc:postgresql://localhost:5432/cameleer3` | PostgreSQL JDBC URL | +| `SPRING_DATASOURCE_USERNAME` | No | `cameleer` | PostgreSQL user | +| `SPRING_DATASOURCE_PASSWORD` | No | `cameleer_dev` | PostgreSQL password | +| `CAMELEER_DB_SCHEMA` | No | `public` | PostgreSQL schema name | + +### Health Probes + +- **Endpoint:** `GET /api/v1/health` (public, no auth) +- **Liveness:** 30s initial delay, 10s period +- **Readiness:** 10s initial delay, 5s period + +### Ingestion Tuning + +| Variable | Default | Purpose | +|----------|---------|---------| +| `INGESTION_BUFFER_CAPACITY` | 50000 | Ring buffer size | +| `INGESTION_BATCH_SIZE` | 5000 | Flush batch size | +| `INGESTION_FLUSH_INTERVAL_MS` | 5000 | Periodic flush interval | + +### Agent Registry Tuning + +| Variable | Default | Purpose | +|----------|---------|---------| +| `AGENT_REGISTRY_STALE_THRESHOLD_MS` | 90000 | Heartbeat miss → STALE | +| `AGENT_REGISTRY_DEAD_THRESHOLD_MS` | 300000 | STALE duration → DEAD | +| `AGENT_REGISTRY_PING_INTERVAL_MS` | 15000 | SSE keepalive interval | +| `AGENT_REGISTRY_COMMAND_EXPIRY_MS` | 60000 | Pending command TTL | + +--- + +## Public Endpoints (No Auth) + +These endpoints do not require authentication: + +- `GET /api/v1/health` +- `POST /api/v1/agents/register` (requires bootstrap token) +- `POST /api/v1/agents/*/refresh` +- `POST /api/v1/auth/login` +- `POST /api/v1/auth/refresh` +- `GET /api/v1/auth/oidc/config` +- `POST /api/v1/auth/oidc/callback` +- `GET /api/v1/api-docs/**` (OpenAPI spec) +- `GET /swagger-ui.html` (Swagger UI) +- Static resources: `/`, `/index.html`, `/config.js`, `/favicon.svg`, `/assets/**` + +All other endpoints require a valid JWT with appropriate role.