Files
cameleer-server/docs/SERVER-CAPABILITIES.md
hsiegeln ac87aa6eb2
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m6s
CI / docker (push) Successful in 43s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Failing after 2m17s
fix: derive PG schema from tenant ID instead of defaulting to public
Schema now defaults to tenant_${cameleer.tenant.id} (e.g. tenant_default,
tenant_acme) instead of public. Flyway create-schemas: true ensures the
schema is auto-created on first startup. CAMELEER_DB_SCHEMA env var still
available as override for feature branch isolation. Removed hardcoded
public schema from K8s base and main overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 21:46:57 +02:00

422 lines
16 KiB
Markdown

# Cameleer3 Server — Capabilities Reference
> Standalone reference for systems integrating with or managing Cameleer3 Server instances.
> Generated 2026-04-04. Source of truth: the codebase and OpenAPI spec at `/api/v1/api-docs`.
## What It Does
Cameleer3 Server is an observability platform for Apache Camel applications. It receives execution traces, metrics, logs, and route diagrams from instrumented Camel agents, stores them in ClickHouse, and serves a web UI for searching, visualizing, and controlling routes.
**Core capabilities:**
- Real-time execution tracing with processor-level detail
- Full-text search across executions, logs, and attributes
- Route topology diagrams with live execution overlays
- Application configuration push via SSE
- Route control (start/stop/suspend) and exchange replay
- Agent lifecycle management with auto-heal on server restart
- RBAC with local users, groups, roles, and OIDC federation
- Multi-tenant isolation (one tenant per server instance)
---
## Multi-Tenancy Model
Each server instance serves exactly one tenant. Multiple tenants share infrastructure but are isolated at the data layer.
| Concern | Isolation |
|---------|-----------|
| PostgreSQL | Schema-per-tenant (`?currentSchema=tenant_{id}`) |
| ClickHouse | Shared DB, `tenant_id` column on all tables, partitioned by `(tenant_id, toYYYYMM(timestamp))` |
| Configuration | `CAMELEER_TENANT_ID` env var (default: `"default"`) |
| Agents | Each agent belongs to one tenant, one environment |
**Environments** (dev/staging/prod) are first-class within a tenant. Agents send `environmentId` at registration and in every heartbeat. The UI filters by environment. JWT tokens carry an `env` claim for persistence across restarts.
---
## Agent Protocol
### Lifecycle
```
Register (bootstrap token) → Receive JWT + SSE URL
Connect SSE ← Receive commands (config-update, deep-trace, replay, route-control)
Heartbeat (every 30s) → Send capabilities, environmentId, routeStates
Deregister (graceful shutdown)
```
### State Machine
```
LIVE ──(no heartbeat for 90s)──→ STALE ──(300s more)──→ DEAD
↑ │
└────(heartbeat arrives)──────────┘
```
Thresholds are configurable via `agent-registry.*` properties.
### Registration
**`POST /api/v1/agents/register`** — requires bootstrap token in `Authorization: Bearer` header.
Request:
```json
{
"instanceId": "agent-abc-123",
"displayName": "Order Service #1",
"applicationId": "order-service",
"environmentId": "production",
"version": "3.2.1",
"routeIds": ["processOrder", "handlePayment"],
"capabilities": { "replay": true, "routeControl": true }
}
```
Response:
```json
{
"instanceId": "agent-abc-123",
"eventStreamUrl": "/api/v1/agents/agent-abc-123/events",
"heartbeatIntervalMs": 30000,
"signingPublicKeyBase64": "<ed25519-public-key>",
"accessToken": "<jwt>",
"refreshToken": "<jwt>"
}
```
### Heartbeat
**`POST /api/v1/agents/{id}/heartbeat`** — JWT auth.
```json
{
"capabilities": { "replay": true, "routeControl": true },
"environmentId": "production",
"routeStates": { "processOrder": "Started", "handlePayment": "Suspended" }
}
```
Auto-heals after server restart: if agent not in registry, re-registers from JWT claims + heartbeat body. Environment priority: heartbeat `environmentId` > JWT `env` claim > `"default"`.
### SSE Event Stream
**`GET /api/v1/agents/{id}/events`** — long-lived SSE connection. Keepalive ping every 15s.
Event types pushed to agents: `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control`.
### Token Refresh
**`POST /api/v1/agents/{id}/refresh`** — public endpoint, validates refresh token.
```json
{ "refreshToken": "<refresh-jwt>" }
```
Returns new `accessToken` + `refreshToken`. Preserves roles, application, and environment from the original token.
---
## Data Ingestion
All ingestion endpoints require JWT with `AGENT` role.
| Endpoint | Data | Notes |
|----------|------|-------|
| `POST /api/v1/data/executions` | Execution chunks (route + processor traces) | Buffered, flushed periodically |
| `POST /api/v1/data/diagrams` | Route graph definitions | Single or array |
| `POST /api/v1/data/events` | Agent lifecycle events | Triggers registry state transitions |
| `POST /api/v1/data/logs` | Application log batches | Buffered, 503 if buffer full |
| `POST /api/v1/data/metrics` | Metrics snapshots | Buffered, 503 if buffer full |
---
## Command System
Commands are delivered to agents via SSE. Three dispatch modes:
| Mode | Endpoint | Behavior |
|------|----------|----------|
| Single agent | `POST /api/v1/agents/{id}/commands` | Async (202), DELIVERED or PENDING |
| Group (application) | `POST /api/v1/agents/groups/{group}/commands` | Sync wait (10s), returns per-agent results |
| Broadcast (all LIVE) | `POST /api/v1/agents/commands` | Fire-and-forget (202) |
**Command types:** `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control`
**Replay** has a dedicated sync endpoint: `POST /api/v1/agents/{id}/replay` (30s timeout, returns result or 504).
**Acknowledgment:** `POST /api/v1/agents/{id}/commands/{commandId}/ack` — agent confirms receipt with status/message/data.
---
## Query & Analytics API
All query endpoints require JWT with `VIEWER` role or higher.
### Execution Search
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/search/executions` | Search by status, time, text, route, app, environment |
| `POST /api/v1/search/executions` | Advanced search with full filter object |
| `GET /api/v1/executions/{id}` | Execution detail with processor tree |
| `GET /api/v1/executions/{id}/processors/by-id/{pid}/snapshot` | Exchange data at processor |
### Statistics & Analytics
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/search/stats` | Aggregated stats (P99, error rate, SLA compliance) |
| `GET /api/v1/search/stats/timeseries` | Bucketed time-series |
| `GET /api/v1/search/stats/timeseries/by-app` | Time series grouped by application |
| `GET /api/v1/search/stats/timeseries/by-route` | Time series grouped by route |
| `GET /api/v1/search/stats/punchcard` | Transaction heatmap (weekday x hour) |
| `GET /api/v1/search/errors/top` | Top N errors with velocity trends |
| `GET /api/v1/search/attributes/keys` | Distinct attribute key names |
### Route Catalog & Metrics
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/routes/catalog` | Applications with routes, agents, health |
| `GET /api/v1/routes/metrics` | Per-route performance (TPS, P99, error rate) |
| `GET /api/v1/routes/metrics/processors` | Per-processor metrics for a route |
### Logs
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/logs` | Cursor-based log search with level aggregation |
### Diagrams
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/diagrams` | Find diagram by application + routeId |
| `GET /api/v1/diagrams/{hash}/render` | SVG or JSON layout |
### Agent Monitoring
| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/agents` | List agents (filter by status, app, environment) |
| `GET /api/v1/agents/events-log` | Agent lifecycle event history |
| `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series |
---
## Application Configuration
| Endpoint | Role | Description |
|----------|------|-------------|
| `GET /api/v1/config` | VIEWER | List all app configs |
| `GET /api/v1/config/{app}` | VIEWER | Get config (returns defaults if none stored) |
| `PUT /api/v1/config/{app}` | OPERATOR | Save config + push to all LIVE agents |
| `GET /api/v1/config/{app}/processor-routes` | VIEWER | Processor-to-route mapping |
| `POST /api/v1/config/{app}/test-expression` | VIEWER | Test Camel expression via live agent |
Config fields: `metricsEnabled`, `samplingRate`, `tracedProcessors`, `logLevels`, `engineLevel`, `payloadCaptureMode`, `version`.
---
## Security
### Authentication
| Method | Endpoint | Purpose |
|--------|----------|---------|
| Bootstrap token | `POST /agents/register` | One-time agent registration |
| Local credentials | `POST /auth/login` | UI login (username/password) |
| OIDC code exchange | `POST /auth/oidc/callback` | External identity provider |
| Token refresh | `POST /auth/refresh` | UI token refresh |
| Token refresh | `POST /agents/{id}/refresh` | Agent token refresh |
### JWT Structure
- Algorithm: HMAC-SHA256
- Access token: 1 hour (configurable)
- Refresh token: 7 days (configurable)
- Claims: `sub` (agent ID or `user:<username>`), `group` (application), `env` (environment), `roles` (array), `type` (access/refresh)
### RBAC Roles
| Role | Permissions |
|------|-------------|
| `AGENT` | Data ingestion, heartbeat, SSE, command ack |
| `VIEWER` | Read-only: executions, search, diagrams, metrics, logs, config |
| `OPERATOR` | VIEWER + send commands, modify config, replay |
| `ADMIN` | OPERATOR + user/group/role management, OIDC config, database admin |
### Ed25519 Config Signing
Server derives an Ed25519 keypair deterministically from the JWT secret. Public key is shared with agents at registration. Config-update payloads are signed so agents can verify authenticity.
### OIDC Integration
Configured via admin API (`/api/v1/admin/oidc`). Supports any OpenID Connect provider. Features: role claim extraction (supports nested paths like `realm_access.roles`), auto-signup, configurable display name claim, constant-time token rotation via dual bootstrap tokens.
---
## Admin API
All admin endpoints require `ADMIN` role. Prefix: `/api/v1/admin/`.
### User Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/users` | GET | List all users |
| `/users` | POST | Create local user |
| `/users/{id}` | GET/PUT/DELETE | Get/update/delete user |
| `/users/{id}/password` | POST | Reset password |
| `/users/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role |
| `/users/{id}/groups/{groupId}` | POST/DELETE | Add/remove from group |
### Group & Role Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/groups` | GET/POST | List/create groups |
| `/groups/{id}` | GET/PUT/DELETE | Manage group (cycle detection on parent change) |
| `/groups/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role from group |
| `/roles` | GET/POST | List/create roles |
| `/roles/{id}` | GET/PUT/DELETE | Manage role (system roles protected) |
| `/rbac/stats` | GET | RBAC statistics |
### Infrastructure
| Endpoint | Description |
|----------|-------------|
| `/database/status` | PostgreSQL version, schema, health |
| `/database/pool` | HikariCP connection pool stats |
| `/database/tables` | Table sizes and row counts |
| `/database/queries` | Active queries (with kill) |
| `/clickhouse/status` | ClickHouse version, uptime |
| `/clickhouse/tables` | Table info, row counts, sizes |
| `/clickhouse/performance` | Disk, memory, compression, partitions |
| `/clickhouse/queries` | Active ClickHouse queries |
| `/clickhouse/pipeline` | Ingestion pipeline stats |
### Settings & Configuration
| Endpoint | Description |
|----------|-------------|
| `/app-settings` | Per-application settings (CRUD) |
| `/thresholds` | Monitoring threshold configuration |
| `/oidc` | OIDC provider configuration (CRUD + test) |
| `/audit` | Paginated audit log search |
| `/usage` | UI usage analytics (ClickHouse) |
---
## Storage
### PostgreSQL
Used for RBAC, configuration, and audit. Schema-per-tenant isolation via `?currentSchema=tenant_{id}`.
Tables: `users`, `groups`, `roles`, `user_roles`, `user_groups`, `group_roles`, `server_config`, `application_config`, `audit_log`.
Flyway migrations (V1-V11) manage schema evolution.
### ClickHouse
Used for all observability data. Schema managed by `ClickHouseSchemaInitializer` (idempotent on startup).
| Table | Engine | Purpose | TTL |
|-------|--------|---------|-----|
| `executions` | ReplacingMergeTree | Route execution records | 365d |
| `processor_executions` | MergeTree | Per-processor trace data | 365d |
| `agent_events` | MergeTree | Agent lifecycle audit trail | 365d |
| `route_diagrams` | ReplacingMergeTree | Route graph definitions | - |
| `logs` | MergeTree | Application logs | 365d |
| `usage_events` | MergeTree | UI action tracking | 90d |
| `stats_1m_all` | AggregatingMergeTree | Global 1-minute rollups | - |
| `stats_1m_app` | AggregatingMergeTree | Per-application rollups | - |
| `stats_1m_route` | AggregatingMergeTree | Per-route rollups | - |
| `stats_1m_processor` | AggregatingMergeTree | Per-processor-type rollups | - |
| `stats_1m_processor_detail` | AggregatingMergeTree | Per-processor-instance rollups | - |
All tables include `tenant_id` and `environment` columns. Partitioned by `(tenant_id, toYYYYMM(timestamp))`.
Stats tables are fed by Materialized Views from base tables. Query with `-Merge()` combinators (e.g., `countMerge(total_count)`).
---
## Deployment
### Container Image
Multi-stage Docker build: Maven 3.9 + JDK 17 (build) → JRE 17 (runtime). Port 8081.
Registry: `gitea.siegeln.net/cameleer/cameleer3-server`
### Infrastructure Requirements
| Component | Version | Purpose |
|-----------|---------|---------|
| PostgreSQL | 16+ | RBAC, config, audit |
| ClickHouse | 24.12+ | All observability data |
### Required Environment Variables
| Variable | Required | Default | Purpose |
|----------|----------|---------|---------|
| `CAMELEER_AUTH_TOKEN` | Yes | - | Bootstrap token for agent registration |
| `CAMELEER_JWT_SECRET` | Recommended | Random (ephemeral) | JWT signing secret |
| `CAMELEER_TENANT_ID` | No | `default` | Tenant identifier |
| `CAMELEER_UI_USER` | No | `admin` | Default admin username |
| `CAMELEER_UI_PASSWORD` | No | `admin` | Default admin password |
| `CAMELEER_UI_ORIGIN` | No | `http://localhost:5173` | CORS allowed origin |
| `CLICKHOUSE_URL` | No | `jdbc:clickhouse://localhost:8123/cameleer` | ClickHouse JDBC URL |
| `CLICKHOUSE_USERNAME` | No | `default` | ClickHouse user |
| `CLICKHOUSE_PASSWORD` | No | (empty) | ClickHouse password |
| `SPRING_DATASOURCE_URL` | No | `jdbc:postgresql://localhost:5432/cameleer3` | PostgreSQL JDBC URL |
| `SPRING_DATASOURCE_USERNAME` | No | `cameleer` | PostgreSQL user |
| `SPRING_DATASOURCE_PASSWORD` | No | `cameleer_dev` | PostgreSQL password |
| `CAMELEER_DB_SCHEMA` | No | `tenant_{CAMELEER_TENANT_ID}` | PostgreSQL schema (override for feature branches) |
### Health Probes
- **Endpoint:** `GET /api/v1/health` (public, no auth)
- **Liveness:** 30s initial delay, 10s period
- **Readiness:** 10s initial delay, 5s period
### Ingestion Tuning
| Variable | Default | Purpose |
|----------|---------|---------|
| `INGESTION_BUFFER_CAPACITY` | 50000 | Ring buffer size |
| `INGESTION_BATCH_SIZE` | 5000 | Flush batch size |
| `INGESTION_FLUSH_INTERVAL_MS` | 5000 | Periodic flush interval |
### Agent Registry Tuning
| Variable | Default | Purpose |
|----------|---------|---------|
| `AGENT_REGISTRY_STALE_THRESHOLD_MS` | 90000 | Heartbeat miss → STALE |
| `AGENT_REGISTRY_DEAD_THRESHOLD_MS` | 300000 | STALE duration → DEAD |
| `AGENT_REGISTRY_PING_INTERVAL_MS` | 15000 | SSE keepalive interval |
| `AGENT_REGISTRY_COMMAND_EXPIRY_MS` | 60000 | Pending command TTL |
---
## Public Endpoints (No Auth)
These endpoints do not require authentication:
- `GET /api/v1/health`
- `POST /api/v1/agents/register` (requires bootstrap token)
- `POST /api/v1/agents/*/refresh`
- `POST /api/v1/auth/login`
- `POST /api/v1/auth/refresh`
- `GET /api/v1/auth/oidc/config`
- `POST /api/v1/auth/oidc/callback`
- `GET /api/v1/api-docs/**` (OpenAPI spec)
- `GET /swagger-ui.html` (Swagger UI)
- Static resources: `/`, `/index.html`, `/config.js`, `/favicon.svg`, `/assets/**`
All other endpoints require a valid JWT with appropriate role.