docs: add SERVER-CAPABILITIES.md for SaaS integration reference
Comprehensive standalone document covering API surface, agent protocol, security, storage, multi-tenancy, deployment, and configuration — designed for external systems (like the SaaS orchestration layer) that need to understand and manage Cameleer3 Server instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
421
docs/SERVER-CAPABILITIES.md
Normal file
421
docs/SERVER-CAPABILITIES.md
Normal file
@@ -0,0 +1,421 @@
|
||||
# Cameleer3 Server — Capabilities Reference
|
||||
|
||||
> Standalone reference for systems integrating with or managing Cameleer3 Server instances.
|
||||
> Generated 2026-04-04. Source of truth: the codebase and OpenAPI spec at `/api/v1/api-docs`.
|
||||
|
||||
## What It Does
|
||||
|
||||
Cameleer3 Server is an observability platform for Apache Camel applications. It receives execution traces, metrics, logs, and route diagrams from instrumented Camel agents, stores them in ClickHouse, and serves a web UI for searching, visualizing, and controlling routes.
|
||||
|
||||
**Core capabilities:**
|
||||
- Real-time execution tracing with processor-level detail
|
||||
- Full-text search across executions, logs, and attributes
|
||||
- Route topology diagrams with live execution overlays
|
||||
- Application configuration push via SSE
|
||||
- Route control (start/stop/suspend) and exchange replay
|
||||
- Agent lifecycle management with auto-heal on server restart
|
||||
- RBAC with local users, groups, roles, and OIDC federation
|
||||
- Multi-tenant isolation (one tenant per server instance)
|
||||
|
||||
---
|
||||
|
||||
## Multi-Tenancy Model
|
||||
|
||||
Each server instance serves exactly one tenant. Multiple tenants share infrastructure but are isolated at the data layer.
|
||||
|
||||
| Concern | Isolation |
|
||||
|---------|-----------|
|
||||
| PostgreSQL | Schema-per-tenant (`?currentSchema=tenant_{id}`) |
|
||||
| ClickHouse | Shared DB, `tenant_id` column on all tables, partitioned by `(tenant_id, toYYYYMM(timestamp))` |
|
||||
| Configuration | `CAMELEER_TENANT_ID` env var (default: `"default"`) |
|
||||
| Agents | Each agent belongs to one tenant, one environment |
|
||||
|
||||
**Environments** (dev/staging/prod) are first-class within a tenant. Agents send `environmentId` at registration and in every heartbeat. The UI filters by environment. JWT tokens carry an `env` claim for persistence across restarts.
|
||||
|
||||
---
|
||||
|
||||
## Agent Protocol
|
||||
|
||||
### Lifecycle
|
||||
|
||||
```
|
||||
Register (bootstrap token) → Receive JWT + SSE URL
|
||||
↓
|
||||
Connect SSE ← Receive commands (config-update, deep-trace, replay, route-control)
|
||||
↓
|
||||
Heartbeat (every 30s) → Send capabilities, environmentId, routeStates
|
||||
↓
|
||||
Deregister (graceful shutdown)
|
||||
```
|
||||
|
||||
### State Machine
|
||||
|
||||
```
|
||||
LIVE ──(no heartbeat for 90s)──→ STALE ──(300s more)──→ DEAD
|
||||
↑ │
|
||||
└────(heartbeat arrives)──────────┘
|
||||
```
|
||||
|
||||
Thresholds are configurable via `agent-registry.*` properties.
|
||||
|
||||
### Registration
|
||||
|
||||
**`POST /api/v1/agents/register`** — requires bootstrap token in `Authorization: Bearer` header.
|
||||
|
||||
Request:
|
||||
```json
|
||||
{
|
||||
"instanceId": "agent-abc-123",
|
||||
"displayName": "Order Service #1",
|
||||
"applicationId": "order-service",
|
||||
"environmentId": "production",
|
||||
"version": "3.2.1",
|
||||
"routeIds": ["processOrder", "handlePayment"],
|
||||
"capabilities": { "replay": true, "routeControl": true }
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"instanceId": "agent-abc-123",
|
||||
"eventStreamUrl": "/api/v1/agents/agent-abc-123/events",
|
||||
"heartbeatIntervalMs": 30000,
|
||||
"signingPublicKeyBase64": "<ed25519-public-key>",
|
||||
"accessToken": "<jwt>",
|
||||
"refreshToken": "<jwt>"
|
||||
}
|
||||
```
|
||||
|
||||
### Heartbeat
|
||||
|
||||
**`POST /api/v1/agents/{id}/heartbeat`** — JWT auth.
|
||||
|
||||
```json
|
||||
{
|
||||
"capabilities": { "replay": true, "routeControl": true },
|
||||
"environmentId": "production",
|
||||
"routeStates": { "processOrder": "Started", "handlePayment": "Suspended" }
|
||||
}
|
||||
```
|
||||
|
||||
Auto-heals after server restart: if agent not in registry, re-registers from JWT claims + heartbeat body. Environment priority: heartbeat `environmentId` > JWT `env` claim > `"default"`.
|
||||
|
||||
### SSE Event Stream
|
||||
|
||||
**`GET /api/v1/agents/{id}/events`** — long-lived SSE connection. Keepalive ping every 15s.
|
||||
|
||||
Event types pushed to agents: `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control`.
|
||||
|
||||
### Token Refresh
|
||||
|
||||
**`POST /api/v1/agents/{id}/refresh`** — public endpoint, validates refresh token.
|
||||
|
||||
```json
|
||||
{ "refreshToken": "<refresh-jwt>" }
|
||||
```
|
||||
|
||||
Returns new `accessToken` + `refreshToken`. Preserves roles, application, and environment from the original token.
|
||||
|
||||
---
|
||||
|
||||
## Data Ingestion
|
||||
|
||||
All ingestion endpoints require JWT with `AGENT` role.
|
||||
|
||||
| Endpoint | Data | Notes |
|
||||
|----------|------|-------|
|
||||
| `POST /api/v1/data/executions` | Execution chunks (route + processor traces) | Buffered, flushed periodically |
|
||||
| `POST /api/v1/data/diagrams` | Route graph definitions | Single or array |
|
||||
| `POST /api/v1/data/events` | Agent lifecycle events | Triggers registry state transitions |
|
||||
| `POST /api/v1/data/logs` | Application log batches | Buffered, 503 if buffer full |
|
||||
| `POST /api/v1/data/metrics` | Metrics snapshots | Buffered, 503 if buffer full |
|
||||
|
||||
---
|
||||
|
||||
## Command System
|
||||
|
||||
Commands are delivered to agents via SSE. Three dispatch modes:
|
||||
|
||||
| Mode | Endpoint | Behavior |
|
||||
|------|----------|----------|
|
||||
| Single agent | `POST /api/v1/agents/{id}/commands` | Async (202), DELIVERED or PENDING |
|
||||
| Group (application) | `POST /api/v1/agents/groups/{group}/commands` | Sync wait (10s), returns per-agent results |
|
||||
| Broadcast (all LIVE) | `POST /api/v1/agents/commands` | Fire-and-forget (202) |
|
||||
|
||||
**Command types:** `config-update`, `deep-trace`, `replay`, `set-traced-processors`, `test-expression`, `route-control`
|
||||
|
||||
**Replay** has a dedicated sync endpoint: `POST /api/v1/agents/{id}/replay` (30s timeout, returns result or 504).
|
||||
|
||||
**Acknowledgment:** `POST /api/v1/agents/{id}/commands/{commandId}/ack` — agent confirms receipt with status/message/data.
|
||||
|
||||
---
|
||||
|
||||
## Query & Analytics API
|
||||
|
||||
All query endpoints require JWT with `VIEWER` role or higher.
|
||||
|
||||
### Execution Search
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/search/executions` | Search by status, time, text, route, app, environment |
|
||||
| `POST /api/v1/search/executions` | Advanced search with full filter object |
|
||||
| `GET /api/v1/executions/{id}` | Execution detail with processor tree |
|
||||
| `GET /api/v1/executions/{id}/processors/by-id/{pid}/snapshot` | Exchange data at processor |
|
||||
|
||||
### Statistics & Analytics
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/search/stats` | Aggregated stats (P99, error rate, SLA compliance) |
|
||||
| `GET /api/v1/search/stats/timeseries` | Bucketed time-series |
|
||||
| `GET /api/v1/search/stats/timeseries/by-app` | Time series grouped by application |
|
||||
| `GET /api/v1/search/stats/timeseries/by-route` | Time series grouped by route |
|
||||
| `GET /api/v1/search/stats/punchcard` | Transaction heatmap (weekday x hour) |
|
||||
| `GET /api/v1/search/errors/top` | Top N errors with velocity trends |
|
||||
| `GET /api/v1/search/attributes/keys` | Distinct attribute key names |
|
||||
|
||||
### Route Catalog & Metrics
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/routes/catalog` | Applications with routes, agents, health |
|
||||
| `GET /api/v1/routes/metrics` | Per-route performance (TPS, P99, error rate) |
|
||||
| `GET /api/v1/routes/metrics/processors` | Per-processor metrics for a route |
|
||||
|
||||
### Logs
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/logs` | Cursor-based log search with level aggregation |
|
||||
|
||||
### Diagrams
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/diagrams` | Find diagram by application + routeId |
|
||||
| `GET /api/v1/diagrams/{hash}/render` | SVG or JSON layout |
|
||||
|
||||
### Agent Monitoring
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /api/v1/agents` | List agents (filter by status, app, environment) |
|
||||
| `GET /api/v1/agents/events-log` | Agent lifecycle event history |
|
||||
| `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series |
|
||||
|
||||
---
|
||||
|
||||
## Application Configuration
|
||||
|
||||
| Endpoint | Role | Description |
|
||||
|----------|------|-------------|
|
||||
| `GET /api/v1/config` | VIEWER | List all app configs |
|
||||
| `GET /api/v1/config/{app}` | VIEWER | Get config (returns defaults if none stored) |
|
||||
| `PUT /api/v1/config/{app}` | OPERATOR | Save config + push to all LIVE agents |
|
||||
| `GET /api/v1/config/{app}/processor-routes` | VIEWER | Processor-to-route mapping |
|
||||
| `POST /api/v1/config/{app}/test-expression` | VIEWER | Test Camel expression via live agent |
|
||||
|
||||
Config fields: `metricsEnabled`, `samplingRate`, `tracedProcessors`, `logLevels`, `engineLevel`, `payloadCaptureMode`, `version`.
|
||||
|
||||
---
|
||||
|
||||
## Security
|
||||
|
||||
### Authentication
|
||||
|
||||
| Method | Endpoint | Purpose |
|
||||
|--------|----------|---------|
|
||||
| Bootstrap token | `POST /agents/register` | One-time agent registration |
|
||||
| Local credentials | `POST /auth/login` | UI login (username/password) |
|
||||
| OIDC code exchange | `POST /auth/oidc/callback` | External identity provider |
|
||||
| Token refresh | `POST /auth/refresh` | UI token refresh |
|
||||
| Token refresh | `POST /agents/{id}/refresh` | Agent token refresh |
|
||||
|
||||
### JWT Structure
|
||||
|
||||
- Algorithm: HMAC-SHA256
|
||||
- Access token: 1 hour (configurable)
|
||||
- Refresh token: 7 days (configurable)
|
||||
- Claims: `sub` (agent ID or `user:<username>`), `group` (application), `env` (environment), `roles` (array), `type` (access/refresh)
|
||||
|
||||
### RBAC Roles
|
||||
|
||||
| Role | Permissions |
|
||||
|------|-------------|
|
||||
| `AGENT` | Data ingestion, heartbeat, SSE, command ack |
|
||||
| `VIEWER` | Read-only: executions, search, diagrams, metrics, logs, config |
|
||||
| `OPERATOR` | VIEWER + send commands, modify config, replay |
|
||||
| `ADMIN` | OPERATOR + user/group/role management, OIDC config, database admin |
|
||||
|
||||
### Ed25519 Config Signing
|
||||
|
||||
Server derives an Ed25519 keypair deterministically from the JWT secret. Public key is shared with agents at registration. Config-update payloads are signed so agents can verify authenticity.
|
||||
|
||||
### OIDC Integration
|
||||
|
||||
Configured via admin API (`/api/v1/admin/oidc`). Supports any OpenID Connect provider. Features: role claim extraction (supports nested paths like `realm_access.roles`), auto-signup, configurable display name claim, constant-time token rotation via dual bootstrap tokens.
|
||||
|
||||
---
|
||||
|
||||
## Admin API
|
||||
|
||||
All admin endpoints require `ADMIN` role. Prefix: `/api/v1/admin/`.
|
||||
|
||||
### User Management
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/users` | GET | List all users |
|
||||
| `/users` | POST | Create local user |
|
||||
| `/users/{id}` | GET/PUT/DELETE | Get/update/delete user |
|
||||
| `/users/{id}/password` | POST | Reset password |
|
||||
| `/users/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role |
|
||||
| `/users/{id}/groups/{groupId}` | POST/DELETE | Add/remove from group |
|
||||
|
||||
### Group & Role Management
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/groups` | GET/POST | List/create groups |
|
||||
| `/groups/{id}` | GET/PUT/DELETE | Manage group (cycle detection on parent change) |
|
||||
| `/groups/{id}/roles/{roleId}` | POST/DELETE | Assign/remove role from group |
|
||||
| `/roles` | GET/POST | List/create roles |
|
||||
| `/roles/{id}` | GET/PUT/DELETE | Manage role (system roles protected) |
|
||||
| `/rbac/stats` | GET | RBAC statistics |
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `/database/status` | PostgreSQL version, schema, health |
|
||||
| `/database/pool` | HikariCP connection pool stats |
|
||||
| `/database/tables` | Table sizes and row counts |
|
||||
| `/database/queries` | Active queries (with kill) |
|
||||
| `/clickhouse/status` | ClickHouse version, uptime |
|
||||
| `/clickhouse/tables` | Table info, row counts, sizes |
|
||||
| `/clickhouse/performance` | Disk, memory, compression, partitions |
|
||||
| `/clickhouse/queries` | Active ClickHouse queries |
|
||||
| `/clickhouse/pipeline` | Ingestion pipeline stats |
|
||||
|
||||
### Settings & Configuration
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `/app-settings` | Per-application settings (CRUD) |
|
||||
| `/thresholds` | Monitoring threshold configuration |
|
||||
| `/oidc` | OIDC provider configuration (CRUD + test) |
|
||||
| `/audit` | Paginated audit log search |
|
||||
| `/usage` | UI usage analytics (ClickHouse) |
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
Used for RBAC, configuration, and audit. Schema-per-tenant isolation via `?currentSchema=tenant_{id}`.
|
||||
|
||||
Tables: `users`, `groups`, `roles`, `user_roles`, `user_groups`, `group_roles`, `server_config`, `application_config`, `audit_log`.
|
||||
|
||||
Flyway migrations (V1-V11) manage schema evolution.
|
||||
|
||||
### ClickHouse
|
||||
|
||||
Used for all observability data. Schema managed by `ClickHouseSchemaInitializer` (idempotent on startup).
|
||||
|
||||
| Table | Engine | Purpose | TTL |
|
||||
|-------|--------|---------|-----|
|
||||
| `executions` | ReplacingMergeTree | Route execution records | 365d |
|
||||
| `processor_executions` | MergeTree | Per-processor trace data | 365d |
|
||||
| `agent_events` | MergeTree | Agent lifecycle audit trail | 365d |
|
||||
| `route_diagrams` | ReplacingMergeTree | Route graph definitions | - |
|
||||
| `logs` | MergeTree | Application logs | 365d |
|
||||
| `usage_events` | MergeTree | UI action tracking | 90d |
|
||||
| `stats_1m_all` | AggregatingMergeTree | Global 1-minute rollups | - |
|
||||
| `stats_1m_app` | AggregatingMergeTree | Per-application rollups | - |
|
||||
| `stats_1m_route` | AggregatingMergeTree | Per-route rollups | - |
|
||||
| `stats_1m_processor` | AggregatingMergeTree | Per-processor-type rollups | - |
|
||||
| `stats_1m_processor_detail` | AggregatingMergeTree | Per-processor-instance rollups | - |
|
||||
|
||||
All tables include `tenant_id` and `environment` columns. Partitioned by `(tenant_id, toYYYYMM(timestamp))`.
|
||||
|
||||
Stats tables are fed by Materialized Views from base tables. Query with `-Merge()` combinators (e.g., `countMerge(total_count)`).
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Container Image
|
||||
|
||||
Multi-stage Docker build: Maven 3.9 + JDK 17 (build) → JRE 17 (runtime). Port 8081.
|
||||
|
||||
Registry: `gitea.siegeln.net/cameleer/cameleer3-server`
|
||||
|
||||
### Infrastructure Requirements
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| PostgreSQL | 16+ | RBAC, config, audit |
|
||||
| ClickHouse | 24.12+ | All observability data |
|
||||
|
||||
### Required Environment Variables
|
||||
|
||||
| Variable | Required | Default | Purpose |
|
||||
|----------|----------|---------|---------|
|
||||
| `CAMELEER_AUTH_TOKEN` | Yes | - | Bootstrap token for agent registration |
|
||||
| `CAMELEER_JWT_SECRET` | Recommended | Random (ephemeral) | JWT signing secret |
|
||||
| `CAMELEER_TENANT_ID` | No | `default` | Tenant identifier |
|
||||
| `CAMELEER_UI_USER` | No | `admin` | Default admin username |
|
||||
| `CAMELEER_UI_PASSWORD` | No | `admin` | Default admin password |
|
||||
| `CAMELEER_UI_ORIGIN` | No | `http://localhost:5173` | CORS allowed origin |
|
||||
| `CLICKHOUSE_URL` | No | `jdbc:clickhouse://localhost:8123/cameleer` | ClickHouse JDBC URL |
|
||||
| `CLICKHOUSE_USERNAME` | No | `default` | ClickHouse user |
|
||||
| `CLICKHOUSE_PASSWORD` | No | (empty) | ClickHouse password |
|
||||
| `SPRING_DATASOURCE_URL` | No | `jdbc:postgresql://localhost:5432/cameleer3` | PostgreSQL JDBC URL |
|
||||
| `SPRING_DATASOURCE_USERNAME` | No | `cameleer` | PostgreSQL user |
|
||||
| `SPRING_DATASOURCE_PASSWORD` | No | `cameleer_dev` | PostgreSQL password |
|
||||
| `CAMELEER_DB_SCHEMA` | No | `public` | PostgreSQL schema name |
|
||||
|
||||
### Health Probes
|
||||
|
||||
- **Endpoint:** `GET /api/v1/health` (public, no auth)
|
||||
- **Liveness:** 30s initial delay, 10s period
|
||||
- **Readiness:** 10s initial delay, 5s period
|
||||
|
||||
### Ingestion Tuning
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `INGESTION_BUFFER_CAPACITY` | 50000 | Ring buffer size |
|
||||
| `INGESTION_BATCH_SIZE` | 5000 | Flush batch size |
|
||||
| `INGESTION_FLUSH_INTERVAL_MS` | 5000 | Periodic flush interval |
|
||||
|
||||
### Agent Registry Tuning
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `AGENT_REGISTRY_STALE_THRESHOLD_MS` | 90000 | Heartbeat miss → STALE |
|
||||
| `AGENT_REGISTRY_DEAD_THRESHOLD_MS` | 300000 | STALE duration → DEAD |
|
||||
| `AGENT_REGISTRY_PING_INTERVAL_MS` | 15000 | SSE keepalive interval |
|
||||
| `AGENT_REGISTRY_COMMAND_EXPIRY_MS` | 60000 | Pending command TTL |
|
||||
|
||||
---
|
||||
|
||||
## Public Endpoints (No Auth)
|
||||
|
||||
These endpoints do not require authentication:
|
||||
|
||||
- `GET /api/v1/health`
|
||||
- `POST /api/v1/agents/register` (requires bootstrap token)
|
||||
- `POST /api/v1/agents/*/refresh`
|
||||
- `POST /api/v1/auth/login`
|
||||
- `POST /api/v1/auth/refresh`
|
||||
- `GET /api/v1/auth/oidc/config`
|
||||
- `POST /api/v1/auth/oidc/callback`
|
||||
- `GET /api/v1/api-docs/**` (OpenAPI spec)
|
||||
- `GET /swagger-ui.html` (Swagger UI)
|
||||
- Static resources: `/`, `/index.html`, `/config.js`, `/favicon.svg`, `/assets/**`
|
||||
|
||||
All other endpoints require a valid JWT with appropriate role.
|
||||
Reference in New Issue
Block a user