Files

hsiegeln 48ce75bf38 feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 23:20:45 +02:00

21 KiB

Raw Blame History

Cameleer Server — Capabilities Reference

Standalone reference for systems integrating with or managing Cameleer Server instances. Generated 2026-04-04. Source of truth: the codebase and OpenAPI spec at /api/v1/api-docs.

What It Does

Cameleer Server is an observability platform for Apache Camel applications. It receives execution traces, metrics, logs, and route diagrams from instrumented Camel agents, stores them in ClickHouse, and serves a web UI for searching, visualizing, and controlling routes.

Core capabilities:

Real-time execution tracing with processor-level detail
Full-text search across executions, logs, and attributes
Route topology diagrams with live execution overlays
Application configuration push via SSE
Route control (start/stop/suspend) and exchange replay
Agent lifecycle management with auto-heal on server restart
RBAC with local users, groups, roles, and OIDC federation
Multi-tenant isolation (one tenant per server instance)

Multi-Tenancy Model

Each server instance serves exactly one tenant. Multiple tenants share infrastructure but are isolated at the data layer.

Concern	Isolation
PostgreSQL	Schema-per-tenant (`?currentSchema=tenant_{id}`)
ClickHouse	Shared DB, `tenant_id` column on all tables, partitioned by `(tenant_id, toYYYYMM(timestamp))`
Configuration	`CAMELEER_SERVER_TENANT_ID` env var (default: `"default"`)
Agents	Each agent belongs to one tenant, one environment

Environments (dev/staging/prod) are first-class within a tenant. Agents send environmentId at registration and in every heartbeat. The UI filters by environment. JWT tokens carry an env claim for persistence across restarts.

Agent Protocol

Lifecycle

Register (bootstrap token) → Receive JWT + SSE URL
    ↓
Connect SSE ← Receive commands (config-update, deep-trace, replay, route-control)
    ↓
Heartbeat (every 30s) → Send capabilities, environmentId, routeStates
    ↓
Deregister (graceful shutdown)

State Machine

LIVE ──(no heartbeat for 90s)──→ STALE ──(300s more)──→ DEAD
  ↑                                 │
  └────(heartbeat arrives)──────────┘

Thresholds are configurable via cameleer.server.agentregistry.* properties.

Registration

POST /api/v1/agents/register — requires bootstrap token in Authorization: Bearer header.

Request:

{
  "instanceId": "agent-abc-123",
  "applicationId": "order-service",
  "environmentId": "production",
  "version": "3.2.1",
  "routeIds": ["processOrder", "handlePayment"],
  "capabilities": { "replay": true, "routeControl": true }
}

Response:

{
  "instanceId": "agent-abc-123",
  "eventStreamUrl": "/api/v1/agents/agent-abc-123/events",
  "heartbeatIntervalMs": 30000,
  "signingPublicKeyBase64": "<ed25519-public-key>",
  "accessToken": "<jwt>",
  "refreshToken": "<jwt>"
}

Heartbeat

POST /api/v1/agents/{id}/heartbeat — JWT auth.

{
  "capabilities": { "replay": true, "routeControl": true },
  "environmentId": "production",
  "routeStates": { "processOrder": "Started", "handlePayment": "Suspended" }
}

Auto-heals after server restart: if agent not in registry, re-registers from JWT claims + heartbeat body. Environment priority: heartbeat environmentId > JWT env claim > "default".

SSE Event Stream

GET /api/v1/agents/{id}/events — long-lived SSE connection. Keepalive ping every 15s.

Event types pushed to agents: config-update, deep-trace, replay, set-traced-processors, test-expression, route-control.

Token Refresh

POST /api/v1/agents/{id}/refresh — public endpoint, validates refresh token.

{ "refreshToken": "<refresh-jwt>" }

Returns new accessToken + refreshToken. Preserves roles, application, and environment from the original token.

Data Ingestion

All ingestion endpoints require JWT with AGENT role.

Endpoint	Data	Notes
`POST /api/v1/data/executions`	Execution chunks (route + processor traces)	Buffered, flushed periodically
`POST /api/v1/data/diagrams`	Route graph definitions	Single or array
`POST /api/v1/data/events`	Agent lifecycle events	Triggers registry state transitions
`POST /api/v1/data/logs`	Log entries (JSON array, `source`: app/agent)	Buffered, 503 if buffer full
`POST /api/v1/data/metrics`	Metrics snapshots	Buffered, 503 if buffer full

Command System

Commands are delivered to agents via SSE. Three dispatch modes:

Mode	Endpoint	Behavior
Single agent	`POST /api/v1/agents/{id}/commands`	Async (202), DELIVERED or PENDING
Group (application)	`POST /api/v1/agents/groups/{group}/commands`	Sync wait (10s), returns per-agent results
Broadcast (all LIVE)	`POST /api/v1/agents/commands`	Fire-and-forget (202)

Command types: config-update, deep-trace, replay, set-traced-processors, test-expression, route-control

Replay has a dedicated sync endpoint: POST /api/v1/agents/{id}/replay (30s timeout, returns result or 504).

Acknowledgment: POST /api/v1/agents/{id}/commands/{commandId}/ack — agent confirms receipt with status/message/data.

Query & Analytics API

All query endpoints require JWT with VIEWER role or higher.

Execution Search

Endpoint	Description
`GET /api/v1/search/executions`	Search by status, time, text, route, app, environment
`POST /api/v1/search/executions`	Advanced search with full filter object
`GET /api/v1/executions/{id}`	Execution detail with processor tree
`GET /api/v1/executions/{id}/processors/by-id/{pid}/snapshot`	Exchange data at processor

Statistics & Analytics

Endpoint	Description
`GET /api/v1/search/stats`	Aggregated stats (P99, error rate, SLA compliance)
`GET /api/v1/search/stats/timeseries`	Bucketed time-series
`GET /api/v1/search/stats/timeseries/by-app`	Time series grouped by application
`GET /api/v1/search/stats/timeseries/by-route`	Time series grouped by route
`GET /api/v1/search/stats/punchcard`	Transaction heatmap (weekday x hour)
`GET /api/v1/search/errors/top`	Top N errors with velocity trends
`GET /api/v1/search/attributes/keys`	Distinct attribute key names

Route Catalog & Metrics

Endpoint	Description
`GET /api/v1/routes/catalog`	Applications with routes, agents, health
`GET /api/v1/routes/metrics`	Per-route performance (TPS, P99, error rate)
`GET /api/v1/routes/metrics/processors`	Per-processor metrics for a route

Logs

Endpoint	Description
`GET /api/v1/logs`	Cursor-based log search with level aggregation. Filters: `source` (app/agent), `application`, `agentId`, `exchangeId`, `level`, `logger`, `q` (text), `environment`, time range

Diagrams

Endpoint	Description
`GET /api/v1/diagrams`	Find diagram by application + routeId
`GET /api/v1/diagrams/{hash}/render`	SVG or JSON layout

Agent Monitoring

Endpoint	Description
`GET /api/v1/agents`	List agents (filter by status, app, environment)
`GET /api/v1/agents/events-log`	Agent lifecycle event history
`GET /api/v1/agents/{id}/metrics`	Agent-level metrics time series

Server Self-Metrics

The server snapshots its own Micrometer registry into ClickHouse every 60 s (table server_metrics) — JVM, HTTP, DB pools, agent/ingestion business metrics, and alerting metrics. Use this instead of running an external Prometheus when building a server-health dashboard. The live scrape endpoint /api/v1/prometheus remains available for traditional scraping.

See docs/server-self-metrics.md for the full metric catalog, suggested panels, and example queries.

Application Configuration

Endpoint	Role	Description
`GET /api/v1/config`	VIEWER	List all app configs
`GET /api/v1/config/{app}`	VIEWER	Get config (returns defaults if none stored)
`PUT /api/v1/config/{app}`	OPERATOR	Save config + push to all LIVE agents
`GET /api/v1/config/{app}/processor-routes`	VIEWER	Processor-to-route mapping
`POST /api/v1/config/{app}/test-expression`	VIEWER	Test Camel expression via live agent

Config fields: metricsEnabled, samplingRate, tracedProcessors, logLevels, engineLevel, payloadCaptureMode, version.

Security

Authentication

Method	Endpoint	Purpose
Bootstrap token	`POST /agents/register`	One-time agent registration
Local credentials	`POST /auth/login`	UI login (username/password)
OIDC code exchange	`POST /auth/oidc/callback`	External identity provider
OIDC access token	Bearer token in Authorization header	SaaS M2M / external OIDC
Token refresh	`POST /auth/refresh`	UI token refresh
Token refresh	`POST /agents/{id}/refresh`	Agent token refresh

JWT Structure

Algorithm: HMAC-SHA256
Access token: 1 hour (configurable)
Refresh token: 7 days (configurable)
Claims: sub (agent ID or user:<username>), group (application), env (environment), roles (array), type (access/refresh)

RBAC Roles

Role	Permissions
`AGENT`	Data ingestion, heartbeat, SSE, command ack
`VIEWER`	Read-only: executions, search, diagrams, metrics, logs, config
`OPERATOR`	VIEWER + send commands, modify config, replay
`ADMIN`	OPERATOR + user/group/role management, OIDC config, database admin

UI Role Gating

The UI enforces role-based visibility (backend ACLs remain the authoritative check):

UI element	VIEWER	OPERATOR	ADMIN
Exchanges, Dashboard, Runtime, Logs	Yes	Yes	Yes
Config tab	Read-only	Edit	Edit
Route control bar	Hidden	Yes	Yes
Diagram node toolbar	Hidden	Yes	Yes
Admin sidebar section	Hidden	Hidden	Yes
Admin pages (`/admin/*`)	Redirect to `/`	Redirect to `/`	Yes

Config tab is a main tab alongside Exchanges/Dashboard/Runtime/Logs. Navigation: /config shows all-app config table; /config/:appId filters to that app with detail panel open. Sidebar clicks while on Config stay on the config tab — route clicks resolve to the parent app's config (config is per-app).

Ed25519 Config Signing

Server derives an Ed25519 keypair deterministically from the JWT secret. Public key is shared with agents at registration. Config-update payloads are signed so agents can verify authenticity.

OIDC Integration

Configured via admin API (/api/v1/admin/oidc) or admin UI. Supports any OpenID Connect provider. Features: configurable user ID claim (userIdClaim, default sub — e.g., email, preferred_username), role claim extraction from access_token then id_token (supports nested paths like realm_access.roles and space-delimited scope strings), auto-signup (auto-provisions new users on first OIDC login), configurable display name claim, constant-time token rotation via dual bootstrap tokens, RFC 8707 resource indicators (audience config). Backend is a confidential client (client_secret authentication, no PKCE). Supports ES384 (Logto default), ES256, and RS256. Directly-assigned system roles are overwritten on every OIDC login (falls back to defaultRoles when OIDC returns none); uses getDirectRolesForUser so group-inherited roles are never touched. Role normalization via SystemRole.normalizeScope() (case-insensitive, strips server: prefix). Shared OIDC infrastructure (discovery, JWK source, algorithm set) centralized in OidcProviderHelper.

SSO Auto-Redirect

When OIDC is configured and enabled, the login page automatically redirects to the OIDC provider with prompt=none for silent SSO. If the user has an active provider session, they are signed in without seeing a login form. If consent_required is returned (first login, scopes not yet granted), the flow retries without prompt=none so the user can grant consent once. If login_required (no provider session), falls back to the login form. Bypass auto-redirect with /login?local. Logout always redirects to /login?local — either via the OIDC end_session_endpoint (with post_logout_redirect_uri) or as a direct fallback — preventing SSO re-login loops.

OIDC Resource Server

When CAMELEER_SERVER_SECURITY_OIDCISSUERURI is configured, the server accepts external access tokens (e.g., Logto M2M tokens) in addition to internal HMAC JWTs. Dual-path validation: tries internal HMAC first, falls back to OIDC JWKS validation. Supports ES384, ES256, and RS256 algorithms. Handles RFC 9068 at+jwt token type.

Role mapping is case-insensitive and accepts both bare and server:-prefixed names:

Scope/claim value	Maps to
`admin`, `server:admin`, `Server:Admin`	ADMIN
`operator`, `server:operator`	OPERATOR
`viewer`, `server:viewer`	VIEWER

This applies to both M2M tokens (scope claim) and OIDC user login (configurable rolesClaim from id_token). The server: prefix allows dedicated API resource scopes without colliding with other platform scopes.

Variable	Purpose
`CAMELEER_SERVER_SECURITY_OIDCISSUERURI`	OIDC issuer URI for token validation (e.g., `https://auth.example.com/oidc`)
`CAMELEER_SERVER_SECURITY_OIDCJWKSETURI`	Direct JWKS URL (e.g., `http://cameleer-logto:3001/oidc/jwks`) — use when public issuer isn't reachable from inside containers
`CAMELEER_SERVER_SECURITY_OIDCAUDIENCE`	Expected audience (API resource indicator)
`CAMELEER_SERVER_SECURITY_OIDCTLSSKIPVERIFY`	Skip TLS certificate verification for OIDC calls (default `false`) — use when provider has a self-signed CA

Logto is proxy-aware (TRUST_PROXY_HEADER=1). The LOGTO_ENDPOINT env var sets the public-facing URL used in OIDC discovery, issuer URI, and redirect URLs. Logto requires its own subdomain (not a path prefix).

Admin API

All admin endpoints require ADMIN role. Prefix: /api/v1/admin/.

User Management

Endpoint	Method	Description
`/users`	GET	List all users
`/users`	POST	Create local user
`/users/{id}`	GET/PUT/DELETE	Get/update/delete user
`/users/{id}/password`	POST	Reset password
`/users/{id}/roles/{roleId}`	POST/DELETE	Assign/remove role
`/users/{id}/groups/{groupId}`	POST/DELETE	Add/remove from group

Group & Role Management

Endpoint	Method	Description
`/groups`	GET/POST	List/create groups
`/groups/{id}`	GET/PUT/DELETE	Manage group (cycle detection on parent change)
`/groups/{id}/roles/{roleId}`	POST/DELETE	Assign/remove role from group
`/roles`	GET/POST	List/create roles
`/roles/{id}`	GET/PUT/DELETE	Manage role (system roles protected)
`/rbac/stats`	GET	RBAC statistics

Infrastructure

Endpoint	Description
`/database/status`	PostgreSQL version, schema, health
`/database/pool`	HikariCP connection pool stats
`/database/tables`	Table sizes and row counts
`/database/queries`	Active queries (with kill)
`/clickhouse/status`	ClickHouse version, uptime
`/clickhouse/tables`	Table info, row counts, sizes
`/clickhouse/performance`	Disk, memory, compression, partitions
`/clickhouse/queries`	Active ClickHouse queries
`/clickhouse/pipeline`	Ingestion pipeline stats

Settings & Configuration

Endpoint	Description
`/app-settings`	Per-application settings (CRUD)
`/thresholds`	Monitoring threshold configuration
`/oidc`	OIDC provider configuration (CRUD + test)
`/audit`	Paginated audit log search
`/usage`	UI usage analytics (ClickHouse)

Storage

PostgreSQL

Used for RBAC, configuration, and audit. Schema-per-tenant isolation via ?currentSchema=tenant_{id}.

Tables: users, groups, roles, user_roles, user_groups, group_roles, server_config, application_config, audit_log.

Flyway migrations (V1-V11) manage schema evolution.

ClickHouse

Used for all observability data. Schema managed by ClickHouseSchemaInitializer (idempotent on startup).

Table	Engine	Purpose	TTL
`executions`	ReplacingMergeTree	Route execution records	365d
`processor_executions`	MergeTree	Per-processor trace data	365d
`agent_events`	MergeTree	Agent lifecycle audit trail	365d
`route_diagrams`	ReplacingMergeTree	Route graph definitions	-
`logs`	MergeTree	Application + agent logs (`source` column: app/agent, `mdc` Map)	365d
`usage_events`	MergeTree	UI action tracking	90d
`stats_1m_all`	AggregatingMergeTree	Global 1-minute rollups	-
`stats_1m_app`	AggregatingMergeTree	Per-application rollups	-
`stats_1m_route`	AggregatingMergeTree	Per-route rollups	-
`stats_1m_processor`	AggregatingMergeTree	Per-processor-type rollups	-
`stats_1m_processor_detail`	AggregatingMergeTree	Per-processor-instance rollups	-

All tables include tenant_id and environment columns. Partitioned by (tenant_id, toYYYYMM(timestamp)).

Stats tables are fed by Materialized Views from base tables. Query with -Merge() combinators (e.g., countMerge(total_count)).

Deployment

Container Image

Multi-stage Docker build: Maven 3.9 + JDK 17 (build) → JRE 17 (runtime). Port 8081. No default credentials baked in — all database config comes from env vars at runtime.

Registry: gitea.siegeln.net/cameleer/cameleer-server

Infrastructure Requirements

Component	Version	Purpose
PostgreSQL	16+	RBAC, config, audit
ClickHouse	24.12+	All observability data

Required Environment Variables

Variable	Required	Default	Purpose
`CAMELEER_SERVER_SECURITY_BOOTSTRAPTOKEN`	Yes	-	Bootstrap token for agent registration
`CAMELEER_SERVER_SECURITY_JWTSECRET`	Recommended	Random (ephemeral)	JWT signing secret
`CAMELEER_SERVER_TENANT_ID`	No	`default`	Tenant identifier
`CAMELEER_SERVER_SECURITY_UIUSER`	No	`admin`	Default admin username
`CAMELEER_SERVER_SECURITY_UIPASSWORD`	No	`admin`	Default admin password
`CAMELEER_SERVER_SECURITY_UIORIGIN`	No	`http://localhost:5173`	CORS allowed origin (single, legacy)
`CAMELEER_SERVER_SECURITY_CORSALLOWEDORIGINS`	No	(empty)	Comma-separated CORS origins — overrides `UIORIGIN` when set
`CAMELEER_SERVER_CLICKHOUSE_URL`	No	`jdbc:clickhouse://localhost:8123/cameleer`	ClickHouse JDBC URL
`CAMELEER_SERVER_CLICKHOUSE_USERNAME`	No	`default`	ClickHouse user
`CAMELEER_SERVER_CLICKHOUSE_PASSWORD`	No	(empty)	ClickHouse password
`SPRING_DATASOURCE_URL`	No	`jdbc:postgresql://localhost:5432/cameleer`	PostgreSQL JDBC URL
`SPRING_DATASOURCE_USERNAME`	No	`cameleer`	PostgreSQL user
`SPRING_DATASOURCE_PASSWORD`	No	`cameleer_dev`	PostgreSQL password
`CAMELEER_SERVER_INGESTION_BODYSIZELIMIT`	No	`16384`	Max body size per execution (bytes)
`CAMELEER_SERVER_SECURITY_OIDCISSUERURI`	No	(empty)	OIDC issuer URI — enables resource server mode for M2M tokens
`CAMELEER_SERVER_SECURITY_OIDCJWKSETURI`	No	(empty)	Direct JWKS URL — bypasses OIDC discovery for container networking
`CAMELEER_SERVER_SECURITY_OIDCAUDIENCE`	No	(empty)	Expected JWT audience (API resource indicator)
`CAMELEER_SERVER_SECURITY_OIDCTLSSKIPVERIFY`	No	`false`	Skip TLS cert verification for OIDC calls (self-signed CAs)

Health Probes

Endpoint: GET /api/v1/health (public, no auth)
Liveness: 30s initial delay, 10s period
Readiness: 10s initial delay, 5s period

Ingestion Tuning

Variable	Default	Purpose
`CAMELEER_SERVER_INGESTION_BUFFERCAPACITY`	50000	Ring buffer size
`CAMELEER_SERVER_INGESTION_BATCHSIZE`	5000	Flush batch size
`CAMELEER_SERVER_INGESTION_FLUSHINTERVALMS`	5000	Periodic flush interval
`CAMELEER_SERVER_INGESTION_BODYSIZELIMIT`	16384	Max body size per execution (bytes)

Agent Registry Tuning

Variable	Default	Purpose
`CAMELEER_SERVER_AGENTREGISTRY_STALETHRESHOLDMS`	90000	Heartbeat miss → STALE
`CAMELEER_SERVER_AGENTREGISTRY_DEADTHRESHOLDMS`	300000	STALE duration → DEAD
`CAMELEER_SERVER_AGENTREGISTRY_PINGINTERVALMS`	15000	SSE keepalive interval
`CAMELEER_SERVER_AGENTREGISTRY_COMMANDEXPIRYMS`	60000	Pending command TTL

Public Endpoints (No Auth)

These endpoints do not require authentication:

GET /api/v1/health
POST /api/v1/agents/register (requires bootstrap token)
POST /api/v1/agents/*/refresh
POST /api/v1/auth/login
POST /api/v1/auth/refresh
GET /api/v1/auth/oidc/config
POST /api/v1/auth/oidc/callback
GET /api/v1/api-docs/** (OpenAPI spec)
GET /swagger-ui.html (Swagger UI)
Static resources: /, /index.html, /config.js, /favicon.svg, /assets/**

All other endpoints require a valid JWT with appropriate role.

21 KiB Raw Blame History