Files
cameleer-server/.claude/rules/metrics.md
hsiegeln 48ce75bf38 feat(server): persist server self-metrics into ClickHouse
Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:20:45 +02:00

5.8 KiB

paths
paths
cameleer-server-app/**/metrics/**
cameleer-server-app/**/ServerMetrics*
ui/src/pages/RuntimeTab/**
ui/src/pages/DashboardTab/**

Prometheus Metrics

Server exposes /api/v1/prometheus (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and http.server.requests metrics automatically. Business metrics via ServerMetrics component.

The same MeterRegistry is also snapshotted to ClickHouse every 60 s by ServerMetricsSnapshotScheduler (see "Server self-metrics persistence" at the bottom of this file) — so historical server-health data survives restarts without an external Prometheus.

Gauges (auto-polled)

Metric Tags Source
cameleer.agents.connected state (live, stale, dead, shutdown) AgentRegistryService.findByState()
cameleer.agents.sse.active SseConnectionManager.getConnectionCount()
cameleer.ingestion.buffer.size type (execution, processor, log, metrics) WriteBuffer.size()
cameleer.ingestion.accumulator.pending ChunkAccumulator.getPendingCount()

Counters

Metric Tags Instrumented in
cameleer.ingestion.drops reason (buffer_full, no_agent, no_identity) LogIngestionController
cameleer.agents.transitions transition (went_stale, went_dead, recovered) AgentLifecycleMonitor
cameleer.deployments.outcome status (running, failed, degraded) DeploymentExecutor
cameleer.auth.failures reason (invalid_token, revoked, oidc_rejected) JwtAuthenticationFilter

Timers

Metric Tags Instrumented in
cameleer.ingestion.flush.duration type (execution, processor, log) ExecutionFlushScheduler
cameleer.deployments.duration DeploymentExecutor

Agent container Prometheus labels (set by PrometheusLabelBuilder at deploy time)

Runtime Type prometheus.path prometheus.port
spring-boot /actuator/prometheus 8081
quarkus / native /q/metrics 9000
plain-java /metrics 9464

All containers also get prometheus.scrape=true. These labels enable Prometheus docker_sd_configs auto-discovery.

Agent Metric Names (Micrometer)

Agents send MetricsSnapshot records with Micrometer-convention metric names. The server stores them generically (ClickHouse agent_metrics.metric_name). The UI references specific names in AgentInstance.tsx for JVM charts.

JVM metrics (used by UI)

Metric name UI usage
process.cpu.usage.value CPU % stat card + chart
jvm.memory.used.value Heap MB stat card + chart (tags: area=heap)
jvm.memory.max.value Heap max for % calculation (tags: area=heap)
jvm.threads.live.value Thread count chart
jvm.gc.pause.total_time GC time chart

Camel route metrics (stored, queried by dashboard)

Metric name Type Tags
camel.exchanges.succeeded.count counter routeId, camelContext
camel.exchanges.failed.count counter routeId, camelContext
camel.exchanges.total.count counter routeId, camelContext
camel.exchanges.failures.handled.count counter routeId, camelContext
camel.route.policy.count count routeId, camelContext
camel.route.policy.total_time total routeId, camelContext
camel.route.policy.max gauge routeId, camelContext
camel.routes.running.value gauge

Mean processing time = camel.route.policy.total_time / camel.route.policy.count. Min processing time is not available (Micrometer does not track minimums).

Cameleer agent metrics

Metric name Type Tags
cameleer.chunks.exported.count counter instanceId
cameleer.chunks.dropped.count counter instanceId, reason
cameleer.sse.reconnects.count counter instanceId
cameleer.taps.evaluated.count counter instanceId
cameleer.metrics.exported.count counter instanceId

Server self-metrics persistence

ServerMetricsSnapshotScheduler walks MeterRegistry.getMeters() every 60 s (configurable via cameleer.server.self-metrics.interval-ms) and writes one row per Micrometer Measurement to the ClickHouse server_metrics table. Full registry is captured — Spring Boot Actuator series (jvm.*, process.*, http.server.requests, hikaricp.*, jdbc.*, tomcat.*, logback.events, system.*) plus cameleer.* and alerting_*.

Table (cameleer-server-app/src/main/resources/clickhouse/init.sql):

server_metrics(tenant_id, collected_at, server_instance_id,
               metric_name, metric_type, statistic, metric_value,
               tags Map(String,String), server_received_at)
  • metric_type — lowercase Micrometer Meter.Type (counter, gauge, timer, distribution_summary, long_task_timer, other)
  • statistic — Micrometer Statistic.getTagValueRepresentation() (value, count, total, total_time, max, mean, active_tasks, duration). Timers emit 3 rows per tick (count + total_time + max); gauges/counters emit 1 (statistic='value' or 'count').
  • No environment column — the server is env-agnostic.
  • tenant_id threaded from cameleer.server.tenant.id (single-tenant per server).
  • server_instance_id resolved once at boot by ServerInstanceIdConfig (property → HOSTNAME → localhost → UUID fallback). Rotates across restarts so counter resets are unambiguous.
  • TTL: 90 days (vs 365 for agent_metrics). Write-only in v1 — no query endpoint or UI page. Inspect via ClickHouse admin: /api/v1/admin/clickhouse/query or direct SQL.
  • Toggle off entirely with cameleer.server.self-metrics.enabled=false (uses @ConditionalOnProperty).