Files
cameleer-server/docs/server-self-metrics.md
hsiegeln 48ce75bf38 feat(server): persist server self-metrics into ClickHouse
Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:20:45 +02:00

17 KiB
Raw Blame History

Server Self-Metrics — Reference for Dashboard Builders

This is the reference for the SaaS team building the server-health dashboard. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

tl;dr — Every 60 s, every meter in the server's Micrometer registry (all cameleer.*, all alerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per (meter, statistic) pair. No external Prometheus required.


Table schema

server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE

What each column means

Column Notes
tenant_id Always filter by this. One tenant per server deployment.
server_instance_id Stable id per server process: property → HOSTNAME env → DNS → random UUID. Rotates on restart, so counters restart cleanly.
metric_name Raw Micrometer meter name. Dots, not underscores.
metric_type Lowercase Micrometer Meter.Type.
statistic Which Measurement this row is. Counters/gauges → value or count. Timers → three rows per tick: count, total_time (or total), max. Distribution summaries → same shape.
metric_value Float64. Non-finite values (NaN / ±∞) are dropped before insert.
tags Map(String, String). Micrometer tags copied verbatim.

Counter semantics (important)

Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:

SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;

On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.

Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.


How to query

Via the admin ClickHouse endpoint

POST /api/v1/admin/clickhouse/query
Authorization: Bearer <admin-jwt>
Content-Type: text/plain

SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2

Requires infrastructureendpoints=true and the ADMIN role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the /api/v1/admin/clickhouse/query path is a human-facing admin tool, not a programmatic API.

Read directly from ClickHouse (read-only user, GRANT SELECT ON cameleer.server_metrics TO dashboard_ro). All queries must filter by tenant_id.


Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

Cameleer business metrics — agent + ingestion

Source: cameleer-server-app/.../metrics/ServerMetrics.java.

Metric Type Statistic Tags Meaning
cameleer.agents.connected gauge value state (live/stale/dead/shutdown) Count of agents in each lifecycle state
cameleer.agents.sse.active gauge value Active SSE connections (command channel)
cameleer.agents.transitions counter count transition (went_stale/went_dead/recovered) Cumulative lifecycle transitions
cameleer.ingestion.buffer.size gauge value type (execution/processor/log/metrics) Write buffer depth — spikes mean ingestion is lagging
cameleer.ingestion.accumulator.pending gauge value Unfinalized execution chunks in the accumulator
cameleer.ingestion.drops counter count reason (buffer_full/no_agent/no_identity) Dropped payloads. Any non-zero rate here is bad.
cameleer.ingestion.flush.duration timer count, total_time/total, max type (execution/processor/log) Flush latency per type

Cameleer business metrics — deploy + auth

Metric Type Statistic Tags Meaning
cameleer.deployments.outcome counter count status (running/failed/degraded) Deploy outcome tally since boot
cameleer.deployments.duration timer count, total_time/total, max End-to-end deploy latency
cameleer.auth.failures counter count reason (invalid_token/revoked/oidc_rejected) Auth failure breakdown — watch for spikes

Alerting subsystem metrics

Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.

Metric Type Statistic Tags Meaning
alerting_rules_total gauge value state (enabled/disabled) Cached 30 s from PostgreSQL alert_rules
alerting_instances_total gauge value state (firing/resolved/ack'd etc.) Cached 30 s from PostgreSQL alert_instances
alerting_eval_errors_total counter count kind (condition kind) Evaluator exceptions per kind
alerting_circuit_opened_total counter count kind Circuit-breaker open transitions per kind
alerting_eval_duration_seconds timer count, total_time/total, max kind Per-kind evaluation latency
alerting_webhook_delivery_duration_seconds timer count, total_time/total, max Outbound webhook POST latency
alerting_notifications_total counter count status (sent/failed/retry/giving_up) Notification outcomes

JVM — memory, GC, threads, classes

From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).

Metric Type Tags Meaning
jvm.memory.used gauge area (heap/nonheap), id (pool name) Bytes used per pool
jvm.memory.committed gauge area, id Bytes committed per pool
jvm.memory.max gauge area, id Pool max
jvm.memory.usage.after.gc gauge area, id Usage right after the last collection
jvm.buffer.memory.used gauge id (direct/mapped) NIO buffer bytes
jvm.buffer.count gauge id NIO buffer count
jvm.buffer.total.capacity gauge id NIO buffer capacity
jvm.threads.live gauge Current live thread count
jvm.threads.daemon gauge Current daemon thread count
jvm.threads.peak gauge Peak thread count since start
jvm.threads.started counter Cumulative threads started
jvm.threads.states gauge state (runnable/blocked/waiting/…) Threads per state
jvm.classes.loaded gauge Currently-loaded classes
jvm.classes.unloaded counter Cumulative unloaded classes
jvm.gc.pause timer action, cause Stop-the-world pause times — watch max
jvm.gc.concurrent.phase.time timer action, cause Concurrent-phase durations (G1/ZGC)
jvm.gc.memory.allocated counter Bytes allocated in the young gen
jvm.gc.memory.promoted counter Bytes promoted to old gen
jvm.gc.overhead gauge Fraction of CPU spent in GC (01)
jvm.gc.live.data.size gauge Live data after last collection
jvm.gc.max.data.size gauge Max old-gen size
jvm.info gauge vendor, runtime, version Constant 1.0; tags carry the real info

Process and system

Metric Type Tags Meaning
process.cpu.usage gauge CPU share consumed by this JVM (01)
process.cpu.time gauge Cumulative CPU time (ns)
process.uptime gauge ms since start
process.start.time gauge Epoch start
process.files.open gauge Open FDs
process.files.max gauge FD ulimit
system.cpu.count gauge Cores visible to the JVM
system.cpu.usage gauge System-wide CPU (01)
system.load.average.1m gauge 1-min load (Unix only)
disk.free gauge path Free bytes on the mount that holds the JAR
disk.total gauge path Total bytes

HTTP server

Metric Type Tags Meaning
http.server.requests timer method, uri, status, outcome, exception Inbound HTTP: count, total_time/total, max
http.server.requests.active long_task_timer method, uri In-flight requests — active_tasks statistic

uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.

Tomcat

Metric Type Tags Meaning
tomcat.sessions.active.current gauge Currently active sessions
tomcat.sessions.active.max gauge Max concurrent sessions observed
tomcat.sessions.alive.max gauge Longest session lifetime (s)
tomcat.sessions.created counter Cumulative session creates
tomcat.sessions.expired counter Cumulative expirations
tomcat.sessions.rejected counter Session creates refused
tomcat.threads.current gauge name Connector thread count
tomcat.threads.busy gauge name Connector threads currently serving a request
tomcat.threads.config.max gauge name Configured max

HikariCP (PostgreSQL pool)

Metric Type Tags Meaning
hikaricp.connections gauge pool Total connections
hikaricp.connections.active gauge pool In-use
hikaricp.connections.idle gauge pool Idle
hikaricp.connections.pending gauge pool Threads waiting for a connection
hikaricp.connections.min gauge pool Configured min
hikaricp.connections.max gauge pool Configured max
hikaricp.connections.creation timer pool Time to open a new connection
hikaricp.connections.acquire timer pool Time to acquire from the pool
hikaricp.connections.usage timer pool Time a connection was in use
hikaricp.connections.timeout counter pool Pool acquisition timeouts — any non-zero rate is a problem

Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).

JDBC generic

Metric Type Tags Meaning
jdbc.connections.min gauge name Same data as Hikari, surfaced generically
jdbc.connections.max gauge name
jdbc.connections.active gauge name
jdbc.connections.idle gauge name

Logging

Metric Type Tags Meaning
logback.events counter level (error/warn/info/debug/trace) Log events emitted since start — {level=error} is a useful panel

Spring Boot lifecycle

Metric Type Tags Meaning
application.started.time timer main.application.class Cold-start duration
application.ready.time timer main.application.class Time to ready

Flyway

Metric Type Tags Meaning
flyway.migrations gauge Number of migrations applied (current schema)

Executor pools (if any @Async executors exist)

When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:

Metric Type Tags Meaning
executor.active gauge name Currently-running tasks
executor.queued gauge name Queued tasks
executor.queue.remaining gauge name Queue headroom
executor.pool.size gauge name Current pool size
executor.pool.core gauge name Core size
executor.pool.max gauge name Max size
executor.completed counter name Completed tasks

Suggested dashboard panels

The shortlist below gives you a working health dashboard with ~12 panels. All queries assume tenant_id is a dashboard variable.

Row: server health (top of dashboard)

  1. Agents by state — stacked area.

    SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
    FROM server_metrics
    WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
      AND collected_at >= {from} AND collected_at < {to}
    GROUP BY t, state ORDER BY t;
    
  2. Ingestion buffer depth — line chart by type. Use cameleer.ingestion.buffer.size same shape as above.

  3. Ingestion drops per minute — bar chart (per-minute delta).

    WITH sorted AS (
      SELECT toStartOfMinute(collected_at) AS minute,
             tags['reason'] AS reason,
             server_instance_id,
             max(metric_value) AS cumulative
      FROM server_metrics
      WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
        AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
      GROUP BY minute, reason, server_instance_id
    )
    SELECT minute, reason,
           cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
             PARTITION BY reason, server_instance_id ORDER BY minute
           ) AS drops_per_minute
    FROM sorted ORDER BY minute;
    
  4. Auth failures per minute — same shape as drops, split by reason.

Row: JVM

  1. Heap used vs committed vs max — area chart. Filter metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max') with tags['area'] = 'heap', sum across pool ids.

  2. CPU % — line. process.cpu.usage and system.cpu.usage.

  3. GC pause p99 + maxjvm.gc.pause with statistic max, grouped by tags['cause'].

  4. Thread countjvm.threads.live, jvm.threads.daemon, jvm.threads.peak.

Row: HTTP + DB

  1. HTTP p99 by URI — use http.server.requests with statistic='max' as a rough p99 proxy, or total_time/count for mean. Group by tags['uri']. Filter tags['outcome'] = 'SUCCESS'.

  2. HTTP error rate — count where tags['status'] starts with 5, divided by total.

  3. HikariCP pool saturation — overlay hikaricp.connections.active and hikaricp.connections.pending. If pending > 0 sustained, the pool is too small.

  4. Hikari acquire timeouts per minute — delta of hikaricp.connections.timeout. Any non-zero rate is a red flag.

Row: alerting (collapsible)

  1. Alerting instances by statealerting_instances_total stacked by tags['state'].

  2. Eval errors per minute by kind — delta of alerting_eval_errors_total by tags['kind'].

  3. Webhook delivery p99alerting_webhook_delivery_duration_seconds with statistic='max'.

Row: deployments (runtime-enabled only)

  1. Deploy outcomes last 24 h — counter delta of cameleer.deployments.outcome grouped by tags['status'].

  2. Deploy duration p99cameleer.deployments.duration with statistic='max' (or total_time/count for mean).


Notes for the dashboard implementer

  • Always filter by tenant_id. It's the first column in the sort key; queries that skip it scan the entire table.
  • Prefer predicate pushdown on metric_name + statistic. Both are LowCardinality, so metric_name = 'x' AND statistic = 'count' is cheap.
  • Treat server_instance_id as a natural partition for counter math. Never compute deltas across it — you'll get negative numbers on restart.
  • total_time vs total. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect total_time. Tests may write total. When in doubt, accept either.
  • Cardinality warning: http.server.requests tags include uri and status. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without @PathVariable, you'll see explosion here. Monitor count(DISTINCT concat(metric_name, toString(tags))) and alert if it spikes.
  • The dashboard should be read-only. No one writes into server_metrics except the server itself — there's no API to push or delete rows.

Changelog

  • 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.