Files

hsiegeln 48ce75bf38 feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 23:20:45 +02:00

17 KiB

Raw Blame History

Server Self-Metrics — Reference for Dashboard Builders

This is the reference for the SaaS team building the server-health dashboard. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

tl;dr — Every 60 s, every meter in the server's Micrometer registry (all cameleer.*, all alerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per (meter, statistic) pair. No external Prometheus required.

Table schema

server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE

What each column means

Column	Notes
`tenant_id`	Always filter by this. One tenant per server deployment.
`server_instance_id`	Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. Rotates on restart, so counters restart cleanly.
`metric_name`	Raw Micrometer meter name. Dots, not underscores.
`metric_type`	Lowercase Micrometer `Meter.Type`.
`statistic`	Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape.
`metric_value`	`Float64`. Non-finite values (NaN / ±∞) are dropped before insert.
`tags`	`Map(String, String)`. Micrometer tags copied verbatim.

Counter semantics (important)

Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:

SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;

On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.

Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.

How to query

Via the admin ClickHouse endpoint

POST /api/v1/admin/clickhouse/query
Authorization: Bearer <admin-jwt>
Content-Type: text/plain

SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2

Requires infrastructureendpoints=true and the ADMIN role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the /api/v1/admin/clickhouse/query path is a human-facing admin tool, not a programmatic API.

Direct JDBC (recommended for the dashboard)

Read directly from ClickHouse (read-only user, GRANT SELECT ON cameleer.server_metrics TO dashboard_ro). All queries must filter by tenant_id.

Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

Cameleer business metrics — agent + ingestion

Source: cameleer-server-app/.../metrics/ServerMetrics.java.

Metric	Type	Statistic	Tags	Meaning
`cameleer.agents.connected`	gauge	`value`	`state` (live/stale/dead/shutdown)	Count of agents in each lifecycle state
`cameleer.agents.sse.active`	gauge	`value`	—	Active SSE connections (command channel)
`cameleer.agents.transitions`	counter	`count`	`transition` (went_stale/went_dead/recovered)	Cumulative lifecycle transitions
`cameleer.ingestion.buffer.size`	gauge	`value`	`type` (execution/processor/log/metrics)	Write buffer depth — spikes mean ingestion is lagging
`cameleer.ingestion.accumulator.pending`	gauge	`value`	—	Unfinalized execution chunks in the accumulator
`cameleer.ingestion.drops`	counter	`count`	`reason` (buffer_full/no_agent/no_identity)	Dropped payloads. Any non-zero rate here is bad.
`cameleer.ingestion.flush.duration`	timer	`count`, `total_time`/`total`, `max`	`type` (execution/processor/log)	Flush latency per type

Cameleer business metrics — deploy + auth

Metric	Type	Statistic	Tags	Meaning
`cameleer.deployments.outcome`	counter	`count`	`status` (running/failed/degraded)	Deploy outcome tally since boot
`cameleer.deployments.duration`	timer	`count`, `total_time`/`total`, `max`	—	End-to-end deploy latency
`cameleer.auth.failures`	counter	`count`	`reason` (invalid_token/revoked/oidc_rejected)	Auth failure breakdown — watch for spikes

Alerting subsystem metrics

Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.

Metric	Type	Statistic	Tags	Meaning
`alerting_rules_total`	gauge	`value`	`state` (enabled/disabled)	Cached 30 s from PostgreSQL `alert_rules`
`alerting_instances_total`	gauge	`value`	`state` (firing/resolved/ack'd etc.)	Cached 30 s from PostgreSQL `alert_instances`
`alerting_eval_errors_total`	counter	`count`	`kind` (condition kind)	Evaluator exceptions per kind
`alerting_circuit_opened_total`	counter	`count`	`kind`	Circuit-breaker open transitions per kind
`alerting_eval_duration_seconds`	timer	`count`, `total_time`/`total`, `max`	`kind`	Per-kind evaluation latency
`alerting_webhook_delivery_duration_seconds`	timer	`count`, `total_time`/`total`, `max`	—	Outbound webhook POST latency
`alerting_notifications_total`	counter	`count`	`status` (sent/failed/retry/giving_up)	Notification outcomes

JVM — memory, GC, threads, classes

From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).

Metric	Type	Tags	Meaning
`jvm.memory.used`	gauge	`area` (heap/nonheap), `id` (pool name)	Bytes used per pool
`jvm.memory.committed`	gauge	`area`, `id`	Bytes committed per pool
`jvm.memory.max`	gauge	`area`, `id`	Pool max
`jvm.memory.usage.after.gc`	gauge	`area`, `id`	Usage right after the last collection
`jvm.buffer.memory.used`	gauge	`id` (direct/mapped)	NIO buffer bytes
`jvm.buffer.count`	gauge	`id`	NIO buffer count
`jvm.buffer.total.capacity`	gauge	`id`	NIO buffer capacity
`jvm.threads.live`	gauge	—	Current live thread count
`jvm.threads.daemon`	gauge	—	Current daemon thread count
`jvm.threads.peak`	gauge	—	Peak thread count since start
`jvm.threads.started`	counter	—	Cumulative threads started
`jvm.threads.states`	gauge	`state` (runnable/blocked/waiting/…)	Threads per state
`jvm.classes.loaded`	gauge	—	Currently-loaded classes
`jvm.classes.unloaded`	counter	—	Cumulative unloaded classes
`jvm.gc.pause`	timer	`action`, `cause`	Stop-the-world pause times — watch `max`
`jvm.gc.concurrent.phase.time`	timer	`action`, `cause`	Concurrent-phase durations (G1/ZGC)
`jvm.gc.memory.allocated`	counter	—	Bytes allocated in the young gen
`jvm.gc.memory.promoted`	counter	—	Bytes promoted to old gen
`jvm.gc.overhead`	gauge	—	Fraction of CPU spent in GC (0–1)
`jvm.gc.live.data.size`	gauge	—	Live data after last collection
`jvm.gc.max.data.size`	gauge	—	Max old-gen size
`jvm.info`	gauge	`vendor`, `runtime`, `version`	Constant `1.0`; tags carry the real info

Process and system

Metric	Type	Tags	Meaning
`process.cpu.usage`	gauge	—	CPU share consumed by this JVM (0–1)
`process.cpu.time`	gauge	—	Cumulative CPU time (ns)
`process.uptime`	gauge	—	ms since start
`process.start.time`	gauge	—	Epoch start
`process.files.open`	gauge	—	Open FDs
`process.files.max`	gauge	—	FD ulimit
`system.cpu.count`	gauge	—	Cores visible to the JVM
`system.cpu.usage`	gauge	—	System-wide CPU (0–1)
`system.load.average.1m`	gauge	—	1-min load (Unix only)
`disk.free`	gauge	`path`	Free bytes on the mount that holds the JAR
`disk.total`	gauge	`path`	Total bytes

HTTP server

Metric	Type	Tags	Meaning
`http.server.requests`	timer	`method`, `uri`, `status`, `outcome`, `exception`	Inbound HTTP: count, total_time/total, max
`http.server.requests.active`	long_task_timer	`method`, `uri`	In-flight requests — `active_tasks` statistic

uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.

Tomcat

Metric	Type	Tags	Meaning
`tomcat.sessions.active.current`	gauge	—	Currently active sessions
`tomcat.sessions.active.max`	gauge	—	Max concurrent sessions observed
`tomcat.sessions.alive.max`	gauge	—	Longest session lifetime (s)
`tomcat.sessions.created`	counter	—	Cumulative session creates
`tomcat.sessions.expired`	counter	—	Cumulative expirations
`tomcat.sessions.rejected`	counter	—	Session creates refused
`tomcat.threads.current`	gauge	`name`	Connector thread count
`tomcat.threads.busy`	gauge	`name`	Connector threads currently serving a request
`tomcat.threads.config.max`	gauge	`name`	Configured max

HikariCP (PostgreSQL pool)

Metric	Type	Tags	Meaning
`hikaricp.connections`	gauge	`pool`	Total connections
`hikaricp.connections.active`	gauge	`pool`	In-use
`hikaricp.connections.idle`	gauge	`pool`	Idle
`hikaricp.connections.pending`	gauge	`pool`	Threads waiting for a connection
`hikaricp.connections.min`	gauge	`pool`	Configured min
`hikaricp.connections.max`	gauge	`pool`	Configured max
`hikaricp.connections.creation`	timer	`pool`	Time to open a new connection
`hikaricp.connections.acquire`	timer	`pool`	Time to acquire from the pool
`hikaricp.connections.usage`	timer	`pool`	Time a connection was in use
`hikaricp.connections.timeout`	counter	`pool`	Pool acquisition timeouts — any non-zero rate is a problem

Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).

JDBC generic

Metric	Type	Tags	Meaning
`jdbc.connections.min`	gauge	`name`	Same data as Hikari, surfaced generically
`jdbc.connections.max`	gauge	`name`
`jdbc.connections.active`	gauge	`name`
`jdbc.connections.idle`	gauge	`name`

Logging

Metric	Type	Tags	Meaning
`logback.events`	counter	`level` (error/warn/info/debug/trace)	Log events emitted since start — `{level=error}` is a useful panel

Spring Boot lifecycle

Metric	Type	Tags	Meaning
`application.started.time`	timer	`main.application.class`	Cold-start duration
`application.ready.time`	timer	`main.application.class`	Time to ready

Flyway

Metric	Type	Tags	Meaning
`flyway.migrations`	gauge	—	Number of migrations applied (current schema)

Executor pools (if any `@Async` executors exist)

When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:

Metric	Type	Tags	Meaning
`executor.active`	gauge	`name`	Currently-running tasks
`executor.queued`	gauge	`name`	Queued tasks
`executor.queue.remaining`	gauge	`name`	Queue headroom
`executor.pool.size`	gauge	`name`	Current pool size
`executor.pool.core`	gauge	`name`	Core size
`executor.pool.max`	gauge	`name`	Max size
`executor.completed`	counter	`name`	Completed tasks

Suggested dashboard panels

The shortlist below gives you a working health dashboard with ~12 panels. All queries assume tenant_id is a dashboard variable.

Row: server health (top of dashboard)

Agents by state — stacked area.

SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
FROM server_metrics
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
  AND collected_at >= {from} AND collected_at < {to}
GROUP BY t, state ORDER BY t;

Ingestion buffer depth — line chart by type. Use cameleer.ingestion.buffer.size same shape as above.

Ingestion drops per minute — bar chart (per-minute delta).

WITH sorted AS (
  SELECT toStartOfMinute(collected_at) AS minute,
         tags['reason'] AS reason,
         server_instance_id,
         max(metric_value) AS cumulative
  FROM server_metrics
  WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
    AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
  GROUP BY minute, reason, server_instance_id
)
SELECT minute, reason,
       cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
         PARTITION BY reason, server_instance_id ORDER BY minute
       ) AS drops_per_minute
FROM sorted ORDER BY minute;

Auth failures per minute — same shape as drops, split by reason.

Row: JVM

Heap used vs committed vs max — area chart. Filter metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max') with tags['area'] = 'heap', sum across pool ids.
CPU % — line. process.cpu.usage and system.cpu.usage.
GC pause p99 + max — jvm.gc.pause with statistic max, grouped by tags['cause'].
Thread count — jvm.threads.live, jvm.threads.daemon, jvm.threads.peak.

Row: HTTP + DB

HTTP p99 by URI — use http.server.requests with statistic='max' as a rough p99 proxy, or total_time/count for mean. Group by tags['uri']. Filter tags['outcome'] = 'SUCCESS'.
HTTP error rate — count where tags['status'] starts with 5, divided by total.
HikariCP pool saturation — overlay hikaricp.connections.active and hikaricp.connections.pending. If pending > 0 sustained, the pool is too small.
Hikari acquire timeouts per minute — delta of hikaricp.connections.timeout. Any non-zero rate is a red flag.

Row: alerting (collapsible)

Alerting instances by state — alerting_instances_total stacked by tags['state'].
Eval errors per minute by kind — delta of alerting_eval_errors_total by tags['kind'].
Webhook delivery p99 — alerting_webhook_delivery_duration_seconds with statistic='max'.

Row: deployments (runtime-enabled only)

Deploy outcomes last 24 h — counter delta of cameleer.deployments.outcome grouped by tags['status'].
Deploy duration p99 — cameleer.deployments.duration with statistic='max' (or total_time/count for mean).

Notes for the dashboard implementer

Always filter by tenant_id. It's the first column in the sort key; queries that skip it scan the entire table.
Prefer predicate pushdown on metric_name + statistic. Both are LowCardinality, so metric_name = 'x' AND statistic = 'count' is cheap.
Treat server_instance_id as a natural partition for counter math. Never compute deltas across it — you'll get negative numbers on restart.
total_time vs total. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect total_time. Tests may write total. When in doubt, accept either.
Cardinality warning: http.server.requests tags include uri and status. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without @PathVariable, you'll see explosion here. Monitor count(DISTINCT concat(metric_name, toString(tags))) and alert if it spikes.
The dashboard should be read-only. No one writes into server_metrics except the server itself — there's no API to push or delete rows.

Changelog

2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.

17 KiB Raw Blame History Unescape Escape