Snapshot the full Micrometer registry (cameleer business metrics, alerting metrics, and Spring Boot Actuator defaults) every 60s into a new server_metrics table so server health survives restarts without an external Prometheus. Includes a dashboard-builder reference for the SaaS team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Server Self-Metrics — Reference for Dashboard Builders
This is the reference for the SaaS team building the server-health dashboard. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
tl;dr — Every 60 s, every meter in the server's Micrometer registry (all
cameleer.*, allalerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per(meter, statistic)pair. No external Prometheus required.
Table schema
server_metrics (
tenant_id LowCardinality(String) DEFAULT 'default',
collected_at DateTime64(3),
server_instance_id LowCardinality(String),
metric_name LowCardinality(String),
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
statistic LowCardinality(String) DEFAULT 'value',
metric_value Float64,
tags Map(String, String) DEFAULT map(),
server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
What each column means
| Column | Notes |
|---|---|
tenant_id |
Always filter by this. One tenant per server deployment. |
server_instance_id |
Stable id per server process: property → HOSTNAME env → DNS → random UUID. Rotates on restart, so counters restart cleanly. |
metric_name |
Raw Micrometer meter name. Dots, not underscores. |
metric_type |
Lowercase Micrometer Meter.Type. |
statistic |
Which Measurement this row is. Counters/gauges → value or count. Timers → three rows per tick: count, total_time (or total), max. Distribution summaries → same shape. |
metric_value |
Float64. Non-finite values (NaN / ±∞) are dropped before insert. |
tags |
Map(String, String). Micrometer tags copied verbatim. |
Counter semantics (important)
Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:
SELECT
toStartOfMinute(collected_at) AS minute,
metric_value - any(metric_value) OVER (
PARTITION BY server_instance_id, metric_name, tags
ORDER BY collected_at
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count'
ORDER BY minute;
On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.
Retention
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
How to query
Via the admin ClickHouse endpoint
POST /api/v1/admin/clickhouse/query
Authorization: Bearer <admin-jwt>
Content-Type: text/plain
SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2
Requires infrastructureendpoints=true and the ADMIN role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the /api/v1/admin/clickhouse/query path is a human-facing admin tool, not a programmatic API.
Direct JDBC (recommended for the dashboard)
Read directly from ClickHouse (read-only user, GRANT SELECT ON cameleer.server_metrics TO dashboard_ro). All queries must filter by tenant_id.
Metric catalog
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
Cameleer business metrics — agent + ingestion
Source: cameleer-server-app/.../metrics/ServerMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.agents.connected |
gauge | value |
state (live/stale/dead/shutdown) |
Count of agents in each lifecycle state |
cameleer.agents.sse.active |
gauge | value |
— | Active SSE connections (command channel) |
cameleer.agents.transitions |
counter | count |
transition (went_stale/went_dead/recovered) |
Cumulative lifecycle transitions |
cameleer.ingestion.buffer.size |
gauge | value |
type (execution/processor/log/metrics) |
Write buffer depth — spikes mean ingestion is lagging |
cameleer.ingestion.accumulator.pending |
gauge | value |
— | Unfinalized execution chunks in the accumulator |
cameleer.ingestion.drops |
counter | count |
reason (buffer_full/no_agent/no_identity) |
Dropped payloads. Any non-zero rate here is bad. |
cameleer.ingestion.flush.duration |
timer | count, total_time/total, max |
type (execution/processor/log) |
Flush latency per type |
Cameleer business metrics — deploy + auth
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.deployments.outcome |
counter | count |
status (running/failed/degraded) |
Deploy outcome tally since boot |
cameleer.deployments.duration |
timer | count, total_time/total, max |
— | End-to-end deploy latency |
cameleer.auth.failures |
counter | count |
reason (invalid_token/revoked/oidc_rejected) |
Auth failure breakdown — watch for spikes |
Alerting subsystem metrics
Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
alerting_rules_total |
gauge | value |
state (enabled/disabled) |
Cached 30 s from PostgreSQL alert_rules |
alerting_instances_total |
gauge | value |
state (firing/resolved/ack'd etc.) |
Cached 30 s from PostgreSQL alert_instances |
alerting_eval_errors_total |
counter | count |
kind (condition kind) |
Evaluator exceptions per kind |
alerting_circuit_opened_total |
counter | count |
kind |
Circuit-breaker open transitions per kind |
alerting_eval_duration_seconds |
timer | count, total_time/total, max |
kind |
Per-kind evaluation latency |
alerting_webhook_delivery_duration_seconds |
timer | count, total_time/total, max |
— | Outbound webhook POST latency |
alerting_notifications_total |
counter | count |
status (sent/failed/retry/giving_up) |
Notification outcomes |
JVM — memory, GC, threads, classes
From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jvm.memory.used |
gauge | area (heap/nonheap), id (pool name) |
Bytes used per pool |
jvm.memory.committed |
gauge | area, id |
Bytes committed per pool |
jvm.memory.max |
gauge | area, id |
Pool max |
jvm.memory.usage.after.gc |
gauge | area, id |
Usage right after the last collection |
jvm.buffer.memory.used |
gauge | id (direct/mapped) |
NIO buffer bytes |
jvm.buffer.count |
gauge | id |
NIO buffer count |
jvm.buffer.total.capacity |
gauge | id |
NIO buffer capacity |
jvm.threads.live |
gauge | — | Current live thread count |
jvm.threads.daemon |
gauge | — | Current daemon thread count |
jvm.threads.peak |
gauge | — | Peak thread count since start |
jvm.threads.started |
counter | — | Cumulative threads started |
jvm.threads.states |
gauge | state (runnable/blocked/waiting/…) |
Threads per state |
jvm.classes.loaded |
gauge | — | Currently-loaded classes |
jvm.classes.unloaded |
counter | — | Cumulative unloaded classes |
jvm.gc.pause |
timer | action, cause |
Stop-the-world pause times — watch max |
jvm.gc.concurrent.phase.time |
timer | action, cause |
Concurrent-phase durations (G1/ZGC) |
jvm.gc.memory.allocated |
counter | — | Bytes allocated in the young gen |
jvm.gc.memory.promoted |
counter | — | Bytes promoted to old gen |
jvm.gc.overhead |
gauge | — | Fraction of CPU spent in GC (0–1) |
jvm.gc.live.data.size |
gauge | — | Live data after last collection |
jvm.gc.max.data.size |
gauge | — | Max old-gen size |
jvm.info |
gauge | vendor, runtime, version |
Constant 1.0; tags carry the real info |
Process and system
| Metric | Type | Tags | Meaning |
|---|---|---|---|
process.cpu.usage |
gauge | — | CPU share consumed by this JVM (0–1) |
process.cpu.time |
gauge | — | Cumulative CPU time (ns) |
process.uptime |
gauge | — | ms since start |
process.start.time |
gauge | — | Epoch start |
process.files.open |
gauge | — | Open FDs |
process.files.max |
gauge | — | FD ulimit |
system.cpu.count |
gauge | — | Cores visible to the JVM |
system.cpu.usage |
gauge | — | System-wide CPU (0–1) |
system.load.average.1m |
gauge | — | 1-min load (Unix only) |
disk.free |
gauge | path |
Free bytes on the mount that holds the JAR |
disk.total |
gauge | path |
Total bytes |
HTTP server
| Metric | Type | Tags | Meaning |
|---|---|---|---|
http.server.requests |
timer | method, uri, status, outcome, exception |
Inbound HTTP: count, total_time/total, max |
http.server.requests.active |
long_task_timer | method, uri |
In-flight requests — active_tasks statistic |
uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.
Tomcat
| Metric | Type | Tags | Meaning |
|---|---|---|---|
tomcat.sessions.active.current |
gauge | — | Currently active sessions |
tomcat.sessions.active.max |
gauge | — | Max concurrent sessions observed |
tomcat.sessions.alive.max |
gauge | — | Longest session lifetime (s) |
tomcat.sessions.created |
counter | — | Cumulative session creates |
tomcat.sessions.expired |
counter | — | Cumulative expirations |
tomcat.sessions.rejected |
counter | — | Session creates refused |
tomcat.threads.current |
gauge | name |
Connector thread count |
tomcat.threads.busy |
gauge | name |
Connector threads currently serving a request |
tomcat.threads.config.max |
gauge | name |
Configured max |
HikariCP (PostgreSQL pool)
| Metric | Type | Tags | Meaning |
|---|---|---|---|
hikaricp.connections |
gauge | pool |
Total connections |
hikaricp.connections.active |
gauge | pool |
In-use |
hikaricp.connections.idle |
gauge | pool |
Idle |
hikaricp.connections.pending |
gauge | pool |
Threads waiting for a connection |
hikaricp.connections.min |
gauge | pool |
Configured min |
hikaricp.connections.max |
gauge | pool |
Configured max |
hikaricp.connections.creation |
timer | pool |
Time to open a new connection |
hikaricp.connections.acquire |
timer | pool |
Time to acquire from the pool |
hikaricp.connections.usage |
timer | pool |
Time a connection was in use |
hikaricp.connections.timeout |
counter | pool |
Pool acquisition timeouts — any non-zero rate is a problem |
Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).
JDBC generic
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jdbc.connections.min |
gauge | name |
Same data as Hikari, surfaced generically |
jdbc.connections.max |
gauge | name |
|
jdbc.connections.active |
gauge | name |
|
jdbc.connections.idle |
gauge | name |
Logging
| Metric | Type | Tags | Meaning |
|---|---|---|---|
logback.events |
counter | level (error/warn/info/debug/trace) |
Log events emitted since start — {level=error} is a useful panel |
Spring Boot lifecycle
| Metric | Type | Tags | Meaning |
|---|---|---|---|
application.started.time |
timer | main.application.class |
Cold-start duration |
application.ready.time |
timer | main.application.class |
Time to ready |
Flyway
| Metric | Type | Tags | Meaning |
|---|---|---|---|
flyway.migrations |
gauge | — | Number of migrations applied (current schema) |
Executor pools (if any @Async executors exist)
When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:
| Metric | Type | Tags | Meaning |
|---|---|---|---|
executor.active |
gauge | name |
Currently-running tasks |
executor.queued |
gauge | name |
Queued tasks |
executor.queue.remaining |
gauge | name |
Queue headroom |
executor.pool.size |
gauge | name |
Current pool size |
executor.pool.core |
gauge | name |
Core size |
executor.pool.max |
gauge | name |
Max size |
executor.completed |
counter | name |
Completed tasks |
Suggested dashboard panels
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume tenant_id is a dashboard variable.
Row: server health (top of dashboard)
-
Agents by state — stacked area.
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count FROM server_metrics WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected' AND collected_at >= {from} AND collected_at < {to} GROUP BY t, state ORDER BY t; -
Ingestion buffer depth — line chart by
type. Usecameleer.ingestion.buffer.sizesame shape as above. -
Ingestion drops per minute — bar chart (per-minute delta).
WITH sorted AS ( SELECT toStartOfMinute(collected_at) AS minute, tags['reason'] AS reason, server_instance_id, max(metric_value) AS cumulative FROM server_metrics WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops' AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to} GROUP BY minute, reason, server_instance_id ) SELECT minute, reason, cumulative - lagInFrame(cumulative, 1, cumulative) OVER ( PARTITION BY reason, server_instance_id ORDER BY minute ) AS drops_per_minute FROM sorted ORDER BY minute; -
Auth failures per minute — same shape as drops, split by
reason.
Row: JVM
-
Heap used vs committed vs max — area chart. Filter
metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')withtags['area'] = 'heap', sum across poolids. -
CPU % — line.
process.cpu.usageandsystem.cpu.usage. -
GC pause p99 + max —
jvm.gc.pausewith statisticmax, grouped bytags['cause']. -
Thread count —
jvm.threads.live,jvm.threads.daemon,jvm.threads.peak.
Row: HTTP + DB
-
HTTP p99 by URI — use
http.server.requestswithstatistic='max'as a rough p99 proxy, ortotal_time/countfor mean. Group bytags['uri']. Filtertags['outcome'] = 'SUCCESS'. -
HTTP error rate — count where
tags['status']starts with5, divided by total. -
HikariCP pool saturation — overlay
hikaricp.connections.activeandhikaricp.connections.pending. Ifpending > 0sustained, the pool is too small. -
Hikari acquire timeouts per minute — delta of
hikaricp.connections.timeout. Any non-zero rate is a red flag.
Row: alerting (collapsible)
-
Alerting instances by state —
alerting_instances_totalstacked bytags['state']. -
Eval errors per minute by kind — delta of
alerting_eval_errors_totalbytags['kind']. -
Webhook delivery p99 —
alerting_webhook_delivery_duration_secondswithstatistic='max'.
Row: deployments (runtime-enabled only)
-
Deploy outcomes last 24 h — counter delta of
cameleer.deployments.outcomegrouped bytags['status']. -
Deploy duration p99 —
cameleer.deployments.durationwithstatistic='max'(ortotal_time/countfor mean).
Notes for the dashboard implementer
- Always filter by
tenant_id. It's the first column in the sort key; queries that skip it scan the entire table. - Prefer predicate pushdown on
metric_name+statistic. Both areLowCardinality, sometric_name = 'x' AND statistic = 'count'is cheap. - Treat
server_instance_idas a natural partition for counter math. Never compute deltas across it — you'll get negative numbers on restart. total_timevstotal. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expecttotal_time. Tests may writetotal. When in doubt, accept either.- Cardinality warning:
http.server.requeststags includeuriandstatus. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without@PathVariable, you'll see explosion here. Monitorcount(DISTINCT concat(metric_name, toString(tags)))and alert if it spikes. - The dashboard should be read-only. No one writes into
server_metricsexcept the server itself — there's no API to push or delete rows.
Changelog
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.