Files
cameleer-server/docs/server-self-metrics.md
hsiegeln 48ce75bf38 feat(server): persist server self-metrics into ClickHouse
Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:20:45 +02:00

347 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Server Self-Metrics — Reference for Dashboard Builders
This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.
---
## Table schema
```sql
server_metrics (
tenant_id LowCardinality(String) DEFAULT 'default',
collected_at DateTime64(3),
server_instance_id LowCardinality(String),
metric_name LowCardinality(String),
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
statistic LowCardinality(String) DEFAULT 'value',
metric_value Float64,
tags Map(String, String) DEFAULT map(),
server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
```
### What each column means
| Column | Notes |
|---|---|
| `tenant_id` | Always filter by this. One tenant per server deployment. |
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
| `metric_type` | Lowercase Micrometer `Meter.Type`. |
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |
### Counter semantics (important)
Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:
```sql
SELECT
toStartOfMinute(collected_at) AS minute,
metric_value - any(metric_value) OVER (
PARTITION BY server_instance_id, metric_name, tags
ORDER BY collected_at
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count'
ORDER BY minute;
```
On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.
### Retention
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
---
## How to query
### Via the admin ClickHouse endpoint
```
POST /api/v1/admin/clickhouse/query
Authorization: Bearer <admin-jwt>
Content-Type: text/plain
SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2
```
Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API.
### Direct JDBC (recommended for the dashboard)
Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`.
---
## Metric catalog
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
### Cameleer business metrics — agent + ingestion
Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |
### Cameleer business metrics — deploy + auth
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |
### Alerting subsystem metrics
Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |
### JVM — memory, GC, threads, classes
From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
| `jvm.threads.live` | gauge | — | Current live thread count |
| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
| `jvm.threads.peak` | gauge | — | Peak thread count since start |
| `jvm.threads.started` | counter | — | Cumulative threads started |
| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (01) |
| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |
### Process and system
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (01) |
| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
| `process.uptime` | gauge | — | ms since start |
| `process.start.time` | gauge | — | Epoch start |
| `process.files.open` | gauge | — | Open FDs |
| `process.files.max` | gauge | — | FD ulimit |
| `system.cpu.count` | gauge | — | Cores visible to the JVM |
| `system.cpu.usage` | gauge | — | System-wide CPU (01) |
| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
| `disk.total` | gauge | `path` | Total bytes |
### HTTP server
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |
`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.
### Tomcat
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
| `tomcat.sessions.created` | counter | — | Cumulative session creates |
| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
| `tomcat.sessions.rejected` | counter | — | Session creates refused |
| `tomcat.threads.current` | gauge | `name` | Connector thread count |
| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
| `tomcat.threads.config.max` | gauge | `name` | Configured max |
### HikariCP (PostgreSQL pool)
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `hikaricp.connections` | gauge | `pool` | Total connections |
| `hikaricp.connections.active` | gauge | `pool` | In-use |
| `hikaricp.connections.idle` | gauge | `pool` | Idle |
| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
| `hikaricp.connections.min` | gauge | `pool` | Configured min |
| `hikaricp.connections.max` | gauge | `pool` | Configured max |
| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |
Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).
### JDBC generic
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
| `jdbc.connections.max` | gauge | `name` | |
| `jdbc.connections.active` | gauge | `name` | |
| `jdbc.connections.idle` | gauge | `name` | |
### Logging
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |
### Spring Boot lifecycle
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `application.started.time` | timer | `main.application.class` | Cold-start duration |
| `application.ready.time` | timer | `main.application.class` | Time to ready |
### Flyway
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |
### Executor pools (if any `@Async` executors exist)
When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `executor.active` | gauge | `name` | Currently-running tasks |
| `executor.queued` | gauge | `name` | Queued tasks |
| `executor.queue.remaining` | gauge | `name` | Queue headroom |
| `executor.pool.size` | gauge | `name` | Current pool size |
| `executor.pool.core` | gauge | `name` | Core size |
| `executor.pool.max` | gauge | `name` | Max size |
| `executor.completed` | counter | `name` | Completed tasks |
---
## Suggested dashboard panels
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable.
### Row: server health (top of dashboard)
1. **Agents by state** — stacked area.
```sql
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
FROM server_metrics
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
AND collected_at >= {from} AND collected_at < {to}
GROUP BY t, state ORDER BY t;
```
2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above.
3. **Ingestion drops per minute** — bar chart (per-minute delta).
```sql
WITH sorted AS (
SELECT toStartOfMinute(collected_at) AS minute,
tags['reason'] AS reason,
server_instance_id,
max(metric_value) AS cumulative
FROM server_metrics
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
GROUP BY minute, reason, server_instance_id
)
SELECT minute, reason,
cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
PARTITION BY reason, server_instance_id ORDER BY minute
) AS drops_per_minute
FROM sorted ORDER BY minute;
```
4. **Auth failures per minute** — same shape as drops, split by `reason`.
### Row: JVM
5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s.
6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`.
7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`.
8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`.
### Row: HTTP + DB
9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`.
10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total.
11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small.
12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag.
### Row: alerting (collapsible)
13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`.
14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`.
15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`.
### Row: deployments (runtime-enabled only)
16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`.
17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean).
---
## Notes for the dashboard implementer
- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table.
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap.
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart.
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either.
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.
---
## Changelog
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.