docs/server-self-metrics.md

# Server Self-Metrics — Reference for Dashboard Builders

This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.

---

## Table schema

```sql
server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
```

### What each column means

| Column | Notes |
|---|---|
| `tenant_id` | Always filter by this. One tenant per server deployment. |
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
| `metric_type` | Lowercase Micrometer `Meter.Type`. |
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |

### Counter semantics (important)

Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:

```sql
SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;
```

On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.

### Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.

---

## How to query

### Via the admin ClickHouse endpoint

```
POST /api/v1/admin/clickhouse/query
Authorization: Bearer <admin-jwt>
Content-Type: text/plain

SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2
```

Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API.

### Direct JDBC (recommended for the dashboard)

Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`.

---

## Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

### Cameleer business metrics — agent + ingestion

Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |

### Cameleer business metrics — deploy + auth

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |

### Alerting subsystem metrics

Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |

### JVM — memory, GC, threads, classes

From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
| `jvm.threads.live` | gauge | — | Current live thread count |
| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
| `jvm.threads.peak` | gauge | — | Peak thread count since start |
| `jvm.threads.started` | counter | — | Cumulative threads started |
| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) |
| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |

### Process and system

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) |
| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
| `process.uptime` | gauge | — | ms since start |
| `process.start.time` | gauge | — | Epoch start |
| `process.files.open` | gauge | — | Open FDs |
| `process.files.max` | gauge | — | FD ulimit |
| `system.cpu.count` | gauge | — | Cores visible to the JVM |
| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) |
| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
| `disk.total` | gauge | `path` | Total bytes |

### HTTP server

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |

`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.

### Tomcat

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
| `tomcat.sessions.created` | counter | — | Cumulative session creates |
| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
| `tomcat.sessions.rejected` | counter | — | Session creates refused |
| `tomcat.threads.current` | gauge | `name` | Connector thread count |
| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
| `tomcat.threads.config.max` | gauge | `name` | Configured max |

### HikariCP (PostgreSQL pool)

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `hikaricp.connections` | gauge | `pool` | Total connections |
| `hikaricp.connections.active` | gauge | `pool` | In-use |
| `hikaricp.connections.idle` | gauge | `pool` | Idle |
| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
| `hikaricp.connections.min` | gauge | `pool` | Configured min |
| `hikaricp.connections.max` | gauge | `pool` | Configured max |
| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |

Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).

### JDBC generic

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
| `jdbc.connections.max` | gauge | `name` | |
| `jdbc.connections.active` | gauge | `name` | |
| `jdbc.connections.idle` | gauge | `name` | |

### Logging

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |

### Spring Boot lifecycle

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `application.started.time` | timer | `main.application.class` | Cold-start duration |
| `application.ready.time` | timer | `main.application.class` | Time to ready |

### Flyway

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |

### Executor pools (if any `@Async` executors exist)

When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `executor.active` | gauge | `name` | Currently-running tasks |
| `executor.queued` | gauge | `name` | Queued tasks |
| `executor.queue.remaining` | gauge | `name` | Queue headroom |
| `executor.pool.size` | gauge | `name` | Current pool size |
| `executor.pool.core` | gauge | `name` | Core size |
| `executor.pool.max` | gauge | `name` | Max size |
| `executor.completed` | counter | `name` | Completed tasks |

---

## Suggested dashboard panels

The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable.

### Row: server health (top of dashboard)

1. **Agents by state** — stacked area.
   ```sql
   SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
   FROM server_metrics
   WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
     AND collected_at >= {from} AND collected_at < {to}
   GROUP BY t, state ORDER BY t;
   ```

2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above.

3. **Ingestion drops per minute** — bar chart (per-minute delta).
   ```sql
   WITH sorted AS (
     SELECT toStartOfMinute(collected_at) AS minute,
            tags['reason'] AS reason,
            server_instance_id,
            max(metric_value) AS cumulative
     FROM server_metrics
     WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
       AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
     GROUP BY minute, reason, server_instance_id
   )
   SELECT minute, reason,
          cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
            PARTITION BY reason, server_instance_id ORDER BY minute
          ) AS drops_per_minute
   FROM sorted ORDER BY minute;
   ```

4. **Auth failures per minute** — same shape as drops, split by `reason`.

### Row: JVM

5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s.

6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`.

7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`.

8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`.

### Row: HTTP + DB

9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`.

10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total.

11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small.

12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag.

### Row: alerting (collapsible)

13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`.

14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`.

15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`.

### Row: deployments (runtime-enabled only)

16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`.

17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean).

---

## Notes for the dashboard implementer

- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table.
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap.
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart.
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either.
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.

---

## Changelog

- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.