# Server Self-Metrics — Reference for Dashboard Builders
This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
```
### What each column means
| Column | Notes |
|---|---|
| `tenant_id` | Always filter by this. One tenant per server deployment. |
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:
```sql
SELECT
toStartOfMinute(collected_at) AS minute,
metric_value - any(metric_value) OVER (
PARTITION BY server_instance_id, metric_name, tags
ORDER BY collected_at
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count'
ORDER BY minute;
```
On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.
### Retention
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
### `GET /catalog`
Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
`from`/`to` are optional; default is the last 1 h.
### `GET /instances`
Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
```
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
#### Validation errors
Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
- unsafe characters in identifiers
-`from ≥ to` or range > 31 days
-`stepSeconds` outside [10, 3600]
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
### Direct ClickHouse (fallback)
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
8.**Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.