feat(server): REST API over server_metrics for SaaS dashboards
Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -66,24 +66,126 @@ On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by
|
||||
|
||||
## How to query
|
||||
|
||||
### Via the admin ClickHouse endpoint
|
||||
Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
|
||||
|
||||
### `GET /catalog`
|
||||
|
||||
Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
|
||||
|
||||
```
|
||||
POST /api/v1/admin/clickhouse/query
|
||||
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||||
Authorization: Bearer <admin-jwt>
|
||||
Content-Type: text/plain
|
||||
|
||||
SELECT metric_name, statistic, count()
|
||||
FROM server_metrics
|
||||
WHERE collected_at >= now() - INTERVAL 1 HOUR
|
||||
GROUP BY 1, 2 ORDER BY 1, 2
|
||||
```
|
||||
|
||||
Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API.
|
||||
```json
|
||||
[
|
||||
{
|
||||
"metricName": "cameleer.agents.connected",
|
||||
"metricType": "gauge",
|
||||
"statistics": ["value"],
|
||||
"tagKeys": ["state"]
|
||||
},
|
||||
{
|
||||
"metricName": "cameleer.ingestion.drops",
|
||||
"metricType": "counter",
|
||||
"statistics": ["count"],
|
||||
"tagKeys": ["reason"]
|
||||
},
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
### Direct JDBC (recommended for the dashboard)
|
||||
`from`/`to` are optional; default is the last 1 h.
|
||||
|
||||
Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`.
|
||||
### `GET /instances`
|
||||
|
||||
Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
|
||||
|
||||
```
|
||||
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||||
```
|
||||
|
||||
```json
|
||||
[
|
||||
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
|
||||
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
|
||||
]
|
||||
```
|
||||
|
||||
### `POST /query` — generic time-series
|
||||
|
||||
The workhorse. One endpoint covers every panel in the dashboard.
|
||||
|
||||
```
|
||||
POST /api/v1/admin/server-metrics/query
|
||||
Authorization: Bearer <admin-jwt>
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
Request body:
|
||||
|
||||
```json
|
||||
{
|
||||
"metric": "cameleer.ingestion.drops",
|
||||
"statistic": "count",
|
||||
"from": "2026-04-22T00:00:00Z",
|
||||
"to": "2026-04-23T00:00:00Z",
|
||||
"stepSeconds": 60,
|
||||
"groupByTags": ["reason"],
|
||||
"filterTags": { },
|
||||
"aggregation": "sum",
|
||||
"mode": "delta",
|
||||
"serverInstanceIds": null
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"metric": "cameleer.ingestion.drops",
|
||||
"statistic": "count",
|
||||
"aggregation": "sum",
|
||||
"mode": "delta",
|
||||
"stepSeconds": 60,
|
||||
"series": [
|
||||
{
|
||||
"tags": { "reason": "buffer_full" },
|
||||
"points": [
|
||||
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
|
||||
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
|
||||
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Request field reference
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|---|---|---|---|
|
||||
| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
|
||||
| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
|
||||
| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
|
||||
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
|
||||
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
|
||||
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
|
||||
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
|
||||
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
|
||||
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
|
||||
|
||||
#### Validation errors
|
||||
|
||||
Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
|
||||
- unsafe characters in identifiers
|
||||
- `from ≥ to` or range > 31 days
|
||||
- `stepSeconds` outside [10, 3600]
|
||||
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
|
||||
|
||||
### Direct ClickHouse (fallback)
|
||||
|
||||
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
|
||||
|
||||
---
|
||||
|
||||
@@ -258,89 +360,150 @@ When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
|
||||
|
||||
## Suggested dashboard panels
|
||||
|
||||
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable.
|
||||
Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
|
||||
|
||||
### Row: server health (top of dashboard)
|
||||
|
||||
1. **Agents by state** — stacked area.
|
||||
```sql
|
||||
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
|
||||
FROM server_metrics
|
||||
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
|
||||
AND collected_at >= {from} AND collected_at < {to}
|
||||
GROUP BY t, state ORDER BY t;
|
||||
```json
|
||||
{ "metric": "cameleer.agents.connected", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
|
||||
2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above.
|
||||
|
||||
3. **Ingestion drops per minute** — bar chart (per-minute delta).
|
||||
```sql
|
||||
WITH sorted AS (
|
||||
SELECT toStartOfMinute(collected_at) AS minute,
|
||||
tags['reason'] AS reason,
|
||||
server_instance_id,
|
||||
max(metric_value) AS cumulative
|
||||
FROM server_metrics
|
||||
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
|
||||
AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
|
||||
GROUP BY minute, reason, server_instance_id
|
||||
)
|
||||
SELECT minute, reason,
|
||||
cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
|
||||
PARTITION BY reason, server_instance_id ORDER BY minute
|
||||
) AS drops_per_minute
|
||||
FROM sorted ORDER BY minute;
|
||||
2. **Ingestion buffer depth by type** — line chart.
|
||||
```json
|
||||
{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
|
||||
4. **Auth failures per minute** — same shape as drops, split by `reason`.
|
||||
3. **Ingestion drops per minute** — bar chart.
|
||||
```json
|
||||
{ "metric": "cameleer.ingestion.drops", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["reason"], "mode": "delta" }
|
||||
```
|
||||
|
||||
4. **Auth failures per minute** — same shape as drops, grouped by `reason`.
|
||||
```json
|
||||
{ "metric": "cameleer.auth.failures", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["reason"], "mode": "delta" }
|
||||
```
|
||||
|
||||
### Row: JVM
|
||||
|
||||
5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s.
|
||||
5. **Heap used vs committed vs max** — area chart (three overlay queries).
|
||||
```json
|
||||
{ "metric": "jvm.memory.used", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
|
||||
```
|
||||
Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.
|
||||
|
||||
6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`.
|
||||
6. **CPU %** — line.
|
||||
```json
|
||||
{ "metric": "process.cpu.usage", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
Overlay with `"metric": "system.cpu.usage"`.
|
||||
|
||||
7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`.
|
||||
7. **GC pause — max per cause**.
|
||||
```json
|
||||
{ "metric": "jvm.gc.pause", "statistic": "max",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
|
||||
```
|
||||
|
||||
8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`.
|
||||
8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
|
||||
|
||||
### Row: HTTP + DB
|
||||
|
||||
9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`.
|
||||
9. **HTTP mean latency by URI** — top-N URIs.
|
||||
```json
|
||||
{ "metric": "http.server.requests", "statistic": "mean",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
|
||||
"aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
For p99 proxy, repeat with `"statistic": "max"`.
|
||||
|
||||
10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total.
|
||||
10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
|
||||
```json
|
||||
{ "metric": "http.server.requests", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"mode": "delta", "aggregation": "sum" }
|
||||
```
|
||||
Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.
|
||||
|
||||
11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small.
|
||||
11. **HikariCP pool saturation** — overlay two queries.
|
||||
```json
|
||||
{ "metric": "hikaricp.connections.active", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
Overlay with `"metric": "hikaricp.connections.pending"`.
|
||||
|
||||
12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag.
|
||||
12. **Hikari acquire timeouts per minute**.
|
||||
```json
|
||||
{ "metric": "hikaricp.connections.timeout", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["pool"], "mode": "delta" }
|
||||
```
|
||||
|
||||
### Row: alerting (collapsible)
|
||||
|
||||
13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`.
|
||||
13. **Alerting instances by state** — stacked.
|
||||
```json
|
||||
{ "metric": "alerting_instances_total", "statistic": "value",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
|
||||
14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`.
|
||||
14. **Eval errors per minute by kind**.
|
||||
```json
|
||||
{ "metric": "alerting_eval_errors_total", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"groupByTags": ["kind"], "mode": "delta" }
|
||||
```
|
||||
|
||||
15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`.
|
||||
15. **Webhook delivery — max per minute**.
|
||||
```json
|
||||
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||
"aggregation": "max", "mode": "raw" }
|
||||
```
|
||||
|
||||
### Row: deployments (runtime-enabled only)
|
||||
|
||||
16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`.
|
||||
16. **Deploy outcomes per hour**.
|
||||
```json
|
||||
{ "metric": "cameleer.deployments.outcome", "statistic": "count",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 3600,
|
||||
"groupByTags": ["status"], "mode": "delta" }
|
||||
```
|
||||
|
||||
17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean).
|
||||
17. **Deploy duration mean**.
|
||||
```json
|
||||
{ "metric": "cameleer.deployments.duration", "statistic": "mean",
|
||||
"from": "{from}", "to": "{to}", "stepSeconds": 300,
|
||||
"aggregation": "avg", "mode": "raw" }
|
||||
```
|
||||
For p99 proxy, repeat with `"statistic": "max"`.
|
||||
|
||||
---
|
||||
|
||||
## Notes for the dashboard implementer
|
||||
|
||||
- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table.
|
||||
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap.
|
||||
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart.
|
||||
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either.
|
||||
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
|
||||
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.
|
||||
- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
|
||||
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
|
||||
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
|
||||
- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.
|
||||
- 2026-04-23 — initial write. Write-only backend.
|
||||
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
|
||||
|
||||
Reference in New Issue
Block a user