Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
510 lines
22 KiB
Markdown
510 lines
22 KiB
Markdown
# Server Self-Metrics — Reference for Dashboard Builders
|
||
|
||
This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
|
||
|
||
> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.
|
||
|
||
---
|
||
|
||
## Table schema
|
||
|
||
```sql
|
||
server_metrics (
|
||
tenant_id LowCardinality(String) DEFAULT 'default',
|
||
collected_at DateTime64(3),
|
||
server_instance_id LowCardinality(String),
|
||
metric_name LowCardinality(String),
|
||
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
|
||
statistic LowCardinality(String) DEFAULT 'value',
|
||
metric_value Float64,
|
||
tags Map(String, String) DEFAULT map(),
|
||
server_received_at DateTime64(3) DEFAULT now64(3)
|
||
)
|
||
ENGINE = MergeTree()
|
||
PARTITION BY (tenant_id, toYYYYMM(collected_at))
|
||
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
|
||
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
|
||
```
|
||
|
||
### What each column means
|
||
|
||
| Column | Notes |
|
||
|---|---|
|
||
| `tenant_id` | Always filter by this. One tenant per server deployment. |
|
||
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
|
||
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
|
||
| `metric_type` | Lowercase Micrometer `Meter.Type`. |
|
||
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
|
||
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
|
||
| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |
|
||
|
||
### Counter semantics (important)
|
||
|
||
Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:
|
||
|
||
```sql
|
||
SELECT
|
||
toStartOfMinute(collected_at) AS minute,
|
||
metric_value - any(metric_value) OVER (
|
||
PARTITION BY server_instance_id, metric_name, tags
|
||
ORDER BY collected_at
|
||
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
|
||
) AS per_minute_delta
|
||
FROM server_metrics
|
||
WHERE metric_name = 'cameleer.ingestion.drops'
|
||
AND statistic = 'count'
|
||
ORDER BY minute;
|
||
```
|
||
|
||
On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.
|
||
|
||
### Retention
|
||
|
||
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
|
||
|
||
---
|
||
|
||
## How to query
|
||
|
||
Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
|
||
|
||
### `GET /catalog`
|
||
|
||
Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
|
||
|
||
```
|
||
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||
Authorization: Bearer <admin-jwt>
|
||
```
|
||
|
||
```json
|
||
[
|
||
{
|
||
"metricName": "cameleer.agents.connected",
|
||
"metricType": "gauge",
|
||
"statistics": ["value"],
|
||
"tagKeys": ["state"]
|
||
},
|
||
{
|
||
"metricName": "cameleer.ingestion.drops",
|
||
"metricType": "counter",
|
||
"statistics": ["count"],
|
||
"tagKeys": ["reason"]
|
||
},
|
||
...
|
||
]
|
||
```
|
||
|
||
`from`/`to` are optional; default is the last 1 h.
|
||
|
||
### `GET /instances`
|
||
|
||
Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
|
||
|
||
```
|
||
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||
```
|
||
|
||
```json
|
||
[
|
||
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
|
||
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
|
||
]
|
||
```
|
||
|
||
### `POST /query` — generic time-series
|
||
|
||
The workhorse. One endpoint covers every panel in the dashboard.
|
||
|
||
```
|
||
POST /api/v1/admin/server-metrics/query
|
||
Authorization: Bearer <admin-jwt>
|
||
Content-Type: application/json
|
||
```
|
||
|
||
Request body:
|
||
|
||
```json
|
||
{
|
||
"metric": "cameleer.ingestion.drops",
|
||
"statistic": "count",
|
||
"from": "2026-04-22T00:00:00Z",
|
||
"to": "2026-04-23T00:00:00Z",
|
||
"stepSeconds": 60,
|
||
"groupByTags": ["reason"],
|
||
"filterTags": { },
|
||
"aggregation": "sum",
|
||
"mode": "delta",
|
||
"serverInstanceIds": null
|
||
}
|
||
```
|
||
|
||
Response:
|
||
|
||
```json
|
||
{
|
||
"metric": "cameleer.ingestion.drops",
|
||
"statistic": "count",
|
||
"aggregation": "sum",
|
||
"mode": "delta",
|
||
"stepSeconds": 60,
|
||
"series": [
|
||
{
|
||
"tags": { "reason": "buffer_full" },
|
||
"points": [
|
||
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
|
||
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
|
||
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
#### Request field reference
|
||
|
||
| Field | Type | Required | Description |
|
||
|---|---|---|---|
|
||
| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
|
||
| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
|
||
| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
|
||
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
|
||
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
|
||
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
|
||
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
|
||
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
|
||
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
|
||
|
||
#### Validation errors
|
||
|
||
Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
|
||
- unsafe characters in identifiers
|
||
- `from ≥ to` or range > 31 days
|
||
- `stepSeconds` outside [10, 3600]
|
||
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
|
||
|
||
### Direct ClickHouse (fallback)
|
||
|
||
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
|
||
|
||
---
|
||
|
||
## Metric catalog
|
||
|
||
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
|
||
|
||
### Cameleer business metrics — agent + ingestion
|
||
|
||
Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.
|
||
|
||
| Metric | Type | Statistic | Tags | Meaning |
|
||
|---|---|---|---|---|
|
||
| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
|
||
| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
|
||
| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
|
||
| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
|
||
| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
|
||
| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
|
||
| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |
|
||
|
||
### Cameleer business metrics — deploy + auth
|
||
|
||
| Metric | Type | Statistic | Tags | Meaning |
|
||
|---|---|---|---|---|
|
||
| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
|
||
| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
|
||
| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |
|
||
|
||
### Alerting subsystem metrics
|
||
|
||
Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.
|
||
|
||
| Metric | Type | Statistic | Tags | Meaning |
|
||
|---|---|---|---|---|
|
||
| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
|
||
| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
|
||
| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
|
||
| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
|
||
| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
|
||
| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
|
||
| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |
|
||
|
||
### JVM — memory, GC, threads, classes
|
||
|
||
From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
|
||
| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
|
||
| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
|
||
| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
|
||
| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
|
||
| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
|
||
| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
|
||
| `jvm.threads.live` | gauge | — | Current live thread count |
|
||
| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
|
||
| `jvm.threads.peak` | gauge | — | Peak thread count since start |
|
||
| `jvm.threads.started` | counter | — | Cumulative threads started |
|
||
| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
|
||
| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
|
||
| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
|
||
| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
|
||
| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
|
||
| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
|
||
| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
|
||
| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) |
|
||
| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
|
||
| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
|
||
| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |
|
||
|
||
### Process and system
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) |
|
||
| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
|
||
| `process.uptime` | gauge | — | ms since start |
|
||
| `process.start.time` | gauge | — | Epoch start |
|
||
| `process.files.open` | gauge | — | Open FDs |
|
||
| `process.files.max` | gauge | — | FD ulimit |
|
||
| `system.cpu.count` | gauge | — | Cores visible to the JVM |
|
||
| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) |
|
||
| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
|
||
| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
|
||
| `disk.total` | gauge | `path` | Total bytes |
|
||
|
||
### HTTP server
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
|
||
| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |
|
||
|
||
`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.
|
||
|
||
### Tomcat
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
|
||
| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
|
||
| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
|
||
| `tomcat.sessions.created` | counter | — | Cumulative session creates |
|
||
| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
|
||
| `tomcat.sessions.rejected` | counter | — | Session creates refused |
|
||
| `tomcat.threads.current` | gauge | `name` | Connector thread count |
|
||
| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
|
||
| `tomcat.threads.config.max` | gauge | `name` | Configured max |
|
||
|
||
### HikariCP (PostgreSQL pool)
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `hikaricp.connections` | gauge | `pool` | Total connections |
|
||
| `hikaricp.connections.active` | gauge | `pool` | In-use |
|
||
| `hikaricp.connections.idle` | gauge | `pool` | Idle |
|
||
| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
|
||
| `hikaricp.connections.min` | gauge | `pool` | Configured min |
|
||
| `hikaricp.connections.max` | gauge | `pool` | Configured max |
|
||
| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
|
||
| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
|
||
| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
|
||
| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |
|
||
|
||
Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).
|
||
|
||
### JDBC generic
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
|
||
| `jdbc.connections.max` | gauge | `name` | |
|
||
| `jdbc.connections.active` | gauge | `name` | |
|
||
| `jdbc.connections.idle` | gauge | `name` | |
|
||
|
||
### Logging
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |
|
||
|
||
### Spring Boot lifecycle
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `application.started.time` | timer | `main.application.class` | Cold-start duration |
|
||
| `application.ready.time` | timer | `main.application.class` | Time to ready |
|
||
|
||
### Flyway
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |
|
||
|
||
### Executor pools (if any `@Async` executors exist)
|
||
|
||
When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
|
||
|
||
| Metric | Type | Tags | Meaning |
|
||
|---|---|---|---|
|
||
| `executor.active` | gauge | `name` | Currently-running tasks |
|
||
| `executor.queued` | gauge | `name` | Queued tasks |
|
||
| `executor.queue.remaining` | gauge | `name` | Queue headroom |
|
||
| `executor.pool.size` | gauge | `name` | Current pool size |
|
||
| `executor.pool.core` | gauge | `name` | Core size |
|
||
| `executor.pool.max` | gauge | `name` | Max size |
|
||
| `executor.completed` | counter | `name` | Completed tasks |
|
||
|
||
---
|
||
|
||
## Suggested dashboard panels
|
||
|
||
Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
|
||
|
||
### Row: server health (top of dashboard)
|
||
|
||
1. **Agents by state** — stacked area.
|
||
```json
|
||
{ "metric": "cameleer.agents.connected", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||
```
|
||
|
||
2. **Ingestion buffer depth by type** — line chart.
|
||
```json
|
||
{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
|
||
```
|
||
|
||
3. **Ingestion drops per minute** — bar chart.
|
||
```json
|
||
{ "metric": "cameleer.ingestion.drops", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["reason"], "mode": "delta" }
|
||
```
|
||
|
||
4. **Auth failures per minute** — same shape as drops, grouped by `reason`.
|
||
```json
|
||
{ "metric": "cameleer.auth.failures", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["reason"], "mode": "delta" }
|
||
```
|
||
|
||
### Row: JVM
|
||
|
||
5. **Heap used vs committed vs max** — area chart (three overlay queries).
|
||
```json
|
||
{ "metric": "jvm.memory.used", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
|
||
```
|
||
Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.
|
||
|
||
6. **CPU %** — line.
|
||
```json
|
||
{ "metric": "process.cpu.usage", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
|
||
```
|
||
Overlay with `"metric": "system.cpu.usage"`.
|
||
|
||
7. **GC pause — max per cause**.
|
||
```json
|
||
{ "metric": "jvm.gc.pause", "statistic": "max",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
|
||
```
|
||
|
||
8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
|
||
|
||
### Row: HTTP + DB
|
||
|
||
9. **HTTP mean latency by URI** — top-N URIs.
|
||
```json
|
||
{ "metric": "http.server.requests", "statistic": "mean",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
|
||
"aggregation": "avg", "mode": "raw" }
|
||
```
|
||
For p99 proxy, repeat with `"statistic": "max"`.
|
||
|
||
10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
|
||
```json
|
||
{ "metric": "http.server.requests", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"mode": "delta", "aggregation": "sum" }
|
||
```
|
||
Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.
|
||
|
||
11. **HikariCP pool saturation** — overlay two queries.
|
||
```json
|
||
{ "metric": "hikaricp.connections.active", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
|
||
```
|
||
Overlay with `"metric": "hikaricp.connections.pending"`.
|
||
|
||
12. **Hikari acquire timeouts per minute**.
|
||
```json
|
||
{ "metric": "hikaricp.connections.timeout", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["pool"], "mode": "delta" }
|
||
```
|
||
|
||
### Row: alerting (collapsible)
|
||
|
||
13. **Alerting instances by state** — stacked.
|
||
```json
|
||
{ "metric": "alerting_instances_total", "statistic": "value",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||
```
|
||
|
||
14. **Eval errors per minute by kind**.
|
||
```json
|
||
{ "metric": "alerting_eval_errors_total", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"groupByTags": ["kind"], "mode": "delta" }
|
||
```
|
||
|
||
15. **Webhook delivery — max per minute**.
|
||
```json
|
||
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||
"aggregation": "max", "mode": "raw" }
|
||
```
|
||
|
||
### Row: deployments (runtime-enabled only)
|
||
|
||
16. **Deploy outcomes per hour**.
|
||
```json
|
||
{ "metric": "cameleer.deployments.outcome", "statistic": "count",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 3600,
|
||
"groupByTags": ["status"], "mode": "delta" }
|
||
```
|
||
|
||
17. **Deploy duration mean**.
|
||
```json
|
||
{ "metric": "cameleer.deployments.duration", "statistic": "mean",
|
||
"from": "{from}", "to": "{to}", "stepSeconds": 300,
|
||
"aggregation": "avg", "mode": "raw" }
|
||
```
|
||
For p99 proxy, repeat with `"statistic": "max"`.
|
||
|
||
---
|
||
|
||
## Notes for the dashboard implementer
|
||
|
||
- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
|
||
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
|
||
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
|
||
- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
|
||
|
||
---
|
||
|
||
## Changelog
|
||
|
||
- 2026-04-23 — initial write. Write-only backend.
|
||
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
|