# Server Self-Metrics — Reference for Dashboard Builders This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel. > **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required. --- ## Built-in admin dashboard The server ships a ready-to-use dashboard at **`/admin/server-metrics`** in the web UI. It renders the 17 panels listed below using `ThemedChart` from the design system. The window is driven by the app-wide time-range control in the TopBar (same one used by Exchanges, Dashboard, and Runtime), so every panel automatically reflects the range you've selected globally. Visibility mirrors the Database and ClickHouse admin pages: - Requires the `ADMIN` role. - Hidden when `cameleer.server.security.infrastructureendpoints=false` (both the backend endpoints and the sidebar entry disappear). Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead. --- ## Table schema ```sql server_metrics ( tenant_id LowCardinality(String) DEFAULT 'default', collected_at DateTime64(3), server_instance_id LowCardinality(String), metric_name LowCardinality(String), metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other statistic LowCardinality(String) DEFAULT 'value', metric_value Float64, tags Map(String, String) DEFAULT map(), server_received_at DateTime64(3) DEFAULT now64(3) ) ENGINE = MergeTree() PARTITION BY (tenant_id, toYYYYMM(collected_at)) ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic) TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE ``` ### What each column means | Column | Notes | |---|---| | `tenant_id` | Always filter by this. One tenant per server deployment. | | `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. | | `metric_name` | Raw Micrometer meter name. Dots, not underscores. | | `metric_type` | Lowercase Micrometer `Meter.Type`. | | `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. | | `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. | | `tags` | `Map(String, String)`. Micrometer tags copied verbatim. | ### Counter semantics (important) Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`: ```sql SELECT toStartOfMinute(collected_at) AS minute, metric_value - any(metric_value) OVER ( PARTITION BY server_instance_id, metric_name, tags ORDER BY collected_at ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) AS per_minute_delta FROM server_metrics WHERE metric_name = 'cameleer.ingestion.drops' AND statistic = 'count' ORDER BY minute; ``` On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets. ### Retention 90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more. --- ## How to query Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate). ### `GET /catalog` Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys. ``` GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z Authorization: Bearer ``` ```json [ { "metricName": "cameleer.agents.connected", "metricType": "gauge", "statistics": ["value"], "tagKeys": ["state"] }, { "metricName": "cameleer.ingestion.drops", "metricType": "counter", "statistics": ["count"], "tagKeys": ["reason"] }, ... ] ``` `from`/`to` are optional; default is the last 1 h. ### `GET /instances` Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions. ``` GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z ``` ```json [ { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" }, { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" } ] ``` ### `POST /query` — generic time-series The workhorse. One endpoint covers every panel in the dashboard. ``` POST /api/v1/admin/server-metrics/query Authorization: Bearer Content-Type: application/json ``` Request body: ```json { "metric": "cameleer.ingestion.drops", "statistic": "count", "from": "2026-04-22T00:00:00Z", "to": "2026-04-23T00:00:00Z", "stepSeconds": 60, "groupByTags": ["reason"], "filterTags": { }, "aggregation": "sum", "mode": "delta", "serverInstanceIds": null } ``` Response: ```json { "metric": "cameleer.ingestion.drops", "statistic": "count", "aggregation": "sum", "mode": "delta", "stepSeconds": 60, "series": [ { "tags": { "reason": "buffer_full" }, "points": [ { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 }, { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 }, { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 } ] } ] } ``` #### Request field reference | Field | Type | Required | Description | |---|---|---|---| | `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. | | `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. | | `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. | | `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. | | `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. | | `filterTags` | map | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. | | `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). | | `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. | | `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. | #### Validation errors Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers: - unsafe characters in identifiers - `from ≥ to` or range > 31 days - `stepSeconds` outside [10, 3600] - result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`) ### Direct ClickHouse (fallback) If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`. --- ## Metric catalog Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest. ### Cameleer business metrics — agent + ingestion Source: `cameleer-server-app/.../metrics/ServerMetrics.java`. | Metric | Type | Statistic | Tags | Meaning | |---|---|---|---|---| | `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state | | `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) | | `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions | | `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging | | `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator | | `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. | | `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type | ### Cameleer business metrics — deploy + auth | Metric | Type | Statistic | Tags | Meaning | |---|---|---|---|---| | `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot | | `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency | | `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes | ### Alerting subsystem metrics Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`. | Metric | Type | Statistic | Tags | Meaning | |---|---|---|---|---| | `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` | | `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` | | `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind | | `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind | | `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency | | `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency | | `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes | ### JVM — memory, GC, threads, classes From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`). | Metric | Type | Tags | Meaning | |---|---|---|---| | `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool | | `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool | | `jvm.memory.max` | gauge | `area`, `id` | Pool max | | `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection | | `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes | | `jvm.buffer.count` | gauge | `id` | NIO buffer count | | `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity | | `jvm.threads.live` | gauge | — | Current live thread count | | `jvm.threads.daemon` | gauge | — | Current daemon thread count | | `jvm.threads.peak` | gauge | — | Peak thread count since start | | `jvm.threads.started` | counter | — | Cumulative threads started | | `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state | | `jvm.classes.loaded` | gauge | — | Currently-loaded classes | | `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes | | `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` | | `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) | | `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen | | `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen | | `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) | | `jvm.gc.live.data.size` | gauge | — | Live data after last collection | | `jvm.gc.max.data.size` | gauge | — | Max old-gen size | | `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info | ### Process and system | Metric | Type | Tags | Meaning | |---|---|---|---| | `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) | | `process.cpu.time` | gauge | — | Cumulative CPU time (ns) | | `process.uptime` | gauge | — | ms since start | | `process.start.time` | gauge | — | Epoch start | | `process.files.open` | gauge | — | Open FDs | | `process.files.max` | gauge | — | FD ulimit | | `system.cpu.count` | gauge | — | Cores visible to the JVM | | `system.cpu.usage` | gauge | — | System-wide CPU (0–1) | | `system.load.average.1m` | gauge | — | 1-min load (Unix only) | | `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR | | `disk.total` | gauge | `path` | Total bytes | ### HTTP server | Metric | Type | Tags | Meaning | |---|---|---|---| | `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max | | `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic | `uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded. ### Tomcat | Metric | Type | Tags | Meaning | |---|---|---|---| | `tomcat.sessions.active.current` | gauge | — | Currently active sessions | | `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed | | `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) | | `tomcat.sessions.created` | counter | — | Cumulative session creates | | `tomcat.sessions.expired` | counter | — | Cumulative expirations | | `tomcat.sessions.rejected` | counter | — | Session creates refused | | `tomcat.threads.current` | gauge | `name` | Connector thread count | | `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request | | `tomcat.threads.config.max` | gauge | `name` | Configured max | ### HikariCP (PostgreSQL pool) | Metric | Type | Tags | Meaning | |---|---|---|---| | `hikaricp.connections` | gauge | `pool` | Total connections | | `hikaricp.connections.active` | gauge | `pool` | In-use | | `hikaricp.connections.idle` | gauge | `pool` | Idle | | `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection | | `hikaricp.connections.min` | gauge | `pool` | Configured min | | `hikaricp.connections.max` | gauge | `pool` | Configured max | | `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection | | `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool | | `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use | | `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem | Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`). ### JDBC generic | Metric | Type | Tags | Meaning | |---|---|---|---| | `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically | | `jdbc.connections.max` | gauge | `name` | | | `jdbc.connections.active` | gauge | `name` | | | `jdbc.connections.idle` | gauge | `name` | | ### Logging | Metric | Type | Tags | Meaning | |---|---|---|---| | `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel | ### Spring Boot lifecycle | Metric | Type | Tags | Meaning | |---|---|---|---| | `application.started.time` | timer | `main.application.class` | Cold-start duration | | `application.ready.time` | timer | `main.application.class` | Time to ready | ### Flyway | Metric | Type | Tags | Meaning | |---|---|---|---| | `flyway.migrations` | gauge | — | Number of migrations applied (current schema) | ### Executor pools (if any `@Async` executors exist) When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds: | Metric | Type | Tags | Meaning | |---|---|---|---| | `executor.active` | gauge | `name` | Currently-running tasks | | `executor.queued` | gauge | `name` | Queued tasks | | `executor.queue.remaining` | gauge | `name` | Queue headroom | | `executor.pool.size` | gauge | `name` | Current pool size | | `executor.pool.core` | gauge | `name` | Core size | | `executor.pool.max` | gauge | `name` | Max size | | `executor.completed` | counter | `name` | Completed tasks | --- ## Suggested dashboard panels Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables. ### Row: server health (top of dashboard) 1. **Agents by state** — stacked area. ```json { "metric": "cameleer.agents.connected", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } ``` 2. **Ingestion buffer depth by type** — line chart. ```json { "metric": "cameleer.ingestion.buffer.size", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" } ``` 3. **Ingestion drops per minute** — bar chart. ```json { "metric": "cameleer.ingestion.drops", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" } ``` 4. **Auth failures per minute** — same shape as drops, grouped by `reason`. ```json { "metric": "cameleer.auth.failures", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" } ``` ### Row: JVM 5. **Heap used vs committed vs max** — area chart (three overlay queries). ```json { "metric": "jvm.memory.used", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" } ``` Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`. 6. **CPU %** — line. ```json { "metric": "process.cpu.usage", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" } ``` Overlay with `"metric": "system.cpu.usage"`. 7. **GC pause — max per cause**. ```json { "metric": "jvm.gc.pause", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" } ``` 8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`. ### Row: HTTP + DB 9. **HTTP mean latency by URI** — top-N URIs. ```json { "metric": "http.server.requests", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" }, "aggregation": "avg", "mode": "raw" } ``` For p99 proxy, repeat with `"statistic": "max"`. 10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests. ```json { "metric": "http.server.requests", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "mode": "delta", "aggregation": "sum" } ``` Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide. 11. **HikariCP pool saturation** — overlay two queries. ```json { "metric": "hikaricp.connections.active", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" } ``` Overlay with `"metric": "hikaricp.connections.pending"`. 12. **Hikari acquire timeouts per minute**. ```json { "metric": "hikaricp.connections.timeout", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "mode": "delta" } ``` ### Row: alerting (collapsible) 13. **Alerting instances by state** — stacked. ```json { "metric": "alerting_instances_total", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } ``` 14. **Eval errors per minute by kind**. ```json { "metric": "alerting_eval_errors_total", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["kind"], "mode": "delta" } ``` 15. **Webhook delivery — max per minute**. ```json { "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "max", "mode": "raw" } ``` ### Row: deployments (runtime-enabled only) 16. **Deploy outcomes per hour**. ```json { "metric": "cameleer.deployments.outcome", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 3600, "groupByTags": ["status"], "mode": "delta" } ``` 17. **Deploy duration mean**. ```json { "metric": "cameleer.deployments.duration", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 300, "aggregation": "avg", "mode": "raw" } ``` For p99 proxy, repeat with `"statistic": "max"`. --- ## Notes for the dashboard implementer - **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express. - **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently. - **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it. - **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`. --- ## Changelog - 2026-04-23 — initial write. Write-only backend. - 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries. - 2026-04-24 — shipped the built-in `/admin/server-metrics` UI dashboard. Gated by `infrastructureendpoints` + ADMIN, identical visibility to `/admin/{database,clickhouse}`. Source: `ui/src/pages/Admin/ServerMetricsAdminPage.tsx`. - 2026-04-24 — dashboard now uses the global time-range control (`useGlobalFilters`) instead of a page-local picker. Bucket size auto-scales with the selected window (10 s → 1 h). Query hooks now take a `ServerMetricsRange = { from: Date; to: Date }` instead of a `windowSeconds` number so they work for any absolute or rolling range the TopBar supplies.