Files

hsiegeln 3c2409ed6e docs(server-metrics): document the built-in admin dashboard

SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API)
side-by-side with visibility rules; the dashboard-builder doc leads with a
"Built-in admin dashboard" section and a 2026-04-24 changelog entry so
first-time readers know they don't have to build anything before seeing
server health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 09:05:22 +02:00

23 KiB

Raw Blame History

Server Self-Metrics — Reference for Dashboard Builders

This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

tl;dr — Every 60 s, every meter in the server's Micrometer registry (all cameleer.*, all alerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per (meter, statistic) pair. No external Prometheus required.

Built-in admin dashboard

The server ships a ready-to-use dashboard at /admin/server-metrics in the web UI. It renders the 17 panels listed below using ThemedChart from the design system, with a time-range selector (15 min / 1 h / 6 h / 24 h / 7 d) and live auto-refresh. Visibility mirrors the Database and ClickHouse admin pages:

Requires the ADMIN role.
Hidden when cameleer.server.security.infrastructureendpoints=false (both the backend endpoints and the sidebar entry disappear).

Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.

Table schema

server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE

What each column means

Column	Notes
`tenant_id`	Always filter by this. One tenant per server deployment.
`server_instance_id`	Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. Rotates on restart, so counters restart cleanly.
`metric_name`	Raw Micrometer meter name. Dots, not underscores.
`metric_type`	Lowercase Micrometer `Meter.Type`.
`statistic`	Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape.
`metric_value`	`Float64`. Non-finite values (NaN / ±∞) are dropped before insert.
`tags`	`Map(String, String)`. Micrometer tags copied verbatim.

Counter semantics (important)

Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:

SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;

On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.

Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.

How to query

Use the REST API — /api/v1/admin/server-metrics/**. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard /api/v1/admin/** RBAC gate).

`GET /catalog`

Enumerate every metric_name observed in a window, with its metric_type, the set of statistics emitted, and the union of tag keys.

GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt>

[
  {
    "metricName": "cameleer.agents.connected",
    "metricType": "gauge",
    "statistics": ["value"],
    "tagKeys": ["state"]
  },
  {
    "metricName": "cameleer.ingestion.drops",
    "metricType": "counter",
    "statistics": ["count"],
    "tagKeys": ["reason"]
  },
  ...
]

from/to are optional; default is the last 1 h.

`GET /instances`

Enumerate the server_instance_id values that wrote at least one sample in the window, with firstSeen / lastSeen. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.

GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z

[
  { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
  { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]

`POST /query` — generic time-series

The workhorse. One endpoint covers every panel in the dashboard.

POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json

Request body:

{
  "metric":          "cameleer.ingestion.drops",
  "statistic":       "count",
  "from":            "2026-04-22T00:00:00Z",
  "to":              "2026-04-23T00:00:00Z",
  "stepSeconds":     60,
  "groupByTags":     ["reason"],
  "filterTags":      { },
  "aggregation":     "sum",
  "mode":            "delta",
  "serverInstanceIds": null
}

Response:

{
  "metric":      "cameleer.ingestion.drops",
  "statistic":   "count",
  "aggregation": "sum",
  "mode":        "delta",
  "stepSeconds": 60,
  "series": [
    {
      "tags":   { "reason": "buffer_full" },
      "points": [
        { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
        { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
        { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
      ]
    }
  ]
}

Request field reference

Field	Type	Required	Description
`metric`	string	yes	Metric name. Regex `^[a-zA-Z0-9._]+$`.
`statistic`	string	no	`value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket.
`from`, `to`	ISO-8601 instant	yes	Half-open window. `to - from ≤ 31 days`.
`stepSeconds`	int	no	Bucket size. Clamped to [10, 3600]. Default 60.
`groupByTags`	string[]	no	Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`.
`filterTags`	map<string,string>	no	Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection.
`aggregation`	string	no	Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas).
`mode`	string	no	`raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts.
`serverInstanceIds`	string[]	no	Allow-list. When null or empty, every instance in the window is included.

Validation errors

Any IllegalArgumentException surfaces as 400 Bad Request with {"error": "…"}. Triggers:

unsafe characters in identifiers
from ≥ to or range > 31 days
stepSeconds outside [10, 3600]
result cardinality > 500 series (reduce groupByTags or tighten filterTags)

Direct ClickHouse (fallback)

If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for /api/v1/admin/clickhouse/query (infrastructureendpoints=true, ADMIN) or a dedicated read-only CH user scoped to server_metrics. All direct queries must filter by tenant_id.

Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

Cameleer business metrics — agent + ingestion

Source: cameleer-server-app/.../metrics/ServerMetrics.java.

Metric	Type	Statistic	Tags	Meaning
`cameleer.agents.connected`	gauge	`value`	`state` (live/stale/dead/shutdown)	Count of agents in each lifecycle state
`cameleer.agents.sse.active`	gauge	`value`	—	Active SSE connections (command channel)
`cameleer.agents.transitions`	counter	`count`	`transition` (went_stale/went_dead/recovered)	Cumulative lifecycle transitions
`cameleer.ingestion.buffer.size`	gauge	`value`	`type` (execution/processor/log/metrics)	Write buffer depth — spikes mean ingestion is lagging
`cameleer.ingestion.accumulator.pending`	gauge	`value`	—	Unfinalized execution chunks in the accumulator
`cameleer.ingestion.drops`	counter	`count`	`reason` (buffer_full/no_agent/no_identity)	Dropped payloads. Any non-zero rate here is bad.
`cameleer.ingestion.flush.duration`	timer	`count`, `total_time`/`total`, `max`	`type` (execution/processor/log)	Flush latency per type

Cameleer business metrics — deploy + auth

Metric	Type	Statistic	Tags	Meaning
`cameleer.deployments.outcome`	counter	`count`	`status` (running/failed/degraded)	Deploy outcome tally since boot
`cameleer.deployments.duration`	timer	`count`, `total_time`/`total`, `max`	—	End-to-end deploy latency
`cameleer.auth.failures`	counter	`count`	`reason` (invalid_token/revoked/oidc_rejected)	Auth failure breakdown — watch for spikes

Alerting subsystem metrics

Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.

Metric	Type	Statistic	Tags	Meaning
`alerting_rules_total`	gauge	`value`	`state` (enabled/disabled)	Cached 30 s from PostgreSQL `alert_rules`
`alerting_instances_total`	gauge	`value`	`state` (firing/resolved/ack'd etc.)	Cached 30 s from PostgreSQL `alert_instances`
`alerting_eval_errors_total`	counter	`count`	`kind` (condition kind)	Evaluator exceptions per kind
`alerting_circuit_opened_total`	counter	`count`	`kind`	Circuit-breaker open transitions per kind
`alerting_eval_duration_seconds`	timer	`count`, `total_time`/`total`, `max`	`kind`	Per-kind evaluation latency
`alerting_webhook_delivery_duration_seconds`	timer	`count`, `total_time`/`total`, `max`	—	Outbound webhook POST latency
`alerting_notifications_total`	counter	`count`	`status` (sent/failed/retry/giving_up)	Notification outcomes

JVM — memory, GC, threads, classes

From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).

Metric	Type	Tags	Meaning
`jvm.memory.used`	gauge	`area` (heap/nonheap), `id` (pool name)	Bytes used per pool
`jvm.memory.committed`	gauge	`area`, `id`	Bytes committed per pool
`jvm.memory.max`	gauge	`area`, `id`	Pool max
`jvm.memory.usage.after.gc`	gauge	`area`, `id`	Usage right after the last collection
`jvm.buffer.memory.used`	gauge	`id` (direct/mapped)	NIO buffer bytes
`jvm.buffer.count`	gauge	`id`	NIO buffer count
`jvm.buffer.total.capacity`	gauge	`id`	NIO buffer capacity
`jvm.threads.live`	gauge	—	Current live thread count
`jvm.threads.daemon`	gauge	—	Current daemon thread count
`jvm.threads.peak`	gauge	—	Peak thread count since start
`jvm.threads.started`	counter	—	Cumulative threads started
`jvm.threads.states`	gauge	`state` (runnable/blocked/waiting/…)	Threads per state
`jvm.classes.loaded`	gauge	—	Currently-loaded classes
`jvm.classes.unloaded`	counter	—	Cumulative unloaded classes
`jvm.gc.pause`	timer	`action`, `cause`	Stop-the-world pause times — watch `max`
`jvm.gc.concurrent.phase.time`	timer	`action`, `cause`	Concurrent-phase durations (G1/ZGC)
`jvm.gc.memory.allocated`	counter	—	Bytes allocated in the young gen
`jvm.gc.memory.promoted`	counter	—	Bytes promoted to old gen
`jvm.gc.overhead`	gauge	—	Fraction of CPU spent in GC (0–1)
`jvm.gc.live.data.size`	gauge	—	Live data after last collection
`jvm.gc.max.data.size`	gauge	—	Max old-gen size
`jvm.info`	gauge	`vendor`, `runtime`, `version`	Constant `1.0`; tags carry the real info

Process and system

Metric	Type	Tags	Meaning
`process.cpu.usage`	gauge	—	CPU share consumed by this JVM (0–1)
`process.cpu.time`	gauge	—	Cumulative CPU time (ns)
`process.uptime`	gauge	—	ms since start
`process.start.time`	gauge	—	Epoch start
`process.files.open`	gauge	—	Open FDs
`process.files.max`	gauge	—	FD ulimit
`system.cpu.count`	gauge	—	Cores visible to the JVM
`system.cpu.usage`	gauge	—	System-wide CPU (0–1)
`system.load.average.1m`	gauge	—	1-min load (Unix only)
`disk.free`	gauge	`path`	Free bytes on the mount that holds the JAR
`disk.total`	gauge	`path`	Total bytes

HTTP server

Metric	Type	Tags	Meaning
`http.server.requests`	timer	`method`, `uri`, `status`, `outcome`, `exception`	Inbound HTTP: count, total_time/total, max
`http.server.requests.active`	long_task_timer	`method`, `uri`	In-flight requests — `active_tasks` statistic

uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.

Tomcat

Metric	Type	Tags	Meaning
`tomcat.sessions.active.current`	gauge	—	Currently active sessions
`tomcat.sessions.active.max`	gauge	—	Max concurrent sessions observed
`tomcat.sessions.alive.max`	gauge	—	Longest session lifetime (s)
`tomcat.sessions.created`	counter	—	Cumulative session creates
`tomcat.sessions.expired`	counter	—	Cumulative expirations
`tomcat.sessions.rejected`	counter	—	Session creates refused
`tomcat.threads.current`	gauge	`name`	Connector thread count
`tomcat.threads.busy`	gauge	`name`	Connector threads currently serving a request
`tomcat.threads.config.max`	gauge	`name`	Configured max

HikariCP (PostgreSQL pool)

Metric	Type	Tags	Meaning
`hikaricp.connections`	gauge	`pool`	Total connections
`hikaricp.connections.active`	gauge	`pool`	In-use
`hikaricp.connections.idle`	gauge	`pool`	Idle
`hikaricp.connections.pending`	gauge	`pool`	Threads waiting for a connection
`hikaricp.connections.min`	gauge	`pool`	Configured min
`hikaricp.connections.max`	gauge	`pool`	Configured max
`hikaricp.connections.creation`	timer	`pool`	Time to open a new connection
`hikaricp.connections.acquire`	timer	`pool`	Time to acquire from the pool
`hikaricp.connections.usage`	timer	`pool`	Time a connection was in use
`hikaricp.connections.timeout`	counter	`pool`	Pool acquisition timeouts — any non-zero rate is a problem

Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).

JDBC generic

Metric	Type	Tags	Meaning
`jdbc.connections.min`	gauge	`name`	Same data as Hikari, surfaced generically
`jdbc.connections.max`	gauge	`name`
`jdbc.connections.active`	gauge	`name`
`jdbc.connections.idle`	gauge	`name`

Logging

Metric	Type	Tags	Meaning
`logback.events`	counter	`level` (error/warn/info/debug/trace)	Log events emitted since start — `{level=error}` is a useful panel

Spring Boot lifecycle

Metric	Type	Tags	Meaning
`application.started.time`	timer	`main.application.class`	Cold-start duration
`application.ready.time`	timer	`main.application.class`	Time to ready

Flyway

Metric	Type	Tags	Meaning
`flyway.migrations`	gauge	—	Number of migrations applied (current schema)

Executor pools (if any `@Async` executors exist)

When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:

Metric	Type	Tags	Meaning
`executor.active`	gauge	`name`	Currently-running tasks
`executor.queued`	gauge	`name`	Queued tasks
`executor.queue.remaining`	gauge	`name`	Queue headroom
`executor.pool.size`	gauge	`name`	Current pool size
`executor.pool.core`	gauge	`name`	Core size
`executor.pool.max`	gauge	`name`	Max size
`executor.completed`	counter	`name`	Completed tasks

Suggested dashboard panels

Below are 17 panels, each expressed as a single POST /api/v1/admin/server-metrics/query body. Tenant is implicit in the JWT — the server filters by tenant server-side. {from} and {to} are dashboard variables.

Row: server health (top of dashboard)

Agents by state — stacked area.

{ "metric": "cameleer.agents.connected", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }

Ingestion buffer depth by type — line chart.

{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }

Ingestion drops per minute — bar chart.

{ "metric": "cameleer.ingestion.drops", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["reason"], "mode": "delta" }

Auth failures per minute — same shape as drops, grouped by reason.

{ "metric": "cameleer.auth.failures", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["reason"], "mode": "delta" }

Row: JVM

Heap used vs committed vs max — area chart (three overlay queries).

{ "metric": "jvm.memory.used", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }

Repeat with "metric": "jvm.memory.committed" and "metric": "jvm.memory.max".

CPU % — line.

{ "metric": "process.cpu.usage", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }

Overlay with "metric": "system.cpu.usage".

GC pause — max per cause.

{ "metric": "jvm.gc.pause", "statistic": "max",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }

Thread count — three overlay lines: jvm.threads.live, jvm.threads.daemon, jvm.threads.peak each with statistic=value, aggregation=avg, mode=raw.

Row: HTTP + DB

HTTP mean latency by URI — top-N URIs.

{ "metric": "http.server.requests", "statistic": "mean",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
  "aggregation": "avg", "mode": "raw" }

For p99 proxy, repeat with "statistic": "max".

HTTP error rate — two queries, divide client-side: total requests and 5xx requests.

{ "metric": "http.server.requests", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "mode": "delta", "aggregation": "sum" }

Then for the 5xx series, add "filterTags": { "outcome": "SERVER_ERROR" } and divide.

HikariCP pool saturation — overlay two queries.

{ "metric": "hikaricp.connections.active", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }

Overlay with "metric": "hikaricp.connections.pending".

Hikari acquire timeouts per minute.

{ "metric": "hikaricp.connections.timeout", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["pool"], "mode": "delta" }

Row: alerting (collapsible)

Alerting instances by state — stacked.

{ "metric": "alerting_instances_total", "statistic": "value",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }

Eval errors per minute by kind.

{ "metric": "alerting_eval_errors_total", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "groupByTags": ["kind"], "mode": "delta" }

Webhook delivery — max per minute.

{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
  "from": "{from}", "to": "{to}", "stepSeconds": 60,
  "aggregation": "max", "mode": "raw" }

Row: deployments (runtime-enabled only)

Deploy outcomes per hour.

{ "metric": "cameleer.deployments.outcome", "statistic": "count",
  "from": "{from}", "to": "{to}", "stepSeconds": 3600,
  "groupByTags": ["status"], "mode": "delta" }

Deploy duration mean.

{ "metric": "cameleer.deployments.duration", "statistic": "mean",
  "from": "{from}", "to": "{to}", "stepSeconds": 300,
  "aggregation": "avg", "mode": "raw" }