SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API) side-by-side with visibility rules; the dashboard-builder doc leads with a "Built-in admin dashboard" section and a 2026-04-24 changelog entry so first-time readers know they don't have to build anything before seeing server health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 KiB
Server Self-Metrics — Reference for Dashboard Builders
This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
tl;dr — Every 60 s, every meter in the server's Micrometer registry (all
cameleer.*, allalerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per(meter, statistic)pair. No external Prometheus required.
Built-in admin dashboard
The server ships a ready-to-use dashboard at /admin/server-metrics in the web UI. It renders the 17 panels listed below using ThemedChart from the design system, with a time-range selector (15 min / 1 h / 6 h / 24 h / 7 d) and live auto-refresh. Visibility mirrors the Database and ClickHouse admin pages:
- Requires the
ADMINrole. - Hidden when
cameleer.server.security.infrastructureendpoints=false(both the backend endpoints and the sidebar entry disappear).
Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.
Table schema
server_metrics (
tenant_id LowCardinality(String) DEFAULT 'default',
collected_at DateTime64(3),
server_instance_id LowCardinality(String),
metric_name LowCardinality(String),
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
statistic LowCardinality(String) DEFAULT 'value',
metric_value Float64,
tags Map(String, String) DEFAULT map(),
server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
What each column means
| Column | Notes |
|---|---|
tenant_id |
Always filter by this. One tenant per server deployment. |
server_instance_id |
Stable id per server process: property → HOSTNAME env → DNS → random UUID. Rotates on restart, so counters restart cleanly. |
metric_name |
Raw Micrometer meter name. Dots, not underscores. |
metric_type |
Lowercase Micrometer Meter.Type. |
statistic |
Which Measurement this row is. Counters/gauges → value or count. Timers → three rows per tick: count, total_time (or total), max. Distribution summaries → same shape. |
metric_value |
Float64. Non-finite values (NaN / ±∞) are dropped before insert. |
tags |
Map(String, String). Micrometer tags copied verbatim. |
Counter semantics (important)
Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:
SELECT
toStartOfMinute(collected_at) AS minute,
metric_value - any(metric_value) OVER (
PARTITION BY server_instance_id, metric_name, tags
ORDER BY collected_at
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count'
ORDER BY minute;
On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.
Retention
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
How to query
Use the REST API — /api/v1/admin/server-metrics/**. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard /api/v1/admin/** RBAC gate).
GET /catalog
Enumerate every metric_name observed in a window, with its metric_type, the set of statistics emitted, and the union of tag keys.
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt>
[
{
"metricName": "cameleer.agents.connected",
"metricType": "gauge",
"statistics": ["value"],
"tagKeys": ["state"]
},
{
"metricName": "cameleer.ingestion.drops",
"metricType": "counter",
"statistics": ["count"],
"tagKeys": ["reason"]
},
...
]
from/to are optional; default is the last 1 h.
GET /instances
Enumerate the server_instance_id values that wrote at least one sample in the window, with firstSeen / lastSeen. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
[
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]
POST /query — generic time-series
The workhorse. One endpoint covers every panel in the dashboard.
POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json
Request body:
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"from": "2026-04-22T00:00:00Z",
"to": "2026-04-23T00:00:00Z",
"stepSeconds": 60,
"groupByTags": ["reason"],
"filterTags": { },
"aggregation": "sum",
"mode": "delta",
"serverInstanceIds": null
}
Response:
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"aggregation": "sum",
"mode": "delta",
"stepSeconds": 60,
"series": [
{
"tags": { "reason": "buffer_full" },
"points": [
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
]
}
]
}
Request field reference
| Field | Type | Required | Description |
|---|---|---|---|
metric |
string | yes | Metric name. Regex ^[a-zA-Z0-9._]+$. |
statistic |
string | no | value / count / total / total_time / max / mean. mean is a derived statistic for timers: sum(total_time | total) / sum(count) per bucket. |
from, to |
ISO-8601 instant | yes | Half-open window. to - from ≤ 31 days. |
stepSeconds |
int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
groupByTags |
string[] | no | Emit one series per unique combination of these tag values. Tag keys regex ^[a-zA-Z0-9._]+$. |
filterTags |
map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
aggregation |
string | no | Within-bucket reducer for raw mode: avg (default), sum, max, min, latest. For mode=delta this controls cross-instance aggregation (defaults to sum of per-instance deltas). |
mode |
string | no | raw (default) or delta. Delta mode computes per-server_instance_id positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
serverInstanceIds |
string[] | no | Allow-list. When null or empty, every instance in the window is included. |
Validation errors
Any IllegalArgumentException surfaces as 400 Bad Request with {"error": "…"}. Triggers:
- unsafe characters in identifiers
from ≥ toor range > 31 daysstepSecondsoutside [10, 3600]- result cardinality > 500 series (reduce
groupByTagsor tightenfilterTags)
Direct ClickHouse (fallback)
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for /api/v1/admin/clickhouse/query (infrastructureendpoints=true, ADMIN) or a dedicated read-only CH user scoped to server_metrics. All direct queries must filter by tenant_id.
Metric catalog
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
Cameleer business metrics — agent + ingestion
Source: cameleer-server-app/.../metrics/ServerMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.agents.connected |
gauge | value |
state (live/stale/dead/shutdown) |
Count of agents in each lifecycle state |
cameleer.agents.sse.active |
gauge | value |
— | Active SSE connections (command channel) |
cameleer.agents.transitions |
counter | count |
transition (went_stale/went_dead/recovered) |
Cumulative lifecycle transitions |
cameleer.ingestion.buffer.size |
gauge | value |
type (execution/processor/log/metrics) |
Write buffer depth — spikes mean ingestion is lagging |
cameleer.ingestion.accumulator.pending |
gauge | value |
— | Unfinalized execution chunks in the accumulator |
cameleer.ingestion.drops |
counter | count |
reason (buffer_full/no_agent/no_identity) |
Dropped payloads. Any non-zero rate here is bad. |
cameleer.ingestion.flush.duration |
timer | count, total_time/total, max |
type (execution/processor/log) |
Flush latency per type |
Cameleer business metrics — deploy + auth
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.deployments.outcome |
counter | count |
status (running/failed/degraded) |
Deploy outcome tally since boot |
cameleer.deployments.duration |
timer | count, total_time/total, max |
— | End-to-end deploy latency |
cameleer.auth.failures |
counter | count |
reason (invalid_token/revoked/oidc_rejected) |
Auth failure breakdown — watch for spikes |
Alerting subsystem metrics
Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
alerting_rules_total |
gauge | value |
state (enabled/disabled) |
Cached 30 s from PostgreSQL alert_rules |
alerting_instances_total |
gauge | value |
state (firing/resolved/ack'd etc.) |
Cached 30 s from PostgreSQL alert_instances |
alerting_eval_errors_total |
counter | count |
kind (condition kind) |
Evaluator exceptions per kind |
alerting_circuit_opened_total |
counter | count |
kind |
Circuit-breaker open transitions per kind |
alerting_eval_duration_seconds |
timer | count, total_time/total, max |
kind |
Per-kind evaluation latency |
alerting_webhook_delivery_duration_seconds |
timer | count, total_time/total, max |
— | Outbound webhook POST latency |
alerting_notifications_total |
counter | count |
status (sent/failed/retry/giving_up) |
Notification outcomes |
JVM — memory, GC, threads, classes
From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jvm.memory.used |
gauge | area (heap/nonheap), id (pool name) |
Bytes used per pool |
jvm.memory.committed |
gauge | area, id |
Bytes committed per pool |
jvm.memory.max |
gauge | area, id |
Pool max |
jvm.memory.usage.after.gc |
gauge | area, id |
Usage right after the last collection |
jvm.buffer.memory.used |
gauge | id (direct/mapped) |
NIO buffer bytes |
jvm.buffer.count |
gauge | id |
NIO buffer count |
jvm.buffer.total.capacity |
gauge | id |
NIO buffer capacity |
jvm.threads.live |
gauge | — | Current live thread count |
jvm.threads.daemon |
gauge | — | Current daemon thread count |
jvm.threads.peak |
gauge | — | Peak thread count since start |
jvm.threads.started |
counter | — | Cumulative threads started |
jvm.threads.states |
gauge | state (runnable/blocked/waiting/…) |
Threads per state |
jvm.classes.loaded |
gauge | — | Currently-loaded classes |
jvm.classes.unloaded |
counter | — | Cumulative unloaded classes |
jvm.gc.pause |
timer | action, cause |
Stop-the-world pause times — watch max |
jvm.gc.concurrent.phase.time |
timer | action, cause |
Concurrent-phase durations (G1/ZGC) |
jvm.gc.memory.allocated |
counter | — | Bytes allocated in the young gen |
jvm.gc.memory.promoted |
counter | — | Bytes promoted to old gen |
jvm.gc.overhead |
gauge | — | Fraction of CPU spent in GC (0–1) |
jvm.gc.live.data.size |
gauge | — | Live data after last collection |
jvm.gc.max.data.size |
gauge | — | Max old-gen size |
jvm.info |
gauge | vendor, runtime, version |
Constant 1.0; tags carry the real info |
Process and system
| Metric | Type | Tags | Meaning |
|---|---|---|---|
process.cpu.usage |
gauge | — | CPU share consumed by this JVM (0–1) |
process.cpu.time |
gauge | — | Cumulative CPU time (ns) |
process.uptime |
gauge | — | ms since start |
process.start.time |
gauge | — | Epoch start |
process.files.open |
gauge | — | Open FDs |
process.files.max |
gauge | — | FD ulimit |
system.cpu.count |
gauge | — | Cores visible to the JVM |
system.cpu.usage |
gauge | — | System-wide CPU (0–1) |
system.load.average.1m |
gauge | — | 1-min load (Unix only) |
disk.free |
gauge | path |
Free bytes on the mount that holds the JAR |
disk.total |
gauge | path |
Total bytes |
HTTP server
| Metric | Type | Tags | Meaning |
|---|---|---|---|
http.server.requests |
timer | method, uri, status, outcome, exception |
Inbound HTTP: count, total_time/total, max |
http.server.requests.active |
long_task_timer | method, uri |
In-flight requests — active_tasks statistic |
uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.
Tomcat
| Metric | Type | Tags | Meaning |
|---|---|---|---|
tomcat.sessions.active.current |
gauge | — | Currently active sessions |
tomcat.sessions.active.max |
gauge | — | Max concurrent sessions observed |
tomcat.sessions.alive.max |
gauge | — | Longest session lifetime (s) |
tomcat.sessions.created |
counter | — | Cumulative session creates |
tomcat.sessions.expired |
counter | — | Cumulative expirations |
tomcat.sessions.rejected |
counter | — | Session creates refused |
tomcat.threads.current |
gauge | name |
Connector thread count |
tomcat.threads.busy |
gauge | name |
Connector threads currently serving a request |
tomcat.threads.config.max |
gauge | name |
Configured max |
HikariCP (PostgreSQL pool)
| Metric | Type | Tags | Meaning |
|---|---|---|---|
hikaricp.connections |
gauge | pool |
Total connections |
hikaricp.connections.active |
gauge | pool |
In-use |
hikaricp.connections.idle |
gauge | pool |
Idle |
hikaricp.connections.pending |
gauge | pool |
Threads waiting for a connection |
hikaricp.connections.min |
gauge | pool |
Configured min |
hikaricp.connections.max |
gauge | pool |
Configured max |
hikaricp.connections.creation |
timer | pool |
Time to open a new connection |
hikaricp.connections.acquire |
timer | pool |
Time to acquire from the pool |
hikaricp.connections.usage |
timer | pool |
Time a connection was in use |
hikaricp.connections.timeout |
counter | pool |
Pool acquisition timeouts — any non-zero rate is a problem |
Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).
JDBC generic
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jdbc.connections.min |
gauge | name |
Same data as Hikari, surfaced generically |
jdbc.connections.max |
gauge | name |
|
jdbc.connections.active |
gauge | name |
|
jdbc.connections.idle |
gauge | name |
Logging
| Metric | Type | Tags | Meaning |
|---|---|---|---|
logback.events |
counter | level (error/warn/info/debug/trace) |
Log events emitted since start — {level=error} is a useful panel |
Spring Boot lifecycle
| Metric | Type | Tags | Meaning |
|---|---|---|---|
application.started.time |
timer | main.application.class |
Cold-start duration |
application.ready.time |
timer | main.application.class |
Time to ready |
Flyway
| Metric | Type | Tags | Meaning |
|---|---|---|---|
flyway.migrations |
gauge | — | Number of migrations applied (current schema) |
Executor pools (if any @Async executors exist)
When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:
| Metric | Type | Tags | Meaning |
|---|---|---|---|
executor.active |
gauge | name |
Currently-running tasks |
executor.queued |
gauge | name |
Queued tasks |
executor.queue.remaining |
gauge | name |
Queue headroom |
executor.pool.size |
gauge | name |
Current pool size |
executor.pool.core |
gauge | name |
Core size |
executor.pool.max |
gauge | name |
Max size |
executor.completed |
counter | name |
Completed tasks |
Suggested dashboard panels
Below are 17 panels, each expressed as a single POST /api/v1/admin/server-metrics/query body. Tenant is implicit in the JWT — the server filters by tenant server-side. {from} and {to} are dashboard variables.
Row: server health (top of dashboard)
-
Agents by state — stacked area.
{ "metric": "cameleer.agents.connected", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } -
Ingestion buffer depth by type — line chart.
{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" } -
Ingestion drops per minute — bar chart.
{ "metric": "cameleer.ingestion.drops", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" } -
Auth failures per minute — same shape as drops, grouped by
reason.{ "metric": "cameleer.auth.failures", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" }
Row: JVM
-
Heap used vs committed vs max — area chart (three overlay queries).
{ "metric": "jvm.memory.used", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }Repeat with
"metric": "jvm.memory.committed"and"metric": "jvm.memory.max". -
CPU % — line.
{ "metric": "process.cpu.usage", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }Overlay with
"metric": "system.cpu.usage". -
GC pause — max per cause.
{ "metric": "jvm.gc.pause", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" } -
Thread count — three overlay lines:
jvm.threads.live,jvm.threads.daemon,jvm.threads.peakeach withstatistic=value, aggregation=avg, mode=raw.
Row: HTTP + DB
-
HTTP mean latency by URI — top-N URIs.
{ "metric": "http.server.requests", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" }, "aggregation": "avg", "mode": "raw" }For p99 proxy, repeat with
"statistic": "max". -
HTTP error rate — two queries, divide client-side: total requests and 5xx requests.
{ "metric": "http.server.requests", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "mode": "delta", "aggregation": "sum" }Then for the 5xx series, add
"filterTags": { "outcome": "SERVER_ERROR" }and divide. -
HikariCP pool saturation — overlay two queries.
{ "metric": "hikaricp.connections.active", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }Overlay with
"metric": "hikaricp.connections.pending". -
Hikari acquire timeouts per minute.
{ "metric": "hikaricp.connections.timeout", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "mode": "delta" }
Row: alerting (collapsible)
-
Alerting instances by state — stacked.
{ "metric": "alerting_instances_total", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } -
Eval errors per minute by kind.
{ "metric": "alerting_eval_errors_total", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["kind"], "mode": "delta" } -
Webhook delivery — max per minute.
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "max", "mode": "raw" }
Row: deployments (runtime-enabled only)
-
Deploy outcomes per hour.
{ "metric": "cameleer.deployments.outcome", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 3600, "groupByTags": ["status"], "mode": "delta" } -
Deploy duration mean.
{ "metric": "cameleer.deployments.duration", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 300, "aggregation": "avg", "mode": "raw" }For p99 proxy, repeat with
"statistic": "max".
Notes for the dashboard implementer
- Use the REST API. The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
total_timevstotal. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expecttotal_time. The derivedstatistic=meanhandles both transparently.- Cardinality warning:
http.server.requeststags includeuriandstatus. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without@PathVariable, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it. - The dashboard is read-only. There's no write path — only the server writes into
server_metrics.
Changelog
- 2026-04-23 — initial write. Write-only backend.
- 2026-04-23 — added generic REST API (
/api/v1/admin/server-metrics/{catalog,instances,query}) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries. - 2026-04-24 — shipped the built-in
/admin/server-metricsUI dashboard. Gated byinfrastructureendpoints+ ADMIN, identical visibility to/admin/{database,clickhouse}. Source:ui/src/pages/Admin/ServerMetricsAdminPage.tsx.