Drop the page-local DS Select window picker. Drive from() / to() off
useGlobalFilters().timeRange so the dashboard tracks the same TopBar range
as Exchanges / Dashboard / Runtime. Bucket size auto-scales via
stepSecondsFor(windowSeconds) (10 s for ≤30 min → 1 h for >48 h). Query
hooks now take ServerMetricsRange = { from: Date; to: Date } instead of a
windowSeconds number, so they support arbitrary absolute or rolling ranges
the TopBar may supply (not just "now − N"). Toolbar collapses to just the
server-instance badges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
24 KiB
Server Self-Metrics — Reference for Dashboard Builders
This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
tl;dr — Every 60 s, every meter in the server's Micrometer registry (all
cameleer.*, allalerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per(meter, statistic)pair. No external Prometheus required.
Built-in admin dashboard
The server ships a ready-to-use dashboard at /admin/server-metrics in the web UI. It renders the 17 panels listed below using ThemedChart from the design system. The window is driven by the app-wide time-range control in the TopBar (same one used by Exchanges, Dashboard, and Runtime), so every panel automatically reflects the range you've selected globally. Visibility mirrors the Database and ClickHouse admin pages:
- Requires the
ADMINrole. - Hidden when
cameleer.server.security.infrastructureendpoints=false(both the backend endpoints and the sidebar entry disappear).
Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.
Table schema
server_metrics (
tenant_id LowCardinality(String) DEFAULT 'default',
collected_at DateTime64(3),
server_instance_id LowCardinality(String),
metric_name LowCardinality(String),
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
statistic LowCardinality(String) DEFAULT 'value',
metric_value Float64,
tags Map(String, String) DEFAULT map(),
server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
What each column means
| Column | Notes |
|---|---|
tenant_id |
Always filter by this. One tenant per server deployment. |
server_instance_id |
Stable id per server process: property → HOSTNAME env → DNS → random UUID. Rotates on restart, so counters restart cleanly. |
metric_name |
Raw Micrometer meter name. Dots, not underscores. |
metric_type |
Lowercase Micrometer Meter.Type. |
statistic |
Which Measurement this row is. Counters/gauges → value or count. Timers → three rows per tick: count, total_time (or total), max. Distribution summaries → same shape. |
metric_value |
Float64. Non-finite values (NaN / ±∞) are dropped before insert. |
tags |
Map(String, String). Micrometer tags copied verbatim. |
Counter semantics (important)
Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:
SELECT
toStartOfMinute(collected_at) AS minute,
metric_value - any(metric_value) OVER (
PARTITION BY server_instance_id, metric_name, tags
ORDER BY collected_at
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count'
ORDER BY minute;
On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.
Retention
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
How to query
Use the REST API — /api/v1/admin/server-metrics/**. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard /api/v1/admin/** RBAC gate).
GET /catalog
Enumerate every metric_name observed in a window, with its metric_type, the set of statistics emitted, and the union of tag keys.
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt>
[
{
"metricName": "cameleer.agents.connected",
"metricType": "gauge",
"statistics": ["value"],
"tagKeys": ["state"]
},
{
"metricName": "cameleer.ingestion.drops",
"metricType": "counter",
"statistics": ["count"],
"tagKeys": ["reason"]
},
...
]
from/to are optional; default is the last 1 h.
GET /instances
Enumerate the server_instance_id values that wrote at least one sample in the window, with firstSeen / lastSeen. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
[
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]
POST /query — generic time-series
The workhorse. One endpoint covers every panel in the dashboard.
POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json
Request body:
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"from": "2026-04-22T00:00:00Z",
"to": "2026-04-23T00:00:00Z",
"stepSeconds": 60,
"groupByTags": ["reason"],
"filterTags": { },
"aggregation": "sum",
"mode": "delta",
"serverInstanceIds": null
}
Response:
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"aggregation": "sum",
"mode": "delta",
"stepSeconds": 60,
"series": [
{
"tags": { "reason": "buffer_full" },
"points": [
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
]
}
]
}
Request field reference
| Field | Type | Required | Description |
|---|---|---|---|
metric |
string | yes | Metric name. Regex ^[a-zA-Z0-9._]+$. |
statistic |
string | no | value / count / total / total_time / max / mean. mean is a derived statistic for timers: sum(total_time | total) / sum(count) per bucket. |
from, to |
ISO-8601 instant | yes | Half-open window. to - from ≤ 31 days. |
stepSeconds |
int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
groupByTags |
string[] | no | Emit one series per unique combination of these tag values. Tag keys regex ^[a-zA-Z0-9._]+$. |
filterTags |
map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
aggregation |
string | no | Within-bucket reducer for raw mode: avg (default), sum, max, min, latest. For mode=delta this controls cross-instance aggregation (defaults to sum of per-instance deltas). |
mode |
string | no | raw (default) or delta. Delta mode computes per-server_instance_id positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
serverInstanceIds |
string[] | no | Allow-list. When null or empty, every instance in the window is included. |
Validation errors
Any IllegalArgumentException surfaces as 400 Bad Request with {"error": "…"}. Triggers:
- unsafe characters in identifiers
from ≥ toor range > 31 daysstepSecondsoutside [10, 3600]- result cardinality > 500 series (reduce
groupByTagsor tightenfilterTags)
Direct ClickHouse (fallback)
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for /api/v1/admin/clickhouse/query (infrastructureendpoints=true, ADMIN) or a dedicated read-only CH user scoped to server_metrics. All direct queries must filter by tenant_id.
Metric catalog
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
Cameleer business metrics — agent + ingestion
Source: cameleer-server-app/.../metrics/ServerMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.agents.connected |
gauge | value |
state (live/stale/dead/shutdown) |
Count of agents in each lifecycle state |
cameleer.agents.sse.active |
gauge | value |
— | Active SSE connections (command channel) |
cameleer.agents.transitions |
counter | count |
transition (went_stale/went_dead/recovered) |
Cumulative lifecycle transitions |
cameleer.ingestion.buffer.size |
gauge | value |
type (execution/processor/log/metrics) |
Write buffer depth — spikes mean ingestion is lagging |
cameleer.ingestion.accumulator.pending |
gauge | value |
— | Unfinalized execution chunks in the accumulator |
cameleer.ingestion.drops |
counter | count |
reason (buffer_full/no_agent/no_identity) |
Dropped payloads. Any non-zero rate here is bad. |
cameleer.ingestion.flush.duration |
timer | count, total_time/total, max |
type (execution/processor/log) |
Flush latency per type |
Cameleer business metrics — deploy + auth
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
cameleer.deployments.outcome |
counter | count |
status (running/failed/degraded) |
Deploy outcome tally since boot |
cameleer.deployments.duration |
timer | count, total_time/total, max |
— | End-to-end deploy latency |
cameleer.auth.failures |
counter | count |
reason (invalid_token/revoked/oidc_rejected) |
Auth failure breakdown — watch for spikes |
Alerting subsystem metrics
Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.
| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
alerting_rules_total |
gauge | value |
state (enabled/disabled) |
Cached 30 s from PostgreSQL alert_rules |
alerting_instances_total |
gauge | value |
state (firing/resolved/ack'd etc.) |
Cached 30 s from PostgreSQL alert_instances |
alerting_eval_errors_total |
counter | count |
kind (condition kind) |
Evaluator exceptions per kind |
alerting_circuit_opened_total |
counter | count |
kind |
Circuit-breaker open transitions per kind |
alerting_eval_duration_seconds |
timer | count, total_time/total, max |
kind |
Per-kind evaluation latency |
alerting_webhook_delivery_duration_seconds |
timer | count, total_time/total, max |
— | Outbound webhook POST latency |
alerting_notifications_total |
counter | count |
status (sent/failed/retry/giving_up) |
Notification outcomes |
JVM — memory, GC, threads, classes
From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jvm.memory.used |
gauge | area (heap/nonheap), id (pool name) |
Bytes used per pool |
jvm.memory.committed |
gauge | area, id |
Bytes committed per pool |
jvm.memory.max |
gauge | area, id |
Pool max |
jvm.memory.usage.after.gc |
gauge | area, id |
Usage right after the last collection |
jvm.buffer.memory.used |
gauge | id (direct/mapped) |
NIO buffer bytes |
jvm.buffer.count |
gauge | id |
NIO buffer count |
jvm.buffer.total.capacity |
gauge | id |
NIO buffer capacity |
jvm.threads.live |
gauge | — | Current live thread count |
jvm.threads.daemon |
gauge | — | Current daemon thread count |
jvm.threads.peak |
gauge | — | Peak thread count since start |
jvm.threads.started |
counter | — | Cumulative threads started |
jvm.threads.states |
gauge | state (runnable/blocked/waiting/…) |
Threads per state |
jvm.classes.loaded |
gauge | — | Currently-loaded classes |
jvm.classes.unloaded |
counter | — | Cumulative unloaded classes |
jvm.gc.pause |
timer | action, cause |
Stop-the-world pause times — watch max |
jvm.gc.concurrent.phase.time |
timer | action, cause |
Concurrent-phase durations (G1/ZGC) |
jvm.gc.memory.allocated |
counter | — | Bytes allocated in the young gen |
jvm.gc.memory.promoted |
counter | — | Bytes promoted to old gen |
jvm.gc.overhead |
gauge | — | Fraction of CPU spent in GC (0–1) |
jvm.gc.live.data.size |
gauge | — | Live data after last collection |
jvm.gc.max.data.size |
gauge | — | Max old-gen size |
jvm.info |
gauge | vendor, runtime, version |
Constant 1.0; tags carry the real info |
Process and system
| Metric | Type | Tags | Meaning |
|---|---|---|---|
process.cpu.usage |
gauge | — | CPU share consumed by this JVM (0–1) |
process.cpu.time |
gauge | — | Cumulative CPU time (ns) |
process.uptime |
gauge | — | ms since start |
process.start.time |
gauge | — | Epoch start |
process.files.open |
gauge | — | Open FDs |
process.files.max |
gauge | — | FD ulimit |
system.cpu.count |
gauge | — | Cores visible to the JVM |
system.cpu.usage |
gauge | — | System-wide CPU (0–1) |
system.load.average.1m |
gauge | — | 1-min load (Unix only) |
disk.free |
gauge | path |
Free bytes on the mount that holds the JAR |
disk.total |
gauge | path |
Total bytes |
HTTP server
| Metric | Type | Tags | Meaning |
|---|---|---|---|
http.server.requests |
timer | method, uri, status, outcome, exception |
Inbound HTTP: count, total_time/total, max |
http.server.requests.active |
long_task_timer | method, uri |
In-flight requests — active_tasks statistic |
uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.
Tomcat
| Metric | Type | Tags | Meaning |
|---|---|---|---|
tomcat.sessions.active.current |
gauge | — | Currently active sessions |
tomcat.sessions.active.max |
gauge | — | Max concurrent sessions observed |
tomcat.sessions.alive.max |
gauge | — | Longest session lifetime (s) |
tomcat.sessions.created |
counter | — | Cumulative session creates |
tomcat.sessions.expired |
counter | — | Cumulative expirations |
tomcat.sessions.rejected |
counter | — | Session creates refused |
tomcat.threads.current |
gauge | name |
Connector thread count |
tomcat.threads.busy |
gauge | name |
Connector threads currently serving a request |
tomcat.threads.config.max |
gauge | name |
Configured max |
HikariCP (PostgreSQL pool)
| Metric | Type | Tags | Meaning |
|---|---|---|---|
hikaricp.connections |
gauge | pool |
Total connections |
hikaricp.connections.active |
gauge | pool |
In-use |
hikaricp.connections.idle |
gauge | pool |
Idle |
hikaricp.connections.pending |
gauge | pool |
Threads waiting for a connection |
hikaricp.connections.min |
gauge | pool |
Configured min |
hikaricp.connections.max |
gauge | pool |
Configured max |
hikaricp.connections.creation |
timer | pool |
Time to open a new connection |
hikaricp.connections.acquire |
timer | pool |
Time to acquire from the pool |
hikaricp.connections.usage |
timer | pool |
Time a connection was in use |
hikaricp.connections.timeout |
counter | pool |
Pool acquisition timeouts — any non-zero rate is a problem |
Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).
JDBC generic
| Metric | Type | Tags | Meaning |
|---|---|---|---|
jdbc.connections.min |
gauge | name |
Same data as Hikari, surfaced generically |
jdbc.connections.max |
gauge | name |
|
jdbc.connections.active |
gauge | name |
|
jdbc.connections.idle |
gauge | name |
Logging
| Metric | Type | Tags | Meaning |
|---|---|---|---|
logback.events |
counter | level (error/warn/info/debug/trace) |
Log events emitted since start — {level=error} is a useful panel |
Spring Boot lifecycle
| Metric | Type | Tags | Meaning |
|---|---|---|---|
application.started.time |
timer | main.application.class |
Cold-start duration |
application.ready.time |
timer | main.application.class |
Time to ready |
Flyway
| Metric | Type | Tags | Meaning |
|---|---|---|---|
flyway.migrations |
gauge | — | Number of migrations applied (current schema) |
Executor pools (if any @Async executors exist)
When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:
| Metric | Type | Tags | Meaning |
|---|---|---|---|
executor.active |
gauge | name |
Currently-running tasks |
executor.queued |
gauge | name |
Queued tasks |
executor.queue.remaining |
gauge | name |
Queue headroom |
executor.pool.size |
gauge | name |
Current pool size |
executor.pool.core |
gauge | name |
Core size |
executor.pool.max |
gauge | name |
Max size |
executor.completed |
counter | name |
Completed tasks |
Suggested dashboard panels
Below are 17 panels, each expressed as a single POST /api/v1/admin/server-metrics/query body. Tenant is implicit in the JWT — the server filters by tenant server-side. {from} and {to} are dashboard variables.
Row: server health (top of dashboard)
-
Agents by state — stacked area.
{ "metric": "cameleer.agents.connected", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } -
Ingestion buffer depth by type — line chart.
{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" } -
Ingestion drops per minute — bar chart.
{ "metric": "cameleer.ingestion.drops", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" } -
Auth failures per minute — same shape as drops, grouped by
reason.{ "metric": "cameleer.auth.failures", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["reason"], "mode": "delta" }
Row: JVM
-
Heap used vs committed vs max — area chart (three overlay queries).
{ "metric": "jvm.memory.used", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }Repeat with
"metric": "jvm.memory.committed"and"metric": "jvm.memory.max". -
CPU % — line.
{ "metric": "process.cpu.usage", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }Overlay with
"metric": "system.cpu.usage". -
GC pause — max per cause.
{ "metric": "jvm.gc.pause", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" } -
Thread count — three overlay lines:
jvm.threads.live,jvm.threads.daemon,jvm.threads.peakeach withstatistic=value, aggregation=avg, mode=raw.
Row: HTTP + DB
-
HTTP mean latency by URI — top-N URIs.
{ "metric": "http.server.requests", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" }, "aggregation": "avg", "mode": "raw" }For p99 proxy, repeat with
"statistic": "max". -
HTTP error rate — two queries, divide client-side: total requests and 5xx requests.
{ "metric": "http.server.requests", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "mode": "delta", "aggregation": "sum" }Then for the 5xx series, add
"filterTags": { "outcome": "SERVER_ERROR" }and divide. -
HikariCP pool saturation — overlay two queries.
{ "metric": "hikaricp.connections.active", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }Overlay with
"metric": "hikaricp.connections.pending". -
Hikari acquire timeouts per minute.
{ "metric": "hikaricp.connections.timeout", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["pool"], "mode": "delta" }
Row: alerting (collapsible)
-
Alerting instances by state — stacked.
{ "metric": "alerting_instances_total", "statistic": "value", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } -
Eval errors per minute by kind.
{ "metric": "alerting_eval_errors_total", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 60, "groupByTags": ["kind"], "mode": "delta" } -
Webhook delivery — max per minute.
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max", "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "max", "mode": "raw" }
Row: deployments (runtime-enabled only)
-
Deploy outcomes per hour.
{ "metric": "cameleer.deployments.outcome", "statistic": "count", "from": "{from}", "to": "{to}", "stepSeconds": 3600, "groupByTags": ["status"], "mode": "delta" } -
Deploy duration mean.
{ "metric": "cameleer.deployments.duration", "statistic": "mean", "from": "{from}", "to": "{to}", "stepSeconds": 300, "aggregation": "avg", "mode": "raw" }For p99 proxy, repeat with
"statistic": "max".
Notes for the dashboard implementer
- Use the REST API. The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
total_timevstotal. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expecttotal_time. The derivedstatistic=meanhandles both transparently.- Cardinality warning:
http.server.requeststags includeuriandstatus. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without@PathVariable, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it. - The dashboard is read-only. There's no write path — only the server writes into
server_metrics.
Changelog
- 2026-04-23 — initial write. Write-only backend.
- 2026-04-23 — added generic REST API (
/api/v1/admin/server-metrics/{catalog,instances,query}) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries. - 2026-04-24 — shipped the built-in
/admin/server-metricsUI dashboard. Gated byinfrastructureendpoints+ ADMIN, identical visibility to/admin/{database,clickhouse}. Source:ui/src/pages/Admin/ServerMetricsAdminPage.tsx. - 2026-04-24 — dashboard now uses the global time-range control (
useGlobalFilters) instead of a page-local picker. Bucket size auto-scales with the selected window (10 s → 1 h). Query hooks now take aServerMetricsRange = { from: Date; to: Date }instead of awindowSecondsnumber so they work for any absolute or rolling range the TopBar supplies.