Files
cameleer-server/docs/server-self-metrics.md
hsiegeln 35319dc666
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m31s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 44s
refactor(ui): server metrics page uses global time range
Drop the page-local DS Select window picker. Drive from() / to() off
useGlobalFilters().timeRange so the dashboard tracks the same TopBar range
as Exchanges / Dashboard / Runtime. Bucket size auto-scales via
stepSecondsFor(windowSeconds) (10 s for ≤30 min → 1 h for >48 h). Query
hooks now take ServerMetricsRange = { from: Date; to: Date } instead of a
windowSeconds number, so they support arbitrary absolute or rolling ranges
the TopBar may supply (not just "now − N"). Toolbar collapses to just the
server-instance badges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 09:19:20 +02:00

24 KiB
Raw Blame History

Server Self-Metrics — Reference for Dashboard Builders

This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the server_metrics ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

tl;dr — Every 60 s, every meter in the server's Micrometer registry (all cameleer.*, all alerting_*, and the full Spring Boot Actuator set) is written into ClickHouse as one row per (meter, statistic) pair. No external Prometheus required.


Built-in admin dashboard

The server ships a ready-to-use dashboard at /admin/server-metrics in the web UI. It renders the 17 panels listed below using ThemedChart from the design system. The window is driven by the app-wide time-range control in the TopBar (same one used by Exchanges, Dashboard, and Runtime), so every panel automatically reflects the range you've selected globally. Visibility mirrors the Database and ClickHouse admin pages:

  • Requires the ADMIN role.
  • Hidden when cameleer.server.security.infrastructureendpoints=false (both the backend endpoints and the sidebar entry disappear).

Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.


Table schema

server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE

What each column means

Column Notes
tenant_id Always filter by this. One tenant per server deployment.
server_instance_id Stable id per server process: property → HOSTNAME env → DNS → random UUID. Rotates on restart, so counters restart cleanly.
metric_name Raw Micrometer meter name. Dots, not underscores.
metric_type Lowercase Micrometer Meter.Type.
statistic Which Measurement this row is. Counters/gauges → value or count. Timers → three rows per tick: count, total_time (or total), max. Distribution summaries → same shape.
metric_value Float64. Non-finite values (NaN / ±∞) are dropped before insert.
tags Map(String, String). Micrometer tags copied verbatim.

Counter semantics (important)

Counters are cumulative totals since meter registration, same convention as Prometheus. To get a rate, compute a delta within a server_instance_id:

SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;

On restart the server_instance_id rotates, so a simple LAG() partitioned by server_instance_id gives monotonic segments without fighting counter resets.

Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.


How to query

Use the REST API — /api/v1/admin/server-metrics/**. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard /api/v1/admin/** RBAC gate).

GET /catalog

Enumerate every metric_name observed in a window, with its metric_type, the set of statistics emitted, and the union of tag keys.

GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt>
[
  {
    "metricName": "cameleer.agents.connected",
    "metricType": "gauge",
    "statistics": ["value"],
    "tagKeys": ["state"]
  },
  {
    "metricName": "cameleer.ingestion.drops",
    "metricType": "counter",
    "statistics": ["count"],
    "tagKeys": ["reason"]
  },
  ...
]

from/to are optional; default is the last 1 h.

GET /instances

Enumerate the server_instance_id values that wrote at least one sample in the window, with firstSeen / lastSeen. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.

GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
[
  { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
  { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]

POST /query — generic time-series

The workhorse. One endpoint covers every panel in the dashboard.

POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json

Request body:

{
  "metric":          "cameleer.ingestion.drops",
  "statistic":       "count",
  "from":            "2026-04-22T00:00:00Z",
  "to":              "2026-04-23T00:00:00Z",
  "stepSeconds":     60,
  "groupByTags":     ["reason"],
  "filterTags":      { },
  "aggregation":     "sum",
  "mode":            "delta",
  "serverInstanceIds": null
}

Response:

{
  "metric":      "cameleer.ingestion.drops",
  "statistic":   "count",
  "aggregation": "sum",
  "mode":        "delta",
  "stepSeconds": 60,
  "series": [
    {
      "tags":   { "reason": "buffer_full" },
      "points": [
        { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
        { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
        { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
      ]
    }
  ]
}

Request field reference

Field Type Required Description
metric string yes Metric name. Regex ^[a-zA-Z0-9._]+$.
statistic string no value / count / total / total_time / max / mean. mean is a derived statistic for timers: sum(total_time | total) / sum(count) per bucket.
from, to ISO-8601 instant yes Half-open window. to - from ≤ 31 days.
stepSeconds int no Bucket size. Clamped to [10, 3600]. Default 60.
groupByTags string[] no Emit one series per unique combination of these tag values. Tag keys regex ^[a-zA-Z0-9._]+$.
filterTags map<string,string> no Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection.
aggregation string no Within-bucket reducer for raw mode: avg (default), sum, max, min, latest. For mode=delta this controls cross-instance aggregation (defaults to sum of per-instance deltas).
mode string no raw (default) or delta. Delta mode computes per-server_instance_id positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts.
serverInstanceIds string[] no Allow-list. When null or empty, every instance in the window is included.

Validation errors

Any IllegalArgumentException surfaces as 400 Bad Request with {"error": "…"}. Triggers:

  • unsafe characters in identifiers
  • from ≥ to or range > 31 days
  • stepSeconds outside [10, 3600]
  • result cardinality > 500 series (reduce groupByTags or tighten filterTags)

Direct ClickHouse (fallback)

If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for /api/v1/admin/clickhouse/query (infrastructureendpoints=true, ADMIN) or a dedicated read-only CH user scoped to server_metrics. All direct queries must filter by tenant_id.


Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

Cameleer business metrics — agent + ingestion

Source: cameleer-server-app/.../metrics/ServerMetrics.java.

Metric Type Statistic Tags Meaning
cameleer.agents.connected gauge value state (live/stale/dead/shutdown) Count of agents in each lifecycle state
cameleer.agents.sse.active gauge value Active SSE connections (command channel)
cameleer.agents.transitions counter count transition (went_stale/went_dead/recovered) Cumulative lifecycle transitions
cameleer.ingestion.buffer.size gauge value type (execution/processor/log/metrics) Write buffer depth — spikes mean ingestion is lagging
cameleer.ingestion.accumulator.pending gauge value Unfinalized execution chunks in the accumulator
cameleer.ingestion.drops counter count reason (buffer_full/no_agent/no_identity) Dropped payloads. Any non-zero rate here is bad.
cameleer.ingestion.flush.duration timer count, total_time/total, max type (execution/processor/log) Flush latency per type

Cameleer business metrics — deploy + auth

Metric Type Statistic Tags Meaning
cameleer.deployments.outcome counter count status (running/failed/degraded) Deploy outcome tally since boot
cameleer.deployments.duration timer count, total_time/total, max End-to-end deploy latency
cameleer.auth.failures counter count reason (invalid_token/revoked/oidc_rejected) Auth failure breakdown — watch for spikes

Alerting subsystem metrics

Source: cameleer-server-app/.../alerting/metrics/AlertingMetrics.java.

Metric Type Statistic Tags Meaning
alerting_rules_total gauge value state (enabled/disabled) Cached 30 s from PostgreSQL alert_rules
alerting_instances_total gauge value state (firing/resolved/ack'd etc.) Cached 30 s from PostgreSQL alert_instances
alerting_eval_errors_total counter count kind (condition kind) Evaluator exceptions per kind
alerting_circuit_opened_total counter count kind Circuit-breaker open transitions per kind
alerting_eval_duration_seconds timer count, total_time/total, max kind Per-kind evaluation latency
alerting_webhook_delivery_duration_seconds timer count, total_time/total, max Outbound webhook POST latency
alerting_notifications_total counter count status (sent/failed/retry/giving_up) Notification outcomes

JVM — memory, GC, threads, classes

From Spring Boot Actuator (JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ClassLoaderMetrics).

Metric Type Tags Meaning
jvm.memory.used gauge area (heap/nonheap), id (pool name) Bytes used per pool
jvm.memory.committed gauge area, id Bytes committed per pool
jvm.memory.max gauge area, id Pool max
jvm.memory.usage.after.gc gauge area, id Usage right after the last collection
jvm.buffer.memory.used gauge id (direct/mapped) NIO buffer bytes
jvm.buffer.count gauge id NIO buffer count
jvm.buffer.total.capacity gauge id NIO buffer capacity
jvm.threads.live gauge Current live thread count
jvm.threads.daemon gauge Current daemon thread count
jvm.threads.peak gauge Peak thread count since start
jvm.threads.started counter Cumulative threads started
jvm.threads.states gauge state (runnable/blocked/waiting/…) Threads per state
jvm.classes.loaded gauge Currently-loaded classes
jvm.classes.unloaded counter Cumulative unloaded classes
jvm.gc.pause timer action, cause Stop-the-world pause times — watch max
jvm.gc.concurrent.phase.time timer action, cause Concurrent-phase durations (G1/ZGC)
jvm.gc.memory.allocated counter Bytes allocated in the young gen
jvm.gc.memory.promoted counter Bytes promoted to old gen
jvm.gc.overhead gauge Fraction of CPU spent in GC (01)
jvm.gc.live.data.size gauge Live data after last collection
jvm.gc.max.data.size gauge Max old-gen size
jvm.info gauge vendor, runtime, version Constant 1.0; tags carry the real info

Process and system

Metric Type Tags Meaning
process.cpu.usage gauge CPU share consumed by this JVM (01)
process.cpu.time gauge Cumulative CPU time (ns)
process.uptime gauge ms since start
process.start.time gauge Epoch start
process.files.open gauge Open FDs
process.files.max gauge FD ulimit
system.cpu.count gauge Cores visible to the JVM
system.cpu.usage gauge System-wide CPU (01)
system.load.average.1m gauge 1-min load (Unix only)
disk.free gauge path Free bytes on the mount that holds the JAR
disk.total gauge path Total bytes

HTTP server

Metric Type Tags Meaning
http.server.requests timer method, uri, status, outcome, exception Inbound HTTP: count, total_time/total, max
http.server.requests.active long_task_timer method, uri In-flight requests — active_tasks statistic

uri is the Spring-templated path (/api/v1/environments/{envSlug}/apps/{appSlug}), not the raw URL — cardinality stays bounded.

Tomcat

Metric Type Tags Meaning
tomcat.sessions.active.current gauge Currently active sessions
tomcat.sessions.active.max gauge Max concurrent sessions observed
tomcat.sessions.alive.max gauge Longest session lifetime (s)
tomcat.sessions.created counter Cumulative session creates
tomcat.sessions.expired counter Cumulative expirations
tomcat.sessions.rejected counter Session creates refused
tomcat.threads.current gauge name Connector thread count
tomcat.threads.busy gauge name Connector threads currently serving a request
tomcat.threads.config.max gauge name Configured max

HikariCP (PostgreSQL pool)

Metric Type Tags Meaning
hikaricp.connections gauge pool Total connections
hikaricp.connections.active gauge pool In-use
hikaricp.connections.idle gauge pool Idle
hikaricp.connections.pending gauge pool Threads waiting for a connection
hikaricp.connections.min gauge pool Configured min
hikaricp.connections.max gauge pool Configured max
hikaricp.connections.creation timer pool Time to open a new connection
hikaricp.connections.acquire timer pool Time to acquire from the pool
hikaricp.connections.usage timer pool Time a connection was in use
hikaricp.connections.timeout counter pool Pool acquisition timeouts — any non-zero rate is a problem

Pools are named. You'll see HikariPool-1 (PostgreSQL) and a separate pool for ClickHouse (clickHouseJdbcTemplate).

JDBC generic

Metric Type Tags Meaning
jdbc.connections.min gauge name Same data as Hikari, surfaced generically
jdbc.connections.max gauge name
jdbc.connections.active gauge name
jdbc.connections.idle gauge name

Logging

Metric Type Tags Meaning
logback.events counter level (error/warn/info/debug/trace) Log events emitted since start — {level=error} is a useful panel

Spring Boot lifecycle

Metric Type Tags Meaning
application.started.time timer main.application.class Cold-start duration
application.ready.time timer main.application.class Time to ready

Flyway

Metric Type Tags Meaning
flyway.migrations gauge Number of migrations applied (current schema)

Executor pools (if any @Async executors exist)

When a ThreadPoolTaskExecutor bean is registered and tagged, Micrometer adds:

Metric Type Tags Meaning
executor.active gauge name Currently-running tasks
executor.queued gauge name Queued tasks
executor.queue.remaining gauge name Queue headroom
executor.pool.size gauge name Current pool size
executor.pool.core gauge name Core size
executor.pool.max gauge name Max size
executor.completed counter name Completed tasks

Suggested dashboard panels

Below are 17 panels, each expressed as a single POST /api/v1/admin/server-metrics/query body. Tenant is implicit in the JWT — the server filters by tenant server-side. {from} and {to} are dashboard variables.

Row: server health (top of dashboard)

  1. Agents by state — stacked area.

    { "metric": "cameleer.agents.connected", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
    
  2. Ingestion buffer depth by type — line chart.

    { "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
    
  3. Ingestion drops per minute — bar chart.

    { "metric": "cameleer.ingestion.drops", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["reason"], "mode": "delta" }
    
  4. Auth failures per minute — same shape as drops, grouped by reason.

    { "metric": "cameleer.auth.failures", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["reason"], "mode": "delta" }
    

Row: JVM

  1. Heap used vs committed vs max — area chart (three overlay queries).

    { "metric": "jvm.memory.used", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
    

    Repeat with "metric": "jvm.memory.committed" and "metric": "jvm.memory.max".

  2. CPU % — line.

    { "metric": "process.cpu.usage", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
    

    Overlay with "metric": "system.cpu.usage".

  3. GC pause — max per cause.

    { "metric": "jvm.gc.pause", "statistic": "max",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
    
  4. Thread count — three overlay lines: jvm.threads.live, jvm.threads.daemon, jvm.threads.peak each with statistic=value, aggregation=avg, mode=raw.

Row: HTTP + DB

  1. HTTP mean latency by URI — top-N URIs.

    { "metric": "http.server.requests", "statistic": "mean",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
      "aggregation": "avg", "mode": "raw" }
    

    For p99 proxy, repeat with "statistic": "max".

  2. HTTP error rate — two queries, divide client-side: total requests and 5xx requests.

    { "metric": "http.server.requests", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "mode": "delta", "aggregation": "sum" }
    

    Then for the 5xx series, add "filterTags": { "outcome": "SERVER_ERROR" } and divide.

  3. HikariCP pool saturation — overlay two queries.

    { "metric": "hikaricp.connections.active", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
    

    Overlay with "metric": "hikaricp.connections.pending".

  4. Hikari acquire timeouts per minute.

    { "metric": "hikaricp.connections.timeout", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["pool"], "mode": "delta" }
    

Row: alerting (collapsible)

  1. Alerting instances by state — stacked.

    { "metric": "alerting_instances_total", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
    
  2. Eval errors per minute by kind.

    { "metric": "alerting_eval_errors_total", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["kind"], "mode": "delta" }
    
  3. Webhook delivery — max per minute.

    { "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "aggregation": "max", "mode": "raw" }
    

Row: deployments (runtime-enabled only)

  1. Deploy outcomes per hour.

    { "metric": "cameleer.deployments.outcome", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 3600,
      "groupByTags": ["status"], "mode": "delta" }
    
  2. Deploy duration mean.

    { "metric": "cameleer.deployments.duration", "statistic": "mean",
      "from": "{from}", "to": "{to}", "stepSeconds": 300,
      "aggregation": "avg", "mode": "raw" }
    

    For p99 proxy, repeat with "statistic": "max".


Notes for the dashboard implementer

  • Use the REST API. The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
  • total_time vs total. SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect total_time. The derived statistic=mean handles both transparently.
  • Cardinality warning: http.server.requests tags include uri and status. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without @PathVariable, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
  • The dashboard is read-only. There's no write path — only the server writes into server_metrics.

Changelog

  • 2026-04-23 — initial write. Write-only backend.
  • 2026-04-23 — added generic REST API (/api/v1/admin/server-metrics/{catalog,instances,query}) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
  • 2026-04-24 — shipped the built-in /admin/server-metrics UI dashboard. Gated by infrastructureendpoints + ADMIN, identical visibility to /admin/{database,clickhouse}. Source: ui/src/pages/Admin/ServerMetricsAdminPage.tsx.
  • 2026-04-24 — dashboard now uses the global time-range control (useGlobalFilters) instead of a page-local picker. Bucket size auto-scales with the selected window (10 s → 1 h). Query hooks now take a ServerMetricsRange = { from: Date; to: Date } instead of a windowSeconds number so they work for any absolute or rolling range the TopBar supplies.