feat(server): persist server self-metrics into ClickHouse
Snapshot the full Micrometer registry (cameleer business metrics, alerting metrics, and Spring Boot Actuator defaults) every 60s into a new server_metrics table so server health survives restarts without an external Prometheus. Includes a dashboard-builder reference for the SaaS team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -129,6 +129,8 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
|
|||||||
## metrics/ — Prometheus observability
|
## metrics/ — Prometheus observability
|
||||||
|
|
||||||
- `ServerMetrics` — centralized business metrics: gauges (agents by state, SSE connections, buffer depths), counters (ingestion drops, agent transitions, deployment outcomes, auth failures), timers (flush duration, deployment duration). Exposed via `/api/v1/prometheus`.
|
- `ServerMetrics` — centralized business metrics: gauges (agents by state, SSE connections, buffer depths), counters (ingestion drops, agent transitions, deployment outcomes, auth failures), timers (flush duration, deployment duration). Exposed via `/api/v1/prometheus`.
|
||||||
|
- `ServerInstanceIdConfig` — `@Configuration`, exposes `@Bean("serverInstanceId") String`. Resolution precedence: `cameleer.server.instance-id` property → `HOSTNAME` env → `InetAddress.getLocalHost()` → random UUID. Fixed at boot; rotates across restarts so counters restart cleanly.
|
||||||
|
- `ServerMetricsSnapshotScheduler` — `@Scheduled(fixedDelayString = "${cameleer.server.self-metrics.interval-ms:60000}")`. Walks `MeterRegistry.getMeters()` each tick, emits one `ServerMetricSample` per `Measurement` (Timer/DistributionSummary produce multiple rows per meter — one per Micrometer `Statistic`). Skips non-finite values; logs and swallows store failures. Disabled via `cameleer.server.self-metrics.enabled=false` (`@ConditionalOnProperty`). Write-only — no query endpoint yet; inspect via `/api/v1/admin/clickhouse/query`.
|
||||||
|
|
||||||
## storage/ — PostgreSQL repositories (JdbcTemplate)
|
## storage/ — PostgreSQL repositories (JdbcTemplate)
|
||||||
|
|
||||||
@@ -145,6 +147,7 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
|
|||||||
- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
|
- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
|
||||||
- `ClickHouseUsageTracker` — usage_events for billing
|
- `ClickHouseUsageTracker` — usage_events for billing
|
||||||
- `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup
|
- `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup
|
||||||
|
- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`, query via `/api/v1/admin/clickhouse/query` (no dedicated endpoint yet).
|
||||||
|
|
||||||
## search/ — ClickHouse search and log stores
|
## search/ — ClickHouse search and log stores
|
||||||
|
|
||||||
|
|||||||
@@ -8,7 +8,9 @@ paths:
|
|||||||
|
|
||||||
# Prometheus Metrics
|
# Prometheus Metrics
|
||||||
|
|
||||||
Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component:
|
Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component.
|
||||||
|
|
||||||
|
The same `MeterRegistry` is also snapshotted to ClickHouse every 60 s by `ServerMetricsSnapshotScheduler` (see "Server self-metrics persistence" at the bottom of this file) — so historical server-health data survives restarts without an external Prometheus.
|
||||||
|
|
||||||
## Gauges (auto-polled)
|
## Gauges (auto-polled)
|
||||||
|
|
||||||
@@ -83,3 +85,23 @@ Mean processing time = `camel.route.policy.total_time / camel.route.policy.count
|
|||||||
| `cameleer.sse.reconnects.count` | counter | `instanceId` |
|
| `cameleer.sse.reconnects.count` | counter | `instanceId` |
|
||||||
| `cameleer.taps.evaluated.count` | counter | `instanceId` |
|
| `cameleer.taps.evaluated.count` | counter | `instanceId` |
|
||||||
| `cameleer.metrics.exported.count` | counter | `instanceId` |
|
| `cameleer.metrics.exported.count` | counter | `instanceId` |
|
||||||
|
|
||||||
|
## Server self-metrics persistence
|
||||||
|
|
||||||
|
`ServerMetricsSnapshotScheduler` walks `MeterRegistry.getMeters()` every 60 s (configurable via `cameleer.server.self-metrics.interval-ms`) and writes one row per Micrometer `Measurement` to the ClickHouse `server_metrics` table. Full registry is captured — Spring Boot Actuator series (`jvm.*`, `process.*`, `http.server.requests`, `hikaricp.*`, `jdbc.*`, `tomcat.*`, `logback.events`, `system.*`) plus `cameleer.*` and `alerting_*`.
|
||||||
|
|
||||||
|
**Table** (`cameleer-server-app/src/main/resources/clickhouse/init.sql`):
|
||||||
|
|
||||||
|
```
|
||||||
|
server_metrics(tenant_id, collected_at, server_instance_id,
|
||||||
|
metric_name, metric_type, statistic, metric_value,
|
||||||
|
tags Map(String,String), server_received_at)
|
||||||
|
```
|
||||||
|
|
||||||
|
- `metric_type` — lowercase Micrometer `Meter.Type` (counter, gauge, timer, distribution_summary, long_task_timer, other)
|
||||||
|
- `statistic` — Micrometer `Statistic.getTagValueRepresentation()` (value, count, total, total_time, max, mean, active_tasks, duration). Timers emit 3 rows per tick (count + total_time + max); gauges/counters emit 1 (`statistic='value'` or `'count'`).
|
||||||
|
- No `environment` column — the server is env-agnostic.
|
||||||
|
- `tenant_id` threaded from `cameleer.server.tenant.id` (single-tenant per server).
|
||||||
|
- `server_instance_id` resolved once at boot by `ServerInstanceIdConfig` (property → HOSTNAME → localhost → UUID fallback). Rotates across restarts so counter resets are unambiguous.
|
||||||
|
- TTL: 90 days (vs 365 for `agent_metrics`). Write-only in v1 — no query endpoint or UI page. Inspect via ClickHouse admin: `/api/v1/admin/clickhouse/query` or direct SQL.
|
||||||
|
- Toggle off entirely with `cameleer.server.self-metrics.enabled=false` (uses `@ConditionalOnProperty`).
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ import com.cameleer.server.app.storage.ClickHouseRouteCatalogStore;
|
|||||||
import com.cameleer.server.core.storage.RouteCatalogStore;
|
import com.cameleer.server.core.storage.RouteCatalogStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore;
|
import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseMetricsStore;
|
import com.cameleer.server.app.storage.ClickHouseMetricsStore;
|
||||||
|
import com.cameleer.server.app.storage.ClickHouseServerMetricsStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseStatsStore;
|
import com.cameleer.server.app.storage.ClickHouseStatsStore;
|
||||||
import com.cameleer.server.core.admin.AuditRepository;
|
import com.cameleer.server.core.admin.AuditRepository;
|
||||||
import com.cameleer.server.core.admin.AuditService;
|
import com.cameleer.server.core.admin.AuditService;
|
||||||
@@ -67,6 +68,12 @@ public class StorageBeanConfig {
|
|||||||
return new ClickHouseMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc);
|
return new ClickHouseMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Bean
|
||||||
|
public ServerMetricsStore clickHouseServerMetricsStore(
|
||||||
|
@Qualifier("clickHouseJdbcTemplate") JdbcTemplate clickHouseJdbc) {
|
||||||
|
return new ClickHouseServerMetricsStore(clickHouseJdbc);
|
||||||
|
}
|
||||||
|
|
||||||
// ── Execution Store ──────────────────────────────────────────────────
|
// ── Execution Store ──────────────────────────────────────────────────
|
||||||
|
|
||||||
@Bean
|
@Bean
|
||||||
|
|||||||
@@ -0,0 +1,63 @@
|
|||||||
|
package com.cameleer.server.app.metrics;
|
||||||
|
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import org.springframework.beans.factory.annotation.Value;
|
||||||
|
import org.springframework.context.annotation.Bean;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
import java.net.InetAddress;
|
||||||
|
import java.net.UnknownHostException;
|
||||||
|
import java.util.UUID;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Resolves a stable identifier for this server process, used as the
|
||||||
|
* {@code server_instance_id} on every server_metrics sample. The value is
|
||||||
|
* fixed at boot, so counters restart cleanly whenever the id rotates.
|
||||||
|
*
|
||||||
|
* <p>Precedence:
|
||||||
|
* <ol>
|
||||||
|
* <li>{@code cameleer.server.instance-id} property / {@code CAMELEER_SERVER_INSTANCE_ID} env
|
||||||
|
* <li>{@code HOSTNAME} env (populated by Docker/Kubernetes)
|
||||||
|
* <li>{@link InetAddress#getLocalHost()} hostname
|
||||||
|
* <li>Random UUID (fallback — only hit when DNS and env are both silent)
|
||||||
|
* </ol>
|
||||||
|
*/
|
||||||
|
@Configuration
|
||||||
|
public class ServerInstanceIdConfig {
|
||||||
|
|
||||||
|
private static final Logger log = LoggerFactory.getLogger(ServerInstanceIdConfig.class);
|
||||||
|
|
||||||
|
@Bean("serverInstanceId")
|
||||||
|
public String serverInstanceId(
|
||||||
|
@Value("${cameleer.server.instance-id:}") String configuredId) {
|
||||||
|
if (!isBlank(configuredId)) {
|
||||||
|
log.info("Server instance id resolved from configuration: {}", configuredId);
|
||||||
|
return configuredId;
|
||||||
|
}
|
||||||
|
|
||||||
|
String hostnameEnv = System.getenv("HOSTNAME");
|
||||||
|
if (!isBlank(hostnameEnv)) {
|
||||||
|
log.info("Server instance id resolved from HOSTNAME env: {}", hostnameEnv);
|
||||||
|
return hostnameEnv;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
String localHost = InetAddress.getLocalHost().getHostName();
|
||||||
|
if (!isBlank(localHost)) {
|
||||||
|
log.info("Server instance id resolved from localhost lookup: {}", localHost);
|
||||||
|
return localHost;
|
||||||
|
}
|
||||||
|
} catch (UnknownHostException e) {
|
||||||
|
log.debug("InetAddress.getLocalHost() failed, falling back to UUID: {}", e.getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
String fallback = UUID.randomUUID().toString();
|
||||||
|
log.warn("Server instance id could not be resolved; using random UUID {}", fallback);
|
||||||
|
return fallback;
|
||||||
|
}
|
||||||
|
|
||||||
|
private static boolean isBlank(String s) {
|
||||||
|
return s == null || s.isBlank();
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,106 @@
|
|||||||
|
package com.cameleer.server.app.metrics;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.ServerMetricsStore;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSample;
|
||||||
|
import io.micrometer.core.instrument.Measurement;
|
||||||
|
import io.micrometer.core.instrument.Meter;
|
||||||
|
import io.micrometer.core.instrument.MeterRegistry;
|
||||||
|
import io.micrometer.core.instrument.Tag;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import org.springframework.beans.factory.annotation.Qualifier;
|
||||||
|
import org.springframework.beans.factory.annotation.Value;
|
||||||
|
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
||||||
|
import org.springframework.scheduling.annotation.Scheduled;
|
||||||
|
import org.springframework.stereotype.Component;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.LinkedHashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Periodically snapshots every meter in the server's {@link MeterRegistry}
|
||||||
|
* and writes the result to ClickHouse via {@link ServerMetricsStore}. This
|
||||||
|
* gives us historical server-health data (buffer depths, agent transitions,
|
||||||
|
* flush latency, JVM memory, HTTP response counts, etc.) without requiring
|
||||||
|
* an external Prometheus.
|
||||||
|
*
|
||||||
|
* <p>Each Micrometer {@link Meter#measure() measurement} becomes one row, so
|
||||||
|
* a single Timer produces rows for {@code count}, {@code total_time}, and
|
||||||
|
* {@code max} each tick. Counter values are cumulative since meter
|
||||||
|
* registration (Prometheus convention) — callers compute rate() themselves.
|
||||||
|
*
|
||||||
|
* <p>Disabled via {@code cameleer.server.self-metrics.enabled=false}.
|
||||||
|
*/
|
||||||
|
@Component
|
||||||
|
@ConditionalOnProperty(
|
||||||
|
prefix = "cameleer.server.self-metrics",
|
||||||
|
name = "enabled",
|
||||||
|
havingValue = "true",
|
||||||
|
matchIfMissing = true)
|
||||||
|
public class ServerMetricsSnapshotScheduler {
|
||||||
|
|
||||||
|
private static final Logger log = LoggerFactory.getLogger(ServerMetricsSnapshotScheduler.class);
|
||||||
|
|
||||||
|
private final MeterRegistry registry;
|
||||||
|
private final ServerMetricsStore store;
|
||||||
|
private final String tenantId;
|
||||||
|
private final String serverInstanceId;
|
||||||
|
|
||||||
|
public ServerMetricsSnapshotScheduler(
|
||||||
|
MeterRegistry registry,
|
||||||
|
ServerMetricsStore store,
|
||||||
|
@Value("${cameleer.server.tenant.id:default}") String tenantId,
|
||||||
|
@Qualifier("serverInstanceId") String serverInstanceId) {
|
||||||
|
this.registry = registry;
|
||||||
|
this.store = store;
|
||||||
|
this.tenantId = tenantId;
|
||||||
|
this.serverInstanceId = serverInstanceId;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Scheduled(fixedDelayString = "${cameleer.server.self-metrics.interval-ms:60000}",
|
||||||
|
initialDelayString = "${cameleer.server.self-metrics.interval-ms:60000}")
|
||||||
|
public void snapshot() {
|
||||||
|
try {
|
||||||
|
Instant now = Instant.now();
|
||||||
|
List<ServerMetricSample> batch = new ArrayList<>();
|
||||||
|
|
||||||
|
for (Meter meter : registry.getMeters()) {
|
||||||
|
Meter.Id id = meter.getId();
|
||||||
|
Map<String, String> tags = flattenTags(id.getTagsAsIterable());
|
||||||
|
String type = id.getType().name().toLowerCase();
|
||||||
|
|
||||||
|
for (Measurement m : meter.measure()) {
|
||||||
|
double v = m.getValue();
|
||||||
|
if (!Double.isFinite(v)) continue;
|
||||||
|
batch.add(new ServerMetricSample(
|
||||||
|
tenantId,
|
||||||
|
now,
|
||||||
|
serverInstanceId,
|
||||||
|
id.getName(),
|
||||||
|
type,
|
||||||
|
m.getStatistic().getTagValueRepresentation(),
|
||||||
|
v,
|
||||||
|
tags));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!batch.isEmpty()) {
|
||||||
|
store.insertBatch(batch);
|
||||||
|
log.debug("Persisted {} server self-metric samples", batch.size());
|
||||||
|
}
|
||||||
|
} catch (Exception e) {
|
||||||
|
log.warn("Server self-metrics snapshot failed: {}", e.getMessage());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private static Map<String, String> flattenTags(Iterable<Tag> tags) {
|
||||||
|
Map<String, String> out = new LinkedHashMap<>();
|
||||||
|
for (Tag t : tags) {
|
||||||
|
out.put(t.getKey(), t.getValue());
|
||||||
|
}
|
||||||
|
return out;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,46 @@
|
|||||||
|
package com.cameleer.server.app.storage;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.ServerMetricsStore;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSample;
|
||||||
|
import org.springframework.jdbc.core.JdbcTemplate;
|
||||||
|
|
||||||
|
import java.sql.Timestamp;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
public class ClickHouseServerMetricsStore implements ServerMetricsStore {
|
||||||
|
|
||||||
|
private final JdbcTemplate jdbc;
|
||||||
|
|
||||||
|
public ClickHouseServerMetricsStore(JdbcTemplate jdbc) {
|
||||||
|
this.jdbc = jdbc;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void insertBatch(List<ServerMetricSample> samples) {
|
||||||
|
if (samples.isEmpty()) return;
|
||||||
|
|
||||||
|
jdbc.batchUpdate("""
|
||||||
|
INSERT INTO server_metrics
|
||||||
|
(tenant_id, collected_at, server_instance_id, metric_name,
|
||||||
|
metric_type, statistic, metric_value, tags)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
samples.stream().map(s -> new Object[]{
|
||||||
|
s.tenantId(),
|
||||||
|
Timestamp.from(s.collectedAt()),
|
||||||
|
s.serverInstanceId(),
|
||||||
|
s.metricName(),
|
||||||
|
s.metricType(),
|
||||||
|
s.statistic(),
|
||||||
|
s.value(),
|
||||||
|
tagsToClickHouseMap(s.tags())
|
||||||
|
}).toList());
|
||||||
|
}
|
||||||
|
|
||||||
|
private Map<String, String> tagsToClickHouseMap(Map<String, String> tags) {
|
||||||
|
if (tags == null || tags.isEmpty()) return new HashMap<>();
|
||||||
|
return new HashMap<>(tags);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -112,6 +112,10 @@ cameleer:
|
|||||||
url: ${CAMELEER_SERVER_CLICKHOUSE_URL:jdbc:clickhouse://localhost:8123/cameleer}
|
url: ${CAMELEER_SERVER_CLICKHOUSE_URL:jdbc:clickhouse://localhost:8123/cameleer}
|
||||||
username: ${CAMELEER_SERVER_CLICKHOUSE_USERNAME:default}
|
username: ${CAMELEER_SERVER_CLICKHOUSE_USERNAME:default}
|
||||||
password: ${CAMELEER_SERVER_CLICKHOUSE_PASSWORD:}
|
password: ${CAMELEER_SERVER_CLICKHOUSE_PASSWORD:}
|
||||||
|
self-metrics:
|
||||||
|
enabled: ${CAMELEER_SERVER_SELFMETRICS_ENABLED:true}
|
||||||
|
interval-ms: ${CAMELEER_SERVER_SELFMETRICS_INTERVALMS:60000}
|
||||||
|
instance-id: ${CAMELEER_SERVER_INSTANCE_ID:}
|
||||||
|
|
||||||
springdoc:
|
springdoc:
|
||||||
api-docs:
|
api-docs:
|
||||||
|
|||||||
@@ -401,6 +401,29 @@ CREATE TABLE IF NOT EXISTS route_catalog (
|
|||||||
ENGINE = ReplacingMergeTree(last_seen)
|
ENGINE = ReplacingMergeTree(last_seen)
|
||||||
ORDER BY (tenant_id, environment, application_id, route_id);
|
ORDER BY (tenant_id, environment, application_id, route_id);
|
||||||
|
|
||||||
|
-- ── Server Self-Metrics ────────────────────────────────────────────────
|
||||||
|
-- Periodic snapshot of the server's own Micrometer registry (written by
|
||||||
|
-- ServerMetricsSnapshotScheduler). No `environment` column — the server
|
||||||
|
-- straddles environments. `statistic` distinguishes Timer/DistributionSummary
|
||||||
|
-- sub-measurements (count, total_time, max, mean) from plain counter/gauge values.
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS server_metrics (
|
||||||
|
tenant_id LowCardinality(String) DEFAULT 'default',
|
||||||
|
collected_at DateTime64(3),
|
||||||
|
server_instance_id LowCardinality(String),
|
||||||
|
metric_name LowCardinality(String),
|
||||||
|
metric_type LowCardinality(String),
|
||||||
|
statistic LowCardinality(String) DEFAULT 'value',
|
||||||
|
metric_value Float64,
|
||||||
|
tags Map(String, String) DEFAULT map(),
|
||||||
|
server_received_at DateTime64(3) DEFAULT now64(3)
|
||||||
|
)
|
||||||
|
ENGINE = MergeTree()
|
||||||
|
PARTITION BY (tenant_id, toYYYYMM(collected_at))
|
||||||
|
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
|
||||||
|
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
|
||||||
|
SETTINGS index_granularity = 8192;
|
||||||
|
|
||||||
-- insert_id tiebreak for keyset pagination (fixes same-millisecond cursor collision).
|
-- insert_id tiebreak for keyset pagination (fixes same-millisecond cursor collision).
|
||||||
-- IF NOT EXISTS on ADD COLUMN is idempotent. MATERIALIZE COLUMN is a background mutation,
|
-- IF NOT EXISTS on ADD COLUMN is idempotent. MATERIALIZE COLUMN is a background mutation,
|
||||||
-- effectively a no-op once all parts are already materialized.
|
-- effectively a no-op once all parts are already materialized.
|
||||||
|
|||||||
@@ -0,0 +1,130 @@
|
|||||||
|
package com.cameleer.server.app.metrics;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.ServerMetricsStore;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSample;
|
||||||
|
import io.micrometer.core.instrument.Counter;
|
||||||
|
import io.micrometer.core.instrument.Gauge;
|
||||||
|
import io.micrometer.core.instrument.MeterRegistry;
|
||||||
|
import io.micrometer.core.instrument.Timer;
|
||||||
|
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
|
import java.time.Duration;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.concurrent.atomic.AtomicInteger;
|
||||||
|
|
||||||
|
import static org.assertj.core.api.Assertions.assertThat;
|
||||||
|
|
||||||
|
class ServerMetricsSnapshotSchedulerTest {
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void snapshot_capturesCounterGaugeAndTimerMeasurements() {
|
||||||
|
MeterRegistry registry = new SimpleMeterRegistry();
|
||||||
|
|
||||||
|
Counter counter = Counter.builder("cameleer.test.counter")
|
||||||
|
.tag("env", "dev")
|
||||||
|
.register(registry);
|
||||||
|
counter.increment(3);
|
||||||
|
|
||||||
|
AtomicInteger gaugeSource = new AtomicInteger(42);
|
||||||
|
Gauge.builder("cameleer.test.gauge", gaugeSource, AtomicInteger::doubleValue)
|
||||||
|
.register(registry);
|
||||||
|
|
||||||
|
Timer timer = Timer.builder("cameleer.test.timer").register(registry);
|
||||||
|
timer.record(Duration.ofMillis(5));
|
||||||
|
timer.record(Duration.ofMillis(15));
|
||||||
|
|
||||||
|
RecordingStore store = new RecordingStore();
|
||||||
|
ServerMetricsSnapshotScheduler scheduler =
|
||||||
|
new ServerMetricsSnapshotScheduler(registry, store, "tenant-7", "server-A");
|
||||||
|
|
||||||
|
scheduler.snapshot();
|
||||||
|
|
||||||
|
assertThat(store.batches).hasSize(1);
|
||||||
|
List<ServerMetricSample> samples = store.batches.get(0);
|
||||||
|
|
||||||
|
// Every sample is stamped with tenant + instance + finite value
|
||||||
|
assertThat(samples).allSatisfy(s -> {
|
||||||
|
assertThat(s.tenantId()).isEqualTo("tenant-7");
|
||||||
|
assertThat(s.serverInstanceId()).isEqualTo("server-A");
|
||||||
|
assertThat(Double.isFinite(s.value())).isTrue();
|
||||||
|
assertThat(s.collectedAt()).isNotNull();
|
||||||
|
});
|
||||||
|
|
||||||
|
// Counter -> 1 row with statistic=count, value=3, tag propagated
|
||||||
|
List<ServerMetricSample> counterRows = samples.stream()
|
||||||
|
.filter(s -> s.metricName().equals("cameleer.test.counter"))
|
||||||
|
.toList();
|
||||||
|
assertThat(counterRows).hasSize(1);
|
||||||
|
assertThat(counterRows.get(0).statistic()).isEqualTo("count");
|
||||||
|
assertThat(counterRows.get(0).metricType()).isEqualTo("counter");
|
||||||
|
assertThat(counterRows.get(0).value()).isEqualTo(3.0);
|
||||||
|
assertThat(counterRows.get(0).tags()).containsEntry("env", "dev");
|
||||||
|
|
||||||
|
// Gauge -> 1 row with statistic=value
|
||||||
|
List<ServerMetricSample> gaugeRows = samples.stream()
|
||||||
|
.filter(s -> s.metricName().equals("cameleer.test.gauge"))
|
||||||
|
.toList();
|
||||||
|
assertThat(gaugeRows).hasSize(1);
|
||||||
|
assertThat(gaugeRows.get(0).statistic()).isEqualTo("value");
|
||||||
|
assertThat(gaugeRows.get(0).metricType()).isEqualTo("gauge");
|
||||||
|
assertThat(gaugeRows.get(0).value()).isEqualTo(42.0);
|
||||||
|
|
||||||
|
// Timer -> emits multiple statistics (count, total_time, max)
|
||||||
|
List<ServerMetricSample> timerRows = samples.stream()
|
||||||
|
.filter(s -> s.metricName().equals("cameleer.test.timer"))
|
||||||
|
.toList();
|
||||||
|
assertThat(timerRows).isNotEmpty();
|
||||||
|
// SimpleMeterRegistry emits Statistic.TOTAL ("total"); other registries (Prometheus)
|
||||||
|
// emit TOTAL_TIME ("total_time"). Accept either so the test isn't registry-coupled.
|
||||||
|
assertThat(timerRows).extracting(ServerMetricSample::statistic)
|
||||||
|
.contains("count", "max");
|
||||||
|
assertThat(timerRows).extracting(ServerMetricSample::statistic)
|
||||||
|
.containsAnyOf("total_time", "total");
|
||||||
|
assertThat(timerRows).allSatisfy(s ->
|
||||||
|
assertThat(s.metricType()).isEqualTo("timer"));
|
||||||
|
ServerMetricSample count = timerRows.stream()
|
||||||
|
.filter(s -> s.statistic().equals("count"))
|
||||||
|
.findFirst().orElseThrow();
|
||||||
|
assertThat(count.value()).isEqualTo(2.0);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void snapshot_withEmptyRegistry_doesNotWriteBatch() {
|
||||||
|
MeterRegistry registry = new SimpleMeterRegistry();
|
||||||
|
// Force removal of any auto-registered meters (SimpleMeterRegistry has none by default).
|
||||||
|
RecordingStore store = new RecordingStore();
|
||||||
|
ServerMetricsSnapshotScheduler scheduler =
|
||||||
|
new ServerMetricsSnapshotScheduler(registry, store, "t", "s");
|
||||||
|
|
||||||
|
scheduler.snapshot();
|
||||||
|
|
||||||
|
assertThat(store.batches).isEmpty();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void snapshot_swallowsStoreFailures() {
|
||||||
|
MeterRegistry registry = new SimpleMeterRegistry();
|
||||||
|
Counter.builder("cameleer.test").register(registry).increment();
|
||||||
|
|
||||||
|
ServerMetricsStore throwingStore = batch -> {
|
||||||
|
throw new RuntimeException("clickhouse down");
|
||||||
|
};
|
||||||
|
|
||||||
|
ServerMetricsSnapshotScheduler scheduler =
|
||||||
|
new ServerMetricsSnapshotScheduler(registry, throwingStore, "t", "s");
|
||||||
|
|
||||||
|
// Must not propagate — the scheduler thread would otherwise die.
|
||||||
|
scheduler.snapshot();
|
||||||
|
}
|
||||||
|
|
||||||
|
private static final class RecordingStore implements ServerMetricsStore {
|
||||||
|
final List<List<ServerMetricSample>> batches = new ArrayList<>();
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void insertBatch(List<ServerMetricSample> samples) {
|
||||||
|
batches.add(List.copyOf(samples));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,117 @@
|
|||||||
|
package com.cameleer.server.app.storage;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSample;
|
||||||
|
import com.zaxxer.hikari.HikariDataSource;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
import org.springframework.jdbc.core.JdbcTemplate;
|
||||||
|
import org.testcontainers.clickhouse.ClickHouseContainer;
|
||||||
|
import org.testcontainers.junit.jupiter.Container;
|
||||||
|
import org.testcontainers.junit.jupiter.Testcontainers;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
import static org.assertj.core.api.Assertions.assertThat;
|
||||||
|
|
||||||
|
@Testcontainers
|
||||||
|
class ClickHouseServerMetricsStoreIT {
|
||||||
|
|
||||||
|
@Container
|
||||||
|
static final ClickHouseContainer clickhouse =
|
||||||
|
new ClickHouseContainer("clickhouse/clickhouse-server:24.12");
|
||||||
|
|
||||||
|
private JdbcTemplate jdbc;
|
||||||
|
private ClickHouseServerMetricsStore store;
|
||||||
|
|
||||||
|
@BeforeEach
|
||||||
|
void setUp() {
|
||||||
|
HikariDataSource ds = new HikariDataSource();
|
||||||
|
ds.setJdbcUrl(clickhouse.getJdbcUrl());
|
||||||
|
ds.setUsername(clickhouse.getUsername());
|
||||||
|
ds.setPassword(clickhouse.getPassword());
|
||||||
|
|
||||||
|
jdbc = new JdbcTemplate(ds);
|
||||||
|
|
||||||
|
jdbc.execute("""
|
||||||
|
CREATE TABLE IF NOT EXISTS server_metrics (
|
||||||
|
tenant_id LowCardinality(String) DEFAULT 'default',
|
||||||
|
collected_at DateTime64(3),
|
||||||
|
server_instance_id LowCardinality(String),
|
||||||
|
metric_name LowCardinality(String),
|
||||||
|
metric_type LowCardinality(String),
|
||||||
|
statistic LowCardinality(String) DEFAULT 'value',
|
||||||
|
metric_value Float64,
|
||||||
|
tags Map(String, String) DEFAULT map(),
|
||||||
|
server_received_at DateTime64(3) DEFAULT now64(3)
|
||||||
|
)
|
||||||
|
ENGINE = MergeTree()
|
||||||
|
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
|
||||||
|
""");
|
||||||
|
|
||||||
|
jdbc.execute("TRUNCATE TABLE server_metrics");
|
||||||
|
|
||||||
|
store = new ClickHouseServerMetricsStore(jdbc);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void insertBatch_roundTripsAllColumns() {
|
||||||
|
Instant ts = Instant.parse("2026-04-23T12:00:00Z");
|
||||||
|
store.insertBatch(List.of(
|
||||||
|
new ServerMetricSample("tenant-a", ts, "srv-1",
|
||||||
|
"cameleer.ingestion.drops", "counter", "count", 17.0,
|
||||||
|
Map.of("reason", "buffer_full")),
|
||||||
|
new ServerMetricSample("tenant-a", ts, "srv-1",
|
||||||
|
"jvm.memory.used", "gauge", "value", 1_048_576.0,
|
||||||
|
Map.of("area", "heap", "id", "G1 Eden Space"))
|
||||||
|
));
|
||||||
|
|
||||||
|
Integer count = jdbc.queryForObject(
|
||||||
|
"SELECT count() FROM server_metrics WHERE tenant_id = 'tenant-a'",
|
||||||
|
Integer.class);
|
||||||
|
assertThat(count).isEqualTo(2);
|
||||||
|
|
||||||
|
Double dropsValue = jdbc.queryForObject(
|
||||||
|
"""
|
||||||
|
SELECT metric_value FROM server_metrics
|
||||||
|
WHERE tenant_id = 'tenant-a'
|
||||||
|
AND server_instance_id = 'srv-1'
|
||||||
|
AND metric_name = 'cameleer.ingestion.drops'
|
||||||
|
AND statistic = 'count'
|
||||||
|
""",
|
||||||
|
Double.class);
|
||||||
|
assertThat(dropsValue).isEqualTo(17.0);
|
||||||
|
|
||||||
|
String heapArea = jdbc.queryForObject(
|
||||||
|
"""
|
||||||
|
SELECT tags['area'] FROM server_metrics
|
||||||
|
WHERE tenant_id = 'tenant-a'
|
||||||
|
AND metric_name = 'jvm.memory.used'
|
||||||
|
""",
|
||||||
|
String.class);
|
||||||
|
assertThat(heapArea).isEqualTo("heap");
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void insertBatch_emptyList_doesNothing() {
|
||||||
|
store.insertBatch(List.of());
|
||||||
|
|
||||||
|
Integer count = jdbc.queryForObject(
|
||||||
|
"SELECT count() FROM server_metrics", Integer.class);
|
||||||
|
assertThat(count).isEqualTo(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void insertBatch_nullTags_storesEmptyMap() {
|
||||||
|
store.insertBatch(List.of(
|
||||||
|
new ServerMetricSample("default", Instant.parse("2026-04-23T12:00:00Z"),
|
||||||
|
"srv-2", "process.cpu.usage", "gauge", "value", 0.12, null)
|
||||||
|
));
|
||||||
|
|
||||||
|
Integer count = jdbc.queryForObject(
|
||||||
|
"SELECT count() FROM server_metrics WHERE server_instance_id = 'srv-2'",
|
||||||
|
Integer.class);
|
||||||
|
assertThat(count).isEqualTo(1);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
package com.cameleer.server.core.storage;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSample;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sink for periodic snapshots of the server's own Micrometer meter registry.
|
||||||
|
* Implementations persist the samples (e.g. to ClickHouse) so server
|
||||||
|
* self-metrics survive restarts and can be queried historically without an
|
||||||
|
* external Prometheus.
|
||||||
|
*/
|
||||||
|
public interface ServerMetricsStore {
|
||||||
|
|
||||||
|
void insertBatch(List<ServerMetricSample> samples);
|
||||||
|
}
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A single sample of the server's own Micrometer registry, captured by a
|
||||||
|
* scheduled snapshot and destined for the ClickHouse {@code server_metrics}
|
||||||
|
* table. One {@code ServerMetricSample} per Micrometer {@code Measurement},
|
||||||
|
* so Timers and DistributionSummaries produce multiple samples per tick
|
||||||
|
* (distinguished by {@link #statistic()}).
|
||||||
|
*/
|
||||||
|
public record ServerMetricSample(
|
||||||
|
String tenantId,
|
||||||
|
Instant collectedAt,
|
||||||
|
String serverInstanceId,
|
||||||
|
String metricName,
|
||||||
|
String metricType,
|
||||||
|
String statistic,
|
||||||
|
double value,
|
||||||
|
Map<String, String> tags
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -204,6 +204,12 @@ All query endpoints require JWT with `VIEWER` role or higher.
|
|||||||
| `GET /api/v1/agents/events-log` | Agent lifecycle event history |
|
| `GET /api/v1/agents/events-log` | Agent lifecycle event history |
|
||||||
| `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series |
|
| `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series |
|
||||||
|
|
||||||
|
### Server Self-Metrics
|
||||||
|
|
||||||
|
The server snapshots its own Micrometer registry into ClickHouse every 60 s (table `server_metrics`) — JVM, HTTP, DB pools, agent/ingestion business metrics, and alerting metrics. Use this instead of running an external Prometheus when building a server-health dashboard. The live scrape endpoint `/api/v1/prometheus` remains available for traditional scraping.
|
||||||
|
|
||||||
|
See [`docs/server-self-metrics.md`](./server-self-metrics.md) for the full metric catalog, suggested panels, and example queries.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Application Configuration
|
## Application Configuration
|
||||||
|
|||||||
346
docs/server-self-metrics.md
Normal file
346
docs/server-self-metrics.md
Normal file
@@ -0,0 +1,346 @@
|
|||||||
|
# Server Self-Metrics — Reference for Dashboard Builders
|
||||||
|
|
||||||
|
This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
|
||||||
|
|
||||||
|
> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table schema
|
||||||
|
|
||||||
|
```sql
|
||||||
|
server_metrics (
|
||||||
|
tenant_id LowCardinality(String) DEFAULT 'default',
|
||||||
|
collected_at DateTime64(3),
|
||||||
|
server_instance_id LowCardinality(String),
|
||||||
|
metric_name LowCardinality(String),
|
||||||
|
metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other
|
||||||
|
statistic LowCardinality(String) DEFAULT 'value',
|
||||||
|
metric_value Float64,
|
||||||
|
tags Map(String, String) DEFAULT map(),
|
||||||
|
server_received_at DateTime64(3) DEFAULT now64(3)
|
||||||
|
)
|
||||||
|
ENGINE = MergeTree()
|
||||||
|
PARTITION BY (tenant_id, toYYYYMM(collected_at))
|
||||||
|
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
|
||||||
|
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
|
||||||
|
```
|
||||||
|
|
||||||
|
### What each column means
|
||||||
|
|
||||||
|
| Column | Notes |
|
||||||
|
|---|---|
|
||||||
|
| `tenant_id` | Always filter by this. One tenant per server deployment. |
|
||||||
|
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
|
||||||
|
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
|
||||||
|
| `metric_type` | Lowercase Micrometer `Meter.Type`. |
|
||||||
|
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
|
||||||
|
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
|
||||||
|
| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |
|
||||||
|
|
||||||
|
### Counter semantics (important)
|
||||||
|
|
||||||
|
Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
toStartOfMinute(collected_at) AS minute,
|
||||||
|
metric_value - any(metric_value) OVER (
|
||||||
|
PARTITION BY server_instance_id, metric_name, tags
|
||||||
|
ORDER BY collected_at
|
||||||
|
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
|
||||||
|
) AS per_minute_delta
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE metric_name = 'cameleer.ingestion.drops'
|
||||||
|
AND statistic = 'count'
|
||||||
|
ORDER BY minute;
|
||||||
|
```
|
||||||
|
|
||||||
|
On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.
|
||||||
|
|
||||||
|
### Retention
|
||||||
|
|
||||||
|
90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to query
|
||||||
|
|
||||||
|
### Via the admin ClickHouse endpoint
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/v1/admin/clickhouse/query
|
||||||
|
Authorization: Bearer <admin-jwt>
|
||||||
|
Content-Type: text/plain
|
||||||
|
|
||||||
|
SELECT metric_name, statistic, count()
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE collected_at >= now() - INTERVAL 1 HOUR
|
||||||
|
GROUP BY 1, 2 ORDER BY 1, 2
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API.
|
||||||
|
|
||||||
|
### Direct JDBC (recommended for the dashboard)
|
||||||
|
|
||||||
|
Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Metric catalog
|
||||||
|
|
||||||
|
Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
|
||||||
|
|
||||||
|
### Cameleer business metrics — agent + ingestion
|
||||||
|
|
||||||
|
Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.
|
||||||
|
|
||||||
|
| Metric | Type | Statistic | Tags | Meaning |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
|
||||||
|
| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
|
||||||
|
| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
|
||||||
|
| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
|
||||||
|
| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
|
||||||
|
| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
|
||||||
|
| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |
|
||||||
|
|
||||||
|
### Cameleer business metrics — deploy + auth
|
||||||
|
|
||||||
|
| Metric | Type | Statistic | Tags | Meaning |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
|
||||||
|
| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
|
||||||
|
| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |
|
||||||
|
|
||||||
|
### Alerting subsystem metrics
|
||||||
|
|
||||||
|
Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.
|
||||||
|
|
||||||
|
| Metric | Type | Statistic | Tags | Meaning |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
|
||||||
|
| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
|
||||||
|
| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
|
||||||
|
| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
|
||||||
|
| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
|
||||||
|
| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
|
||||||
|
| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |
|
||||||
|
|
||||||
|
### JVM — memory, GC, threads, classes
|
||||||
|
|
||||||
|
From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
|
||||||
|
| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
|
||||||
|
| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
|
||||||
|
| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
|
||||||
|
| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
|
||||||
|
| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
|
||||||
|
| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
|
||||||
|
| `jvm.threads.live` | gauge | — | Current live thread count |
|
||||||
|
| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
|
||||||
|
| `jvm.threads.peak` | gauge | — | Peak thread count since start |
|
||||||
|
| `jvm.threads.started` | counter | — | Cumulative threads started |
|
||||||
|
| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
|
||||||
|
| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
|
||||||
|
| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
|
||||||
|
| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
|
||||||
|
| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
|
||||||
|
| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
|
||||||
|
| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
|
||||||
|
| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) |
|
||||||
|
| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
|
||||||
|
| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
|
||||||
|
| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |
|
||||||
|
|
||||||
|
### Process and system
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) |
|
||||||
|
| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
|
||||||
|
| `process.uptime` | gauge | — | ms since start |
|
||||||
|
| `process.start.time` | gauge | — | Epoch start |
|
||||||
|
| `process.files.open` | gauge | — | Open FDs |
|
||||||
|
| `process.files.max` | gauge | — | FD ulimit |
|
||||||
|
| `system.cpu.count` | gauge | — | Cores visible to the JVM |
|
||||||
|
| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) |
|
||||||
|
| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
|
||||||
|
| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
|
||||||
|
| `disk.total` | gauge | `path` | Total bytes |
|
||||||
|
|
||||||
|
### HTTP server
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
|
||||||
|
| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |
|
||||||
|
|
||||||
|
`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.
|
||||||
|
|
||||||
|
### Tomcat
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
|
||||||
|
| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
|
||||||
|
| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
|
||||||
|
| `tomcat.sessions.created` | counter | — | Cumulative session creates |
|
||||||
|
| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
|
||||||
|
| `tomcat.sessions.rejected` | counter | — | Session creates refused |
|
||||||
|
| `tomcat.threads.current` | gauge | `name` | Connector thread count |
|
||||||
|
| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
|
||||||
|
| `tomcat.threads.config.max` | gauge | `name` | Configured max |
|
||||||
|
|
||||||
|
### HikariCP (PostgreSQL pool)
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `hikaricp.connections` | gauge | `pool` | Total connections |
|
||||||
|
| `hikaricp.connections.active` | gauge | `pool` | In-use |
|
||||||
|
| `hikaricp.connections.idle` | gauge | `pool` | Idle |
|
||||||
|
| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
|
||||||
|
| `hikaricp.connections.min` | gauge | `pool` | Configured min |
|
||||||
|
| `hikaricp.connections.max` | gauge | `pool` | Configured max |
|
||||||
|
| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
|
||||||
|
| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
|
||||||
|
| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
|
||||||
|
| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |
|
||||||
|
|
||||||
|
Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).
|
||||||
|
|
||||||
|
### JDBC generic
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
|
||||||
|
| `jdbc.connections.max` | gauge | `name` | |
|
||||||
|
| `jdbc.connections.active` | gauge | `name` | |
|
||||||
|
| `jdbc.connections.idle` | gauge | `name` | |
|
||||||
|
|
||||||
|
### Logging
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |
|
||||||
|
|
||||||
|
### Spring Boot lifecycle
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `application.started.time` | timer | `main.application.class` | Cold-start duration |
|
||||||
|
| `application.ready.time` | timer | `main.application.class` | Time to ready |
|
||||||
|
|
||||||
|
### Flyway
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |
|
||||||
|
|
||||||
|
### Executor pools (if any `@Async` executors exist)
|
||||||
|
|
||||||
|
When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
|
||||||
|
|
||||||
|
| Metric | Type | Tags | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `executor.active` | gauge | `name` | Currently-running tasks |
|
||||||
|
| `executor.queued` | gauge | `name` | Queued tasks |
|
||||||
|
| `executor.queue.remaining` | gauge | `name` | Queue headroom |
|
||||||
|
| `executor.pool.size` | gauge | `name` | Current pool size |
|
||||||
|
| `executor.pool.core` | gauge | `name` | Core size |
|
||||||
|
| `executor.pool.max` | gauge | `name` | Max size |
|
||||||
|
| `executor.completed` | counter | `name` | Completed tasks |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Suggested dashboard panels
|
||||||
|
|
||||||
|
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable.
|
||||||
|
|
||||||
|
### Row: server health (top of dashboard)
|
||||||
|
|
||||||
|
1. **Agents by state** — stacked area.
|
||||||
|
```sql
|
||||||
|
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
|
||||||
|
AND collected_at >= {from} AND collected_at < {to}
|
||||||
|
GROUP BY t, state ORDER BY t;
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above.
|
||||||
|
|
||||||
|
3. **Ingestion drops per minute** — bar chart (per-minute delta).
|
||||||
|
```sql
|
||||||
|
WITH sorted AS (
|
||||||
|
SELECT toStartOfMinute(collected_at) AS minute,
|
||||||
|
tags['reason'] AS reason,
|
||||||
|
server_instance_id,
|
||||||
|
max(metric_value) AS cumulative
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
|
||||||
|
AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
|
||||||
|
GROUP BY minute, reason, server_instance_id
|
||||||
|
)
|
||||||
|
SELECT minute, reason,
|
||||||
|
cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
|
||||||
|
PARTITION BY reason, server_instance_id ORDER BY minute
|
||||||
|
) AS drops_per_minute
|
||||||
|
FROM sorted ORDER BY minute;
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Auth failures per minute** — same shape as drops, split by `reason`.
|
||||||
|
|
||||||
|
### Row: JVM
|
||||||
|
|
||||||
|
5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s.
|
||||||
|
|
||||||
|
6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`.
|
||||||
|
|
||||||
|
7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`.
|
||||||
|
|
||||||
|
8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`.
|
||||||
|
|
||||||
|
### Row: HTTP + DB
|
||||||
|
|
||||||
|
9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`.
|
||||||
|
|
||||||
|
10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total.
|
||||||
|
|
||||||
|
11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small.
|
||||||
|
|
||||||
|
12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag.
|
||||||
|
|
||||||
|
### Row: alerting (collapsible)
|
||||||
|
|
||||||
|
13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`.
|
||||||
|
|
||||||
|
14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`.
|
||||||
|
|
||||||
|
15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`.
|
||||||
|
|
||||||
|
### Row: deployments (runtime-enabled only)
|
||||||
|
|
||||||
|
16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`.
|
||||||
|
|
||||||
|
17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes for the dashboard implementer
|
||||||
|
|
||||||
|
- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table.
|
||||||
|
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap.
|
||||||
|
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart.
|
||||||
|
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either.
|
||||||
|
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
|
||||||
|
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog
|
||||||
|
|
||||||
|
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.
|
||||||
Reference in New Issue
Block a user