feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-23 23:41:02 +02:00
parent 64608a7677
commit d58c8cde2e
13 changed files with 1235 additions and 59 deletions

View File

@@ -109,6 +109,7 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
- `UsageAnalyticsController` — GET `/api/v1/admin/usage` (ClickHouse `usage_events`). - `UsageAnalyticsController` — GET `/api/v1/admin/usage` (ClickHouse `usage_events`).
- `ClickHouseAdminController` — GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag). - `ClickHouseAdminController` — GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag).
- `DatabaseAdminController` — GET `/api/v1/admin/database/**` (conditional on `infrastructureendpoints` flag). - `DatabaseAdminController` — GET `/api/v1/admin/database/**` (conditional on `infrastructureendpoints` flag).
- `ServerMetricsAdminController``/api/v1/admin/server-metrics/**`. GET `/catalog`, GET `/instances`, POST `/query`. Generic read API over the `server_metrics` ClickHouse table so SaaS dashboards don't need direct CH access. Delegates to `ServerMetricsQueryStore` (impl `ClickHouseServerMetricsQueryStore`). Validation: metric/tag regex `^[a-zA-Z0-9._]+$`, statistic regex `^[a-z_]+$`, `to - from ≤ 31 days`, stepSeconds ∈ [10, 3600], response capped at 500 series. `IllegalArgumentException` → 400. `/query` supports `raw` + `delta` modes (delta does per-`server_instance_id` positive-clipped differences, then aggregates across instances). Derived `statistic=mean` for timers computes `sum(total|total_time)/sum(count)` per bucket.
### Other (flat) ### Other (flat)
@@ -147,7 +148,8 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository` - `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
- `ClickHouseUsageTracker` — usage_events for billing - `ClickHouseUsageTracker` — usage_events for billing
- `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup - `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup
- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`, query via `/api/v1/admin/clickhouse/query` (no dedicated endpoint yet). - `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`.
- `ClickHouseServerMetricsQueryStore` — read side of `server_metrics` for dashboards. Implements `ServerMetricsQueryStore`. `catalog(from,to)` returns name+type+statistics+tagKeys, `listInstances(from,to)` returns server_instance_ids with first/last seen, `query(request)` builds bucketed time-series with `raw` or `delta` mode and supports a derived `mean` statistic for timers. All identifier inputs regex-validated; tenant_id always bound; max range 31 days; series count capped at 500. Exposed via `ServerMetricsAdminController`.
## search/ — ClickHouse search and log stores ## search/ — ClickHouse search and log stores

View File

@@ -9,6 +9,7 @@ import com.cameleer.server.app.storage.ClickHouseRouteCatalogStore;
import com.cameleer.server.core.storage.RouteCatalogStore; import com.cameleer.server.core.storage.RouteCatalogStore;
import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore; import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore;
import com.cameleer.server.app.storage.ClickHouseMetricsStore; import com.cameleer.server.app.storage.ClickHouseMetricsStore;
import com.cameleer.server.app.storage.ClickHouseServerMetricsQueryStore;
import com.cameleer.server.app.storage.ClickHouseServerMetricsStore; import com.cameleer.server.app.storage.ClickHouseServerMetricsStore;
import com.cameleer.server.app.storage.ClickHouseStatsStore; import com.cameleer.server.app.storage.ClickHouseStatsStore;
import com.cameleer.server.core.admin.AuditRepository; import com.cameleer.server.core.admin.AuditRepository;
@@ -74,6 +75,13 @@ public class StorageBeanConfig {
return new ClickHouseServerMetricsStore(clickHouseJdbc); return new ClickHouseServerMetricsStore(clickHouseJdbc);
} }
@Bean
public ServerMetricsQueryStore clickHouseServerMetricsQueryStore(
TenantProperties tenantProperties,
@Qualifier("clickHouseJdbcTemplate") JdbcTemplate clickHouseJdbc) {
return new ClickHouseServerMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc);
}
// ── Execution Store ────────────────────────────────────────────────── // ── Execution Store ──────────────────────────────────────────────────
@Bean @Bean

View File

@@ -0,0 +1,135 @@
package com.cameleer.server.app.controller;
import com.cameleer.server.core.storage.ServerMetricsQueryStore;
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
import io.swagger.v3.oas.annotations.Operation;
import io.swagger.v3.oas.annotations.tags.Tag;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import java.time.Instant;
import java.util.List;
import java.util.Map;
/**
* Generic read API over the ClickHouse {@code server_metrics} table. Lets
* SaaS control planes build server-health dashboards without requiring direct
* ClickHouse access.
*
* <p>Three endpoints cover all 17 panels in {@code docs/server-self-metrics.md}:
* <ul>
* <li>{@code GET /catalog} — discover available metric names, types, statistics, and tags</li>
* <li>{@code POST /query} — generic time-series query with aggregation, grouping, filtering, and counter-delta mode</li>
* <li>{@code GET /instances} — list server instances (useful for partitioning counter math)</li>
* </ul>
*
* <p>Protected by the {@code /api/v1/admin/**} catch-all in {@code SecurityConfig} — requires ADMIN role.
*/
@RestController
@RequestMapping("/api/v1/admin/server-metrics")
@Tag(name = "Server Self-Metrics",
description = "Read API over the server's own Micrometer registry snapshots for dashboards")
public class ServerMetricsAdminController {
/** Default lookback window for catalog/instances when from/to are omitted. */
private static final long DEFAULT_LOOKBACK_SECONDS = 3_600L;
private final ServerMetricsQueryStore store;
public ServerMetricsAdminController(ServerMetricsQueryStore store) {
this.store = store;
}
@GetMapping("/catalog")
@Operation(summary = "List metric names observed in the window",
description = "For each metric_name, returns metric_type, the set of statistics emitted, and the union of tag keys.")
public ResponseEntity<List<ServerMetricCatalogEntry>> catalog(
@RequestParam(required = false) String from,
@RequestParam(required = false) String to) {
Instant[] window = resolveWindow(from, to);
return ResponseEntity.ok(store.catalog(window[0], window[1]));
}
@GetMapping("/instances")
@Operation(summary = "List server_instance_id values observed in the window",
description = "Returns first/last seen timestamps — use to partition counter-delta computations.")
public ResponseEntity<List<ServerInstanceInfo>> instances(
@RequestParam(required = false) String from,
@RequestParam(required = false) String to) {
Instant[] window = resolveWindow(from, to);
return ResponseEntity.ok(store.listInstances(window[0], window[1]));
}
@PostMapping("/query")
@Operation(summary = "Generic time-series query",
description = "Returns bucketed series for a single metric_name. Supports aggregation (avg/sum/max/min/latest), group-by-tag, filter-by-tag, counter delta mode, and a derived 'mean' statistic for timers.")
public ResponseEntity<ServerMetricQueryResponse> query(@RequestBody QueryBody body) {
ServerMetricQueryRequest request = new ServerMetricQueryRequest(
body.metric(),
body.statistic(),
parseInstant(body.from(), "from"),
parseInstant(body.to(), "to"),
body.stepSeconds(),
body.groupByTags(),
body.filterTags(),
body.aggregation(),
body.mode(),
body.serverInstanceIds());
return ResponseEntity.ok(store.query(request));
}
@ExceptionHandler(IllegalArgumentException.class)
public ResponseEntity<Map<String, String>> handleBadRequest(IllegalArgumentException e) {
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
}
private static Instant[] resolveWindow(String from, String to) {
Instant toI = to != null ? parseInstant(to, "to") : Instant.now();
Instant fromI = from != null
? parseInstant(from, "from")
: toI.minusSeconds(DEFAULT_LOOKBACK_SECONDS);
if (!fromI.isBefore(toI)) {
throw new IllegalArgumentException("from must be strictly before to");
}
return new Instant[]{fromI, toI};
}
private static Instant parseInstant(String raw, String field) {
if (raw == null || raw.isBlank()) {
throw new IllegalArgumentException(field + " is required");
}
try {
return Instant.parse(raw);
} catch (Exception e) {
throw new IllegalArgumentException(
field + " must be an ISO-8601 instant (e.g. 2026-04-23T10:00:00Z)");
}
}
/**
* Request body for {@link #query(QueryBody)}. Uses ISO-8601 strings on
* the wire so the OpenAPI schema stays language-neutral.
*/
public record QueryBody(
String metric,
String statistic,
String from,
String to,
Integer stepSeconds,
List<String> groupByTags,
Map<String, String> filterTags,
String aggregation,
String mode,
List<String> serverInstanceIds
) {
}
}

View File

@@ -0,0 +1,408 @@
package com.cameleer.server.app.storage;
import com.cameleer.server.core.storage.ServerMetricsQueryStore;
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
import com.cameleer.server.core.storage.model.ServerMetricPoint;
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
import com.cameleer.server.core.storage.model.ServerMetricSeries;
import org.springframework.jdbc.core.JdbcTemplate;
import java.sql.Array;
import java.sql.Timestamp;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Pattern;
/**
* ClickHouse-backed {@link ServerMetricsQueryStore}.
*
* <p>Safety rules for every query:
* <ul>
* <li>tenant_id always bound as a parameter — no cross-tenant reads.</li>
* <li>Identifier-like inputs (metric name, statistic, tag keys,
* aggregation, mode) are regex-validated. Tag keys flow through the
* query as JDBC parameter-bound values of {@code tags[?]} map lookups,
* so even with a "safe" regex they cannot inject SQL.</li>
* <li>Literal values ({@code from}, {@code to}, tag filter values,
* server_instance_id allow-list) always go through {@code ?}.</li>
* <li>The time range is capped at {@link #MAX_RANGE}.</li>
* <li>Result cardinality is capped at {@link #MAX_SERIES} series.</li>
* </ul>
*/
public class ClickHouseServerMetricsQueryStore implements ServerMetricsQueryStore {
private static final Pattern SAFE_IDENTIFIER = Pattern.compile("^[a-zA-Z0-9._]+$");
private static final Pattern SAFE_STATISTIC = Pattern.compile("^[a-z_]+$");
private static final Set<String> AGGREGATIONS = Set.of("avg", "sum", "max", "min", "latest");
private static final Set<String> MODES = Set.of("raw", "delta");
/** Maximum {@code to - from} window accepted by the API. */
static final Duration MAX_RANGE = Duration.ofDays(31);
/** Clamp bounds and default for {@code stepSeconds}. */
static final int MIN_STEP = 10;
static final int MAX_STEP = 3600;
static final int DEFAULT_STEP = 60;
/** Defence against group-by explosion — limit the series count per response. */
static final int MAX_SERIES = 500;
private final String tenantId;
private final JdbcTemplate jdbc;
public ClickHouseServerMetricsQueryStore(String tenantId, JdbcTemplate jdbc) {
this.tenantId = tenantId;
this.jdbc = jdbc;
}
// ── catalog ─────────────────────────────────────────────────────────
@Override
public List<ServerMetricCatalogEntry> catalog(Instant from, Instant to) {
requireRange(from, to);
String sql = """
SELECT
metric_name,
any(metric_type) AS metric_type,
arraySort(groupUniqArray(statistic)) AS statistics,
arraySort(arrayDistinct(arrayFlatten(groupArray(mapKeys(tags))))) AS tag_keys
FROM server_metrics
WHERE tenant_id = ?
AND collected_at >= ?
AND collected_at < ?
GROUP BY metric_name
ORDER BY metric_name
""";
return jdbc.query(sql, (rs, n) -> new ServerMetricCatalogEntry(
rs.getString("metric_name"),
rs.getString("metric_type"),
arrayToStringList(rs.getArray("statistics")),
arrayToStringList(rs.getArray("tag_keys"))
), tenantId, Timestamp.from(from), Timestamp.from(to));
}
// ── instances ───────────────────────────────────────────────────────
@Override
public List<ServerInstanceInfo> listInstances(Instant from, Instant to) {
requireRange(from, to);
String sql = """
SELECT
server_instance_id,
min(collected_at) AS first_seen,
max(collected_at) AS last_seen
FROM server_metrics
WHERE tenant_id = ?
AND collected_at >= ?
AND collected_at < ?
GROUP BY server_instance_id
ORDER BY last_seen DESC
""";
return jdbc.query(sql, (rs, n) -> new ServerInstanceInfo(
rs.getString("server_instance_id"),
rs.getTimestamp("first_seen").toInstant(),
rs.getTimestamp("last_seen").toInstant()
), tenantId, Timestamp.from(from), Timestamp.from(to));
}
// ── query ───────────────────────────────────────────────────────────
@Override
public ServerMetricQueryResponse query(ServerMetricQueryRequest request) {
if (request == null) throw new IllegalArgumentException("request is required");
String metric = requireSafeIdentifier(request.metric(), "metric");
requireRange(request.from(), request.to());
String aggregation = request.aggregation() != null ? request.aggregation().toLowerCase() : "avg";
if (!AGGREGATIONS.contains(aggregation)) {
throw new IllegalArgumentException("aggregation must be one of " + AGGREGATIONS);
}
String mode = request.mode() != null ? request.mode().toLowerCase() : "raw";
if (!MODES.contains(mode)) {
throw new IllegalArgumentException("mode must be one of " + MODES);
}
int step = request.stepSeconds() != null ? request.stepSeconds() : DEFAULT_STEP;
if (step < MIN_STEP || step > MAX_STEP) {
throw new IllegalArgumentException(
"stepSeconds must be in [" + MIN_STEP + "," + MAX_STEP + "]");
}
String statistic = request.statistic();
if (statistic != null && !SAFE_STATISTIC.matcher(statistic).matches()) {
throw new IllegalArgumentException("statistic contains unsafe characters");
}
List<String> groupByTags = request.groupByTags() != null
? request.groupByTags() : List.of();
for (String t : groupByTags) requireSafeIdentifier(t, "groupByTag");
Map<String, String> filterTags = request.filterTags() != null
? request.filterTags() : Map.of();
for (String t : filterTags.keySet()) requireSafeIdentifier(t, "filterTag key");
List<String> instanceAllowList = request.serverInstanceIds() != null
? request.serverInstanceIds() : List.of();
boolean isDelta = "delta".equals(mode);
boolean isMean = "mean".equals(statistic);
String sql = isDelta
? buildDeltaSql(step, groupByTags, filterTags, instanceAllowList, statistic, isMean)
: buildRawSql(step, groupByTags, filterTags, instanceAllowList,
statistic, aggregation, isMean);
List<Object> params = buildParams(groupByTags, metric, statistic, isMean,
request.from(), request.to(),
filterTags, instanceAllowList);
List<Row> rows = jdbc.query(sql, (rs, n) -> {
int idx = 1;
Instant bucket = rs.getTimestamp(idx++).toInstant();
List<String> tagValues = new ArrayList<>(groupByTags.size());
for (int g = 0; g < groupByTags.size(); g++) {
tagValues.add(rs.getString(idx++));
}
double value = rs.getDouble(idx);
return new Row(bucket, tagValues, value);
}, params.toArray());
return assembleSeries(rows, metric, statistic, aggregation, mode, step, groupByTags);
}
// ── SQL builders ────────────────────────────────────────────────────
/**
* Builds a single-pass SQL for raw mode:
* <pre>{@code
* SELECT bucket, tag0, ..., <agg>(metric_value) AS value
* FROM server_metrics WHERE ...
* GROUP BY bucket, tag0, ...
* ORDER BY bucket, tag0, ...
* }</pre>
* For {@code statistic=mean}, replaces the aggregate with
* {@code sumIf(value, statistic IN ('total','total_time')) / nullIf(sumIf(value, statistic='count'), 0)}.
*/
private String buildRawSql(int step, List<String> groupByTags,
Map<String, String> filterTags,
List<String> instanceAllowList,
String statistic, String aggregation, boolean isMean) {
StringBuilder s = new StringBuilder(512);
s.append("SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ")
.append(step).append(" SECOND), 3) AS bucket");
for (int i = 0; i < groupByTags.size(); i++) {
s.append(",\n tags[?] AS tag").append(i);
}
s.append(",\n ").append(isMean ? meanExpr() : scalarAggExpr(aggregation))
.append(" AS value\nFROM server_metrics\n");
appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean);
s.append("GROUP BY bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append("\nORDER BY bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
return s.toString();
}
/**
* Builds a three-level SQL for delta mode. Inner fills one
* (bucket, instance, tag-group) row via {@code max(metric_value)};
* middle computes positive-clipped per-instance differences via a
* window function; outer sums across instances.
*/
private String buildDeltaSql(int step, List<String> groupByTags,
Map<String, String> filterTags,
List<String> instanceAllowList,
String statistic, boolean isMean) {
StringBuilder s = new StringBuilder(1024);
s.append("SELECT bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append(", sum(delta) AS value FROM (\n");
// Middle: per-instance positive-clipped delta using window.
s.append(" SELECT bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append(", server_instance_id, greatest(0, value - coalesce(any(value) OVER (")
.append("PARTITION BY server_instance_id");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append(" ORDER BY bucket ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), value)) AS delta FROM (\n");
// Inner: one representative value per (bucket, instance, tag-group).
s.append(" SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ")
.append(step).append(" SECOND), 3) AS bucket,\n server_instance_id");
for (int i = 0; i < groupByTags.size(); i++) {
s.append(",\n tags[?] AS tag").append(i);
}
s.append(",\n ").append(isMean ? meanExpr() : "max(metric_value)")
.append(" AS value\n FROM server_metrics\n");
appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean);
s.append(" GROUP BY bucket, server_instance_id");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append("\n ) AS bucketed\n) AS deltas\n");
s.append("GROUP BY bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
s.append("\nORDER BY bucket");
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
return s.toString();
}
/**
* WHERE clause shared by both raw and delta SQL shapes. Appended at the
* correct indent under either the single {@code FROM server_metrics}
* (raw) or the innermost one (delta).
*/
private void appendWhereClause(StringBuilder s, Map<String, String> filterTags,
List<String> instanceAllowList,
String statistic, boolean isMean) {
s.append(" WHERE tenant_id = ?\n")
.append(" AND metric_name = ?\n");
if (isMean) {
s.append(" AND statistic IN ('count', 'total', 'total_time')\n");
} else if (statistic != null) {
s.append(" AND statistic = ?\n");
}
s.append(" AND collected_at >= ?\n")
.append(" AND collected_at < ?\n");
for (int i = 0; i < filterTags.size(); i++) {
s.append(" AND tags[?] = ?\n");
}
if (!instanceAllowList.isEmpty()) {
s.append(" AND server_instance_id IN (")
.append("?,".repeat(instanceAllowList.size() - 1)).append("?)\n");
}
}
/**
* SQL-positional params for both raw and delta queries (same relative
* order because the WHERE clause is emitted by {@link #appendWhereClause}
* only once, with the {@code tags[?]} select-list placeholders appearing
* earlier in the SQL text).
*/
private List<Object> buildParams(List<String> groupByTags, String metric,
String statistic, boolean isMean,
Instant from, Instant to,
Map<String, String> filterTags,
List<String> instanceAllowList) {
List<Object> params = new ArrayList<>();
// SELECT-list tags[?] placeholders
params.addAll(groupByTags);
// WHERE
params.add(tenantId);
params.add(metric);
if (!isMean && statistic != null) params.add(statistic);
params.add(Timestamp.from(from));
params.add(Timestamp.from(to));
for (Map.Entry<String, String> e : filterTags.entrySet()) {
params.add(e.getKey());
params.add(e.getValue());
}
params.addAll(instanceAllowList);
return params;
}
private static String scalarAggExpr(String aggregation) {
return switch (aggregation) {
case "avg" -> "avg(metric_value)";
case "sum" -> "sum(metric_value)";
case "max" -> "max(metric_value)";
case "min" -> "min(metric_value)";
case "latest" -> "argMax(metric_value, collected_at)";
default -> throw new IllegalStateException("unreachable: " + aggregation);
};
}
private static String meanExpr() {
return "sumIf(metric_value, statistic IN ('total', 'total_time'))"
+ " / nullIf(sumIf(metric_value, statistic = 'count'), 0)";
}
// ── response assembly ───────────────────────────────────────────────
private ServerMetricQueryResponse assembleSeries(
List<Row> rows, String metric, String statistic,
String aggregation, String mode, int step, List<String> groupByTags) {
Map<List<String>, List<ServerMetricPoint>> bySignature = new LinkedHashMap<>();
for (Row r : rows) {
if (Double.isNaN(r.value) || Double.isInfinite(r.value)) continue;
bySignature.computeIfAbsent(r.tagValues, k -> new ArrayList<>())
.add(new ServerMetricPoint(r.bucket, r.value));
}
if (bySignature.size() > MAX_SERIES) {
throw new IllegalArgumentException(
"query produced " + bySignature.size()
+ " series; reduce groupByTags or tighten filterTags (max "
+ MAX_SERIES + ")");
}
List<ServerMetricSeries> series = new ArrayList<>(bySignature.size());
for (Map.Entry<List<String>, List<ServerMetricPoint>> e : bySignature.entrySet()) {
Map<String, String> tags = new LinkedHashMap<>();
for (int i = 0; i < groupByTags.size(); i++) {
tags.put(groupByTags.get(i), e.getKey().get(i));
}
series.add(new ServerMetricSeries(Collections.unmodifiableMap(tags), e.getValue()));
}
return new ServerMetricQueryResponse(metric,
statistic != null ? statistic : "value",
aggregation, mode, step, series);
}
// ── helpers ─────────────────────────────────────────────────────────
private static void requireRange(Instant from, Instant to) {
if (from == null || to == null) {
throw new IllegalArgumentException("from and to are required");
}
if (!from.isBefore(to)) {
throw new IllegalArgumentException("from must be strictly before to");
}
if (Duration.between(from, to).compareTo(MAX_RANGE) > 0) {
throw new IllegalArgumentException(
"time range exceeds maximum of " + MAX_RANGE.toDays() + " days");
}
}
private static String requireSafeIdentifier(String value, String field) {
if (value == null || value.isBlank()) {
throw new IllegalArgumentException(field + " is required");
}
if (!SAFE_IDENTIFIER.matcher(value).matches()) {
throw new IllegalArgumentException(
field + " contains unsafe characters (allowed: [a-zA-Z0-9._])");
}
return value;
}
private static List<String> arrayToStringList(Array array) {
if (array == null) return List.of();
try {
Object[] values = (Object[]) array.getArray();
Set<String> sorted = new TreeSet<>();
for (Object v : values) {
if (v != null) sorted.add(v.toString());
}
return List.copyOf(sorted);
} catch (Exception e) {
return List.of();
} finally {
try { array.free(); } catch (Exception ignore) { }
}
}
private record Row(Instant bucket, List<String> tagValues, double value) {
}
}

View File

@@ -0,0 +1,314 @@
package com.cameleer.server.app.controller;
import com.cameleer.server.app.AbstractPostgresIT;
import com.cameleer.server.app.TestSecurityHelper;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.web.client.TestRestTemplate;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import java.sql.Timestamp;
import java.time.Instant;
import java.util.Map;
import static org.assertj.core.api.Assertions.assertThat;
class ServerMetricsAdminControllerIT extends AbstractPostgresIT {
@Autowired
private TestRestTemplate restTemplate;
@Autowired
private TestSecurityHelper securityHelper;
private final ObjectMapper mapper = new ObjectMapper();
private HttpHeaders adminJson;
private HttpHeaders adminGet;
private HttpHeaders viewerGet;
@BeforeEach
void seedAndAuth() {
adminJson = securityHelper.adminHeaders();
adminGet = securityHelper.authHeadersNoBody(securityHelper.adminToken());
viewerGet = securityHelper.authHeadersNoBody(securityHelper.viewerToken());
// Fresh rows for each test. The Spring-context ClickHouse JdbcTemplate
// lives in a different bean; reach for it here by executing through
// the same JdbcTemplate used by the store via the ClickHouseConfig bean.
org.springframework.jdbc.core.JdbcTemplate ch = clickhouseJdbc();
ch.execute("TRUNCATE TABLE server_metrics");
Instant t0 = Instant.parse("2026-04-23T10:00:00Z");
// Gauge: cameleer.agents.connected, two states, two buckets.
insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 3.0,
Map.of("state", "live"));
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 4.0,
Map.of("state", "live"));
insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 1.0,
Map.of("state", "stale"));
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 0.0,
Map.of("state", "stale"));
// Counter: cumulative drops, +5 per minute on srv-A.
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full"));
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.drops", "counter", "count", 5.0, Map.of("reason", "buffer_full"));
insert(ch, "default", t0.plusSeconds(120), "srv-A", "cameleer.ingestion.drops", "counter", "count", 10.0, Map.of("reason", "buffer_full"));
// Simulated restart to srv-B: counter resets to 0, then climbs to 2.
insert(ch, "default", t0.plusSeconds(180), "srv-B", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full"));
insert(ch, "default", t0.plusSeconds(240), "srv-B", "cameleer.ingestion.drops", "counter", "count", 2.0, Map.of("reason", "buffer_full"));
// Timer mean inputs: two buckets, 2 samples each (count=2, total_time=30).
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 2.0, Map.of("type", "execution"));
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 30.0, Map.of("type", "execution"));
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 4.0, Map.of("type", "execution"));
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 100.0, Map.of("type", "execution"));
}
// ── catalog ─────────────────────────────────────────────────────────
@Test
void catalog_listsSeededMetricsWithStatisticsAndTagKeys() throws Exception {
ResponseEntity<String> r = restTemplate.exchange(
"/api/v1/admin/server-metrics/catalog?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z",
HttpMethod.GET, new HttpEntity<>(adminGet), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
JsonNode body = mapper.readTree(r.getBody());
assertThat(body.isArray()).isTrue();
JsonNode drops = findByField(body, "metricName", "cameleer.ingestion.drops");
assertThat(drops.get("metricType").asText()).isEqualTo("counter");
assertThat(asStringList(drops.get("statistics"))).contains("count");
assertThat(asStringList(drops.get("tagKeys"))).contains("reason");
JsonNode timer = findByField(body, "metricName", "cameleer.ingestion.flush.duration");
assertThat(asStringList(timer.get("statistics"))).contains("count", "total_time");
}
// ── instances ───────────────────────────────────────────────────────
@Test
void instances_listsDistinctServerInstanceIdsWithFirstAndLastSeen() throws Exception {
ResponseEntity<String> r = restTemplate.exchange(
"/api/v1/admin/server-metrics/instances?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z",
HttpMethod.GET, new HttpEntity<>(adminGet), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
JsonNode body = mapper.readTree(r.getBody());
assertThat(body.isArray()).isTrue();
assertThat(body.size()).isEqualTo(2);
// Ordered by last_seen DESC — srv-B saw a later row.
assertThat(body.get(0).get("serverInstanceId").asText()).isEqualTo("srv-B");
assertThat(body.get(1).get("serverInstanceId").asText()).isEqualTo("srv-A");
}
// ── query — gauge with group-by-tag ─────────────────────────────────
@Test
void query_gaugeWithGroupByTag_returnsSeriesPerTagValue() throws Exception {
String requestBody = """
{
"metric": "cameleer.agents.connected",
"statistic": "value",
"from": "2026-04-23T09:59:00Z",
"to": "2026-04-23T10:02:00Z",
"stepSeconds": 60,
"groupByTags": ["state"],
"aggregation": "avg",
"mode": "raw"
}
""";
ResponseEntity<String> r = restTemplate.postForEntity(
"/api/v1/admin/server-metrics/query",
new HttpEntity<>(requestBody, adminJson), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
JsonNode body = mapper.readTree(r.getBody());
assertThat(body.get("metric").asText()).isEqualTo("cameleer.agents.connected");
assertThat(body.get("statistic").asText()).isEqualTo("value");
assertThat(body.get("mode").asText()).isEqualTo("raw");
assertThat(body.get("stepSeconds").asInt()).isEqualTo(60);
JsonNode series = body.get("series");
assertThat(series.isArray()).isTrue();
assertThat(series.size()).isEqualTo(2);
JsonNode live = findByTag(series, "state", "live");
assertThat(live.get("points").size()).isEqualTo(2);
assertThat(live.get("points").get(0).get("v").asDouble()).isEqualTo(3.0);
assertThat(live.get("points").get(1).get("v").asDouble()).isEqualTo(4.0);
}
// ── query — counter delta across instance rotation ──────────────────
@Test
void query_counterDelta_clipsNegativesAcrossInstanceRotation() throws Exception {
String requestBody = """
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"from": "2026-04-23T09:59:00Z",
"to": "2026-04-23T10:05:00Z",
"stepSeconds": 60,
"groupByTags": ["reason"],
"aggregation": "sum",
"mode": "delta"
}
""";
ResponseEntity<String> r = restTemplate.postForEntity(
"/api/v1/admin/server-metrics/query",
new HttpEntity<>(requestBody, adminJson), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
JsonNode body = mapper.readTree(r.getBody());
JsonNode reason = findByTag(body.get("series"), "reason", "buffer_full");
// Deltas: 0 (first bucket on srv-A), 5, 5, 0 (first on srv-B, clipped), 2.
// Sum across the window should be 12 if we tally all positive deltas.
double sum = 0;
for (JsonNode p : reason.get("points")) sum += p.get("v").asDouble();
assertThat(sum).isEqualTo(12.0);
// No individual point may be negative.
for (JsonNode p : reason.get("points")) {
assertThat(p.get("v").asDouble()).isGreaterThanOrEqualTo(0.0);
}
}
// ── query — derived 'mean' statistic for timers ─────────────────────
@Test
void query_timerMeanStatistic_computesTotalOverCountPerBucket() throws Exception {
String requestBody = """
{
"metric": "cameleer.ingestion.flush.duration",
"statistic": "mean",
"from": "2026-04-23T09:59:00Z",
"to": "2026-04-23T10:02:00Z",
"stepSeconds": 60,
"groupByTags": ["type"],
"aggregation": "avg",
"mode": "raw"
}
""";
ResponseEntity<String> r = restTemplate.postForEntity(
"/api/v1/admin/server-metrics/query",
new HttpEntity<>(requestBody, adminJson), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
JsonNode body = mapper.readTree(r.getBody());
JsonNode points = findByTag(body.get("series"), "type", "execution").get("points");
// Bucket 0: 30 / 2 = 15.0
// Bucket 1: 100 / 4 = 25.0
assertThat(points.get(0).get("v").asDouble()).isEqualTo(15.0);
assertThat(points.get(1).get("v").asDouble()).isEqualTo(25.0);
}
// ── query — input validation ────────────────────────────────────────
@Test
void query_rejectsUnsafeMetricName() {
String requestBody = """
{
"metric": "cameleer.agents; DROP TABLE server_metrics",
"from": "2026-04-23T09:59:00Z",
"to": "2026-04-23T10:02:00Z"
}
""";
ResponseEntity<String> r = restTemplate.postForEntity(
"/api/v1/admin/server-metrics/query",
new HttpEntity<>(requestBody, adminJson), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST);
}
@Test
void query_rejectsRangeBeyondMax() {
String requestBody = """
{
"metric": "cameleer.agents.connected",
"from": "2026-01-01T00:00:00Z",
"to": "2026-04-23T00:00:00Z"
}
""";
ResponseEntity<String> r = restTemplate.postForEntity(
"/api/v1/admin/server-metrics/query",
new HttpEntity<>(requestBody, adminJson), String.class);
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST);
}
// ── authorization ───────────────────────────────────────────────────
@Test
void allEndpoints_requireAdminRole() {
ResponseEntity<String> catalog = restTemplate.exchange(
"/api/v1/admin/server-metrics/catalog",
HttpMethod.GET, new HttpEntity<>(viewerGet), String.class);
assertThat(catalog.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
ResponseEntity<String> instances = restTemplate.exchange(
"/api/v1/admin/server-metrics/instances",
HttpMethod.GET, new HttpEntity<>(viewerGet), String.class);
assertThat(instances.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
HttpHeaders viewerPost = securityHelper.authHeaders(securityHelper.viewerToken());
ResponseEntity<String> query = restTemplate.exchange(
"/api/v1/admin/server-metrics/query",
HttpMethod.POST, new HttpEntity<>("{}", viewerPost), String.class);
assertThat(query.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
}
// ── helpers ─────────────────────────────────────────────────────────
private org.springframework.jdbc.core.JdbcTemplate clickhouseJdbc() {
return org.springframework.test.util.AopTestUtils.getTargetObject(
applicationContext.getBean("clickHouseJdbcTemplate"));
}
@Autowired
private org.springframework.context.ApplicationContext applicationContext;
private static void insert(org.springframework.jdbc.core.JdbcTemplate jdbc,
String tenantId, Instant collectedAt, String serverInstanceId,
String metricName, String metricType, String statistic,
double value, Map<String, String> tags) {
jdbc.update("""
INSERT INTO server_metrics
(tenant_id, collected_at, server_instance_id,
metric_name, metric_type, statistic, metric_value, tags)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""",
tenantId, Timestamp.from(collectedAt), serverInstanceId,
metricName, metricType, statistic, value, tags);
}
private static JsonNode findByField(JsonNode array, String field, String value) {
for (JsonNode n : array) {
if (value.equals(n.path(field).asText())) return n;
}
throw new AssertionError("no element with " + field + "=" + value);
}
private static JsonNode findByTag(JsonNode seriesArray, String tagKey, String tagValue) {
for (JsonNode s : seriesArray) {
if (tagValue.equals(s.path("tags").path(tagKey).asText())) return s;
}
throw new AssertionError("no series with tag " + tagKey + "=" + tagValue);
}
private static java.util.List<String> asStringList(JsonNode arr) {
java.util.List<String> out = new java.util.ArrayList<>();
if (arr != null) for (JsonNode n : arr) out.add(n.asText());
return out;
}
}

View File

@@ -0,0 +1,36 @@
package com.cameleer.server.core.storage;
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
import java.time.Instant;
import java.util.List;
/**
* Read-side access to the ClickHouse {@code server_metrics} table. Exposed
* to dashboards through {@code /api/v1/admin/server-metrics/**} so SaaS
* control planes don't need direct ClickHouse access.
*/
public interface ServerMetricsQueryStore {
/**
* Catalog of metric names observed in {@code [from, to)} along with their
* type, the set of statistics emitted, and the union of tag keys seen.
*/
List<ServerMetricCatalogEntry> catalog(Instant from, Instant to);
/**
* Distinct {@code server_instance_id} values that wrote at least one
* sample in {@code [from, to)}, with first/last seen timestamps.
*/
List<ServerInstanceInfo> listInstances(Instant from, Instant to);
/**
* Generic time-series query. See {@link ServerMetricQueryRequest} for
* request semantics. Implementations must enforce input validation and
* reject unsafe inputs with {@link IllegalArgumentException}.
*/
ServerMetricQueryResponse query(ServerMetricQueryRequest request);
}

View File

@@ -0,0 +1,15 @@
package com.cameleer.server.core.storage.model;
import java.time.Instant;
/**
* One row of the {@code /api/v1/admin/server-metrics/instances} response.
* Used by dashboards to partition counter-delta computations across server
* process boundaries (each boot rotates the id).
*/
public record ServerInstanceInfo(
String serverInstanceId,
Instant firstSeen,
Instant lastSeen
) {
}

View File

@@ -0,0 +1,17 @@
package com.cameleer.server.core.storage.model;
import java.util.List;
/**
* One row of the {@code /api/v1/admin/server-metrics/catalog} response.
* Surfaces the set of statistics and tag keys observed for a metric across
* the requested window, so dashboards can build selectors without ClickHouse
* access.
*/
public record ServerMetricCatalogEntry(
String metricName,
String metricType,
List<String> statistics,
List<String> tagKeys
) {
}

View File

@@ -0,0 +1,10 @@
package com.cameleer.server.core.storage.model;
import java.time.Instant;
/** One {@code (bucket, value)} point of a server-metrics series. */
public record ServerMetricPoint(
Instant t,
double v
) {
}

View File

@@ -0,0 +1,40 @@
package com.cameleer.server.core.storage.model;
import java.time.Instant;
import java.util.List;
import java.util.Map;
/**
* Request contract for the generic server-metrics time-series query.
*
* <p>{@code aggregation} controls how multiple samples within a bucket
* collapse: {@code avg|sum|max|min|latest}. {@code mode} controls counter
* handling: {@code raw} returns values as stored (cumulative for counters),
* {@code delta} returns per-bucket positive-clipped differences computed
* per {@code server_instance_id}.
*
* <p>{@code statistic} filters which Micrometer sub-measurement to read
* ({@code value} / {@code count} / {@code total_time} / {@code total} /
* {@code max} / {@code mean}). {@code mean} is a derived statistic for
* timers: {@code sum(total_time|total) / sum(count)} per bucket.
*
* <p>{@code groupByTags} splits the output into one series per unique tag
* combination. {@code filterTags} narrows the input to samples whose tag
* map matches every entry.
*
* <p>{@code serverInstanceIds} is an optional allow-list. When null or
* empty all instances observed in the window are included.
*/
public record ServerMetricQueryRequest(
String metric,
String statistic,
Instant from,
Instant to,
Integer stepSeconds,
List<String> groupByTags,
Map<String, String> filterTags,
String aggregation,
String mode,
List<String> serverInstanceIds
) {
}

View File

@@ -0,0 +1,14 @@
package com.cameleer.server.core.storage.model;
import java.util.List;
/** Response of the generic server-metrics time-series query. */
public record ServerMetricQueryResponse(
String metric,
String statistic,
String aggregation,
String mode,
int stepSeconds,
List<ServerMetricSeries> series
) {
}

View File

@@ -0,0 +1,14 @@
package com.cameleer.server.core.storage.model;
import java.util.List;
import java.util.Map;
/**
* One series of the server-metrics query response, identified by its
* {@link #tags} group (empty map when the query had no {@code groupByTags}).
*/
public record ServerMetricSeries(
Map<String, String> tags,
List<ServerMetricPoint> points
) {
}

View File

@@ -66,24 +66,126 @@ On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by
## How to query ## How to query
### Via the admin ClickHouse endpoint Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
### `GET /catalog`
Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
``` ```
POST /api/v1/admin/clickhouse/query GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt> Authorization: Bearer <admin-jwt>
Content-Type: text/plain
SELECT metric_name, statistic, count()
FROM server_metrics
WHERE collected_at >= now() - INTERVAL 1 HOUR
GROUP BY 1, 2 ORDER BY 1, 2
``` ```
Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API. ```json
[
{
"metricName": "cameleer.agents.connected",
"metricType": "gauge",
"statistics": ["value"],
"tagKeys": ["state"]
},
{
"metricName": "cameleer.ingestion.drops",
"metricType": "counter",
"statistics": ["count"],
"tagKeys": ["reason"]
},
...
]
```
### Direct JDBC (recommended for the dashboard) `from`/`to` are optional; default is the last 1 h.
Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`. ### `GET /instances`
Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
```
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
```
```json
[
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]
```
### `POST /query` — generic time-series
The workhorse. One endpoint covers every panel in the dashboard.
```
POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json
```
Request body:
```json
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"from": "2026-04-22T00:00:00Z",
"to": "2026-04-23T00:00:00Z",
"stepSeconds": 60,
"groupByTags": ["reason"],
"filterTags": { },
"aggregation": "sum",
"mode": "delta",
"serverInstanceIds": null
}
```
Response:
```json
{
"metric": "cameleer.ingestion.drops",
"statistic": "count",
"aggregation": "sum",
"mode": "delta",
"stepSeconds": 60,
"series": [
{
"tags": { "reason": "buffer_full" },
"points": [
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
]
}
]
}
```
#### Request field reference
| Field | Type | Required | Description |
|---|---|---|---|
| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
#### Validation errors
Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
- unsafe characters in identifiers
- `from ≥ to` or range > 31 days
- `stepSeconds` outside [10, 3600]
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
### Direct ClickHouse (fallback)
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
--- ---
@@ -258,89 +360,150 @@ When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
## Suggested dashboard panels ## Suggested dashboard panels
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable. Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
### Row: server health (top of dashboard) ### Row: server health (top of dashboard)
1. **Agents by state** — stacked area. 1. **Agents by state** — stacked area.
```sql ```json
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count { "metric": "cameleer.agents.connected", "statistic": "value",
FROM server_metrics "from": "{from}", "to": "{to}", "stepSeconds": 60,
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected' "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
AND collected_at >= {from} AND collected_at < {to}
GROUP BY t, state ORDER BY t;
``` ```
2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above. 2. **Ingestion buffer depth by type** — line chart.
```json
3. **Ingestion drops per minute** — bar chart (per-minute delta). { "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
```sql "from": "{from}", "to": "{to}", "stepSeconds": 60,
WITH sorted AS ( "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
SELECT toStartOfMinute(collected_at) AS minute,
tags['reason'] AS reason,
server_instance_id,
max(metric_value) AS cumulative
FROM server_metrics
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
GROUP BY minute, reason, server_instance_id
)
SELECT minute, reason,
cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
PARTITION BY reason, server_instance_id ORDER BY minute
) AS drops_per_minute
FROM sorted ORDER BY minute;
``` ```
4. **Auth failures per minute** — same shape as drops, split by `reason`. 3. **Ingestion drops per minute** — bar chart.
```json
{ "metric": "cameleer.ingestion.drops", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["reason"], "mode": "delta" }
```
4. **Auth failures per minute** — same shape as drops, grouped by `reason`.
```json
{ "metric": "cameleer.auth.failures", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["reason"], "mode": "delta" }
```
### Row: JVM ### Row: JVM
5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s. 5. **Heap used vs committed vs max** — area chart (three overlay queries).
```json
{ "metric": "jvm.memory.used", "statistic": "value",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
```
Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.
6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`. 6. **CPU %** — line.
```json
{ "metric": "process.cpu.usage", "statistic": "value",
"from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
```
Overlay with `"metric": "system.cpu.usage"`.
7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`. 7. **GC pause — max per cause**.
```json
{ "metric": "jvm.gc.pause", "statistic": "max",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
```
8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`. 8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
### Row: HTTP + DB ### Row: HTTP + DB
9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`. 9. **HTTP mean latency by URI** — top-N URIs.
```json
{ "metric": "http.server.requests", "statistic": "mean",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
"aggregation": "avg", "mode": "raw" }
```
For p99 proxy, repeat with `"statistic": "max"`.
10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total. 10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
```json
{ "metric": "http.server.requests", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"mode": "delta", "aggregation": "sum" }
```
Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.
11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small. 11. **HikariCP pool saturation** — overlay two queries.
```json
{ "metric": "hikaricp.connections.active", "statistic": "value",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
```
Overlay with `"metric": "hikaricp.connections.pending"`.
12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag. 12. **Hikari acquire timeouts per minute**.
```json
{ "metric": "hikaricp.connections.timeout", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["pool"], "mode": "delta" }
```
### Row: alerting (collapsible) ### Row: alerting (collapsible)
13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`. 13. **Alerting instances by state** — stacked.
```json
{ "metric": "alerting_instances_total", "statistic": "value",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
```
14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`. 14. **Eval errors per minute by kind**.
```json
{ "metric": "alerting_eval_errors_total", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"groupByTags": ["kind"], "mode": "delta" }
```
15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`. 15. **Webhook delivery — max per minute**.
```json
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
"from": "{from}", "to": "{to}", "stepSeconds": 60,
"aggregation": "max", "mode": "raw" }
```
### Row: deployments (runtime-enabled only) ### Row: deployments (runtime-enabled only)
16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`. 16. **Deploy outcomes per hour**.
```json
{ "metric": "cameleer.deployments.outcome", "statistic": "count",
"from": "{from}", "to": "{to}", "stepSeconds": 3600,
"groupByTags": ["status"], "mode": "delta" }
```
17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean). 17. **Deploy duration mean**.
```json
{ "metric": "cameleer.deployments.duration", "statistic": "mean",
"from": "{from}", "to": "{to}", "stepSeconds": 300,
"aggregation": "avg", "mode": "raw" }
```
For p99 proxy, repeat with `"statistic": "max"`.
--- ---
## Notes for the dashboard implementer ## Notes for the dashboard implementer
- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table. - **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap. - **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it you'll get negative numbers on restart. - **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either. - **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.
--- ---
## Changelog ## Changelog
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever. - 2026-04-23 — initial write. Write-only backend.
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.