diff --git a/.claude/rules/app-classes.md b/.claude/rules/app-classes.md index f10bccbc..971643fb 100644 --- a/.claude/rules/app-classes.md +++ b/.claude/rules/app-classes.md @@ -109,6 +109,7 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale - `UsageAnalyticsController` — GET `/api/v1/admin/usage` (ClickHouse `usage_events`). - `ClickHouseAdminController` — GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag). - `DatabaseAdminController` — GET `/api/v1/admin/database/**` (conditional on `infrastructureendpoints` flag). +- `ServerMetricsAdminController` — `/api/v1/admin/server-metrics/**`. GET `/catalog`, GET `/instances`, POST `/query`. Generic read API over the `server_metrics` ClickHouse table so SaaS dashboards don't need direct CH access. Delegates to `ServerMetricsQueryStore` (impl `ClickHouseServerMetricsQueryStore`). Validation: metric/tag regex `^[a-zA-Z0-9._]+$`, statistic regex `^[a-z_]+$`, `to - from ≤ 31 days`, stepSeconds ∈ [10, 3600], response capped at 500 series. `IllegalArgumentException` → 400. `/query` supports `raw` + `delta` modes (delta does per-`server_instance_id` positive-clipped differences, then aggregates across instances). Derived `statistic=mean` for timers computes `sum(total|total_time)/sum(count)` per bucket. ### Other (flat) @@ -147,7 +148,8 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale - `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository` - `ClickHouseUsageTracker` — usage_events for billing - `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup -- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`, query via `/api/v1/admin/clickhouse/query` (no dedicated endpoint yet). +- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`. +- `ClickHouseServerMetricsQueryStore` — read side of `server_metrics` for dashboards. Implements `ServerMetricsQueryStore`. `catalog(from,to)` returns name+type+statistics+tagKeys, `listInstances(from,to)` returns server_instance_ids with first/last seen, `query(request)` builds bucketed time-series with `raw` or `delta` mode and supports a derived `mean` statistic for timers. All identifier inputs regex-validated; tenant_id always bound; max range 31 days; series count capped at 500. Exposed via `ServerMetricsAdminController`. ## search/ — ClickHouse search and log stores diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java index e8996f49..6df6a51b 100644 --- a/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java @@ -9,6 +9,7 @@ import com.cameleer.server.app.storage.ClickHouseRouteCatalogStore; import com.cameleer.server.core.storage.RouteCatalogStore; import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore; import com.cameleer.server.app.storage.ClickHouseMetricsStore; +import com.cameleer.server.app.storage.ClickHouseServerMetricsQueryStore; import com.cameleer.server.app.storage.ClickHouseServerMetricsStore; import com.cameleer.server.app.storage.ClickHouseStatsStore; import com.cameleer.server.core.admin.AuditRepository; @@ -74,6 +75,13 @@ public class StorageBeanConfig { return new ClickHouseServerMetricsStore(clickHouseJdbc); } + @Bean + public ServerMetricsQueryStore clickHouseServerMetricsQueryStore( + TenantProperties tenantProperties, + @Qualifier("clickHouseJdbcTemplate") JdbcTemplate clickHouseJdbc) { + return new ClickHouseServerMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc); + } + // ── Execution Store ────────────────────────────────────────────────── @Bean diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ServerMetricsAdminController.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ServerMetricsAdminController.java new file mode 100644 index 00000000..676dbd8c --- /dev/null +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ServerMetricsAdminController.java @@ -0,0 +1,135 @@ +package com.cameleer.server.app.controller; + +import com.cameleer.server.core.storage.ServerMetricsQueryStore; +import com.cameleer.server.core.storage.model.ServerInstanceInfo; +import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry; +import com.cameleer.server.core.storage.model.ServerMetricQueryRequest; +import com.cameleer.server.core.storage.model.ServerMetricQueryResponse; +import io.swagger.v3.oas.annotations.Operation; +import io.swagger.v3.oas.annotations.tags.Tag; +import org.springframework.http.ResponseEntity; +import org.springframework.web.bind.annotation.ExceptionHandler; +import org.springframework.web.bind.annotation.GetMapping; +import org.springframework.web.bind.annotation.PostMapping; +import org.springframework.web.bind.annotation.RequestBody; +import org.springframework.web.bind.annotation.RequestMapping; +import org.springframework.web.bind.annotation.RequestParam; +import org.springframework.web.bind.annotation.RestController; + +import java.time.Instant; +import java.util.List; +import java.util.Map; + +/** + * Generic read API over the ClickHouse {@code server_metrics} table. Lets + * SaaS control planes build server-health dashboards without requiring direct + * ClickHouse access. + * + *

Three endpoints cover all 17 panels in {@code docs/server-self-metrics.md}: + *

+ * + *

Protected by the {@code /api/v1/admin/**} catch-all in {@code SecurityConfig} — requires ADMIN role. + */ +@RestController +@RequestMapping("/api/v1/admin/server-metrics") +@Tag(name = "Server Self-Metrics", + description = "Read API over the server's own Micrometer registry snapshots for dashboards") +public class ServerMetricsAdminController { + + /** Default lookback window for catalog/instances when from/to are omitted. */ + private static final long DEFAULT_LOOKBACK_SECONDS = 3_600L; + + private final ServerMetricsQueryStore store; + + public ServerMetricsAdminController(ServerMetricsQueryStore store) { + this.store = store; + } + + @GetMapping("/catalog") + @Operation(summary = "List metric names observed in the window", + description = "For each metric_name, returns metric_type, the set of statistics emitted, and the union of tag keys.") + public ResponseEntity> catalog( + @RequestParam(required = false) String from, + @RequestParam(required = false) String to) { + Instant[] window = resolveWindow(from, to); + return ResponseEntity.ok(store.catalog(window[0], window[1])); + } + + @GetMapping("/instances") + @Operation(summary = "List server_instance_id values observed in the window", + description = "Returns first/last seen timestamps — use to partition counter-delta computations.") + public ResponseEntity> instances( + @RequestParam(required = false) String from, + @RequestParam(required = false) String to) { + Instant[] window = resolveWindow(from, to); + return ResponseEntity.ok(store.listInstances(window[0], window[1])); + } + + @PostMapping("/query") + @Operation(summary = "Generic time-series query", + description = "Returns bucketed series for a single metric_name. Supports aggregation (avg/sum/max/min/latest), group-by-tag, filter-by-tag, counter delta mode, and a derived 'mean' statistic for timers.") + public ResponseEntity query(@RequestBody QueryBody body) { + ServerMetricQueryRequest request = new ServerMetricQueryRequest( + body.metric(), + body.statistic(), + parseInstant(body.from(), "from"), + parseInstant(body.to(), "to"), + body.stepSeconds(), + body.groupByTags(), + body.filterTags(), + body.aggregation(), + body.mode(), + body.serverInstanceIds()); + return ResponseEntity.ok(store.query(request)); + } + + @ExceptionHandler(IllegalArgumentException.class) + public ResponseEntity> handleBadRequest(IllegalArgumentException e) { + return ResponseEntity.badRequest().body(Map.of("error", e.getMessage())); + } + + private static Instant[] resolveWindow(String from, String to) { + Instant toI = to != null ? parseInstant(to, "to") : Instant.now(); + Instant fromI = from != null + ? parseInstant(from, "from") + : toI.minusSeconds(DEFAULT_LOOKBACK_SECONDS); + if (!fromI.isBefore(toI)) { + throw new IllegalArgumentException("from must be strictly before to"); + } + return new Instant[]{fromI, toI}; + } + + private static Instant parseInstant(String raw, String field) { + if (raw == null || raw.isBlank()) { + throw new IllegalArgumentException(field + " is required"); + } + try { + return Instant.parse(raw); + } catch (Exception e) { + throw new IllegalArgumentException( + field + " must be an ISO-8601 instant (e.g. 2026-04-23T10:00:00Z)"); + } + } + + /** + * Request body for {@link #query(QueryBody)}. Uses ISO-8601 strings on + * the wire so the OpenAPI schema stays language-neutral. + */ + public record QueryBody( + String metric, + String statistic, + String from, + String to, + Integer stepSeconds, + List groupByTags, + Map filterTags, + String aggregation, + String mode, + List serverInstanceIds + ) { + } +} diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsQueryStore.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsQueryStore.java new file mode 100644 index 00000000..0c5773d2 --- /dev/null +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsQueryStore.java @@ -0,0 +1,408 @@ +package com.cameleer.server.app.storage; + +import com.cameleer.server.core.storage.ServerMetricsQueryStore; +import com.cameleer.server.core.storage.model.ServerInstanceInfo; +import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry; +import com.cameleer.server.core.storage.model.ServerMetricPoint; +import com.cameleer.server.core.storage.model.ServerMetricQueryRequest; +import com.cameleer.server.core.storage.model.ServerMetricQueryResponse; +import com.cameleer.server.core.storage.model.ServerMetricSeries; +import org.springframework.jdbc.core.JdbcTemplate; + +import java.sql.Array; +import java.sql.Timestamp; +import java.time.Duration; +import java.time.Instant; +import java.util.ArrayList; +import java.util.Collections; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.TreeSet; +import java.util.regex.Pattern; + +/** + * ClickHouse-backed {@link ServerMetricsQueryStore}. + * + *

Safety rules for every query: + *

+ */ +public class ClickHouseServerMetricsQueryStore implements ServerMetricsQueryStore { + + private static final Pattern SAFE_IDENTIFIER = Pattern.compile("^[a-zA-Z0-9._]+$"); + private static final Pattern SAFE_STATISTIC = Pattern.compile("^[a-z_]+$"); + + private static final Set AGGREGATIONS = Set.of("avg", "sum", "max", "min", "latest"); + private static final Set MODES = Set.of("raw", "delta"); + + /** Maximum {@code to - from} window accepted by the API. */ + static final Duration MAX_RANGE = Duration.ofDays(31); + + /** Clamp bounds and default for {@code stepSeconds}. */ + static final int MIN_STEP = 10; + static final int MAX_STEP = 3600; + static final int DEFAULT_STEP = 60; + + /** Defence against group-by explosion — limit the series count per response. */ + static final int MAX_SERIES = 500; + + private final String tenantId; + private final JdbcTemplate jdbc; + + public ClickHouseServerMetricsQueryStore(String tenantId, JdbcTemplate jdbc) { + this.tenantId = tenantId; + this.jdbc = jdbc; + } + + // ── catalog ───────────────────────────────────────────────────────── + + @Override + public List catalog(Instant from, Instant to) { + requireRange(from, to); + String sql = """ + SELECT + metric_name, + any(metric_type) AS metric_type, + arraySort(groupUniqArray(statistic)) AS statistics, + arraySort(arrayDistinct(arrayFlatten(groupArray(mapKeys(tags))))) AS tag_keys + FROM server_metrics + WHERE tenant_id = ? + AND collected_at >= ? + AND collected_at < ? + GROUP BY metric_name + ORDER BY metric_name + """; + return jdbc.query(sql, (rs, n) -> new ServerMetricCatalogEntry( + rs.getString("metric_name"), + rs.getString("metric_type"), + arrayToStringList(rs.getArray("statistics")), + arrayToStringList(rs.getArray("tag_keys")) + ), tenantId, Timestamp.from(from), Timestamp.from(to)); + } + + // ── instances ─────────────────────────────────────────────────────── + + @Override + public List listInstances(Instant from, Instant to) { + requireRange(from, to); + String sql = """ + SELECT + server_instance_id, + min(collected_at) AS first_seen, + max(collected_at) AS last_seen + FROM server_metrics + WHERE tenant_id = ? + AND collected_at >= ? + AND collected_at < ? + GROUP BY server_instance_id + ORDER BY last_seen DESC + """; + return jdbc.query(sql, (rs, n) -> new ServerInstanceInfo( + rs.getString("server_instance_id"), + rs.getTimestamp("first_seen").toInstant(), + rs.getTimestamp("last_seen").toInstant() + ), tenantId, Timestamp.from(from), Timestamp.from(to)); + } + + // ── query ─────────────────────────────────────────────────────────── + + @Override + public ServerMetricQueryResponse query(ServerMetricQueryRequest request) { + if (request == null) throw new IllegalArgumentException("request is required"); + String metric = requireSafeIdentifier(request.metric(), "metric"); + requireRange(request.from(), request.to()); + + String aggregation = request.aggregation() != null ? request.aggregation().toLowerCase() : "avg"; + if (!AGGREGATIONS.contains(aggregation)) { + throw new IllegalArgumentException("aggregation must be one of " + AGGREGATIONS); + } + + String mode = request.mode() != null ? request.mode().toLowerCase() : "raw"; + if (!MODES.contains(mode)) { + throw new IllegalArgumentException("mode must be one of " + MODES); + } + + int step = request.stepSeconds() != null ? request.stepSeconds() : DEFAULT_STEP; + if (step < MIN_STEP || step > MAX_STEP) { + throw new IllegalArgumentException( + "stepSeconds must be in [" + MIN_STEP + "," + MAX_STEP + "]"); + } + + String statistic = request.statistic(); + if (statistic != null && !SAFE_STATISTIC.matcher(statistic).matches()) { + throw new IllegalArgumentException("statistic contains unsafe characters"); + } + + List groupByTags = request.groupByTags() != null + ? request.groupByTags() : List.of(); + for (String t : groupByTags) requireSafeIdentifier(t, "groupByTag"); + + Map filterTags = request.filterTags() != null + ? request.filterTags() : Map.of(); + for (String t : filterTags.keySet()) requireSafeIdentifier(t, "filterTag key"); + + List instanceAllowList = request.serverInstanceIds() != null + ? request.serverInstanceIds() : List.of(); + + boolean isDelta = "delta".equals(mode); + boolean isMean = "mean".equals(statistic); + + String sql = isDelta + ? buildDeltaSql(step, groupByTags, filterTags, instanceAllowList, statistic, isMean) + : buildRawSql(step, groupByTags, filterTags, instanceAllowList, + statistic, aggregation, isMean); + + List params = buildParams(groupByTags, metric, statistic, isMean, + request.from(), request.to(), + filterTags, instanceAllowList); + + List rows = jdbc.query(sql, (rs, n) -> { + int idx = 1; + Instant bucket = rs.getTimestamp(idx++).toInstant(); + List tagValues = new ArrayList<>(groupByTags.size()); + for (int g = 0; g < groupByTags.size(); g++) { + tagValues.add(rs.getString(idx++)); + } + double value = rs.getDouble(idx); + return new Row(bucket, tagValues, value); + }, params.toArray()); + + return assembleSeries(rows, metric, statistic, aggregation, mode, step, groupByTags); + } + + // ── SQL builders ──────────────────────────────────────────────────── + + /** + * Builds a single-pass SQL for raw mode: + *
{@code
+     * SELECT bucket, tag0, ..., (metric_value) AS value
+     * FROM server_metrics WHERE ...
+     * GROUP BY bucket, tag0, ...
+     * ORDER BY bucket, tag0, ...
+     * }
+ * For {@code statistic=mean}, replaces the aggregate with + * {@code sumIf(value, statistic IN ('total','total_time')) / nullIf(sumIf(value, statistic='count'), 0)}. + */ + private String buildRawSql(int step, List groupByTags, + Map filterTags, + List instanceAllowList, + String statistic, String aggregation, boolean isMean) { + StringBuilder s = new StringBuilder(512); + s.append("SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ") + .append(step).append(" SECOND), 3) AS bucket"); + for (int i = 0; i < groupByTags.size(); i++) { + s.append(",\n tags[?] AS tag").append(i); + } + s.append(",\n ").append(isMean ? meanExpr() : scalarAggExpr(aggregation)) + .append(" AS value\nFROM server_metrics\n"); + appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean); + s.append("GROUP BY bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append("\nORDER BY bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + return s.toString(); + } + + /** + * Builds a three-level SQL for delta mode. Inner fills one + * (bucket, instance, tag-group) row via {@code max(metric_value)}; + * middle computes positive-clipped per-instance differences via a + * window function; outer sums across instances. + */ + private String buildDeltaSql(int step, List groupByTags, + Map filterTags, + List instanceAllowList, + String statistic, boolean isMean) { + StringBuilder s = new StringBuilder(1024); + s.append("SELECT bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append(", sum(delta) AS value FROM (\n"); + + // Middle: per-instance positive-clipped delta using window. + s.append(" SELECT bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append(", server_instance_id, greatest(0, value - coalesce(any(value) OVER (") + .append("PARTITION BY server_instance_id"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append(" ORDER BY bucket ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), value)) AS delta FROM (\n"); + + // Inner: one representative value per (bucket, instance, tag-group). + s.append(" SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ") + .append(step).append(" SECOND), 3) AS bucket,\n server_instance_id"); + for (int i = 0; i < groupByTags.size(); i++) { + s.append(",\n tags[?] AS tag").append(i); + } + s.append(",\n ").append(isMean ? meanExpr() : "max(metric_value)") + .append(" AS value\n FROM server_metrics\n"); + appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean); + s.append(" GROUP BY bucket, server_instance_id"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append("\n ) AS bucketed\n) AS deltas\n"); + + s.append("GROUP BY bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + s.append("\nORDER BY bucket"); + for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i); + return s.toString(); + } + + /** + * WHERE clause shared by both raw and delta SQL shapes. Appended at the + * correct indent under either the single {@code FROM server_metrics} + * (raw) or the innermost one (delta). + */ + private void appendWhereClause(StringBuilder s, Map filterTags, + List instanceAllowList, + String statistic, boolean isMean) { + s.append(" WHERE tenant_id = ?\n") + .append(" AND metric_name = ?\n"); + if (isMean) { + s.append(" AND statistic IN ('count', 'total', 'total_time')\n"); + } else if (statistic != null) { + s.append(" AND statistic = ?\n"); + } + s.append(" AND collected_at >= ?\n") + .append(" AND collected_at < ?\n"); + for (int i = 0; i < filterTags.size(); i++) { + s.append(" AND tags[?] = ?\n"); + } + if (!instanceAllowList.isEmpty()) { + s.append(" AND server_instance_id IN (") + .append("?,".repeat(instanceAllowList.size() - 1)).append("?)\n"); + } + } + + /** + * SQL-positional params for both raw and delta queries (same relative + * order because the WHERE clause is emitted by {@link #appendWhereClause} + * only once, with the {@code tags[?]} select-list placeholders appearing + * earlier in the SQL text). + */ + private List buildParams(List groupByTags, String metric, + String statistic, boolean isMean, + Instant from, Instant to, + Map filterTags, + List instanceAllowList) { + List params = new ArrayList<>(); + // SELECT-list tags[?] placeholders + params.addAll(groupByTags); + // WHERE + params.add(tenantId); + params.add(metric); + if (!isMean && statistic != null) params.add(statistic); + params.add(Timestamp.from(from)); + params.add(Timestamp.from(to)); + for (Map.Entry e : filterTags.entrySet()) { + params.add(e.getKey()); + params.add(e.getValue()); + } + params.addAll(instanceAllowList); + return params; + } + + private static String scalarAggExpr(String aggregation) { + return switch (aggregation) { + case "avg" -> "avg(metric_value)"; + case "sum" -> "sum(metric_value)"; + case "max" -> "max(metric_value)"; + case "min" -> "min(metric_value)"; + case "latest" -> "argMax(metric_value, collected_at)"; + default -> throw new IllegalStateException("unreachable: " + aggregation); + }; + } + + private static String meanExpr() { + return "sumIf(metric_value, statistic IN ('total', 'total_time'))" + + " / nullIf(sumIf(metric_value, statistic = 'count'), 0)"; + } + + // ── response assembly ─────────────────────────────────────────────── + + private ServerMetricQueryResponse assembleSeries( + List rows, String metric, String statistic, + String aggregation, String mode, int step, List groupByTags) { + + Map, List> bySignature = new LinkedHashMap<>(); + for (Row r : rows) { + if (Double.isNaN(r.value) || Double.isInfinite(r.value)) continue; + bySignature.computeIfAbsent(r.tagValues, k -> new ArrayList<>()) + .add(new ServerMetricPoint(r.bucket, r.value)); + } + + if (bySignature.size() > MAX_SERIES) { + throw new IllegalArgumentException( + "query produced " + bySignature.size() + + " series; reduce groupByTags or tighten filterTags (max " + + MAX_SERIES + ")"); + } + + List series = new ArrayList<>(bySignature.size()); + for (Map.Entry, List> e : bySignature.entrySet()) { + Map tags = new LinkedHashMap<>(); + for (int i = 0; i < groupByTags.size(); i++) { + tags.put(groupByTags.get(i), e.getKey().get(i)); + } + series.add(new ServerMetricSeries(Collections.unmodifiableMap(tags), e.getValue())); + } + + return new ServerMetricQueryResponse(metric, + statistic != null ? statistic : "value", + aggregation, mode, step, series); + } + + // ── helpers ───────────────────────────────────────────────────────── + + private static void requireRange(Instant from, Instant to) { + if (from == null || to == null) { + throw new IllegalArgumentException("from and to are required"); + } + if (!from.isBefore(to)) { + throw new IllegalArgumentException("from must be strictly before to"); + } + if (Duration.between(from, to).compareTo(MAX_RANGE) > 0) { + throw new IllegalArgumentException( + "time range exceeds maximum of " + MAX_RANGE.toDays() + " days"); + } + } + + private static String requireSafeIdentifier(String value, String field) { + if (value == null || value.isBlank()) { + throw new IllegalArgumentException(field + " is required"); + } + if (!SAFE_IDENTIFIER.matcher(value).matches()) { + throw new IllegalArgumentException( + field + " contains unsafe characters (allowed: [a-zA-Z0-9._])"); + } + return value; + } + + private static List arrayToStringList(Array array) { + if (array == null) return List.of(); + try { + Object[] values = (Object[]) array.getArray(); + Set sorted = new TreeSet<>(); + for (Object v : values) { + if (v != null) sorted.add(v.toString()); + } + return List.copyOf(sorted); + } catch (Exception e) { + return List.of(); + } finally { + try { array.free(); } catch (Exception ignore) { } + } + } + + private record Row(Instant bucket, List tagValues, double value) { + } +} diff --git a/cameleer-server-app/src/test/java/com/cameleer/server/app/controller/ServerMetricsAdminControllerIT.java b/cameleer-server-app/src/test/java/com/cameleer/server/app/controller/ServerMetricsAdminControllerIT.java new file mode 100644 index 00000000..8a80b103 --- /dev/null +++ b/cameleer-server-app/src/test/java/com/cameleer/server/app/controller/ServerMetricsAdminControllerIT.java @@ -0,0 +1,314 @@ +package com.cameleer.server.app.controller; + +import com.cameleer.server.app.AbstractPostgresIT; +import com.cameleer.server.app.TestSecurityHelper; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.springframework.beans.factory.annotation.Autowired; +import org.springframework.boot.test.web.client.TestRestTemplate; +import org.springframework.http.HttpEntity; +import org.springframework.http.HttpHeaders; +import org.springframework.http.HttpMethod; +import org.springframework.http.HttpStatus; +import org.springframework.http.ResponseEntity; + +import java.sql.Timestamp; +import java.time.Instant; +import java.util.Map; + +import static org.assertj.core.api.Assertions.assertThat; + +class ServerMetricsAdminControllerIT extends AbstractPostgresIT { + + @Autowired + private TestRestTemplate restTemplate; + + @Autowired + private TestSecurityHelper securityHelper; + + private final ObjectMapper mapper = new ObjectMapper(); + + private HttpHeaders adminJson; + private HttpHeaders adminGet; + private HttpHeaders viewerGet; + + @BeforeEach + void seedAndAuth() { + adminJson = securityHelper.adminHeaders(); + adminGet = securityHelper.authHeadersNoBody(securityHelper.adminToken()); + viewerGet = securityHelper.authHeadersNoBody(securityHelper.viewerToken()); + + // Fresh rows for each test. The Spring-context ClickHouse JdbcTemplate + // lives in a different bean; reach for it here by executing through + // the same JdbcTemplate used by the store via the ClickHouseConfig bean. + org.springframework.jdbc.core.JdbcTemplate ch = clickhouseJdbc(); + ch.execute("TRUNCATE TABLE server_metrics"); + + Instant t0 = Instant.parse("2026-04-23T10:00:00Z"); + // Gauge: cameleer.agents.connected, two states, two buckets. + insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 3.0, + Map.of("state", "live")); + insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 4.0, + Map.of("state", "live")); + insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 1.0, + Map.of("state", "stale")); + insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 0.0, + Map.of("state", "stale")); + + // Counter: cumulative drops, +5 per minute on srv-A. + insert(ch, "default", t0, "srv-A", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full")); + insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.drops", "counter", "count", 5.0, Map.of("reason", "buffer_full")); + insert(ch, "default", t0.plusSeconds(120), "srv-A", "cameleer.ingestion.drops", "counter", "count", 10.0, Map.of("reason", "buffer_full")); + // Simulated restart to srv-B: counter resets to 0, then climbs to 2. + insert(ch, "default", t0.plusSeconds(180), "srv-B", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full")); + insert(ch, "default", t0.plusSeconds(240), "srv-B", "cameleer.ingestion.drops", "counter", "count", 2.0, Map.of("reason", "buffer_full")); + + // Timer mean inputs: two buckets, 2 samples each (count=2, total_time=30). + insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 2.0, Map.of("type", "execution")); + insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 30.0, Map.of("type", "execution")); + insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 4.0, Map.of("type", "execution")); + insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 100.0, Map.of("type", "execution")); + } + + // ── catalog ───────────────────────────────────────────────────────── + + @Test + void catalog_listsSeededMetricsWithStatisticsAndTagKeys() throws Exception { + ResponseEntity r = restTemplate.exchange( + "/api/v1/admin/server-metrics/catalog?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z", + HttpMethod.GET, new HttpEntity<>(adminGet), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK); + + JsonNode body = mapper.readTree(r.getBody()); + assertThat(body.isArray()).isTrue(); + + JsonNode drops = findByField(body, "metricName", "cameleer.ingestion.drops"); + assertThat(drops.get("metricType").asText()).isEqualTo("counter"); + assertThat(asStringList(drops.get("statistics"))).contains("count"); + assertThat(asStringList(drops.get("tagKeys"))).contains("reason"); + + JsonNode timer = findByField(body, "metricName", "cameleer.ingestion.flush.duration"); + assertThat(asStringList(timer.get("statistics"))).contains("count", "total_time"); + } + + // ── instances ─────────────────────────────────────────────────────── + + @Test + void instances_listsDistinctServerInstanceIdsWithFirstAndLastSeen() throws Exception { + ResponseEntity r = restTemplate.exchange( + "/api/v1/admin/server-metrics/instances?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z", + HttpMethod.GET, new HttpEntity<>(adminGet), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK); + + JsonNode body = mapper.readTree(r.getBody()); + assertThat(body.isArray()).isTrue(); + assertThat(body.size()).isEqualTo(2); + // Ordered by last_seen DESC — srv-B saw a later row. + assertThat(body.get(0).get("serverInstanceId").asText()).isEqualTo("srv-B"); + assertThat(body.get(1).get("serverInstanceId").asText()).isEqualTo("srv-A"); + } + + // ── query — gauge with group-by-tag ───────────────────────────────── + + @Test + void query_gaugeWithGroupByTag_returnsSeriesPerTagValue() throws Exception { + String requestBody = """ + { + "metric": "cameleer.agents.connected", + "statistic": "value", + "from": "2026-04-23T09:59:00Z", + "to": "2026-04-23T10:02:00Z", + "stepSeconds": 60, + "groupByTags": ["state"], + "aggregation": "avg", + "mode": "raw" + } + """; + + ResponseEntity r = restTemplate.postForEntity( + "/api/v1/admin/server-metrics/query", + new HttpEntity<>(requestBody, adminJson), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK); + + JsonNode body = mapper.readTree(r.getBody()); + assertThat(body.get("metric").asText()).isEqualTo("cameleer.agents.connected"); + assertThat(body.get("statistic").asText()).isEqualTo("value"); + assertThat(body.get("mode").asText()).isEqualTo("raw"); + assertThat(body.get("stepSeconds").asInt()).isEqualTo(60); + + JsonNode series = body.get("series"); + assertThat(series.isArray()).isTrue(); + assertThat(series.size()).isEqualTo(2); + + JsonNode live = findByTag(series, "state", "live"); + assertThat(live.get("points").size()).isEqualTo(2); + assertThat(live.get("points").get(0).get("v").asDouble()).isEqualTo(3.0); + assertThat(live.get("points").get(1).get("v").asDouble()).isEqualTo(4.0); + } + + // ── query — counter delta across instance rotation ────────────────── + + @Test + void query_counterDelta_clipsNegativesAcrossInstanceRotation() throws Exception { + String requestBody = """ + { + "metric": "cameleer.ingestion.drops", + "statistic": "count", + "from": "2026-04-23T09:59:00Z", + "to": "2026-04-23T10:05:00Z", + "stepSeconds": 60, + "groupByTags": ["reason"], + "aggregation": "sum", + "mode": "delta" + } + """; + + ResponseEntity r = restTemplate.postForEntity( + "/api/v1/admin/server-metrics/query", + new HttpEntity<>(requestBody, adminJson), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK); + + JsonNode body = mapper.readTree(r.getBody()); + JsonNode reason = findByTag(body.get("series"), "reason", "buffer_full"); + // Deltas: 0 (first bucket on srv-A), 5, 5, 0 (first on srv-B, clipped), 2. + // Sum across the window should be 12 if we tally all positive deltas. + double sum = 0; + for (JsonNode p : reason.get("points")) sum += p.get("v").asDouble(); + assertThat(sum).isEqualTo(12.0); + // No individual point may be negative. + for (JsonNode p : reason.get("points")) { + assertThat(p.get("v").asDouble()).isGreaterThanOrEqualTo(0.0); + } + } + + // ── query — derived 'mean' statistic for timers ───────────────────── + + @Test + void query_timerMeanStatistic_computesTotalOverCountPerBucket() throws Exception { + String requestBody = """ + { + "metric": "cameleer.ingestion.flush.duration", + "statistic": "mean", + "from": "2026-04-23T09:59:00Z", + "to": "2026-04-23T10:02:00Z", + "stepSeconds": 60, + "groupByTags": ["type"], + "aggregation": "avg", + "mode": "raw" + } + """; + + ResponseEntity r = restTemplate.postForEntity( + "/api/v1/admin/server-metrics/query", + new HttpEntity<>(requestBody, adminJson), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK); + + JsonNode body = mapper.readTree(r.getBody()); + JsonNode points = findByTag(body.get("series"), "type", "execution").get("points"); + // Bucket 0: 30 / 2 = 15.0 + // Bucket 1: 100 / 4 = 25.0 + assertThat(points.get(0).get("v").asDouble()).isEqualTo(15.0); + assertThat(points.get(1).get("v").asDouble()).isEqualTo(25.0); + } + + // ── query — input validation ──────────────────────────────────────── + + @Test + void query_rejectsUnsafeMetricName() { + String requestBody = """ + { + "metric": "cameleer.agents; DROP TABLE server_metrics", + "from": "2026-04-23T09:59:00Z", + "to": "2026-04-23T10:02:00Z" + } + """; + + ResponseEntity r = restTemplate.postForEntity( + "/api/v1/admin/server-metrics/query", + new HttpEntity<>(requestBody, adminJson), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST); + } + + @Test + void query_rejectsRangeBeyondMax() { + String requestBody = """ + { + "metric": "cameleer.agents.connected", + "from": "2026-01-01T00:00:00Z", + "to": "2026-04-23T00:00:00Z" + } + """; + + ResponseEntity r = restTemplate.postForEntity( + "/api/v1/admin/server-metrics/query", + new HttpEntity<>(requestBody, adminJson), String.class); + assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST); + } + + // ── authorization ─────────────────────────────────────────────────── + + @Test + void allEndpoints_requireAdminRole() { + ResponseEntity catalog = restTemplate.exchange( + "/api/v1/admin/server-metrics/catalog", + HttpMethod.GET, new HttpEntity<>(viewerGet), String.class); + assertThat(catalog.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN); + + ResponseEntity instances = restTemplate.exchange( + "/api/v1/admin/server-metrics/instances", + HttpMethod.GET, new HttpEntity<>(viewerGet), String.class); + assertThat(instances.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN); + + HttpHeaders viewerPost = securityHelper.authHeaders(securityHelper.viewerToken()); + ResponseEntity query = restTemplate.exchange( + "/api/v1/admin/server-metrics/query", + HttpMethod.POST, new HttpEntity<>("{}", viewerPost), String.class); + assertThat(query.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN); + } + + // ── helpers ───────────────────────────────────────────────────────── + + private org.springframework.jdbc.core.JdbcTemplate clickhouseJdbc() { + return org.springframework.test.util.AopTestUtils.getTargetObject( + applicationContext.getBean("clickHouseJdbcTemplate")); + } + + @Autowired + private org.springframework.context.ApplicationContext applicationContext; + + private static void insert(org.springframework.jdbc.core.JdbcTemplate jdbc, + String tenantId, Instant collectedAt, String serverInstanceId, + String metricName, String metricType, String statistic, + double value, Map tags) { + jdbc.update(""" + INSERT INTO server_metrics + (tenant_id, collected_at, server_instance_id, + metric_name, metric_type, statistic, metric_value, tags) + VALUES (?, ?, ?, ?, ?, ?, ?, ?) + """, + tenantId, Timestamp.from(collectedAt), serverInstanceId, + metricName, metricType, statistic, value, tags); + } + + private static JsonNode findByField(JsonNode array, String field, String value) { + for (JsonNode n : array) { + if (value.equals(n.path(field).asText())) return n; + } + throw new AssertionError("no element with " + field + "=" + value); + } + + private static JsonNode findByTag(JsonNode seriesArray, String tagKey, String tagValue) { + for (JsonNode s : seriesArray) { + if (tagValue.equals(s.path("tags").path(tagKey).asText())) return s; + } + throw new AssertionError("no series with tag " + tagKey + "=" + tagValue); + } + + private static java.util.List asStringList(JsonNode arr) { + java.util.List out = new java.util.ArrayList<>(); + if (arr != null) for (JsonNode n : arr) out.add(n.asText()); + return out; + } +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsQueryStore.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsQueryStore.java new file mode 100644 index 00000000..de77fece --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsQueryStore.java @@ -0,0 +1,36 @@ +package com.cameleer.server.core.storage; + +import com.cameleer.server.core.storage.model.ServerInstanceInfo; +import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry; +import com.cameleer.server.core.storage.model.ServerMetricQueryRequest; +import com.cameleer.server.core.storage.model.ServerMetricQueryResponse; + +import java.time.Instant; +import java.util.List; + +/** + * Read-side access to the ClickHouse {@code server_metrics} table. Exposed + * to dashboards through {@code /api/v1/admin/server-metrics/**} so SaaS + * control planes don't need direct ClickHouse access. + */ +public interface ServerMetricsQueryStore { + + /** + * Catalog of metric names observed in {@code [from, to)} along with their + * type, the set of statistics emitted, and the union of tag keys seen. + */ + List catalog(Instant from, Instant to); + + /** + * Distinct {@code server_instance_id} values that wrote at least one + * sample in {@code [from, to)}, with first/last seen timestamps. + */ + List listInstances(Instant from, Instant to); + + /** + * Generic time-series query. See {@link ServerMetricQueryRequest} for + * request semantics. Implementations must enforce input validation and + * reject unsafe inputs with {@link IllegalArgumentException}. + */ + ServerMetricQueryResponse query(ServerMetricQueryRequest request); +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerInstanceInfo.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerInstanceInfo.java new file mode 100644 index 00000000..2cf563db --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerInstanceInfo.java @@ -0,0 +1,15 @@ +package com.cameleer.server.core.storage.model; + +import java.time.Instant; + +/** + * One row of the {@code /api/v1/admin/server-metrics/instances} response. + * Used by dashboards to partition counter-delta computations across server + * process boundaries (each boot rotates the id). + */ +public record ServerInstanceInfo( + String serverInstanceId, + Instant firstSeen, + Instant lastSeen +) { +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricCatalogEntry.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricCatalogEntry.java new file mode 100644 index 00000000..8750332c --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricCatalogEntry.java @@ -0,0 +1,17 @@ +package com.cameleer.server.core.storage.model; + +import java.util.List; + +/** + * One row of the {@code /api/v1/admin/server-metrics/catalog} response. + * Surfaces the set of statistics and tag keys observed for a metric across + * the requested window, so dashboards can build selectors without ClickHouse + * access. + */ +public record ServerMetricCatalogEntry( + String metricName, + String metricType, + List statistics, + List tagKeys +) { +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricPoint.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricPoint.java new file mode 100644 index 00000000..d2d5c98f --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricPoint.java @@ -0,0 +1,10 @@ +package com.cameleer.server.core.storage.model; + +import java.time.Instant; + +/** One {@code (bucket, value)} point of a server-metrics series. */ +public record ServerMetricPoint( + Instant t, + double v +) { +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryRequest.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryRequest.java new file mode 100644 index 00000000..08041efd --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryRequest.java @@ -0,0 +1,40 @@ +package com.cameleer.server.core.storage.model; + +import java.time.Instant; +import java.util.List; +import java.util.Map; + +/** + * Request contract for the generic server-metrics time-series query. + * + *

{@code aggregation} controls how multiple samples within a bucket + * collapse: {@code avg|sum|max|min|latest}. {@code mode} controls counter + * handling: {@code raw} returns values as stored (cumulative for counters), + * {@code delta} returns per-bucket positive-clipped differences computed + * per {@code server_instance_id}. + * + *

{@code statistic} filters which Micrometer sub-measurement to read + * ({@code value} / {@code count} / {@code total_time} / {@code total} / + * {@code max} / {@code mean}). {@code mean} is a derived statistic for + * timers: {@code sum(total_time|total) / sum(count)} per bucket. + * + *

{@code groupByTags} splits the output into one series per unique tag + * combination. {@code filterTags} narrows the input to samples whose tag + * map matches every entry. + * + *

{@code serverInstanceIds} is an optional allow-list. When null or + * empty all instances observed in the window are included. + */ +public record ServerMetricQueryRequest( + String metric, + String statistic, + Instant from, + Instant to, + Integer stepSeconds, + List groupByTags, + Map filterTags, + String aggregation, + String mode, + List serverInstanceIds +) { +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryResponse.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryResponse.java new file mode 100644 index 00000000..157d266f --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricQueryResponse.java @@ -0,0 +1,14 @@ +package com.cameleer.server.core.storage.model; + +import java.util.List; + +/** Response of the generic server-metrics time-series query. */ +public record ServerMetricQueryResponse( + String metric, + String statistic, + String aggregation, + String mode, + int stepSeconds, + List series +) { +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSeries.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSeries.java new file mode 100644 index 00000000..0d56fd7a --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSeries.java @@ -0,0 +1,14 @@ +package com.cameleer.server.core.storage.model; + +import java.util.List; +import java.util.Map; + +/** + * One series of the server-metrics query response, identified by its + * {@link #tags} group (empty map when the query had no {@code groupByTags}). + */ +public record ServerMetricSeries( + Map tags, + List points +) { +} diff --git a/docs/server-self-metrics.md b/docs/server-self-metrics.md index 38a267b7..742ef8cd 100644 --- a/docs/server-self-metrics.md +++ b/docs/server-self-metrics.md @@ -66,24 +66,126 @@ On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by ## How to query -### Via the admin ClickHouse endpoint +Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate). + +### `GET /catalog` + +Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys. ``` -POST /api/v1/admin/clickhouse/query +GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z Authorization: Bearer -Content-Type: text/plain - -SELECT metric_name, statistic, count() -FROM server_metrics -WHERE collected_at >= now() - INTERVAL 1 HOUR -GROUP BY 1, 2 ORDER BY 1, 2 ``` -Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API. +```json +[ + { + "metricName": "cameleer.agents.connected", + "metricType": "gauge", + "statistics": ["value"], + "tagKeys": ["state"] + }, + { + "metricName": "cameleer.ingestion.drops", + "metricType": "counter", + "statistics": ["count"], + "tagKeys": ["reason"] + }, + ... +] +``` -### Direct JDBC (recommended for the dashboard) +`from`/`to` are optional; default is the last 1 h. -Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`. +### `GET /instances` + +Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions. + +``` +GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z +``` + +```json +[ + { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" }, + { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" } +] +``` + +### `POST /query` — generic time-series + +The workhorse. One endpoint covers every panel in the dashboard. + +``` +POST /api/v1/admin/server-metrics/query +Authorization: Bearer +Content-Type: application/json +``` + +Request body: + +```json +{ + "metric": "cameleer.ingestion.drops", + "statistic": "count", + "from": "2026-04-22T00:00:00Z", + "to": "2026-04-23T00:00:00Z", + "stepSeconds": 60, + "groupByTags": ["reason"], + "filterTags": { }, + "aggregation": "sum", + "mode": "delta", + "serverInstanceIds": null +} +``` + +Response: + +```json +{ + "metric": "cameleer.ingestion.drops", + "statistic": "count", + "aggregation": "sum", + "mode": "delta", + "stepSeconds": 60, + "series": [ + { + "tags": { "reason": "buffer_full" }, + "points": [ + { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 }, + { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 }, + { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 } + ] + } + ] +} +``` + +#### Request field reference + +| Field | Type | Required | Description | +|---|---|---|---| +| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. | +| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. | +| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. | +| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. | +| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. | +| `filterTags` | map | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. | +| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). | +| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. | +| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. | + +#### Validation errors + +Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers: +- unsafe characters in identifiers +- `from ≥ to` or range > 31 days +- `stepSeconds` outside [10, 3600] +- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`) + +### Direct ClickHouse (fallback) + +If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`. --- @@ -258,89 +360,150 @@ When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds: ## Suggested dashboard panels -The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable. +Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables. ### Row: server health (top of dashboard) 1. **Agents by state** — stacked area. - ```sql - SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count - FROM server_metrics - WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected' - AND collected_at >= {from} AND collected_at < {to} - GROUP BY t, state ORDER BY t; + ```json + { "metric": "cameleer.agents.connected", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } ``` -2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above. - -3. **Ingestion drops per minute** — bar chart (per-minute delta). - ```sql - WITH sorted AS ( - SELECT toStartOfMinute(collected_at) AS minute, - tags['reason'] AS reason, - server_instance_id, - max(metric_value) AS cumulative - FROM server_metrics - WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops' - AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to} - GROUP BY minute, reason, server_instance_id - ) - SELECT minute, reason, - cumulative - lagInFrame(cumulative, 1, cumulative) OVER ( - PARTITION BY reason, server_instance_id ORDER BY minute - ) AS drops_per_minute - FROM sorted ORDER BY minute; +2. **Ingestion buffer depth by type** — line chart. + ```json + { "metric": "cameleer.ingestion.buffer.size", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" } ``` -4. **Auth failures per minute** — same shape as drops, split by `reason`. +3. **Ingestion drops per minute** — bar chart. + ```json + { "metric": "cameleer.ingestion.drops", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["reason"], "mode": "delta" } + ``` + +4. **Auth failures per minute** — same shape as drops, grouped by `reason`. + ```json + { "metric": "cameleer.auth.failures", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["reason"], "mode": "delta" } + ``` ### Row: JVM -5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s. +5. **Heap used vs committed vs max** — area chart (three overlay queries). + ```json + { "metric": "jvm.memory.used", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" } + ``` + Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`. -6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`. +6. **CPU %** — line. + ```json + { "metric": "process.cpu.usage", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" } + ``` + Overlay with `"metric": "system.cpu.usage"`. -7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`. +7. **GC pause — max per cause**. + ```json + { "metric": "jvm.gc.pause", "statistic": "max", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" } + ``` -8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`. +8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`. ### Row: HTTP + DB -9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`. +9. **HTTP mean latency by URI** — top-N URIs. + ```json + { "metric": "http.server.requests", "statistic": "mean", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" }, + "aggregation": "avg", "mode": "raw" } + ``` + For p99 proxy, repeat with `"statistic": "max"`. -10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total. +10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests. + ```json + { "metric": "http.server.requests", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "mode": "delta", "aggregation": "sum" } + ``` + Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide. -11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small. +11. **HikariCP pool saturation** — overlay two queries. + ```json + { "metric": "hikaricp.connections.active", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" } + ``` + Overlay with `"metric": "hikaricp.connections.pending"`. -12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag. +12. **Hikari acquire timeouts per minute**. + ```json + { "metric": "hikaricp.connections.timeout", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["pool"], "mode": "delta" } + ``` ### Row: alerting (collapsible) -13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`. +13. **Alerting instances by state** — stacked. + ```json + { "metric": "alerting_instances_total", "statistic": "value", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" } + ``` -14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`. +14. **Eval errors per minute by kind**. + ```json + { "metric": "alerting_eval_errors_total", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "groupByTags": ["kind"], "mode": "delta" } + ``` -15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`. +15. **Webhook delivery — max per minute**. + ```json + { "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max", + "from": "{from}", "to": "{to}", "stepSeconds": 60, + "aggregation": "max", "mode": "raw" } + ``` ### Row: deployments (runtime-enabled only) -16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`. +16. **Deploy outcomes per hour**. + ```json + { "metric": "cameleer.deployments.outcome", "statistic": "count", + "from": "{from}", "to": "{to}", "stepSeconds": 3600, + "groupByTags": ["status"], "mode": "delta" } + ``` -17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean). +17. **Deploy duration mean**. + ```json + { "metric": "cameleer.deployments.duration", "statistic": "mean", + "from": "{from}", "to": "{to}", "stepSeconds": 300, + "aggregation": "avg", "mode": "raw" } + ``` + For p99 proxy, repeat with `"statistic": "max"`. --- ## Notes for the dashboard implementer -- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table. -- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap. -- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart. -- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either. -- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes. -- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows. +- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express. +- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently. +- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it. +- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`. --- ## Changelog -- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever. +- 2026-04-23 — initial write. Write-only backend. +- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.