feat(server): REST API over server_metrics for SaaS dashboards
Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -109,6 +109,7 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
|
|||||||
- `UsageAnalyticsController` — GET `/api/v1/admin/usage` (ClickHouse `usage_events`).
|
- `UsageAnalyticsController` — GET `/api/v1/admin/usage` (ClickHouse `usage_events`).
|
||||||
- `ClickHouseAdminController` — GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag).
|
- `ClickHouseAdminController` — GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag).
|
||||||
- `DatabaseAdminController` — GET `/api/v1/admin/database/**` (conditional on `infrastructureendpoints` flag).
|
- `DatabaseAdminController` — GET `/api/v1/admin/database/**` (conditional on `infrastructureendpoints` flag).
|
||||||
|
- `ServerMetricsAdminController` — `/api/v1/admin/server-metrics/**`. GET `/catalog`, GET `/instances`, POST `/query`. Generic read API over the `server_metrics` ClickHouse table so SaaS dashboards don't need direct CH access. Delegates to `ServerMetricsQueryStore` (impl `ClickHouseServerMetricsQueryStore`). Validation: metric/tag regex `^[a-zA-Z0-9._]+$`, statistic regex `^[a-z_]+$`, `to - from ≤ 31 days`, stepSeconds ∈ [10, 3600], response capped at 500 series. `IllegalArgumentException` → 400. `/query` supports `raw` + `delta` modes (delta does per-`server_instance_id` positive-clipped differences, then aggregates across instances). Derived `statistic=mean` for timers computes `sum(total|total_time)/sum(count)` per bucket.
|
||||||
|
|
||||||
### Other (flat)
|
### Other (flat)
|
||||||
|
|
||||||
@@ -147,7 +148,8 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
|
|||||||
- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
|
- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
|
||||||
- `ClickHouseUsageTracker` — usage_events for billing
|
- `ClickHouseUsageTracker` — usage_events for billing
|
||||||
- `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup
|
- `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup
|
||||||
- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`, query via `/api/v1/admin/clickhouse/query` (no dedicated endpoint yet).
|
- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`.
|
||||||
|
- `ClickHouseServerMetricsQueryStore` — read side of `server_metrics` for dashboards. Implements `ServerMetricsQueryStore`. `catalog(from,to)` returns name+type+statistics+tagKeys, `listInstances(from,to)` returns server_instance_ids with first/last seen, `query(request)` builds bucketed time-series with `raw` or `delta` mode and supports a derived `mean` statistic for timers. All identifier inputs regex-validated; tenant_id always bound; max range 31 days; series count capped at 500. Exposed via `ServerMetricsAdminController`.
|
||||||
|
|
||||||
## search/ — ClickHouse search and log stores
|
## search/ — ClickHouse search and log stores
|
||||||
|
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ import com.cameleer.server.app.storage.ClickHouseRouteCatalogStore;
|
|||||||
import com.cameleer.server.core.storage.RouteCatalogStore;
|
import com.cameleer.server.core.storage.RouteCatalogStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore;
|
import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseMetricsStore;
|
import com.cameleer.server.app.storage.ClickHouseMetricsStore;
|
||||||
|
import com.cameleer.server.app.storage.ClickHouseServerMetricsQueryStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseServerMetricsStore;
|
import com.cameleer.server.app.storage.ClickHouseServerMetricsStore;
|
||||||
import com.cameleer.server.app.storage.ClickHouseStatsStore;
|
import com.cameleer.server.app.storage.ClickHouseStatsStore;
|
||||||
import com.cameleer.server.core.admin.AuditRepository;
|
import com.cameleer.server.core.admin.AuditRepository;
|
||||||
@@ -74,6 +75,13 @@ public class StorageBeanConfig {
|
|||||||
return new ClickHouseServerMetricsStore(clickHouseJdbc);
|
return new ClickHouseServerMetricsStore(clickHouseJdbc);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Bean
|
||||||
|
public ServerMetricsQueryStore clickHouseServerMetricsQueryStore(
|
||||||
|
TenantProperties tenantProperties,
|
||||||
|
@Qualifier("clickHouseJdbcTemplate") JdbcTemplate clickHouseJdbc) {
|
||||||
|
return new ClickHouseServerMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc);
|
||||||
|
}
|
||||||
|
|
||||||
// ── Execution Store ──────────────────────────────────────────────────
|
// ── Execution Store ──────────────────────────────────────────────────
|
||||||
|
|
||||||
@Bean
|
@Bean
|
||||||
|
|||||||
@@ -0,0 +1,135 @@
|
|||||||
|
package com.cameleer.server.app.controller;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.ServerMetricsQueryStore;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
|
||||||
|
import io.swagger.v3.oas.annotations.Operation;
|
||||||
|
import io.swagger.v3.oas.annotations.tags.Tag;
|
||||||
|
import org.springframework.http.ResponseEntity;
|
||||||
|
import org.springframework.web.bind.annotation.ExceptionHandler;
|
||||||
|
import org.springframework.web.bind.annotation.GetMapping;
|
||||||
|
import org.springframework.web.bind.annotation.PostMapping;
|
||||||
|
import org.springframework.web.bind.annotation.RequestBody;
|
||||||
|
import org.springframework.web.bind.annotation.RequestMapping;
|
||||||
|
import org.springframework.web.bind.annotation.RequestParam;
|
||||||
|
import org.springframework.web.bind.annotation.RestController;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generic read API over the ClickHouse {@code server_metrics} table. Lets
|
||||||
|
* SaaS control planes build server-health dashboards without requiring direct
|
||||||
|
* ClickHouse access.
|
||||||
|
*
|
||||||
|
* <p>Three endpoints cover all 17 panels in {@code docs/server-self-metrics.md}:
|
||||||
|
* <ul>
|
||||||
|
* <li>{@code GET /catalog} — discover available metric names, types, statistics, and tags</li>
|
||||||
|
* <li>{@code POST /query} — generic time-series query with aggregation, grouping, filtering, and counter-delta mode</li>
|
||||||
|
* <li>{@code GET /instances} — list server instances (useful for partitioning counter math)</li>
|
||||||
|
* </ul>
|
||||||
|
*
|
||||||
|
* <p>Protected by the {@code /api/v1/admin/**} catch-all in {@code SecurityConfig} — requires ADMIN role.
|
||||||
|
*/
|
||||||
|
@RestController
|
||||||
|
@RequestMapping("/api/v1/admin/server-metrics")
|
||||||
|
@Tag(name = "Server Self-Metrics",
|
||||||
|
description = "Read API over the server's own Micrometer registry snapshots for dashboards")
|
||||||
|
public class ServerMetricsAdminController {
|
||||||
|
|
||||||
|
/** Default lookback window for catalog/instances when from/to are omitted. */
|
||||||
|
private static final long DEFAULT_LOOKBACK_SECONDS = 3_600L;
|
||||||
|
|
||||||
|
private final ServerMetricsQueryStore store;
|
||||||
|
|
||||||
|
public ServerMetricsAdminController(ServerMetricsQueryStore store) {
|
||||||
|
this.store = store;
|
||||||
|
}
|
||||||
|
|
||||||
|
@GetMapping("/catalog")
|
||||||
|
@Operation(summary = "List metric names observed in the window",
|
||||||
|
description = "For each metric_name, returns metric_type, the set of statistics emitted, and the union of tag keys.")
|
||||||
|
public ResponseEntity<List<ServerMetricCatalogEntry>> catalog(
|
||||||
|
@RequestParam(required = false) String from,
|
||||||
|
@RequestParam(required = false) String to) {
|
||||||
|
Instant[] window = resolveWindow(from, to);
|
||||||
|
return ResponseEntity.ok(store.catalog(window[0], window[1]));
|
||||||
|
}
|
||||||
|
|
||||||
|
@GetMapping("/instances")
|
||||||
|
@Operation(summary = "List server_instance_id values observed in the window",
|
||||||
|
description = "Returns first/last seen timestamps — use to partition counter-delta computations.")
|
||||||
|
public ResponseEntity<List<ServerInstanceInfo>> instances(
|
||||||
|
@RequestParam(required = false) String from,
|
||||||
|
@RequestParam(required = false) String to) {
|
||||||
|
Instant[] window = resolveWindow(from, to);
|
||||||
|
return ResponseEntity.ok(store.listInstances(window[0], window[1]));
|
||||||
|
}
|
||||||
|
|
||||||
|
@PostMapping("/query")
|
||||||
|
@Operation(summary = "Generic time-series query",
|
||||||
|
description = "Returns bucketed series for a single metric_name. Supports aggregation (avg/sum/max/min/latest), group-by-tag, filter-by-tag, counter delta mode, and a derived 'mean' statistic for timers.")
|
||||||
|
public ResponseEntity<ServerMetricQueryResponse> query(@RequestBody QueryBody body) {
|
||||||
|
ServerMetricQueryRequest request = new ServerMetricQueryRequest(
|
||||||
|
body.metric(),
|
||||||
|
body.statistic(),
|
||||||
|
parseInstant(body.from(), "from"),
|
||||||
|
parseInstant(body.to(), "to"),
|
||||||
|
body.stepSeconds(),
|
||||||
|
body.groupByTags(),
|
||||||
|
body.filterTags(),
|
||||||
|
body.aggregation(),
|
||||||
|
body.mode(),
|
||||||
|
body.serverInstanceIds());
|
||||||
|
return ResponseEntity.ok(store.query(request));
|
||||||
|
}
|
||||||
|
|
||||||
|
@ExceptionHandler(IllegalArgumentException.class)
|
||||||
|
public ResponseEntity<Map<String, String>> handleBadRequest(IllegalArgumentException e) {
|
||||||
|
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
|
||||||
|
}
|
||||||
|
|
||||||
|
private static Instant[] resolveWindow(String from, String to) {
|
||||||
|
Instant toI = to != null ? parseInstant(to, "to") : Instant.now();
|
||||||
|
Instant fromI = from != null
|
||||||
|
? parseInstant(from, "from")
|
||||||
|
: toI.minusSeconds(DEFAULT_LOOKBACK_SECONDS);
|
||||||
|
if (!fromI.isBefore(toI)) {
|
||||||
|
throw new IllegalArgumentException("from must be strictly before to");
|
||||||
|
}
|
||||||
|
return new Instant[]{fromI, toI};
|
||||||
|
}
|
||||||
|
|
||||||
|
private static Instant parseInstant(String raw, String field) {
|
||||||
|
if (raw == null || raw.isBlank()) {
|
||||||
|
throw new IllegalArgumentException(field + " is required");
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
return Instant.parse(raw);
|
||||||
|
} catch (Exception e) {
|
||||||
|
throw new IllegalArgumentException(
|
||||||
|
field + " must be an ISO-8601 instant (e.g. 2026-04-23T10:00:00Z)");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Request body for {@link #query(QueryBody)}. Uses ISO-8601 strings on
|
||||||
|
* the wire so the OpenAPI schema stays language-neutral.
|
||||||
|
*/
|
||||||
|
public record QueryBody(
|
||||||
|
String metric,
|
||||||
|
String statistic,
|
||||||
|
String from,
|
||||||
|
String to,
|
||||||
|
Integer stepSeconds,
|
||||||
|
List<String> groupByTags,
|
||||||
|
Map<String, String> filterTags,
|
||||||
|
String aggregation,
|
||||||
|
String mode,
|
||||||
|
List<String> serverInstanceIds
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,408 @@
|
|||||||
|
package com.cameleer.server.app.storage;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.ServerMetricsQueryStore;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricPoint;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricSeries;
|
||||||
|
import org.springframework.jdbc.core.JdbcTemplate;
|
||||||
|
|
||||||
|
import java.sql.Array;
|
||||||
|
import java.sql.Timestamp;
|
||||||
|
import java.time.Duration;
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Collections;
|
||||||
|
import java.util.LinkedHashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Set;
|
||||||
|
import java.util.TreeSet;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* ClickHouse-backed {@link ServerMetricsQueryStore}.
|
||||||
|
*
|
||||||
|
* <p>Safety rules for every query:
|
||||||
|
* <ul>
|
||||||
|
* <li>tenant_id always bound as a parameter — no cross-tenant reads.</li>
|
||||||
|
* <li>Identifier-like inputs (metric name, statistic, tag keys,
|
||||||
|
* aggregation, mode) are regex-validated. Tag keys flow through the
|
||||||
|
* query as JDBC parameter-bound values of {@code tags[?]} map lookups,
|
||||||
|
* so even with a "safe" regex they cannot inject SQL.</li>
|
||||||
|
* <li>Literal values ({@code from}, {@code to}, tag filter values,
|
||||||
|
* server_instance_id allow-list) always go through {@code ?}.</li>
|
||||||
|
* <li>The time range is capped at {@link #MAX_RANGE}.</li>
|
||||||
|
* <li>Result cardinality is capped at {@link #MAX_SERIES} series.</li>
|
||||||
|
* </ul>
|
||||||
|
*/
|
||||||
|
public class ClickHouseServerMetricsQueryStore implements ServerMetricsQueryStore {
|
||||||
|
|
||||||
|
private static final Pattern SAFE_IDENTIFIER = Pattern.compile("^[a-zA-Z0-9._]+$");
|
||||||
|
private static final Pattern SAFE_STATISTIC = Pattern.compile("^[a-z_]+$");
|
||||||
|
|
||||||
|
private static final Set<String> AGGREGATIONS = Set.of("avg", "sum", "max", "min", "latest");
|
||||||
|
private static final Set<String> MODES = Set.of("raw", "delta");
|
||||||
|
|
||||||
|
/** Maximum {@code to - from} window accepted by the API. */
|
||||||
|
static final Duration MAX_RANGE = Duration.ofDays(31);
|
||||||
|
|
||||||
|
/** Clamp bounds and default for {@code stepSeconds}. */
|
||||||
|
static final int MIN_STEP = 10;
|
||||||
|
static final int MAX_STEP = 3600;
|
||||||
|
static final int DEFAULT_STEP = 60;
|
||||||
|
|
||||||
|
/** Defence against group-by explosion — limit the series count per response. */
|
||||||
|
static final int MAX_SERIES = 500;
|
||||||
|
|
||||||
|
private final String tenantId;
|
||||||
|
private final JdbcTemplate jdbc;
|
||||||
|
|
||||||
|
public ClickHouseServerMetricsQueryStore(String tenantId, JdbcTemplate jdbc) {
|
||||||
|
this.tenantId = tenantId;
|
||||||
|
this.jdbc = jdbc;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── catalog ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public List<ServerMetricCatalogEntry> catalog(Instant from, Instant to) {
|
||||||
|
requireRange(from, to);
|
||||||
|
String sql = """
|
||||||
|
SELECT
|
||||||
|
metric_name,
|
||||||
|
any(metric_type) AS metric_type,
|
||||||
|
arraySort(groupUniqArray(statistic)) AS statistics,
|
||||||
|
arraySort(arrayDistinct(arrayFlatten(groupArray(mapKeys(tags))))) AS tag_keys
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE tenant_id = ?
|
||||||
|
AND collected_at >= ?
|
||||||
|
AND collected_at < ?
|
||||||
|
GROUP BY metric_name
|
||||||
|
ORDER BY metric_name
|
||||||
|
""";
|
||||||
|
return jdbc.query(sql, (rs, n) -> new ServerMetricCatalogEntry(
|
||||||
|
rs.getString("metric_name"),
|
||||||
|
rs.getString("metric_type"),
|
||||||
|
arrayToStringList(rs.getArray("statistics")),
|
||||||
|
arrayToStringList(rs.getArray("tag_keys"))
|
||||||
|
), tenantId, Timestamp.from(from), Timestamp.from(to));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── instances ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public List<ServerInstanceInfo> listInstances(Instant from, Instant to) {
|
||||||
|
requireRange(from, to);
|
||||||
|
String sql = """
|
||||||
|
SELECT
|
||||||
|
server_instance_id,
|
||||||
|
min(collected_at) AS first_seen,
|
||||||
|
max(collected_at) AS last_seen
|
||||||
|
FROM server_metrics
|
||||||
|
WHERE tenant_id = ?
|
||||||
|
AND collected_at >= ?
|
||||||
|
AND collected_at < ?
|
||||||
|
GROUP BY server_instance_id
|
||||||
|
ORDER BY last_seen DESC
|
||||||
|
""";
|
||||||
|
return jdbc.query(sql, (rs, n) -> new ServerInstanceInfo(
|
||||||
|
rs.getString("server_instance_id"),
|
||||||
|
rs.getTimestamp("first_seen").toInstant(),
|
||||||
|
rs.getTimestamp("last_seen").toInstant()
|
||||||
|
), tenantId, Timestamp.from(from), Timestamp.from(to));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── query ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public ServerMetricQueryResponse query(ServerMetricQueryRequest request) {
|
||||||
|
if (request == null) throw new IllegalArgumentException("request is required");
|
||||||
|
String metric = requireSafeIdentifier(request.metric(), "metric");
|
||||||
|
requireRange(request.from(), request.to());
|
||||||
|
|
||||||
|
String aggregation = request.aggregation() != null ? request.aggregation().toLowerCase() : "avg";
|
||||||
|
if (!AGGREGATIONS.contains(aggregation)) {
|
||||||
|
throw new IllegalArgumentException("aggregation must be one of " + AGGREGATIONS);
|
||||||
|
}
|
||||||
|
|
||||||
|
String mode = request.mode() != null ? request.mode().toLowerCase() : "raw";
|
||||||
|
if (!MODES.contains(mode)) {
|
||||||
|
throw new IllegalArgumentException("mode must be one of " + MODES);
|
||||||
|
}
|
||||||
|
|
||||||
|
int step = request.stepSeconds() != null ? request.stepSeconds() : DEFAULT_STEP;
|
||||||
|
if (step < MIN_STEP || step > MAX_STEP) {
|
||||||
|
throw new IllegalArgumentException(
|
||||||
|
"stepSeconds must be in [" + MIN_STEP + "," + MAX_STEP + "]");
|
||||||
|
}
|
||||||
|
|
||||||
|
String statistic = request.statistic();
|
||||||
|
if (statistic != null && !SAFE_STATISTIC.matcher(statistic).matches()) {
|
||||||
|
throw new IllegalArgumentException("statistic contains unsafe characters");
|
||||||
|
}
|
||||||
|
|
||||||
|
List<String> groupByTags = request.groupByTags() != null
|
||||||
|
? request.groupByTags() : List.of();
|
||||||
|
for (String t : groupByTags) requireSafeIdentifier(t, "groupByTag");
|
||||||
|
|
||||||
|
Map<String, String> filterTags = request.filterTags() != null
|
||||||
|
? request.filterTags() : Map.of();
|
||||||
|
for (String t : filterTags.keySet()) requireSafeIdentifier(t, "filterTag key");
|
||||||
|
|
||||||
|
List<String> instanceAllowList = request.serverInstanceIds() != null
|
||||||
|
? request.serverInstanceIds() : List.of();
|
||||||
|
|
||||||
|
boolean isDelta = "delta".equals(mode);
|
||||||
|
boolean isMean = "mean".equals(statistic);
|
||||||
|
|
||||||
|
String sql = isDelta
|
||||||
|
? buildDeltaSql(step, groupByTags, filterTags, instanceAllowList, statistic, isMean)
|
||||||
|
: buildRawSql(step, groupByTags, filterTags, instanceAllowList,
|
||||||
|
statistic, aggregation, isMean);
|
||||||
|
|
||||||
|
List<Object> params = buildParams(groupByTags, metric, statistic, isMean,
|
||||||
|
request.from(), request.to(),
|
||||||
|
filterTags, instanceAllowList);
|
||||||
|
|
||||||
|
List<Row> rows = jdbc.query(sql, (rs, n) -> {
|
||||||
|
int idx = 1;
|
||||||
|
Instant bucket = rs.getTimestamp(idx++).toInstant();
|
||||||
|
List<String> tagValues = new ArrayList<>(groupByTags.size());
|
||||||
|
for (int g = 0; g < groupByTags.size(); g++) {
|
||||||
|
tagValues.add(rs.getString(idx++));
|
||||||
|
}
|
||||||
|
double value = rs.getDouble(idx);
|
||||||
|
return new Row(bucket, tagValues, value);
|
||||||
|
}, params.toArray());
|
||||||
|
|
||||||
|
return assembleSeries(rows, metric, statistic, aggregation, mode, step, groupByTags);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── SQL builders ────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds a single-pass SQL for raw mode:
|
||||||
|
* <pre>{@code
|
||||||
|
* SELECT bucket, tag0, ..., <agg>(metric_value) AS value
|
||||||
|
* FROM server_metrics WHERE ...
|
||||||
|
* GROUP BY bucket, tag0, ...
|
||||||
|
* ORDER BY bucket, tag0, ...
|
||||||
|
* }</pre>
|
||||||
|
* For {@code statistic=mean}, replaces the aggregate with
|
||||||
|
* {@code sumIf(value, statistic IN ('total','total_time')) / nullIf(sumIf(value, statistic='count'), 0)}.
|
||||||
|
*/
|
||||||
|
private String buildRawSql(int step, List<String> groupByTags,
|
||||||
|
Map<String, String> filterTags,
|
||||||
|
List<String> instanceAllowList,
|
||||||
|
String statistic, String aggregation, boolean isMean) {
|
||||||
|
StringBuilder s = new StringBuilder(512);
|
||||||
|
s.append("SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ")
|
||||||
|
.append(step).append(" SECOND), 3) AS bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) {
|
||||||
|
s.append(",\n tags[?] AS tag").append(i);
|
||||||
|
}
|
||||||
|
s.append(",\n ").append(isMean ? meanExpr() : scalarAggExpr(aggregation))
|
||||||
|
.append(" AS value\nFROM server_metrics\n");
|
||||||
|
appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean);
|
||||||
|
s.append("GROUP BY bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append("\nORDER BY bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
return s.toString();
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds a three-level SQL for delta mode. Inner fills one
|
||||||
|
* (bucket, instance, tag-group) row via {@code max(metric_value)};
|
||||||
|
* middle computes positive-clipped per-instance differences via a
|
||||||
|
* window function; outer sums across instances.
|
||||||
|
*/
|
||||||
|
private String buildDeltaSql(int step, List<String> groupByTags,
|
||||||
|
Map<String, String> filterTags,
|
||||||
|
List<String> instanceAllowList,
|
||||||
|
String statistic, boolean isMean) {
|
||||||
|
StringBuilder s = new StringBuilder(1024);
|
||||||
|
s.append("SELECT bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append(", sum(delta) AS value FROM (\n");
|
||||||
|
|
||||||
|
// Middle: per-instance positive-clipped delta using window.
|
||||||
|
s.append(" SELECT bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append(", server_instance_id, greatest(0, value - coalesce(any(value) OVER (")
|
||||||
|
.append("PARTITION BY server_instance_id");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append(" ORDER BY bucket ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), value)) AS delta FROM (\n");
|
||||||
|
|
||||||
|
// Inner: one representative value per (bucket, instance, tag-group).
|
||||||
|
s.append(" SELECT\n toDateTime64(toStartOfInterval(collected_at, INTERVAL ")
|
||||||
|
.append(step).append(" SECOND), 3) AS bucket,\n server_instance_id");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) {
|
||||||
|
s.append(",\n tags[?] AS tag").append(i);
|
||||||
|
}
|
||||||
|
s.append(",\n ").append(isMean ? meanExpr() : "max(metric_value)")
|
||||||
|
.append(" AS value\n FROM server_metrics\n");
|
||||||
|
appendWhereClause(s, filterTags, instanceAllowList, statistic, isMean);
|
||||||
|
s.append(" GROUP BY bucket, server_instance_id");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append("\n ) AS bucketed\n) AS deltas\n");
|
||||||
|
|
||||||
|
s.append("GROUP BY bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
s.append("\nORDER BY bucket");
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) s.append(", tag").append(i);
|
||||||
|
return s.toString();
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* WHERE clause shared by both raw and delta SQL shapes. Appended at the
|
||||||
|
* correct indent under either the single {@code FROM server_metrics}
|
||||||
|
* (raw) or the innermost one (delta).
|
||||||
|
*/
|
||||||
|
private void appendWhereClause(StringBuilder s, Map<String, String> filterTags,
|
||||||
|
List<String> instanceAllowList,
|
||||||
|
String statistic, boolean isMean) {
|
||||||
|
s.append(" WHERE tenant_id = ?\n")
|
||||||
|
.append(" AND metric_name = ?\n");
|
||||||
|
if (isMean) {
|
||||||
|
s.append(" AND statistic IN ('count', 'total', 'total_time')\n");
|
||||||
|
} else if (statistic != null) {
|
||||||
|
s.append(" AND statistic = ?\n");
|
||||||
|
}
|
||||||
|
s.append(" AND collected_at >= ?\n")
|
||||||
|
.append(" AND collected_at < ?\n");
|
||||||
|
for (int i = 0; i < filterTags.size(); i++) {
|
||||||
|
s.append(" AND tags[?] = ?\n");
|
||||||
|
}
|
||||||
|
if (!instanceAllowList.isEmpty()) {
|
||||||
|
s.append(" AND server_instance_id IN (")
|
||||||
|
.append("?,".repeat(instanceAllowList.size() - 1)).append("?)\n");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* SQL-positional params for both raw and delta queries (same relative
|
||||||
|
* order because the WHERE clause is emitted by {@link #appendWhereClause}
|
||||||
|
* only once, with the {@code tags[?]} select-list placeholders appearing
|
||||||
|
* earlier in the SQL text).
|
||||||
|
*/
|
||||||
|
private List<Object> buildParams(List<String> groupByTags, String metric,
|
||||||
|
String statistic, boolean isMean,
|
||||||
|
Instant from, Instant to,
|
||||||
|
Map<String, String> filterTags,
|
||||||
|
List<String> instanceAllowList) {
|
||||||
|
List<Object> params = new ArrayList<>();
|
||||||
|
// SELECT-list tags[?] placeholders
|
||||||
|
params.addAll(groupByTags);
|
||||||
|
// WHERE
|
||||||
|
params.add(tenantId);
|
||||||
|
params.add(metric);
|
||||||
|
if (!isMean && statistic != null) params.add(statistic);
|
||||||
|
params.add(Timestamp.from(from));
|
||||||
|
params.add(Timestamp.from(to));
|
||||||
|
for (Map.Entry<String, String> e : filterTags.entrySet()) {
|
||||||
|
params.add(e.getKey());
|
||||||
|
params.add(e.getValue());
|
||||||
|
}
|
||||||
|
params.addAll(instanceAllowList);
|
||||||
|
return params;
|
||||||
|
}
|
||||||
|
|
||||||
|
private static String scalarAggExpr(String aggregation) {
|
||||||
|
return switch (aggregation) {
|
||||||
|
case "avg" -> "avg(metric_value)";
|
||||||
|
case "sum" -> "sum(metric_value)";
|
||||||
|
case "max" -> "max(metric_value)";
|
||||||
|
case "min" -> "min(metric_value)";
|
||||||
|
case "latest" -> "argMax(metric_value, collected_at)";
|
||||||
|
default -> throw new IllegalStateException("unreachable: " + aggregation);
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
private static String meanExpr() {
|
||||||
|
return "sumIf(metric_value, statistic IN ('total', 'total_time'))"
|
||||||
|
+ " / nullIf(sumIf(metric_value, statistic = 'count'), 0)";
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── response assembly ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
private ServerMetricQueryResponse assembleSeries(
|
||||||
|
List<Row> rows, String metric, String statistic,
|
||||||
|
String aggregation, String mode, int step, List<String> groupByTags) {
|
||||||
|
|
||||||
|
Map<List<String>, List<ServerMetricPoint>> bySignature = new LinkedHashMap<>();
|
||||||
|
for (Row r : rows) {
|
||||||
|
if (Double.isNaN(r.value) || Double.isInfinite(r.value)) continue;
|
||||||
|
bySignature.computeIfAbsent(r.tagValues, k -> new ArrayList<>())
|
||||||
|
.add(new ServerMetricPoint(r.bucket, r.value));
|
||||||
|
}
|
||||||
|
|
||||||
|
if (bySignature.size() > MAX_SERIES) {
|
||||||
|
throw new IllegalArgumentException(
|
||||||
|
"query produced " + bySignature.size()
|
||||||
|
+ " series; reduce groupByTags or tighten filterTags (max "
|
||||||
|
+ MAX_SERIES + ")");
|
||||||
|
}
|
||||||
|
|
||||||
|
List<ServerMetricSeries> series = new ArrayList<>(bySignature.size());
|
||||||
|
for (Map.Entry<List<String>, List<ServerMetricPoint>> e : bySignature.entrySet()) {
|
||||||
|
Map<String, String> tags = new LinkedHashMap<>();
|
||||||
|
for (int i = 0; i < groupByTags.size(); i++) {
|
||||||
|
tags.put(groupByTags.get(i), e.getKey().get(i));
|
||||||
|
}
|
||||||
|
series.add(new ServerMetricSeries(Collections.unmodifiableMap(tags), e.getValue()));
|
||||||
|
}
|
||||||
|
|
||||||
|
return new ServerMetricQueryResponse(metric,
|
||||||
|
statistic != null ? statistic : "value",
|
||||||
|
aggregation, mode, step, series);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── helpers ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
private static void requireRange(Instant from, Instant to) {
|
||||||
|
if (from == null || to == null) {
|
||||||
|
throw new IllegalArgumentException("from and to are required");
|
||||||
|
}
|
||||||
|
if (!from.isBefore(to)) {
|
||||||
|
throw new IllegalArgumentException("from must be strictly before to");
|
||||||
|
}
|
||||||
|
if (Duration.between(from, to).compareTo(MAX_RANGE) > 0) {
|
||||||
|
throw new IllegalArgumentException(
|
||||||
|
"time range exceeds maximum of " + MAX_RANGE.toDays() + " days");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private static String requireSafeIdentifier(String value, String field) {
|
||||||
|
if (value == null || value.isBlank()) {
|
||||||
|
throw new IllegalArgumentException(field + " is required");
|
||||||
|
}
|
||||||
|
if (!SAFE_IDENTIFIER.matcher(value).matches()) {
|
||||||
|
throw new IllegalArgumentException(
|
||||||
|
field + " contains unsafe characters (allowed: [a-zA-Z0-9._])");
|
||||||
|
}
|
||||||
|
return value;
|
||||||
|
}
|
||||||
|
|
||||||
|
private static List<String> arrayToStringList(Array array) {
|
||||||
|
if (array == null) return List.of();
|
||||||
|
try {
|
||||||
|
Object[] values = (Object[]) array.getArray();
|
||||||
|
Set<String> sorted = new TreeSet<>();
|
||||||
|
for (Object v : values) {
|
||||||
|
if (v != null) sorted.add(v.toString());
|
||||||
|
}
|
||||||
|
return List.copyOf(sorted);
|
||||||
|
} catch (Exception e) {
|
||||||
|
return List.of();
|
||||||
|
} finally {
|
||||||
|
try { array.free(); } catch (Exception ignore) { }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private record Row(Instant bucket, List<String> tagValues, double value) {
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,314 @@
|
|||||||
|
package com.cameleer.server.app.controller;
|
||||||
|
|
||||||
|
import com.cameleer.server.app.AbstractPostgresIT;
|
||||||
|
import com.cameleer.server.app.TestSecurityHelper;
|
||||||
|
import com.fasterxml.jackson.databind.JsonNode;
|
||||||
|
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
import org.springframework.beans.factory.annotation.Autowired;
|
||||||
|
import org.springframework.boot.test.web.client.TestRestTemplate;
|
||||||
|
import org.springframework.http.HttpEntity;
|
||||||
|
import org.springframework.http.HttpHeaders;
|
||||||
|
import org.springframework.http.HttpMethod;
|
||||||
|
import org.springframework.http.HttpStatus;
|
||||||
|
import org.springframework.http.ResponseEntity;
|
||||||
|
|
||||||
|
import java.sql.Timestamp;
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
import static org.assertj.core.api.Assertions.assertThat;
|
||||||
|
|
||||||
|
class ServerMetricsAdminControllerIT extends AbstractPostgresIT {
|
||||||
|
|
||||||
|
@Autowired
|
||||||
|
private TestRestTemplate restTemplate;
|
||||||
|
|
||||||
|
@Autowired
|
||||||
|
private TestSecurityHelper securityHelper;
|
||||||
|
|
||||||
|
private final ObjectMapper mapper = new ObjectMapper();
|
||||||
|
|
||||||
|
private HttpHeaders adminJson;
|
||||||
|
private HttpHeaders adminGet;
|
||||||
|
private HttpHeaders viewerGet;
|
||||||
|
|
||||||
|
@BeforeEach
|
||||||
|
void seedAndAuth() {
|
||||||
|
adminJson = securityHelper.adminHeaders();
|
||||||
|
adminGet = securityHelper.authHeadersNoBody(securityHelper.adminToken());
|
||||||
|
viewerGet = securityHelper.authHeadersNoBody(securityHelper.viewerToken());
|
||||||
|
|
||||||
|
// Fresh rows for each test. The Spring-context ClickHouse JdbcTemplate
|
||||||
|
// lives in a different bean; reach for it here by executing through
|
||||||
|
// the same JdbcTemplate used by the store via the ClickHouseConfig bean.
|
||||||
|
org.springframework.jdbc.core.JdbcTemplate ch = clickhouseJdbc();
|
||||||
|
ch.execute("TRUNCATE TABLE server_metrics");
|
||||||
|
|
||||||
|
Instant t0 = Instant.parse("2026-04-23T10:00:00Z");
|
||||||
|
// Gauge: cameleer.agents.connected, two states, two buckets.
|
||||||
|
insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 3.0,
|
||||||
|
Map.of("state", "live"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 4.0,
|
||||||
|
Map.of("state", "live"));
|
||||||
|
insert(ch, "default", t0, "srv-A", "cameleer.agents.connected", "gauge", "value", 1.0,
|
||||||
|
Map.of("state", "stale"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.agents.connected", "gauge", "value", 0.0,
|
||||||
|
Map.of("state", "stale"));
|
||||||
|
|
||||||
|
// Counter: cumulative drops, +5 per minute on srv-A.
|
||||||
|
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.drops", "counter", "count", 5.0, Map.of("reason", "buffer_full"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(120), "srv-A", "cameleer.ingestion.drops", "counter", "count", 10.0, Map.of("reason", "buffer_full"));
|
||||||
|
// Simulated restart to srv-B: counter resets to 0, then climbs to 2.
|
||||||
|
insert(ch, "default", t0.plusSeconds(180), "srv-B", "cameleer.ingestion.drops", "counter", "count", 0.0, Map.of("reason", "buffer_full"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(240), "srv-B", "cameleer.ingestion.drops", "counter", "count", 2.0, Map.of("reason", "buffer_full"));
|
||||||
|
|
||||||
|
// Timer mean inputs: two buckets, 2 samples each (count=2, total_time=30).
|
||||||
|
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 2.0, Map.of("type", "execution"));
|
||||||
|
insert(ch, "default", t0, "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 30.0, Map.of("type", "execution"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "count", 4.0, Map.of("type", "execution"));
|
||||||
|
insert(ch, "default", t0.plusSeconds(60), "srv-A", "cameleer.ingestion.flush.duration", "timer", "total_time", 100.0, Map.of("type", "execution"));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── catalog ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void catalog_listsSeededMetricsWithStatisticsAndTagKeys() throws Exception {
|
||||||
|
ResponseEntity<String> r = restTemplate.exchange(
|
||||||
|
"/api/v1/admin/server-metrics/catalog?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z",
|
||||||
|
HttpMethod.GET, new HttpEntity<>(adminGet), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
|
||||||
|
|
||||||
|
JsonNode body = mapper.readTree(r.getBody());
|
||||||
|
assertThat(body.isArray()).isTrue();
|
||||||
|
|
||||||
|
JsonNode drops = findByField(body, "metricName", "cameleer.ingestion.drops");
|
||||||
|
assertThat(drops.get("metricType").asText()).isEqualTo("counter");
|
||||||
|
assertThat(asStringList(drops.get("statistics"))).contains("count");
|
||||||
|
assertThat(asStringList(drops.get("tagKeys"))).contains("reason");
|
||||||
|
|
||||||
|
JsonNode timer = findByField(body, "metricName", "cameleer.ingestion.flush.duration");
|
||||||
|
assertThat(asStringList(timer.get("statistics"))).contains("count", "total_time");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── instances ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void instances_listsDistinctServerInstanceIdsWithFirstAndLastSeen() throws Exception {
|
||||||
|
ResponseEntity<String> r = restTemplate.exchange(
|
||||||
|
"/api/v1/admin/server-metrics/instances?from=2026-04-23T09:00:00Z&to=2026-04-23T11:00:00Z",
|
||||||
|
HttpMethod.GET, new HttpEntity<>(adminGet), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
|
||||||
|
|
||||||
|
JsonNode body = mapper.readTree(r.getBody());
|
||||||
|
assertThat(body.isArray()).isTrue();
|
||||||
|
assertThat(body.size()).isEqualTo(2);
|
||||||
|
// Ordered by last_seen DESC — srv-B saw a later row.
|
||||||
|
assertThat(body.get(0).get("serverInstanceId").asText()).isEqualTo("srv-B");
|
||||||
|
assertThat(body.get(1).get("serverInstanceId").asText()).isEqualTo("srv-A");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── query — gauge with group-by-tag ─────────────────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void query_gaugeWithGroupByTag_returnsSeriesPerTagValue() throws Exception {
|
||||||
|
String requestBody = """
|
||||||
|
{
|
||||||
|
"metric": "cameleer.agents.connected",
|
||||||
|
"statistic": "value",
|
||||||
|
"from": "2026-04-23T09:59:00Z",
|
||||||
|
"to": "2026-04-23T10:02:00Z",
|
||||||
|
"stepSeconds": 60,
|
||||||
|
"groupByTags": ["state"],
|
||||||
|
"aggregation": "avg",
|
||||||
|
"mode": "raw"
|
||||||
|
}
|
||||||
|
""";
|
||||||
|
|
||||||
|
ResponseEntity<String> r = restTemplate.postForEntity(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
new HttpEntity<>(requestBody, adminJson), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
|
||||||
|
|
||||||
|
JsonNode body = mapper.readTree(r.getBody());
|
||||||
|
assertThat(body.get("metric").asText()).isEqualTo("cameleer.agents.connected");
|
||||||
|
assertThat(body.get("statistic").asText()).isEqualTo("value");
|
||||||
|
assertThat(body.get("mode").asText()).isEqualTo("raw");
|
||||||
|
assertThat(body.get("stepSeconds").asInt()).isEqualTo(60);
|
||||||
|
|
||||||
|
JsonNode series = body.get("series");
|
||||||
|
assertThat(series.isArray()).isTrue();
|
||||||
|
assertThat(series.size()).isEqualTo(2);
|
||||||
|
|
||||||
|
JsonNode live = findByTag(series, "state", "live");
|
||||||
|
assertThat(live.get("points").size()).isEqualTo(2);
|
||||||
|
assertThat(live.get("points").get(0).get("v").asDouble()).isEqualTo(3.0);
|
||||||
|
assertThat(live.get("points").get(1).get("v").asDouble()).isEqualTo(4.0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── query — counter delta across instance rotation ──────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void query_counterDelta_clipsNegativesAcrossInstanceRotation() throws Exception {
|
||||||
|
String requestBody = """
|
||||||
|
{
|
||||||
|
"metric": "cameleer.ingestion.drops",
|
||||||
|
"statistic": "count",
|
||||||
|
"from": "2026-04-23T09:59:00Z",
|
||||||
|
"to": "2026-04-23T10:05:00Z",
|
||||||
|
"stepSeconds": 60,
|
||||||
|
"groupByTags": ["reason"],
|
||||||
|
"aggregation": "sum",
|
||||||
|
"mode": "delta"
|
||||||
|
}
|
||||||
|
""";
|
||||||
|
|
||||||
|
ResponseEntity<String> r = restTemplate.postForEntity(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
new HttpEntity<>(requestBody, adminJson), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
|
||||||
|
|
||||||
|
JsonNode body = mapper.readTree(r.getBody());
|
||||||
|
JsonNode reason = findByTag(body.get("series"), "reason", "buffer_full");
|
||||||
|
// Deltas: 0 (first bucket on srv-A), 5, 5, 0 (first on srv-B, clipped), 2.
|
||||||
|
// Sum across the window should be 12 if we tally all positive deltas.
|
||||||
|
double sum = 0;
|
||||||
|
for (JsonNode p : reason.get("points")) sum += p.get("v").asDouble();
|
||||||
|
assertThat(sum).isEqualTo(12.0);
|
||||||
|
// No individual point may be negative.
|
||||||
|
for (JsonNode p : reason.get("points")) {
|
||||||
|
assertThat(p.get("v").asDouble()).isGreaterThanOrEqualTo(0.0);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── query — derived 'mean' statistic for timers ─────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void query_timerMeanStatistic_computesTotalOverCountPerBucket() throws Exception {
|
||||||
|
String requestBody = """
|
||||||
|
{
|
||||||
|
"metric": "cameleer.ingestion.flush.duration",
|
||||||
|
"statistic": "mean",
|
||||||
|
"from": "2026-04-23T09:59:00Z",
|
||||||
|
"to": "2026-04-23T10:02:00Z",
|
||||||
|
"stepSeconds": 60,
|
||||||
|
"groupByTags": ["type"],
|
||||||
|
"aggregation": "avg",
|
||||||
|
"mode": "raw"
|
||||||
|
}
|
||||||
|
""";
|
||||||
|
|
||||||
|
ResponseEntity<String> r = restTemplate.postForEntity(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
new HttpEntity<>(requestBody, adminJson), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.OK);
|
||||||
|
|
||||||
|
JsonNode body = mapper.readTree(r.getBody());
|
||||||
|
JsonNode points = findByTag(body.get("series"), "type", "execution").get("points");
|
||||||
|
// Bucket 0: 30 / 2 = 15.0
|
||||||
|
// Bucket 1: 100 / 4 = 25.0
|
||||||
|
assertThat(points.get(0).get("v").asDouble()).isEqualTo(15.0);
|
||||||
|
assertThat(points.get(1).get("v").asDouble()).isEqualTo(25.0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── query — input validation ────────────────────────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void query_rejectsUnsafeMetricName() {
|
||||||
|
String requestBody = """
|
||||||
|
{
|
||||||
|
"metric": "cameleer.agents; DROP TABLE server_metrics",
|
||||||
|
"from": "2026-04-23T09:59:00Z",
|
||||||
|
"to": "2026-04-23T10:02:00Z"
|
||||||
|
}
|
||||||
|
""";
|
||||||
|
|
||||||
|
ResponseEntity<String> r = restTemplate.postForEntity(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
new HttpEntity<>(requestBody, adminJson), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void query_rejectsRangeBeyondMax() {
|
||||||
|
String requestBody = """
|
||||||
|
{
|
||||||
|
"metric": "cameleer.agents.connected",
|
||||||
|
"from": "2026-01-01T00:00:00Z",
|
||||||
|
"to": "2026-04-23T00:00:00Z"
|
||||||
|
}
|
||||||
|
""";
|
||||||
|
|
||||||
|
ResponseEntity<String> r = restTemplate.postForEntity(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
new HttpEntity<>(requestBody, adminJson), String.class);
|
||||||
|
assertThat(r.getStatusCode()).isEqualTo(HttpStatus.BAD_REQUEST);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── authorization ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void allEndpoints_requireAdminRole() {
|
||||||
|
ResponseEntity<String> catalog = restTemplate.exchange(
|
||||||
|
"/api/v1/admin/server-metrics/catalog",
|
||||||
|
HttpMethod.GET, new HttpEntity<>(viewerGet), String.class);
|
||||||
|
assertThat(catalog.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
|
||||||
|
|
||||||
|
ResponseEntity<String> instances = restTemplate.exchange(
|
||||||
|
"/api/v1/admin/server-metrics/instances",
|
||||||
|
HttpMethod.GET, new HttpEntity<>(viewerGet), String.class);
|
||||||
|
assertThat(instances.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
|
||||||
|
|
||||||
|
HttpHeaders viewerPost = securityHelper.authHeaders(securityHelper.viewerToken());
|
||||||
|
ResponseEntity<String> query = restTemplate.exchange(
|
||||||
|
"/api/v1/admin/server-metrics/query",
|
||||||
|
HttpMethod.POST, new HttpEntity<>("{}", viewerPost), String.class);
|
||||||
|
assertThat(query.getStatusCode()).isEqualTo(HttpStatus.FORBIDDEN);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── helpers ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
private org.springframework.jdbc.core.JdbcTemplate clickhouseJdbc() {
|
||||||
|
return org.springframework.test.util.AopTestUtils.getTargetObject(
|
||||||
|
applicationContext.getBean("clickHouseJdbcTemplate"));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Autowired
|
||||||
|
private org.springframework.context.ApplicationContext applicationContext;
|
||||||
|
|
||||||
|
private static void insert(org.springframework.jdbc.core.JdbcTemplate jdbc,
|
||||||
|
String tenantId, Instant collectedAt, String serverInstanceId,
|
||||||
|
String metricName, String metricType, String statistic,
|
||||||
|
double value, Map<String, String> tags) {
|
||||||
|
jdbc.update("""
|
||||||
|
INSERT INTO server_metrics
|
||||||
|
(tenant_id, collected_at, server_instance_id,
|
||||||
|
metric_name, metric_type, statistic, metric_value, tags)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
tenantId, Timestamp.from(collectedAt), serverInstanceId,
|
||||||
|
metricName, metricType, statistic, value, tags);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static JsonNode findByField(JsonNode array, String field, String value) {
|
||||||
|
for (JsonNode n : array) {
|
||||||
|
if (value.equals(n.path(field).asText())) return n;
|
||||||
|
}
|
||||||
|
throw new AssertionError("no element with " + field + "=" + value);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static JsonNode findByTag(JsonNode seriesArray, String tagKey, String tagValue) {
|
||||||
|
for (JsonNode s : seriesArray) {
|
||||||
|
if (tagValue.equals(s.path("tags").path(tagKey).asText())) return s;
|
||||||
|
}
|
||||||
|
throw new AssertionError("no series with tag " + tagKey + "=" + tagValue);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static java.util.List<String> asStringList(JsonNode arr) {
|
||||||
|
java.util.List<String> out = new java.util.ArrayList<>();
|
||||||
|
if (arr != null) for (JsonNode n : arr) out.add(n.asText());
|
||||||
|
return out;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,36 @@
|
|||||||
|
package com.cameleer.server.core.storage;
|
||||||
|
|
||||||
|
import com.cameleer.server.core.storage.model.ServerInstanceInfo;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricCatalogEntry;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryRequest;
|
||||||
|
import com.cameleer.server.core.storage.model.ServerMetricQueryResponse;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Read-side access to the ClickHouse {@code server_metrics} table. Exposed
|
||||||
|
* to dashboards through {@code /api/v1/admin/server-metrics/**} so SaaS
|
||||||
|
* control planes don't need direct ClickHouse access.
|
||||||
|
*/
|
||||||
|
public interface ServerMetricsQueryStore {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Catalog of metric names observed in {@code [from, to)} along with their
|
||||||
|
* type, the set of statistics emitted, and the union of tag keys seen.
|
||||||
|
*/
|
||||||
|
List<ServerMetricCatalogEntry> catalog(Instant from, Instant to);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Distinct {@code server_instance_id} values that wrote at least one
|
||||||
|
* sample in {@code [from, to)}, with first/last seen timestamps.
|
||||||
|
*/
|
||||||
|
List<ServerInstanceInfo> listInstances(Instant from, Instant to);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generic time-series query. See {@link ServerMetricQueryRequest} for
|
||||||
|
* request semantics. Implementations must enforce input validation and
|
||||||
|
* reject unsafe inputs with {@link IllegalArgumentException}.
|
||||||
|
*/
|
||||||
|
ServerMetricQueryResponse query(ServerMetricQueryRequest request);
|
||||||
|
}
|
||||||
@@ -0,0 +1,15 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* One row of the {@code /api/v1/admin/server-metrics/instances} response.
|
||||||
|
* Used by dashboards to partition counter-delta computations across server
|
||||||
|
* process boundaries (each boot rotates the id).
|
||||||
|
*/
|
||||||
|
public record ServerInstanceInfo(
|
||||||
|
String serverInstanceId,
|
||||||
|
Instant firstSeen,
|
||||||
|
Instant lastSeen
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -0,0 +1,17 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* One row of the {@code /api/v1/admin/server-metrics/catalog} response.
|
||||||
|
* Surfaces the set of statistics and tag keys observed for a metric across
|
||||||
|
* the requested window, so dashboards can build selectors without ClickHouse
|
||||||
|
* access.
|
||||||
|
*/
|
||||||
|
public record ServerMetricCatalogEntry(
|
||||||
|
String metricName,
|
||||||
|
String metricType,
|
||||||
|
List<String> statistics,
|
||||||
|
List<String> tagKeys
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
|
||||||
|
/** One {@code (bucket, value)} point of a server-metrics series. */
|
||||||
|
public record ServerMetricPoint(
|
||||||
|
Instant t,
|
||||||
|
double v
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -0,0 +1,40 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Request contract for the generic server-metrics time-series query.
|
||||||
|
*
|
||||||
|
* <p>{@code aggregation} controls how multiple samples within a bucket
|
||||||
|
* collapse: {@code avg|sum|max|min|latest}. {@code mode} controls counter
|
||||||
|
* handling: {@code raw} returns values as stored (cumulative for counters),
|
||||||
|
* {@code delta} returns per-bucket positive-clipped differences computed
|
||||||
|
* per {@code server_instance_id}.
|
||||||
|
*
|
||||||
|
* <p>{@code statistic} filters which Micrometer sub-measurement to read
|
||||||
|
* ({@code value} / {@code count} / {@code total_time} / {@code total} /
|
||||||
|
* {@code max} / {@code mean}). {@code mean} is a derived statistic for
|
||||||
|
* timers: {@code sum(total_time|total) / sum(count)} per bucket.
|
||||||
|
*
|
||||||
|
* <p>{@code groupByTags} splits the output into one series per unique tag
|
||||||
|
* combination. {@code filterTags} narrows the input to samples whose tag
|
||||||
|
* map matches every entry.
|
||||||
|
*
|
||||||
|
* <p>{@code serverInstanceIds} is an optional allow-list. When null or
|
||||||
|
* empty all instances observed in the window are included.
|
||||||
|
*/
|
||||||
|
public record ServerMetricQueryRequest(
|
||||||
|
String metric,
|
||||||
|
String statistic,
|
||||||
|
Instant from,
|
||||||
|
Instant to,
|
||||||
|
Integer stepSeconds,
|
||||||
|
List<String> groupByTags,
|
||||||
|
Map<String, String> filterTags,
|
||||||
|
String aggregation,
|
||||||
|
String mode,
|
||||||
|
List<String> serverInstanceIds
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/** Response of the generic server-metrics time-series query. */
|
||||||
|
public record ServerMetricQueryResponse(
|
||||||
|
String metric,
|
||||||
|
String statistic,
|
||||||
|
String aggregation,
|
||||||
|
String mode,
|
||||||
|
int stepSeconds,
|
||||||
|
List<ServerMetricSeries> series
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
package com.cameleer.server.core.storage.model;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* One series of the server-metrics query response, identified by its
|
||||||
|
* {@link #tags} group (empty map when the query had no {@code groupByTags}).
|
||||||
|
*/
|
||||||
|
public record ServerMetricSeries(
|
||||||
|
Map<String, String> tags,
|
||||||
|
List<ServerMetricPoint> points
|
||||||
|
) {
|
||||||
|
}
|
||||||
@@ -66,24 +66,126 @@ On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by
|
|||||||
|
|
||||||
## How to query
|
## How to query
|
||||||
|
|
||||||
### Via the admin ClickHouse endpoint
|
Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
|
||||||
|
|
||||||
|
### `GET /catalog`
|
||||||
|
|
||||||
|
Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
|
||||||
|
|
||||||
```
|
```
|
||||||
POST /api/v1/admin/clickhouse/query
|
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||||||
Authorization: Bearer <admin-jwt>
|
Authorization: Bearer <admin-jwt>
|
||||||
Content-Type: text/plain
|
|
||||||
|
|
||||||
SELECT metric_name, statistic, count()
|
|
||||||
FROM server_metrics
|
|
||||||
WHERE collected_at >= now() - INTERVAL 1 HOUR
|
|
||||||
GROUP BY 1, 2 ORDER BY 1, 2
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API.
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"metricName": "cameleer.agents.connected",
|
||||||
|
"metricType": "gauge",
|
||||||
|
"statistics": ["value"],
|
||||||
|
"tagKeys": ["state"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"metricName": "cameleer.ingestion.drops",
|
||||||
|
"metricType": "counter",
|
||||||
|
"statistics": ["count"],
|
||||||
|
"tagKeys": ["reason"]
|
||||||
|
},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
### Direct JDBC (recommended for the dashboard)
|
`from`/`to` are optional; default is the last 1 h.
|
||||||
|
|
||||||
Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`.
|
### `GET /instances`
|
||||||
|
|
||||||
|
Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{ "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
|
||||||
|
{ "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### `POST /query` — generic time-series
|
||||||
|
|
||||||
|
The workhorse. One endpoint covers every panel in the dashboard.
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/v1/admin/server-metrics/query
|
||||||
|
Authorization: Bearer <admin-jwt>
|
||||||
|
Content-Type: application/json
|
||||||
|
```
|
||||||
|
|
||||||
|
Request body:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metric": "cameleer.ingestion.drops",
|
||||||
|
"statistic": "count",
|
||||||
|
"from": "2026-04-22T00:00:00Z",
|
||||||
|
"to": "2026-04-23T00:00:00Z",
|
||||||
|
"stepSeconds": 60,
|
||||||
|
"groupByTags": ["reason"],
|
||||||
|
"filterTags": { },
|
||||||
|
"aggregation": "sum",
|
||||||
|
"mode": "delta",
|
||||||
|
"serverInstanceIds": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Response:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metric": "cameleer.ingestion.drops",
|
||||||
|
"statistic": "count",
|
||||||
|
"aggregation": "sum",
|
||||||
|
"mode": "delta",
|
||||||
|
"stepSeconds": 60,
|
||||||
|
"series": [
|
||||||
|
{
|
||||||
|
"tags": { "reason": "buffer_full" },
|
||||||
|
"points": [
|
||||||
|
{ "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
|
||||||
|
{ "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
|
||||||
|
{ "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Request field reference
|
||||||
|
|
||||||
|
| Field | Type | Required | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
|
||||||
|
| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
|
||||||
|
| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
|
||||||
|
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
|
||||||
|
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
|
||||||
|
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
|
||||||
|
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
|
||||||
|
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
|
||||||
|
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
|
||||||
|
|
||||||
|
#### Validation errors
|
||||||
|
|
||||||
|
Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
|
||||||
|
- unsafe characters in identifiers
|
||||||
|
- `from ≥ to` or range > 31 days
|
||||||
|
- `stepSeconds` outside [10, 3600]
|
||||||
|
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
|
||||||
|
|
||||||
|
### Direct ClickHouse (fallback)
|
||||||
|
|
||||||
|
If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -258,89 +360,150 @@ When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
|
|||||||
|
|
||||||
## Suggested dashboard panels
|
## Suggested dashboard panels
|
||||||
|
|
||||||
The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable.
|
Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
|
||||||
|
|
||||||
### Row: server health (top of dashboard)
|
### Row: server health (top of dashboard)
|
||||||
|
|
||||||
1. **Agents by state** — stacked area.
|
1. **Agents by state** — stacked area.
|
||||||
```sql
|
```json
|
||||||
SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count
|
{ "metric": "cameleer.agents.connected", "statistic": "value",
|
||||||
FROM server_metrics
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected'
|
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||||||
AND collected_at >= {from} AND collected_at < {to}
|
|
||||||
GROUP BY t, state ORDER BY t;
|
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above.
|
2. **Ingestion buffer depth by type** — line chart.
|
||||||
|
```json
|
||||||
3. **Ingestion drops per minute** — bar chart (per-minute delta).
|
{ "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
|
||||||
```sql
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
WITH sorted AS (
|
"groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
|
||||||
SELECT toStartOfMinute(collected_at) AS minute,
|
|
||||||
tags['reason'] AS reason,
|
|
||||||
server_instance_id,
|
|
||||||
max(metric_value) AS cumulative
|
|
||||||
FROM server_metrics
|
|
||||||
WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops'
|
|
||||||
AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to}
|
|
||||||
GROUP BY minute, reason, server_instance_id
|
|
||||||
)
|
|
||||||
SELECT minute, reason,
|
|
||||||
cumulative - lagInFrame(cumulative, 1, cumulative) OVER (
|
|
||||||
PARTITION BY reason, server_instance_id ORDER BY minute
|
|
||||||
) AS drops_per_minute
|
|
||||||
FROM sorted ORDER BY minute;
|
|
||||||
```
|
```
|
||||||
|
|
||||||
4. **Auth failures per minute** — same shape as drops, split by `reason`.
|
3. **Ingestion drops per minute** — bar chart.
|
||||||
|
```json
|
||||||
|
{ "metric": "cameleer.ingestion.drops", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["reason"], "mode": "delta" }
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Auth failures per minute** — same shape as drops, grouped by `reason`.
|
||||||
|
```json
|
||||||
|
{ "metric": "cameleer.auth.failures", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["reason"], "mode": "delta" }
|
||||||
|
```
|
||||||
|
|
||||||
### Row: JVM
|
### Row: JVM
|
||||||
|
|
||||||
5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s.
|
5. **Heap used vs committed vs max** — area chart (three overlay queries).
|
||||||
|
```json
|
||||||
|
{ "metric": "jvm.memory.used", "statistic": "value",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
|
||||||
|
```
|
||||||
|
Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.
|
||||||
|
|
||||||
6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`.
|
6. **CPU %** — line.
|
||||||
|
```json
|
||||||
|
{ "metric": "process.cpu.usage", "statistic": "value",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
|
||||||
|
```
|
||||||
|
Overlay with `"metric": "system.cpu.usage"`.
|
||||||
|
|
||||||
7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`.
|
7. **GC pause — max per cause**.
|
||||||
|
```json
|
||||||
|
{ "metric": "jvm.gc.pause", "statistic": "max",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
|
||||||
|
```
|
||||||
|
|
||||||
8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`.
|
8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
|
||||||
|
|
||||||
### Row: HTTP + DB
|
### Row: HTTP + DB
|
||||||
|
|
||||||
9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`.
|
9. **HTTP mean latency by URI** — top-N URIs.
|
||||||
|
```json
|
||||||
|
{ "metric": "http.server.requests", "statistic": "mean",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
|
||||||
|
"aggregation": "avg", "mode": "raw" }
|
||||||
|
```
|
||||||
|
For p99 proxy, repeat with `"statistic": "max"`.
|
||||||
|
|
||||||
10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total.
|
10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
|
||||||
|
```json
|
||||||
|
{ "metric": "http.server.requests", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"mode": "delta", "aggregation": "sum" }
|
||||||
|
```
|
||||||
|
Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.
|
||||||
|
|
||||||
11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small.
|
11. **HikariCP pool saturation** — overlay two queries.
|
||||||
|
```json
|
||||||
|
{ "metric": "hikaricp.connections.active", "statistic": "value",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
|
||||||
|
```
|
||||||
|
Overlay with `"metric": "hikaricp.connections.pending"`.
|
||||||
|
|
||||||
12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag.
|
12. **Hikari acquire timeouts per minute**.
|
||||||
|
```json
|
||||||
|
{ "metric": "hikaricp.connections.timeout", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["pool"], "mode": "delta" }
|
||||||
|
```
|
||||||
|
|
||||||
### Row: alerting (collapsible)
|
### Row: alerting (collapsible)
|
||||||
|
|
||||||
13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`.
|
13. **Alerting instances by state** — stacked.
|
||||||
|
```json
|
||||||
|
{ "metric": "alerting_instances_total", "statistic": "value",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
|
||||||
|
```
|
||||||
|
|
||||||
14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`.
|
14. **Eval errors per minute by kind**.
|
||||||
|
```json
|
||||||
|
{ "metric": "alerting_eval_errors_total", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"groupByTags": ["kind"], "mode": "delta" }
|
||||||
|
```
|
||||||
|
|
||||||
15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`.
|
15. **Webhook delivery — max per minute**.
|
||||||
|
```json
|
||||||
|
{ "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 60,
|
||||||
|
"aggregation": "max", "mode": "raw" }
|
||||||
|
```
|
||||||
|
|
||||||
### Row: deployments (runtime-enabled only)
|
### Row: deployments (runtime-enabled only)
|
||||||
|
|
||||||
16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`.
|
16. **Deploy outcomes per hour**.
|
||||||
|
```json
|
||||||
|
{ "metric": "cameleer.deployments.outcome", "statistic": "count",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 3600,
|
||||||
|
"groupByTags": ["status"], "mode": "delta" }
|
||||||
|
```
|
||||||
|
|
||||||
17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean).
|
17. **Deploy duration mean**.
|
||||||
|
```json
|
||||||
|
{ "metric": "cameleer.deployments.duration", "statistic": "mean",
|
||||||
|
"from": "{from}", "to": "{to}", "stepSeconds": 300,
|
||||||
|
"aggregation": "avg", "mode": "raw" }
|
||||||
|
```
|
||||||
|
For p99 proxy, repeat with `"statistic": "max"`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Notes for the dashboard implementer
|
## Notes for the dashboard implementer
|
||||||
|
|
||||||
- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table.
|
- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
|
||||||
- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap.
|
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
|
||||||
- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart.
|
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
|
||||||
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either.
|
- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
|
||||||
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes.
|
|
||||||
- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.
|
- 2026-04-23 — initial write. Write-only backend.
|
||||||
|
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
|
||||||
|
|||||||
Reference in New Issue
Block a user