diff --git a/.claude/rules/app-classes.md b/.claude/rules/app-classes.md index f4fd2e91..f10bccbc 100644 --- a/.claude/rules/app-classes.md +++ b/.claude/rules/app-classes.md @@ -129,6 +129,8 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale ## metrics/ — Prometheus observability - `ServerMetrics` — centralized business metrics: gauges (agents by state, SSE connections, buffer depths), counters (ingestion drops, agent transitions, deployment outcomes, auth failures), timers (flush duration, deployment duration). Exposed via `/api/v1/prometheus`. +- `ServerInstanceIdConfig` — `@Configuration`, exposes `@Bean("serverInstanceId") String`. Resolution precedence: `cameleer.server.instance-id` property → `HOSTNAME` env → `InetAddress.getLocalHost()` → random UUID. Fixed at boot; rotates across restarts so counters restart cleanly. +- `ServerMetricsSnapshotScheduler` — `@Scheduled(fixedDelayString = "${cameleer.server.self-metrics.interval-ms:60000}")`. Walks `MeterRegistry.getMeters()` each tick, emits one `ServerMetricSample` per `Measurement` (Timer/DistributionSummary produce multiple rows per meter — one per Micrometer `Statistic`). Skips non-finite values; logs and swallows store failures. Disabled via `cameleer.server.self-metrics.enabled=false` (`@ConditionalOnProperty`). Write-only — no query endpoint yet; inspect via `/api/v1/admin/clickhouse/query`. ## storage/ — PostgreSQL repositories (JdbcTemplate) @@ -145,6 +147,7 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale - `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository` - `ClickHouseUsageTracker` — usage_events for billing - `ClickHouseRouteCatalogStore` — persistent route catalog with first_seen cache, warm-loaded on startup +- `ClickHouseServerMetricsStore` — periodic dumps of the server's own Micrometer registry into the `server_metrics` table. Tenant-stamped (bound at the scheduler, not the bean); no `environment` column (server straddles envs). Batch-insert via `JdbcTemplate.batchUpdate` with `Map(String, String)` tag binding. Written by `ServerMetricsSnapshotScheduler`, query via `/api/v1/admin/clickhouse/query` (no dedicated endpoint yet). ## search/ — ClickHouse search and log stores diff --git a/.claude/rules/metrics.md b/.claude/rules/metrics.md index a2f0f365..295e47af 100644 --- a/.claude/rules/metrics.md +++ b/.claude/rules/metrics.md @@ -8,7 +8,9 @@ paths: # Prometheus Metrics -Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component: +Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component. + +The same `MeterRegistry` is also snapshotted to ClickHouse every 60 s by `ServerMetricsSnapshotScheduler` (see "Server self-metrics persistence" at the bottom of this file) — so historical server-health data survives restarts without an external Prometheus. ## Gauges (auto-polled) @@ -83,3 +85,23 @@ Mean processing time = `camel.route.policy.total_time / camel.route.policy.count | `cameleer.sse.reconnects.count` | counter | `instanceId` | | `cameleer.taps.evaluated.count` | counter | `instanceId` | | `cameleer.metrics.exported.count` | counter | `instanceId` | + +## Server self-metrics persistence + +`ServerMetricsSnapshotScheduler` walks `MeterRegistry.getMeters()` every 60 s (configurable via `cameleer.server.self-metrics.interval-ms`) and writes one row per Micrometer `Measurement` to the ClickHouse `server_metrics` table. Full registry is captured — Spring Boot Actuator series (`jvm.*`, `process.*`, `http.server.requests`, `hikaricp.*`, `jdbc.*`, `tomcat.*`, `logback.events`, `system.*`) plus `cameleer.*` and `alerting_*`. + +**Table** (`cameleer-server-app/src/main/resources/clickhouse/init.sql`): + +``` +server_metrics(tenant_id, collected_at, server_instance_id, + metric_name, metric_type, statistic, metric_value, + tags Map(String,String), server_received_at) +``` + +- `metric_type` — lowercase Micrometer `Meter.Type` (counter, gauge, timer, distribution_summary, long_task_timer, other) +- `statistic` — Micrometer `Statistic.getTagValueRepresentation()` (value, count, total, total_time, max, mean, active_tasks, duration). Timers emit 3 rows per tick (count + total_time + max); gauges/counters emit 1 (`statistic='value'` or `'count'`). +- No `environment` column — the server is env-agnostic. +- `tenant_id` threaded from `cameleer.server.tenant.id` (single-tenant per server). +- `server_instance_id` resolved once at boot by `ServerInstanceIdConfig` (property → HOSTNAME → localhost → UUID fallback). Rotates across restarts so counter resets are unambiguous. +- TTL: 90 days (vs 365 for `agent_metrics`). Write-only in v1 — no query endpoint or UI page. Inspect via ClickHouse admin: `/api/v1/admin/clickhouse/query` or direct SQL. +- Toggle off entirely with `cameleer.server.self-metrics.enabled=false` (uses `@ConditionalOnProperty`). diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java index 6bb91336..e8996f49 100644 --- a/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/config/StorageBeanConfig.java @@ -9,6 +9,7 @@ import com.cameleer.server.app.storage.ClickHouseRouteCatalogStore; import com.cameleer.server.core.storage.RouteCatalogStore; import com.cameleer.server.app.storage.ClickHouseMetricsQueryStore; import com.cameleer.server.app.storage.ClickHouseMetricsStore; +import com.cameleer.server.app.storage.ClickHouseServerMetricsStore; import com.cameleer.server.app.storage.ClickHouseStatsStore; import com.cameleer.server.core.admin.AuditRepository; import com.cameleer.server.core.admin.AuditService; @@ -67,6 +68,12 @@ public class StorageBeanConfig { return new ClickHouseMetricsQueryStore(tenantProperties.getId(), clickHouseJdbc); } + @Bean + public ServerMetricsStore clickHouseServerMetricsStore( + @Qualifier("clickHouseJdbcTemplate") JdbcTemplate clickHouseJdbc) { + return new ClickHouseServerMetricsStore(clickHouseJdbc); + } + // ── Execution Store ────────────────────────────────────────────────── @Bean diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerInstanceIdConfig.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerInstanceIdConfig.java new file mode 100644 index 00000000..2c44e159 --- /dev/null +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerInstanceIdConfig.java @@ -0,0 +1,63 @@ +package com.cameleer.server.app.metrics; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.beans.factory.annotation.Value; +import org.springframework.context.annotation.Bean; +import org.springframework.context.annotation.Configuration; + +import java.net.InetAddress; +import java.net.UnknownHostException; +import java.util.UUID; + +/** + * Resolves a stable identifier for this server process, used as the + * {@code server_instance_id} on every server_metrics sample. The value is + * fixed at boot, so counters restart cleanly whenever the id rotates. + * + *

Precedence: + *

    + *
  1. {@code cameleer.server.instance-id} property / {@code CAMELEER_SERVER_INSTANCE_ID} env + *
  2. {@code HOSTNAME} env (populated by Docker/Kubernetes) + *
  3. {@link InetAddress#getLocalHost()} hostname + *
  4. Random UUID (fallback — only hit when DNS and env are both silent) + *
+ */ +@Configuration +public class ServerInstanceIdConfig { + + private static final Logger log = LoggerFactory.getLogger(ServerInstanceIdConfig.class); + + @Bean("serverInstanceId") + public String serverInstanceId( + @Value("${cameleer.server.instance-id:}") String configuredId) { + if (!isBlank(configuredId)) { + log.info("Server instance id resolved from configuration: {}", configuredId); + return configuredId; + } + + String hostnameEnv = System.getenv("HOSTNAME"); + if (!isBlank(hostnameEnv)) { + log.info("Server instance id resolved from HOSTNAME env: {}", hostnameEnv); + return hostnameEnv; + } + + try { + String localHost = InetAddress.getLocalHost().getHostName(); + if (!isBlank(localHost)) { + log.info("Server instance id resolved from localhost lookup: {}", localHost); + return localHost; + } + } catch (UnknownHostException e) { + log.debug("InetAddress.getLocalHost() failed, falling back to UUID: {}", e.getMessage()); + } + + String fallback = UUID.randomUUID().toString(); + log.warn("Server instance id could not be resolved; using random UUID {}", fallback); + return fallback; + } + + private static boolean isBlank(String s) { + return s == null || s.isBlank(); + } +} diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotScheduler.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotScheduler.java new file mode 100644 index 00000000..2d483aae --- /dev/null +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotScheduler.java @@ -0,0 +1,106 @@ +package com.cameleer.server.app.metrics; + +import com.cameleer.server.core.storage.ServerMetricsStore; +import com.cameleer.server.core.storage.model.ServerMetricSample; +import io.micrometer.core.instrument.Measurement; +import io.micrometer.core.instrument.Meter; +import io.micrometer.core.instrument.MeterRegistry; +import io.micrometer.core.instrument.Tag; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.beans.factory.annotation.Qualifier; +import org.springframework.beans.factory.annotation.Value; +import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty; +import org.springframework.scheduling.annotation.Scheduled; +import org.springframework.stereotype.Component; + +import java.time.Instant; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; + +/** + * Periodically snapshots every meter in the server's {@link MeterRegistry} + * and writes the result to ClickHouse via {@link ServerMetricsStore}. This + * gives us historical server-health data (buffer depths, agent transitions, + * flush latency, JVM memory, HTTP response counts, etc.) without requiring + * an external Prometheus. + * + *

Each Micrometer {@link Meter#measure() measurement} becomes one row, so + * a single Timer produces rows for {@code count}, {@code total_time}, and + * {@code max} each tick. Counter values are cumulative since meter + * registration (Prometheus convention) — callers compute rate() themselves. + * + *

Disabled via {@code cameleer.server.self-metrics.enabled=false}. + */ +@Component +@ConditionalOnProperty( + prefix = "cameleer.server.self-metrics", + name = "enabled", + havingValue = "true", + matchIfMissing = true) +public class ServerMetricsSnapshotScheduler { + + private static final Logger log = LoggerFactory.getLogger(ServerMetricsSnapshotScheduler.class); + + private final MeterRegistry registry; + private final ServerMetricsStore store; + private final String tenantId; + private final String serverInstanceId; + + public ServerMetricsSnapshotScheduler( + MeterRegistry registry, + ServerMetricsStore store, + @Value("${cameleer.server.tenant.id:default}") String tenantId, + @Qualifier("serverInstanceId") String serverInstanceId) { + this.registry = registry; + this.store = store; + this.tenantId = tenantId; + this.serverInstanceId = serverInstanceId; + } + + @Scheduled(fixedDelayString = "${cameleer.server.self-metrics.interval-ms:60000}", + initialDelayString = "${cameleer.server.self-metrics.interval-ms:60000}") + public void snapshot() { + try { + Instant now = Instant.now(); + List batch = new ArrayList<>(); + + for (Meter meter : registry.getMeters()) { + Meter.Id id = meter.getId(); + Map tags = flattenTags(id.getTagsAsIterable()); + String type = id.getType().name().toLowerCase(); + + for (Measurement m : meter.measure()) { + double v = m.getValue(); + if (!Double.isFinite(v)) continue; + batch.add(new ServerMetricSample( + tenantId, + now, + serverInstanceId, + id.getName(), + type, + m.getStatistic().getTagValueRepresentation(), + v, + tags)); + } + } + + if (!batch.isEmpty()) { + store.insertBatch(batch); + log.debug("Persisted {} server self-metric samples", batch.size()); + } + } catch (Exception e) { + log.warn("Server self-metrics snapshot failed: {}", e.getMessage()); + } + } + + private static Map flattenTags(Iterable tags) { + Map out = new LinkedHashMap<>(); + for (Tag t : tags) { + out.put(t.getKey(), t.getValue()); + } + return out; + } +} diff --git a/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStore.java b/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStore.java new file mode 100644 index 00000000..3230f3e7 --- /dev/null +++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStore.java @@ -0,0 +1,46 @@ +package com.cameleer.server.app.storage; + +import com.cameleer.server.core.storage.ServerMetricsStore; +import com.cameleer.server.core.storage.model.ServerMetricSample; +import org.springframework.jdbc.core.JdbcTemplate; + +import java.sql.Timestamp; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public class ClickHouseServerMetricsStore implements ServerMetricsStore { + + private final JdbcTemplate jdbc; + + public ClickHouseServerMetricsStore(JdbcTemplate jdbc) { + this.jdbc = jdbc; + } + + @Override + public void insertBatch(List samples) { + if (samples.isEmpty()) return; + + jdbc.batchUpdate(""" + INSERT INTO server_metrics + (tenant_id, collected_at, server_instance_id, metric_name, + metric_type, statistic, metric_value, tags) + VALUES (?, ?, ?, ?, ?, ?, ?, ?) + """, + samples.stream().map(s -> new Object[]{ + s.tenantId(), + Timestamp.from(s.collectedAt()), + s.serverInstanceId(), + s.metricName(), + s.metricType(), + s.statistic(), + s.value(), + tagsToClickHouseMap(s.tags()) + }).toList()); + } + + private Map tagsToClickHouseMap(Map tags) { + if (tags == null || tags.isEmpty()) return new HashMap<>(); + return new HashMap<>(tags); + } +} diff --git a/cameleer-server-app/src/main/resources/application.yml b/cameleer-server-app/src/main/resources/application.yml index 671d8029..3ded9adf 100644 --- a/cameleer-server-app/src/main/resources/application.yml +++ b/cameleer-server-app/src/main/resources/application.yml @@ -112,6 +112,10 @@ cameleer: url: ${CAMELEER_SERVER_CLICKHOUSE_URL:jdbc:clickhouse://localhost:8123/cameleer} username: ${CAMELEER_SERVER_CLICKHOUSE_USERNAME:default} password: ${CAMELEER_SERVER_CLICKHOUSE_PASSWORD:} + self-metrics: + enabled: ${CAMELEER_SERVER_SELFMETRICS_ENABLED:true} + interval-ms: ${CAMELEER_SERVER_SELFMETRICS_INTERVALMS:60000} + instance-id: ${CAMELEER_SERVER_INSTANCE_ID:} springdoc: api-docs: diff --git a/cameleer-server-app/src/main/resources/clickhouse/init.sql b/cameleer-server-app/src/main/resources/clickhouse/init.sql index 598ac6ec..ee65dd21 100644 --- a/cameleer-server-app/src/main/resources/clickhouse/init.sql +++ b/cameleer-server-app/src/main/resources/clickhouse/init.sql @@ -401,6 +401,29 @@ CREATE TABLE IF NOT EXISTS route_catalog ( ENGINE = ReplacingMergeTree(last_seen) ORDER BY (tenant_id, environment, application_id, route_id); +-- ── Server Self-Metrics ──────────────────────────────────────────────── +-- Periodic snapshot of the server's own Micrometer registry (written by +-- ServerMetricsSnapshotScheduler). No `environment` column — the server +-- straddles environments. `statistic` distinguishes Timer/DistributionSummary +-- sub-measurements (count, total_time, max, mean) from plain counter/gauge values. + +CREATE TABLE IF NOT EXISTS server_metrics ( + tenant_id LowCardinality(String) DEFAULT 'default', + collected_at DateTime64(3), + server_instance_id LowCardinality(String), + metric_name LowCardinality(String), + metric_type LowCardinality(String), + statistic LowCardinality(String) DEFAULT 'value', + metric_value Float64, + tags Map(String, String) DEFAULT map(), + server_received_at DateTime64(3) DEFAULT now64(3) +) +ENGINE = MergeTree() +PARTITION BY (tenant_id, toYYYYMM(collected_at)) +ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic) +TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE +SETTINGS index_granularity = 8192; + -- insert_id tiebreak for keyset pagination (fixes same-millisecond cursor collision). -- IF NOT EXISTS on ADD COLUMN is idempotent. MATERIALIZE COLUMN is a background mutation, -- effectively a no-op once all parts are already materialized. diff --git a/cameleer-server-app/src/test/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotSchedulerTest.java b/cameleer-server-app/src/test/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotSchedulerTest.java new file mode 100644 index 00000000..63147616 --- /dev/null +++ b/cameleer-server-app/src/test/java/com/cameleer/server/app/metrics/ServerMetricsSnapshotSchedulerTest.java @@ -0,0 +1,130 @@ +package com.cameleer.server.app.metrics; + +import com.cameleer.server.core.storage.ServerMetricsStore; +import com.cameleer.server.core.storage.model.ServerMetricSample; +import io.micrometer.core.instrument.Counter; +import io.micrometer.core.instrument.Gauge; +import io.micrometer.core.instrument.MeterRegistry; +import io.micrometer.core.instrument.Timer; +import io.micrometer.core.instrument.simple.SimpleMeterRegistry; +import org.junit.jupiter.api.Test; + +import java.time.Duration; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicInteger; + +import static org.assertj.core.api.Assertions.assertThat; + +class ServerMetricsSnapshotSchedulerTest { + + @Test + void snapshot_capturesCounterGaugeAndTimerMeasurements() { + MeterRegistry registry = new SimpleMeterRegistry(); + + Counter counter = Counter.builder("cameleer.test.counter") + .tag("env", "dev") + .register(registry); + counter.increment(3); + + AtomicInteger gaugeSource = new AtomicInteger(42); + Gauge.builder("cameleer.test.gauge", gaugeSource, AtomicInteger::doubleValue) + .register(registry); + + Timer timer = Timer.builder("cameleer.test.timer").register(registry); + timer.record(Duration.ofMillis(5)); + timer.record(Duration.ofMillis(15)); + + RecordingStore store = new RecordingStore(); + ServerMetricsSnapshotScheduler scheduler = + new ServerMetricsSnapshotScheduler(registry, store, "tenant-7", "server-A"); + + scheduler.snapshot(); + + assertThat(store.batches).hasSize(1); + List samples = store.batches.get(0); + + // Every sample is stamped with tenant + instance + finite value + assertThat(samples).allSatisfy(s -> { + assertThat(s.tenantId()).isEqualTo("tenant-7"); + assertThat(s.serverInstanceId()).isEqualTo("server-A"); + assertThat(Double.isFinite(s.value())).isTrue(); + assertThat(s.collectedAt()).isNotNull(); + }); + + // Counter -> 1 row with statistic=count, value=3, tag propagated + List counterRows = samples.stream() + .filter(s -> s.metricName().equals("cameleer.test.counter")) + .toList(); + assertThat(counterRows).hasSize(1); + assertThat(counterRows.get(0).statistic()).isEqualTo("count"); + assertThat(counterRows.get(0).metricType()).isEqualTo("counter"); + assertThat(counterRows.get(0).value()).isEqualTo(3.0); + assertThat(counterRows.get(0).tags()).containsEntry("env", "dev"); + + // Gauge -> 1 row with statistic=value + List gaugeRows = samples.stream() + .filter(s -> s.metricName().equals("cameleer.test.gauge")) + .toList(); + assertThat(gaugeRows).hasSize(1); + assertThat(gaugeRows.get(0).statistic()).isEqualTo("value"); + assertThat(gaugeRows.get(0).metricType()).isEqualTo("gauge"); + assertThat(gaugeRows.get(0).value()).isEqualTo(42.0); + + // Timer -> emits multiple statistics (count, total_time, max) + List timerRows = samples.stream() + .filter(s -> s.metricName().equals("cameleer.test.timer")) + .toList(); + assertThat(timerRows).isNotEmpty(); + // SimpleMeterRegistry emits Statistic.TOTAL ("total"); other registries (Prometheus) + // emit TOTAL_TIME ("total_time"). Accept either so the test isn't registry-coupled. + assertThat(timerRows).extracting(ServerMetricSample::statistic) + .contains("count", "max"); + assertThat(timerRows).extracting(ServerMetricSample::statistic) + .containsAnyOf("total_time", "total"); + assertThat(timerRows).allSatisfy(s -> + assertThat(s.metricType()).isEqualTo("timer")); + ServerMetricSample count = timerRows.stream() + .filter(s -> s.statistic().equals("count")) + .findFirst().orElseThrow(); + assertThat(count.value()).isEqualTo(2.0); + } + + @Test + void snapshot_withEmptyRegistry_doesNotWriteBatch() { + MeterRegistry registry = new SimpleMeterRegistry(); + // Force removal of any auto-registered meters (SimpleMeterRegistry has none by default). + RecordingStore store = new RecordingStore(); + ServerMetricsSnapshotScheduler scheduler = + new ServerMetricsSnapshotScheduler(registry, store, "t", "s"); + + scheduler.snapshot(); + + assertThat(store.batches).isEmpty(); + } + + @Test + void snapshot_swallowsStoreFailures() { + MeterRegistry registry = new SimpleMeterRegistry(); + Counter.builder("cameleer.test").register(registry).increment(); + + ServerMetricsStore throwingStore = batch -> { + throw new RuntimeException("clickhouse down"); + }; + + ServerMetricsSnapshotScheduler scheduler = + new ServerMetricsSnapshotScheduler(registry, throwingStore, "t", "s"); + + // Must not propagate — the scheduler thread would otherwise die. + scheduler.snapshot(); + } + + private static final class RecordingStore implements ServerMetricsStore { + final List> batches = new ArrayList<>(); + + @Override + public void insertBatch(List samples) { + batches.add(List.copyOf(samples)); + } + } +} diff --git a/cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStoreIT.java b/cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStoreIT.java new file mode 100644 index 00000000..e37abddb --- /dev/null +++ b/cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseServerMetricsStoreIT.java @@ -0,0 +1,117 @@ +package com.cameleer.server.app.storage; + +import com.cameleer.server.core.storage.model.ServerMetricSample; +import com.zaxxer.hikari.HikariDataSource; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.springframework.jdbc.core.JdbcTemplate; +import org.testcontainers.clickhouse.ClickHouseContainer; +import org.testcontainers.junit.jupiter.Container; +import org.testcontainers.junit.jupiter.Testcontainers; + +import java.time.Instant; +import java.util.List; +import java.util.Map; + +import static org.assertj.core.api.Assertions.assertThat; + +@Testcontainers +class ClickHouseServerMetricsStoreIT { + + @Container + static final ClickHouseContainer clickhouse = + new ClickHouseContainer("clickhouse/clickhouse-server:24.12"); + + private JdbcTemplate jdbc; + private ClickHouseServerMetricsStore store; + + @BeforeEach + void setUp() { + HikariDataSource ds = new HikariDataSource(); + ds.setJdbcUrl(clickhouse.getJdbcUrl()); + ds.setUsername(clickhouse.getUsername()); + ds.setPassword(clickhouse.getPassword()); + + jdbc = new JdbcTemplate(ds); + + jdbc.execute(""" + CREATE TABLE IF NOT EXISTS server_metrics ( + tenant_id LowCardinality(String) DEFAULT 'default', + collected_at DateTime64(3), + server_instance_id LowCardinality(String), + metric_name LowCardinality(String), + metric_type LowCardinality(String), + statistic LowCardinality(String) DEFAULT 'value', + metric_value Float64, + tags Map(String, String) DEFAULT map(), + server_received_at DateTime64(3) DEFAULT now64(3) + ) + ENGINE = MergeTree() + ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic) + """); + + jdbc.execute("TRUNCATE TABLE server_metrics"); + + store = new ClickHouseServerMetricsStore(jdbc); + } + + @Test + void insertBatch_roundTripsAllColumns() { + Instant ts = Instant.parse("2026-04-23T12:00:00Z"); + store.insertBatch(List.of( + new ServerMetricSample("tenant-a", ts, "srv-1", + "cameleer.ingestion.drops", "counter", "count", 17.0, + Map.of("reason", "buffer_full")), + new ServerMetricSample("tenant-a", ts, "srv-1", + "jvm.memory.used", "gauge", "value", 1_048_576.0, + Map.of("area", "heap", "id", "G1 Eden Space")) + )); + + Integer count = jdbc.queryForObject( + "SELECT count() FROM server_metrics WHERE tenant_id = 'tenant-a'", + Integer.class); + assertThat(count).isEqualTo(2); + + Double dropsValue = jdbc.queryForObject( + """ + SELECT metric_value FROM server_metrics + WHERE tenant_id = 'tenant-a' + AND server_instance_id = 'srv-1' + AND metric_name = 'cameleer.ingestion.drops' + AND statistic = 'count' + """, + Double.class); + assertThat(dropsValue).isEqualTo(17.0); + + String heapArea = jdbc.queryForObject( + """ + SELECT tags['area'] FROM server_metrics + WHERE tenant_id = 'tenant-a' + AND metric_name = 'jvm.memory.used' + """, + String.class); + assertThat(heapArea).isEqualTo("heap"); + } + + @Test + void insertBatch_emptyList_doesNothing() { + store.insertBatch(List.of()); + + Integer count = jdbc.queryForObject( + "SELECT count() FROM server_metrics", Integer.class); + assertThat(count).isEqualTo(0); + } + + @Test + void insertBatch_nullTags_storesEmptyMap() { + store.insertBatch(List.of( + new ServerMetricSample("default", Instant.parse("2026-04-23T12:00:00Z"), + "srv-2", "process.cpu.usage", "gauge", "value", 0.12, null) + )); + + Integer count = jdbc.queryForObject( + "SELECT count() FROM server_metrics WHERE server_instance_id = 'srv-2'", + Integer.class); + assertThat(count).isEqualTo(1); + } +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsStore.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsStore.java new file mode 100644 index 00000000..70db4e5f --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ServerMetricsStore.java @@ -0,0 +1,16 @@ +package com.cameleer.server.core.storage; + +import com.cameleer.server.core.storage.model.ServerMetricSample; + +import java.util.List; + +/** + * Sink for periodic snapshots of the server's own Micrometer meter registry. + * Implementations persist the samples (e.g. to ClickHouse) so server + * self-metrics survive restarts and can be queried historically without an + * external Prometheus. + */ +public interface ServerMetricsStore { + + void insertBatch(List samples); +} diff --git a/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSample.java b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSample.java new file mode 100644 index 00000000..50bbb5d4 --- /dev/null +++ b/cameleer-server-core/src/main/java/com/cameleer/server/core/storage/model/ServerMetricSample.java @@ -0,0 +1,23 @@ +package com.cameleer.server.core.storage.model; + +import java.time.Instant; +import java.util.Map; + +/** + * A single sample of the server's own Micrometer registry, captured by a + * scheduled snapshot and destined for the ClickHouse {@code server_metrics} + * table. One {@code ServerMetricSample} per Micrometer {@code Measurement}, + * so Timers and DistributionSummaries produce multiple samples per tick + * (distinguished by {@link #statistic()}). + */ +public record ServerMetricSample( + String tenantId, + Instant collectedAt, + String serverInstanceId, + String metricName, + String metricType, + String statistic, + double value, + Map tags +) { +} diff --git a/docs/SERVER-CAPABILITIES.md b/docs/SERVER-CAPABILITIES.md index 11fb3cf6..2ccada24 100644 --- a/docs/SERVER-CAPABILITIES.md +++ b/docs/SERVER-CAPABILITIES.md @@ -204,6 +204,12 @@ All query endpoints require JWT with `VIEWER` role or higher. | `GET /api/v1/agents/events-log` | Agent lifecycle event history | | `GET /api/v1/agents/{id}/metrics` | Agent-level metrics time series | +### Server Self-Metrics + +The server snapshots its own Micrometer registry into ClickHouse every 60 s (table `server_metrics`) — JVM, HTTP, DB pools, agent/ingestion business metrics, and alerting metrics. Use this instead of running an external Prometheus when building a server-health dashboard. The live scrape endpoint `/api/v1/prometheus` remains available for traditional scraping. + +See [`docs/server-self-metrics.md`](./server-self-metrics.md) for the full metric catalog, suggested panels, and example queries. + --- ## Application Configuration diff --git a/docs/server-self-metrics.md b/docs/server-self-metrics.md new file mode 100644 index 00000000..38a267b7 --- /dev/null +++ b/docs/server-self-metrics.md @@ -0,0 +1,346 @@ +# Server Self-Metrics — Reference for Dashboard Builders + +This is the reference for the SaaS team building the server-health dashboard. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel. + +> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required. + +--- + +## Table schema + +```sql +server_metrics ( + tenant_id LowCardinality(String) DEFAULT 'default', + collected_at DateTime64(3), + server_instance_id LowCardinality(String), + metric_name LowCardinality(String), + metric_type LowCardinality(String), -- counter|gauge|timer|distribution_summary|long_task_timer|other + statistic LowCardinality(String) DEFAULT 'value', + metric_value Float64, + tags Map(String, String) DEFAULT map(), + server_received_at DateTime64(3) DEFAULT now64(3) +) +ENGINE = MergeTree() +PARTITION BY (tenant_id, toYYYYMM(collected_at)) +ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic) +TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE +``` + +### What each column means + +| Column | Notes | +|---|---| +| `tenant_id` | Always filter by this. One tenant per server deployment. | +| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. | +| `metric_name` | Raw Micrometer meter name. Dots, not underscores. | +| `metric_type` | Lowercase Micrometer `Meter.Type`. | +| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. | +| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. | +| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. | + +### Counter semantics (important) + +Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`: + +```sql +SELECT + toStartOfMinute(collected_at) AS minute, + metric_value - any(metric_value) OVER ( + PARTITION BY server_instance_id, metric_name, tags + ORDER BY collected_at + ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING + ) AS per_minute_delta +FROM server_metrics +WHERE metric_name = 'cameleer.ingestion.drops' + AND statistic = 'count' +ORDER BY minute; +``` + +On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets. + +### Retention + +90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more. + +--- + +## How to query + +### Via the admin ClickHouse endpoint + +``` +POST /api/v1/admin/clickhouse/query +Authorization: Bearer +Content-Type: text/plain + +SELECT metric_name, statistic, count() +FROM server_metrics +WHERE collected_at >= now() - INTERVAL 1 HOUR +GROUP BY 1, 2 ORDER BY 1, 2 +``` + +Requires `infrastructureendpoints=true` and the `ADMIN` role. For a SaaS control plane you will likely want a dedicated read-only CH user scoped to this table — the `/api/v1/admin/clickhouse/query` path is a human-facing admin tool, not a programmatic API. + +### Direct JDBC (recommended for the dashboard) + +Read directly from ClickHouse (read-only user, `GRANT SELECT ON cameleer.server_metrics TO dashboard_ro`). All queries must filter by `tenant_id`. + +--- + +## Metric catalog + +Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest. + +### Cameleer business metrics — agent + ingestion + +Source: `cameleer-server-app/.../metrics/ServerMetrics.java`. + +| Metric | Type | Statistic | Tags | Meaning | +|---|---|---|---|---| +| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state | +| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) | +| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions | +| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging | +| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator | +| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. | +| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type | + +### Cameleer business metrics — deploy + auth + +| Metric | Type | Statistic | Tags | Meaning | +|---|---|---|---|---| +| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot | +| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency | +| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes | + +### Alerting subsystem metrics + +Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`. + +| Metric | Type | Statistic | Tags | Meaning | +|---|---|---|---|---| +| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` | +| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` | +| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind | +| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind | +| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency | +| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency | +| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes | + +### JVM — memory, GC, threads, classes + +From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`). + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool | +| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool | +| `jvm.memory.max` | gauge | `area`, `id` | Pool max | +| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection | +| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes | +| `jvm.buffer.count` | gauge | `id` | NIO buffer count | +| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity | +| `jvm.threads.live` | gauge | — | Current live thread count | +| `jvm.threads.daemon` | gauge | — | Current daemon thread count | +| `jvm.threads.peak` | gauge | — | Peak thread count since start | +| `jvm.threads.started` | counter | — | Cumulative threads started | +| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state | +| `jvm.classes.loaded` | gauge | — | Currently-loaded classes | +| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes | +| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` | +| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) | +| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen | +| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen | +| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) | +| `jvm.gc.live.data.size` | gauge | — | Live data after last collection | +| `jvm.gc.max.data.size` | gauge | — | Max old-gen size | +| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info | + +### Process and system + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) | +| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) | +| `process.uptime` | gauge | — | ms since start | +| `process.start.time` | gauge | — | Epoch start | +| `process.files.open` | gauge | — | Open FDs | +| `process.files.max` | gauge | — | FD ulimit | +| `system.cpu.count` | gauge | — | Cores visible to the JVM | +| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) | +| `system.load.average.1m` | gauge | — | 1-min load (Unix only) | +| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR | +| `disk.total` | gauge | `path` | Total bytes | + +### HTTP server + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max | +| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic | + +`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded. + +### Tomcat + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `tomcat.sessions.active.current` | gauge | — | Currently active sessions | +| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed | +| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) | +| `tomcat.sessions.created` | counter | — | Cumulative session creates | +| `tomcat.sessions.expired` | counter | — | Cumulative expirations | +| `tomcat.sessions.rejected` | counter | — | Session creates refused | +| `tomcat.threads.current` | gauge | `name` | Connector thread count | +| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request | +| `tomcat.threads.config.max` | gauge | `name` | Configured max | + +### HikariCP (PostgreSQL pool) + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `hikaricp.connections` | gauge | `pool` | Total connections | +| `hikaricp.connections.active` | gauge | `pool` | In-use | +| `hikaricp.connections.idle` | gauge | `pool` | Idle | +| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection | +| `hikaricp.connections.min` | gauge | `pool` | Configured min | +| `hikaricp.connections.max` | gauge | `pool` | Configured max | +| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection | +| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool | +| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use | +| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem | + +Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`). + +### JDBC generic + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically | +| `jdbc.connections.max` | gauge | `name` | | +| `jdbc.connections.active` | gauge | `name` | | +| `jdbc.connections.idle` | gauge | `name` | | + +### Logging + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel | + +### Spring Boot lifecycle + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `application.started.time` | timer | `main.application.class` | Cold-start duration | +| `application.ready.time` | timer | `main.application.class` | Time to ready | + +### Flyway + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) | + +### Executor pools (if any `@Async` executors exist) + +When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds: + +| Metric | Type | Tags | Meaning | +|---|---|---|---| +| `executor.active` | gauge | `name` | Currently-running tasks | +| `executor.queued` | gauge | `name` | Queued tasks | +| `executor.queue.remaining` | gauge | `name` | Queue headroom | +| `executor.pool.size` | gauge | `name` | Current pool size | +| `executor.pool.core` | gauge | `name` | Core size | +| `executor.pool.max` | gauge | `name` | Max size | +| `executor.completed` | counter | `name` | Completed tasks | + +--- + +## Suggested dashboard panels + +The shortlist below gives you a working health dashboard with ~12 panels. All queries assume `tenant_id` is a dashboard variable. + +### Row: server health (top of dashboard) + +1. **Agents by state** — stacked area. + ```sql + SELECT toStartOfMinute(collected_at) AS t, tags['state'] AS state, avg(metric_value) AS count + FROM server_metrics + WHERE tenant_id = {tenant} AND metric_name = 'cameleer.agents.connected' + AND collected_at >= {from} AND collected_at < {to} + GROUP BY t, state ORDER BY t; + ``` + +2. **Ingestion buffer depth** — line chart by `type`. Use `cameleer.ingestion.buffer.size` same shape as above. + +3. **Ingestion drops per minute** — bar chart (per-minute delta). + ```sql + WITH sorted AS ( + SELECT toStartOfMinute(collected_at) AS minute, + tags['reason'] AS reason, + server_instance_id, + max(metric_value) AS cumulative + FROM server_metrics + WHERE tenant_id = {tenant} AND metric_name = 'cameleer.ingestion.drops' + AND statistic = 'count' AND collected_at >= {from} AND collected_at < {to} + GROUP BY minute, reason, server_instance_id + ) + SELECT minute, reason, + cumulative - lagInFrame(cumulative, 1, cumulative) OVER ( + PARTITION BY reason, server_instance_id ORDER BY minute + ) AS drops_per_minute + FROM sorted ORDER BY minute; + ``` + +4. **Auth failures per minute** — same shape as drops, split by `reason`. + +### Row: JVM + +5. **Heap used vs committed vs max** — area chart. Filter `metric_name IN ('jvm.memory.used', 'jvm.memory.committed', 'jvm.memory.max')` with `tags['area'] = 'heap'`, sum across pool `id`s. + +6. **CPU %** — line. `process.cpu.usage` and `system.cpu.usage`. + +7. **GC pause p99 + max** — `jvm.gc.pause` with statistic `max`, grouped by `tags['cause']`. + +8. **Thread count** — `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`. + +### Row: HTTP + DB + +9. **HTTP p99 by URI** — use `http.server.requests` with `statistic='max'` as a rough p99 proxy, or `total_time/count` for mean. Group by `tags['uri']`. Filter `tags['outcome'] = 'SUCCESS'`. + +10. **HTTP error rate** — count where `tags['status']` starts with `5`, divided by total. + +11. **HikariCP pool saturation** — overlay `hikaricp.connections.active` and `hikaricp.connections.pending`. If `pending > 0` sustained, the pool is too small. + +12. **Hikari acquire timeouts per minute** — delta of `hikaricp.connections.timeout`. Any non-zero rate is a red flag. + +### Row: alerting (collapsible) + +13. **Alerting instances by state** — `alerting_instances_total` stacked by `tags['state']`. + +14. **Eval errors per minute by kind** — delta of `alerting_eval_errors_total` by `tags['kind']`. + +15. **Webhook delivery p99** — `alerting_webhook_delivery_duration_seconds` with `statistic='max'`. + +### Row: deployments (runtime-enabled only) + +16. **Deploy outcomes last 24 h** — counter delta of `cameleer.deployments.outcome` grouped by `tags['status']`. + +17. **Deploy duration p99** — `cameleer.deployments.duration` with `statistic='max'` (or `total_time/count` for mean). + +--- + +## Notes for the dashboard implementer + +- **Always filter by `tenant_id`.** It's the first column in the sort key; queries that skip it scan the entire table. +- **Prefer predicate pushdown on `metric_name` + `statistic`.** Both are `LowCardinality`, so `metric_name = 'x' AND statistic = 'count'` is cheap. +- **Treat `server_instance_id` as a natural partition for counter math.** Never compute deltas across it — you'll get negative numbers on restart. +- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. Tests may write `total`. When in doubt, accept either. +- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. Monitor `count(DISTINCT concat(metric_name, toString(tags)))` and alert if it spikes. +- **The dashboard should be read-only.** No one writes into `server_metrics` except the server itself — there's no API to push or delete rows. + +--- + +## Changelog + +- 2026-04-23 — initial write. Write-only in v1 (no REST endpoint or admin page). Reach out to the server team before building a write-back path; we'd rather cut a proper API than have the dashboard hit ClickHouse directly forever.