Files

108 lines
5.8 KiB
Markdown
Raw Permalink Normal View History

---
paths:
- "cameleer-server-app/**/metrics/**"
- "cameleer-server-app/**/ServerMetrics*"
- "ui/src/pages/RuntimeTab/**"
- "ui/src/pages/DashboardTab/**"
---
# Prometheus Metrics
Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component.
The same `MeterRegistry` is also snapshotted to ClickHouse every 60 s by `ServerMetricsSnapshotScheduler` (see "Server self-metrics persistence" at the bottom of this file) — so historical server-health data survives restarts without an external Prometheus.
## Gauges (auto-polled)
| Metric | Tags | Source |
|--------|------|--------|
| `cameleer.agents.connected` | `state` (live, stale, dead, shutdown) | `AgentRegistryService.findByState()` |
| `cameleer.agents.sse.active` | — | `SseConnectionManager.getConnectionCount()` |
| `cameleer.ingestion.buffer.size` | `type` (execution, processor, log, metrics) | `WriteBuffer.size()` |
| `cameleer.ingestion.accumulator.pending` | — | `ChunkAccumulator.getPendingCount()` |
## Counters
| Metric | Tags | Instrumented in |
|--------|------|-----------------|
| `cameleer.ingestion.drops` | `reason` (buffer_full, no_agent, no_identity) | `LogIngestionController` |
| `cameleer.agents.transitions` | `transition` (went_stale, went_dead, recovered) | `AgentLifecycleMonitor` |
| `cameleer.deployments.outcome` | `status` (running, failed, degraded) | `DeploymentExecutor` |
| `cameleer.auth.failures` | `reason` (invalid_token, revoked, oidc_rejected) | `JwtAuthenticationFilter` |
## Timers
| Metric | Tags | Instrumented in |
|--------|------|-----------------|
| `cameleer.ingestion.flush.duration` | `type` (execution, processor, log) | `ExecutionFlushScheduler` |
| `cameleer.deployments.duration` | — | `DeploymentExecutor` |
## Agent container Prometheus labels (set by PrometheusLabelBuilder at deploy time)
| Runtime Type | `prometheus.path` | `prometheus.port` |
|---|---|---|
| `spring-boot` | `/actuator/prometheus` | `8081` |
| `quarkus` / `native` | `/q/metrics` | `9000` |
| `plain-java` | `/metrics` | `9464` |
All containers also get `prometheus.scrape=true`. These labels enable Prometheus `docker_sd_configs` auto-discovery.
## Agent Metric Names (Micrometer)
Agents send `MetricsSnapshot` records with Micrometer-convention metric names. The server stores them generically (ClickHouse `agent_metrics.metric_name`). The UI references specific names in `AgentInstance.tsx` for JVM charts.
### JVM metrics (used by UI)
| Metric name | UI usage |
|---|---|
| `process.cpu.usage.value` | CPU % stat card + chart |
| `jvm.memory.used.value` | Heap MB stat card + chart (tags: `area=heap`) |
| `jvm.memory.max.value` | Heap max for % calculation (tags: `area=heap`) |
| `jvm.threads.live.value` | Thread count chart |
| `jvm.gc.pause.total_time` | GC time chart |
### Camel route metrics (stored, queried by dashboard)
| Metric name | Type | Tags |
|---|---|---|
| `camel.exchanges.succeeded.count` | counter | `routeId`, `camelContext` |
| `camel.exchanges.failed.count` | counter | `routeId`, `camelContext` |
| `camel.exchanges.total.count` | counter | `routeId`, `camelContext` |
| `camel.exchanges.failures.handled.count` | counter | `routeId`, `camelContext` |
| `camel.route.policy.count` | count | `routeId`, `camelContext` |
| `camel.route.policy.total_time` | total | `routeId`, `camelContext` |
| `camel.route.policy.max` | gauge | `routeId`, `camelContext` |
| `camel.routes.running.value` | gauge | — |
Mean processing time = `camel.route.policy.total_time / camel.route.policy.count`. Min processing time is not available (Micrometer does not track minimums).
### Cameleer agent metrics
| Metric name | Type | Tags |
|---|---|---|
| `cameleer.chunks.exported.count` | counter | `instanceId` |
| `cameleer.chunks.dropped.count` | counter | `instanceId`, `reason` |
| `cameleer.sse.reconnects.count` | counter | `instanceId` |
| `cameleer.taps.evaluated.count` | counter | `instanceId` |
| `cameleer.metrics.exported.count` | counter | `instanceId` |
## Server self-metrics persistence
`ServerMetricsSnapshotScheduler` walks `MeterRegistry.getMeters()` every 60 s (configurable via `cameleer.server.self-metrics.interval-ms`) and writes one row per Micrometer `Measurement` to the ClickHouse `server_metrics` table. Full registry is captured — Spring Boot Actuator series (`jvm.*`, `process.*`, `http.server.requests`, `hikaricp.*`, `jdbc.*`, `tomcat.*`, `logback.events`, `system.*`) plus `cameleer.*` and `alerting_*`.
**Table** (`cameleer-server-app/src/main/resources/clickhouse/init.sql`):
```
server_metrics(tenant_id, collected_at, server_instance_id,
metric_name, metric_type, statistic, metric_value,
tags Map(String,String), server_received_at)
```
- `metric_type` — lowercase Micrometer `Meter.Type` (counter, gauge, timer, distribution_summary, long_task_timer, other)
- `statistic` — Micrometer `Statistic.getTagValueRepresentation()` (value, count, total, total_time, max, mean, active_tasks, duration). Timers emit 3 rows per tick (count + total_time + max); gauges/counters emit 1 (`statistic='value'` or `'count'`).
- No `environment` column — the server is env-agnostic.
- `tenant_id` threaded from `cameleer.server.tenant.id` (single-tenant per server).
- `server_instance_id` resolved once at boot by `ServerInstanceIdConfig` (property → HOSTNAME → localhost → UUID fallback). Rotates across restarts so counter resets are unambiguous.
- TTL: 90 days (vs 365 for `agent_metrics`). Write-only in v1 — no query endpoint or UI page. Inspect via ClickHouse admin: `/api/v1/admin/clickhouse/query` or direct SQL.
- Toggle off entirely with `cameleer.server.self-metrics.enabled=false` (uses `@ConditionalOnProperty`).