docs/server-self-metrics.md

# Server Self-Metrics — Reference for Dashboard Builders

This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.

> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.

---

## Built-in admin dashboard

The server ships a ready-to-use dashboard at **`/admin/server-metrics`** in the web UI. It renders the 17 panels listed below using `ThemedChart` from the design system. The window is driven by the app-wide time-range control in the TopBar (same one used by Exchanges, Dashboard, and Runtime), so every panel automatically reflects the range you've selected globally. Visibility mirrors the Database and ClickHouse admin pages:

- Requires the `ADMIN` role.
- Hidden when `cameleer.server.security.infrastructureendpoints=false` (both the backend endpoints and the sidebar entry disappear).

Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.

---

## Table schema

```sql
server_metrics (
    tenant_id          LowCardinality(String) DEFAULT 'default',
    collected_at       DateTime64(3),
    server_instance_id LowCardinality(String),
    metric_name        LowCardinality(String),
    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
    statistic          LowCardinality(String) DEFAULT 'value',
    metric_value       Float64,
    tags               Map(String, String) DEFAULT map(),
    server_received_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
PARTITION BY (tenant_id, toYYYYMM(collected_at))
ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
```

### What each column means

| Column | Notes |
|---|---|
| `tenant_id` | Always filter by this. One tenant per server deployment. |
| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
| `metric_type` | Lowercase Micrometer `Meter.Type`. |
| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |

### Counter semantics (important)

Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:

```sql
SELECT
    toStartOfMinute(collected_at) AS minute,
    metric_value - any(metric_value) OVER (
        PARTITION BY server_instance_id, metric_name, tags
        ORDER BY collected_at
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS per_minute_delta
FROM server_metrics
WHERE metric_name = 'cameleer.ingestion.drops'
  AND statistic = 'count'
ORDER BY minute;
```

On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.

### Retention

90 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.

---

## How to query

Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).

### `GET /catalog`

Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.

```
GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
Authorization: Bearer <admin-jwt>
```

```json
[
  {
    "metricName": "cameleer.agents.connected",
    "metricType": "gauge",
    "statistics": ["value"],
    "tagKeys": ["state"]
  },
  {
    "metricName": "cameleer.ingestion.drops",
    "metricType": "counter",
    "statistics": ["count"],
    "tagKeys": ["reason"]
  },
  ...
]
```

`from`/`to` are optional; default is the last 1 h.

### `GET /instances`

Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.

```
GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
```

```json
[
  { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
  { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
]
```

### `POST /query` — generic time-series

The workhorse. One endpoint covers every panel in the dashboard.

```
POST /api/v1/admin/server-metrics/query
Authorization: Bearer <admin-jwt>
Content-Type: application/json
```

Request body:

```json
{
  "metric":          "cameleer.ingestion.drops",
  "statistic":       "count",
  "from":            "2026-04-22T00:00:00Z",
  "to":              "2026-04-23T00:00:00Z",
  "stepSeconds":     60,
  "groupByTags":     ["reason"],
  "filterTags":      { },
  "aggregation":     "sum",
  "mode":            "delta",
  "serverInstanceIds": null
}
```

Response:

```json
{
  "metric":      "cameleer.ingestion.drops",
  "statistic":   "count",
  "aggregation": "sum",
  "mode":        "delta",
  "stepSeconds": 60,
  "series": [
    {
      "tags":   { "reason": "buffer_full" },
      "points": [
        { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
        { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
        { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
      ]
    }
  ]
}
```

#### Request field reference

| Field | Type | Required | Description |
|---|---|---|---|
| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |

#### Validation errors

Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
- unsafe characters in identifiers
- `from ≥ to` or range > 31 days
- `stepSeconds` outside [10, 3600]
- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)

### Direct ClickHouse (fallback)

If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.

---

## Metric catalog

Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.

### Cameleer business metrics — agent + ingestion

Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |

### Cameleer business metrics — deploy + auth

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |

### Alerting subsystem metrics

Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.

| Metric | Type | Statistic | Tags | Meaning |
|---|---|---|---|---|
| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |

### JVM — memory, GC, threads, classes

From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
| `jvm.threads.live` | gauge | — | Current live thread count |
| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
| `jvm.threads.peak` | gauge | — | Peak thread count since start |
| `jvm.threads.started` | counter | — | Cumulative threads started |
| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) |
| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |

### Process and system

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) |
| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
| `process.uptime` | gauge | — | ms since start |
| `process.start.time` | gauge | — | Epoch start |
| `process.files.open` | gauge | — | Open FDs |
| `process.files.max` | gauge | — | FD ulimit |
| `system.cpu.count` | gauge | — | Cores visible to the JVM |
| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) |
| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
| `disk.total` | gauge | `path` | Total bytes |

### HTTP server

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |

`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.

### Tomcat

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
| `tomcat.sessions.created` | counter | — | Cumulative session creates |
| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
| `tomcat.sessions.rejected` | counter | — | Session creates refused |
| `tomcat.threads.current` | gauge | `name` | Connector thread count |
| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
| `tomcat.threads.config.max` | gauge | `name` | Configured max |

### HikariCP (PostgreSQL pool)

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `hikaricp.connections` | gauge | `pool` | Total connections |
| `hikaricp.connections.active` | gauge | `pool` | In-use |
| `hikaricp.connections.idle` | gauge | `pool` | Idle |
| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
| `hikaricp.connections.min` | gauge | `pool` | Configured min |
| `hikaricp.connections.max` | gauge | `pool` | Configured max |
| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |

Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).

### JDBC generic

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
| `jdbc.connections.max` | gauge | `name` | |
| `jdbc.connections.active` | gauge | `name` | |
| `jdbc.connections.idle` | gauge | `name` | |

### Logging

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |

### Spring Boot lifecycle

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `application.started.time` | timer | `main.application.class` | Cold-start duration |
| `application.ready.time` | timer | `main.application.class` | Time to ready |

### Flyway

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |

### Executor pools (if any `@Async` executors exist)

When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:

| Metric | Type | Tags | Meaning |
|---|---|---|---|
| `executor.active` | gauge | `name` | Currently-running tasks |
| `executor.queued` | gauge | `name` | Queued tasks |
| `executor.queue.remaining` | gauge | `name` | Queue headroom |
| `executor.pool.size` | gauge | `name` | Current pool size |
| `executor.pool.core` | gauge | `name` | Core size |
| `executor.pool.max` | gauge | `name` | Max size |
| `executor.completed` | counter | `name` | Completed tasks |

---

## Suggested dashboard panels

Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.

### Row: server health (top of dashboard)

1. **Agents by state** — stacked area.
   ```json
   { "metric": "cameleer.agents.connected", "statistic": "value",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
   ```

2. **Ingestion buffer depth by type** — line chart.
   ```json
   { "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
   ```

3. **Ingestion drops per minute** — bar chart.
   ```json
   { "metric": "cameleer.ingestion.drops", "statistic": "count",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["reason"], "mode": "delta" }
   ```

4. **Auth failures per minute** — same shape as drops, grouped by `reason`.
   ```json
   { "metric": "cameleer.auth.failures", "statistic": "count",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["reason"], "mode": "delta" }
   ```

### Row: JVM

5. **Heap used vs committed vs max** — area chart (three overlay queries).
   ```json
   { "metric": "jvm.memory.used", "statistic": "value",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
   ```
   Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.

6. **CPU %** — line.
   ```json
   { "metric": "process.cpu.usage", "statistic": "value",
     "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
   ```
   Overlay with `"metric": "system.cpu.usage"`.

7. **GC pause — max per cause**.
   ```json
   { "metric": "jvm.gc.pause", "statistic": "max",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
   ```

8. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.

### Row: HTTP + DB

9. **HTTP mean latency by URI** — top-N URIs.
   ```json
   { "metric": "http.server.requests", "statistic": "mean",
     "from": "{from}", "to": "{to}", "stepSeconds": 60,
     "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
     "aggregation": "avg", "mode": "raw" }
   ```
   For p99 proxy, repeat with `"statistic": "max"`.

10. **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
    ```json
    { "metric": "http.server.requests", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "mode": "delta", "aggregation": "sum" }
    ```
    Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.

11. **HikariCP pool saturation** — overlay two queries.
    ```json
    { "metric": "hikaricp.connections.active", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
    ```
    Overlay with `"metric": "hikaricp.connections.pending"`.

12. **Hikari acquire timeouts per minute**.
    ```json
    { "metric": "hikaricp.connections.timeout", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["pool"], "mode": "delta" }
    ```

### Row: alerting (collapsible)

13. **Alerting instances by state** — stacked.
    ```json
    { "metric": "alerting_instances_total", "statistic": "value",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
    ```

14. **Eval errors per minute by kind**.
    ```json
    { "metric": "alerting_eval_errors_total", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "groupByTags": ["kind"], "mode": "delta" }
    ```

15. **Webhook delivery — max per minute**.
    ```json
    { "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
      "from": "{from}", "to": "{to}", "stepSeconds": 60,
      "aggregation": "max", "mode": "raw" }
    ```

### Row: deployments (runtime-enabled only)

16. **Deploy outcomes per hour**.
    ```json
    { "metric": "cameleer.deployments.outcome", "statistic": "count",
      "from": "{from}", "to": "{to}", "stepSeconds": 3600,
      "groupByTags": ["status"], "mode": "delta" }
    ```

17. **Deploy duration mean**.
    ```json
    { "metric": "cameleer.deployments.duration", "statistic": "mean",
      "from": "{from}", "to": "{to}", "stepSeconds": 300,
      "aggregation": "avg", "mode": "raw" }
    ```
    For p99 proxy, repeat with `"statistic": "max"`.

---

## Notes for the dashboard implementer

- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.

---

## Changelog

- 2026-04-23 — initial write. Write-only backend.
- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
- 2026-04-24 — shipped the built-in `/admin/server-metrics` UI dashboard. Gated by `infrastructureendpoints` + ADMIN, identical visibility to `/admin/{database,clickhouse}`. Source: `ui/src/pages/Admin/ServerMetricsAdminPage.tsx`.
- 2026-04-24 — dashboard now uses the global time-range control (`useGlobalFilters`) instead of a page-local picker. Bucket size auto-scales with the selected window (10 s → 1 h). Query hooks now take a `ServerMetricsRange = { from: Date; to: Date }` instead of a `windowSeconds` number so they work for any absolute or rolling range the TopBar supplies.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								# Server Self-Metrics — Reference for Dashboard Builders
-												docs(server-metrics): document the built-in admin dashboard

SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API)
side-by-side with visibility rules; the dashboard-builder doc leads with a
"Built-in admin dashboard" section and a 2026-04-24 changelog entry so
first-time readers know they don't have to build anything before seeing
server health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:05:22 +02:00
+								This is the reference for anyone building a server-health dashboard on top of the Cameleer server. It documents the `server_metrics` ClickHouse table, every series you can expect to find in it, and the queries we recommend for each dashboard panel.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								> **tl;dr** — Every 60 s, every meter in the server's Micrometer registry (all `cameleer.*`, all `alerting_*`, and the full Spring Boot Actuator set) is written into ClickHouse as one row per `(meter, statistic)` pair. No external Prometheus required.
 								---
-												docs(server-metrics): document the built-in admin dashboard

SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API)
side-by-side with visibility rules; the dashboard-builder doc leads with a
"Built-in admin dashboard" section and a 2026-04-24 changelog entry so
first-time readers know they don't have to build anything before seeing
server health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:05:22 +02:00
+								## Built-in admin dashboard
-												refactor(ui): server metrics page uses global time range

Drop the page-local DS Select window picker. Drive from() / to() off
useGlobalFilters().timeRange so the dashboard tracks the same TopBar range
as Exchanges / Dashboard / Runtime. Bucket size auto-scales via
stepSecondsFor(windowSeconds) (10 s for ≤30 min → 1 h for >48 h). Query
hooks now take ServerMetricsRange = { from: Date; to: Date } instead of a
windowSeconds number, so they support arbitrary absolute or rolling ranges
the TopBar may supply (not just "now − N"). Toolbar collapses to just the
server-instance badges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:19:20 +02:00
+								The server ships a ready-to-use dashboard at **`/admin/server-metrics`** in the web UI. It renders the 17 panels listed below using `ThemedChart` from the design system. The window is driven by the app-wide time-range control in the TopBar (same one used by Exchanges, Dashboard, and Runtime), so every panel automatically reflects the range you've selected globally. Visibility mirrors the Database and ClickHouse admin pages:
-												docs(server-metrics): document the built-in admin dashboard

SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API)
side-by-side with visibility rules; the dashboard-builder doc leads with a
"Built-in admin dashboard" section and a 2026-04-24 changelog entry so
first-time readers know they don't have to build anything before seeing
server health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:05:22 +02:00
 								- Requires the `ADMIN` role.
 								- Hidden when `cameleer.server.security.infrastructureendpoints=false` (both the backend endpoints and the sidebar entry disappear).
 								Use this page for single-tenant installs and dev/staging — it's the fastest path to "is the server healthy right now?". For multi-tenant control planes, cross-environment rollups, or embedding metrics inside an existing operations console, call the REST API below instead.
 								---
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								## Table schema
 								```sql
 								server_metrics (
 								    tenant_id          LowCardinality(String) DEFAULT 'default',
 								    collected_at       DateTime64(3),
 								    server_instance_id LowCardinality(String),
 								    metric_name        LowCardinality(String),
 								    metric_type        LowCardinality(String),   -- counter|gauge|timer|distribution_summary|long_task_timer|other
 								    statistic          LowCardinality(String) DEFAULT 'value',
 								    metric_value       Float64,
 								    tags               Map(String, String) DEFAULT map(),
 								    server_received_at DateTime64(3) DEFAULT now64(3)
 								)
 								ENGINE = MergeTree()
 								PARTITION BY (tenant_id, toYYYYMM(collected_at))
 								ORDER BY (tenant_id, collected_at, server_instance_id, metric_name, statistic)
 								TTL toDateTime(collected_at) + INTERVAL 90 DAY DELETE
 								```
 								### What each column means
 								| Column | Notes |
 								|---|---|
 								| `tenant_id` | Always filter by this. One tenant per server deployment. |
 								| `server_instance_id` | Stable id per server process: property → `HOSTNAME` env → DNS → random UUID. **Rotates on restart**, so counters restart cleanly. |
 								| `metric_name` | Raw Micrometer meter name. Dots, not underscores. |
 								| `metric_type` | Lowercase Micrometer `Meter.Type`. |
 								| `statistic` | Which `Measurement` this row is. Counters/gauges → `value` or `count`. Timers → three rows per tick: `count`, `total_time` (or `total`), `max`. Distribution summaries → same shape. |
 								| `metric_value` | `Float64`. Non-finite values (NaN / ±∞) are dropped before insert. |
 								| `tags` | `Map(String, String)`. Micrometer tags copied verbatim. |
 								### Counter semantics (important)
 								Counters are **cumulative totals since meter registration**, same convention as Prometheus. To get a rate, compute a delta within a `server_instance_id`:
 								```sql
 								SELECT
 								    toStartOfMinute(collected_at) AS minute,
 								    metric_value - any(metric_value) OVER (
 								        PARTITION BY server_instance_id, metric_name, tags
 								        ORDER BY collected_at
 								        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
 								    ) AS per_minute_delta
 								FROM server_metrics
 								WHERE metric_name = 'cameleer.ingestion.drops'
 								  AND statistic = 'count'
 								ORDER BY minute;
 								```
 								On restart the `server_instance_id` rotates, so a simple `LAG()` partitioned by `server_instance_id` gives monotonic segments without fighting counter resets.
 								### Retention
 days, TTL-enforced. Long-term trend analysis is out of scope — ship raw data to an external warehouse if you need more.
 								---
 								## How to query
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								Use the REST API — `/api/v1/admin/server-metrics/**`. It does the tenant filter, range bounding, counter-delta math, and input validation for you, so the dashboard never needs direct ClickHouse access. ADMIN role required (standard `/api/v1/admin/**` RBAC gate).
 								### `GET /catalog`
 								Enumerate every `metric_name` observed in a window, with its `metric_type`, the set of statistics emitted, and the union of tag keys.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								```
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								GET /api/v1/admin/server-metrics/catalog?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								Authorization: Bearer <admin-jwt>
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								```json
 								[
 								  {
 								    "metricName": "cameleer.agents.connected",
 								    "metricType": "gauge",
 								    "statistics": ["value"],
 								    "tagKeys": ["state"]
 								  },
 								  {
 								    "metricName": "cameleer.ingestion.drops",
 								    "metricType": "counter",
 								    "statistics": ["count"],
 								    "tagKeys": ["reason"]
 								  },
 								  ...
 								]
 								```
 								`from`/`to` are optional; default is the last 1 h.
 								### `GET /instances`
 								Enumerate the `server_instance_id` values that wrote at least one sample in the window, with `firstSeen` / `lastSeen`. Use this when you need to annotate restarts on a graph or reason about counter-delta partitions.
 								```
 								GET /api/v1/admin/server-metrics/instances?from=2026-04-22T00:00:00Z&to=2026-04-23T00:00:00Z
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								```
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								```json
 								[
 								  { "serverInstanceId": "srv-prod-b", "firstSeen": "2026-04-22T14:30:00Z", "lastSeen": "2026-04-23T00:00:00Z" },
 								  { "serverInstanceId": "srv-prod-a", "firstSeen": "2026-04-22T00:00:00Z", "lastSeen": "2026-04-22T14:25:00Z" }
 								]
 								```
 								### `POST /query` — generic time-series
 								The workhorse. One endpoint covers every panel in the dashboard.
 								```
 								POST /api/v1/admin/server-metrics/query
 								Authorization: Bearer <admin-jwt>
 								Content-Type: application/json
 								```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								Request body:
 								```json
 								{
 								  "metric":          "cameleer.ingestion.drops",
 								  "statistic":       "count",
 								  "from":            "2026-04-22T00:00:00Z",
 								  "to":              "2026-04-23T00:00:00Z",
 								  "stepSeconds":     60,
 								  "groupByTags":     ["reason"],
 								  "filterTags":      { },
 								  "aggregation":     "sum",
 								  "mode":            "delta",
 								  "serverInstanceIds": null
 								}
 								```
 								Response:
 								```json
 								{
 								  "metric":      "cameleer.ingestion.drops",
 								  "statistic":   "count",
 								  "aggregation": "sum",
 								  "mode":        "delta",
 								  "stepSeconds": 60,
 								  "series": [
 								    {
 								      "tags":   { "reason": "buffer_full" },
 								      "points": [
 								        { "t": "2026-04-22T00:00:00.000Z", "v": 0.0 },
 								        { "t": "2026-04-22T00:01:00.000Z", "v": 5.0 },
 								        { "t": "2026-04-22T00:02:00.000Z", "v": 5.0 }
 								      ]
 								    }
 								  ]
 								}
 								```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								#### Request field reference
 								| Field | Type | Required | Description |
 								|---|---|---|---|
 								| `metric` | string | yes | Metric name. Regex `^[a-zA-Z0-9._]+$`. |
 								| `statistic` | string | no | `value` / `count` / `total` / `total_time` / `max` / `mean`. `mean` is a derived statistic for timers: `sum(total_time \| total) / sum(count)` per bucket. |
 								| `from`, `to` | ISO-8601 instant | yes | Half-open window. `to - from ≤ 31 days`. |
 								| `stepSeconds` | int | no | Bucket size. Clamped to [10, 3600]. Default 60. |
 								| `groupByTags` | string[] | no | Emit one series per unique combination of these tag values. Tag keys regex `^[a-zA-Z0-9._]+$`. |
 								| `filterTags` | map<string,string> | no | Narrow to samples whose tag map contains every entry. Values bound via parameter — no injection. |
 								| `aggregation` | string | no | Within-bucket reducer for raw mode: `avg` (default), `sum`, `max`, `min`, `latest`. For `mode=delta` this controls cross-instance aggregation (defaults to `sum` of per-instance deltas). |
 								| `mode` | string | no | `raw` (default) or `delta`. Delta mode computes per-`server_instance_id` positive-clipped differences and then aggregates across instances — so you get a rate-like time series that survives server restarts. |
 								| `serverInstanceIds` | string[] | no | Allow-list. When null or empty, every instance in the window is included. |
 								#### Validation errors
 								Any `IllegalArgumentException` surfaces as `400 Bad Request` with `{"error": "…"}`. Triggers:
 								- unsafe characters in identifiers
 								- `from ≥ to` or range > 31 days
 								- `stepSeconds` outside [10, 3600]
 								- result cardinality > 500 series (reduce `groupByTags` or tighten `filterTags`)
 								### Direct ClickHouse (fallback)
 								If you need something the generic query can't express (complex joins, percentile aggregates, materialized-view rollups), reach for `/api/v1/admin/clickhouse/query` (`infrastructureendpoints=true`, ADMIN) or a dedicated read-only CH user scoped to `server_metrics`. All direct queries must filter by `tenant_id`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								---
 								## Metric catalog
 								Every series below is populated. Names follow Micrometer conventions (dots, not underscores). Use these as the starting point for dashboard panels — pick the handful you care about, ignore the rest.
 								### Cameleer business metrics — agent + ingestion
 								Source: `cameleer-server-app/.../metrics/ServerMetrics.java`.
 								| Metric | Type | Statistic | Tags | Meaning |
 								|---|---|---|---|---|
 								| `cameleer.agents.connected` | gauge | `value` | `state` (live/stale/dead/shutdown) | Count of agents in each lifecycle state |
 								| `cameleer.agents.sse.active` | gauge | `value` | — | Active SSE connections (command channel) |
 								| `cameleer.agents.transitions` | counter | `count` | `transition` (went_stale/went_dead/recovered) | Cumulative lifecycle transitions |
 								| `cameleer.ingestion.buffer.size` | gauge | `value` | `type` (execution/processor/log/metrics) | Write buffer depth — spikes mean ingestion is lagging |
 								| `cameleer.ingestion.accumulator.pending` | gauge | `value` | — | Unfinalized execution chunks in the accumulator |
 								| `cameleer.ingestion.drops` | counter | `count` | `reason` (buffer_full/no_agent/no_identity) | Dropped payloads. Any non-zero rate here is bad. |
 								| `cameleer.ingestion.flush.duration` | timer | `count`, `total_time`/`total`, `max` | `type` (execution/processor/log) | Flush latency per type |
 								### Cameleer business metrics — deploy + auth
 								| Metric | Type | Statistic | Tags | Meaning |
 								|---|---|---|---|---|
 								| `cameleer.deployments.outcome` | counter | `count` | `status` (running/failed/degraded) | Deploy outcome tally since boot |
 								| `cameleer.deployments.duration` | timer | `count`, `total_time`/`total`, `max` | — | End-to-end deploy latency |
 								| `cameleer.auth.failures` | counter | `count` | `reason` (invalid_token/revoked/oidc_rejected) | Auth failure breakdown — watch for spikes |
 								### Alerting subsystem metrics
 								Source: `cameleer-server-app/.../alerting/metrics/AlertingMetrics.java`.
 								| Metric | Type | Statistic | Tags | Meaning |
 								|---|---|---|---|---|
 								| `alerting_rules_total` | gauge | `value` | `state` (enabled/disabled) | Cached 30 s from PostgreSQL `alert_rules` |
 								| `alerting_instances_total` | gauge | `value` | `state` (firing/resolved/ack'd etc.) | Cached 30 s from PostgreSQL `alert_instances` |
 								| `alerting_eval_errors_total` | counter | `count` | `kind` (condition kind) | Evaluator exceptions per kind |
 								| `alerting_circuit_opened_total` | counter | `count` | `kind` | Circuit-breaker open transitions per kind |
 								| `alerting_eval_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | `kind` | Per-kind evaluation latency |
 								| `alerting_webhook_delivery_duration_seconds` | timer | `count`, `total_time`/`total`, `max` | — | Outbound webhook POST latency |
 								| `alerting_notifications_total` | counter | `count` | `status` (sent/failed/retry/giving_up) | Notification outcomes |
 								### JVM — memory, GC, threads, classes
 								From Spring Boot Actuator (`JvmMemoryMetrics`, `JvmGcMetrics`, `JvmThreadMetrics`, `ClassLoaderMetrics`).
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `jvm.memory.used` | gauge | `area` (heap/nonheap), `id` (pool name) | Bytes used per pool |
 								| `jvm.memory.committed` | gauge | `area`, `id` | Bytes committed per pool |
 								| `jvm.memory.max` | gauge | `area`, `id` | Pool max |
 								| `jvm.memory.usage.after.gc` | gauge | `area`, `id` | Usage right after the last collection |
 								| `jvm.buffer.memory.used` | gauge | `id` (direct/mapped) | NIO buffer bytes |
 								| `jvm.buffer.count` | gauge | `id` | NIO buffer count |
 								| `jvm.buffer.total.capacity` | gauge | `id` | NIO buffer capacity |
 								| `jvm.threads.live` | gauge | — | Current live thread count |
 								| `jvm.threads.daemon` | gauge | — | Current daemon thread count |
 								| `jvm.threads.peak` | gauge | — | Peak thread count since start |
 								| `jvm.threads.started` | counter | — | Cumulative threads started |
 								| `jvm.threads.states` | gauge | `state` (runnable/blocked/waiting/…) | Threads per state |
 								| `jvm.classes.loaded` | gauge | — | Currently-loaded classes |
 								| `jvm.classes.unloaded` | counter | — | Cumulative unloaded classes |
 								| `jvm.gc.pause` | timer | `action`, `cause` | Stop-the-world pause times — watch `max` |
 								| `jvm.gc.concurrent.phase.time` | timer | `action`, `cause` | Concurrent-phase durations (G1/ZGC) |
 								| `jvm.gc.memory.allocated` | counter | — | Bytes allocated in the young gen |
 								| `jvm.gc.memory.promoted` | counter | — | Bytes promoted to old gen |
 								| `jvm.gc.overhead` | gauge | — | Fraction of CPU spent in GC (0–1) |
 								| `jvm.gc.live.data.size` | gauge | — | Live data after last collection |
 								| `jvm.gc.max.data.size` | gauge | — | Max old-gen size |
 								| `jvm.info` | gauge | `vendor`, `runtime`, `version` | Constant `1.0`; tags carry the real info |
 								### Process and system
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `process.cpu.usage` | gauge | — | CPU share consumed by this JVM (0–1) |
 								| `process.cpu.time` | gauge | — | Cumulative CPU time (ns) |
 								| `process.uptime` | gauge | — | ms since start |
 								| `process.start.time` | gauge | — | Epoch start |
 								| `process.files.open` | gauge | — | Open FDs |
 								| `process.files.max` | gauge | — | FD ulimit |
 								| `system.cpu.count` | gauge | — | Cores visible to the JVM |
 								| `system.cpu.usage` | gauge | — | System-wide CPU (0–1) |
 								| `system.load.average.1m` | gauge | — | 1-min load (Unix only) |
 								| `disk.free` | gauge | `path` | Free bytes on the mount that holds the JAR |
 								| `disk.total` | gauge | `path` | Total bytes |
 								### HTTP server
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `http.server.requests` | timer | `method`, `uri`, `status`, `outcome`, `exception` | Inbound HTTP: count, total_time/total, max |
 								| `http.server.requests.active` | long_task_timer | `method`, `uri` | In-flight requests — `active_tasks` statistic |
 								`uri` is the Spring-templated path (`/api/v1/environments/{envSlug}/apps/{appSlug}`), not the raw URL — cardinality stays bounded.
 								### Tomcat
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `tomcat.sessions.active.current` | gauge | — | Currently active sessions |
 								| `tomcat.sessions.active.max` | gauge | — | Max concurrent sessions observed |
 								| `tomcat.sessions.alive.max` | gauge | — | Longest session lifetime (s) |
 								| `tomcat.sessions.created` | counter | — | Cumulative session creates |
 								| `tomcat.sessions.expired` | counter | — | Cumulative expirations |
 								| `tomcat.sessions.rejected` | counter | — | Session creates refused |
 								| `tomcat.threads.current` | gauge | `name` | Connector thread count |
 								| `tomcat.threads.busy` | gauge | `name` | Connector threads currently serving a request |
 								| `tomcat.threads.config.max` | gauge | `name` | Configured max |
 								### HikariCP (PostgreSQL pool)
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `hikaricp.connections` | gauge | `pool` | Total connections |
 								| `hikaricp.connections.active` | gauge | `pool` | In-use |
 								| `hikaricp.connections.idle` | gauge | `pool` | Idle |
 								| `hikaricp.connections.pending` | gauge | `pool` | Threads waiting for a connection |
 								| `hikaricp.connections.min` | gauge | `pool` | Configured min |
 								| `hikaricp.connections.max` | gauge | `pool` | Configured max |
 								| `hikaricp.connections.creation` | timer | `pool` | Time to open a new connection |
 								| `hikaricp.connections.acquire` | timer | `pool` | Time to acquire from the pool |
 								| `hikaricp.connections.usage` | timer | `pool` | Time a connection was in use |
 								| `hikaricp.connections.timeout` | counter | `pool` | Pool acquisition timeouts — any non-zero rate is a problem |
 								Pools are named. You'll see `HikariPool-1` (PostgreSQL) and a separate pool for ClickHouse (`clickHouseJdbcTemplate`).
 								### JDBC generic
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `jdbc.connections.min` | gauge | `name` | Same data as Hikari, surfaced generically |
 								| `jdbc.connections.max` | gauge | `name` | |
 								| `jdbc.connections.active` | gauge | `name` | |
 								| `jdbc.connections.idle` | gauge | `name` | |
 								### Logging
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `logback.events` | counter | `level` (error/warn/info/debug/trace) | Log events emitted since start — `{level=error}` is a useful panel |
 								### Spring Boot lifecycle
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `application.started.time` | timer | `main.application.class` | Cold-start duration |
 								| `application.ready.time` | timer | `main.application.class` | Time to ready |
 								### Flyway
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `flyway.migrations` | gauge | — | Number of migrations applied (current schema) |
 								### Executor pools (if any `@Async` executors exist)
 								When a `ThreadPoolTaskExecutor` bean is registered and tagged, Micrometer adds:
 								| Metric | Type | Tags | Meaning |
 								|---|---|---|---|
 								| `executor.active` | gauge | `name` | Currently-running tasks |
 								| `executor.queued` | gauge | `name` | Queued tasks |
 								| `executor.queue.remaining` | gauge | `name` | Queue headroom |
 								| `executor.pool.size` | gauge | `name` | Current pool size |
 								| `executor.pool.core` | gauge | `name` | Core size |
 								| `executor.pool.max` | gauge | `name` | Max size |
 								| `executor.completed` | counter | `name` | Completed tasks |
 								---
 								## Suggested dashboard panels
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								Below are 17 panels, each expressed as a single `POST /api/v1/admin/server-metrics/query` body. Tenant is implicit in the JWT — the server filters by tenant server-side. `{from}` and `{to}` are dashboard variables.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								### Row: server health (top of dashboard)
 . **Agents by state** — stacked area.
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								   ```json
 								   { "metric": "cameleer.agents.connected", "statistic": "value",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								   ```
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Ingestion buffer depth by type** — line chart.
 								   ```json
 								   { "metric": "cameleer.ingestion.buffer.size", "statistic": "value",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["type"], "aggregation": "avg", "mode": "raw" }
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
+								   ```
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Ingestion drops per minute** — bar chart.
 								   ```json
 								   { "metric": "cameleer.ingestion.drops", "statistic": "count",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["reason"], "mode": "delta" }
 								   ```
 . **Auth failures per minute** — same shape as drops, grouped by `reason`.
 								   ```json
 								   { "metric": "cameleer.auth.failures", "statistic": "count",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["reason"], "mode": "delta" }
 								   ```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								### Row: JVM
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Heap used vs committed vs max** — area chart (three overlay queries).
 								   ```json
 								   { "metric": "jvm.memory.used", "statistic": "value",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "filterTags": { "area": "heap" }, "aggregation": "sum", "mode": "raw" }
 								   ```
 								   Repeat with `"metric": "jvm.memory.committed"` and `"metric": "jvm.memory.max"`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **CPU %** — line.
 								   ```json
 								   { "metric": "process.cpu.usage", "statistic": "value",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60, "aggregation": "avg", "mode": "raw" }
 								   ```
 								   Overlay with `"metric": "system.cpu.usage"`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **GC pause — max per cause**.
 								   ```json
 								   { "metric": "jvm.gc.pause", "statistic": "max",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["cause"], "aggregation": "max", "mode": "raw" }
 								   ```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Thread count** — three overlay lines: `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak` each with `statistic=value, aggregation=avg, mode=raw`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								### Row: HTTP + DB
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **HTTP mean latency by URI** — top-N URIs.
 								   ```json
 								   { "metric": "http.server.requests", "statistic": "mean",
 								     "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								     "groupByTags": ["uri"], "filterTags": { "outcome": "SUCCESS" },
 								     "aggregation": "avg", "mode": "raw" }
 								   ```
 								   For p99 proxy, repeat with `"statistic": "max"`.
 . **HTTP error rate** — two queries, divide client-side: total requests and 5xx requests.
 								    ```json
 								    { "metric": "http.server.requests", "statistic": "count",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "mode": "delta", "aggregation": "sum" }
 								    ```
 								    Then for the 5xx series, add `"filterTags": { "outcome": "SERVER_ERROR" }` and divide.
 . **HikariCP pool saturation** — overlay two queries.
 								    ```json
 								    { "metric": "hikaricp.connections.active", "statistic": "value",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "groupByTags": ["pool"], "aggregation": "avg", "mode": "raw" }
 								    ```
 								    Overlay with `"metric": "hikaricp.connections.pending"`.
 . **Hikari acquire timeouts per minute**.
 								    ```json
 								    { "metric": "hikaricp.connections.timeout", "statistic": "count",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "groupByTags": ["pool"], "mode": "delta" }
 								    ```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								### Row: alerting (collapsible)
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Alerting instances by state** — stacked.
 								    ```json
 								    { "metric": "alerting_instances_total", "statistic": "value",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "groupByTags": ["state"], "aggregation": "avg", "mode": "raw" }
 								    ```
 . **Eval errors per minute by kind**.
 								    ```json
 								    { "metric": "alerting_eval_errors_total", "statistic": "count",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "groupByTags": ["kind"], "mode": "delta" }
 								    ```
 . **Webhook delivery — max per minute**.
 								    ```json
 								    { "metric": "alerting_webhook_delivery_duration_seconds", "statistic": "max",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 60,
 								      "aggregation": "max", "mode": "raw" }
 								    ```
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								### Row: deployments (runtime-enabled only)
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+. **Deploy outcomes per hour**.
 								    ```json
 								    { "metric": "cameleer.deployments.outcome", "statistic": "count",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 3600,
 								      "groupByTags": ["status"], "mode": "delta" }
 								    ```
 . **Deploy duration mean**.
 								    ```json
 								    { "metric": "cameleer.deployments.duration", "statistic": "mean",
 								      "from": "{from}", "to": "{to}", "stepSeconds": 300,
 								      "aggregation": "avg", "mode": "raw" }
 								    ```
 								    For p99 proxy, repeat with `"statistic": "max"`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								---
 								## Notes for the dashboard implementer
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								- **Use the REST API.** The server handles tenant filtering, counter deltas, range bounds, and input validation. Direct ClickHouse is a fallback for the handful of cases the generic query can't express.
 								- **`total_time` vs `total`.** SimpleMeterRegistry and PrometheusMeterRegistry disagree on the tag value for Timer cumulative duration. The server uses PrometheusMeterRegistry in production, so expect `total_time`. The derived `statistic=mean` handles both transparently.
 								- **Cardinality warning:** `http.server.requests` tags include `uri` and `status`. The server templates URIs, but if someone adds an endpoint that embeds a high-cardinality path segment without `@PathVariable`, you'll see explosion here. The API caps responses at 500 series; you'll get a 400 if you blow past it.
 								- **The dashboard is read-only.** There's no write path — only the server writes into `server_metrics`.
-												feat(server): persist server self-metrics into ClickHouse

Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:20:45 +02:00
 								---
 								## Changelog
-												feat(server): REST API over server_metrics for SaaS dashboards

Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-23 23:41:02 +02:00
+								- 2026-04-23 — initial write. Write-only backend.
 								- 2026-04-23 — added generic REST API (`/api/v1/admin/server-metrics/{catalog,instances,query}`) so dashboards don't need direct ClickHouse access. All 17 suggested panels now expressed as single-endpoint queries.
-												docs(server-metrics): document the built-in admin dashboard

SERVER-CAPABILITIES.md now lists the two consumption paths (UI + REST API)
side-by-side with visibility rules; the dashboard-builder doc leads with a
"Built-in admin dashboard" section and a 2026-04-24 changelog entry so
first-time readers know they don't have to build anything before seeing
server health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:05:22 +02:00
+								- 2026-04-24 — shipped the built-in `/admin/server-metrics` UI dashboard. Gated by `infrastructureendpoints` + ADMIN, identical visibility to `/admin/{database,clickhouse}`. Source: `ui/src/pages/Admin/ServerMetricsAdminPage.tsx`.
-												refactor(ui): server metrics page uses global time range

Drop the page-local DS Select window picker. Drive from() / to() off
useGlobalFilters().timeRange so the dashboard tracks the same TopBar range
as Exchanges / Dashboard / Runtime. Bucket size auto-scales via
stepSecondsFor(windowSeconds) (10 s for ≤30 min → 1 h for >48 h). Query
hooks now take ServerMetricsRange = { from: Date; to: Date } instead of a
windowSeconds number, so they support arbitrary absolute or rolling ranges
the TopBar may supply (not just "now − N"). Toolbar collapses to just the
server-instance badges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-24 09:19:20 +02:00
+								- 2026-04-24 — dashboard now uses the global time-range control (`useGlobalFilters`) instead of a page-local picker. Bucket size auto-scales with the selected window (10 s → 1 h). Query hooks now take a `ServerMetricsRange = { from: Date; to: Date }` instead of a `windowSeconds` number so they work for any absolute or rolling range the TopBar supplies.