docs/alerting.md

# Alerting — Admin Guide

Cameleer's alerting system provides rule-based monitoring over the observability data the server already collects: route metrics, exchange outcomes, agent state, deployment state, application logs, and JVM metrics. It is a "good enough" baseline for operational awareness. For on-call rotation, escalation policies, and incident management, integrate with PagerDuty or OpsGenie via a webhook rule — Cameleer handles the HTTP POST, they handle the rest.

> For full architectural detail see `docs/superpowers/plans/2026-04-19-alerting-02-backend.md` and the spec at `docs/superpowers/specs/2026-04-19-alerting-design.md`.

---

## Condition Kinds

Six condition kinds are supported. All rules live under a single environment.

### ROUTE_METRIC

Fires when a computed route metric crosses a threshold over a rolling window.

```json
{
  "name": "High error rate on orders",
  "severity": "CRITICAL",
  "conditionKind": "ROUTE_METRIC",
  "condition": {
    "kind": "ROUTE_METRIC",
    "scope": { "appSlug": "orders-service" },
    "metric": "ERROR_RATE",
    "comparator": "GT",
    "threshold": 0.05,
    "windowSeconds": 300
  },
  "evaluationIntervalSeconds": 60
}
```

Available metrics: `ERROR_RATE`, `THROUGHPUT`, `AVG_DURATION_MS`, `P99_LATENCY_MS`, `ERROR_COUNT`.
Comparators: `GT`, `GTE`, `LT`, `LTE`, `EQ`.

### EXCHANGE_MATCH

Fires on exchanges matching a filter. Two firing modes — pick the one that matches your operational intent.

#### `fireMode: COUNT_IN_WINDOW`

One alert when the count of matching exchanges in a rolling window crosses a threshold. Aggregation-style: good for "more than 3 payment failures in 10 minutes."

```json
{
  "name": "Payment failures spike",
  "severity": "WARNING",
  "conditionKind": "EXCHANGE_MATCH",
  "condition": {
    "kind": "EXCHANGE_MATCH",
    "scope": { "appSlug": "payment-service", "routeId": "processPayment" },
    "filter": { "status": "FAILED", "attributes": { "payment.type": "card" } },
    "fireMode": "COUNT_IN_WINDOW",
    "threshold": 3,
    "windowSeconds": 600
  }
}
```

#### `fireMode: PER_EXCHANGE`

One alert per distinct failed exchange — **exactly once**. Each failure produces its own `AlertInstance` and its own notification. The Inbox contains one row per failed exchange, never a duplicate, across ticks or process restarts. Good for "page me for every failed order regardless of rate."

```json
{
  "name": "Any order failure",
  "severity": "CRITICAL",
  "conditionKind": "EXCHANGE_MATCH",
  "condition": {
    "kind": "EXCHANGE_MATCH",
    "scope": { "appSlug": "orders-service" },
    "filter": { "status": "FAILED" },
    "fireMode": "PER_EXCHANGE"
  }
}
```

PER_EXCHANGE rules have a tighter configurable surface — the server rejects non-coherent combinations at save time with 400:

| Field | PER_EXCHANGE | COUNT_IN_WINDOW |
|---|---|---|
| `threshold`, `windowSeconds` | must be absent / zero | required, positive |
| `reNotifyMinutes` | must be 0 (fires once; re-notify does not apply) | optional |
| `forDurationSeconds` | must be 0 | optional |
| `scope`, `filter`, `severity`, notification template, `webhooks` / `targets` | standard | standard |

Additionally, any rule (any `conditionKind`) with **both** empty `webhooks` and empty `targets` is rejected — a rule that notifies no one is always a misconfiguration.

**Exactly-once guarantee — scope.** One `AlertInstance` and one PENDING `AlertNotification` per exchange, survived across evaluator ticks and process restarts. HTTP webhook delivery is still at-least-once under transient failure; for Slack and similar, include `{{alert.id}}` in the message template so the consumer can dedup.

**First post-deploy tick — backlog cap.** A PER_EXCHANGE rule's first run (no persisted cursor yet) would otherwise scan from `rule.createdAt` forward, which can trigger a one-time notification flood for long-lived rules after a DB migration or schema reset. The server clamps the first-run scan to `max(rule.createdAt, now - deployBacklogCap)`. Default cap: 24 h. Tune via `cameleer.server.alerting.per-exchange-deploy-backlog-cap-seconds` (set to 0 to disable the clamp and replay from `createdAt`).

### AGENT_STATE

Fires when a specific agent (or any agent for an app) reaches a given state for a sustained period.

```json
{
  "name": "Orders agent dead",
  "severity": "CRITICAL",
  "conditionKind": "AGENT_STATE",
  "condition": {
    "kind": "AGENT_STATE",
    "scope": { "appSlug": "orders-service" },
    "state": "DEAD",
    "forSeconds": 120
  }
}
```

States: `LIVE`, `STALE`, `DEAD`.

### DEPLOYMENT_STATE

Fires when a deployment reaches one of the specified states.

```json
{
  "name": "Deployment failed",
  "severity": "WARNING",
  "conditionKind": "DEPLOYMENT_STATE",
  "condition": {
    "kind": "DEPLOYMENT_STATE",
    "scope": { "appSlug": "orders-service" },
    "states": ["FAILED", "DEGRADED"]
  }
}
```

### LOG_PATTERN

Fires when the number of log lines matching a regex pattern at a given level exceeds a threshold in a rolling window.

```json
{
  "name": "TimeoutException spike",
  "severity": "WARNING",
  "conditionKind": "LOG_PATTERN",
  "condition": {
    "kind": "LOG_PATTERN",
    "scope": { "appSlug": "orders-service" },
    "level": "ERROR",
    "pattern": "TimeoutException",
    "threshold": 5,
    "windowSeconds": 300
  }
}
```

`level`: `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`. `pattern` is a Java regex matched against the log message.

### JVM_METRIC

Fires when an aggregated JVM metric crosses a threshold.

```json
{
  "name": "Heap > 85%",
  "severity": "WARNING",
  "conditionKind": "JVM_METRIC",
  "condition": {
    "kind": "JVM_METRIC",
    "scope": { "appSlug": "orders-service" },
    "metric": "jvm.memory.used.value",
    "aggregation": "AVG",
    "comparator": "GT",
    "threshold": 0.85,
    "windowSeconds": 120
  }
}
```

`aggregation`: `AVG`, `MAX`, `MIN`, `LAST`.

---

## Notification Templates

Rules carry a `notificationTitleTmpl` and `notificationMessageTmpl` field rendered with [JMustache](https://github.com/samskivert/jmustache). Variables available in every template (populated by `NotificationContextBuilder`):

| Variable | Example |
|---|---|
| `{{rule.name}}` | "TimeoutException spike" |
| `{{rule.severity}}` | "WARNING" |
| `{{rule.description}}` | "…" |
| `{{alert.id}}` | UUID |
| `{{alert.state}}` | "FIRING" |
| `{{alert.firedAt}}` | ISO-8601 instant |
| `{{alert.resolvedAt}}` | ISO-8601 instant or empty |
| `{{alert.currentValue}}` | numeric value that triggered |
| `{{alert.threshold}}` | configured threshold |
| `{{alert.link}}` | deep-link URL to inbox item |
| `{{env.slug}}` | "prod" |
| `{{env.name}}` | "Production" |

Default templates (applied when not specified):

- Title: `"[{{rule.severity}}] {{rule.name}} — {{env.slug}}"`
- Message: `"Alert {{alert.id}} fired at {{alert.firedAt}}. Value: {{alert.currentValue}}, Threshold: {{alert.threshold}}"`

Use `POST /alerts/rules/{id}/render-preview` to test templates before saving.

---

## Webhook Setup

Webhooks are sent via **outbound connections** managed by an ADMIN at
`/api/v1/admin/outbound-connections`. This decouples secrets (HMAC key, auth tokens) from rule definitions. An OPERATOR can attach an existing connection to a rule.

### Creating an outbound connection (ADMIN)

```http
POST /api/v1/admin/outbound-connections
{
  "name": "slack-alerts",
  "url": "https://hooks.slack.com/services/T00/B00/XXX",
  "method": "POST",
  "tlsTrustMode": "SYSTEM_DEFAULT",
  "auth": { "kind": "NONE" },
  "defaultHeaders": { "Content-Type": "application/json" },
  "bodyTemplate": "{\"text\": \"{{rule.name}}: {{alert.state}}\"}",
  "hmacSecret": "my-signing-secret",
  "allowedEnvironmentIds": []
}
```

For PagerDuty Events API v2:

```json
{
  "name": "pagerduty-prod",
  "url": "https://events.pagerduty.com/v2/enqueue",
  "method": "POST",
  "tlsTrustMode": "SYSTEM_DEFAULT",
  "auth": { "kind": "BEARER", "token": "your-integration-key" },
  "defaultHeaders": { "Content-Type": "application/json" },
  "bodyTemplate": "{\"routing_key\":\"{{rule.id}}\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{rule.name}}\",\"severity\":\"{{rule.severity}}\",\"source\":\"{{env.slug}}\"}}"
}
```

### Attaching to a rule (OPERATOR)

Include the connection UUID in the `webhooks` array when creating or updating a rule:

```json
{
  "webhooks": [
    { "outboundConnectionId": "a1b2c3d4-..." }
  ]
}
```

The server validates that the connection exists and is allowed in the rule's environment (422 otherwise).

### HMAC Signature

When `hmacSecret` is set on the connection, each POST includes:

```
X-Cameleer-Signature: sha256=<hex-encoded-HMAC-SHA256(secret, body)>
```

Verify this on the receiving end to confirm authenticity.

---

## Silences

A silence suppresses notifications for matching alerts without deleting the rule. Silences are time-bounded.

```http
POST /api/v1/environments/{envSlug}/alerts/silences
{
  "matcher": {
    "ruleId": "uuid-of-rule",
    "severity": "WARNING"
  },
  "reason": "Planned maintenance window",
  "startsAt": "2026-04-20T02:00:00Z",
  "endsAt": "2026-04-20T06:00:00Z"
}
```

Matcher fields are all optional; at least one should be set. A silence matches an alert instance if ALL specified matcher fields match. List active silences with `GET /api/v1/environments/{envSlug}/alerts/silences`.

---

## Troubleshooting

### Circuit Breaker

If an evaluator kind (`LOG_PATTERN`, `ROUTE_METRIC`, etc.) throws exceptions repeatedly (default: 5 failures in 30 s), the circuit opens and skips that kind for a cooldown period (default: 60 s). Check server logs for:

```
Circuit breaker open for LOG_PATTERN; skipping rule <id>
```

The `alerting_circuit_opened_total{kind}` Prometheus counter tracks openings.

Tune via:

```yaml
cameleer:
  server:
    alerting:
      circuit-breaker-fail-threshold: 5
      circuit-breaker-window-seconds: 30
      circuit-breaker-cooldown-seconds: 60
```

### Retention

Old resolved alert instances and settled notifications are deleted nightly at 03:00. Retention windows:

```yaml
cameleer:
  server:
    alerting:
      event-retention-days: 90        # RESOLVED instances
      notification-retention-days: 30 # DELIVERED/FAILED notifications
```

FIRING and ACKNOWLEDGED instances are never deleted by retention (only RESOLVED ones are).

### Webhook delivery failures

Check `GET /api/v1/environments/{envSlug}/alerts/{id}/notifications` for response status and snippet. OPERATOR can retry a failed notification via `POST /api/v1/alerts/notifications/{id}/retry`.

### Prometheus metrics (alerting)

| Metric | Tags | Description |
|---|---|---|
| `alerting_eval_errors_total` | `kind` | Evaluation errors by condition kind |
| `alerting_eval_duration_seconds` | `kind` | Evaluation latency histogram |
| `alerting_circuit_opened_total` | `kind` | Circuit breaker open transitions |
| `alerting_notifications_total` | `status` | Notification outcomes |
| `alerting_webhook_delivery_duration_seconds` | — | Webhook POST latency |
| `alerting_rules_total` | `state` (enabled/disabled) | Rule count gauge |
| `alerting_instances_total` | `state` | Instance count gauge |

### ClickHouse projections

The `LOG_PATTERN` and `EXCHANGE_MATCH` evaluators use ClickHouse projections (`logs_by_level`, `executions_by_status`). On fresh ClickHouse containers (e.g. Testcontainers), projections may not be active immediately — the evaluator falls back to a full table scan with the same WHERE clause, so correctness is preserved but latency may increase on first evaluation. In production ClickHouse, projections are applied to new data immediately and to existing data after `OPTIMIZE TABLE … FINAL`.

---

## UI walkthrough

The alerting UI is accessible to any authenticated VIEWER+; writing actions (create rule, silence, ack) require OPERATOR+ per backend RBAC.

### Sidebar

A dedicated **Alerts** section between Applications and Admin:

- **Inbox** — open alerts targeted at you (state FIRING or ACKNOWLEDGED). Mark individual rows as read by clicking the title, or "Mark all read" via the toolbar. Firing rows have an amber left border.
- **All** — every open alert in the environment with state-chip filter (Open / Firing / Acked / All).
- **Rules** — the rule catalogue. Toggle the Enabled switch to disable a rule without deleting it. Delete prompts for confirmation; fired instances survive via `rule_snapshot`.
- **Silences** — active + scheduled silences. Create one by filling any combination of `ruleId` and `appSlug`, duration (hours), optional reason.
- **History** — RESOLVED alerts within the retention window (default 90 days).

### Notification bell

A bell icon in the top bar polls `/alerts/unread-count` every 30 seconds (paused when the tab is hidden). Clicking it navigates to the inbox.

### Rule editor (5-step wizard)

1. **Scope** — name, severity, and radio between environment-wide, single-app, single-route, or single-agent.
2. **Condition** — one of six condition kinds (ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE, DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC) with a form tailored to each.
3. **Trigger** — evaluation interval (≥5s), for-duration before firing (0 = fire immediately), re-notify cadence (minutes). Test-evaluate button when editing an existing rule.
4. **Notify** — notification title + message templates (Mustache with autocomplete), target users/groups/roles, webhook bindings (filtered to outbound connections allowed in the current env).
5. **Review** — summary card, enable toggle, save.

### Mustache autocomplete

Every template-editable field uses a shared CodeMirror 6 editor with variable autocomplete:

- Type `{{` to open the variable picker.
- Variables filter by condition kind (e.g. `route.*` is only shown when a route-scoped condition is selected).
- Unknown references get an amber underline at save time ("not available for this rule kind — will render as literal").
- The canonical variable list lives in `ui/src/components/MustacheEditor/alert-variables.ts` and mirrors the backend `NotificationContextBuilder`.

### Env promotion

Rules are environment-scoped. To replicate a rule in another env, open the source env's rule list and pick a target env from the **Promote to ▾** dropdown. The editor opens pre-filled with the source rule's values, with client-side warnings:

- Agent IDs are env-specific and get cleared.
- Apps that don't exist in the target env flag an "update before saving" hint.
- Outbound connections not allowed in the target env flag an "remove or pick another" hint.

No new REST endpoint — promotion is pure UI-driven create.

### CMD-K

The command palette (`Ctrl/Cmd + K`) surfaces open alerts and alert rules alongside existing apps/routes/exchanges. Select an alert to jump to its inbox detail; select a rule to open its editor.