cameleer-server/docs/superpowers/specs/2026-04-19-alerting-design.md

# Alerting — Design Spec

**Date:** 2026-04-19
**Status:** Draft — awaiting user review
**Surfaces:** server (core + app), UI, admin, Gitea issues
**Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred)

---

## 1. Summary

A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores.

### Guiding principles

- **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling.
- **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
- **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
- **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
- **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild.

---

## 2. Scope

### In scope (v1)

Six signal sources, expressed as sealed-type conditions:

1. **`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`.
2. **`EXCHANGE_MATCH`** — per-exchange matching with two fire modes:
   - `PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)
   - `COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold
3. **`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`.
4. **`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds.
5. **`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold.
6. **`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.

**Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.

**Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).

**Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail).

**Rule promotion across environments** via UI prefill — no new server endpoint.

**CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry.

**Configurable evaluator cadence** (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence.

### Out of scope (v1, not deferred)

- Custom SQL / Prometheus-style query DSL (option F).
- Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations.
- Native provider integrations (Slack, Teams, PagerDuty as first-class types).
- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
- Expression language in rules — fixed templates only.
- mTLS / client-cert auth on outbound webhooks.
- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.

### Deferred to backlog

- **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.

---

## 3. Key decisions

Captured from brainstorming, in order of architectural impact.

| Decision | Chosen | Rejected | Rationale |
|---|---|---|---|
| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten |
| Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) |
| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules |
| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting |
| Horizontal scale | `FOR UPDATE SKIP LOCKED` claim pattern | advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard |
| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature |
| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision |
| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users |
| UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin |
| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry |
| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations |
| TLS trust | shared cross-cutting module `http/` | alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path |
| CA management UI | **deferred (BL-001)** | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse |
| Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows |

---

## 4. Module architecture

### Package layout

```
cameleer-server-core/src/main/java/com/cameleer/server/core/
├── alerting/                      (domain; pure records + interfaces)
│   ├── AlertRule
│   ├── AlertCondition (sealed)
│   │   ├── RouteMetricCondition
│   │   ├── ExchangeMatchCondition
│   │   ├── AgentStateCondition
│   │   ├── DeploymentStateCondition
│   │   ├── LogPatternCondition
│   │   └── JvmMetricCondition
│   ├── AlertSeverity / AlertState (enums)
│   ├── AlertInstance / AlertEvent
│   ├── NotificationTarget / NotificationTargetKind
│   ├── AlertSilence / SilenceMatcher
│   ├── AlertRuleRepository (interface)
│   ├── AlertInstanceRepository (interface)
│   ├── AlertSilenceRepository (interface)
│   ├── AlertNotificationRepository (interface)
│   ├── AlertReadRepository (interface)
│   ├── ConditionEvaluator<C> (sealed)
│   └── NotificationDispatcher (interface)
├── outbound/                      (admin-managed outbound connections)
│   ├── OutboundConnection
│   ├── OutboundAuth (sealed — NONE, BEARER, BASIC)
│   ├── TrustMode (enum)
│   └── OutboundConnectionRepository (interface)
└── http/                          (cross-cutting outbound HTTP primitive)
    ├── OutboundHttpProperties
    ├── OutboundHttpRequestContext
    └── OutboundHttpClientFactory (interface)

cameleer-server-app/src/main/java/com/cameleer/server/app/
├── alerting/
│   ├── controller/                (REST)
│   │   ├── AlertRuleController
│   │   ├── AlertController
│   │   ├── AlertSilenceController
│   │   └── AlertNotificationController
│   ├── storage/                   (Postgres)
│   │   ├── PostgresAlertRuleRepository
│   │   ├── PostgresAlertInstanceRepository
│   │   ├── PostgresAlertSilenceRepository
│   │   ├── PostgresAlertNotificationRepository
│   │   └── PostgresAlertReadRepository
│   ├── eval/                      (the scheduled evaluators)
│   │   ├── AlertEvaluatorJob        (@Scheduled, claim-based)
│   │   ├── RouteMetricEvaluator
│   │   ├── ExchangeMatchEvaluator
│   │   ├── AgentStateEvaluator
│   │   ├── DeploymentStateEvaluator
│   │   ├── LogPatternEvaluator
│   │   ├── JvmMetricEvaluator
│   │   ├── PerKindCircuitBreaker
│   │   └── TickCache
│   ├── notify/
│   │   ├── NotificationDispatchJob  (@Scheduled, claim-based)
│   │   ├── InAppInboxQuery
│   │   ├── WebhookDispatcher
│   │   ├── MustacheRenderer
│   │   └── SilenceMatcher
│   ├── dto/                       (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.)
│   ├── retention/
│   │   └── AlertingRetentionJob     (daily @Scheduled)
│   └── config/
│       └── AlertingProperties       (@ConfigurationProperties)
├── outbound/
│   ├── controller/
│   │   └── OutboundConnectionAdminController
│   ├── storage/
│   │   └── PostgresOutboundConnectionRepository
│   └── dto/
│       └── OutboundConnectionDto
└── http/
    ├── ApacheOutboundHttpClientFactory
    ├── SslContextBuilder
    └── config/
        └── OutboundHttpConfig         (@ConfigurationProperties)

cameleer-server-app/src/main/resources/
├── db/migration/V11__alerting_and_outbound.sql   (one Flyway migration)
└── clickhouse/V_alerting_projections.sql         (one CH migration, idempotent)

ui/src/
├── pages/Alerts/
│   ├── InboxPage.tsx
│   ├── AllAlertsPage.tsx
│   ├── RulesListPage.tsx
│   ├── RuleEditor/
│   │   ├── RuleEditorWizard.tsx
│   │   ├── ScopeStep.tsx
│   │   ├── ConditionStep.tsx
│   │   ├── TriggerStep.tsx
│   │   ├── NotifyStep.tsx
│   │   └── ReviewStep.tsx
│   ├── SilencesPage.tsx
│   └── HistoryPage.tsx
├── pages/Admin/
│   └── OutboundConnectionsPage.tsx
├── components/
│   ├── NotificationBell.tsx
│   └── AlertStateChip.tsx
├── api/queries/
│   ├── alerts.ts
│   ├── alertRules.ts
│   ├── alertSilences.ts
│   └── outboundConnections.ts
└── cmdk/sources/
    ├── alerts.ts
    └── alertRules.ts
```

### Touchpoints on existing code (deliberate, minimal)

| Existing surface | Change | Scope |
|---|---|---|
| `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive |
| `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` |
| `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added |
| `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added |
| `SecurityConfig` | Path matchers for new endpoints | ~15 lines |
| `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive |
| Top-nav layout | Insert `<NotificationBell />` | one import + one component |
| CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import |
| `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation |
| `com.cameleer:cameleer-common` | no changes | — |
| ingestion paths | no changes | — |
| agent protocol | no changes | — |
| ClickHouse schema (table structure) | no changes — only projections added | — |

### New dependencies

- `com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`.
- Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate.

---

## 5. Data model (PostgreSQL)

One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction.

### Enum types

```sql
CREATE TYPE severity_enum          AS ENUM ('CRITICAL','WARNING','INFO');
CREATE TYPE condition_kind_enum    AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
CREATE TYPE alert_state_enum       AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
CREATE TYPE target_kind_enum       AS ENUM ('USER','GROUP','ROLE');
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
CREATE TYPE trust_mode_enum        AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
CREATE TYPE outbound_method_enum   AS ENUM ('POST','PUT','PATCH');
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');
```

### Tables

#### `outbound_connections` (admin-managed)

```sql
CREATE TABLE outbound_connections (
  id                       uuid PRIMARY KEY,
  tenant_id                varchar(64) NOT NULL,
  name                     varchar(100) NOT NULL,
  description              text,
  url                      text NOT NULL,                         -- Mustache-enabled
  method                   outbound_method_enum NOT NULL,
  default_headers          jsonb NOT NULL DEFAULT '{}',           -- values are Mustache templates
  default_body_tmpl        text,                                  -- null = built-in default JSON envelope
  tls_trust_mode           trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
  tls_ca_pem_paths         jsonb NOT NULL DEFAULT '[]',           -- array of paths from OutboundHttpProperties
  hmac_secret              text,                                  -- Ed25519-key-derived encryption at rest
  auth_kind                outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
  auth_config              jsonb NOT NULL DEFAULT '{}',           -- shape depends on auth_kind; v1 unused
  allowed_environment_ids  uuid[] NOT NULL DEFAULT '{}',          -- [] = allowed in all envs
  created_at               timestamptz NOT NULL DEFAULT now(),
  created_by               uuid NOT NULL REFERENCES users(id),
  updated_at               timestamptz NOT NULL DEFAULT now(),
  updated_by               uuid NOT NULL REFERENCES users(id),
  UNIQUE (tenant_id, name)
);
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);
```

#### `alert_rules`

```sql
CREATE TABLE alert_rules (
  id                          uuid PRIMARY KEY,
  environment_id              uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  name                        varchar(200) NOT NULL,
  description                 text,
  severity                    severity_enum NOT NULL,
  enabled                     boolean NOT NULL DEFAULT true,

  condition_kind              condition_kind_enum NOT NULL,
  condition                   jsonb NOT NULL,                     -- sealed-subtype payload, Jackson polymorphic on `kind`

  evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
  for_duration_seconds        int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
  re_notify_minutes           int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),

  notification_title_tmpl     text NOT NULL,                      -- Mustache
  notification_message_tmpl   text NOT NULL,                      -- Mustache
  webhooks                    jsonb NOT NULL DEFAULT '[]',        -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id

  next_evaluation_at          timestamptz NOT NULL DEFAULT now(),
  claimed_by                  varchar(64),
  claimed_until               timestamptz,
  eval_state                  jsonb NOT NULL DEFAULT '{}',

  created_at                  timestamptz NOT NULL DEFAULT now(),
  created_by                  uuid NOT NULL REFERENCES users(id),
  updated_at                  timestamptz NOT NULL DEFAULT now(),
  updated_by                  uuid NOT NULL REFERENCES users(id)
);
CREATE INDEX alert_rules_env_idx            ON alert_rules (environment_id);
CREATE INDEX alert_rules_claim_due_idx      ON alert_rules (next_evaluation_at) WHERE enabled = true;
```

#### `alert_rule_targets`

```sql
CREATE TABLE alert_rule_targets (
  id            uuid PRIMARY KEY,
  rule_id       uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
  target_kind   target_kind_enum NOT NULL,
  target_id     varchar(128) NOT NULL,
  UNIQUE (rule_id, target_kind, target_id)
);
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);
```

#### `alert_instances`

```sql
CREATE TABLE alert_instances (
  id                  uuid PRIMARY KEY,
  rule_id             uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
  rule_snapshot       jsonb NOT NULL,
  environment_id      uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  state               alert_state_enum NOT NULL,
  severity            severity_enum NOT NULL,
  fired_at            timestamptz NOT NULL,
  acked_at            timestamptz,
  acked_by            uuid REFERENCES users(id),
  resolved_at         timestamptz,
  last_notified_at    timestamptz,
  silenced            boolean NOT NULL DEFAULT false,
  current_value       numeric,
  threshold           numeric,
  context             jsonb NOT NULL,
  title               text NOT NULL,
  message             text NOT NULL,
  target_user_ids     uuid[] NOT NULL DEFAULT '{}',
  target_group_ids    uuid[] NOT NULL DEFAULT '{}',
  target_role_names   text[] NOT NULL DEFAULT '{}'
);
CREATE INDEX alert_instances_inbox_idx      ON alert_instances (environment_id, state, fired_at DESC);
CREATE INDEX alert_instances_open_rule_idx  ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
CREATE INDEX alert_instances_resolved_idx   ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
CREATE INDEX alert_instances_target_u_idx   ON alert_instances USING GIN (target_user_ids);
CREATE INDEX alert_instances_target_g_idx   ON alert_instances USING GIN (target_group_ids);
CREATE INDEX alert_instances_target_r_idx   ON alert_instances USING GIN (target_role_names);
```

#### `alert_silences`

```sql
CREATE TABLE alert_silences (
  id             uuid PRIMARY KEY,
  environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  matcher        jsonb NOT NULL,                      -- { ruleId?, appSlug?, routeId?, agentId?, severity? }
  reason         text,
  starts_at      timestamptz NOT NULL,
  ends_at        timestamptz NOT NULL CHECK (ends_at > starts_at),
  created_by     uuid NOT NULL REFERENCES users(id),
  created_at     timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at);
```

#### `alert_notifications` (webhook delivery outbox)

```sql
CREATE TABLE alert_notifications (
  id                    uuid PRIMARY KEY,
  alert_instance_id     uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
  webhook_id            uuid,                        -- opaque ref into rule's webhooks JSONB
  outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL,
  status                notification_status_enum NOT NULL DEFAULT 'PENDING',
  attempts              int NOT NULL DEFAULT 0,
  next_attempt_at       timestamptz NOT NULL DEFAULT now(),
  claimed_by            varchar(64),
  claimed_until         timestamptz,
  last_response_status  int,
  last_response_snippet text,
  payload               jsonb NOT NULL,              -- snapshotted at first attempt
  delivered_at          timestamptz,
  created_at            timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING';
CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id);
```

#### `alert_reads`

```sql
CREATE TABLE alert_reads (
  user_id            uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  alert_instance_id  uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
  read_at            timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (user_id, alert_instance_id)
);
```

### Cascade summary

```
environments → alert_rules           (CASCADE)  → alert_rule_targets   (CASCADE)
environments → alert_silences        (CASCADE)
environments → alert_instances       (CASCADE)  → alert_reads          (CASCADE)
                                                → alert_notifications  (CASCADE)
alert_rules  → alert_instances                   (SET NULL, rule_snapshot preserves context)
users        → alert_reads           (CASCADE)
outbound_connections (delete)        — blocked by FK from rules.webhooks JSONB via app-level 409 check
```

**Rule deletion** preserves history (`alert_instances.rule_id = NULL`, `rule_snapshot` retains details). **Environment deletion** leaves zero alerting rows — POC-safe.

### Jackson polymorphism for conditions

```java
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind",
              include = JsonTypeInfo.As.EXISTING_PROPERTY, visible = true)
@JsonSubTypes({
    @Type(value = RouteMetricCondition.class,     name = "ROUTE_METRIC"),
    @Type(value = ExchangeMatchCondition.class,   name = "EXCHANGE_MATCH"),
    @Type(value = AgentStateCondition.class,      name = "AGENT_STATE"),
    @Type(value = DeploymentStateCondition.class, name = "DEPLOYMENT_STATE"),
    @Type(value = LogPatternCondition.class,      name = "LOG_PATTERN"),
    @Type(value = JvmMetricCondition.class,       name = "JVM_METRIC"),
})
public sealed interface AlertCondition permits
    RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition,
    DeploymentStateCondition, LogPatternCondition, JvmMetricCondition {
    ConditionKind kind();
}
```

Each payload carries its own `kind` field, which Jackson reads (`EXISTING_PROPERTY`) to pick the subtype and the record still exposes as `ConditionKind kind()`. Bean Validation (`@Valid`) on each record validates at the controller boundary.

Example condition payloads:

```json
// ROUTE_METRIC
{ "kind": "ROUTE_METRIC",
  "scope": {"appSlug":"orders","routeId":"route-1"},
  "metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 }

// EXCHANGE_MATCH — PER_EXCHANGE
{ "kind": "EXCHANGE_MATCH",
  "scope": {"appSlug":"orders"},
  "filter": {"status":"FAILED","attributes":{"type":"payment"}},
  "fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 }

// EXCHANGE_MATCH — COUNT_IN_WINDOW
{ "kind": "EXCHANGE_MATCH",
  "scope": {"appSlug":"orders"},
  "filter": {"status":"FAILED"},
  "fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 }

// AGENT_STATE
{ "kind": "AGENT_STATE", "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 }

// DEPLOYMENT_STATE
{ "kind": "DEPLOYMENT_STATE", "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] }

// LOG_PATTERN
{ "kind": "LOG_PATTERN", "scope": {"appSlug":"orders"}, "level": "ERROR",
  "pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 }

// JVM_METRIC
{ "kind": "JVM_METRIC", "scope": {"appSlug":"orders"}, "metric": "heap_used_percent",
  "aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 }
```

### Claim-polling queries

```sql
-- Rule evaluator
UPDATE alert_rules
   SET claimed_by = :instance, claimed_until = now() + interval '30 seconds'
 WHERE id IN (
   SELECT id FROM alert_rules
    WHERE enabled = true
      AND next_evaluation_at <= now()
      AND (claimed_until IS NULL OR claimed_until < now())
    ORDER BY next_evaluation_at
    LIMIT :batch
    FOR UPDATE SKIP LOCKED
 )
 RETURNING *;

-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')
```

`FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other.

---

## 6. Outbound connections

### Concept

An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.

**Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules.

**Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.

**Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references.

### Default body template (when rule has no override)

```json
{
  "alert":   { "id", "state", "firedAt", "severity", "title", "message", "link" },
  "rule":    { "id", "name", "description", "severity" },
  "env":     { "slug", "id" },
  "context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
}
```

"Just plug in my Slack incoming webhook URL" works without writing a template.

### HMAC signing (optional per connection)

When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=<hmac(secret, body)>` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).

---

## 7. Rule evaluation

### Scheduler

```java
@Component
public class AlertEvaluatorJob implements SchedulingConfigurer {

    // Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
    @Override
    public void configureTasks(ScheduledTaskRegistrar registrar) {
        registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs());
    }

    void tick() {
        List<AlertRule> claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize());
        var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r))));
        for (var entry : groups.entrySet()) {
            if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; }
            try {
                coalescedEvaluate(entry.getKey(), entry.getValue());
            } catch (Exception e) {
                circuitBreaker.recordFailure(entry.getKey().kind());
                rescheduleBatch(entry.getValue());
            }
        }
    }
}
```

### Per-condition evaluators

| Kind | Read source | Query shape |
|---|---|---|
| `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold |
| `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor |
| `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold |
| `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state |
| `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set |
| `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold |
| `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold |

### State machine

```
                 (cond holds for <forDuration)
  PENDING ──────▶ keep pendingSince
    ▲            │
    │            ▼ (cond holds ≥ forDuration)
    │          FIRING ◀── (re-eval matches; update last_notified_at cadence)
    │          / \
    │         /   \
    │    ack/      \resolve
    │       ▼       ▼
    │   ACKNOWLEDGED  RESOLVED ── (cond false again → cycle can restart)
```

`PER_EXCHANGE` mode: each match is its own brief FIRING instance that auto-resolves after `perExchangeLingerSeconds` (default 300 s). History retains it for 90 d.

### Performance optimizations (v1)

1. **Four ClickHouse projections** (new CH migration, idempotent):

   ```sql
   ALTER TABLE executions    ADD PROJECTION IF NOT EXISTS alerting_app_status
     (SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time));
   ALTER TABLE executions    ADD PROJECTION IF NOT EXISTS alerting_route_status
     (SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time));
   ALTER TABLE logs          ADD PROJECTION IF NOT EXISTS alerting_app_level
     (SELECT * ORDER BY (tenant_id, environment, application, level, timestamp));
   ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric
     (SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));
   ```

   `stats_1m_route`'s existing ORDER BY already aligns with alerting access patterns; no projection needed.

2. **Drop `FINAL` for alerting counts.** New methods `ClickHouseLogStore.countLogs(...)` and `ClickHouseSearchIndex.countExecutionsForAlerting(...)` skip `FINAL` — alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facing `count()` path unchanged.

3. **Per-tick query coalescing.** Rules of the same kind + window share one aggregate query per tick.

4. **In-tick cache.** `Map<QueryKey, Long>` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call.

5. **Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open.

### Silence matching

At notification-dispatch time (not evaluation time):

```sql
SELECT 1 FROM alert_silences
 WHERE environment_id = :env
   AND now() BETWEEN starts_at AND ends_at
   AND matcher_matches(matcher, :instanceContext)
 LIMIT 1;
```

If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.

### Failure modes

| Failure | Behavior |
|---|---|
| Read interface throws | Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim |
| 10 consecutive failures for a rule | Mark `eval_state.disabledReason`, surface in UI |
| Template render error | Fall back to literal `{{var}}` in output, log WARN, still dispatch |
| Slow evaluator | Claim TTL 30 s; investigate if sustained |
| Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized |
| Env deleted mid-eval | FK cascade waits — effectively serialized |

---

## 8. Notification dispatch

### In-app inbox — derived, not materialized

```sql
SELECT ai.*
  FROM alert_instances ai
 WHERE ai.environment_id = :env
   AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
   AND (
       :me = ANY(ai.target_user_ids)
    OR ai.target_group_ids && :my_group_ids
    OR ai.target_role_names && :my_role_names
   )
 ORDER BY ai.fired_at DESC
 LIMIT 100;
```

`:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`.

**Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap.

### Webhook outbox — claim-based

`NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection.

- **2xx** → `DELIVERED`
- **4xx** → `FAILED` immediately (retry won't help); log at WARN
- **5xx / network / timeout** → retry with exponential backoff 30 s → 2 m → 5 m, then `FAILED`
- Manual retry: `POST /alerts/notifications/{id}/retry` (OPERATOR+)

Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.

### Template rendering

JMustache (`com.samskivert:jmustache`). Logic-less, industry-standard syntax.

**Rendered surfaces:** URL (query-string interpolation), header values, body, and separately `alert_instances.title` / `message` rendered once at fire.

**Context map** (dot-notation + camelCase leaves):

```
env.slug                 env.id
rule.id                  rule.name                 rule.severity            rule.description
alert.id                 alert.state               alert.firedAt            alert.resolvedAt
alert.ackedBy            alert.link                alert.currentValue       alert.threshold
alert.comparator         alert.window
app.slug                 app.id                    app.displayName
route.id
agent.id                 agent.name                agent.state
exchange.id              exchange.status           exchange.link
deployment.id            deployment.status
log.logger               log.level                 log.message
metric.name              metric.value
```

**Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.

**"Test render" endpoint:** `POST /alerts/rules/{id}/render-preview` — drives rule editor's Preview button.

---

## 9. Rule promotion across environments

**UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `<name>` from `<src>` → `<dst>`. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy).

**Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths.

**Prefill-time validation (client-side warnings, non-blocking):**

| Field | Check | Behavior |
|---|---|---|
| `scope.appSlug` | Does app exist in target env? | ⚠ warn + picker from target env's apps |
| `scope.agentId` | Per-env; can't transfer | Clear field, keep appSlug, note |
| `scope.routeId` | Per-app logical ID, stable | ✓ pass through |
| `targets[]` | Tenant-scoped | ✓ transfer as-is |
| `webhooks[].outboundConnectionId` | Target env allowed by connection? | ⚠ warn if not; disable save until resolved |

Bulk promotion (select multiple → promote all) deferred until usage patterns justify it.

---

## 10. Cross-cutting: outbound HTTP & TLS trust

Shared module — not inside `alerting/`.

### `OutboundHttpClientFactory`

```java
public interface OutboundHttpClientFactory {
    CloseableHttpClient clientFor(OutboundHttpRequestContext context);
}

public record OutboundHttpRequestContext(
    TrustMode trustMode,                // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS
    List<String> trustedCaPemPaths,
    Duration connectTimeout,
    Duration readTimeout
) {}
```

Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call.

### System config (`cameleer.server.outbound-http.*`)

```yaml
cameleer:
  server:
    outbound-http:
      trust-all: false                       # global kill-switch; WARN logged if true
      trusted-ca-pem-paths:                  # additional roots layered on JVM default
        - /etc/cameleer/certs/corporate-root.pem
        - /etc/cameleer/certs/acme-internal.pem
      default-connect-timeout-ms: 2000
      default-read-timeout-ms:    5000
      proxy-url:                             # optional; null = no proxy
      proxy-username:
      proxy-password:
```

On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files.

### Per-connection overrides

Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`).

### Deferred

See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**:
- In-app CA bundle upload / admin management
- SaaS-layer CA reuse investigation (do first)

---

## 11. API surface

All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver.

### Alerting — rules

| Method | Path | Role |
|---|---|---|
| `GET`    | `/alerts/rules` | VIEWER+ |
| `POST`   | `/alerts/rules` | OPERATOR+ |
| `GET`    | `/alerts/rules/{id}` | VIEWER+ |
| `PUT`    | `/alerts/rules/{id}` | OPERATOR+ |
| `DELETE` | `/alerts/rules/{id}` | OPERATOR+ |
| `POST`   | `/alerts/rules/{id}/enable` · `/disable` | OPERATOR+ |
| `POST`   | `/alerts/rules/{id}/render-preview` | OPERATOR+ |
| `POST`   | `/alerts/rules/{id}/test-evaluate` | OPERATOR+ |

### Alerting — instances

| Method | Path | Role |
|---|---|---|
| `GET`    | `/alerts` | VIEWER+ |
| `GET`    | `/alerts/unread-count` | VIEWER+ |
| `GET`    | `/alerts/{id}` | VIEWER+ |
| `POST`   | `/alerts/{id}/ack` | VIEWER+ (if targeted) / OPERATOR+ |
| `POST`   | `/alerts/{id}/read` | VIEWER+ (self) |
| `POST`   | `/alerts/bulk-read` | VIEWER+ (self) |

### Alerting — silences

| Method | Path | Role |
|---|---|---|
| `GET`    | `/alerts/silences` | VIEWER+ |
| `POST`   | `/alerts/silences` | OPERATOR+ |
| `PUT`    | `/alerts/silences/{id}` | OPERATOR+ |
| `DELETE` | `/alerts/silences/{id}` | OPERATOR+ |

### Alerting — notifications

| Method | Path | Role |
|---|---|---|
| `GET`    | `/alerts/{id}/notifications` | VIEWER+ |
| `POST`   | `/alerts/notifications/{id}/retry` | OPERATOR+ |

### Outbound connections (admin)

| Method | Path | Role |
|---|---|---|
| `GET`    | `/api/v1/admin/outbound-connections` | ADMIN / OPERATOR (read-only) |
| `POST`   | `/api/v1/admin/outbound-connections` | ADMIN |
| `GET`    | `/api/v1/admin/outbound-connections/{id}` | ADMIN / OPERATOR (read-only) |
| `PUT`    | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if narrowing breaks references) |
| `DELETE` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if referenced) |
| `POST`   | `/api/v1/admin/outbound-connections/{id}/test` | ADMIN |
| `GET`    | `/api/v1/admin/outbound-connections/{id}/usage` | ADMIN / OPERATOR |

### OpenAPI regen

Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change.

---

## 12. CMD-K integration

Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`):

- **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`.
- **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`.

No new registry machinery — uses the existing extension point.

---

## 13. UI

### Routes

```
/alerts
  ├── /inbox          (default landing)
  ├── /all
  ├── /rules
  │     ├── /new
  │     └── /{id}     (edit; accepts ?promoteFrom=<src>&ruleId=<id>)
  ├── /silences
  └── /history

/admin/outbound-connections
  ├── /
  ├── /new
  └── /{id}
```

### Top-nav

Insert `<NotificationBell />` between env selector and user menu. Badge severity = `max(severities of unread targeting me)` (CRITICAL → `var(--error)`, WARNING → `var(--amber)`, INFO → `var(--muted)`). Dropdown shows 5 most-recent unread with inline ack button + "See all".

### Alerts section

New sidebar/top-nav entry visible to `VIEWER+`. Authoring actions (`POST /rules`, silence create, etc.) gated to `OPERATOR+`.

### Rule editor — 5-step wizard

1. **Scope** — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints).
2. **Condition** — radio (6 kinds) + kind-specific form.
3. **Trigger** — threshold + comparator + window + for-duration + evaluation interval + severity; inline *Test evaluate* button.
4. **Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`.
5. **Review** — summary card, enabled toggle, save.

### Template editor — Mustache with variable auto-complete

Every Mustache template-editable field — notification title, notification message, webhook URL, webhook header values, webhook body — uses a shared `<MustacheEditor />` component with **variable auto-complete**. Users never have to guess what context variables are available.

**Behavior.**
- Typing `{{` opens a dropdown of available variables at the caret position.
- Each suggestion shows the variable path (`alert.firedAt`), its type (`Instant`), a one-line description, and a sample rendered value from the canned context.
- Filtering narrows the list as the user keeps typing (`{{ale…` → filters to `alert.*`).
- `Enter` / `Tab` inserts the path and closes `}}` automatically.
- Arrow keys + `Esc` follow standard combobox semantics (ARIA-conformant).

**Context-aware filtering.** The available variables depend on the rule's condition kind and scope. The editor is aware of both:
- Always shown: `env.*`, `rule.*`, `alert.*`
- `ROUTE_METRIC` with `route.id` set: adds `route.id`, `app.*`
- `EXCHANGE_MATCH`: adds `exchange.*`, `app.*`, `route.id` (if scoped)
- `AGENT_STATE`: adds `agent.*`, `app.*`
- `DEPLOYMENT_STATE`: adds `deployment.*`, `app.*`
- `LOG_PATTERN`: adds `log.*`, `app.*`
- `JVM_METRIC`: adds `metric.*`, `agent.*`, `app.*`

Variables that *might not* populate (e.g., `alert.resolvedAt` while state is FIRING) are shown with a grey "may be null" badge — users still see them so they can defensively template.

**Syntax checks inline.**
- Unclosed `{{` / unmatched `}}` flagged with a red underline + hover hint.
- Reference to an out-of-scope variable (e.g., `{{exchange.id}}` in a ROUTE_METRIC rule) flagged with an amber underline + hint ("not available for this rule kind — will render as literal").
- Checks run client-side on every keystroke (debounced); server-side render preview is still authoritative (§8).

**Shared implementation.** Same `<MustacheEditor />` component is used in:
- Rule editor — Notify step (title, message)
- Rule editor — Webhook overrides (body override, header value overrides; URL not editable per rule, it's the connection's)
- Admin **Outbound Connections** editor — default body template, default header values, URL (URL gets a reduced context: only `env.*` since a connection URL is rule-agnostic)
- *Test render* inline preview — rendered output updates live as user types

**Completion engine.** Specific library choice (CodeMirror 6 with a custom completion extension vs Monaco vs a lighter custom overlay on `<textarea>`) is deferred to planning — see §20.

### Silences, History, Rules list, OutboundConnectionAdminPage

Structure described in design presentation; no new design-system components required. Reuses `Select`, `Tabs`, `Toggle`, `Button`, `Label`, `InfiniteScrollArea`, `PageLoader`, `Badge` from `@cameleer/design-system`.

### Real-time behavior

- Bell: `/alerts/unread-count` polled every 30 s; paused when tab hidden (Page Visibility API).
- Inbox view: `/alerts` polled every 30 s when focused.
- No SSE in v1. SSE is a clean future add under `/alerts/stream` with no schema changes.

### Accessibility

Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive `aria-label`.

### Styling

All colors via `@cameleer/design-system` CSS variables (`var(--error)`, `var(--amber)`, `var(--muted)`, `var(--success)`). No hard-coded hex.

---

## 14. Configuration

### `AlertingProperties` (`cameleer.server.alerting.*`)

```yaml
cameleer:
  server:
    alerting:
      evaluator-tick-interval-ms:       5000    # floor: 5000 (clamped at startup with WARN if lower)
      evaluator-batch-size:             20
      claim-ttl-seconds:                30
      notification-tick-interval-ms:    5000
      notification-batch-size:          50
      in-tick-cache-enabled:            true
      circuit-breaker-fail-threshold:   5
      circuit-breaker-window-seconds:   30
      circuit-breaker-cooldown-seconds: 60
      event-retention-days:             90
      notification-retention-days:      30
      webhook-timeout-ms:               5000
      webhook-max-attempts:             3
```

Env-var overridable (`CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...`). Wired via `SchedulingConfigurer` (not literal `@Scheduled(fixedDelay=...)`) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.

### `OutboundHttpProperties` (`cameleer.server.outbound-http.*`)

See §10.

---

## 15. Retention

Daily `@Scheduled(cron = "0 0 3 * * *")` job `AlertingRetentionJob` (advisory-lock-of-the-day pattern, same as `JarRetentionJob`):

```sql
DELETE FROM alert_instances
 WHERE state = 'RESOLVED'
   AND resolved_at < now() - :eventRetentionDays::interval;

DELETE FROM alert_notifications
 WHERE status IN ('DELIVERED','FAILED')
   AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval);
```

Retention values from `AlertingProperties`.

---

## 16. Observability

New metrics exposed via existing `/api/v1/prometheus`:

- `alerting_eval_duration_seconds{kind}` — histogram per condition kind
- `alerting_eval_errors_total{kind, rule_id}` — counter
- `alerting_circuit_open_total{kind}` — counter
- `alerting_rule_state{state}` — gauge (enabled / disabled / broken-reference)
- `alerting_instances_total{state, severity}` — gauge (open alerts)
- `alerting_notifications_total{status}` — counter
- `alerting_webhook_delivery_duration_seconds` — histogram

No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add.

### Audit

New `AuditCategory` values:
- `OUTBOUND_HTTP_TRUST_CHANGE` — webhook or connection TLS config change
- `ALERT_RULE_CHANGE` — create / update / delete rule
- `ALERT_SILENCE_CHANGE` — create / update / delete silence
- `OUTBOUND_CONNECTION_CHANGE` — admin CRUD on outbound connection

Emitted via existing `AuditService.log(...)`.

---

## 17. Security

- **Tenant + env isolation.** Every controller call runs through `@EnvPath` (resolves env → tenant via `TenantContext`). Every CH query filters by `tenant_id AND environment` per pre-existing invariant.
- **RBAC.** Enforced via Spring Security `@PreAuthorize` on each endpoint (see §11 role column).
- **Webhook URL SSRF protection.** At rule save, reject URLs resolving to private IPs (`127.0.0.0/8`, `10.0.0.0/8`, `172.16/12`, `192.168/16`, `::1`, `fc00::/7`) unless a deployment-level allow-listed dev flag is set.
- **HMAC signing.** Per-connection `hmac_secret` encrypted at rest; signature header sent on dispatch.
- **TLS trust.** Cross-cutting module (§10).
- **Audit.** See §16.

---

## 18. Testing

### Backend — unit (`*Test.java`, no Spring)

- Each `ConditionEvaluator`: synthetic inputs → expected `EvalResult`. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce.
- `MustacheRenderer`: context + template → expected output; malformed falls back + logs.
- `SilenceMatcher`: matcher JSONB vs instance → truth table.
- Jackson polymorphism: roundtrip each `AlertCondition` subtype.
- Claim-polling concurrency (embedded PG): two threads → no duplicates.

### Backend — integration (Testcontainers, `*IT.java`)

- `AlertingFullLifecycleIT` — end-to-end rule → fire → ack → silence → delete, history survives.
- `AlertingEnvIsolationIT` — alert in env-A invisible from env-B inbox.
- `OutboundConnectionAllowedEnvIT` — 422 on save if connection not allowed in env; 409 on narrow-while-referenced.
- `WebhookDispatchIT` (WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx.
- `PerformanceIT` (opt-in, not default CI) — 500 rules + 5-replica simulation.

### Frontend — component (Vitest + Testing Library)

- Rule editor wizard step navigation + validation.
- Bell polling pause on tab hide.
- Inbox row rendering by severity.
- CMD-K result-source registration.

### Frontend — E2E (Playwright if infra supports)

- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.

---

## 19. Rollout

- **No feature flag.** Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
- **Migration rollback.** V11 PG migration has matching down-script; CH projections are `IF NOT EXISTS`-safe and droppable without data loss.
- **Progressive adoption.** First user creates the first rule; feature organically spreads from there.
- **Documentation.** Add an admin-facing alerting guide under `docs/` describing rule shapes, template variables, webhook destinations, and silence patterns.
- **`.claude/rules/` updates.** `app-classes.md` and `core-classes.md` updated to document the new packages and any touched classes — part of the change, not a follow-up.

---

## 20. Open questions / items for writing-plans

These are not design-level decisions — they're implementation-phase tasks to be carried into planning:

1. **Alignment with existing OIDC outbound cert handling.** Before implementing `ApacheOutboundHttpClientFactory`, audit how `OidcProviderHelper` / `OidcTokenExchanger` currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1).
2. **`hmac_secret` encryption-at-rest.** Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.
3. **V1 CH migration file naming.** Confirm the convention for alerting-owned CH migrations (`V_alerting_projections.sql` vs numbered). Current `ClickHouseSchemaInitializer` runs files idempotently — naming is informational.
4. **Bell component keyboard shortcut.** Optional; align with existing CMD-K shortcut conventions.
5. **Target picker UX.** How to mix user / group / role in one multi-select with typeahead. Small UX design task.
6. **Env-delete cascade audit.** Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.
7. **`<MustacheEditor />` completion engine choice.** Decide between CodeMirror 6 with a custom completion extension, Monaco, or a lighter custom-overlay-on-`<textarea>` implementation. Criteria: bundle-size cost, accessibility (ARIA combobox semantics), existing design-system integration. The variable metadata registry (`{path, type, description, sampleValue, availableForKinds[]}`) is the same regardless of engine.