diff --git a/docs/superpowers/backlog.md b/docs/superpowers/backlog.md new file mode 100644 index 00000000..56fa88b0 --- /dev/null +++ b/docs/superpowers/backlog.md @@ -0,0 +1,42 @@ +# Backlog + +Deferred items surfaced during design / planning / execution that we've decided *not* to build right now but want to keep visible. Append-only — close items by marking them and moving to the "Closed" section with a link to the delivering commit/spec. + +--- + +## Open + +### BL-001 — Managed CA bundles for outbound HTTPS + +**Opened:** 2026-04-19 +**Surfaced by:** [Alerting design](specs/2026-04-19-alerting-design.md) — TLS trust section +**Tracking:** [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) +**Status:** Open + +**Context.** The alerting feature introduces server → external HTTPS webhooks, which makes outbound TLS trust a cross-cutting concern (joining the existing OIDC token-exchange / JWKS fetch paths). Alerting v1 handles this with a shared `OutboundHttpClientFactory` + system config (`cameleer.server.outbound-http.trusted-ca-pem-paths`) + a per-webhook `TrustMode` override (`SYSTEM_DEFAULT` / `TRUST_ALL` / `TRUST_PATHS`). CA files in v1 are filesystem-resident, managed via deployment config — there is no in-app upload surface. + +**What's deferred.** + +1. **In-app CA bundle management.** Admin UI to upload, list, and delete trusted CA PEMs. Storage in PG (`trusted_ca_certs` table) so all replicas see a consistent set without a filesystem sync step. Likely lives under `/admin/outbound-http` (new admin surface) or as a tab on the existing admin navigation. + +2. **SaaS-layer CA reuse — design investigation (do first).** The SaaS layer already manages CA material for the server (for its reverse-proxy → OIDC path and related). Before building in-app CA management in the OSS server, investigate whether the SaaS CA mechanism can be extended/exposed so the server can consume trust material from the SaaS layer directly. Goal: KISS + DRY — don't duplicate a CA store in the server if the SaaS side already owns one. If reuse is viable, in-app CA upload in the server may never be needed — the SaaS layer becomes the authoritative admin surface and the server just reads. + +**Acceptance criteria.** +- Investigation concludes with a one-page decision: reuse SaaS / build in-server / hybrid, with rationale and an implementation sketch. +- If "build in-server" is chosen: spec + plan + implementation following the normal flow. Must include PG-backed storage, audit logging on CA change (category already introduced: `OUTBOUND_HTTP_TRUST_CHANGE`), and cluster-consistent propagation. +- If "reuse SaaS" is chosen: spec for the extension on the SaaS side + a small server-side consumer; the server's current file-path-based trust config remains as the OSS fallback for non-SaaS deployments. + +**Why we're not doing it now.** +- Alerting v1's file-based trust config is identical to how the server handles other trust material today (OIDC issuer URIs, Ed25519 keys), so it's no regression. +- Building in-server CA management before the SaaS reuse investigation risks duplicating work we may throw away. +- Most early alerting users will target public SaaS webhooks (Slack, PagerDuty, Teams) whose certs chain to public roots — no custom CA needed. + +**Links.** +- `cameleer-server-app/src/main/java/com/cameleer/server/app/http/` (v1 outbound HTTP module — the investigation will extend this) +- OIDC trust touch-points in `OidcProviderHelper`, `OidcTokenExchanger` (alignment reference) + +--- + +## Closed + +_(nothing yet)_ diff --git a/docs/superpowers/specs/2026-04-19-alerting-design.md b/docs/superpowers/specs/2026-04-19-alerting-design.md new file mode 100644 index 00000000..8923c137 --- /dev/null +++ b/docs/superpowers/specs/2026-04-19-alerting-design.md @@ -0,0 +1,1041 @@ +# Alerting — Design Spec + +**Date:** 2026-04-19 +**Status:** Draft — awaiting user review +**Surfaces:** server (core + app), UI, admin, Gitea issues +**Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred) + +--- + +## 1. Summary + +A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores. + +### Guiding principles + +- **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling. +- **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration. +- **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env. +- **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1. +- **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild. + +--- + +## 2. Scope + +### In scope (v1) + +Six signal sources, expressed as sealed-type conditions: + +1. **`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`. +2. **`EXCHANGE_MATCH`** — per-exchange matching with two fire modes: + - `PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns) + - `COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold +3. **`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`. +4. **`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds. +5. **`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold. +6. **`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window. + +**Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations. + +**Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC). + +**Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail). + +**Rule promotion across environments** via UI prefill — no new server endpoint. + +**CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry. + +**Configurable evaluator cadence** (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence. + +### Out of scope (v1, not deferred) + +- Custom SQL / Prometheus-style query DSL (option F). +- Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations. +- Native provider integrations (Slack, Teams, PagerDuty as first-class types). +- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead. +- Expression language in rules — fixed templates only. +- mTLS / client-cert auth on outbound webhooks. +- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed. + +### Deferred to backlog + +- **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys. + +--- + +## 3. Key decisions + +Captured from brainstorming, in order of architectural impact. + +| Decision | Chosen | Rejected | Rationale | +|---|---|---|---| +| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten | +| Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) | +| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules | +| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting | +| Horizontal scale | `FOR UPDATE SKIP LOCKED` claim pattern | advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard | +| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature | +| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision | +| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users | +| UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin | +| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry | +| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations | +| TLS trust | shared cross-cutting module `http/` | alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path | +| CA management UI | **deferred (BL-001)** | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse | +| Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows | + +--- + +## 4. Module architecture + +### Package layout + +``` +cameleer-server-core/src/main/java/com/cameleer/server/core/ +├── alerting/ (domain; pure records + interfaces) +│ ├── AlertRule +│ ├── AlertCondition (sealed) +│ │ ├── RouteMetricCondition +│ │ ├── ExchangeMatchCondition +│ │ ├── AgentStateCondition +│ │ ├── DeploymentStateCondition +│ │ ├── LogPatternCondition +│ │ └── JvmMetricCondition +│ ├── AlertSeverity / AlertState (enums) +│ ├── AlertInstance / AlertEvent +│ ├── NotificationTarget / NotificationTargetKind +│ ├── AlertSilence / SilenceMatcher +│ ├── AlertRuleRepository (interface) +│ ├── AlertInstanceRepository (interface) +│ ├── AlertSilenceRepository (interface) +│ ├── AlertNotificationRepository (interface) +│ ├── AlertReadRepository (interface) +│ ├── ConditionEvaluator (sealed) +│ └── NotificationDispatcher (interface) +├── outbound/ (admin-managed outbound connections) +│ ├── OutboundConnection +│ ├── OutboundAuth (sealed — NONE, BEARER, BASIC) +│ ├── TrustMode (enum) +│ └── OutboundConnectionRepository (interface) +└── http/ (cross-cutting outbound HTTP primitive) + ├── OutboundHttpProperties + ├── OutboundHttpRequestContext + └── OutboundHttpClientFactory (interface) + +cameleer-server-app/src/main/java/com/cameleer/server/app/ +├── alerting/ +│ ├── controller/ (REST) +│ │ ├── AlertRuleController +│ │ ├── AlertController +│ │ ├── AlertSilenceController +│ │ └── AlertNotificationController +│ ├── storage/ (Postgres) +│ │ ├── PostgresAlertRuleRepository +│ │ ├── PostgresAlertInstanceRepository +│ │ ├── PostgresAlertSilenceRepository +│ │ ├── PostgresAlertNotificationRepository +│ │ └── PostgresAlertReadRepository +│ ├── eval/ (the scheduled evaluators) +│ │ ├── AlertEvaluatorJob (@Scheduled, claim-based) +│ │ ├── RouteMetricEvaluator +│ │ ├── ExchangeMatchEvaluator +│ │ ├── AgentStateEvaluator +│ │ ├── DeploymentStateEvaluator +│ │ ├── LogPatternEvaluator +│ │ ├── JvmMetricEvaluator +│ │ ├── PerKindCircuitBreaker +│ │ └── TickCache +│ ├── notify/ +│ │ ├── NotificationDispatchJob (@Scheduled, claim-based) +│ │ ├── InAppInboxQuery +│ │ ├── WebhookDispatcher +│ │ ├── MustacheRenderer +│ │ └── SilenceMatcher +│ ├── dto/ (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.) +│ ├── retention/ +│ │ └── AlertingRetentionJob (daily @Scheduled) +│ └── config/ +│ └── AlertingProperties (@ConfigurationProperties) +├── outbound/ +│ ├── controller/ +│ │ └── OutboundConnectionAdminController +│ ├── storage/ +│ │ └── PostgresOutboundConnectionRepository +│ └── dto/ +│ └── OutboundConnectionDto +└── http/ + ├── ApacheOutboundHttpClientFactory + ├── SslContextBuilder + └── config/ + └── OutboundHttpConfig (@ConfigurationProperties) + +cameleer-server-app/src/main/resources/ +├── db/migration/V11__alerting_and_outbound.sql (one Flyway migration) +└── clickhouse/V_alerting_projections.sql (one CH migration, idempotent) + +ui/src/ +├── pages/Alerts/ +│ ├── InboxPage.tsx +│ ├── AllAlertsPage.tsx +│ ├── RulesListPage.tsx +│ ├── RuleEditor/ +│ │ ├── RuleEditorWizard.tsx +│ │ ├── ScopeStep.tsx +│ │ ├── ConditionStep.tsx +│ │ ├── TriggerStep.tsx +│ │ ├── NotifyStep.tsx +│ │ └── ReviewStep.tsx +│ ├── SilencesPage.tsx +│ └── HistoryPage.tsx +├── pages/Admin/ +│ └── OutboundConnectionsPage.tsx +├── components/ +│ ├── NotificationBell.tsx +│ └── AlertStateChip.tsx +├── api/queries/ +│ ├── alerts.ts +│ ├── alertRules.ts +│ ├── alertSilences.ts +│ └── outboundConnections.ts +└── cmdk/sources/ + ├── alerts.ts + └── alertRules.ts +``` + +### Touchpoints on existing code (deliberate, minimal) + +| Existing surface | Change | Scope | +|---|---|---| +| `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive | +| `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` | +| `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added | +| `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added | +| `SecurityConfig` | Path matchers for new endpoints | ~15 lines | +| `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive | +| Top-nav layout | Insert `` | one import + one component | +| CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import | +| `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation | +| `com.cameleer:cameleer-common` | no changes | — | +| ingestion paths | no changes | — | +| agent protocol | no changes | — | +| ClickHouse schema (table structure) | no changes — only projections added | — | + +### New dependencies + +- `com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`. +- Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate. + +--- + +## 5. Data model (PostgreSQL) + +One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction. + +### Enum types + +```sql +CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO'); +CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC'); +CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED'); +CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE'); +CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED'); +CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS'); +CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH'); +CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC'); +``` + +### Tables + +#### `outbound_connections` (admin-managed) + +```sql +CREATE TABLE outbound_connections ( + id uuid PRIMARY KEY, + tenant_id varchar(64) NOT NULL, + name varchar(100) NOT NULL, + description text, + url text NOT NULL, -- Mustache-enabled + method outbound_method_enum NOT NULL, + default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates + default_body_tmpl text, -- null = built-in default JSON envelope + tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT', + tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties + hmac_secret text, -- Ed25519-key-derived encryption at rest + auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE', + auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused + allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs + created_at timestamptz NOT NULL DEFAULT now(), + created_by uuid NOT NULL REFERENCES users(id), + updated_at timestamptz NOT NULL DEFAULT now(), + updated_by uuid NOT NULL REFERENCES users(id), + UNIQUE (tenant_id, name) +); +CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id); +``` + +#### `alert_rules` + +```sql +CREATE TABLE alert_rules ( + id uuid PRIMARY KEY, + environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, + name varchar(200) NOT NULL, + description text, + severity severity_enum NOT NULL, + enabled boolean NOT NULL DEFAULT true, + + condition_kind condition_kind_enum NOT NULL, + condition jsonb NOT NULL, -- sealed-subtype payload, Jackson-DEDUCTION polymorphic + + evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5), + for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0), + re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0), + + notification_title_tmpl text NOT NULL, -- Mustache + notification_message_tmpl text NOT NULL, -- Mustache + webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id + + next_evaluation_at timestamptz NOT NULL DEFAULT now(), + claimed_by varchar(64), + claimed_until timestamptz, + eval_state jsonb NOT NULL DEFAULT '{}', + + created_at timestamptz NOT NULL DEFAULT now(), + created_by uuid NOT NULL REFERENCES users(id), + updated_at timestamptz NOT NULL DEFAULT now(), + updated_by uuid NOT NULL REFERENCES users(id) +); +CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id); +CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true; +``` + +#### `alert_rule_targets` + +```sql +CREATE TABLE alert_rule_targets ( + id uuid PRIMARY KEY, + rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE, + target_kind target_kind_enum NOT NULL, + target_id varchar(128) NOT NULL, + UNIQUE (rule_id, target_kind, target_id) +); +CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id); +``` + +#### `alert_instances` + +```sql +CREATE TABLE alert_instances ( + id uuid PRIMARY KEY, + rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL, + rule_snapshot jsonb NOT NULL, + environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, + state alert_state_enum NOT NULL, + severity severity_enum NOT NULL, + fired_at timestamptz NOT NULL, + acked_at timestamptz, + acked_by uuid REFERENCES users(id), + resolved_at timestamptz, + last_notified_at timestamptz, + silenced boolean NOT NULL DEFAULT false, + current_value numeric, + threshold numeric, + context jsonb NOT NULL, + title text NOT NULL, + message text NOT NULL, + target_user_ids uuid[] NOT NULL DEFAULT '{}', + target_group_ids uuid[] NOT NULL DEFAULT '{}', + target_role_names text[] NOT NULL DEFAULT '{}' +); +CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC); +CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL; +CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED'; +CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids); +CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids); +CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names); +``` + +#### `alert_silences` + +```sql +CREATE TABLE alert_silences ( + id uuid PRIMARY KEY, + environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, + matcher jsonb NOT NULL, -- { ruleId?, appSlug?, routeId?, agentId?, severity? } + reason text, + starts_at timestamptz NOT NULL, + ends_at timestamptz NOT NULL CHECK (ends_at > starts_at), + created_by uuid NOT NULL REFERENCES users(id), + created_at timestamptz NOT NULL DEFAULT now() +); +CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at); +``` + +#### `alert_notifications` (webhook delivery outbox) + +```sql +CREATE TABLE alert_notifications ( + id uuid PRIMARY KEY, + alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE, + webhook_id uuid, -- opaque ref into rule's webhooks JSONB + outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL, + status notification_status_enum NOT NULL DEFAULT 'PENDING', + attempts int NOT NULL DEFAULT 0, + next_attempt_at timestamptz NOT NULL DEFAULT now(), + claimed_by varchar(64), + claimed_until timestamptz, + last_response_status int, + last_response_snippet text, + payload jsonb NOT NULL, -- snapshotted at first attempt + delivered_at timestamptz, + created_at timestamptz NOT NULL DEFAULT now() +); +CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING'; +CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id); +``` + +#### `alert_reads` + +```sql +CREATE TABLE alert_reads ( + user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE, + alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE, + read_at timestamptz NOT NULL DEFAULT now(), + PRIMARY KEY (user_id, alert_instance_id) +); +``` + +### Cascade summary + +``` +environments → alert_rules (CASCADE) → alert_rule_targets (CASCADE) +environments → alert_silences (CASCADE) +environments → alert_instances (CASCADE) → alert_reads (CASCADE) + → alert_notifications (CASCADE) +alert_rules → alert_instances (SET NULL, rule_snapshot preserves context) +users → alert_reads (CASCADE) +outbound_connections (delete) — blocked by FK from rules.webhooks JSONB via app-level 409 check +``` + +**Rule deletion** preserves history (`alert_instances.rule_id = NULL`, `rule_snapshot` retains details). **Environment deletion** leaves zero alerting rows — POC-safe. + +### Jackson polymorphism for conditions + +```java +@JsonTypeInfo(use = JsonTypeInfo.Id.DEDUCTION) +@JsonSubTypes({ + @Type(RouteMetricCondition.class), + @Type(ExchangeMatchCondition.class), + @Type(AgentStateCondition.class), + @Type(DeploymentStateCondition.class), + @Type(LogPatternCondition.class), + @Type(JvmMetricCondition.class), +}) +public sealed interface AlertCondition permits + RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition, + DeploymentStateCondition, LogPatternCondition, JvmMetricCondition { + ConditionKind kind(); +} +``` + +Jackson deduces the subtype from the set of present fields. Bean Validation (`@Valid`) on each record validates at the controller boundary. + +Example condition payloads: + +```json +// ROUTE_METRIC +{ "scope": {"appSlug":"orders","routeId":"route-1"}, + "metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 } + +// EXCHANGE_MATCH — PER_EXCHANGE +{ "scope": {"appSlug":"orders"}, + "filter": {"status":"FAILED","attributes":{"type":"payment"}}, + "fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 } + +// EXCHANGE_MATCH — COUNT_IN_WINDOW +{ "scope": {"appSlug":"orders"}, + "filter": {"status":"FAILED"}, + "fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 } + +// AGENT_STATE +{ "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 } + +// DEPLOYMENT_STATE +{ "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] } + +// LOG_PATTERN +{ "scope": {"appSlug":"orders"}, "level": "ERROR", + "pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 } + +// JVM_METRIC +{ "scope": {"appSlug":"orders"}, "metric": "heap_used_percent", + "aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 } +``` + +### Claim-polling queries + +```sql +-- Rule evaluator +UPDATE alert_rules + SET claimed_by = :instance, claimed_until = now() + interval '30 seconds' + WHERE id IN ( + SELECT id FROM alert_rules + WHERE enabled = true + AND next_evaluation_at <= now() + AND (claimed_until IS NULL OR claimed_until < now()) + ORDER BY next_evaluation_at + LIMIT :batch + FOR UPDATE SKIP LOCKED + ) + RETURNING *; + +-- Notification dispatcher (same pattern on alert_notifications with status='PENDING') +``` + +`FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other. + +--- + +## 6. Outbound connections + +### Concept + +An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically. + +**Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules. + +**Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list. + +**Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references. + +### Default body template (when rule has no override) + +```json +{ + "alert": { "id", "state", "firedAt", "severity", "title", "message", "link" }, + "rule": { "id", "name", "description", "severity" }, + "env": { "slug", "id" }, + "context": { /* full Mustache context: app, route, agent, exchange, etc. */ } +} +``` + +"Just plug in my Slack incoming webhook URL" works without writing a template. + +### HMAC signing (optional per connection) + +When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20). + +--- + +## 7. Rule evaluation + +### Scheduler + +```java +@Component +public class AlertEvaluatorJob implements SchedulingConfigurer { + + // Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000) + @Override + public void configureTasks(ScheduledTaskRegistrar registrar) { + registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs()); + } + + void tick() { + List claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize()); + var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r)))); + for (var entry : groups.entrySet()) { + if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; } + try { + coalescedEvaluate(entry.getKey(), entry.getValue()); + } catch (Exception e) { + circuitBreaker.recordFailure(entry.getKey().kind()); + rescheduleBatch(entry.getValue()); + } + } + } +} +``` + +### Per-condition evaluators + +| Kind | Read source | Query shape | +|---|---|---| +| `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold | +| `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor | +| `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold | +| `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state | +| `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set | +| `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold | +| `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold | + +### State machine + +``` + (cond holds for ` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call. + +5. **Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open. + +### Silence matching + +At notification-dispatch time (not evaluation time): + +```sql +SELECT 1 FROM alert_silences + WHERE environment_id = :env + AND now() BETWEEN starts_at AND ends_at + AND matcher_matches(matcher, :instanceContext) + LIMIT 1; +``` + +If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved. + +### Failure modes + +| Failure | Behavior | +|---|---| +| Read interface throws | Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim | +| 10 consecutive failures for a rule | Mark `eval_state.disabledReason`, surface in UI | +| Template render error | Fall back to literal `{{var}}` in output, log WARN, still dispatch | +| Slow evaluator | Claim TTL 30 s; investigate if sustained | +| Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized | +| Env deleted mid-eval | FK cascade waits — effectively serialized | + +--- + +## 8. Notification dispatch + +### In-app inbox — derived, not materialized + +```sql +SELECT ai.* + FROM alert_instances ai + WHERE ai.environment_id = :env + AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED') + AND ( + :me = ANY(ai.target_user_ids) + OR ai.target_group_ids && :my_group_ids + OR ai.target_role_names && :my_role_names + ) + ORDER BY ai.fired_at DESC + LIMIT 100; +``` + +`:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`. + +**Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap. + +### Webhook outbox — claim-based + +`NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection. + +- **2xx** → `DELIVERED` +- **4xx** → `FAILED` immediately (retry won't help); log at WARN +- **5xx / network / timeout** → retry with exponential backoff 30 s → 2 m → 5 m, then `FAILED` +- Manual retry: `POST /alerts/notifications/{id}/retry` (OPERATOR+) + +Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications. + +### Template rendering + +JMustache (`com.samskivert:jmustache`). Logic-less, industry-standard syntax. + +**Rendered surfaces:** URL (query-string interpolation), header values, body, and separately `alert_instances.title` / `message` rendered once at fire. + +**Context map** (dot-notation + camelCase leaves): + +``` +env.slug env.id +rule.id rule.name rule.severity rule.description +alert.id alert.state alert.firedAt alert.resolvedAt +alert.ackedBy alert.link alert.currentValue alert.threshold +alert.comparator alert.window +app.slug app.id app.displayName +route.id +agent.id agent.name agent.state +exchange.id exchange.status exchange.link +deployment.id deployment.status +log.logger log.level log.message +metric.name metric.value +``` + +**Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error. + +**"Test render" endpoint:** `POST /alerts/rules/{id}/render-preview` — drives rule editor's Preview button. + +--- + +## 9. Rule promotion across environments + +**UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `` from `` → ``. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy). + +**Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths. + +**Prefill-time validation (client-side warnings, non-blocking):** + +| Field | Check | Behavior | +|---|---|---| +| `scope.appSlug` | Does app exist in target env? | ⚠ warn + picker from target env's apps | +| `scope.agentId` | Per-env; can't transfer | Clear field, keep appSlug, note | +| `scope.routeId` | Per-app logical ID, stable | ✓ pass through | +| `targets[]` | Tenant-scoped | ✓ transfer as-is | +| `webhooks[].outboundConnectionId` | Target env allowed by connection? | ⚠ warn if not; disable save until resolved | + +Bulk promotion (select multiple → promote all) deferred until usage patterns justify it. + +--- + +## 10. Cross-cutting: outbound HTTP & TLS trust + +Shared module — not inside `alerting/`. + +### `OutboundHttpClientFactory` + +```java +public interface OutboundHttpClientFactory { + CloseableHttpClient clientFor(OutboundHttpRequestContext context); +} + +public record OutboundHttpRequestContext( + TrustMode trustMode, // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS + List trustedCaPemPaths, + Duration connectTimeout, + Duration readTimeout +) {} +``` + +Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call. + +### System config (`cameleer.server.outbound-http.*`) + +```yaml +cameleer: + server: + outbound-http: + trust-all: false # global kill-switch; WARN logged if true + trusted-ca-pem-paths: # additional roots layered on JVM default + - /etc/cameleer/certs/corporate-root.pem + - /etc/cameleer/certs/acme-internal.pem + default-connect-timeout-ms: 2000 + default-read-timeout-ms: 5000 + proxy-url: # optional; null = no proxy + proxy-username: + proxy-password: +``` + +On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files. + +### Per-connection overrides + +Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`). + +### Deferred + +See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**: +- In-app CA bundle upload / admin management +- SaaS-layer CA reuse investigation (do first) + +--- + +## 11. API surface + +All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver. + +### Alerting — rules + +| Method | Path | Role | +|---|---|---| +| `GET` | `/alerts/rules` | VIEWER+ | +| `POST` | `/alerts/rules` | OPERATOR+ | +| `GET` | `/alerts/rules/{id}` | VIEWER+ | +| `PUT` | `/alerts/rules/{id}` | OPERATOR+ | +| `DELETE` | `/alerts/rules/{id}` | OPERATOR+ | +| `POST` | `/alerts/rules/{id}/enable` · `/disable` | OPERATOR+ | +| `POST` | `/alerts/rules/{id}/render-preview` | OPERATOR+ | +| `POST` | `/alerts/rules/{id}/test-evaluate` | OPERATOR+ | + +### Alerting — instances + +| Method | Path | Role | +|---|---|---| +| `GET` | `/alerts` | VIEWER+ | +| `GET` | `/alerts/unread-count` | VIEWER+ | +| `GET` | `/alerts/{id}` | VIEWER+ | +| `POST` | `/alerts/{id}/ack` | VIEWER+ (if targeted) / OPERATOR+ | +| `POST` | `/alerts/{id}/read` | VIEWER+ (self) | +| `POST` | `/alerts/bulk-read` | VIEWER+ (self) | + +### Alerting — silences + +| Method | Path | Role | +|---|---|---| +| `GET` | `/alerts/silences` | VIEWER+ | +| `POST` | `/alerts/silences` | OPERATOR+ | +| `PUT` | `/alerts/silences/{id}` | OPERATOR+ | +| `DELETE` | `/alerts/silences/{id}` | OPERATOR+ | + +### Alerting — notifications + +| Method | Path | Role | +|---|---|---| +| `GET` | `/alerts/{id}/notifications` | VIEWER+ | +| `POST` | `/alerts/notifications/{id}/retry` | OPERATOR+ | + +### Outbound connections (admin) + +| Method | Path | Role | +|---|---|---| +| `GET` | `/api/v1/admin/outbound-connections` | ADMIN / OPERATOR (read-only) | +| `POST` | `/api/v1/admin/outbound-connections` | ADMIN | +| `GET` | `/api/v1/admin/outbound-connections/{id}` | ADMIN / OPERATOR (read-only) | +| `PUT` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if narrowing breaks references) | +| `DELETE` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if referenced) | +| `POST` | `/api/v1/admin/outbound-connections/{id}/test` | ADMIN | +| `GET` | `/api/v1/admin/outbound-connections/{id}/usage` | ADMIN / OPERATOR | + +### OpenAPI regen + +Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change. + +--- + +## 12. CMD-K integration + +Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`): + +- **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`. +- **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`. + +No new registry machinery — uses the existing extension point. + +--- + +## 13. UI + +### Routes + +``` +/alerts + ├── /inbox (default landing) + ├── /all + ├── /rules + │ ├── /new + │ └── /{id} (edit; accepts ?promoteFrom=&ruleId=) + ├── /silences + └── /history + +/admin/outbound-connections + ├── / + ├── /new + └── /{id} +``` + +### Top-nav + +Insert `` between env selector and user menu. Badge severity = `max(severities of unread targeting me)` (CRITICAL → `var(--error)`, WARNING → `var(--amber)`, INFO → `var(--muted)`). Dropdown shows 5 most-recent unread with inline ack button + "See all". + +### Alerts section + +New sidebar/top-nav entry visible to `VIEWER+`. Authoring actions (`POST /rules`, silence create, etc.) gated to `OPERATOR+`. + +### Rule editor — 5-step wizard + +1. **Scope** — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints). +2. **Condition** — radio (6 kinds) + kind-specific form. +3. **Trigger** — threshold + comparator + window + for-duration + evaluation interval + severity; inline *Test evaluate* button. +4. **Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`. +5. **Review** — summary card, enabled toggle, save. + +### Silences, History, Rules list, OutboundConnectionAdminPage + +Structure described in design presentation; no new design-system components required. Reuses `Select`, `Tabs`, `Toggle`, `Button`, `Label`, `InfiniteScrollArea`, `PageLoader`, `Badge` from `@cameleer/design-system`. + +### Real-time behavior + +- Bell: `/alerts/unread-count` polled every 30 s; paused when tab hidden (Page Visibility API). +- Inbox view: `/alerts` polled every 30 s when focused. +- No SSE in v1. SSE is a clean future add under `/alerts/stream` with no schema changes. + +### Accessibility + +Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive `aria-label`. + +### Styling + +All colors via `@cameleer/design-system` CSS variables (`var(--error)`, `var(--amber)`, `var(--muted)`, `var(--success)`). No hard-coded hex. + +--- + +## 14. Configuration + +### `AlertingProperties` (`cameleer.server.alerting.*`) + +```yaml +cameleer: + server: + alerting: + evaluator-tick-interval-ms: 5000 # floor: 5000 (clamped at startup with WARN if lower) + evaluator-batch-size: 20 + claim-ttl-seconds: 30 + notification-tick-interval-ms: 5000 + notification-batch-size: 50 + in-tick-cache-enabled: true + circuit-breaker-fail-threshold: 5 + circuit-breaker-window-seconds: 30 + circuit-breaker-cooldown-seconds: 60 + event-retention-days: 90 + notification-retention-days: 30 + webhook-timeout-ms: 5000 + webhook-max-attempts: 3 +``` + +Env-var overridable (`CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...`). Wired via `SchedulingConfigurer` (not literal `@Scheduled(fixedDelay=...)`) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence. + +### `OutboundHttpProperties` (`cameleer.server.outbound-http.*`) + +See §10. + +--- + +## 15. Retention + +Daily `@Scheduled(cron = "0 0 3 * * *")` job `AlertingRetentionJob` (advisory-lock-of-the-day pattern, same as `JarRetentionJob`): + +```sql +DELETE FROM alert_instances + WHERE state = 'RESOLVED' + AND resolved_at < now() - :eventRetentionDays::interval; + +DELETE FROM alert_notifications + WHERE status IN ('DELIVERED','FAILED') + AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval); +``` + +Retention values from `AlertingProperties`. + +--- + +## 16. Observability + +New metrics exposed via existing `/api/v1/prometheus`: + +- `alerting_eval_duration_seconds{kind}` — histogram per condition kind +- `alerting_eval_errors_total{kind, rule_id}` — counter +- `alerting_circuit_open_total{kind}` — counter +- `alerting_rule_state{state}` — gauge (enabled / disabled / broken-reference) +- `alerting_instances_total{state, severity}` — gauge (open alerts) +- `alerting_notifications_total{status}` — counter +- `alerting_webhook_delivery_duration_seconds` — histogram + +No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add. + +### Audit + +New `AuditCategory` values: +- `OUTBOUND_HTTP_TRUST_CHANGE` — webhook or connection TLS config change +- `ALERT_RULE_CHANGE` — create / update / delete rule +- `ALERT_SILENCE_CHANGE` — create / update / delete silence +- `OUTBOUND_CONNECTION_CHANGE` — admin CRUD on outbound connection + +Emitted via existing `AuditService.log(...)`. + +--- + +## 17. Security + +- **Tenant + env isolation.** Every controller call runs through `@EnvPath` (resolves env → tenant via `TenantContext`). Every CH query filters by `tenant_id AND environment` per pre-existing invariant. +- **RBAC.** Enforced via Spring Security `@PreAuthorize` on each endpoint (see §11 role column). +- **Webhook URL SSRF protection.** At rule save, reject URLs resolving to private IPs (`127.0.0.0/8`, `10.0.0.0/8`, `172.16/12`, `192.168/16`, `::1`, `fc00::/7`) unless a deployment-level allow-listed dev flag is set. +- **HMAC signing.** Per-connection `hmac_secret` encrypted at rest; signature header sent on dispatch. +- **TLS trust.** Cross-cutting module (§10). +- **Audit.** See §16. + +--- + +## 18. Testing + +### Backend — unit (`*Test.java`, no Spring) + +- Each `ConditionEvaluator`: synthetic inputs → expected `EvalResult`. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce. +- `MustacheRenderer`: context + template → expected output; malformed falls back + logs. +- `SilenceMatcher`: matcher JSONB vs instance → truth table. +- Jackson polymorphism: roundtrip each `AlertCondition` subtype. +- Claim-polling concurrency (embedded PG): two threads → no duplicates. + +### Backend — integration (Testcontainers, `*IT.java`) + +- `AlertingFullLifecycleIT` — end-to-end rule → fire → ack → silence → delete, history survives. +- `AlertingEnvIsolationIT` — alert in env-A invisible from env-B inbox. +- `OutboundConnectionAllowedEnvIT` — 422 on save if connection not allowed in env; 409 on narrow-while-referenced. +- `WebhookDispatchIT` (WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx. +- `PerformanceIT` (opt-in, not default CI) — 500 rules + 5-replica simulation. + +### Frontend — component (Vitest + Testing Library) + +- Rule editor wizard step navigation + validation. +- Bell polling pause on tab hide. +- Inbox row rendering by severity. +- CMD-K result-source registration. + +### Frontend — E2E (Playwright if infra supports) + +- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears. + +--- + +## 19. Rollout + +- **No feature flag.** Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive. +- **Migration rollback.** V11 PG migration has matching down-script; CH projections are `IF NOT EXISTS`-safe and droppable without data loss. +- **Progressive adoption.** First user creates the first rule; feature organically spreads from there. +- **Documentation.** Add an admin-facing alerting guide under `docs/` describing rule shapes, template variables, webhook destinations, and silence patterns. +- **`.claude/rules/` updates.** `app-classes.md` and `core-classes.md` updated to document the new packages and any touched classes — part of the change, not a follow-up. + +--- + +## 20. Open questions / items for writing-plans + +These are not design-level decisions — they're implementation-phase tasks to be carried into planning: + +1. **Alignment with existing OIDC outbound cert handling.** Before implementing `ApacheOutboundHttpClientFactory`, audit how `OidcProviderHelper` / `OidcTokenExchanger` currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1). +2. **`hmac_secret` encryption-at-rest.** Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan. +3. **V1 CH migration file naming.** Confirm the convention for alerting-owned CH migrations (`V_alerting_projections.sql` vs numbered). Current `ClickHouseSchemaInitializer` runs files idempotently — naming is informational. +4. **Bell component keyboard shortcut.** Optional; align with existing CMD-K shortcut conventions. +5. **Target picker UX.** How to mix user / group / role in one multi-select with typeahead. Small UX design task. +6. **Env-delete cascade audit.** Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.