docs(spec): PER_EXCHANGE exactly-once-per-exchange alerting

Four focused correctness fixes for the "fire exactly once per FAILED exchange" use case (alerting layer only; HTTP-level idempotency is a separate scope): 1. Composite cursor (startTime, executionId) replaces the current single-timestamp, inclusive cursor — prevents same-millisecond drops and same-exchange re-selection. 2. First-run cursor initialized to rule createdAt (not null) — prevents the current unbounded historical-retention scan on first tick of a new rule. 3. Transactional coupling of instance writes + notification enqueue + cursor advance — eliminates partial-progress failure modes on crash or rollback. 4. Config hygiene: reNotifyMinutes forced to 0, forDurationSeconds rejected, perExchangeLingerSeconds removed entirely (was validated as required but never read) — the rule shape stops admitting nonsensical PER_EXCHANGE combinations. Alert stays FIRING until human ack/resolve (no auto-resolve); webhook fires exactly once per AlertInstance; Inbox never sees duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:17:18 +02:00
parent e4492b10e1
commit 817b61058a
1 changed files with 154 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md
+++ b/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md
@@ -0,0 +1,154 @@
+# PER_EXCHANGE — Exactly-Once-Per-Exchange Alerting — Design
+
+**Date:** 2026-04-22
+**Scope:** alerting layer only (webhook-delivery-level idempotency is out of scope — see "Out of scope" below).
+**Preceding context:** `.planning/sse-flakiness-diagnosis.md` (unrelated); `docs/superpowers/specs/2026-04-19-alerting-design.md` (foundational alerting design — original PER_EXCHANGE intent).
+
+## Motivation
+
+A user wants to create an alert rule of the shape "for every exchange that ends in FAILED status, notify Slack — exactly once per exchange." Exchanges are **terminal events**: once in FAILED state they never transition back, unlike agents which toggle between LIVE / STALE / DEAD. So "exactly once" is well-defined and achievable.
+
+Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates:
+
+1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen.
+2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced).
+3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics.
+
+The product rules, agreed with the user:
+
+- The AlertInstance **stays FIRING** until a human acks or resolves it. No auto-resolve sweep.
+- The action (webhook) **fires exactly once** per AlertInstance.
+- The Inbox **contains exactly one AlertInstance** per failed exchange — never a duplicate from cursor errors, tick re-runs, or process restarts.
+
+"Exactly once" here is at the alerting layer — one AlertInstance, one PENDING AlertNotification per (instance × webhook binding). The HTTP dispatch that follows is still at-least-once on transient failures; that's a separate scope.
+
+## Non-goals
+
+- Auto-resolve after linger seconds. The existing spec reserved `perExchangeLingerSeconds` for this; we're explicitly dropping the field (unused + not desired).
+- Resolve-on-delivery semantics. Alert stays FIRING until human intervention.
+- Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change.
+- Any change to COUNT_IN_WINDOW mode.
+- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed.
+
+## Design
+
+Four focused changes. Each is small on its own; together they make PER_EXCHANGE fire exactly once per failed exchange.
+
+### 1. Composite cursor for cursor monotonicity
+
+**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch.
+
+**New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"<ISO-8601 startTime>|<executionId>"` in `evalState.lastExchangeCursor`.
+
+- Selection predicate: `(start_time > cursor.ts) OR (start_time = cursor.ts AND execution_id > cursor.id)`.
+  - This is trivially monotone: every consumed exchange is strictly-after the cursor in the composite ordering.
+  - Handles two exchanges at the exact same millisecond correctly — both are selected on their turn, neither re-selected.
+  - Uses the existing ClickHouse primary-key order `(tenant_id, start_time, …, execution_id)`, so the predicate is a range scan on the PK.
+- Advance: set cursor to the `(startTime, executionId)` of the lexicographically-last row in the batch (last row when sorted by `(start_time asc, execution_id asc)`).
+- First run (no cursor): today's behaviour is `cursor = null → timeFrom = null → unbounded scan of ClickHouse history` — any pre-existing FAILED exchange in retention would fire an alert on the first tick. That's broken and needs fixing. New rule: initialize `lastExchangeCursor` to `(rule.createdAt, "")` at rule creation time — so a PER_EXCHANGE rule only alerts on exchanges that fail *after* it was created. The empty-string `executionId` component is correct: any real execution_id sorts strictly after it lexicographically, so the very first matching exchange post-creation gets picked up on the first tick. No ambient lookback window, no retention dependency, no backlog flood.
+- `evalState` schema change: retire the `lastExchangeTs` key, add `lastExchangeCursor`. Pre-prod; no migration needed. Readers that see neither key treat the rule as first-run.
+
+**Affected files (scope estimate):**
+- `ExchangeMatchEvaluator.evaluatePerExchange` — cursor parse/advance/selection.
+- `SearchRequest` / `ClickHouseSearchIndex.search` — needs to accept the composite predicate. Option A: add an optional `afterExecutionId` param alongside `timeFrom`. Option B: introduce a dedicated `AfterCursor(ts, id)` type. Plan phase picks one — A is simpler.
+- `evalState` JSON schema (documented in alerting spec).
+
+### 2. Transactional coupling of instance writes + cursor advance
+
+**Current.** Per tick for a PER_EXCHANGE rule:
+1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit).
+2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write.
+
+Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) produces one of two inconsistent states:
+- Instances saved but cursor not advanced → next tick duplicates them.
+- Cursor advanced but no instances saved → those exchanges never alerted.
+
+**New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`:
+
+```
+TX {
+  persist all AlertInstances for the batch
+  insert all PENDING AlertNotifications for those instances
+  update rule: evalState.lastExchangeCursor + nextRunAt
+}
+```
+
+Commit: all three land atomically. Rollback: none do, and the rule stays claimed-but-cursor-unchanged so the next tick re-processes the same exchanges. Combined with the monotone cursor from §1, that gives exactly-once instance creation: if a batch half-succeeded and rolled back, the second attempt starts from the same cursor and produces the same set.
+
+Notification dispatch (`NotificationDispatchJob` picking up PENDING rows) happens outside the transaction on its own schedule — webhook I/O must never hold a DB transaction open.
+
+**Affected files (scope estimate):**
+- `AlertEvaluatorJob.applyResult` + `applyBatchFiring` — fold into one transactional block when the result is a `Batch`.
+- No change to the COUNT_IN_WINDOW path (`applyResult` for non-Batch results keeps its current semantics).
+- `PostgresAlertInstanceRepository` / `PostgresAlertNotificationRepository` / `PostgresAlertRuleRepository` — existing methods usable from inside a transaction; verify no implicit auto-commit.
+
+### 3. Config hygiene — enforce a coherent PER_EXCHANGE rule shape
+
+Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy configurations.
+
+| Knob | Current state | New state for PER_EXCHANGE |
+|---|---|---|
+| `reNotifyMinutes` | Default 60 in UI; re-notify sweep fires every N min while FIRING | **Must be `0`.** API 400s if non-zero. UI forces to `0` and disables the input with tooltip "Per-exchange rules fire exactly once — re-notify does not apply." |
+| `perExchangeLingerSeconds` | Validated as required by `ExchangeMatchCondition` compact ctor; unused anywhere in the code | **Removed.** Drop the field entirely — from the record, the compact-ctor validation, `AlertRuleRequest` DTO, form state, UI. Pre-prod; no shim. |
+| `forDurationSeconds` | Applied by the state machine in the COUNT_IN_WINDOW / agent-lifecycle path | **Must be `0`/null** for PER_EXCHANGE. 400 on save if non-zero. UI hides the field when PER_EXCHANGE is selected. Evaluator path already ignores it for `Batch` results, so this is a contract-tightening at the API edge only. |
+
+Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination.
+
+**Affected files (scope estimate):**
+- `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`.
+- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode).
+- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field.
+- Tests (§4).
+
+### 4. Tests that lock the guarantees
+
+Three scenarios, one per correctness property.
+
+**Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit)
+
+- Seed two FAILED executions with identical `start_time`, different `executionId`.
+- Tick 1: both fire, batch of 2.
+- Tick 2: neither fires.
+- Seed a third at the same timestamp. Tick 3: that third only.
+
+**Test 2 — tick atomicity** (`AlertEvaluatorJobIT`, integration with real Postgres)
+
+- Seed 3 FAILED executions. Inject a fault on the second notification-insert.
+- Tick → transaction rolls back: 0 AlertInstances, cursor unchanged, rule `nextRunAt` unchanged.
+- Remove fault, tick again: 3 AlertInstances + 3 PENDING notifications, cursor advanced.
+
+**Test 3 — full-lifecycle exactly-once** (extends `AlertingFullLifecycleIT`)
+
+- PER_EXCHANGE rule, dummy webhook.
+- Seed 5 FAILED executions across two ticks (3 + 2). After both ticks: exactly 5 FIRING AlertInstances, exactly 5 PENDING notifications.
+- Third tick with no new executions: zero new instances, zero new notifications.
+- Ack one instance: other four unchanged.
+- Additionally: POST a PER_EXCHANGE rule with `reNotifyMinutes=60` via the controller → expect 400.
+- Additionally: POST a PER_EXCHANGE rule with `forDurationSeconds=60` → expect 400.
+
+**Test 4 — first-run uses rule creation time, not unbounded history** (unit, in `ExchangeMatchEvaluatorTest`)
+
+- Seed 2 FAILED executions dated *before* rule creation, 1 *after*.
+- Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty.
+- Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch.
+
+Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup.
+
+## Out of scope
+
+- **Webhook-level idempotency.** `WebhookDispatcher` still retries on 5xx / network / timeout. For Slack, that means a timeout mid-POST can produce a duplicate channel message. The consumer-side fix is to include a stable ID (e.g. `{{alert.id}}`) in the message template and drop duplicates on Slack's side — doable today via the existing Mustache editor, no server change. If in the future we want strict exactly-once HTTP delivery, that's a separate design.
+- **Auto-resolve of PER_EXCHANGE instances.** Alerts stay FIRING until humans intervene. If operational experience shows the Inbox gets unwieldy, a later phase can add a manual "resolve all" bulk action or an opt-in TTL sweep.
+- **Rule-level dedup of identical alerts in a short window** (e.g. "same failure signature fires twice in 5 s"). Out of scope; every failed exchange is its own event by design.
+- **COUNT_IN_WINDOW changes.** Untouched.
+- **Migration of existing PER_EXCHANGE rules.** Pre-prod; any existing rule using the retired `perExchangeLingerSeconds` field gets the value silently dropped by the API's unknown-property handling on next PUT, or rejected on create (new shape). If needed, a one-shot cleanup is easier than a shim.
+
+## Risks
+
+- **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions.
+- **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds.
+- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup.
+
+## Verification
+
+- `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures.
+- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears.