From 817b61058a96a117a7b844f37b7a8ac2d5048775 Mon Sep 17 00:00:00 2001 From: hsiegeln <37154749+hsiegeln@users.noreply.github.com> Date: Wed, 22 Apr 2026 14:17:18 +0200 Subject: [PATCH] docs(spec): PER_EXCHANGE exactly-once-per-exchange alerting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four focused correctness fixes for the "fire exactly once per FAILED exchange" use case (alerting layer only; HTTP-level idempotency is a separate scope): 1. Composite cursor (startTime, executionId) replaces the current single-timestamp, inclusive cursor — prevents same-millisecond drops and same-exchange re-selection. 2. First-run cursor initialized to rule createdAt (not null) — prevents the current unbounded historical-retention scan on first tick of a new rule. 3. Transactional coupling of instance writes + notification enqueue + cursor advance — eliminates partial-progress failure modes on crash or rollback. 4. Config hygiene: reNotifyMinutes forced to 0, forDurationSeconds rejected, perExchangeLingerSeconds removed entirely (was validated as required but never read) — the rule shape stops admitting nonsensical PER_EXCHANGE combinations. Alert stays FIRING until human ack/resolve (no auto-resolve); webhook fires exactly once per AlertInstance; Inbox never sees duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-04-22-per-exchange-exactly-once-design.md | 154 ++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md diff --git a/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md b/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md new file mode 100644 index 00000000..5b7db45b --- /dev/null +++ b/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md @@ -0,0 +1,154 @@ +# PER_EXCHANGE — Exactly-Once-Per-Exchange Alerting — Design + +**Date:** 2026-04-22 +**Scope:** alerting layer only (webhook-delivery-level idempotency is out of scope — see "Out of scope" below). +**Preceding context:** `.planning/sse-flakiness-diagnosis.md` (unrelated); `docs/superpowers/specs/2026-04-19-alerting-design.md` (foundational alerting design — original PER_EXCHANGE intent). + +## Motivation + +A user wants to create an alert rule of the shape "for every exchange that ends in FAILED status, notify Slack — exactly once per exchange." Exchanges are **terminal events**: once in FAILED state they never transition back, unlike agents which toggle between LIVE / STALE / DEAD. So "exactly once" is well-defined and achievable. + +Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates: + +1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen. +2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced). +3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics. + +The product rules, agreed with the user: + +- The AlertInstance **stays FIRING** until a human acks or resolves it. No auto-resolve sweep. +- The action (webhook) **fires exactly once** per AlertInstance. +- The Inbox **contains exactly one AlertInstance** per failed exchange — never a duplicate from cursor errors, tick re-runs, or process restarts. + +"Exactly once" here is at the alerting layer — one AlertInstance, one PENDING AlertNotification per (instance × webhook binding). The HTTP dispatch that follows is still at-least-once on transient failures; that's a separate scope. + +## Non-goals + +- Auto-resolve after linger seconds. The existing spec reserved `perExchangeLingerSeconds` for this; we're explicitly dropping the field (unused + not desired). +- Resolve-on-delivery semantics. Alert stays FIRING until human intervention. +- Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change. +- Any change to COUNT_IN_WINDOW mode. +- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed. + +## Design + +Four focused changes. Each is small on its own; together they make PER_EXCHANGE fire exactly once per failed exchange. + +### 1. Composite cursor for cursor monotonicity + +**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch. + +**New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"|"` in `evalState.lastExchangeCursor`. + +- Selection predicate: `(start_time > cursor.ts) OR (start_time = cursor.ts AND execution_id > cursor.id)`. + - This is trivially monotone: every consumed exchange is strictly-after the cursor in the composite ordering. + - Handles two exchanges at the exact same millisecond correctly — both are selected on their turn, neither re-selected. + - Uses the existing ClickHouse primary-key order `(tenant_id, start_time, …, execution_id)`, so the predicate is a range scan on the PK. +- Advance: set cursor to the `(startTime, executionId)` of the lexicographically-last row in the batch (last row when sorted by `(start_time asc, execution_id asc)`). +- First run (no cursor): today's behaviour is `cursor = null → timeFrom = null → unbounded scan of ClickHouse history` — any pre-existing FAILED exchange in retention would fire an alert on the first tick. That's broken and needs fixing. New rule: initialize `lastExchangeCursor` to `(rule.createdAt, "")` at rule creation time — so a PER_EXCHANGE rule only alerts on exchanges that fail *after* it was created. The empty-string `executionId` component is correct: any real execution_id sorts strictly after it lexicographically, so the very first matching exchange post-creation gets picked up on the first tick. No ambient lookback window, no retention dependency, no backlog flood. +- `evalState` schema change: retire the `lastExchangeTs` key, add `lastExchangeCursor`. Pre-prod; no migration needed. Readers that see neither key treat the rule as first-run. + +**Affected files (scope estimate):** +- `ExchangeMatchEvaluator.evaluatePerExchange` — cursor parse/advance/selection. +- `SearchRequest` / `ClickHouseSearchIndex.search` — needs to accept the composite predicate. Option A: add an optional `afterExecutionId` param alongside `timeFrom`. Option B: introduce a dedicated `AfterCursor(ts, id)` type. Plan phase picks one — A is simpler. +- `evalState` JSON schema (documented in alerting spec). + +### 2. Transactional coupling of instance writes + cursor advance + +**Current.** Per tick for a PER_EXCHANGE rule: +1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit). +2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write. + +Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) produces one of two inconsistent states: +- Instances saved but cursor not advanced → next tick duplicates them. +- Cursor advanced but no instances saved → those exchanges never alerted. + +**New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`: + +``` +TX { + persist all AlertInstances for the batch + insert all PENDING AlertNotifications for those instances + update rule: evalState.lastExchangeCursor + nextRunAt +} +``` + +Commit: all three land atomically. Rollback: none do, and the rule stays claimed-but-cursor-unchanged so the next tick re-processes the same exchanges. Combined with the monotone cursor from §1, that gives exactly-once instance creation: if a batch half-succeeded and rolled back, the second attempt starts from the same cursor and produces the same set. + +Notification dispatch (`NotificationDispatchJob` picking up PENDING rows) happens outside the transaction on its own schedule — webhook I/O must never hold a DB transaction open. + +**Affected files (scope estimate):** +- `AlertEvaluatorJob.applyResult` + `applyBatchFiring` — fold into one transactional block when the result is a `Batch`. +- No change to the COUNT_IN_WINDOW path (`applyResult` for non-Batch results keeps its current semantics). +- `PostgresAlertInstanceRepository` / `PostgresAlertNotificationRepository` / `PostgresAlertRuleRepository` — existing methods usable from inside a transaction; verify no implicit auto-commit. + +### 3. Config hygiene — enforce a coherent PER_EXCHANGE rule shape + +Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy configurations. + +| Knob | Current state | New state for PER_EXCHANGE | +|---|---|---| +| `reNotifyMinutes` | Default 60 in UI; re-notify sweep fires every N min while FIRING | **Must be `0`.** API 400s if non-zero. UI forces to `0` and disables the input with tooltip "Per-exchange rules fire exactly once — re-notify does not apply." | +| `perExchangeLingerSeconds` | Validated as required by `ExchangeMatchCondition` compact ctor; unused anywhere in the code | **Removed.** Drop the field entirely — from the record, the compact-ctor validation, `AlertRuleRequest` DTO, form state, UI. Pre-prod; no shim. | +| `forDurationSeconds` | Applied by the state machine in the COUNT_IN_WINDOW / agent-lifecycle path | **Must be `0`/null** for PER_EXCHANGE. 400 on save if non-zero. UI hides the field when PER_EXCHANGE is selected. Evaluator path already ignores it for `Batch` results, so this is a contract-tightening at the API edge only. | + +Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination. + +**Affected files (scope estimate):** +- `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`. +- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode). +- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field. +- Tests (§4). + +### 4. Tests that lock the guarantees + +Three scenarios, one per correctness property. + +**Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit) + +- Seed two FAILED executions with identical `start_time`, different `executionId`. +- Tick 1: both fire, batch of 2. +- Tick 2: neither fires. +- Seed a third at the same timestamp. Tick 3: that third only. + +**Test 2 — tick atomicity** (`AlertEvaluatorJobIT`, integration with real Postgres) + +- Seed 3 FAILED executions. Inject a fault on the second notification-insert. +- Tick → transaction rolls back: 0 AlertInstances, cursor unchanged, rule `nextRunAt` unchanged. +- Remove fault, tick again: 3 AlertInstances + 3 PENDING notifications, cursor advanced. + +**Test 3 — full-lifecycle exactly-once** (extends `AlertingFullLifecycleIT`) + +- PER_EXCHANGE rule, dummy webhook. +- Seed 5 FAILED executions across two ticks (3 + 2). After both ticks: exactly 5 FIRING AlertInstances, exactly 5 PENDING notifications. +- Third tick with no new executions: zero new instances, zero new notifications. +- Ack one instance: other four unchanged. +- Additionally: POST a PER_EXCHANGE rule with `reNotifyMinutes=60` via the controller → expect 400. +- Additionally: POST a PER_EXCHANGE rule with `forDurationSeconds=60` → expect 400. + +**Test 4 — first-run uses rule creation time, not unbounded history** (unit, in `ExchangeMatchEvaluatorTest`) + +- Seed 2 FAILED executions dated *before* rule creation, 1 *after*. +- Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty. +- Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch. + +Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup. + +## Out of scope + +- **Webhook-level idempotency.** `WebhookDispatcher` still retries on 5xx / network / timeout. For Slack, that means a timeout mid-POST can produce a duplicate channel message. The consumer-side fix is to include a stable ID (e.g. `{{alert.id}}`) in the message template and drop duplicates on Slack's side — doable today via the existing Mustache editor, no server change. If in the future we want strict exactly-once HTTP delivery, that's a separate design. +- **Auto-resolve of PER_EXCHANGE instances.** Alerts stay FIRING until humans intervene. If operational experience shows the Inbox gets unwieldy, a later phase can add a manual "resolve all" bulk action or an opt-in TTL sweep. +- **Rule-level dedup of identical alerts in a short window** (e.g. "same failure signature fires twice in 5 s"). Out of scope; every failed exchange is its own event by design. +- **COUNT_IN_WINDOW changes.** Untouched. +- **Migration of existing PER_EXCHANGE rules.** Pre-prod; any existing rule using the retired `perExchangeLingerSeconds` field gets the value silently dropped by the API's unknown-property handling on next PUT, or rejected on create (new shape). If needed, a one-shot cleanup is easier than a shim. + +## Risks + +- **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions. +- **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds. +- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup. + +## Verification + +- `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures. +- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears.