docs(spec): PER_EXCHANGE exactly-once-per-exchange alerting

Four focused correctness fixes for the "fire exactly once per FAILED
exchange" use case (alerting layer only; HTTP-level idempotency is a
separate scope):

1. Composite cursor (startTime, executionId) replaces the current
   single-timestamp, inclusive cursor — prevents same-millisecond
   drops and same-exchange re-selection.
2. First-run cursor initialized to rule createdAt (not null) —
   prevents the current unbounded historical-retention scan on first
   tick of a new rule.
3. Transactional coupling of instance writes + notification enqueue +
   cursor advance — eliminates partial-progress failure modes on crash
   or rollback.
4. Config hygiene: reNotifyMinutes forced to 0, forDurationSeconds
   rejected, perExchangeLingerSeconds removed entirely (was validated
   as required but never read) — the rule shape stops admitting
   nonsensical PER_EXCHANGE combinations.

Alert stays FIRING until human ack/resolve (no auto-resolve); webhook
fires exactly once per AlertInstance; Inbox never sees duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-22 14:17:18 +02:00
parent e4492b10e1
commit 817b61058a

View File

@@ -0,0 +1,154 @@
# PER_EXCHANGE — Exactly-Once-Per-Exchange Alerting — Design
**Date:** 2026-04-22
**Scope:** alerting layer only (webhook-delivery-level idempotency is out of scope — see "Out of scope" below).
**Preceding context:** `.planning/sse-flakiness-diagnosis.md` (unrelated); `docs/superpowers/specs/2026-04-19-alerting-design.md` (foundational alerting design — original PER_EXCHANGE intent).
## Motivation
A user wants to create an alert rule of the shape "for every exchange that ends in FAILED status, notify Slack — exactly once per exchange." Exchanges are **terminal events**: once in FAILED state they never transition back, unlike agents which toggle between LIVE / STALE / DEAD. So "exactly once" is well-defined and achievable.
Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates:
1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen.
2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced).
3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics.
The product rules, agreed with the user:
- The AlertInstance **stays FIRING** until a human acks or resolves it. No auto-resolve sweep.
- The action (webhook) **fires exactly once** per AlertInstance.
- The Inbox **contains exactly one AlertInstance** per failed exchange — never a duplicate from cursor errors, tick re-runs, or process restarts.
"Exactly once" here is at the alerting layer — one AlertInstance, one PENDING AlertNotification per (instance × webhook binding). The HTTP dispatch that follows is still at-least-once on transient failures; that's a separate scope.
## Non-goals
- Auto-resolve after linger seconds. The existing spec reserved `perExchangeLingerSeconds` for this; we're explicitly dropping the field (unused + not desired).
- Resolve-on-delivery semantics. Alert stays FIRING until human intervention.
- Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change.
- Any change to COUNT_IN_WINDOW mode.
- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed.
## Design
Four focused changes. Each is small on its own; together they make PER_EXCHANGE fire exactly once per failed exchange.
### 1. Composite cursor for cursor monotonicity
**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch.
**New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"<ISO-8601 startTime>|<executionId>"` in `evalState.lastExchangeCursor`.
- Selection predicate: `(start_time > cursor.ts) OR (start_time = cursor.ts AND execution_id > cursor.id)`.
- This is trivially monotone: every consumed exchange is strictly-after the cursor in the composite ordering.
- Handles two exchanges at the exact same millisecond correctly — both are selected on their turn, neither re-selected.
- Uses the existing ClickHouse primary-key order `(tenant_id, start_time, …, execution_id)`, so the predicate is a range scan on the PK.
- Advance: set cursor to the `(startTime, executionId)` of the lexicographically-last row in the batch (last row when sorted by `(start_time asc, execution_id asc)`).
- First run (no cursor): today's behaviour is `cursor = null → timeFrom = null → unbounded scan of ClickHouse history` — any pre-existing FAILED exchange in retention would fire an alert on the first tick. That's broken and needs fixing. New rule: initialize `lastExchangeCursor` to `(rule.createdAt, "")` at rule creation time — so a PER_EXCHANGE rule only alerts on exchanges that fail *after* it was created. The empty-string `executionId` component is correct: any real execution_id sorts strictly after it lexicographically, so the very first matching exchange post-creation gets picked up on the first tick. No ambient lookback window, no retention dependency, no backlog flood.
- `evalState` schema change: retire the `lastExchangeTs` key, add `lastExchangeCursor`. Pre-prod; no migration needed. Readers that see neither key treat the rule as first-run.
**Affected files (scope estimate):**
- `ExchangeMatchEvaluator.evaluatePerExchange` — cursor parse/advance/selection.
- `SearchRequest` / `ClickHouseSearchIndex.search` — needs to accept the composite predicate. Option A: add an optional `afterExecutionId` param alongside `timeFrom`. Option B: introduce a dedicated `AfterCursor(ts, id)` type. Plan phase picks one — A is simpler.
- `evalState` JSON schema (documented in alerting spec).
### 2. Transactional coupling of instance writes + cursor advance
**Current.** Per tick for a PER_EXCHANGE rule:
1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit).
2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write.
Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) produces one of two inconsistent states:
- Instances saved but cursor not advanced → next tick duplicates them.
- Cursor advanced but no instances saved → those exchanges never alerted.
**New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`:
```
TX {
persist all AlertInstances for the batch
insert all PENDING AlertNotifications for those instances
update rule: evalState.lastExchangeCursor + nextRunAt
}
```
Commit: all three land atomically. Rollback: none do, and the rule stays claimed-but-cursor-unchanged so the next tick re-processes the same exchanges. Combined with the monotone cursor from §1, that gives exactly-once instance creation: if a batch half-succeeded and rolled back, the second attempt starts from the same cursor and produces the same set.
Notification dispatch (`NotificationDispatchJob` picking up PENDING rows) happens outside the transaction on its own schedule — webhook I/O must never hold a DB transaction open.
**Affected files (scope estimate):**
- `AlertEvaluatorJob.applyResult` + `applyBatchFiring` — fold into one transactional block when the result is a `Batch`.
- No change to the COUNT_IN_WINDOW path (`applyResult` for non-Batch results keeps its current semantics).
- `PostgresAlertInstanceRepository` / `PostgresAlertNotificationRepository` / `PostgresAlertRuleRepository` — existing methods usable from inside a transaction; verify no implicit auto-commit.
### 3. Config hygiene — enforce a coherent PER_EXCHANGE rule shape
Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy configurations.
| Knob | Current state | New state for PER_EXCHANGE |
|---|---|---|
| `reNotifyMinutes` | Default 60 in UI; re-notify sweep fires every N min while FIRING | **Must be `0`.** API 400s if non-zero. UI forces to `0` and disables the input with tooltip "Per-exchange rules fire exactly once — re-notify does not apply." |
| `perExchangeLingerSeconds` | Validated as required by `ExchangeMatchCondition` compact ctor; unused anywhere in the code | **Removed.** Drop the field entirely — from the record, the compact-ctor validation, `AlertRuleRequest` DTO, form state, UI. Pre-prod; no shim. |
| `forDurationSeconds` | Applied by the state machine in the COUNT_IN_WINDOW / agent-lifecycle path | **Must be `0`/null** for PER_EXCHANGE. 400 on save if non-zero. UI hides the field when PER_EXCHANGE is selected. Evaluator path already ignores it for `Batch` results, so this is a contract-tightening at the API edge only. |
Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination.
**Affected files (scope estimate):**
- `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`.
- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode).
- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field.
- Tests (§4).
### 4. Tests that lock the guarantees
Three scenarios, one per correctness property.
**Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit)
- Seed two FAILED executions with identical `start_time`, different `executionId`.
- Tick 1: both fire, batch of 2.
- Tick 2: neither fires.
- Seed a third at the same timestamp. Tick 3: that third only.
**Test 2 — tick atomicity** (`AlertEvaluatorJobIT`, integration with real Postgres)
- Seed 3 FAILED executions. Inject a fault on the second notification-insert.
- Tick → transaction rolls back: 0 AlertInstances, cursor unchanged, rule `nextRunAt` unchanged.
- Remove fault, tick again: 3 AlertInstances + 3 PENDING notifications, cursor advanced.
**Test 3 — full-lifecycle exactly-once** (extends `AlertingFullLifecycleIT`)
- PER_EXCHANGE rule, dummy webhook.
- Seed 5 FAILED executions across two ticks (3 + 2). After both ticks: exactly 5 FIRING AlertInstances, exactly 5 PENDING notifications.
- Third tick with no new executions: zero new instances, zero new notifications.
- Ack one instance: other four unchanged.
- Additionally: POST a PER_EXCHANGE rule with `reNotifyMinutes=60` via the controller → expect 400.
- Additionally: POST a PER_EXCHANGE rule with `forDurationSeconds=60` → expect 400.
**Test 4 — first-run uses rule creation time, not unbounded history** (unit, in `ExchangeMatchEvaluatorTest`)
- Seed 2 FAILED executions dated *before* rule creation, 1 *after*.
- Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty.
- Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch.
Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup.
## Out of scope
- **Webhook-level idempotency.** `WebhookDispatcher` still retries on 5xx / network / timeout. For Slack, that means a timeout mid-POST can produce a duplicate channel message. The consumer-side fix is to include a stable ID (e.g. `{{alert.id}}`) in the message template and drop duplicates on Slack's side — doable today via the existing Mustache editor, no server change. If in the future we want strict exactly-once HTTP delivery, that's a separate design.
- **Auto-resolve of PER_EXCHANGE instances.** Alerts stay FIRING until humans intervene. If operational experience shows the Inbox gets unwieldy, a later phase can add a manual "resolve all" bulk action or an opt-in TTL sweep.
- **Rule-level dedup of identical alerts in a short window** (e.g. "same failure signature fires twice in 5 s"). Out of scope; every failed exchange is its own event by design.
- **COUNT_IN_WINDOW changes.** Untouched.
- **Migration of existing PER_EXCHANGE rules.** Pre-prod; any existing rule using the retired `perExchangeLingerSeconds` field gets the value silently dropped by the API's unknown-property handling on next PUT, or rejected on create (new shape). If needed, a one-shot cleanup is easier than a shim.
## Risks
- **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions.
- **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds.
- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup.
## Verification
- `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures.
- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears.