Files
cameleer-server/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md
hsiegeln 817b61058a docs(spec): PER_EXCHANGE exactly-once-per-exchange alerting
Four focused correctness fixes for the "fire exactly once per FAILED
exchange" use case (alerting layer only; HTTP-level idempotency is a
separate scope):

1. Composite cursor (startTime, executionId) replaces the current
   single-timestamp, inclusive cursor — prevents same-millisecond
   drops and same-exchange re-selection.
2. First-run cursor initialized to rule createdAt (not null) —
   prevents the current unbounded historical-retention scan on first
   tick of a new rule.
3. Transactional coupling of instance writes + notification enqueue +
   cursor advance — eliminates partial-progress failure modes on crash
   or rollback.
4. Config hygiene: reNotifyMinutes forced to 0, forDurationSeconds
   rejected, perExchangeLingerSeconds removed entirely (was validated
   as required but never read) — the rule shape stops admitting
   nonsensical PER_EXCHANGE combinations.

Alert stays FIRING until human ack/resolve (no auto-resolve); webhook
fires exactly once per AlertInstance; Inbox never sees duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:17:18 +02:00

155 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PER_EXCHANGE — Exactly-Once-Per-Exchange Alerting — Design
**Date:** 2026-04-22
**Scope:** alerting layer only (webhook-delivery-level idempotency is out of scope — see "Out of scope" below).
**Preceding context:** `.planning/sse-flakiness-diagnosis.md` (unrelated); `docs/superpowers/specs/2026-04-19-alerting-design.md` (foundational alerting design — original PER_EXCHANGE intent).
## Motivation
A user wants to create an alert rule of the shape "for every exchange that ends in FAILED status, notify Slack — exactly once per exchange." Exchanges are **terminal events**: once in FAILED state they never transition back, unlike agents which toggle between LIVE / STALE / DEAD. So "exactly once" is well-defined and achievable.
Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates:
1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen.
2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced).
3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics.
The product rules, agreed with the user:
- The AlertInstance **stays FIRING** until a human acks or resolves it. No auto-resolve sweep.
- The action (webhook) **fires exactly once** per AlertInstance.
- The Inbox **contains exactly one AlertInstance** per failed exchange — never a duplicate from cursor errors, tick re-runs, or process restarts.
"Exactly once" here is at the alerting layer — one AlertInstance, one PENDING AlertNotification per (instance × webhook binding). The HTTP dispatch that follows is still at-least-once on transient failures; that's a separate scope.
## Non-goals
- Auto-resolve after linger seconds. The existing spec reserved `perExchangeLingerSeconds` for this; we're explicitly dropping the field (unused + not desired).
- Resolve-on-delivery semantics. Alert stays FIRING until human intervention.
- Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change.
- Any change to COUNT_IN_WINDOW mode.
- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed.
## Design
Four focused changes. Each is small on its own; together they make PER_EXCHANGE fire exactly once per failed exchange.
### 1. Composite cursor for cursor monotonicity
**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch.
**New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"<ISO-8601 startTime>|<executionId>"` in `evalState.lastExchangeCursor`.
- Selection predicate: `(start_time > cursor.ts) OR (start_time = cursor.ts AND execution_id > cursor.id)`.
- This is trivially monotone: every consumed exchange is strictly-after the cursor in the composite ordering.
- Handles two exchanges at the exact same millisecond correctly — both are selected on their turn, neither re-selected.
- Uses the existing ClickHouse primary-key order `(tenant_id, start_time, …, execution_id)`, so the predicate is a range scan on the PK.
- Advance: set cursor to the `(startTime, executionId)` of the lexicographically-last row in the batch (last row when sorted by `(start_time asc, execution_id asc)`).
- First run (no cursor): today's behaviour is `cursor = null → timeFrom = null → unbounded scan of ClickHouse history` — any pre-existing FAILED exchange in retention would fire an alert on the first tick. That's broken and needs fixing. New rule: initialize `lastExchangeCursor` to `(rule.createdAt, "")` at rule creation time — so a PER_EXCHANGE rule only alerts on exchanges that fail *after* it was created. The empty-string `executionId` component is correct: any real execution_id sorts strictly after it lexicographically, so the very first matching exchange post-creation gets picked up on the first tick. No ambient lookback window, no retention dependency, no backlog flood.
- `evalState` schema change: retire the `lastExchangeTs` key, add `lastExchangeCursor`. Pre-prod; no migration needed. Readers that see neither key treat the rule as first-run.
**Affected files (scope estimate):**
- `ExchangeMatchEvaluator.evaluatePerExchange` — cursor parse/advance/selection.
- `SearchRequest` / `ClickHouseSearchIndex.search` — needs to accept the composite predicate. Option A: add an optional `afterExecutionId` param alongside `timeFrom`. Option B: introduce a dedicated `AfterCursor(ts, id)` type. Plan phase picks one — A is simpler.
- `evalState` JSON schema (documented in alerting spec).
### 2. Transactional coupling of instance writes + cursor advance
**Current.** Per tick for a PER_EXCHANGE rule:
1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit).
2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write.
Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) produces one of two inconsistent states:
- Instances saved but cursor not advanced → next tick duplicates them.
- Cursor advanced but no instances saved → those exchanges never alerted.
**New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`:
```
TX {
persist all AlertInstances for the batch
insert all PENDING AlertNotifications for those instances
update rule: evalState.lastExchangeCursor + nextRunAt
}
```
Commit: all three land atomically. Rollback: none do, and the rule stays claimed-but-cursor-unchanged so the next tick re-processes the same exchanges. Combined with the monotone cursor from §1, that gives exactly-once instance creation: if a batch half-succeeded and rolled back, the second attempt starts from the same cursor and produces the same set.
Notification dispatch (`NotificationDispatchJob` picking up PENDING rows) happens outside the transaction on its own schedule — webhook I/O must never hold a DB transaction open.
**Affected files (scope estimate):**
- `AlertEvaluatorJob.applyResult` + `applyBatchFiring` — fold into one transactional block when the result is a `Batch`.
- No change to the COUNT_IN_WINDOW path (`applyResult` for non-Batch results keeps its current semantics).
- `PostgresAlertInstanceRepository` / `PostgresAlertNotificationRepository` / `PostgresAlertRuleRepository` — existing methods usable from inside a transaction; verify no implicit auto-commit.
### 3. Config hygiene — enforce a coherent PER_EXCHANGE rule shape
Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy configurations.
| Knob | Current state | New state for PER_EXCHANGE |
|---|---|---|
| `reNotifyMinutes` | Default 60 in UI; re-notify sweep fires every N min while FIRING | **Must be `0`.** API 400s if non-zero. UI forces to `0` and disables the input with tooltip "Per-exchange rules fire exactly once — re-notify does not apply." |
| `perExchangeLingerSeconds` | Validated as required by `ExchangeMatchCondition` compact ctor; unused anywhere in the code | **Removed.** Drop the field entirely — from the record, the compact-ctor validation, `AlertRuleRequest` DTO, form state, UI. Pre-prod; no shim. |
| `forDurationSeconds` | Applied by the state machine in the COUNT_IN_WINDOW / agent-lifecycle path | **Must be `0`/null** for PER_EXCHANGE. 400 on save if non-zero. UI hides the field when PER_EXCHANGE is selected. Evaluator path already ignores it for `Batch` results, so this is a contract-tightening at the API edge only. |
Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination.
**Affected files (scope estimate):**
- `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`.
- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode).
- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field.
- Tests (§4).
### 4. Tests that lock the guarantees
Three scenarios, one per correctness property.
**Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit)
- Seed two FAILED executions with identical `start_time`, different `executionId`.
- Tick 1: both fire, batch of 2.
- Tick 2: neither fires.
- Seed a third at the same timestamp. Tick 3: that third only.
**Test 2 — tick atomicity** (`AlertEvaluatorJobIT`, integration with real Postgres)
- Seed 3 FAILED executions. Inject a fault on the second notification-insert.
- Tick → transaction rolls back: 0 AlertInstances, cursor unchanged, rule `nextRunAt` unchanged.
- Remove fault, tick again: 3 AlertInstances + 3 PENDING notifications, cursor advanced.
**Test 3 — full-lifecycle exactly-once** (extends `AlertingFullLifecycleIT`)
- PER_EXCHANGE rule, dummy webhook.
- Seed 5 FAILED executions across two ticks (3 + 2). After both ticks: exactly 5 FIRING AlertInstances, exactly 5 PENDING notifications.
- Third tick with no new executions: zero new instances, zero new notifications.
- Ack one instance: other four unchanged.
- Additionally: POST a PER_EXCHANGE rule with `reNotifyMinutes=60` via the controller → expect 400.
- Additionally: POST a PER_EXCHANGE rule with `forDurationSeconds=60` → expect 400.
**Test 4 — first-run uses rule creation time, not unbounded history** (unit, in `ExchangeMatchEvaluatorTest`)
- Seed 2 FAILED executions dated *before* rule creation, 1 *after*.
- Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty.
- Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch.
Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup.
## Out of scope
- **Webhook-level idempotency.** `WebhookDispatcher` still retries on 5xx / network / timeout. For Slack, that means a timeout mid-POST can produce a duplicate channel message. The consumer-side fix is to include a stable ID (e.g. `{{alert.id}}`) in the message template and drop duplicates on Slack's side — doable today via the existing Mustache editor, no server change. If in the future we want strict exactly-once HTTP delivery, that's a separate design.
- **Auto-resolve of PER_EXCHANGE instances.** Alerts stay FIRING until humans intervene. If operational experience shows the Inbox gets unwieldy, a later phase can add a manual "resolve all" bulk action or an opt-in TTL sweep.
- **Rule-level dedup of identical alerts in a short window** (e.g. "same failure signature fires twice in 5 s"). Out of scope; every failed exchange is its own event by design.
- **COUNT_IN_WINDOW changes.** Untouched.
- **Migration of existing PER_EXCHANGE rules.** Pre-prod; any existing rule using the retired `perExchangeLingerSeconds` field gets the value silently dropped by the API's unknown-property handling on next PUT, or rejected on create (new shape). If needed, a one-shot cleanup is easier than a shim.
## Risks
- **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions.
- **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds.
- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup.
## Verification
- `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures.
- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears.