cameleer-server/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md

# PER_EXCHANGE — Exactly-Once-Per-Exchange Alerting — Design

**Date:** 2026-04-22
**Scope:** alerting layer only (webhook-delivery-level idempotency is out of scope — see "Out of scope" below).
**Preceding context:** `.planning/sse-flakiness-diagnosis.md` (unrelated); `docs/superpowers/specs/2026-04-19-alerting-design.md` (foundational alerting design — original PER_EXCHANGE intent).

## Motivation

A user wants to create an alert rule of the shape "for every exchange that ends in FAILED status, notify Slack — exactly once per exchange." Exchanges are **terminal events**: once in FAILED state they never transition back, unlike agents which toggle between LIVE / STALE / DEAD. So "exactly once" is well-defined and achievable.

Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates:

1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen.
2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced).
3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics.

The product rules, agreed with the user:

- The AlertInstance **stays FIRING** until a human acks or resolves it. No auto-resolve sweep.
- The action (webhook) **fires exactly once** per AlertInstance.
- The Inbox **contains exactly one AlertInstance** per failed exchange — never a duplicate from cursor errors, tick re-runs, or process restarts.

"Exactly once" here is at the alerting layer — one AlertInstance, one PENDING AlertNotification per (instance × webhook binding). The HTTP dispatch that follows is still at-least-once on transient failures; that's a separate scope.

## Non-goals

- Auto-resolve after linger seconds. The existing spec reserved `perExchangeLingerSeconds` for this; we're explicitly dropping the field (unused + not desired).
- Resolve-on-delivery semantics. Alert stays FIRING until human intervention.
- Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change.
- Any change to COUNT_IN_WINDOW mode.
- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed.

## Design

Four focused changes. Each is small on its own; together they make PER_EXCHANGE fire exactly once per failed exchange.

### 1. Composite cursor for cursor monotonicity

**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch.

**New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"<ISO-8601 startTime>|<executionId>"` in `evalState.lastExchangeCursor`.

- Selection predicate: `(start_time > cursor.ts) OR (start_time = cursor.ts AND execution_id > cursor.id)`.
  - This is trivially monotone: every consumed exchange is strictly-after the cursor in the composite ordering.
  - Handles two exchanges at the exact same millisecond correctly — both are selected on their turn, neither re-selected.
  - Uses the existing ClickHouse primary-key order `(tenant_id, start_time, …, execution_id)`, so the predicate is a range scan on the PK.
- Advance: set cursor to the `(startTime, executionId)` of the lexicographically-last row in the batch (last row when sorted by `(start_time asc, execution_id asc)`).
- First run (no cursor): today's behaviour is `cursor = null → timeFrom = null → unbounded scan of ClickHouse history` — any pre-existing FAILED exchange in retention would fire an alert on the first tick. That's broken and needs fixing. New rule: initialize `lastExchangeCursor` to `(rule.createdAt, "")` at rule creation time — so a PER_EXCHANGE rule only alerts on exchanges that fail *after* it was created. The empty-string `executionId` component is correct: any real execution_id sorts strictly after it lexicographically, so the very first matching exchange post-creation gets picked up on the first tick. No ambient lookback window, no retention dependency, no backlog flood.
- `evalState` schema change: retire the `lastExchangeTs` key, add `lastExchangeCursor`. Pre-prod; no migration needed. Readers that see neither key treat the rule as first-run.

**Affected files (scope estimate):**
- `ExchangeMatchEvaluator.evaluatePerExchange` — cursor parse/advance/selection.
- `SearchRequest` / `ClickHouseSearchIndex.search` — needs to accept the composite predicate. Option A: add an optional `afterExecutionId` param alongside `timeFrom`. Option B: introduce a dedicated `AfterCursor(ts, id)` type. Plan phase picks one — A is simpler.
- `evalState` JSON schema (documented in alerting spec).

### 2. Transactional coupling of instance writes + cursor advance

**Current.** Per tick for a PER_EXCHANGE rule:
1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit).
2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write.

Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) produces one of two inconsistent states:
- Instances saved but cursor not advanced → next tick duplicates them.
- Cursor advanced but no instances saved → those exchanges never alerted.

**New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`:

```
TX {
  persist all AlertInstances for the batch
  insert all PENDING AlertNotifications for those instances
  update rule: evalState.lastExchangeCursor + nextRunAt
}
```

Commit: all three land atomically. Rollback: none do, and the rule stays claimed-but-cursor-unchanged so the next tick re-processes the same exchanges. Combined with the monotone cursor from §1, that gives exactly-once instance creation: if a batch half-succeeded and rolled back, the second attempt starts from the same cursor and produces the same set.

Notification dispatch (`NotificationDispatchJob` picking up PENDING rows) happens outside the transaction on its own schedule — webhook I/O must never hold a DB transaction open.

**Affected files (scope estimate):**
- `AlertEvaluatorJob.applyResult` + `applyBatchFiring` — fold into one transactional block when the result is a `Batch`.
- No change to the COUNT_IN_WINDOW path (`applyResult` for non-Batch results keeps its current semantics).
- `PostgresAlertInstanceRepository` / `PostgresAlertNotificationRepository` / `PostgresAlertRuleRepository` — existing methods usable from inside a transaction; verify no implicit auto-commit.

### 3. Config hygiene — enforce a coherent PER_EXCHANGE rule shape

Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy configurations.

| Knob | Current state | New state for PER_EXCHANGE |
|---|---|---|
| `reNotifyMinutes` | Default 60 in UI; re-notify sweep fires every N min while FIRING | **Must be `0`.** API 400s if non-zero. UI forces to `0` and disables the input with tooltip "Per-exchange rules fire exactly once — re-notify does not apply." |
| `perExchangeLingerSeconds` | Validated as required by `ExchangeMatchCondition` compact ctor; unused anywhere in the code | **Removed.** Drop the field entirely — from the record, the compact-ctor validation, `AlertRuleRequest` DTO, form state, UI. Pre-prod; no shim. |
| `forDurationSeconds` | Applied by the state machine in the COUNT_IN_WINDOW / agent-lifecycle path | **Must be `0`/null** for PER_EXCHANGE. 400 on save if non-zero. UI hides the field when PER_EXCHANGE is selected. Evaluator path already ignores it for `Batch` results, so this is a contract-tightening at the API edge only. |

Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination.

**Affected files (scope estimate):**
- `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`.
- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode).
- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field.
- Tests (§4).

### 4. Tests that lock the guarantees

Three scenarios, one per correctness property.

**Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit)

- Seed two FAILED executions with identical `start_time`, different `executionId`.
- Tick 1: both fire, batch of 2.
- Tick 2: neither fires.
- Seed a third at the same timestamp. Tick 3: that third only.

**Test 2 — tick atomicity** (`AlertEvaluatorJobIT`, integration with real Postgres)

- Seed 3 FAILED executions. Inject a fault on the second notification-insert.
- Tick → transaction rolls back: 0 AlertInstances, cursor unchanged, rule `nextRunAt` unchanged.
- Remove fault, tick again: 3 AlertInstances + 3 PENDING notifications, cursor advanced.

**Test 3 — full-lifecycle exactly-once** (extends `AlertingFullLifecycleIT`)

- PER_EXCHANGE rule, dummy webhook.
- Seed 5 FAILED executions across two ticks (3 + 2). After both ticks: exactly 5 FIRING AlertInstances, exactly 5 PENDING notifications.
- Third tick with no new executions: zero new instances, zero new notifications.
- Ack one instance: other four unchanged.
- Additionally: POST a PER_EXCHANGE rule with `reNotifyMinutes=60` via the controller → expect 400.
- Additionally: POST a PER_EXCHANGE rule with `forDurationSeconds=60` → expect 400.

**Test 4 — first-run uses rule creation time, not unbounded history** (unit, in `ExchangeMatchEvaluatorTest`)

- Seed 2 FAILED executions dated *before* rule creation, 1 *after*.
- Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty.
- Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch.

Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup.

## Out of scope

- **Webhook-level idempotency.** `WebhookDispatcher` still retries on 5xx / network / timeout. For Slack, that means a timeout mid-POST can produce a duplicate channel message. The consumer-side fix is to include a stable ID (e.g. `{{alert.id}}`) in the message template and drop duplicates on Slack's side — doable today via the existing Mustache editor, no server change. If in the future we want strict exactly-once HTTP delivery, that's a separate design.
- **Auto-resolve of PER_EXCHANGE instances.** Alerts stay FIRING until humans intervene. If operational experience shows the Inbox gets unwieldy, a later phase can add a manual "resolve all" bulk action or an opt-in TTL sweep.
- **Rule-level dedup of identical alerts in a short window** (e.g. "same failure signature fires twice in 5 s"). Out of scope; every failed exchange is its own event by design.
- **COUNT_IN_WINDOW changes.** Untouched.
- **Migration of existing PER_EXCHANGE rules.** Pre-prod; any existing rule using the retired `perExchangeLingerSeconds` field gets the value silently dropped by the API's unknown-property handling on next PUT, or rejected on create (new shape). If needed, a one-shot cleanup is easier than a shim.

## Risks

- **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions.
- **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds.
- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup.

## Verification

- `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures.
- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears.