From 2f9b9c9b0f849452b611f7e09be0730afbd0c387 Mon Sep 17 00:00:00 2001 From: hsiegeln <37154749+hsiegeln@users.noreply.github.com> Date: Wed, 22 Apr 2026 14:57:25 +0200 Subject: [PATCH] =?UTF-8?q?docs(spec):=20PER=5FEXCHANGE=20=E2=80=94=20tigh?= =?UTF-8?q?ten=20motivation,=20fold=20in=20njams=20review?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Correct the factual claim that the cursor advances — it is dead code: _nextCursor is computed but never persisted by applyBatchFiring/reschedule, so every tick re-enqueues notifications for every matching exchange in retention. Clarify that instance-level dedup already works via the unique index; notification-level dedup is what's broken. Reframe §2 as "make it atomic before §1 goes live." Add builder-UX lessons from the njams Server_4 rules editor: clear stale fields on fireMode toggle (not just hide them); block save on empty webhooks+targets; wire the already-existing /render-preview endpoint into the Review step. Add Test 5 (red-first notification-bleed regression) and Test 6 (form-state clear on mode toggle). Park two follow-ups explicitly: sealed condition-type hierarchy (backend lags the UI's condition-forms/* sharding) and a coalesceSeconds primitive for Inbox-storm taming. Amend cursor-format-churn risk: benign in theory, but first post-deploy tick against long-standing rules could scan from rule.createdAt forward — suggests a deployBacklogCap clamp to bound the one-time backlog flood. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-04-22-per-exchange-exactly-once-design.md | 54 +++++++++++++++---- 1 file changed, 44 insertions(+), 10 deletions(-) diff --git a/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md b/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md index 5b7db45b..0c7612e4 100644 --- a/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md +++ b/docs/superpowers/specs/2026-04-22-per-exchange-exactly-once-design.md @@ -10,8 +10,8 @@ A user wants to create an alert rule of the shape "for every exchange that ends Today's PER_EXCHANGE mode partially supports this but has three gaps that, in combination, either miss exchanges or flood the Inbox with duplicates: -1. The cursor `evalState.lastExchangeTs` is advanced to `max(startTime)` across the batch and applied as `SearchRequest.timeFrom`, which maps to `start_time >= cursor` in ClickHouse. Same-timestamp exchanges get re-selected on the next tick; exchanges that share a millisecond with the cursor-setting one can never be seen. -2. Alert-instance writes and `evalState` cursor advance are not coupled transactionally. A crash between the two states produces either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced). +1. **The cursor is dead code.** `ExchangeMatchEvaluator.evaluatePerExchange` at `cameleer-server-app/.../eval/ExchangeMatchEvaluator.java:141` computes a `_nextCursor` and stamps it onto the **last firing's context map**, but `AlertEvaluatorJob.applyBatchFiring` (`:206-213`) never reads it, and `reschedule` (`:259`) calls `releaseClaim(rule.id(), nextRun, rule.evalState())` — passing the **original, unmodified** `evalState`. So `lastExchangeTs` is never persisted; every tick runs with `timeFrom = null` and re-scans ClickHouse from retention's beginning. The partial unique index on `alert_instances(rule_id, context->'exchange'->>'id')` silently de-dups duplicate *instances* on each tick — but `enqueueNotifications` in `applyBatchFiring` runs unconditionally on the returned row, so **every tick enqueues a fresh PENDING `AlertNotification` for every matching exchange already in retention**. The user-visible symptom is not "a same-millisecond collision"; it is "Slack gets re-spammed for every historical failed exchange on every tick." Same-millisecond collision is also real, but strictly subsumed once a working cursor exists. +2. Alert-instance writes, notification enqueues, and `evalState` cursor advance are not coupled transactionally. Once §1 is fixed and the cursor *does* advance, a crash between the instance write and the cursor persist would produce either silent data loss (cursor advanced, instances never persisted) or duplicate instances on recovery (instances persisted, cursor not advanced). §2 makes this atomic before the cursor goes live. 3. The rule-configuration surface for PER_EXCHANGE admits nonsensical combinations (`reNotifyMinutes > 0`, mandatory-but-unused `perExchangeLingerSeconds`, `forDurationSeconds`) — a user following the UI defaults can build a rule that re-notifies hourly even though they want one-shot semantics. The product rules, agreed with the user: @@ -28,7 +28,9 @@ The product rules, agreed with the user: - Resolve-on-delivery semantics. Alert stays FIRING until human intervention. - Webhook-level idempotency / exactly-once HTTP delivery to Slack. Rare duplicate Slack messages on timeout retries are accepted; consumer-side dedup (via `alert.id` in the payload) is a template concern, not a server change. - Any change to COUNT_IN_WINDOW mode. -- Backfilling duplicate instances already created by the existing buggy cursor in any running environment. Pre-prod, manual cleanup if needed. +- Backfilling duplicate instances already created by the existing broken cursor in any running environment. Pre-prod, manual cleanup if needed. +- **Sealed condition-type hierarchy.** A follow-up refactor could replace the current "one `AlertRule` with mode-gated fields" model with a sealed hierarchy (`PerExchangeCondition` / `CountInWindowCondition` / `AgentLifecycleCondition`) where each type carries only the knobs it supports. The UI already sharded condition forms under `ui/src/pages/Alerts/RuleEditor/condition-forms/` — the backend is the laggard. §3 here only patches the three known field conflicts; the structural cleanup is a separate phase (see §Follow-ups). +- **Throttle / coalesce primitive for PER_EXCHANGE.** njams bakes `throttling` + `throttlingEventCount` into its CEP queries ("fire once, count the rest in W seconds"). If operators later find the Inbox unwieldy during incident storms, a `coalesceSeconds` knob is the right cure — one FIRING per (rule × signature) per window, with `occurrenceCount` maintained on the instance. Explicitly parked; see §Follow-ups. ## Design @@ -36,7 +38,7 @@ Four focused changes. Each is small on its own; together they make PER_EXCHANGE ### 1. Composite cursor for cursor monotonicity -**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string. `ExchangeMatchEvaluator.evaluatePerExchange` reads it as a lower bound and advances it to `max(startTime)` in the batch. +**Current.** `evalState.lastExchangeTs` is a single ISO-8601 string the evaluator *reads* as a lower bound, but **never writes**. The advance is computed (`latestTs = max(startTime)`) and attached to the last firing's context map as `_nextCursor`, then discarded by `applyBatchFiring` — `reschedule` passes the untouched `rule.evalState()` to `releaseClaim`. Net effect today: every tick queries ClickHouse with `timeFrom = null` (first-run path). The same-millisecond-collision bug described in the original spec assumes the cursor works; in practice the cursor has never worked. Fixing the advance is therefore both a correctness fix and a dead-code-elimination. **New.** Replace with a composite cursor `(startTime, executionId)`, serialized as `"|"` in `evalState.lastExchangeCursor`. @@ -55,7 +57,9 @@ Four focused changes. Each is small on its own; together they make PER_EXCHANGE ### 2. Transactional coupling of instance writes + cursor advance -**Current.** Per tick for a PER_EXCHANGE rule: +This section presumes §1 is landed — once the cursor actually advances, we need the advance and the instance writes to be atomic. + +**Current (post-§1).** Per tick for a PER_EXCHANGE rule: 1. `applyResult` iterates the `EvalResult.Batch` firings and calls `applyBatchFiring` for each — one `AlertInstance` save + `enqueueNotifications` per firing, each its own transaction (or auto-commit). 2. After the rule loop, `reschedule(rule, nextRun)` saves the updated `evalState` + `nextRunAt` in a separate write. @@ -63,6 +67,8 @@ Crash anywhere between steps 1 and 2 (or partway through the loop in step 1) pro - Instances saved but cursor not advanced → next tick duplicates them. - Cursor advanced but no instances saved → those exchanges never alerted. +(Note: today, pre-§1, the "cursor never advances" bug means only the first failure mode ever occurs. §2 prevents the second from appearing once §1 is live.) + **New.** Wrap the whole Batch-result processing for a single rule in one `TransactionTemplate.execute(...)`: ``` @@ -94,15 +100,22 @@ Three knobs on the rule are wrong for PER_EXCHANGE and trap the user into buggy Net effect: a PER_EXCHANGE rule's configurable surface becomes exactly `{scope, filter, severity, notification title/message, webhooks, targets}`. The user can't express an inconsistent combination. +**Mode-toggle state hygiene (UX).** When the user flips `fireMode` PER_EXCHANGE ↔ COUNT_IN_WINDOW inside `ExchangeMatchForm`, the form state for the *other* mode's fields must be cleared — not just hidden. The njams Server_4 frontend mode-gates fields via `*ngIf` and silently retains stale values behind the toggle (`src/app/rules/.../rule-view.component.html:96–112`), which produces save-time surprises. In `form-state.ts` the `setFireMode` reducer must reset the fields that are no longer in scope for the new mode (to their type-appropriate zero, not to undefined — the record compact-ctor still runs). That keeps the API-layer cross-field validator (400-on-save) and the form shape permanently consistent. + +**Builder-UX lessons worth adopting (tiny, in-scope).** +- **Disabled "Add" gating.** `AlertRuleController` accepts `webhooks: []` and `targets: []` as valid, which lets the user save a rule that never notifies anyone. The form already splits by step; the Notify step's "Add webhook" button should stay enabled, but the wizard's "Save rule" in `ReviewStep.tsx` should block-with-reason if `webhooks.length === 0 && targets.length === 0`. njams's pattern of disabling "Add X" until the last row is complete (`rule-view.component.ts:38–45`) is the right shape. +- **Preserve `/test-evaluate` and `/render-preview`.** `AlertRuleController` exposes POST `{id}/test-evaluate` and `{id}/render-preview`; the wizard should surface at least render-preview in `ReviewStep.tsx` before save. njams ships no in-builder preview and operators compensate with trial-and-error creation. We already have the endpoints; not wiring them up would be leaving value on the floor. + **Affected files (scope estimate):** - `ExchangeMatchCondition` — remove `perExchangeLingerSeconds`. -- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode). -- `ui/src/pages/Alerts/RuleEditor/TriggerStep.tsx` + `form-state.ts` — disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field. +- `AlertRuleController` / `AlertRuleRequest` — cross-field validation (reNotify + forDuration vs fireMode; empty webhooks+targets). +- `ui/src/pages/Alerts/RuleEditor/condition-forms/ExchangeMatchForm.tsx` + `form-state.ts` — clear fields on mode toggle; disable reNotify + hide forDuration when PER_EXCHANGE; remove the linger field. +- `ui/src/pages/Alerts/RuleEditor/ReviewStep.tsx` — block save on empty webhooks+targets; render-preview pane. - Tests (§4). ### 4. Tests that lock the guarantees -Three scenarios, one per correctness property. +Six scenarios: four on the correctness core, one red-test that reproduces today's actual bleed (turns green on fix), one on the builder-UX state-clearing contract. **Test 1 — cursor monotonicity** (`ExchangeMatchEvaluatorTest`, unit) @@ -132,6 +145,19 @@ Three scenarios, one per correctness property. - Evaluate a freshly-created PER_EXCHANGE rule whose `evalState` is empty. - Expect: exactly 1 firing (the one after creation). The pre-creation ones must not appear in the batch. +**Test 5 — pre-fix regression reproducer: notifications do not re-enqueue for already-matched exchanges** (integration, `AlertEvaluatorJobIT`) + +- Seed 2 FAILED executions. Tick 1 → 2 FIRING instances, 2 PENDING notifications. Dispatcher drains them → 2 DELIVERED. +- Tick 2 with **no new executions**: expect zero new PENDING notifications. (Today, without the §1+§2 fix, tick 2 re-enqueues both. This test should be written red-first against `main`, then go green when the cursor is actually persisted.) +- This test directly pins the bug the original spec text understated: instance-level dedup via the unique index is already working; notification-level dedup is what's broken. + +**Test 6 — form state clears on fireMode toggle** (unit, Vitest, `condition-forms/ExchangeMatchForm.test.tsx`) + +- Build an initial form state with `fireMode=COUNT_IN_WINDOW, threshold=5, windowSeconds=300`. +- Dispatch `setFireMode(PER_EXCHANGE)`. +- Expect: `threshold` and `windowSeconds` are cleared to their zero-values (not merely hidden), and the record compact-ctor doesn't throw when `form-state.ts` rebuilds the condition object. +- Dispatch `setFireMode(COUNT_IN_WINDOW)` — expect threshold/window come back as defaults, not as stale values. + Plus a small unit test on the new cross-field validator to isolate its logic from the IT setup. ## Out of scope @@ -146,9 +172,17 @@ Plus a small unit test on the new cross-field validator to isolate its logic fro - **ClickHouse predicate performance.** The composite predicate `(start_time > ? OR (start_time = ? AND execution_id > ?))` must hit the PK range efficiently. The table PK is `(tenant_id, start_time, environment, application_id, route_id, execution_id)`, so the OR-form should be fine, but we'll verify with `EXPLAIN PIPELINE` against the IT container during plan-phase. Fallback: `(start_time, execution_id)` tuple comparison if CH has native support (`(start_time, execution_id) > (?, ?)`), which it does in recent versions. - **Transaction size.** A single tick caps at `limit = 50` matches (existing behaviour), so the transaction holds at most 50 AlertInstance + 50 AlertNotification writes + 1 rule update. Well within safe bounds. -- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. Any rule whose `evalState` has neither key falls back to the "first-run" behaviour — `(rule.createdAt, "")` — which is bounded by the rule's own creation time. For existing PER_EXCHANGE rules this means the cursor effectively resets to rule-creation, so exchanges that failed between the previous `lastExchangeTs` and deploy time would not re-fire. For pre-prod that's acceptable; post-prod we'd add a one-shot translate-old-key shim at startup. +- **Cursor format churn.** Dropping `lastExchangeTs` in favour of `lastExchangeCursor` is a one-line `evalState` JSON change. Pre-prod; no shim needed. In practice the churn is even more benign than it looks: today no rule has ever persisted a `lastExchangeTs` value (the advance path is dead code — see §1 Current.), so every existing PER_EXCHANGE rule will hit the first-run path `(rule.createdAt, "")` on first post-deploy tick. Side effect: on deploy, long-standing PER_EXCHANGE rules will immediately scan from `rule.createdAt` forward and enqueue notifications for every FAILED exchange in retention that matches. This is a **one-time backlog flood** proportional to ClickHouse retention × failure rate × number of PER_EXCHANGE rules. For pre-prod with small history this is tolerable; if a rule was created years ago on a real environment, bound the first-run scan by clamping initial cursor to `max(rule.createdAt, now() - deployBacklogCap)` where `deployBacklogCap` is a config (default 24 h). Call this out explicitly in the plan-phase so deployment order is "deploy first, then create rules" or "accept the one-time flood." + +## Follow-ups (parked — separate phases) + +Explicit list of ideas that are valuable but deliberately not in this spec's scope. + +1. **Sealed condition-type hierarchy (backend).** Replace `AlertRule` + `fireMode` field with a sealed `Condition` hierarchy where each type carries only its own knobs. The UI is already sharded (`condition-forms/*Form.tsx`); the backend would follow. Biggest win: kills the whole "mode-gated field" class of bug at the record level, so cross-field validators become compact-ctor invariants instead of controller-layer glue. Estimated scope: medium (DTO migration, Jackson polymorphism, request compatibility). Trigger: when a 4th condition kind lands or when the next "silently-ignored field" bug surfaces. +2. **`coalesceSeconds` primitive on PER_EXCHANGE.** "One FIRING per (rule × signature) per window; attach occurrenceCount." Addresses the Inbox-flood scenario during incident storms without breaking the exactly-once-per-exchange guarantee for the default case. njams bakes this into its CEP template as `throttling` + `throttlingEventCount`; we'd express it as a post-match coalescer on the AlertInstance write path. Trigger: first operator complaint about Inbox volume during a real incident, or when we onboard a tenant with >100 failed exchanges/min. +3. **Cross-phase: ingestion-time rule-matching.** Today's tick+cursor model is correct but latency-bound to `evaluationIntervalSeconds`. A streaming path (agent → ClickHouse ingest → publish → rule matcher) would drop alert latency to seconds. Not needed today; flagged because the spec's design explicitly chooses batch over streaming and future requirements may flip that. ## Verification - `mvn -pl cameleer-server-app -am -Dit.test='ExchangeMatchEvaluatorTest,AlertEvaluatorJobIT,AlertingFullLifecycleIT,AlertRuleControllerIT' ... verify` → 0 failures. -- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets one message each. Ack one instance. Produce another failure. Verify only the new one appears. +- Manual: create a PER_EXCHANGE / FAILED rule via UI. Verify `reNotifyMinutes` is fixed at 0 and disabled; verify the linger field is gone; verify toggling `fireMode` clears COUNT_IN_WINDOW-specific fields. Produce failing exchanges. Verify Inbox shows one instance per exchange, Slack gets exactly one message each. **Wait three evaluation-interval ticks with no new exchanges**; verify no additional notifications arrive (the pre-fix bleed). Ack one instance. Produce another failure. Verify only the new one appears. Save a rule with empty webhooks+targets → expect blocked at the Review step with a reason shown.