docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m7s
CI / docker (push) Successful in 1m22s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s

Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand
EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface
restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink
rule), exactly-once guarantee scope, and the first-run backlog-cap knob.

Surface the new config in application.yml with the 24h default and the
opt-out-to-0 semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-22 18:39:49 +02:00
parent e470fc0dab
commit eda74b7339
2 changed files with 43 additions and 4 deletions

View File

@@ -93,6 +93,10 @@ cameleer:
notification-retention-days: ${CAMELEER_SERVER_ALERTING_NOTIFICATIONRETENTIONDAYS:30}
webhook-timeout-ms: ${CAMELEER_SERVER_ALERTING_WEBHOOKTIMEOUTMS:5000}
webhook-max-attempts: ${CAMELEER_SERVER_ALERTING_WEBHOOKMAXATTEMPTS:3}
# PER_EXCHANGE first-run cursor clamp: on first tick with no persisted cursor, evaluator
# scans no further back than (now - this cap). Prevents one-time backlog flood for rules
# whose createdAt predates a migration. Set to 0 to disable and replay from createdAt.
per-exchange-deploy-backlog-cap-seconds: ${CAMELEER_SERVER_ALERTING_PEREXCHANGEDEPLOYBACKLOGCAPSECONDS:86400}
outbound-http:
trust-all: false
trusted-ca-pem-paths: []

View File

@@ -36,25 +36,60 @@ Comparators: `GT`, `GTE`, `LT`, `LTE`, `EQ`.
### EXCHANGE_MATCH
Fires when the number of exchanges matching a filter exceeds a threshold.
Fires on exchanges matching a filter. Two firing modes — pick the one that matches your operational intent.
#### `fireMode: COUNT_IN_WINDOW`
One alert when the count of matching exchanges in a rolling window crosses a threshold. Aggregation-style: good for "more than 3 payment failures in 10 minutes."
```json
{
"name": "Failed payment exchanges",
"name": "Payment failures spike",
"severity": "WARNING",
"conditionKind": "EXCHANGE_MATCH",
"condition": {
"kind": "EXCHANGE_MATCH",
"scope": { "appSlug": "payment-service", "routeId": "processPayment" },
"filter": { "status": "FAILED", "attributes": { "payment.type": "card" } },
"fireMode": "AGGREGATE",
"fireMode": "COUNT_IN_WINDOW",
"threshold": 3,
"windowSeconds": 600
}
}
```
`fireMode`: `AGGREGATE` (one alert for the count) or `PER_EXCHANGE` (one alert per matching exchange).
#### `fireMode: PER_EXCHANGE`
One alert per distinct failed exchange — **exactly once**. Each failure produces its own `AlertInstance` and its own notification. The Inbox contains one row per failed exchange, never a duplicate, across ticks or process restarts. Good for "page me for every failed order regardless of rate."
```json
{
"name": "Any order failure",
"severity": "CRITICAL",
"conditionKind": "EXCHANGE_MATCH",
"condition": {
"kind": "EXCHANGE_MATCH",
"scope": { "appSlug": "orders-service" },
"filter": { "status": "FAILED" },
"fireMode": "PER_EXCHANGE"
}
}
```
PER_EXCHANGE rules have a tighter configurable surface — the server rejects non-coherent combinations at save time with 400:
| Field | PER_EXCHANGE | COUNT_IN_WINDOW |
|---|---|---|
| `threshold`, `windowSeconds` | must be absent / zero | required, positive |
| `reNotifyMinutes` | must be 0 (fires once; re-notify does not apply) | optional |
| `forDurationSeconds` | must be 0 | optional |
| `scope`, `filter`, `severity`, notification template, `webhooks` / `targets` | standard | standard |
Additionally, any rule (any `conditionKind`) with **both** empty `webhooks` and empty `targets` is rejected — a rule that notifies no one is always a misconfiguration.
**Exactly-once guarantee — scope.** One `AlertInstance` and one PENDING `AlertNotification` per exchange, survived across evaluator ticks and process restarts. HTTP webhook delivery is still at-least-once under transient failure; for Slack and similar, include `{{alert.id}}` in the message template so the consumer can dedup.
**First post-deploy tick — backlog cap.** A PER_EXCHANGE rule's first run (no persisted cursor yet) would otherwise scan from `rule.createdAt` forward, which can trigger a one-time notification flood for long-lived rules after a DB migration or schema reset. The server clamps the first-run scan to `max(rule.createdAt, now - deployBacklogCap)`. Default cap: 24 h. Tune via `cameleer.server.alerting.per-exchange-deploy-backlog-cap-seconds` (set to 0 to disable the clamp and replay from `createdAt`).
### AGENT_STATE