feat(alerting): Plan 02 — backend (domain, storage, evaluators, dispatch) #140
Reference in New Issue
Block a user
Delete Branch "feat/alerting-02-backend"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Full backend for the alerting feature. Stacked on top of #139 (Plan 01). Rebase onto main once #139 merges.
Spec:
docs/superpowers/specs/2026-04-19-alerting-design.md. Plan:docs/superpowers/plans/2026-04-19-alerting-02-backend.md. Admin guide:docs/alerting.md. Verification:docs/alerting-02-verification.md. Post-review audit:docs/alerting-02-final-review.md.alert_instances(rule_id) WHERE state IN (PENDING,FIRING,ACKNOWLEDGED).core/alerting/): sealedAlertConditionhierarchy with Jackson polymorphism (6 subtypes — ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE, DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC),AlertRule/AlertInstance/AlertSilence/AlertNotification/WebhookBindingrecords, 5 repository interfaces.app/alerting/storage/): JdbcTemplate + ObjectMapper pattern;FOR UPDATE SKIP LOCKEDclaim-polling on rules + notifications.app/alerting/eval/): each reads through existing core interfaces (StatsStore,ClickHouseLogStore,ClickHouseSearchIndex,AgentRegistryService,DeploymentRepository,MetricsQueryStore); new additive CH methodscountLogs+countExecutionsForAlerting(no FINAL).AlertEvaluatorJob):SchedulingConfigurerclaim-polling loop +PerKindCircuitBreaker(5 failures / 30s → open for 60s) +TickCachequery coalescing + re-notify cadence sweep (enqueue fresh notifications whenlastNotifiedAt + reNotifyMinutes < now).NotificationDispatchJob): claim-polling outbox, silence check at dispatch time (not eval time — preserves audit trail),WebhookDispatcherwith HMAC signing + 2xx/4xx/5xx retry classification + TLS trust modes via Plan 01'sOutboundHttpClientFactory.MustacheRenderer+NotificationContextBuilder): JMustache 1.16 dep; unresolved{{x.y.z}}renders as literal; context is populated by condition kind (env/rule/alert always; app/route/exchange/agent/deployment/log/metric conditional)./api/v1/environments/{envSlug}/alerts/...): rules + alerts + silences + notifications controllers with Bean Validation, RBAC (VIEWER+read,OPERATOR+mutations), audit via newALERT_RULE_CHANGE+ALERT_SILENCE_CHANGEcategories.AlertRuleControllervalidatesExchangeMatchCondition.filter.attributeskeys against^[a-zA-Z0-9._-]+$before persistence (they get inlined intoJSONExtractString(attributes, '<key>')).AlertingRetentionJob): daily @03:00 cleanup of RESOLVED instances + settled notifications.AlertingMetrics):alerting_eval_duration_seconds{kind},alerting_eval_errors_total,alerting_circuit_open_total,alerting_notifications_total,alerting_webhook_delivery_duration_seconds, plus live Postgres-backed gauges for rule/instance state counts.alerting_app_status,alerting_route_status(onexecutionsReplacingMergeTree — requiresMODIFY SETTING deduplicate_merge_projection_mode='rebuild', applied inline in the migration),alerting_app_level(logs),alerting_instance_metric(agent_metrics).OutboundConnectionServiceImpl.rulesReferencing()now callsAlertRuleRepository.findRuleIdsByOutboundConnectionId(id)— delete/narrow-envs guards now actually work.Test Plan
AlertingFullLifecycleIT— creates rule via REST API (not raw SQL — see note below), injects log, ticks evaluator → FIRING +rule_snapshotpopulated, ticks dispatcher → WireMock receives POST withX-Cameleer-Signature, ack → ACK, silence → second notification FAILED "silenced" with zero extra WireMock hits, re-notify cadence → second WireMock POST after clock advance, rule delete →rule_id=NULL+ snapshot retained.AlertingEnvIsolationIT— rule in env-A invisible from env-B inbox.LOG_PATTERNrule viacurl POST /alerts/rules→ inject log viaPOST /api/v1/data/logs→ wait 2 eval ticks + 1 notification tick → confirmalert_instancesFIRING +alert_notificationsDELIVERED + webhook body received.SELECT name FROM system.projections WHERE table IN ('executions','logs','agent_metrics')shows all 4alerting_*projections after startup on a real (not Testcontainer) CH.Known pre-existing test failures (orthogonal — not Plan 02 scope)
~69 failures/errors in non-alerting test classes (
AgentSseControllerIT,RegistrationSecurityIT,SecurityFilterIT,SseSigningIT,JwtRefreshIT,BootstrapTokenIT,ClickHouseStatsStoreIT,IngestionSchemaIT,ClickHouseChunkPipelineIT,SearchControllerIT, et al.). Confirmed pre-date this branch by running againstfeat/alerting-01-outbound-infra. Zero overlap with alerting code.Post-review fixes applied (see
docs/alerting-02-final-review.md)PostgresAlertRuleRepository.save()now persistsalert_rule_targets+rowMapper()loads them back.AlertStateTransitions.newInstance()propagates targets onto the instance. Lifecycle IT rewritten to POST rule via REST API (instead of raw-SQL seeding) so this class of bug is caught going forward.lastNotifiedAttracking on DELIVERED).resetForRetry(id, nextAttemptAt)instead ofscheduleRetry; attempts reset to 0 as contracted.DuplicateKeyExceptionhandler insave()(log + return existing). Future-proofs multi-replica even though v1 runs single-replica.alerting_notifications_totalcounter now called at DELIVERED / FAILED branches.Deferred to Plan 03
NotificationBell,/alerts/**pages, 5-step rule editor wizard,<MustacheEditor />with variable auto-complete (BL-002), CMD-K integration, rule promotion across envs (pure UI flow).ui/src/api/schema.d.ts) — do against merged main in Plan 03.OutboundConnection.url(reject RFC-1918 / loopback / link-local) — Plan 01 scope, required before SaaS exposure.AlertingMetricsgauge caching (NIT).Rollout
Dormant-by-default: zero rules → zero evaluator work → zero behaviour change. V12 + V13 are additive with matching down-scripts; CH projections are
IF NOT EXISTS-safe.Sentinel-substitution approach: unresolved {{x.y.z}} tokens are replaced with a unique NUL-delimited sentinel before Mustache compilation, rendered as opaque text, then post-replaced with the original {{x.y.z}} literal. Malformed templates (unclosed {{) are caught and return the raw template. Never throws. 9 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>- POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD - POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate - Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL) - Webhook validation: verifies outboundConnectionId exists and is allowed in env - Null-safe notification template defaults to "" for NOT NULL DB constraint - Fixed misleading comment in ClickHouseSearchIndex to document validation contract - OPERATOR+ for mutations, VIEWER+ for reads - Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE - 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>- GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery - GET /unread-count — memoized unread count (5s TTL) - GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read - bulkRead filters instanceIds to env before delegating to AlertReadRepository - VIEWER+ for all endpoints; env isolation enforced by requireInstance - 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>- POST/GET/DELETE /environments/{envSlug}/alerts/silences - 422 when endsAt <= startsAt ("endsAt must be after startsAt") - OPERATOR+ for create/delete, VIEWER+ for list - Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE - 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>- GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+) - POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+) Flat path because notification IDs are globally unique (no env routing needed) - scheduleRetry resets attempts to 0 and sets nextAttemptAt = now - Added 11 alerting path matchers to SecurityConfig before outbound-connections block - Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and LogPatternEvaluator inject the concrete classes directly (not interface beans), so the full Spring context fails without these mocks in tests that don't use the real CH container - 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>- Add AlertRuleController, AlertController, AlertSilenceController, AlertNotificationController entries - Document inbox SQL visibility contract (target_user_ids/group_ids/role_names — no broadcast) - Add /api/v1/alerts/notifications/{id}/retry to flat-endpoint allow-list - Update SecurityConfig entry with alerting path matchers - Note attribute-key SQL injection validation contract on AlertRuleController Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>AlertingMetrics @Component wraps MeterRegistry: - Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind}, alerting_notifications_total{status} - Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds - Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state} AlertEvaluatorJob records evalError + evalDuration around each evaluator call. PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind). AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>V13 migration creates alert_instances_open_rule_uq — a partial unique index on (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches DuplicateKeyException and returns the existing open instance instead of failing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Rule creation now goes through POST /alerts/rules (exercises saveTargets on the write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps: 1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1) 2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2) 3. ack via REST → assert ACKNOWLEDGED state 4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED) 5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL) 6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s → evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence) Background scheduler races addressed by resetting claimed_by/claimed_until before each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp falls within the evaluator window. Re-notify notifications backdated in Postgres to work around the simulated vs real clock gap in claimDueNotifications. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>