feat(alerting): Plan 02 — backend (domain, storage, evaluators, dispatch) #140

Merged
claude merged 53 commits from feat/alerting-02-backend into main 2026-04-20 09:03:16 +02:00
Owner

Summary

Full backend for the alerting feature. Stacked on top of #139 (Plan 01). Rebase onto main once #139 merges.

Spec: docs/superpowers/specs/2026-04-19-alerting-design.md. Plan: docs/superpowers/plans/2026-04-19-alerting-02-backend.md. Admin guide: docs/alerting.md. Verification: docs/alerting-02-verification.md. Post-review audit: docs/alerting-02-final-review.md.

  • V12 + V13 Flyway migrations — 5 enums, 6 tables, indexes, cascades; V13 adds unique partial index on alert_instances(rule_id) WHERE state IN (PENDING,FIRING,ACKNOWLEDGED).
  • Core domain (core/alerting/): sealed AlertCondition hierarchy with Jackson polymorphism (6 subtypes — ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE, DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC), AlertRule/AlertInstance/AlertSilence/AlertNotification/WebhookBinding records, 5 repository interfaces.
  • Postgres repositories (app/alerting/storage/): JdbcTemplate + ObjectMapper pattern; FOR UPDATE SKIP LOCKED claim-polling on rules + notifications.
  • Six evaluators (app/alerting/eval/): each reads through existing core interfaces (StatsStore, ClickHouseLogStore, ClickHouseSearchIndex, AgentRegistryService, DeploymentRepository, MetricsQueryStore); new additive CH methods countLogs + countExecutionsForAlerting (no FINAL).
  • Evaluator job (AlertEvaluatorJob): SchedulingConfigurer claim-polling loop + PerKindCircuitBreaker (5 failures / 30s → open for 60s) + TickCache query coalescing + re-notify cadence sweep (enqueue fresh notifications when lastNotifiedAt + reNotifyMinutes < now).
  • Notification dispatch (NotificationDispatchJob): claim-polling outbox, silence check at dispatch time (not eval time — preserves audit trail), WebhookDispatcher with HMAC signing + 2xx/4xx/5xx retry classification + TLS trust modes via Plan 01's OutboundHttpClientFactory.
  • Mustache templating (MustacheRenderer + NotificationContextBuilder): JMustache 1.16 dep; unresolved {{x.y.z}} renders as literal; context is populated by condition kind (env/rule/alert always; app/route/exchange/agent/deployment/log/metric conditional).
  • REST API (/api/v1/environments/{envSlug}/alerts/...): rules + alerts + silences + notifications controllers with Bean Validation, RBAC (VIEWER+ read, OPERATOR+ mutations), audit via new ALERT_RULE_CHANGE + ALERT_SILENCE_CHANGE categories.
  • Attribute-key SQL-injection guard: AlertRuleController validates ExchangeMatchCondition.filter.attributes keys against ^[a-zA-Z0-9._-]+$ before persistence (they get inlined into JSONExtractString(attributes, '<key>')).
  • Retention (AlertingRetentionJob): daily @03:00 cleanup of RESOLVED instances + settled notifications.
  • Metrics (AlertingMetrics): alerting_eval_duration_seconds{kind}, alerting_eval_errors_total, alerting_circuit_open_total, alerting_notifications_total, alerting_webhook_delivery_duration_seconds, plus live Postgres-backed gauges for rule/instance state counts.
  • CH projections: alerting_app_status, alerting_route_status (on executions ReplacingMergeTree — requires MODIFY SETTING deduplicate_merge_projection_mode='rebuild', applied inline in the migration), alerting_app_level (logs), alerting_instance_metric (agent_metrics).
  • Plan 01 gate wired: OutboundConnectionServiceImpl.rulesReferencing() now calls AlertRuleRepository.findRuleIdsByOutboundConnectionId(id) — delete/narrow-envs guards now actually work.

Test Plan

  • 158 new alerting tests, all green (120 pre-final-review + 38 after review fixes).
  • AlertingFullLifecycleIT — creates rule via REST API (not raw SQL — see note below), injects log, ticks evaluator → FIRING + rule_snapshot populated, ticks dispatcher → WireMock receives POST with X-Cameleer-Signature, ack → ACK, silence → second notification FAILED "silenced" with zero extra WireMock hits, re-notify cadence → second WireMock POST after clock advance, rule delete → rule_id=NULL + snapshot retained.
  • AlertingEnvIsolationIT — rule in env-A invisible from env-B inbox.
  • Outbound allowed-env guard — 422 on save in disallowed env; 409 on narrowing while rule references.
  • Staging smoke: create outbound connection → create LOG_PATTERN rule via curl POST /alerts/rules → inject log via POST /api/v1/data/logs → wait 2 eval ticks + 1 notification tick → confirm alert_instances FIRING + alert_notifications DELIVERED + webhook body received.
  • Verify CH projections: SELECT name FROM system.projections WHERE table IN ('executions','logs','agent_metrics') shows all 4 alerting_* projections after startup on a real (not Testcontainer) CH.

Known pre-existing test failures (orthogonal — not Plan 02 scope)

~69 failures/errors in non-alerting test classes (AgentSseControllerIT, RegistrationSecurityIT, SecurityFilterIT, SseSigningIT, JwtRefreshIT, BootstrapTokenIT, ClickHouseStatsStoreIT, IngestionSchemaIT, ClickHouseChunkPipelineIT, SearchControllerIT, et al.). Confirmed pre-date this branch by running against feat/alerting-01-outbound-infra. Zero overlap with alerting code.

Post-review fixes applied (see docs/alerting-02-final-review.md)

  • B-1PostgresAlertRuleRepository.save() now persists alert_rule_targets + rowMapper() loads them back. AlertStateTransitions.newInstance() propagates targets onto the instance. Lifecycle IT rewritten to POST rule via REST API (instead of raw-SQL seeding) so this class of bug is caught going forward.
  • B-2 — re-notify cadence implemented end-to-end (evaluator sweep + lastNotifiedAt tracking on DELIVERED).
  • I-1 — retry endpoint calls new resetForRetry(id, nextAttemptAt) instead of scheduleRetry; attempts reset to 0 as contracted.
  • I-2 — V13 unique partial index on open instances + DuplicateKeyException handler in save() (log + return existing). Future-proofs multi-replica even though v1 runs single-replica.
  • I-4alerting_notifications_total counter now called at DELIVERED / FAILED branches.

Deferred to Plan 03

  • UI: NotificationBell, /alerts/** pages, 5-step rule editor wizard, <MustacheEditor /> with variable auto-complete (BL-002), CMD-K integration, rule promotion across envs (pure UI flow).
  • OpenAPI TypeScript regen (ui/src/api/schema.d.ts) — do against merged main in Plan 03.
  • SSRF guard on OutboundConnection.url (reject RFC-1918 / loopback / link-local) — Plan 01 scope, required before SaaS exposure.
  • AlertingMetrics gauge caching (NIT).
  • Performance tests (500 rules × 5 replicas).
  • Native provider integrations (Slack/Teams/PagerDuty — BL-002).

Rollout

Dormant-by-default: zero rules → zero evaluator work → zero behaviour change. V12 + V13 are additive with matching down-scripts; CH projections are IF NOT EXISTS-safe.

## Summary Full backend for the alerting feature. **Stacked on top of #139** (Plan 01). Rebase onto main once #139 merges. Spec: `docs/superpowers/specs/2026-04-19-alerting-design.md`. Plan: `docs/superpowers/plans/2026-04-19-alerting-02-backend.md`. Admin guide: `docs/alerting.md`. Verification: `docs/alerting-02-verification.md`. Post-review audit: `docs/alerting-02-final-review.md`. - **V12 + V13 Flyway migrations** — 5 enums, 6 tables, indexes, cascades; V13 adds unique partial index on `alert_instances(rule_id) WHERE state IN (PENDING,FIRING,ACKNOWLEDGED)`. - **Core domain** (`core/alerting/`): sealed `AlertCondition` hierarchy with Jackson polymorphism (6 subtypes — ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE, DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC), `AlertRule`/`AlertInstance`/`AlertSilence`/`AlertNotification`/`WebhookBinding` records, 5 repository interfaces. - **Postgres repositories** (`app/alerting/storage/`): JdbcTemplate + ObjectMapper pattern; `FOR UPDATE SKIP LOCKED` claim-polling on rules + notifications. - **Six evaluators** (`app/alerting/eval/`): each reads through existing core interfaces (`StatsStore`, `ClickHouseLogStore`, `ClickHouseSearchIndex`, `AgentRegistryService`, `DeploymentRepository`, `MetricsQueryStore`); new additive CH methods `countLogs` + `countExecutionsForAlerting` (no FINAL). - **Evaluator job** (`AlertEvaluatorJob`): `SchedulingConfigurer` claim-polling loop + `PerKindCircuitBreaker` (5 failures / 30s → open for 60s) + `TickCache` query coalescing + **re-notify cadence sweep** (enqueue fresh notifications when `lastNotifiedAt + reNotifyMinutes < now`). - **Notification dispatch** (`NotificationDispatchJob`): claim-polling outbox, silence check at dispatch time (not eval time — preserves audit trail), `WebhookDispatcher` with HMAC signing + 2xx/4xx/5xx retry classification + TLS trust modes via Plan 01's `OutboundHttpClientFactory`. - **Mustache templating** (`MustacheRenderer` + `NotificationContextBuilder`): JMustache 1.16 dep; unresolved `{{x.y.z}}` renders as literal; context is populated by condition kind (env/rule/alert always; app/route/exchange/agent/deployment/log/metric conditional). - **REST API** (`/api/v1/environments/{envSlug}/alerts/...`): rules + alerts + silences + notifications controllers with Bean Validation, RBAC (`VIEWER+` read, `OPERATOR+` mutations), audit via new `ALERT_RULE_CHANGE` + `ALERT_SILENCE_CHANGE` categories. - **Attribute-key SQL-injection guard**: `AlertRuleController` validates `ExchangeMatchCondition.filter.attributes` keys against `^[a-zA-Z0-9._-]+$` before persistence (they get inlined into `JSONExtractString(attributes, '<key>')`). - **Retention** (`AlertingRetentionJob`): daily @03:00 cleanup of RESOLVED instances + settled notifications. - **Metrics** (`AlertingMetrics`): `alerting_eval_duration_seconds{kind}`, `alerting_eval_errors_total`, `alerting_circuit_open_total`, `alerting_notifications_total`, `alerting_webhook_delivery_duration_seconds`, plus live Postgres-backed gauges for rule/instance state counts. - **CH projections**: `alerting_app_status`, `alerting_route_status` (on `executions` ReplacingMergeTree — requires `MODIFY SETTING deduplicate_merge_projection_mode='rebuild'`, applied inline in the migration), `alerting_app_level` (logs), `alerting_instance_metric` (agent_metrics). - **Plan 01 gate wired**: `OutboundConnectionServiceImpl.rulesReferencing()` now calls `AlertRuleRepository.findRuleIdsByOutboundConnectionId(id)` — delete/narrow-envs guards now actually work. ## Test Plan - [x] 158 new alerting tests, all green (120 pre-final-review + 38 after review fixes). - [x] `AlertingFullLifecycleIT` — creates rule **via REST API** (not raw SQL — see note below), injects log, ticks evaluator → FIRING + `rule_snapshot` populated, ticks dispatcher → WireMock receives POST with `X-Cameleer-Signature`, ack → ACK, silence → second notification FAILED "silenced" with zero extra WireMock hits, re-notify cadence → second WireMock POST after clock advance, rule delete → `rule_id=NULL` + snapshot retained. - [x] `AlertingEnvIsolationIT` — rule in env-A invisible from env-B inbox. - [x] Outbound allowed-env guard — 422 on save in disallowed env; 409 on narrowing while rule references. - [ ] Staging smoke: create outbound connection → create `LOG_PATTERN` rule via `curl POST /alerts/rules` → inject log via `POST /api/v1/data/logs` → wait 2 eval ticks + 1 notification tick → confirm `alert_instances` FIRING + `alert_notifications` DELIVERED + webhook body received. - [ ] Verify CH projections: `SELECT name FROM system.projections WHERE table IN ('executions','logs','agent_metrics')` shows all 4 `alerting_*` projections after startup on a real (not Testcontainer) CH. ## Known pre-existing test failures (orthogonal — not Plan 02 scope) ~69 failures/errors in non-alerting test classes (`AgentSseControllerIT`, `RegistrationSecurityIT`, `SecurityFilterIT`, `SseSigningIT`, `JwtRefreshIT`, `BootstrapTokenIT`, `ClickHouseStatsStoreIT`, `IngestionSchemaIT`, `ClickHouseChunkPipelineIT`, `SearchControllerIT`, et al.). Confirmed pre-date this branch by running against `feat/alerting-01-outbound-infra`. Zero overlap with alerting code. ## Post-review fixes applied (see `docs/alerting-02-final-review.md`) - **B-1** — `PostgresAlertRuleRepository.save()` now persists `alert_rule_targets` + `rowMapper()` loads them back. `AlertStateTransitions.newInstance()` propagates targets onto the instance. Lifecycle IT rewritten to POST rule via REST API (instead of raw-SQL seeding) so this class of bug is caught going forward. - **B-2** — re-notify cadence implemented end-to-end (evaluator sweep + `lastNotifiedAt` tracking on DELIVERED). - **I-1** — retry endpoint calls new `resetForRetry(id, nextAttemptAt)` instead of `scheduleRetry`; attempts reset to 0 as contracted. - **I-2** — V13 unique partial index on open instances + `DuplicateKeyException` handler in `save()` (log + return existing). Future-proofs multi-replica even though v1 runs single-replica. - **I-4** — `alerting_notifications_total` counter now called at DELIVERED / FAILED branches. ## Deferred to Plan 03 - UI: `NotificationBell`, `/alerts/**` pages, 5-step rule editor wizard, `<MustacheEditor />` with variable auto-complete (BL-002), CMD-K integration, rule promotion across envs (pure UI flow). - OpenAPI TypeScript regen (`ui/src/api/schema.d.ts`) — do against merged main in Plan 03. - SSRF guard on `OutboundConnection.url` (reject RFC-1918 / loopback / link-local) — Plan 01 scope, required before SaaS exposure. - `AlertingMetrics` gauge caching (NIT). - Performance tests (500 rules × 5 replicas). - Native provider integrations (Slack/Teams/PagerDuty — BL-002). ## Rollout Dormant-by-default: zero rules → zero evaluator work → zero behaviour change. V12 + V13 are additive with matching down-scripts; CH projections are `IF NOT EXISTS`-safe.
claude changed target branch from feat/alerting-01-outbound-infra to main 2026-04-20 09:02:58 +02:00
claude added 53 commits 2026-04-20 09:02:58 +02:00
- Replace hard-coded 'u1' user_id with per-test UUID to prevent PK collision on re-runs
- Add @AfterEach null-safe cleanup for environments and users rows
- Use containsExactlyInAnyOrder for enum assertions to catch misspelled names
- Slug suffix on environment insert avoids slug uniqueness conflicts on re-runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements AlertRuleRepository with JSONB condition/webhooks/eval_state
serialization via ObjectMapper, UPSERT on conflict, JSONB containment
query for findRuleIdsByOutboundConnectionId, and FOR UPDATE SKIP LOCKED
claim-polling for horizontal scale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the Plan 01 stub that returned [] with a real call through
AlertRuleRepository.findRuleIdsByOutboundConnectionId. Adds AlertingBeanConfig
exposing the AlertRuleRepository bean; widens OutboundBeanConfig constructor
to inject it. Delete and narrow-envs guards now correctly block when rules
reference a connection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements AlertInstanceRepository: save (upsert), findById, findOpenForRule,
listForInbox (3-way OR: user/group/role via && array-overlap + ANY), countUnreadForUser
(LEFT JOIN alert_reads), ack, resolve, markSilenced, deleteResolvedBefore.
Integration test covers all 9 scenarios including inbox fan-out across all
three target types. Also adds @JsonIgnoreProperties(ignoreUnknown=true) to
SilenceMatcher to suppress Jackson serializing isWildcard() as a round-trip field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PostgresAlertSilenceRepository: save/findById roundtrip, listActive (BETWEEN
starts_at AND ends_at), listByEnvironment, delete. JSONB SilenceMatcher via ObjectMapper.

PostgresAlertNotificationRepository: save/findById, listForInstance,
claimDueNotifications (UPDATE...RETURNING with FOR UPDATE SKIP LOCKED),
markDelivered, scheduleRetry (bumps attempts + next_attempt_at), markFailed,
deleteSettledBefore (DELIVERED+FAILED rows older than cutoff). JSONB payload.

PostgresAlertReadRepository: markRead (ON CONFLICT DO NOTHING idempotent),
bulkMarkRead (iterates, handles empty list without error).

16 IT scenarios across 3 classes, all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AlertingBeanConfig now exposes 4 additional @Bean methods:
alertInstanceRepository, alertSilenceRepository,
alertNotificationRepository, alertReadRepository.
AlertReadRepository takes only JdbcTemplate (no JSONB/ObjectMapper needed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds countLogs(LogSearchRequest) — no FINAL, no cursor/sort/limit —
reusing the same WHERE-clause logic as search() for tenant, env, app,
level, q, logger, source, exchangeId, and time-range filters.
Also extends ClickHouseTestHelper with executeInitSqlWithProjections()
and makes the script runner non-fatal for ADD/MATERIALIZE PROJECTION.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting —
no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window,
and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString
key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds alerting_projections.sql with four projections (alerting_app_status,
alerting_route_status on executions; alerting_app_level on logs;
alerting_instance_metric on agent_metrics). ClickHouseSchemaInitializer now
runs both init.sql and alerting_projections.sql, with ADD PROJECTION and
MATERIALIZE treated as non-fatal — executions (ReplacingMergeTree) requires
deduplicate_merge_projection_mode=rebuild which is unavailable via JDBC pool.
MergeTree projections (logs, agent_metrics) always succeed and are asserted in IT.

Column names confirmed from init.sql: logs uses 'application' (not application_id),
agent_metrics uses 'collected_at' (not timestamp). All column names match the plan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Declared in cameleer-server-core pom (canonical location for unit-testable
rendering without Spring) and mirrored in cameleer-server-app pom so the
app module compiles standalone without a full reactor install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sentinel-substitution approach: unresolved {{x.y.z}} tokens are replaced
with a unique NUL-delimited sentinel before Mustache compilation, rendered
as opaque text, then post-replaced with the original {{x.y.z}} literal.
Malformed templates (unclosed {{) are caught and return the raw template.
Never throws. 9 unit tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Builds the Mustache context map from AlertRule + AlertInstance + Environment.
Always emits env/rule/alert subtrees; conditionally emits kind-specific
subtrees (agent, app, route, exchange, log, metric, deployment) based on
rule.conditionKind(). Missing instance.context() keys resolve to empty
string. alert.link prefixed with uiOrigin when non-null. 11 unit tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SilenceMatcherService.matches() evaluates AND semantics across ruleId,
severity, appSlug, routeId, agentId constraints. Null fields are wildcards.
Scope-based constraints (appSlug/routeId/agentId) return false when rule is
null (deleted rule — scope cannot be verified). 17 unit tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P95_LATENCY_MS maps to avgDurationMs (ExecutionStats has no p95 bucket).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PER_EXCHANGE returns EvalResult.Batch(List<Firing>); last Firing carries
_nextCursor (Instant) in its context map for the job to persist as
evalState.lastExchangeTs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AlertingProperties @ConfigurationProperties with effective*() accessors and
  5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup
- AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error
  branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated
  to job
- AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck,
  withSilenced, withTitleMessage, withLastNotifiedAt, withContext
- AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties),
  alertingInstanceId bean (hostname:pid), alertingClock bean,
  PerKindCircuitBreaker bean wired from props
- 12 unit tests in AlertStateTransitionsTest covering all transitions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AlertEvaluatorJob implements SchedulingConfigurer; fixed-delay tick from
  AlertingProperties.effectiveEvaluatorTickIntervalMs (5 s floor)
- Claim-polling via AlertRuleRepository.claimDueRules (FOR UPDATE SKIP LOCKED)
- Per-kind circuit breaker guards each evaluator; failures recorded, open kinds
  skipped and rescheduled without evaluation
- Single-Firing path delegates to AlertStateTransitions; new FIRING instances
  enqueue AlertNotification rows per rule.webhooks()
- Batch (PER_EXCHANGE) path creates one FIRING AlertInstance per Firing entry
- PENDING→FIRING promotion handled in applyResult via state machine
- Title/message rendered via MustacheRenderer + NotificationContextBuilder;
  environment resolved from EnvironmentRepository.findById per tick
- AlertEvaluatorJobIT (4 tests): uses named @MockBean replacements for
  ClickHouseSearchIndex + ClickHouseLogStore; @MockBean AgentRegistryService
  drives Clear/Firing/resolve cycle without timing sensitivity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers)
- Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and
  applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot
- Strip null values from the Jackson-serialized map before wrapping in the immutable
  snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields
- Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind
  appear in the rule_snapshot column after a tick fires an instance
- Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts
  rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HmacSHA256 signer returning sha256=<lowercase-hex>. 5 unit tests covering
known vector, prefix, hex casing, and different secrets/bodies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Renders URL/headers/body with Mustache, optionally HMAC-signs the body
(X-Cameleer-Signature), supports POST/PUT/PATCH, classifies 2xx/4xx/5xx
into DELIVERED/FAILED/retry. 8 WireMock-backed IT tests including HTTPS
TRUST_ALL against WireMock self-signed cert.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claim-polling SchedulingConfigurer: claims due notifications, resolves
instance/connection/rule, checks active silences, dispatches via
WebhookDispatcher, classifies outcomes into DELIVERED/FAILED/retry.
Guards null rule/env after deletion. 5 Testcontainers ITs: 200/503/404
outcomes, active silence suppression, deleted connection fast-fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
listInbox resolves user groups+roles via RbacService.getEffectiveGroupsForUser
/ getEffectiveRolesForUser then delegates to AlertInstanceRepository.
countUnread memoized per (envId, userId) with 5s TTL via ConcurrentHashMap
using a controllable Clock. 6 unit tests covering delegation, cache hit,
TTL expiry, and isolation between users/envs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD
- POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate
- Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time
  (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL)
- Webhook validation: verifies outboundConnectionId exists and is allowed in env
- Null-safe notification template defaults to "" for NOT NULL DB constraint
- Fixed misleading comment in ClickHouseSearchIndex to document validation contract
- OPERATOR+ for mutations, VIEWER+ for reads
- Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE
- 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery
- GET /unread-count — memoized unread count (5s TTL)
- GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read
- bulkRead filters instanceIds to env before delegating to AlertReadRepository
- VIEWER+ for all endpoints; env isolation enforced by requireInstance
- 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- POST/GET/DELETE /environments/{envSlug}/alerts/silences
- 422 when endsAt <= startsAt ("endsAt must be after startsAt")
- OPERATOR+ for create/delete, VIEWER+ for list
- Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE
- 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+)
- POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+)
  Flat path because notification IDs are globally unique (no env routing needed)
- scheduleRetry resets attempts to 0 and sets nextAttemptAt = now
- Added 11 alerting path matchers to SecurityConfig before outbound-connections block
- Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding
  @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and
  LogPatternEvaluator inject the concrete classes directly (not interface beans), so the
  full Spring context fails without these mocks in tests that don't use the real CH container
- 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add AlertRuleController, AlertController, AlertSilenceController, AlertNotificationController entries
- Document inbox SQL visibility contract (target_user_ids/group_ids/role_names — no broadcast)
- Add /api/v1/alerts/notifications/{id}/retry to flat-endpoint allow-list
- Update SecurityConfig entry with alerting path matchers
- Note attribute-key SQL injection validation contract on AlertRuleController

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older
than eventRetentionDays and DELIVERED/FAILED alert_notifications older
than notificationRetentionDays.  Uses injected Clock for testability.
IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept
regardless of age, PENDING notification never deleted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AlertingMetrics @Component wraps MeterRegistry:
- Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind},
  alerting_notifications_total{status}
- Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds
- Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state}

AlertEvaluatorJob records evalError + evalDuration around each evaluator call.
PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind).
AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds alerting stanza to application.yml with all AlertingProperties
fields backed by env-var overrides.  Creates docs/alerting.md covering
six condition kinds (with example JSON), template variables, webhook
setup (Slack/PagerDuty examples), silence patterns, circuit-breaker
and retention troubleshooting, and Prometheus metrics reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9.
All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with
"Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore
mock kept where needed. 120 alerting tests now pass (0 failures).

Also adds docs/alerting-02-verification.md (Task 43).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because
stats_1m_route has no p95 column. Exposing the old name implied p95 semantics
operators did not get. Rename to AVG_DURATION_MS makes the contract honest.
Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Investigated three approaches for CH 24.12:
- Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting).
- ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in
  table metadata across connection restarts; runs before ADD PROJECTION in the SQL script.
- Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better).

alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs.
AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status
and alerting_route_status on executions) exist after schema init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
saveTargets() is called unconditionally at the end of save() — it deletes
existing targets and re-inserts from the current targets list. findById()
and listByEnvironment() already call withTargets() so reads are consistent.
PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and
saveTargets_updateReplacesExistingTargets to cover the new write path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns
instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL
branch excluded: sweep only re-notifies, initial notify is the dispatcher's job).

AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh
notifications for eligible instances and stamps last_notified_at.

NotificationDispatchJob stamps last_notified_at on the alert_instance when a
notification is DELIVERED, providing the anchor timestamp for cadence checks.

PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering
the three-rule eligibility matrix (never-notified, long-ago, recent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets
attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response
fields. AlertNotificationController calls resetForRetry instead of
scheduleRetry so a manual retry always starts from a clean slate.

AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify
attempts==0 and status==PENDING after three prior markFailed calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V13 migration creates alert_instances_open_rule_uq — a partial unique index on
(rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing
duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches
DuplicateKeyException and returns the existing open instance instead of failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames
so newly created AlertInstance rows carry the correct target arrays.
Previously these were always empty List.of(), making the inbox query return nothing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rule creation now goes through POST /alerts/rules (exercises saveTargets on the
write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed
in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps:

  1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1)
  2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2)
  3. ack via REST → assert ACKNOWLEDGED state
  4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED)
  5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL)
  6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s →
     evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence)

Background scheduler races addressed by resetting claimed_by/claimed_until before
each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp
falls within the evaluator window. Re-notify notifications backdated in Postgres
to work around the simulated vs real clock gap in claimDueNotifications.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(alerting): add V11-V13 migration entries to CLAUDE.md
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m35s
CI / docker (push) Successful in 4m34s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Failing after 2m10s
aa9e93369f
Documents the three Flyway migrations added by the alerting feature branch
so future sessions have an accurate migration map.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
claude merged commit ca78aa3962 into main 2026-04-20 09:03:16 +02:00
claude deleted branch feat/alerting-02-backend 2026-04-20 09:03:16 +02:00
Sign in to join this conversation.