Commit Graph

1531 Commits

Author SHA1 Message Date
hsiegeln
8d8bae4e18 feat(ui/alerts): InboxPage with ack + bulk-read actions
AlertRow is reused by AllAlertsPage and HistoryPage. Marking a row as read
happens when its link is followed (the detail sub-route will be added in
phase 10 polish). FIRING rows get an amber left border.
2026-04-20 13:49:23 +02:00
hsiegeln
891dcaef32 feat(ui/alerts): mount NotificationBell in TopBar
Renders the `<NotificationBell />` as the first child of `<TopBar>` (before
`<SearchTrigger>`). The bell links to `/alerts/inbox` and shows the unread
alert count for the currently selected environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:47:16 +02:00
hsiegeln
54e4217e21 feat(ui/alerts): Alerts sidebar section with Inbox/All/Rules/Silences/History
Adds `buildAlertsTreeNodes` to sidebar-utils and renders an Alerts section
between Applications and Starred in LayoutShell. The section uses an
accordion pattern — entering `/alerts/*` collapses apps/admin/starred and
restores their state on leave.

gitnexus_impact(LayoutContent, upstream) = LOW (0 direct callers; rendered
only by LayoutShell's provider wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:46:49 +02:00
hsiegeln
167d0ebd42 feat(ui/alerts): register /alerts/* routes with placeholder pages
Adds 6 lazy-loaded route entries for the alerting UI (Inbox, All, History,
Rules list, Rule editor wizard, Silences) plus an `/alerts` → `/alerts/inbox`
redirect. Page components are placeholder stubs to be replaced in Phase 5/6/7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:44 +02:00
hsiegeln
019e79a362 feat(ui/alerts): MustacheEditor component (CM6 shell with completion + linter)
Wires the mustache-completion source and mustache-linter into a CodeMirror 6
EditorView. Accepts kind (filters variables) and reducedContext (env-only for
connection URLs). singleLine prevents newlines for URL/header fields. Host
ref syncs when the parent replaces value (promotion prefill).
2026-04-20 13:41:46 +02:00
hsiegeln
ac2a943feb feat(ui/alerts): CM6 completion + linter for Mustache templates
completion fires after {{ and narrows as the user types; apply() closes the
tag automatically. Linter raises an error on unclosed {{, a warning on
references that aren't in the allowed-variable set for the current condition
kind. Kind-specific allowed set comes from availableVariables().
2026-04-20 13:39:10 +02:00
hsiegeln
18e6dde67a fix(ui/alerts): align ALERT_VARIABLES registry with NotificationContextBuilder
Plan prose had spec §8 idealized leaves, but the backend NotificationContext
only emits a subset:
  ROUTE_METRIC / EXCHANGE_MATCH → route.id + route.uri (uri added)
  LOG_PATTERN → log.pattern + log.matchCount (renamed from log.logger/level/message)
  app.slug / app.id → scoped to non-env kinds (removed from 'always')
  exchange.link / alert.comparator / alert.window / app.displayName → removed (backend doesn't emit)

Without this alignment the Task 11 linter would (1) flag valid route.uri as
unknown, (2) suggest log.{logger,level,message} as valid paths that render
empty, and (3) flag app.slug on env-wide rules.
2026-04-20 13:35:42 +02:00
hsiegeln
5ddd89f883 feat(ui/alerts): Mustache variable metadata registry for autocomplete
ALERT_VARIABLES mirrors the spec §8 context map. availableVariables(kind)
returns the kind-specific filter (always vars + kind vars). extractReferences
+ unknownReferences drive the inline amber linter. Backend NotificationContext
adds must land here too.
2026-04-20 13:31:55 +02:00
hsiegeln
38083d7c3f fix(ui/alerts): remove dead usePageVisible subscription; align alerts test mock with DTO
NotificationBell used a usePageVisible() subscription that re-rendered on
every visibilitychange without consuming the value. TanStack Query's
refetchIntervalInBackground:false already pauses polling; the extra
subscription was speculative generality. Dropped the import + call + JSDoc
reference; usePageVisible hook + test retained as a reusable primitive.

Also: alerts.test.tsx 'returns the server payload unmodified' asserted a
pre-plan {total, bySeverity} shape, but UnreadCountResponse is actually
{count}. Fixed mock + assertion to {count: 3}.
2026-04-20 13:30:07 +02:00
hsiegeln
197c60126c feat(ui/alerts): NotificationBell with Page Visibility poll pause
Adds a header bell component linking to /alerts/inbox with an unread-count
badge for the selected environment. Polling pauses when the tab is hidden
via TanStack Query's refetchIntervalInBackground:false (already set on
useUnreadCount); the new usePageVisible hook gives components a
re-renders-on-visibility-change signal for future defense-in-depth.

Plan-prose deviation: the plan assumed UnreadCountResponse carries a
bySeverity map for per-severity badge coloring, but the backend DTO only
exposes a scalar `count`. The bell reads `data?.count` and renders a single
var(--error) tint; a TODO references spec §13 for future per-severity work
that would require expanding the DTO.

Tests: usePageVisible toggles on visibilitychange events; NotificationBell
renders the bell with no badge at count=0 and shows "3" at count=3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:26:58 +02:00
hsiegeln
31ee974830 feat(ui/alerts): AlertStateChip + SeverityBadge components
State colors follow the convention from @cameleer/design-system (CRITICAL->error,
WARNING->warning, INFO->auto). Silenced pill stacks next to state for the spec
section 8 audit-trail surface.
2026-04-20 13:21:37 +02:00
hsiegeln
51bc796bec feat(ui/alerts): silence + notification query hooks 2026-04-20 13:16:57 +02:00
hsiegeln
c6c3dd9cfe feat(ui/alerts): alert rule query hooks (CRUD, enable/disable, preview, test-evaluate)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:13:07 +02:00
hsiegeln
82c29f46a5 fix(ui/alerts): bulk-read body uses instanceIds to match BulkReadRequest DTO
Plan 03 prose had 'alertInstanceIds'; backend record is 'instanceIds'.
2026-04-20 13:08:22 +02:00
hsiegeln
83a8912da6 feat(ui/alerts): TanStack Query hooks for /alerts endpoints
Adds env-scoped hooks for the alerts inbox:
- useAlerts (30s poll, background-paused, filter-aware)
- useAlert, useUnreadCount (30s poll)
- useAckAlert, useMarkAlertRead, useBulkReadAlerts (mutations that
  invalidate the alerts query key tree + unread-count)

Test file uses .tsx because the QueryClientProvider wrapper relies on
JSX; vitest picks up both .ts and .tsx via the configured include glob.
Client mock targets the actual export name (`api` in ../client) rather
than the `apiClient` alias that alertMeta re-exports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:07:16 +02:00
hsiegeln
1a8b9eb41b feat(ui/alerts): shared env helper for alerting query hooks 2026-04-20 13:00:49 +02:00
hsiegeln
b2066cdb68 chore(ui): regenerate openapi.json + schema.d.ts from deployed Plan 02 backend
Fetched from http://192.168.50.86:30090/api/v1/api-docs via
`npm run generate-api:live`. Adds TypeScript types for the new alerting
REST surface merged in #140:

- 15 alerting paths under /environments/{envSlug}/alerts/** (rules CRUD,
  enable/disable, render-preview, test-evaluate, inbox, unread-count,
  ack/read/bulk-read, silences CRUD, per-alert notifications)
- 1 flat notification retry path /alerts/notifications/{id}/retry
- 4 outbound-connection admin paths (from Plan 01 #139)

Verified tsc -p tsconfig.app.json --noEmit exits 0 — no existing SPA
call sites break against the fresh types. Plan 03 UI work can consume
these directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:59:38 +02:00
hsiegeln
1260cbe674 chore(ui): add CodeMirror 6 + Playwright config
Install @codemirror/{view,state,autocomplete,commands,language,lint}
and @lezer/common — needed by Phase 3's MustacheEditor (Task 13).
CM6 picked over a raw textarea for its small incremental-rendering
bundle, full ARIA/keyboard support, and pluggable autocomplete +
linter APIs that map cleanly to Mustache token parsing.

Add ui/playwright.config.ts wiring Task 30's E2E smoke:
- testDir ./src/test/e2e, single worker, trace+screenshot on failure
- webServer launches `npm run dev:local` (backend on :8081 required)
- PLAYWRIGHT_BASE_URL env var skips the dev server for CI against a
  pre-deployed UI

Add test:e2e / test:e2e:ui npm scripts and exclude Playwright's
test-results/ and playwright-report/ from git. @playwright/test
itself was already in devDependencies from an earlier task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:55:57 +02:00
hsiegeln
0aa1776b57 chore(ui): add Vitest + Testing Library scaffolding
Prepares for Plan 03 unit tests (MustacheEditor, NotificationBell, wizard step
validation). jsdom environment + jest-dom matchers + canary test verifies the
wiring.
2026-04-20 12:16:22 +02:00
hsiegeln
2942025a54 docs(alerting): Plan 03 — UI + backfills implementation plan
32 tasks across 10 phases:
 - Foundation: Vitest, CodeMirror 6, Playwright scaffolding + schema regen.
 - API: env-scoped query hooks for alerts/rules/silences/notifications.
 - Components: AlertStateChip, SeverityBadge, NotificationBell (with tab-hidden poll pause), MustacheEditor (CM6 with variable autocomplete + linter).
 - Routes: /alerts/* section with sidebar accordion; bell mounted in TopBar.
 - Pages: Inbox / All / History / Rules (with env promotion) / Silences.
 - Wizard: 5-step editor with kind-specific condition forms + test-evaluate + render-preview + prefill warnings.
 - CMD-K: alerts + rules sources via LayoutShell extension.
 - Backend backfills: SSRF guard on outbound URL + 30s AlertingMetrics gauge cache.
 - Final: Playwright smoke, .claude/rules/ui.md + admin-guide updates, full build/test/PR.

Decisions: CM6 over Monaco/textarea (90KB gzipped, ARIA-conformant); CMD-K extension via existing LayoutShell searchData (not a new registry); REST-API-driven tests per project test policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:12:21 +02:00
99c3cab84a Merge pull request 'chore(ui): regenerate openapi schema for Plan 02 alerting endpoints' (#143) from chore/openapi-regen-post-plan02 into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m0s
CI / docker (push) Successful in 1m17s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Reviewed-on: #143
2026-04-20 11:02:13 +02:00
c861482c9e Merge pull request 'test(alerting): decentralize @MockBean + add SpringContextSmokeIT (follow-up to #141)' (#142) from fix/alerting-test-hygiene into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m28s
CI / docker (push) Successful in 29s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 40s
Reviewed-on: #142
2026-04-20 10:57:57 +02:00
hsiegeln
39a134a0db chore(ui): regenerate openapi.json + schema.d.ts from deployed Plan 02 backend
All checks were successful
CI / cleanup-branch (pull_request) Has been skipped
CI / build (pull_request) Successful in 2m37s
CI / docker (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / deploy-feature (pull_request) Has been skipped
Fetched from http://192.168.50.86:30090/api/v1/api-docs via
`npm run generate-api:live`. Adds TypeScript types for the new alerting
REST surface merged in #140:

- 15 alerting paths under /environments/{envSlug}/alerts/** (rules CRUD,
  enable/disable, render-preview, test-evaluate, inbox, unread-count,
  ack/read/bulk-read, silences CRUD, per-alert notifications)
- 1 flat notification retry path /alerts/notifications/{id}/retry
- 4 outbound-connection admin paths (from Plan 01 #139)

Verified tsc -p tsconfig.app.json --noEmit exits 0 — no existing SPA
call sites break against the fresh types. Plan 03 UI work can consume
these directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:56:43 +02:00
hsiegeln
94e941b026 test(alerting): decentralize @MockBean from AbstractPostgresIT + add SpringContextSmokeIT
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m43s
CI / cleanup-branch (pull_request) Has been skipped
CI / build (pull_request) Successful in 3m7s
CI / docker (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / deploy-feature (pull_request) Has been skipped
CI / docker (push) Successful in 1m37s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Successful in 39s
Follow-up to #141. AbstractPostgresIT centrally declared three @MockBean
fields (clickHouseSearchIndex, clickHouseLogStore, agentRegistryService),
which meant EVERY IT ran against mocks instead of the real Spring context.
That masked the production crashloop — the real bean graph was never
exercised by CI.

- Remove the three @MockBean fields from AbstractPostgresIT.
- Move @MockBean declarations onto only the specific ITs that stub
  method behavior (verified by grepping for when/verify calls).
- ITs that don't stub CH behavior now inject the real beans.
- Add SpringContextSmokeIT — @SpringBootTest with no mocks, void
  contextLoads(). Fails fast on declared-type / autowire-type
  mismatches like the one #141 fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:51:46 +02:00
5373ac6541 Merge pull request 'fix(alerting): ClickHouseSearchIndex bean registered as concrete type (hotfix: production crashloop)' (#141) from fix/alerting-searchindex-bean-type into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m53s
CI / docker (push) Successful in 31s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s
Reviewed-on: #141
2026-04-20 10:24:49 +02:00
hsiegeln
c9c93ac565 fix(alerting): declare ClickHouseSearchIndex bean as concrete type
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m6s
CI / cleanup-branch (pull_request) Has been skipped
CI / build (pull_request) Successful in 4m5s
CI / docker (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / deploy-feature (pull_request) Has been skipped
CI / docker (push) Successful in 1m37s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Successful in 39s
Production crashlooped on startup: ExchangeMatchEvaluator autowires the
concrete ClickHouseSearchIndex (for countExecutionsForAlerting, which
lives only on the concrete class, not the SearchIndex interface), but
StorageBeanConfig declared the bean with interface return type SearchIndex.
Spring matches autowire candidates by declared bean type, not by runtime
instance class, so the concrete-typed autowire failed with:

  Parameter 0 of constructor in ExchangeMatchEvaluator required a bean
  of type 'ClickHouseSearchIndex' that could not be found.

ClickHouseLogStore's bean is already declared with the concrete return
type (line 171), which is why LogPatternEvaluator autowires fine.

All alerting ITs passed pre-merge because AbstractPostgresIT replaces the
clickHouseSearchIndex bean with @MockBean(name=...) whose declared type
IS the concrete ClickHouseSearchIndex. The mock masked the prod bug.

Follow-up: remove @MockBean(name="clickHouseSearchIndex") from
AbstractPostgresIT so the real bean graph is exercised by alerting ITs
(and add a SpringContextSmokeIT that loads the context with no mocks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:11:47 +02:00
ca78aa3962 Merge pull request 'feat(alerting): Plan 02 — backend (domain, storage, evaluators, dispatch)' (#140) from feat/alerting-02-backend into main
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m58s
CI / docker (push) Successful in 34s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Failing after 2m17s
2026-04-20 09:03:15 +02:00
b763155a60 Merge pull request 'feat(alerting): Plan 01 — outbound HTTP infra + admin-managed outbound connections' (#139) from feat/alerting-01-outbound-infra into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m53s
CI / docker (push) Successful in 36s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Reviewed-on: #139
2026-04-20 08:57:40 +02:00
hsiegeln
aa9e93369f docs(alerting): add V11-V13 migration entries to CLAUDE.md
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m35s
CI / docker (push) Successful in 4m34s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Failing after 2m10s
Documents the three Flyway migrations added by the alerting feature branch
so future sessions have an accurate migration map.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:27:39 +02:00
hsiegeln
b0ba08e572 test(alerting): rewrite AlertingFullLifecycleIT — REST-driven rule creation, re-notify cadence
Rule creation now goes through POST /alerts/rules (exercises saveTargets on the
write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed
in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps:

  1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1)
  2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2)
  3. ack via REST → assert ACKNOWLEDGED state
  4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED)
  5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL)
  6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s →
     evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence)

Background scheduler races addressed by resetting claimed_by/claimed_until before
each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp
falls within the evaluator window. Re-notify notifications backdated in Postgres
to work around the simulated vs real clock gap in claimDueNotifications.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:38 +02:00
hsiegeln
2c82b50ea2 fix(alerting/B-1): AlertStateTransitions.newInstance() propagates rule targets to AlertInstance
newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames
so newly created AlertInstance rows carry the correct target arrays.
Previously these were always empty List.of(), making the inbox query return nothing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:25 +02:00
hsiegeln
7e79ff4d98 fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states
V13 migration creates alert_instances_open_rule_uq — a partial unique index on
(rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing
duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches
DuplicateKeyException and returns the existing open instance instead of failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:07 +02:00
hsiegeln
424894a3e2 fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing
AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets
attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response
fields. AlertNotificationController calls resetForRetry instead of
scheduleRetry so a manual retry always starts from a clean slate.

AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify
attempts==0 and status==PENDING after three prior markFailed calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:59 +02:00
hsiegeln
d74079da63 fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking
AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns
instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL
branch excluded: sweep only re-notifies, initial notify is the dispatcher's job).

AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh
notifications for eligible instances and stamps last_notified_at.

NotificationDispatchJob stamps last_notified_at on the alert_instance when a
notification is DELIVERED, providing the anchor timestamp for cadence checks.

PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering
the three-rule eligibility matrix (never-notified, long-ago, recent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:50 +02:00
hsiegeln
3f036da03d fix(alerting/B-1): PostgresAlertRuleRepository.save() now persists alert_rule_targets
saveTargets() is called unconditionally at the end of save() — it deletes
existing targets and re-inserts from the current targets list. findById()
and listByEnvironment() already call withTargets() so reads are consistent.
PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and
saveTargets_updateReplacesExistingTargets to cover the new write path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:39 +02:00
hsiegeln
8bf45d5456 fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree
Investigated three approaches for CH 24.12:
- Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting).
- ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in
  table metadata across connection restarts; runs before ADD PROJECTION in the SQL script.
- Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better).

alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs.
AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status
and alerting_route_status on executions) exist after schema init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:55 +02:00
hsiegeln
f1abca3a45 refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes
The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because
stats_1m_route has no p95 column. Exposing the old name implied p95 semantics
operators did not get. Rename to AVG_DURATION_MS makes the contract honest.
Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:43 +02:00
hsiegeln
144915563c docs(alerting): whole-branch final review report
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:25:33 +02:00
hsiegeln
c79a6234af test(alerting): fix duplicate @MockBean after AbstractPostgresIT centralised mocks + Plan 02 verification report
AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9.
All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with
"Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore
mock kept where needed. 120 alerting tests now pass (0 failures).

Also adds docs/alerting-02-verification.md (Task 43).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 23:27:19 +02:00
hsiegeln
63669bd1d7 docs(alerting): default config + admin guide
Adds alerting stanza to application.yml with all AlertingProperties
fields backed by env-var overrides.  Creates docs/alerting.md covering
six condition kinds (with example JSON), template variables, webhook
setup (Slack/PagerDuty examples), silence patterns, circuit-breaker
and retention troubleshooting, and Prometheus metrics reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:38 +02:00
hsiegeln
840a71df94 feat(alerting): observability metrics via micrometer
AlertingMetrics @Component wraps MeterRegistry:
- Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind},
  alerting_notifications_total{status}
- Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds
- Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state}

AlertEvaluatorJob records evalError + evalDuration around each evaluator call.
PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind).
AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:30 +02:00
hsiegeln
1ab21bc019 feat(alerting): AlertingRetentionJob daily cleanup
Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older
than eventRetentionDays and DELIVERED/FAILED alert_notifications older
than notificationRetentionDays.  Uses injected Clock for testability.
IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept
regardless of age, PENDING notification never deleted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:21 +02:00
hsiegeln
118ace7cc3 docs(alerting): update app-classes.md for Phase 9 REST controllers (Task 36)
- Add AlertRuleController, AlertController, AlertSilenceController, AlertNotificationController entries
- Document inbox SQL visibility contract (target_user_ids/group_ids/role_names — no broadcast)
- Add /api/v1/alerts/notifications/{id}/retry to flat-endpoint allow-list
- Update SecurityConfig entry with alerting path matchers
- Note attribute-key SQL injection validation contract on AlertRuleController

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:08:38 +02:00
hsiegeln
e334dfacd3 feat(alerting): AlertNotificationController + SecurityConfig matchers + fix IT context (Task 35)
- GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+)
- POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+)
  Flat path because notification IDs are globally unique (no env routing needed)
- scheduleRetry resets attempts to 0 and sets nextAttemptAt = now
- Added 11 alerting path matchers to SecurityConfig before outbound-connections block
- Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding
  @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and
  LogPatternEvaluator inject the concrete classes directly (not interface beans), so the
  full Spring context fails without these mocks in tests that don't use the real CH container
- 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:29:17 +02:00
hsiegeln
77d1718451 feat(alerting): AlertSilenceController CRUD with time-range validation + audit (Task 34)
- POST/GET/DELETE /environments/{envSlug}/alerts/silences
- 422 when endsAt <= startsAt ("endsAt must be after startsAt")
- OPERATOR+ for create/delete, VIEWER+ for list
- Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE
- 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:29:03 +02:00
hsiegeln
841793d7b9 feat(alerting): AlertController in-app inbox with ack/read/bulk-read (Task 33)
- GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery
- GET /unread-count — memoized unread count (5s TTL)
- GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read
- bulkRead filters instanceIds to env before delegating to AlertReadRepository
- VIEWER+ for all endpoints; env isolation enforced by requireInstance
- 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:28:55 +02:00
hsiegeln
c1b34f592b feat(alerting): AlertRuleController with attribute-key SQL injection validation (Task 32)
- POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD
- POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate
- Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time
  (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL)
- Webhook validation: verifies outboundConnectionId exists and is allowed in env
- Null-safe notification template defaults to "" for NOT NULL DB constraint
- Fixed misleading comment in ClickHouseSearchIndex to document validation contract
- OPERATOR+ for mutations, VIEWER+ for reads
- Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE
- 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:28:46 +02:00
hsiegeln
d3dd8882bd feat(alerting): InAppInboxQuery with 5s unread-count memoization
listInbox resolves user groups+roles via RbacService.getEffectiveGroupsForUser
/ getEffectiveRolesForUser then delegates to AlertInstanceRepository.
countUnread memoized per (envId, userId) with 5s TTL via ConcurrentHashMap
using a controllable Clock. 6 unit tests covering delegation, cache hit,
TTL expiry, and isolation between users/envs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:25:00 +02:00
hsiegeln
6b48bc63bf feat(alerting): NotificationDispatchJob outbox loop with silence + retry
Claim-polling SchedulingConfigurer: claims due notifications, resolves
instance/connection/rule, checks active silences, dispatches via
WebhookDispatcher, classifies outcomes into DELIVERED/FAILED/retry.
Guards null rule/env after deletion. 5 Testcontainers ITs: 200/503/404
outcomes, active silence suppression, deleted connection fast-fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:24:54 +02:00
hsiegeln
466aceb920 feat(alerting): WebhookDispatcher with HMAC + TLS + retry classification
Renders URL/headers/body with Mustache, optionally HMAC-signs the body
(X-Cameleer-Signature), supports POST/PUT/PATCH, classifies 2xx/4xx/5xx
into DELIVERED/FAILED/retry. 8 WireMock-backed IT tests including HTTPS
TRUST_ALL against WireMock self-signed cert.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:24:47 +02:00