Commit Graph

1339 Commits

Author SHA1 Message Date
hsiegeln
5ebc729b82 feat(alerting): SSRF guard on outbound connection URL
Rejects webhook URLs that resolve to loopback, link-local, or RFC-1918
private ranges (IPv4 + IPv6 ULA fc00::/7). Enforced on both create and
update in OutboundConnectionServiceImpl before persistence; returns 400
Bad Request with "private or loopback" in the body.

Bypass via `cameleer.server.outbound-http.allow-private-targets=true`
for dev environments where webhooks legitimately point at local
services. Production default is `false`.

Test profile sets the flag to `true` in application-test.yml so the
existing ITs that post webhooks to WireMock on https://localhost:PORT
keep working. A dedicated OutboundConnectionSsrfIT overrides the flag
back to false (via @TestPropertySource + @DirtiesContext) to exercise
the reject path end-to-end through the admin controller.

Plan 01 scope; required before SaaS exposure (spec §17).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:17:44 +02:00
hsiegeln
f4c2cb120b feat(ui/alerts): CMD-K sources for alerts + alert rules
Extends operationalSearchData with open alerts (FIRING|ACKNOWLEDGED) and
all rules. Badges convey severity + state. Selecting an alert navigates to
/alerts/inbox/{id}; a rule navigates to /alerts/rules/{id}. Uses the
existing CommandPalette extension point — no new registry.
2026-04-20 14:09:39 +02:00
hsiegeln
8689643e11 feat(ui/alerts): SilencesPage with matcher-based create + end-early action
Matcher accepts ruleId and/or appSlug. Server enforces endsAt > startsAt
(V12 CHECK constraint) and matcher_matches() at dispatch time (spec §7).
2026-04-20 14:08:27 +02:00
hsiegeln
0191ca4b13 feat(ui/alerts): render promotion warnings in wizard banner
Fetches target-env apps (useCatalog) and env-allowed outbound
connections, passes them to prefillFromPromotion, and renders the
returned warnings in an amber banner above the step nav. Warnings list
the field name and the remediation message so users see crossings that
need manual adjustment before saving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:05:08 +02:00
hsiegeln
3963ea5591 feat(ui/alerts): ReviewStep + promotion prefill warnings
Review step dumps a human summary plus raw request JSON, and (when a
setter is supplied) offers an Enabled-on-save Toggle. Promotion prefill
now returns {form, warnings}: clears agent IDs (per-env), flags missing
apps in target env, and flags webhook connections not allowed in target
env. 4 Vitest cases cover copy-name, agent clear, app-missing, and
webhook-not-allowed paths.

The wizard now consumes {form, warnings}; Task 25 renders the warnings
banner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:04:04 +02:00
hsiegeln
816096f4d1 feat(ui/alerts): NotifyStep (MustacheEditor for title/message/body, targets, webhook bindings)
Title and message use MustacheEditor with kind-specific autocomplete.
Preview button posts to the render-preview endpoint and shows rendered
title/message inline. Targets combine users/groups/roles into a unified
Badge pill list. Webhook picker filters to outbound connections allowed
in the current env (spec 6, allowed_environment_ids). Header overrides
use plain Input rather than MustacheEditor for now.

Deviations:
 - RenderPreviewRequest is Record<string, never>, so we send {} instead
   of {titleTemplate, messageTemplate}; backend resolves from rule state.
 - RenderPreviewResponse has {title, message} (plan draft used
   renderedTitle/renderedMessage).
 - Button size="sm" not "small" (DS only accepts sm|md).
 - Target kind field renamed from targetKind to kind to match
   AlertRuleTarget DTO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:02:26 +02:00
hsiegeln
d42a6ca6a8 feat(ui/alerts): TriggerStep (evaluation interval, for-duration, re-notify, test-evaluate)
Three numeric inputs for evaluation cadence, for-duration, and
re-notification window, plus a Test evaluate button for saved rules.
TestEvaluateRequest is empty on the wire (server uses the rule id), so
we send {} and rely on the backend to evaluate the current saved state.

Deviation: plan draft passed {condition: toRequest(form).condition} into
the request body. The generated TestEvaluateRequest type is
Record<string, never>, so we send an empty body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:00:33 +02:00
hsiegeln
ef8c60c2b5 feat(ui/alerts): ConditionStep with 6 kind-specific forms
Each condition kind (ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE,
DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC) renders its own payload-shape
form. Changing the kind resets the condition payload to {kind, scope} so
stale fields from a previous kind don't leak into the save request.

Deviation: DS Select uses native event-based onChange. Plan draft showed
a value-based signature (onChange(v) => ...).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:59:55 +02:00
hsiegeln
f48fc750f2 feat(ui/alerts): ScopeStep (name, severity, env/app/route/agent selectors)
Name, description, severity, scope-kind radio, and cascading app/route/
agent selectors driven by catalog + agents data. Adjusts condition
routing by clearing routeId/agentId when the app changes.

Deviation: DS Select uses native event-based onChange; plan draft had
a value-based signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:58:21 +02:00
hsiegeln
334e815c25 feat(ui/alerts): rule editor wizard shell + form-state module
Wizard navigates 5 steps (scope/condition/trigger/notify/review) with
per-step validation. form-state module is the single source of truth for
the rule form; initialForm/toRequest/validateStep are unit-tested (6
tests). Step components are stubbed and will be implemented in Tasks
20-24. prefillFromPromotion is a thin wrapper in this commit; Task 24
rewrites it to compute scope-adjustment warnings.

Deviation notes:
 - FormState.targets uses {kind, targetId} to match AlertRuleTarget DTO
   field names (plan draft had targetKind).
 - toRequest casts through Record<string, unknown> so the spread over
   the Partial<AlertCondition> union typechecks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:57:30 +02:00
hsiegeln
7e91459cd6 feat(ui/alerts): RulesListPage with enable/disable, delete, env promotion
Promotion dropdown builds a /alerts/rules/new URL with promoteFrom, ruleId,
and targetEnv query params — the wizard will read these in Task 24 and
pre-fill the form with source-env prefill + client-side warnings.
2026-04-20 13:52:14 +02:00
hsiegeln
269a63af1f feat(ui/alerts): AllAlertsPage + HistoryPage
AllAlertsPage: state filter chips (Open/Firing/Acked/All).
HistoryPage: RESOLVED filter, respects retention window.
2026-04-20 13:49:52 +02:00
hsiegeln
8d8bae4e18 feat(ui/alerts): InboxPage with ack + bulk-read actions
AlertRow is reused by AllAlertsPage and HistoryPage. Marking a row as read
happens when its link is followed (the detail sub-route will be added in
phase 10 polish). FIRING rows get an amber left border.
2026-04-20 13:49:23 +02:00
hsiegeln
891dcaef32 feat(ui/alerts): mount NotificationBell in TopBar
Renders the `<NotificationBell />` as the first child of `<TopBar>` (before
`<SearchTrigger>`). The bell links to `/alerts/inbox` and shows the unread
alert count for the currently selected environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:47:16 +02:00
hsiegeln
54e4217e21 feat(ui/alerts): Alerts sidebar section with Inbox/All/Rules/Silences/History
Adds `buildAlertsTreeNodes` to sidebar-utils and renders an Alerts section
between Applications and Starred in LayoutShell. The section uses an
accordion pattern — entering `/alerts/*` collapses apps/admin/starred and
restores their state on leave.

gitnexus_impact(LayoutContent, upstream) = LOW (0 direct callers; rendered
only by LayoutShell's provider wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:46:49 +02:00
hsiegeln
167d0ebd42 feat(ui/alerts): register /alerts/* routes with placeholder pages
Adds 6 lazy-loaded route entries for the alerting UI (Inbox, All, History,
Rules list, Rule editor wizard, Silences) plus an `/alerts` → `/alerts/inbox`
redirect. Page components are placeholder stubs to be replaced in Phase 5/6/7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:44 +02:00
hsiegeln
019e79a362 feat(ui/alerts): MustacheEditor component (CM6 shell with completion + linter)
Wires the mustache-completion source and mustache-linter into a CodeMirror 6
EditorView. Accepts kind (filters variables) and reducedContext (env-only for
connection URLs). singleLine prevents newlines for URL/header fields. Host
ref syncs when the parent replaces value (promotion prefill).
2026-04-20 13:41:46 +02:00
hsiegeln
ac2a943feb feat(ui/alerts): CM6 completion + linter for Mustache templates
completion fires after {{ and narrows as the user types; apply() closes the
tag automatically. Linter raises an error on unclosed {{, a warning on
references that aren't in the allowed-variable set for the current condition
kind. Kind-specific allowed set comes from availableVariables().
2026-04-20 13:39:10 +02:00
hsiegeln
18e6dde67a fix(ui/alerts): align ALERT_VARIABLES registry with NotificationContextBuilder
Plan prose had spec §8 idealized leaves, but the backend NotificationContext
only emits a subset:
  ROUTE_METRIC / EXCHANGE_MATCH → route.id + route.uri (uri added)
  LOG_PATTERN → log.pattern + log.matchCount (renamed from log.logger/level/message)
  app.slug / app.id → scoped to non-env kinds (removed from 'always')
  exchange.link / alert.comparator / alert.window / app.displayName → removed (backend doesn't emit)

Without this alignment the Task 11 linter would (1) flag valid route.uri as
unknown, (2) suggest log.{logger,level,message} as valid paths that render
empty, and (3) flag app.slug on env-wide rules.
2026-04-20 13:35:42 +02:00
hsiegeln
5ddd89f883 feat(ui/alerts): Mustache variable metadata registry for autocomplete
ALERT_VARIABLES mirrors the spec §8 context map. availableVariables(kind)
returns the kind-specific filter (always vars + kind vars). extractReferences
+ unknownReferences drive the inline amber linter. Backend NotificationContext
adds must land here too.
2026-04-20 13:31:55 +02:00
hsiegeln
38083d7c3f fix(ui/alerts): remove dead usePageVisible subscription; align alerts test mock with DTO
NotificationBell used a usePageVisible() subscription that re-rendered on
every visibilitychange without consuming the value. TanStack Query's
refetchIntervalInBackground:false already pauses polling; the extra
subscription was speculative generality. Dropped the import + call + JSDoc
reference; usePageVisible hook + test retained as a reusable primitive.

Also: alerts.test.tsx 'returns the server payload unmodified' asserted a
pre-plan {total, bySeverity} shape, but UnreadCountResponse is actually
{count}. Fixed mock + assertion to {count: 3}.
2026-04-20 13:30:07 +02:00
hsiegeln
197c60126c feat(ui/alerts): NotificationBell with Page Visibility poll pause
Adds a header bell component linking to /alerts/inbox with an unread-count
badge for the selected environment. Polling pauses when the tab is hidden
via TanStack Query's refetchIntervalInBackground:false (already set on
useUnreadCount); the new usePageVisible hook gives components a
re-renders-on-visibility-change signal for future defense-in-depth.

Plan-prose deviation: the plan assumed UnreadCountResponse carries a
bySeverity map for per-severity badge coloring, but the backend DTO only
exposes a scalar `count`. The bell reads `data?.count` and renders a single
var(--error) tint; a TODO references spec §13 for future per-severity work
that would require expanding the DTO.

Tests: usePageVisible toggles on visibilitychange events; NotificationBell
renders the bell with no badge at count=0 and shows "3" at count=3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:26:58 +02:00
hsiegeln
31ee974830 feat(ui/alerts): AlertStateChip + SeverityBadge components
State colors follow the convention from @cameleer/design-system (CRITICAL->error,
WARNING->warning, INFO->auto). Silenced pill stacks next to state for the spec
section 8 audit-trail surface.
2026-04-20 13:21:37 +02:00
hsiegeln
51bc796bec feat(ui/alerts): silence + notification query hooks 2026-04-20 13:16:57 +02:00
hsiegeln
c6c3dd9cfe feat(ui/alerts): alert rule query hooks (CRUD, enable/disable, preview, test-evaluate)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:13:07 +02:00
hsiegeln
82c29f46a5 fix(ui/alerts): bulk-read body uses instanceIds to match BulkReadRequest DTO
Plan 03 prose had 'alertInstanceIds'; backend record is 'instanceIds'.
2026-04-20 13:08:22 +02:00
hsiegeln
83a8912da6 feat(ui/alerts): TanStack Query hooks for /alerts endpoints
Adds env-scoped hooks for the alerts inbox:
- useAlerts (30s poll, background-paused, filter-aware)
- useAlert, useUnreadCount (30s poll)
- useAckAlert, useMarkAlertRead, useBulkReadAlerts (mutations that
  invalidate the alerts query key tree + unread-count)

Test file uses .tsx because the QueryClientProvider wrapper relies on
JSX; vitest picks up both .ts and .tsx via the configured include glob.
Client mock targets the actual export name (`api` in ../client) rather
than the `apiClient` alias that alertMeta re-exports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:07:16 +02:00
hsiegeln
1a8b9eb41b feat(ui/alerts): shared env helper for alerting query hooks 2026-04-20 13:00:49 +02:00
hsiegeln
b2066cdb68 chore(ui): regenerate openapi.json + schema.d.ts from deployed Plan 02 backend
Fetched from http://192.168.50.86:30090/api/v1/api-docs via
`npm run generate-api:live`. Adds TypeScript types for the new alerting
REST surface merged in #140:

- 15 alerting paths under /environments/{envSlug}/alerts/** (rules CRUD,
  enable/disable, render-preview, test-evaluate, inbox, unread-count,
  ack/read/bulk-read, silences CRUD, per-alert notifications)
- 1 flat notification retry path /alerts/notifications/{id}/retry
- 4 outbound-connection admin paths (from Plan 01 #139)

Verified tsc -p tsconfig.app.json --noEmit exits 0 — no existing SPA
call sites break against the fresh types. Plan 03 UI work can consume
these directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:59:38 +02:00
hsiegeln
1260cbe674 chore(ui): add CodeMirror 6 + Playwright config
Install @codemirror/{view,state,autocomplete,commands,language,lint}
and @lezer/common — needed by Phase 3's MustacheEditor (Task 13).
CM6 picked over a raw textarea for its small incremental-rendering
bundle, full ARIA/keyboard support, and pluggable autocomplete +
linter APIs that map cleanly to Mustache token parsing.

Add ui/playwright.config.ts wiring Task 30's E2E smoke:
- testDir ./src/test/e2e, single worker, trace+screenshot on failure
- webServer launches `npm run dev:local` (backend on :8081 required)
- PLAYWRIGHT_BASE_URL env var skips the dev server for CI against a
  pre-deployed UI

Add test:e2e / test:e2e:ui npm scripts and exclude Playwright's
test-results/ and playwright-report/ from git. @playwright/test
itself was already in devDependencies from an earlier task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:55:57 +02:00
hsiegeln
0aa1776b57 chore(ui): add Vitest + Testing Library scaffolding
Prepares for Plan 03 unit tests (MustacheEditor, NotificationBell, wizard step
validation). jsdom environment + jest-dom matchers + canary test verifies the
wiring.
2026-04-20 12:16:22 +02:00
hsiegeln
2942025a54 docs(alerting): Plan 03 — UI + backfills implementation plan
32 tasks across 10 phases:
 - Foundation: Vitest, CodeMirror 6, Playwright scaffolding + schema regen.
 - API: env-scoped query hooks for alerts/rules/silences/notifications.
 - Components: AlertStateChip, SeverityBadge, NotificationBell (with tab-hidden poll pause), MustacheEditor (CM6 with variable autocomplete + linter).
 - Routes: /alerts/* section with sidebar accordion; bell mounted in TopBar.
 - Pages: Inbox / All / History / Rules (with env promotion) / Silences.
 - Wizard: 5-step editor with kind-specific condition forms + test-evaluate + render-preview + prefill warnings.
 - CMD-K: alerts + rules sources via LayoutShell extension.
 - Backend backfills: SSRF guard on outbound URL + 30s AlertingMetrics gauge cache.
 - Final: Playwright smoke, .claude/rules/ui.md + admin-guide updates, full build/test/PR.

Decisions: CM6 over Monaco/textarea (90KB gzipped, ARIA-conformant); CMD-K extension via existing LayoutShell searchData (not a new registry); REST-API-driven tests per project test policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:12:21 +02:00
5373ac6541 Merge pull request 'fix(alerting): ClickHouseSearchIndex bean registered as concrete type (hotfix: production crashloop)' (#141) from fix/alerting-searchindex-bean-type into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m53s
CI / docker (push) Successful in 31s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s
Reviewed-on: #141
2026-04-20 10:24:49 +02:00
hsiegeln
c9c93ac565 fix(alerting): declare ClickHouseSearchIndex bean as concrete type
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m6s
CI / cleanup-branch (pull_request) Has been skipped
CI / build (pull_request) Successful in 4m5s
CI / docker (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / deploy-feature (pull_request) Has been skipped
CI / docker (push) Successful in 1m37s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Successful in 39s
Production crashlooped on startup: ExchangeMatchEvaluator autowires the
concrete ClickHouseSearchIndex (for countExecutionsForAlerting, which
lives only on the concrete class, not the SearchIndex interface), but
StorageBeanConfig declared the bean with interface return type SearchIndex.
Spring matches autowire candidates by declared bean type, not by runtime
instance class, so the concrete-typed autowire failed with:

  Parameter 0 of constructor in ExchangeMatchEvaluator required a bean
  of type 'ClickHouseSearchIndex' that could not be found.

ClickHouseLogStore's bean is already declared with the concrete return
type (line 171), which is why LogPatternEvaluator autowires fine.

All alerting ITs passed pre-merge because AbstractPostgresIT replaces the
clickHouseSearchIndex bean with @MockBean(name=...) whose declared type
IS the concrete ClickHouseSearchIndex. The mock masked the prod bug.

Follow-up: remove @MockBean(name="clickHouseSearchIndex") from
AbstractPostgresIT so the real bean graph is exercised by alerting ITs
(and add a SpringContextSmokeIT that loads the context with no mocks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:11:47 +02:00
ca78aa3962 Merge pull request 'feat(alerting): Plan 02 — backend (domain, storage, evaluators, dispatch)' (#140) from feat/alerting-02-backend into main
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m58s
CI / docker (push) Successful in 34s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Failing after 2m17s
2026-04-20 09:03:15 +02:00
b763155a60 Merge pull request 'feat(alerting): Plan 01 — outbound HTTP infra + admin-managed outbound connections' (#139) from feat/alerting-01-outbound-infra into main
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m53s
CI / docker (push) Successful in 36s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Reviewed-on: #139
2026-04-20 08:57:40 +02:00
hsiegeln
aa9e93369f docs(alerting): add V11-V13 migration entries to CLAUDE.md
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m35s
CI / docker (push) Successful in 4m34s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Failing after 2m10s
Documents the three Flyway migrations added by the alerting feature branch
so future sessions have an accurate migration map.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:27:39 +02:00
hsiegeln
b0ba08e572 test(alerting): rewrite AlertingFullLifecycleIT — REST-driven rule creation, re-notify cadence
Rule creation now goes through POST /alerts/rules (exercises saveTargets on the
write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed
in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps:

  1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1)
  2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2)
  3. ack via REST → assert ACKNOWLEDGED state
  4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED)
  5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL)
  6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s →
     evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence)

Background scheduler races addressed by resetting claimed_by/claimed_until before
each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp
falls within the evaluator window. Re-notify notifications backdated in Postgres
to work around the simulated vs real clock gap in claimDueNotifications.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:38 +02:00
hsiegeln
2c82b50ea2 fix(alerting/B-1): AlertStateTransitions.newInstance() propagates rule targets to AlertInstance
newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames
so newly created AlertInstance rows carry the correct target arrays.
Previously these were always empty List.of(), making the inbox query return nothing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:25 +02:00
hsiegeln
7e79ff4d98 fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states
V13 migration creates alert_instances_open_rule_uq — a partial unique index on
(rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing
duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches
DuplicateKeyException and returns the existing open instance instead of failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:07 +02:00
hsiegeln
424894a3e2 fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing
AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets
attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response
fields. AlertNotificationController calls resetForRetry instead of
scheduleRetry so a manual retry always starts from a clean slate.

AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify
attempts==0 and status==PENDING after three prior markFailed calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:59 +02:00
hsiegeln
d74079da63 fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking
AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns
instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL
branch excluded: sweep only re-notifies, initial notify is the dispatcher's job).

AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh
notifications for eligible instances and stamps last_notified_at.

NotificationDispatchJob stamps last_notified_at on the alert_instance when a
notification is DELIVERED, providing the anchor timestamp for cadence checks.

PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering
the three-rule eligibility matrix (never-notified, long-ago, recent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:50 +02:00
hsiegeln
3f036da03d fix(alerting/B-1): PostgresAlertRuleRepository.save() now persists alert_rule_targets
saveTargets() is called unconditionally at the end of save() — it deletes
existing targets and re-inserts from the current targets list. findById()
and listByEnvironment() already call withTargets() so reads are consistent.
PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and
saveTargets_updateReplacesExistingTargets to cover the new write path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:39 +02:00
hsiegeln
8bf45d5456 fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree
Investigated three approaches for CH 24.12:
- Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting).
- ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in
  table metadata across connection restarts; runs before ADD PROJECTION in the SQL script.
- Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better).

alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs.
AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status
and alerting_route_status on executions) exist after schema init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:55 +02:00
hsiegeln
f1abca3a45 refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes
The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because
stats_1m_route has no p95 column. Exposing the old name implied p95 semantics
operators did not get. Rename to AVG_DURATION_MS makes the contract honest.
Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:43 +02:00
hsiegeln
144915563c docs(alerting): whole-branch final review report
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:25:33 +02:00
hsiegeln
c79a6234af test(alerting): fix duplicate @MockBean after AbstractPostgresIT centralised mocks + Plan 02 verification report
AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9.
All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with
"Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore
mock kept where needed. 120 alerting tests now pass (0 failures).

Also adds docs/alerting-02-verification.md (Task 43).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 23:27:19 +02:00
hsiegeln
63669bd1d7 docs(alerting): default config + admin guide
Adds alerting stanza to application.yml with all AlertingProperties
fields backed by env-var overrides.  Creates docs/alerting.md covering
six condition kinds (with example JSON), template variables, webhook
setup (Slack/PagerDuty examples), silence patterns, circuit-breaker
and retention troubleshooting, and Prometheus metrics reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:38 +02:00
hsiegeln
840a71df94 feat(alerting): observability metrics via micrometer
AlertingMetrics @Component wraps MeterRegistry:
- Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind},
  alerting_notifications_total{status}
- Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds
- Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state}

AlertEvaluatorJob records evalError + evalDuration around each evaluator call.
PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind).
AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:30 +02:00
hsiegeln
1ab21bc019 feat(alerting): AlertingRetentionJob daily cleanup
Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older
than eventRetentionDays and DELIVERED/FAILED alert_notifications older
than notificationRetentionDays.  Uses injected Clock for testability.
IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept
regardless of age, PENDING notification never deleted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:21 +02:00