feat(alerting): Plan 03 — UI + backfills (SSRF guard, metrics caching, docker stack) #144
Reference in New Issue
Block a user
Delete Branch "feat/alerting-03-ui"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Plan 03 delivers the full alerting UI on top of Plan 02's backend, plus two backend backfills (SSRF guard on outbound URLs + 30s TTL cache on
AlertingMetricsgauges) and a complete local docker-compose stack mirroring the k8s manifests indeploy/.UI additions (all under
ui/src/pages/Alerts/+ui/src/components/):/alerts/{inbox,all,history,rules,rules/new,rules/:id,silences}routes (lazy-loaded).NotificationBellin TopBar that polls/alerts/unread-countevery 30s (paused when tab hidden via TanStack QueryrefetchIntervalInBackground).ROUTE_METRIC,EXCHANGE_MATCH,AGENT_STATE,DEPLOYMENT_STATE,LOG_PATTERN,JVM_METRIC).<MustacheEditor />(CodeMirror 6) with variable autocomplete + inline linter. Registry (alert-variables.ts) mirrorsNotificationContextBuilderleaves.AlertStateChip,SeverityBadge,InboxPage(bulk-read),AllAlertsPage(state filter),HistoryPage,RulesListPage(enable/disable + delete + promote),SilencesPage(matcher-based create + end-early).alert+alertRuleresult categories via the existingLayoutShellsearchData extension point.alerts.ts,alertRules.ts,alertSilences.ts,alertNotifications.ts,alertMeta.ts. All env-scoped viauseSelectedEnv.Backend backfills:
SsrfGuard— rejects outbound webhook URLs that resolve to loopback, link-local, RFC-1918 private ranges, or IPv6 ULA. Wired intoOutboundConnectionServiceImpl.create/update. Bypass viacameleer.server.outbound-http.allow-private-targets=truefor dev.AlertingMetricsgauges now wrap their Postgres-backed suppliers in a 30s TTL cache so Prometheus scrapes don't produce per-scrape DB queries (final-review NIT from Plan 02).@Autowiredon theAlertingMetricsproduction constructor so Spring picks it over the package-private test-friendly one (Task 29 refactor had introduced ambiguity).Docker stack (new
docker-compose.ymlmirroringdeploy/):cameleer-postgres(matchesdeploy/cameleer-postgres.yaml).cameleer-clickhouse(matchesdeploy/cameleer-clickhouse.yaml;CLICKHOUSE_DB=cameleer).cameleer-serverbuilt from the repoDockerfile(REGISTRY_TOKENnow optional — cameleer-common is public).cameleer-uibuilt fromui/Dockerfileon host:8080so Vite dev (npm run dev:local) keeps:5173free.cameleer-seedone-shot service that seedsuser_id='admin'intenant_default.usersafter the server is healthy, bridging a pre-existing FK mismatch betweenUserRepositorystorage (prefixeduser:admin) and alerting-controller usage (strippedadmin). The root-cause fix belongs in a future backend cleanup.Docs + rules:
.claude/rules/ui.md— Alerts section mapping every Plan 03 UI surface.docs/alerting.md— UI walkthrough (sidebar / bell / wizard / Mustache autocomplete / env promotion / CMD-K).Plan + spec
docs/superpowers/specs/2026-04-19-alerting-design.md§9, §12, §13, §17.docs/superpowers/plans/2026-04-20-alerting-03-ui.md.Supersedes
chore/openapi-regen-post-plan02— delete that branch after merge.Test plan
cd ui && npm test). Covers query hooks, CM6 completion + linter, Mustache variable registry, wizard form-state, promotion prefill, AlertStateChip, SeverityBadge, NotificationBell, usePageVisible.cd ui && npx tsc -p tsconfig.app.json --noEmit→ zero errors.cd ui && npm run build(RuleEditorWizard chunk ~120 KB gzip incl. CM6).mvn -pl cameleer-server-app -am test -Dtest='SsrfGuardTest,AlertingMetricsCachingTest,OutboundConnectionSsrfIT' -Dsurefire.failIfNoSpecifiedTests=false): 8 SsrfGuard + 2 AlertingMetrics caching + 1 SSRF admin-controller IT.OutboundConnectionAdminControllerIT9/9 pass withallow-private-targets=truein test profile.cd ui && npx playwright test). Covers sidebar nav, rule CRUD via wizard, CMD-K open/close, silence create + end-early.UserRepositorystorage with alerting/outbound controller stripping so the compose seeder becomes redundant.End-to-end
fire → ack → clearis covered server-side by Plan 02'sAlertingFullLifecycleIT. UI E2E for that path would require event injection into ClickHouse and is out of scope.🤖 Generated with Claude Code
Install @codemirror/{view,state,autocomplete,commands,language,lint} and @lezer/common — needed by Phase 3's MustacheEditor (Task 13). CM6 picked over a raw textarea for its small incremental-rendering bundle, full ARIA/keyboard support, and pluggable autocomplete + linter APIs that map cleanly to Mustache token parsing. Add ui/playwright.config.ts wiring Task 30's E2E smoke: - testDir ./src/test/e2e, single worker, trace+screenshot on failure - webServer launches `npm run dev:local` (backend on :8081 required) - PLAYWRIGHT_BASE_URL env var skips the dev server for CI against a pre-deployed UI Add test:e2e / test:e2e:ui npm scripts and exclude Playwright's test-results/ and playwright-report/ from git. @playwright/test itself was already in devDependencies from an earlier task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>NotificationBell used a usePageVisible() subscription that re-rendered on every visibilitychange without consuming the value. TanStack Query's refetchIntervalInBackground:false already pauses polling; the extra subscription was speculative generality. Dropped the import + call + JSDoc reference; usePageVisible hook + test retained as a reusable primitive. Also: alerts.test.tsx 'returns the server payload unmodified' asserted a pre-plan {total, bySeverity} shape, but UnreadCountResponse is actually {count}. Fixed mock + assertion to {count: 3}.Plan prose had spec §8 idealized leaves, but the backend NotificationContext only emits a subset: ROUTE_METRIC / EXCHANGE_MATCH → route.id + route.uri (uri added) LOG_PATTERN → log.pattern + log.matchCount (renamed from log.logger/level/message) app.slug / app.id → scoped to non-env kinds (removed from 'always') exchange.link / alert.comparator / alert.window / app.displayName → removed (backend doesn't emit) Without this alignment the Task 11 linter would (1) flag valid route.uri as unknown, (2) suggest log.{logger,level,message} as valid paths that render empty, and (3) flag app.slug on env-wide rules.completion fires after {{ and narrows as the user types; apply() closes the tag automatically. Linter raises an error on unclosed {{, a warning on references that aren't in the allowed-variable set for the current condition kind. Kind-specific allowed set comes from availableVariables().Wizard navigates 5 steps (scope/condition/trigger/notify/review) with per-step validation. form-state module is the single source of truth for the rule form; initialForm/toRequest/validateStep are unit-tested (6 tests). Step components are stubbed and will be implemented in Tasks 20-24. prefillFromPromotion is a thin wrapper in this commit; Task 24 rewrites it to compute scope-adjustment warnings. Deviation notes: - FormState.targets uses {kind, targetId} to match AlertRuleTarget DTO field names (plan draft had targetKind). - toRequest casts through Record<string, unknown> so the spread over the Partial<AlertCondition> union typechecks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Each condition kind (ROUTE_METRIC, EXCHANGE_MATCH, AGENT_STATE, DEPLOYMENT_STATE, LOG_PATTERN, JVM_METRIC) renders its own payload-shape form. Changing the kind resets the condition payload to {kind, scope} so stale fields from a previous kind don't leak into the save request. Deviation: DS Select uses native event-based onChange. Plan draft showed a value-based signature (onChange(v) => ...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Three numeric inputs for evaluation cadence, for-duration, and re-notification window, plus a Test evaluate button for saved rules. TestEvaluateRequest is empty on the wire (server uses the rule id), so we send {} and rely on the backend to evaluate the current saved state. Deviation: plan draft passed {condition: toRequest(form).condition} into the request body. The generated TestEvaluateRequest type is Record<string, never>, so we send an empty body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Title and message use MustacheEditor with kind-specific autocomplete. Preview button posts to the render-preview endpoint and shows rendered title/message inline. Targets combine users/groups/roles into a unified Badge pill list. Webhook picker filters to outbound connections allowed in the current env (spec 6, allowed_environment_ids). Header overrides use plain Input rather than MustacheEditor for now. Deviations: - RenderPreviewRequest is Record<string, never>, so we send {} instead of {titleTemplate, messageTemplate}; backend resolves from rule state. - RenderPreviewResponse has {title, message} (plan draft used renderedTitle/renderedMessage). - Button size="sm" not "small" (DS only accepts sm|md). - Target kind field renamed from targetKind to kind to match AlertRuleTarget DTO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Review step dumps a human summary plus raw request JSON, and (when a setter is supplied) offers an Enabled-on-save Toggle. Promotion prefill now returns {form, warnings}: clears agent IDs (per-env), flags missing apps in target env, and flags webhook connections not allowed in target env. 4 Vitest cases cover copy-name, agent clear, app-missing, and webhook-not-allowed paths. The wizard now consumes {form, warnings}; Task 25 renders the warnings banner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Extends operationalSearchData with open alerts (FIRING|ACKNOWLEDGED) and all rules. Badges convey severity + state. Selecting an alert navigates to /alerts/inbox/{id}; a rule navigates to /alerts/rules/{id}. Uses the existing CommandPalette extension point — no new registry.Prometheus scrapes can fire every few seconds. The open-alerts / open-rules gauges query Postgres on each read — caching the values for 30s amortises that to one query per half-minute. Addresses final-review NIT from Plan 02. - Introduces a package-private TtlCache that wraps a Supplier<Long> and memoises the last read for a configurable Duration against a Supplier<Instant> clock. - Wraps each gauge supplier (alerting_rules_total{enabled|disabled}, alerting_instances_total{state}) in its own TtlCache. - Adds a test-friendly constructor (package-private) taking explicit Duration + Supplier<Instant> so AlertingMetricsCachingTest can advance a fake clock without waiting wall-clock time. - Adds AlertingMetricsCachingTest covering: * supplier invoked once per TTL across repeated scrapes * 29 s elapsed → still cached; 31 s elapsed → re-queried * gauge value reflects the cached result even after delegate mutates Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>