diff --git a/docs/superpowers/specs/2026-04-20-alerting-04-hardening-design.md b/docs/superpowers/specs/2026-04-20-alerting-04-hardening-design.md new file mode 100644 index 00000000..55e544cf --- /dev/null +++ b/docs/superpowers/specs/2026-04-20-alerting-04-hardening-design.md @@ -0,0 +1,403 @@ +# Alerting — Plan 04 Design Spec: Post-Ship Hardening + +**Date:** 2026-04-20 +**Status:** Draft — awaiting user review +**Surfaces:** server (core + app), UI, tests +**Related:** [Plan 03](../plans/2026-04-20-alerting-03-ui.md) (shipped as PR #144); builds on [Alerting design](2026-04-19-alerting-design.md) + +--- + +## 1. Summary + +Close the loop on three bug classes that surfaced in Plan 03's post-merge triage: + +1. **Spring wiring regressions not caught by unit tests** — `AlertingMetrics`'s production constructor was missing `@Autowired` and only broke at Docker runtime (commit `5edf7eb2`). Unit tests exercised the package-private ctor, so green locally. +2. **UI↔backend drift on Mustache template variables** — `ALERT_VARIABLES` in the UI and `NotificationContextBuilder` on the server were two hand-maintained lists of the same thing; one drift was fixed in commit `18e6dde6`. +3. **Hand-maintained TypeScript enum unions** — `ui/src/pages/Alerts/enums.ts` declares `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` manually because springdoc emits a broken polymorphic schema (condition types resolve to `never` in TS due to discriminator-literal conflict). + +Plan 04 is **pure hardening**: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them. + +### Guiding principles + +- **Prefer SSOT over drift-detect.** When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't. +- **Type correctness at the contract layer.** Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration. +- **Close the loop on Plan 03, don't start new product work.** Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan. + +--- + +## 2. Scope + +### In scope + +1. **Context-startup smoke test** — a `@SpringBootTest` that loads the full production application context and asserts the alerting beans are present. Catches `@Autowired` regressions at `mvn verify` time. +2. **Template-variables SSOT endpoint** — new `GET /api/v1/alerts/template-variables` returning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query. +3. **Second Playwright spec** — rule EDIT path, MustacheEditor autocomplete + linter in-browser, env-promotion warnings banner, inbox ack, bulk-read. +4. **Typed vocabulary for 5 String condition fields** — `AgentStateCondition.state`, `DeploymentStateCondition.states`, `LogPatternCondition.level`, `ExchangeFilter.status` become proper Java enums. `JvmMetricCondition.metric` stays `String` + `@Schema(allowableValues={...})` because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive. +5. **Springdoc discriminator mapping** — add `@Schema(discriminatorProperty, discriminatorMapping)` to the `AlertCondition` sealed interface. TS stops generating `kind: ""` literals. Hand-declared unions in `ui/src/pages/Alerts/enums.ts` migrate to derived types. + +### Out of scope (explicit) + +To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot: + +- Native Slack / Teams / PagerDuty integrations (Plan 03 spec BL-002 — future plan). +- Admin CA-bundle UI ([gitea #137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) / BL-001). +- Router.tsx lint sweep (207 fast-refresh warnings, pre-existed Plan 03 — standalone lint PR). +- Performance ITs for 500 rules × 5 replicas (Plan 03 spec deferral — separate plan, different shape). +- OIDC / UserAdmin re-audit (commit `ae647363` was minimal; re-audit post-alpha). +- Orphan `.worktrees/` cleanup (user holds authorization). + +--- + +## 3. Task 1 — Context-startup smoke test + +### Problem + +Plan 03 shipped a regression where `AlertingMetrics`'s production constructor was missing `@Autowired`. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid. + +### Design + +**Location:** `cameleer-server-app/src/test/java/com/cameleer/server/app/ContextStartupSmokeTest.java` + +**Shape:** + +```java +@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE) +@ActiveProfiles("test-context-smoke") +class ContextStartupSmokeTest { + @Autowired ApplicationContext ctx; + + @Test + void productionContextLoadsWithAllAlertingBeans() { + assertNotNull(ctx.getBean(AlertingMetrics.class)); + assertNotNull(ctx.getBean(AlertEvaluatorJob.class)); + assertNotNull(ctx.getBean(NotificationDispatchJob.class)); + assertNotNull(ctx.getBean(NotificationContextBuilder.class)); + assertNotNull(ctx.getBean(WebhookDispatcher.class)); + assertNotNull(ctx.getBean(MustacheRenderer.class)); + assertNotNull(ctx.getBean(AlertingRetentionJob.class)); + assertNotNull(ctx.getBean(PerKindCircuitBreaker.class)); + // + one assertion per ConditionEvaluator impl (6 evaluators) + } +} +``` + +**Profile stubs:** `test-context-smoke` profile replaces `DataSource`, `ClickHouseJdbcTemplate`, and `RuntimeOrchestrator` with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting. + +### Rationale + +- **Why not extend an existing `@SpringBootTest`?** Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing `@Autowired` on a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed. +- **Why assert individual beans, not just "context loads"?** A context loads if *any* bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable. + +### Validation + +Intentionally break one bean (remove `@Autowired` or `@Component`) and confirm the test fails with a clear message. Revert before committing. + +--- + +## 4. Task 2 — Template-variables SSOT endpoint + +### Problem + +`ui/src/components/MustacheEditor/alert-variables.ts` is a hand-maintained list of template variable paths (`alert.rule.name`, `exchange.status`, etc.) mirroring what `NotificationContextBuilder` actually produces. On 2026-04-20, commit `18e6dde6` fixed one drift between them. Until drift is structurally prevented, it will recur. + +### Design + +#### Backend + +1. **Declarative registry in core.** Extract `NotificationContextBuilder`'s leaf paths into a static constant: + + ```java + public record TemplateVariableDescriptor( + String path, // "alert.rule.name" + String type, // "string" | "number" | "boolean" | "datetime" + String description, + String example + ) {} + + public final class AlertTemplateVariables { + public static final List ALL = List.of( + new TemplateVariableDescriptor("alert.rule.name", "string", "Rule display name", "High error rate"), + new TemplateVariableDescriptor("alert.firing_at", "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"), + // ... exhaustive list + ); + } + ``` + +2. **Refactor `NotificationContextBuilder`** to build its context from this registry, so divergence is structural impossibility rather than convention. + +3. **Round-trip unit test.** `AlertTemplateVariablesTest`: for every descriptor in `ALL`, confirm the path resolves to a non-null value when `NotificationContextBuilder.build(...)` runs against a canonical sample firing instance. Fails if a variable is added to `ALL` but the builder doesn't produce it (or vice versa). + +4. **Controller.** `AlertTemplateVariablesController` at `GET /api/v1/alerts/template-variables`: + - **Flat endpoint** (added to allow-list in `.claude/rules/app-classes.md`, reason: registry is tenant- and env-agnostic global metadata). + - VIEWER+ (operators authoring webhook bodies are typically VIEWER+). + - Returns `List` — identical shape to `TemplateVariableDescriptor`. + - Response is deterministic and cacheable (`Cache-Control: public, max-age=3600` — the registry changes only across server versions). + +#### Frontend + +1. **New query hook:** `ui/src/api/queries/alertTemplateVariables.ts` exposing `useTemplateVariables()`. TanStack Query config: `staleTime: Infinity`, `gcTime: Infinity` — the list never changes at runtime. + +2. **Delete** `ui/src/components/MustacheEditor/alert-variables.ts` entirely. + +3. **Prop-drill the variable list:** + - `MustacheEditor.tsx` takes `variables: TemplateVariable[]` as a prop. + - `mustache-completion.ts` and `mustache-linter.ts` become functions taking the list as an argument instead of importing a module. + - `NotifyStep.tsx` calls `useTemplateVariables()` once and passes the result. + - Webhook body/header editors do the same. + +4. **Loading state:** while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of `NotifyStep` (or whichever component opens first) and the result is cached forever. + +### Rationale + +- **Why a runtime endpoint, not a build-time JSON artifact?** The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time. +- **Why `staleTime: Infinity`?** The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling. + +--- + +## 5. Task 3 — Second Playwright spec + +### Problem + +Existing `ui/src/test/e2e/alerting.spec.ts` covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions. + +### Design + +**File:** `ui/src/test/e2e/alerting-editor.spec.ts` (new; existing smoke stays unchanged) + +**Cases:** + +1. **Rule EDIT.** Seed a rule via backend API, navigate to `/alerts/rules/:id`, change `threshold` field, save, reload page, assert threshold persisted. +2. **MustacheEditor autocomplete.** On NotifyStep, focus the notification title field, type `{{`, assert CodeMirror completion popover appears listing `alert.rule.name` (and friends), arrow-down + Enter, assert insertion. +3. **MustacheEditor linter.** Type `{{alert.not_a_real_var}}`, assert a linter diagnostic appears within 500 ms. +4. **Env-promotion warnings.** Seed a rule in env A with `scope.routeId` referencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route. +5. **Inbox ack.** Seed a FIRING alert via API, navigate to `/alerts/inbox`, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables. +6. **Bulk-read.** Seed 3 FIRING alerts, select-all via header checkbox, click bulk-read button, assert notification-bell unread-count becomes 0. + +### Stability + +- Use `page.waitForFunction` on CodeMirror's `EditorView.completionStatus` rather than `waitForTimeout` for autocomplete assertions. +- Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate). +- Seed fixtures via API, not via page navigation — each test independent, no ordering dependency. + +### Rationale + +Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, `waitForFunction` over `waitForTimeout`) buys the coverage that bit us in Plan 03 triage. + +--- + +## 6. Task 4 — Typed vocabulary for condition String fields + +### Problem + +Five condition fields are typed as `String` in the Java records, so springdoc emits `"type": "string"` without `enum`. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in `ui/src/pages/Alerts/enums.ts` for context. + +### Design + +Four field migrations to proper Java enums + one schema-only hint: + +| Field | Target type | Existing or new? | +|---|---|---| +| `AgentStateCondition.state` | `com.cameleer.server.core.agent.AgentState` (enum) | existing (`LIVE, STALE, DEAD, SHUTDOWN`) | +| `DeploymentStateCondition.states` | `List` (enum) | existing (`STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED`) | +| `LogPatternCondition.level` | new `com.cameleer.server.core.alerting.LogLevel` (enum) | new — `TRACE, DEBUG, INFO, WARN, ERROR` | +| `ExchangeFilter.status` | `com.cameleer.common.model.ExecutionStatus` (enum) | existing in cameleer-common (`RUNNING, COMPLETED, FAILED, ABANDONED`) | +| `JvmMetricCondition.metric` | stays `String` + `@Schema(allowableValues={...})` | schema-only hint — wire stays as agent-reported Micrometer metric names (`jvm.memory.used.value`, etc.) | + +**Why JvmMetric is different.** `JvmMetricEvaluator` passes `c.metric()` directly to `MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...)`. The field is a raw Micrometer metric name emitted by the agent — `jvm.memory.used.value`, `process.cpu.usage.value`, `jvm.gc.pause.total_time`, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s `@Schema(allowableValues=…)` gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from `.claude/rules/metrics.md`. + +### Persistence + +`alert_rules.condition` is stored as JSONB. + +- For the 4 enum migrations: Jackson serializes enum values by `.name()`; on-the-wire payloads already use uppercase tokens (`"COMPLETED"`, not `"Completed"`). Legacy rows round-trip unchanged. +- For `JvmMetricCondition.metric`: no change — it's already a String with dotted-name values. `@Schema(allowableValues=…)` affects OpenAPI emission only, not deserialization or validation. + +**Pre-flight check:** before the first enum-flip commit, run against the local dev DB: + +```sql +SELECT condition_kind, condition->>'level' AS val +FROM alert_rules WHERE condition_kind='LOG_PATTERN' +UNION ALL +SELECT condition_kind, condition->>'state' +FROM alert_rules WHERE condition_kind='AGENT_STATE' +-- etc, one union per migrating field +``` + +If any row has a value that doesn't match the target enum (e.g. lowercase `"info"` instead of `"INFO"`), Plan 04 adds a one-shot Flyway `V15__alert_condition_enum_repair.sql` that normalizes them. If all values match, skip the migration. + +### Validation + +Controller-level `@Valid` on `AlertRuleRequest` automatically enforces enum membership — unknown values produce `400 Bad Request`. Replaces any ad-hoc string checks in evaluators. + +Integration test per migrated field: `POST /rules` with known-bad value → 400; known-good value → 201. + +### UI cleanup + +After each enum migration ships, the corresponding manual artifact in `ui/src/pages/Alerts/enums.ts` (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted. + +--- + +## 7. Task 5 — Springdoc discriminator mapping + +### Problem + +The generated OpenAPI has a discriminator for `AlertCondition` but no `mapping`: + +```json +"AlertCondition": { + "discriminator": { "propertyName": "kind" }, + "properties": { "kind": { "enum": ["ROUTE_METRIC", ...] } } +} +``` + +When `openapi-typescript` encounters a discriminator without a `mapping`, it synthesizes the literal from the schema name. Result: + +```ts +AgentStateCondition: { + kind: "AgentStateCondition"; // synthesized — class simple name +} & (Omit & { + readonly kind?: "ROUTE_METRIC" | ... | "JVM_METRIC"; +}); +``` + +Indexed access like `AgentStateCondition['kind']` is `"AgentStateCondition" & ("ROUTE_METRIC" | ...)` = `never`. `Record` exhaustiveness checks silently collapse. + +### Design + +One edit to `cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertCondition.java`: + +```java +import io.swagger.v3.oas.annotations.media.DiscriminatorMapping; +import io.swagger.v3.oas.annotations.media.Schema; + +@Schema( + discriminatorProperty = "kind", + discriminatorMapping = { + @DiscriminatorMapping(value = "ROUTE_METRIC", schema = RouteMetricCondition.class), + @DiscriminatorMapping(value = "EXCHANGE_MATCH", schema = ExchangeMatchCondition.class), + @DiscriminatorMapping(value = "AGENT_STATE", schema = AgentStateCondition.class), + @DiscriminatorMapping(value = "DEPLOYMENT_STATE", schema = DeploymentStateCondition.class), + @DiscriminatorMapping(value = "LOG_PATTERN", schema = LogPatternCondition.class), + @DiscriminatorMapping(value = "JVM_METRIC", schema = JvmMetricCondition.class) + } +) +@JsonTypeInfo(...) // unchanged +@JsonSubTypes(...) // unchanged +public sealed interface AlertCondition ... +``` + +After `mvn compile && cd ui && npm run generate-api:live`, the schema gains: + +```json +"discriminator": { + "propertyName": "kind", + "mapping": { + "ROUTE_METRIC": "#/components/schemas/RouteMetricCondition", + ... + } +} +``` + +And `openapi-typescript` emits `kind: "ROUTE_METRIC"` instead of `kind: "RouteMetricCondition"`. + +### Fallback + +If springdoc 2.8.6 does not honor `@DiscriminatorMapping` on a sealed interface (sealed types are newer than the library's primary target surface), add a `SpringDocConfig` bean: + +```java +@Bean OpenApiCustomizer alertConditionDiscriminatorFix() { + return openApi -> { + Schema alertCondition = openApi.getComponents().getSchemas().get("AlertCondition"); + alertCondition.getDiscriminator().mapping(Map.of( + "ROUTE_METRIC", "#/components/schemas/RouteMetricCondition", + // ... + )); + }; +} +``` + +~20 lines, no behavioural change for the server — emits the exact same OpenAPI. Plan 04 ships either way. + +### UI cleanup + +Once the schema is correct, `ui/src/pages/Alerts/enums.ts` is simplified: + +- Delete hand-declared `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` unions. +- Replace with derived indexed access: + ```ts + type RouteMetric = NonNullable; + type Comparator = NonNullable; + // etc. + ``` +- Delete the 30-line comment block at the top of the file explaining the workaround. +- Keep `HIDDEN` arrays and label-map `Record` exhaustiveness enforcement — those remain valuable. + +--- + +## 8. Execution Order + +1. **Task 5** (springdoc discriminator mapping) — purely additive annotation. +2. **Task 4** (String → enum migrations, one commit per field). +3. **Task 1** (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout). +4. **Task 2** (template-variables SSOT). +5. **Task 3** (Playwright spec — last, exercises all of the above). + +No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small. + +--- + +## 9. Risks + +| Risk | Mitigation | +|---|---| +| Springdoc `@DiscriminatorMapping` on a sealed interface not honored in 2.8.6 | `OpenApiCustomizer` bean fallback (§7). | +| Legacy `alert_rules.condition` JSONB rows have values that don't match new enum names | Pre-flight SQL inspection; `V15` Flyway repair migration only if needed. | +| `NotificationContextBuilder` refactor silently changes rendered notifications | Golden-file integration test — one sample notification per (severity × kind); run before and after refactor. | +| Playwright spec flakiness on CodeMirror autocomplete timing | `waitForFunction` on completion state; 3 consecutive green CI runs before merge. | +| `@Valid` enum enforcement breaks existing rule creation flows in the UI | Task-4 commits bundle `generate-api:live` + typecheck-fixing; any UI call-site change ships with the backend change. | + +--- + +## 10. Acceptance Criteria + +Plan 04 is done when: + +1. Reverting commit `5edf7eb2` (re-introducing the `AlertingMetrics` ctor bug) causes `mvn clean verify` to fail. Revert the revert; test stays. +2. `grep -r "ALERT_VARIABLES" ui/src` shows no hand-maintained registry — only the backend-fetched hook and its consumers. +3. `ui/src/pages/Alerts/enums.ts` has no hand-declared unions for `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` — all four are derived from `schema.d.ts`. The top-of-file explanatory comment is gone. +4. All 5 condition String fields are enum-typed on the wire (inspect `openapi.json`) and in `schema.d.ts`. +5. `alerting-editor.spec.ts` green in CI for 3 consecutive runs. +6. `.claude/rules/app-classes.md` updated: new `AlertTemplateVariablesController` listed under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata"). +7. `openapi.json` and `schema.d.ts` regenerated and committed alongside the backend changes they reflect. + +--- + +## 11. Estimated Commit Cadence + +Approximate, for sizing: + +1. `feat(alerting): discriminator mapping on AlertCondition` — Task 5 backend. +2. `refactor(ui/alerts): derive polymorphic unions from schema` — Task 5 UI cleanup. +3. `refactor(alerting): AgentStateCondition.state → AgentState enum`. +4. `refactor(alerting): DeploymentStateCondition.states → DeploymentStatus enum`. +5. `refactor(alerting): LogPatternCondition.level → LogLevel enum` (adds new enum). +6. `refactor(alerting): JvmMetricCondition.metric — @Schema(allowableValues) hint` (no type change; OpenAPI hint only). +7. `refactor(alerting): ExchangeFilter.status → ExecutionStatus enum`. +8. `test(alerting): Spring context-startup smoke`. +9. `feat(alerting): template-variables SSOT endpoint` — Task 2 backend. +10. `refactor(ui/alerts): consume template-variables via API` — Task 2 frontend. +11. `test(ui/alerts): Playwright editor spec`. + +~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible. + +--- + +## 12. CRITICAL process rules (per CLAUDE.md) + +- `gitnexus_impact` on `AlertCondition`, `NotificationContextBuilder`, `AlertingMetrics`, `ExchangeMatchCondition`, each touched subtype before editing. +- `gitnexus_detect_changes` before every commit. +- `cd ui && npm run generate-api:live` after every controller or DTO change. +- `.claude/rules/app-classes.md` updated in the same commit that adds `AlertTemplateVariablesController`. +- `.claude/rules/core-classes.md` updated when the new `LogLevel` enum lands. +- Maven: `mvn clean verify` must stay green through every commit. +- UI: `cd ui && npm run typecheck && npm run test && npm run test:e2e` must stay green through every commit.