Files
cameleer-server/docs/superpowers/specs/2026-04-20-alerting-04-hardening-design.md
hsiegeln 46867cc659 docs(alerting): Plan 04 design spec — post-ship hardening
Closes the loop on three bug classes from Plan 03 triage:
context-load regressions (missing @Autowired), UI/backend drift
on template variables, and hand-maintained TS enum unions caused
by springdoc polymorphic schema quirk.

Covers 5 tasks: context-startup smoke test, template-variables
SSOT endpoint, second Playwright spec, String-to-enum migrations
on 5 condition fields, and @DiscriminatorMapping on AlertCondition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:44:41 +02:00

404 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alerting — Plan 04 Design Spec: Post-Ship Hardening
**Date:** 2026-04-20
**Status:** Draft — awaiting user review
**Surfaces:** server (core + app), UI, tests
**Related:** [Plan 03](../plans/2026-04-20-alerting-03-ui.md) (shipped as PR #144); builds on [Alerting design](2026-04-19-alerting-design.md)
---
## 1. Summary
Close the loop on three bug classes that surfaced in Plan 03's post-merge triage:
1. **Spring wiring regressions not caught by unit tests**`AlertingMetrics`'s production constructor was missing `@Autowired` and only broke at Docker runtime (commit `5edf7eb2`). Unit tests exercised the package-private ctor, so green locally.
2. **UI↔backend drift on Mustache template variables**`ALERT_VARIABLES` in the UI and `NotificationContextBuilder` on the server were two hand-maintained lists of the same thing; one drift was fixed in commit `18e6dde6`.
3. **Hand-maintained TypeScript enum unions**`ui/src/pages/Alerts/enums.ts` declares `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` manually because springdoc emits a broken polymorphic schema (condition types resolve to `never` in TS due to discriminator-literal conflict).
Plan 04 is **pure hardening**: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them.
### Guiding principles
- **Prefer SSOT over drift-detect.** When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't.
- **Type correctness at the contract layer.** Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration.
- **Close the loop on Plan 03, don't start new product work.** Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan.
---
## 2. Scope
### In scope
1. **Context-startup smoke test** — a `@SpringBootTest` that loads the full production application context and asserts the alerting beans are present. Catches `@Autowired` regressions at `mvn verify` time.
2. **Template-variables SSOT endpoint** — new `GET /api/v1/alerts/template-variables` returning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query.
3. **Second Playwright spec** — rule EDIT path, MustacheEditor autocomplete + linter in-browser, env-promotion warnings banner, inbox ack, bulk-read.
4. **Typed vocabulary for 5 String condition fields**`AgentStateCondition.state`, `DeploymentStateCondition.states`, `LogPatternCondition.level`, `ExchangeFilter.status` become proper Java enums. `JvmMetricCondition.metric` stays `String` + `@Schema(allowableValues={...})` because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive.
5. **Springdoc discriminator mapping** — add `@Schema(discriminatorProperty, discriminatorMapping)` to the `AlertCondition` sealed interface. TS stops generating `kind: "<ClassName>"` literals. Hand-declared unions in `ui/src/pages/Alerts/enums.ts` migrate to derived types.
### Out of scope (explicit)
To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot:
- Native Slack / Teams / PagerDuty integrations (Plan 03 spec BL-002 — future plan).
- Admin CA-bundle UI ([gitea #137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) / BL-001).
- Router.tsx lint sweep (207 fast-refresh warnings, pre-existed Plan 03 — standalone lint PR).
- Performance ITs for 500 rules × 5 replicas (Plan 03 spec deferral — separate plan, different shape).
- OIDC / UserAdmin re-audit (commit `ae647363` was minimal; re-audit post-alpha).
- Orphan `.worktrees/` cleanup (user holds authorization).
---
## 3. Task 1 — Context-startup smoke test
### Problem
Plan 03 shipped a regression where `AlertingMetrics`'s production constructor was missing `@Autowired`. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid.
### Design
**Location:** `cameleer-server-app/src/test/java/com/cameleer/server/app/ContextStartupSmokeTest.java`
**Shape:**
```java
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test-context-smoke")
class ContextStartupSmokeTest {
@Autowired ApplicationContext ctx;
@Test
void productionContextLoadsWithAllAlertingBeans() {
assertNotNull(ctx.getBean(AlertingMetrics.class));
assertNotNull(ctx.getBean(AlertEvaluatorJob.class));
assertNotNull(ctx.getBean(NotificationDispatchJob.class));
assertNotNull(ctx.getBean(NotificationContextBuilder.class));
assertNotNull(ctx.getBean(WebhookDispatcher.class));
assertNotNull(ctx.getBean(MustacheRenderer.class));
assertNotNull(ctx.getBean(AlertingRetentionJob.class));
assertNotNull(ctx.getBean(PerKindCircuitBreaker.class));
// + one assertion per ConditionEvaluator impl (6 evaluators)
}
}
```
**Profile stubs:** `test-context-smoke` profile replaces `DataSource`, `ClickHouseJdbcTemplate`, and `RuntimeOrchestrator` with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting.
### Rationale
- **Why not extend an existing `@SpringBootTest`?** Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing `@Autowired` on a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed.
- **Why assert individual beans, not just "context loads"?** A context loads if *any* bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable.
### Validation
Intentionally break one bean (remove `@Autowired` or `@Component`) and confirm the test fails with a clear message. Revert before committing.
---
## 4. Task 2 — Template-variables SSOT endpoint
### Problem
`ui/src/components/MustacheEditor/alert-variables.ts` is a hand-maintained list of template variable paths (`alert.rule.name`, `exchange.status`, etc.) mirroring what `NotificationContextBuilder` actually produces. On 2026-04-20, commit `18e6dde6` fixed one drift between them. Until drift is structurally prevented, it will recur.
### Design
#### Backend
1. **Declarative registry in core.** Extract `NotificationContextBuilder`'s leaf paths into a static constant:
```java
public record TemplateVariableDescriptor(
String path, // "alert.rule.name"
String type, // "string" | "number" | "boolean" | "datetime"
String description,
String example
) {}
public final class AlertTemplateVariables {
public static final List<TemplateVariableDescriptor> ALL = List.of(
new TemplateVariableDescriptor("alert.rule.name", "string", "Rule display name", "High error rate"),
new TemplateVariableDescriptor("alert.firing_at", "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"),
// ... exhaustive list
);
}
```
2. **Refactor `NotificationContextBuilder`** to build its context from this registry, so divergence is structural impossibility rather than convention.
3. **Round-trip unit test.** `AlertTemplateVariablesTest`: for every descriptor in `ALL`, confirm the path resolves to a non-null value when `NotificationContextBuilder.build(...)` runs against a canonical sample firing instance. Fails if a variable is added to `ALL` but the builder doesn't produce it (or vice versa).
4. **Controller.** `AlertTemplateVariablesController` at `GET /api/v1/alerts/template-variables`:
- **Flat endpoint** (added to allow-list in `.claude/rules/app-classes.md`, reason: registry is tenant- and env-agnostic global metadata).
- VIEWER+ (operators authoring webhook bodies are typically VIEWER+).
- Returns `List<TemplateVariableDto>` — identical shape to `TemplateVariableDescriptor`.
- Response is deterministic and cacheable (`Cache-Control: public, max-age=3600` — the registry changes only across server versions).
#### Frontend
1. **New query hook:** `ui/src/api/queries/alertTemplateVariables.ts` exposing `useTemplateVariables()`. TanStack Query config: `staleTime: Infinity`, `gcTime: Infinity` — the list never changes at runtime.
2. **Delete** `ui/src/components/MustacheEditor/alert-variables.ts` entirely.
3. **Prop-drill the variable list:**
- `MustacheEditor.tsx` takes `variables: TemplateVariable[]` as a prop.
- `mustache-completion.ts` and `mustache-linter.ts` become functions taking the list as an argument instead of importing a module.
- `NotifyStep.tsx` calls `useTemplateVariables()` once and passes the result.
- Webhook body/header editors do the same.
4. **Loading state:** while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of `NotifyStep` (or whichever component opens first) and the result is cached forever.
### Rationale
- **Why a runtime endpoint, not a build-time JSON artifact?** The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time.
- **Why `staleTime: Infinity`?** The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling.
---
## 5. Task 3 — Second Playwright spec
### Problem
Existing `ui/src/test/e2e/alerting.spec.ts` covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions.
### Design
**File:** `ui/src/test/e2e/alerting-editor.spec.ts` (new; existing smoke stays unchanged)
**Cases:**
1. **Rule EDIT.** Seed a rule via backend API, navigate to `/alerts/rules/:id`, change `threshold` field, save, reload page, assert threshold persisted.
2. **MustacheEditor autocomplete.** On NotifyStep, focus the notification title field, type `{{`, assert CodeMirror completion popover appears listing `alert.rule.name` (and friends), arrow-down + Enter, assert insertion.
3. **MustacheEditor linter.** Type `{{alert.not_a_real_var}}`, assert a linter diagnostic appears within 500 ms.
4. **Env-promotion warnings.** Seed a rule in env A with `scope.routeId` referencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route.
5. **Inbox ack.** Seed a FIRING alert via API, navigate to `/alerts/inbox`, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables.
6. **Bulk-read.** Seed 3 FIRING alerts, select-all via header checkbox, click bulk-read button, assert notification-bell unread-count becomes 0.
### Stability
- Use `page.waitForFunction` on CodeMirror's `EditorView.completionStatus` rather than `waitForTimeout` for autocomplete assertions.
- Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate).
- Seed fixtures via API, not via page navigation — each test independent, no ordering dependency.
### Rationale
Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, `waitForFunction` over `waitForTimeout`) buys the coverage that bit us in Plan 03 triage.
---
## 6. Task 4 — Typed vocabulary for condition String fields
### Problem
Five condition fields are typed as `String` in the Java records, so springdoc emits `"type": "string"` without `enum`. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in `ui/src/pages/Alerts/enums.ts` for context.
### Design
Four field migrations to proper Java enums + one schema-only hint:
| Field | Target type | Existing or new? |
|---|---|---|
| `AgentStateCondition.state` | `com.cameleer.server.core.agent.AgentState` (enum) | existing (`LIVE, STALE, DEAD, SHUTDOWN`) |
| `DeploymentStateCondition.states` | `List<com.cameleer.server.core.runtime.DeploymentStatus>` (enum) | existing (`STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED`) |
| `LogPatternCondition.level` | new `com.cameleer.server.core.alerting.LogLevel` (enum) | new — `TRACE, DEBUG, INFO, WARN, ERROR` |
| `ExchangeFilter.status` | `com.cameleer.common.model.ExecutionStatus` (enum) | existing in cameleer-common (`RUNNING, COMPLETED, FAILED, ABANDONED`) |
| `JvmMetricCondition.metric` | stays `String` + `@Schema(allowableValues={...})` | schema-only hint — wire stays as agent-reported Micrometer metric names (`jvm.memory.used.value`, etc.) |
**Why JvmMetric is different.** `JvmMetricEvaluator` passes `c.metric()` directly to `MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...)`. The field is a raw Micrometer metric name emitted by the agent — `jvm.memory.used.value`, `process.cpu.usage.value`, `jvm.gc.pause.total_time`, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s `@Schema(allowableValues=…)` gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from `.claude/rules/metrics.md`.
### Persistence
`alert_rules.condition` is stored as JSONB.
- For the 4 enum migrations: Jackson serializes enum values by `.name()`; on-the-wire payloads already use uppercase tokens (`"COMPLETED"`, not `"Completed"`). Legacy rows round-trip unchanged.
- For `JvmMetricCondition.metric`: no change — it's already a String with dotted-name values. `@Schema(allowableValues=…)` affects OpenAPI emission only, not deserialization or validation.
**Pre-flight check:** before the first enum-flip commit, run against the local dev DB:
```sql
SELECT condition_kind, condition->>'level' AS val
FROM alert_rules WHERE condition_kind='LOG_PATTERN'
UNION ALL
SELECT condition_kind, condition->>'state'
FROM alert_rules WHERE condition_kind='AGENT_STATE'
-- etc, one union per migrating field
```
If any row has a value that doesn't match the target enum (e.g. lowercase `"info"` instead of `"INFO"`), Plan 04 adds a one-shot Flyway `V15__alert_condition_enum_repair.sql` that normalizes them. If all values match, skip the migration.
### Validation
Controller-level `@Valid` on `AlertRuleRequest` automatically enforces enum membership — unknown values produce `400 Bad Request`. Replaces any ad-hoc string checks in evaluators.
Integration test per migrated field: `POST /rules` with known-bad value → 400; known-good value → 201.
### UI cleanup
After each enum migration ships, the corresponding manual artifact in `ui/src/pages/Alerts/enums.ts` (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted.
---
## 7. Task 5 — Springdoc discriminator mapping
### Problem
The generated OpenAPI has a discriminator for `AlertCondition` but no `mapping`:
```json
"AlertCondition": {
"discriminator": { "propertyName": "kind" },
"properties": { "kind": { "enum": ["ROUTE_METRIC", ...] } }
}
```
When `openapi-typescript` encounters a discriminator without a `mapping`, it synthesizes the literal from the schema name. Result:
```ts
AgentStateCondition: {
kind: "AgentStateCondition"; // synthesized — class simple name
} & (Omit<AlertCondition, "kind"> & {
readonly kind?: "ROUTE_METRIC" | ... | "JVM_METRIC";
});
```
Indexed access like `AgentStateCondition['kind']` is `"AgentStateCondition" & ("ROUTE_METRIC" | ...)` = `never`. `Record<T, string>` exhaustiveness checks silently collapse.
### Design
One edit to `cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertCondition.java`:
```java
import io.swagger.v3.oas.annotations.media.DiscriminatorMapping;
import io.swagger.v3.oas.annotations.media.Schema;
@Schema(
discriminatorProperty = "kind",
discriminatorMapping = {
@DiscriminatorMapping(value = "ROUTE_METRIC", schema = RouteMetricCondition.class),
@DiscriminatorMapping(value = "EXCHANGE_MATCH", schema = ExchangeMatchCondition.class),
@DiscriminatorMapping(value = "AGENT_STATE", schema = AgentStateCondition.class),
@DiscriminatorMapping(value = "DEPLOYMENT_STATE", schema = DeploymentStateCondition.class),
@DiscriminatorMapping(value = "LOG_PATTERN", schema = LogPatternCondition.class),
@DiscriminatorMapping(value = "JVM_METRIC", schema = JvmMetricCondition.class)
}
)
@JsonTypeInfo(...) // unchanged
@JsonSubTypes(...) // unchanged
public sealed interface AlertCondition ...
```
After `mvn compile && cd ui && npm run generate-api:live`, the schema gains:
```json
"discriminator": {
"propertyName": "kind",
"mapping": {
"ROUTE_METRIC": "#/components/schemas/RouteMetricCondition",
...
}
}
```
And `openapi-typescript` emits `kind: "ROUTE_METRIC"` instead of `kind: "RouteMetricCondition"`.
### Fallback
If springdoc 2.8.6 does not honor `@DiscriminatorMapping` on a sealed interface (sealed types are newer than the library's primary target surface), add a `SpringDocConfig` bean:
```java
@Bean OpenApiCustomizer alertConditionDiscriminatorFix() {
return openApi -> {
Schema<?> alertCondition = openApi.getComponents().getSchemas().get("AlertCondition");
alertCondition.getDiscriminator().mapping(Map.of(
"ROUTE_METRIC", "#/components/schemas/RouteMetricCondition",
// ...
));
};
}
```
~20 lines, no behavioural change for the server — emits the exact same OpenAPI. Plan 04 ships either way.
### UI cleanup
Once the schema is correct, `ui/src/pages/Alerts/enums.ts` is simplified:
- Delete hand-declared `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` unions.
- Replace with derived indexed access:
```ts
type RouteMetric = NonNullable<components['schemas']['RouteMetricCondition']['metric']>;
type Comparator = NonNullable<components['schemas']['RouteMetricCondition']['comparator']>;
// etc.
```
- Delete the 30-line comment block at the top of the file explaining the workaround.
- Keep `HIDDEN` arrays and label-map `Record<T, string>` exhaustiveness enforcement — those remain valuable.
---
## 8. Execution Order
1. **Task 5** (springdoc discriminator mapping) — purely additive annotation.
2. **Task 4** (String → enum migrations, one commit per field).
3. **Task 1** (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout).
4. **Task 2** (template-variables SSOT).
5. **Task 3** (Playwright spec — last, exercises all of the above).
No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| Springdoc `@DiscriminatorMapping` on a sealed interface not honored in 2.8.6 | `OpenApiCustomizer` bean fallback (§7). |
| Legacy `alert_rules.condition` JSONB rows have values that don't match new enum names | Pre-flight SQL inspection; `V15` Flyway repair migration only if needed. |
| `NotificationContextBuilder` refactor silently changes rendered notifications | Golden-file integration test — one sample notification per (severity × kind); run before and after refactor. |
| Playwright spec flakiness on CodeMirror autocomplete timing | `waitForFunction` on completion state; 3 consecutive green CI runs before merge. |
| `@Valid` enum enforcement breaks existing rule creation flows in the UI | Task-4 commits bundle `generate-api:live` + typecheck-fixing; any UI call-site change ships with the backend change. |
---
## 10. Acceptance Criteria
Plan 04 is done when:
1. Reverting commit `5edf7eb2` (re-introducing the `AlertingMetrics` ctor bug) causes `mvn clean verify` to fail. Revert the revert; test stays.
2. `grep -r "ALERT_VARIABLES" ui/src` shows no hand-maintained registry — only the backend-fetched hook and its consumers.
3. `ui/src/pages/Alerts/enums.ts` has no hand-declared unions for `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` — all four are derived from `schema.d.ts`. The top-of-file explanatory comment is gone.
4. All 5 condition String fields are enum-typed on the wire (inspect `openapi.json`) and in `schema.d.ts`.
5. `alerting-editor.spec.ts` green in CI for 3 consecutive runs.
6. `.claude/rules/app-classes.md` updated: new `AlertTemplateVariablesController` listed under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata").
7. `openapi.json` and `schema.d.ts` regenerated and committed alongside the backend changes they reflect.
---
## 11. Estimated Commit Cadence
Approximate, for sizing:
1. `feat(alerting): discriminator mapping on AlertCondition` — Task 5 backend.
2. `refactor(ui/alerts): derive polymorphic unions from schema` — Task 5 UI cleanup.
3. `refactor(alerting): AgentStateCondition.state → AgentState enum`.
4. `refactor(alerting): DeploymentStateCondition.states → DeploymentStatus enum`.
5. `refactor(alerting): LogPatternCondition.level → LogLevel enum` (adds new enum).
6. `refactor(alerting): JvmMetricCondition.metric — @Schema(allowableValues) hint` (no type change; OpenAPI hint only).
7. `refactor(alerting): ExchangeFilter.status → ExecutionStatus enum`.
8. `test(alerting): Spring context-startup smoke`.
9. `feat(alerting): template-variables SSOT endpoint` — Task 2 backend.
10. `refactor(ui/alerts): consume template-variables via API` — Task 2 frontend.
11. `test(ui/alerts): Playwright editor spec`.
~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible.
---
## 12. CRITICAL process rules (per CLAUDE.md)
- `gitnexus_impact` on `AlertCondition`, `NotificationContextBuilder`, `AlertingMetrics`, `ExchangeMatchCondition`, each touched subtype before editing.
- `gitnexus_detect_changes` before every commit.
- `cd ui && npm run generate-api:live` after every controller or DTO change.
- `.claude/rules/app-classes.md` updated in the same commit that adds `AlertTemplateVariablesController`.
- `.claude/rules/core-classes.md` updated when the new `LogLevel` enum lands.
- Maven: `mvn clean verify` must stay green through every commit.
- UI: `cd ui && npm run typecheck && npm run test && npm run test:e2e` must stay green through every commit.