docs(alerting): Plan 04 design spec — post-ship hardening

Closes the loop on three bug classes from Plan 03 triage:
context-load regressions (missing @Autowired), UI/backend drift
on template variables, and hand-maintained TS enum unions caused
by springdoc polymorphic schema quirk.

Covers 5 tasks: context-startup smoke test, template-variables
SSOT endpoint, second Playwright spec, String-to-enum migrations
on 5 condition fields, and @DiscriminatorMapping on AlertCondition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-20 21:33:56 +02:00
parent efa8390108
commit 46867cc659

View File

@@ -0,0 +1,403 @@
# Alerting — Plan 04 Design Spec: Post-Ship Hardening
**Date:** 2026-04-20
**Status:** Draft — awaiting user review
**Surfaces:** server (core + app), UI, tests
**Related:** [Plan 03](../plans/2026-04-20-alerting-03-ui.md) (shipped as PR #144); builds on [Alerting design](2026-04-19-alerting-design.md)
---
## 1. Summary
Close the loop on three bug classes that surfaced in Plan 03's post-merge triage:
1. **Spring wiring regressions not caught by unit tests**`AlertingMetrics`'s production constructor was missing `@Autowired` and only broke at Docker runtime (commit `5edf7eb2`). Unit tests exercised the package-private ctor, so green locally.
2. **UI↔backend drift on Mustache template variables**`ALERT_VARIABLES` in the UI and `NotificationContextBuilder` on the server were two hand-maintained lists of the same thing; one drift was fixed in commit `18e6dde6`.
3. **Hand-maintained TypeScript enum unions**`ui/src/pages/Alerts/enums.ts` declares `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` manually because springdoc emits a broken polymorphic schema (condition types resolve to `never` in TS due to discriminator-literal conflict).
Plan 04 is **pure hardening**: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them.
### Guiding principles
- **Prefer SSOT over drift-detect.** When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't.
- **Type correctness at the contract layer.** Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration.
- **Close the loop on Plan 03, don't start new product work.** Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan.
---
## 2. Scope
### In scope
1. **Context-startup smoke test** — a `@SpringBootTest` that loads the full production application context and asserts the alerting beans are present. Catches `@Autowired` regressions at `mvn verify` time.
2. **Template-variables SSOT endpoint** — new `GET /api/v1/alerts/template-variables` returning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query.
3. **Second Playwright spec** — rule EDIT path, MustacheEditor autocomplete + linter in-browser, env-promotion warnings banner, inbox ack, bulk-read.
4. **Typed vocabulary for 5 String condition fields**`AgentStateCondition.state`, `DeploymentStateCondition.states`, `LogPatternCondition.level`, `ExchangeFilter.status` become proper Java enums. `JvmMetricCondition.metric` stays `String` + `@Schema(allowableValues={...})` because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive.
5. **Springdoc discriminator mapping** — add `@Schema(discriminatorProperty, discriminatorMapping)` to the `AlertCondition` sealed interface. TS stops generating `kind: "<ClassName>"` literals. Hand-declared unions in `ui/src/pages/Alerts/enums.ts` migrate to derived types.
### Out of scope (explicit)
To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot:
- Native Slack / Teams / PagerDuty integrations (Plan 03 spec BL-002 — future plan).
- Admin CA-bundle UI ([gitea #137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) / BL-001).
- Router.tsx lint sweep (207 fast-refresh warnings, pre-existed Plan 03 — standalone lint PR).
- Performance ITs for 500 rules × 5 replicas (Plan 03 spec deferral — separate plan, different shape).
- OIDC / UserAdmin re-audit (commit `ae647363` was minimal; re-audit post-alpha).
- Orphan `.worktrees/` cleanup (user holds authorization).
---
## 3. Task 1 — Context-startup smoke test
### Problem
Plan 03 shipped a regression where `AlertingMetrics`'s production constructor was missing `@Autowired`. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid.
### Design
**Location:** `cameleer-server-app/src/test/java/com/cameleer/server/app/ContextStartupSmokeTest.java`
**Shape:**
```java
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test-context-smoke")
class ContextStartupSmokeTest {
@Autowired ApplicationContext ctx;
@Test
void productionContextLoadsWithAllAlertingBeans() {
assertNotNull(ctx.getBean(AlertingMetrics.class));
assertNotNull(ctx.getBean(AlertEvaluatorJob.class));
assertNotNull(ctx.getBean(NotificationDispatchJob.class));
assertNotNull(ctx.getBean(NotificationContextBuilder.class));
assertNotNull(ctx.getBean(WebhookDispatcher.class));
assertNotNull(ctx.getBean(MustacheRenderer.class));
assertNotNull(ctx.getBean(AlertingRetentionJob.class));
assertNotNull(ctx.getBean(PerKindCircuitBreaker.class));
// + one assertion per ConditionEvaluator impl (6 evaluators)
}
}
```
**Profile stubs:** `test-context-smoke` profile replaces `DataSource`, `ClickHouseJdbcTemplate`, and `RuntimeOrchestrator` with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting.
### Rationale
- **Why not extend an existing `@SpringBootTest`?** Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing `@Autowired` on a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed.
- **Why assert individual beans, not just "context loads"?** A context loads if *any* bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable.
### Validation
Intentionally break one bean (remove `@Autowired` or `@Component`) and confirm the test fails with a clear message. Revert before committing.
---
## 4. Task 2 — Template-variables SSOT endpoint
### Problem
`ui/src/components/MustacheEditor/alert-variables.ts` is a hand-maintained list of template variable paths (`alert.rule.name`, `exchange.status`, etc.) mirroring what `NotificationContextBuilder` actually produces. On 2026-04-20, commit `18e6dde6` fixed one drift between them. Until drift is structurally prevented, it will recur.
### Design
#### Backend
1. **Declarative registry in core.** Extract `NotificationContextBuilder`'s leaf paths into a static constant:
```java
public record TemplateVariableDescriptor(
String path, // "alert.rule.name"
String type, // "string" | "number" | "boolean" | "datetime"
String description,
String example
) {}
public final class AlertTemplateVariables {
public static final List<TemplateVariableDescriptor> ALL = List.of(
new TemplateVariableDescriptor("alert.rule.name", "string", "Rule display name", "High error rate"),
new TemplateVariableDescriptor("alert.firing_at", "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"),
// ... exhaustive list
);
}
```
2. **Refactor `NotificationContextBuilder`** to build its context from this registry, so divergence is structural impossibility rather than convention.
3. **Round-trip unit test.** `AlertTemplateVariablesTest`: for every descriptor in `ALL`, confirm the path resolves to a non-null value when `NotificationContextBuilder.build(...)` runs against a canonical sample firing instance. Fails if a variable is added to `ALL` but the builder doesn't produce it (or vice versa).
4. **Controller.** `AlertTemplateVariablesController` at `GET /api/v1/alerts/template-variables`:
- **Flat endpoint** (added to allow-list in `.claude/rules/app-classes.md`, reason: registry is tenant- and env-agnostic global metadata).
- VIEWER+ (operators authoring webhook bodies are typically VIEWER+).
- Returns `List<TemplateVariableDto>` — identical shape to `TemplateVariableDescriptor`.
- Response is deterministic and cacheable (`Cache-Control: public, max-age=3600` — the registry changes only across server versions).
#### Frontend
1. **New query hook:** `ui/src/api/queries/alertTemplateVariables.ts` exposing `useTemplateVariables()`. TanStack Query config: `staleTime: Infinity`, `gcTime: Infinity` — the list never changes at runtime.
2. **Delete** `ui/src/components/MustacheEditor/alert-variables.ts` entirely.
3. **Prop-drill the variable list:**
- `MustacheEditor.tsx` takes `variables: TemplateVariable[]` as a prop.
- `mustache-completion.ts` and `mustache-linter.ts` become functions taking the list as an argument instead of importing a module.
- `NotifyStep.tsx` calls `useTemplateVariables()` once and passes the result.
- Webhook body/header editors do the same.
4. **Loading state:** while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of `NotifyStep` (or whichever component opens first) and the result is cached forever.
### Rationale
- **Why a runtime endpoint, not a build-time JSON artifact?** The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time.
- **Why `staleTime: Infinity`?** The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling.
---
## 5. Task 3 — Second Playwright spec
### Problem
Existing `ui/src/test/e2e/alerting.spec.ts` covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions.
### Design
**File:** `ui/src/test/e2e/alerting-editor.spec.ts` (new; existing smoke stays unchanged)
**Cases:**
1. **Rule EDIT.** Seed a rule via backend API, navigate to `/alerts/rules/:id`, change `threshold` field, save, reload page, assert threshold persisted.
2. **MustacheEditor autocomplete.** On NotifyStep, focus the notification title field, type `{{`, assert CodeMirror completion popover appears listing `alert.rule.name` (and friends), arrow-down + Enter, assert insertion.
3. **MustacheEditor linter.** Type `{{alert.not_a_real_var}}`, assert a linter diagnostic appears within 500 ms.
4. **Env-promotion warnings.** Seed a rule in env A with `scope.routeId` referencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route.
5. **Inbox ack.** Seed a FIRING alert via API, navigate to `/alerts/inbox`, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables.
6. **Bulk-read.** Seed 3 FIRING alerts, select-all via header checkbox, click bulk-read button, assert notification-bell unread-count becomes 0.
### Stability
- Use `page.waitForFunction` on CodeMirror's `EditorView.completionStatus` rather than `waitForTimeout` for autocomplete assertions.
- Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate).
- Seed fixtures via API, not via page navigation — each test independent, no ordering dependency.
### Rationale
Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, `waitForFunction` over `waitForTimeout`) buys the coverage that bit us in Plan 03 triage.
---
## 6. Task 4 — Typed vocabulary for condition String fields
### Problem
Five condition fields are typed as `String` in the Java records, so springdoc emits `"type": "string"` without `enum`. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in `ui/src/pages/Alerts/enums.ts` for context.
### Design
Four field migrations to proper Java enums + one schema-only hint:
| Field | Target type | Existing or new? |
|---|---|---|
| `AgentStateCondition.state` | `com.cameleer.server.core.agent.AgentState` (enum) | existing (`LIVE, STALE, DEAD, SHUTDOWN`) |
| `DeploymentStateCondition.states` | `List<com.cameleer.server.core.runtime.DeploymentStatus>` (enum) | existing (`STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED`) |
| `LogPatternCondition.level` | new `com.cameleer.server.core.alerting.LogLevel` (enum) | new — `TRACE, DEBUG, INFO, WARN, ERROR` |
| `ExchangeFilter.status` | `com.cameleer.common.model.ExecutionStatus` (enum) | existing in cameleer-common (`RUNNING, COMPLETED, FAILED, ABANDONED`) |
| `JvmMetricCondition.metric` | stays `String` + `@Schema(allowableValues={...})` | schema-only hint — wire stays as agent-reported Micrometer metric names (`jvm.memory.used.value`, etc.) |
**Why JvmMetric is different.** `JvmMetricEvaluator` passes `c.metric()` directly to `MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...)`. The field is a raw Micrometer metric name emitted by the agent — `jvm.memory.used.value`, `process.cpu.usage.value`, `jvm.gc.pause.total_time`, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s `@Schema(allowableValues=…)` gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from `.claude/rules/metrics.md`.
### Persistence
`alert_rules.condition` is stored as JSONB.
- For the 4 enum migrations: Jackson serializes enum values by `.name()`; on-the-wire payloads already use uppercase tokens (`"COMPLETED"`, not `"Completed"`). Legacy rows round-trip unchanged.
- For `JvmMetricCondition.metric`: no change — it's already a String with dotted-name values. `@Schema(allowableValues=…)` affects OpenAPI emission only, not deserialization or validation.
**Pre-flight check:** before the first enum-flip commit, run against the local dev DB:
```sql
SELECT condition_kind, condition->>'level' AS val
FROM alert_rules WHERE condition_kind='LOG_PATTERN'
UNION ALL
SELECT condition_kind, condition->>'state'
FROM alert_rules WHERE condition_kind='AGENT_STATE'
-- etc, one union per migrating field
```
If any row has a value that doesn't match the target enum (e.g. lowercase `"info"` instead of `"INFO"`), Plan 04 adds a one-shot Flyway `V15__alert_condition_enum_repair.sql` that normalizes them. If all values match, skip the migration.
### Validation
Controller-level `@Valid` on `AlertRuleRequest` automatically enforces enum membership — unknown values produce `400 Bad Request`. Replaces any ad-hoc string checks in evaluators.
Integration test per migrated field: `POST /rules` with known-bad value → 400; known-good value → 201.
### UI cleanup
After each enum migration ships, the corresponding manual artifact in `ui/src/pages/Alerts/enums.ts` (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted.
---
## 7. Task 5 — Springdoc discriminator mapping
### Problem
The generated OpenAPI has a discriminator for `AlertCondition` but no `mapping`:
```json
"AlertCondition": {
"discriminator": { "propertyName": "kind" },
"properties": { "kind": { "enum": ["ROUTE_METRIC", ...] } }
}
```
When `openapi-typescript` encounters a discriminator without a `mapping`, it synthesizes the literal from the schema name. Result:
```ts
AgentStateCondition: {
kind: "AgentStateCondition"; // synthesized — class simple name
} & (Omit<AlertCondition, "kind"> & {
readonly kind?: "ROUTE_METRIC" | ... | "JVM_METRIC";
});
```
Indexed access like `AgentStateCondition['kind']` is `"AgentStateCondition" & ("ROUTE_METRIC" | ...)` = `never`. `Record<T, string>` exhaustiveness checks silently collapse.
### Design
One edit to `cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertCondition.java`:
```java
import io.swagger.v3.oas.annotations.media.DiscriminatorMapping;
import io.swagger.v3.oas.annotations.media.Schema;
@Schema(
discriminatorProperty = "kind",
discriminatorMapping = {
@DiscriminatorMapping(value = "ROUTE_METRIC", schema = RouteMetricCondition.class),
@DiscriminatorMapping(value = "EXCHANGE_MATCH", schema = ExchangeMatchCondition.class),
@DiscriminatorMapping(value = "AGENT_STATE", schema = AgentStateCondition.class),
@DiscriminatorMapping(value = "DEPLOYMENT_STATE", schema = DeploymentStateCondition.class),
@DiscriminatorMapping(value = "LOG_PATTERN", schema = LogPatternCondition.class),
@DiscriminatorMapping(value = "JVM_METRIC", schema = JvmMetricCondition.class)
}
)
@JsonTypeInfo(...) // unchanged
@JsonSubTypes(...) // unchanged
public sealed interface AlertCondition ...
```
After `mvn compile && cd ui && npm run generate-api:live`, the schema gains:
```json
"discriminator": {
"propertyName": "kind",
"mapping": {
"ROUTE_METRIC": "#/components/schemas/RouteMetricCondition",
...
}
}
```
And `openapi-typescript` emits `kind: "ROUTE_METRIC"` instead of `kind: "RouteMetricCondition"`.
### Fallback
If springdoc 2.8.6 does not honor `@DiscriminatorMapping` on a sealed interface (sealed types are newer than the library's primary target surface), add a `SpringDocConfig` bean:
```java
@Bean OpenApiCustomizer alertConditionDiscriminatorFix() {
return openApi -> {
Schema<?> alertCondition = openApi.getComponents().getSchemas().get("AlertCondition");
alertCondition.getDiscriminator().mapping(Map.of(
"ROUTE_METRIC", "#/components/schemas/RouteMetricCondition",
// ...
));
};
}
```
~20 lines, no behavioural change for the server — emits the exact same OpenAPI. Plan 04 ships either way.
### UI cleanup
Once the schema is correct, `ui/src/pages/Alerts/enums.ts` is simplified:
- Delete hand-declared `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` unions.
- Replace with derived indexed access:
```ts
type RouteMetric = NonNullable<components['schemas']['RouteMetricCondition']['metric']>;
type Comparator = NonNullable<components['schemas']['RouteMetricCondition']['comparator']>;
// etc.
```
- Delete the 30-line comment block at the top of the file explaining the workaround.
- Keep `HIDDEN` arrays and label-map `Record<T, string>` exhaustiveness enforcement — those remain valuable.
---
## 8. Execution Order
1. **Task 5** (springdoc discriminator mapping) — purely additive annotation.
2. **Task 4** (String → enum migrations, one commit per field).
3. **Task 1** (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout).
4. **Task 2** (template-variables SSOT).
5. **Task 3** (Playwright spec — last, exercises all of the above).
No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| Springdoc `@DiscriminatorMapping` on a sealed interface not honored in 2.8.6 | `OpenApiCustomizer` bean fallback (§7). |
| Legacy `alert_rules.condition` JSONB rows have values that don't match new enum names | Pre-flight SQL inspection; `V15` Flyway repair migration only if needed. |
| `NotificationContextBuilder` refactor silently changes rendered notifications | Golden-file integration test — one sample notification per (severity × kind); run before and after refactor. |
| Playwright spec flakiness on CodeMirror autocomplete timing | `waitForFunction` on completion state; 3 consecutive green CI runs before merge. |
| `@Valid` enum enforcement breaks existing rule creation flows in the UI | Task-4 commits bundle `generate-api:live` + typecheck-fixing; any UI call-site change ships with the backend change. |
---
## 10. Acceptance Criteria
Plan 04 is done when:
1. Reverting commit `5edf7eb2` (re-introducing the `AlertingMetrics` ctor bug) causes `mvn clean verify` to fail. Revert the revert; test stays.
2. `grep -r "ALERT_VARIABLES" ui/src` shows no hand-maintained registry — only the backend-fetched hook and its consumers.
3. `ui/src/pages/Alerts/enums.ts` has no hand-declared unions for `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` — all four are derived from `schema.d.ts`. The top-of-file explanatory comment is gone.
4. All 5 condition String fields are enum-typed on the wire (inspect `openapi.json`) and in `schema.d.ts`.
5. `alerting-editor.spec.ts` green in CI for 3 consecutive runs.
6. `.claude/rules/app-classes.md` updated: new `AlertTemplateVariablesController` listed under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata").
7. `openapi.json` and `schema.d.ts` regenerated and committed alongside the backend changes they reflect.
---
## 11. Estimated Commit Cadence
Approximate, for sizing:
1. `feat(alerting): discriminator mapping on AlertCondition` — Task 5 backend.
2. `refactor(ui/alerts): derive polymorphic unions from schema` — Task 5 UI cleanup.
3. `refactor(alerting): AgentStateCondition.state → AgentState enum`.
4. `refactor(alerting): DeploymentStateCondition.states → DeploymentStatus enum`.
5. `refactor(alerting): LogPatternCondition.level → LogLevel enum` (adds new enum).
6. `refactor(alerting): JvmMetricCondition.metric — @Schema(allowableValues) hint` (no type change; OpenAPI hint only).
7. `refactor(alerting): ExchangeFilter.status → ExecutionStatus enum`.
8. `test(alerting): Spring context-startup smoke`.
9. `feat(alerting): template-variables SSOT endpoint` — Task 2 backend.
10. `refactor(ui/alerts): consume template-variables via API` — Task 2 frontend.
11. `test(ui/alerts): Playwright editor spec`.
~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible.
---
## 12. CRITICAL process rules (per CLAUDE.md)
- `gitnexus_impact` on `AlertCondition`, `NotificationContextBuilder`, `AlertingMetrics`, `ExchangeMatchCondition`, each touched subtype before editing.
- `gitnexus_detect_changes` before every commit.
- `cd ui && npm run generate-api:live` after every controller or DTO change.
- `.claude/rules/app-classes.md` updated in the same commit that adds `AlertTemplateVariablesController`.
- `.claude/rules/core-classes.md` updated when the new `LogLevel` enum lands.
- Maven: `mvn clean verify` must stay green through every commit.
- UI: `cd ui && npm run typecheck && npm run test && npm run test:e2e` must stay green through every commit.