V13 added a partial unique index on alert_instances(rule_id) WHERE state
IN (PENDING,FIRING,ACKNOWLEDGED). Correct for scalar condition kinds
(ROUTE_METRIC / AGENT_STATE / DEPLOYMENT_STATE / LOG_PATTERN / JVM_METRIC
/ EXCHANGE_MATCH in COUNT_IN_WINDOW) but wrong for EXCHANGE_MATCH /
PER_EXCHANGE, which by design emits one alert_instance per matching
exchange. Under V13 every PER_EXCHANGE tick with >1 match logged
"Skipped duplicate open alert_instance for rule …" at evaluator cadence
and silently lost alert fidelity — only the first matching exchange per
tick got an AlertInstance + webhook dispatch.
V15 drops the rule_id-only constraint and recreates it with a
discriminator on context->'exchange'->>'id'. Scalar kinds emit
Map.of() as context, so their expression resolves to '' — "one open per
rule" preserved. ExchangeMatchEvaluator.evaluatePerExchange always
populates exchange.id, so per-exchange instances coexist cleanly.
Two new PostgresAlertInstanceRepositoryIT tests:
- multiple open instances for same rule + distinct exchanges all land
- second open for identical (rule, exchange) still dedups via the
DuplicateKeyException fallback in save() — defense-in-depth kept
Also fixes pre-existing PostgresAlertReadRepositoryIT brokenness: its
setup() inserted 3 open instances sharing one rule_id, which V13 blocked
on arrival. Migrate to one rule_id per instance (pattern already used
across other storage ITs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the loop on three bug classes from Plan 03 triage:
context-load regressions (missing @Autowired), UI/backend drift
on template variables, and hand-maintained TS enum unions caused
by springdoc polymorphic schema quirk.
Covers 5 tasks: context-startup smoke test, template-variables
SSOT endpoint, second Playwright spec, String-to-enum migrations
on 5 condition fields, and @DiscriminatorMapping on AlertCondition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:44:41 +02:00
5 changed files with 3201 additions and 8 deletions
# Alerting — Plan 04 Design Spec: Post-Ship Hardening
**Date:** 2026-04-20
**Status:** Draft — awaiting user review
**Surfaces:** server (core + app), UI, tests
**Related:** [Plan 03](../plans/2026-04-20-alerting-03-ui.md) (shipped as PR #144); builds on [Alerting design](2026-04-19-alerting-design.md)
---
## 1. Summary
Close the loop on three bug classes that surfaced in Plan 03's post-merge triage:
1.**Spring wiring regressions not caught by unit tests** — `AlertingMetrics`'s production constructor was missing `@Autowired` and only broke at Docker runtime (commit `5edf7eb2`). Unit tests exercised the package-private ctor, so green locally.
2.**UI↔backend drift on Mustache template variables** — `ALERT_VARIABLES` in the UI and `NotificationContextBuilder` on the server were two hand-maintained lists of the same thing; one drift was fixed in commit `18e6dde6`.
3.**Hand-maintained TypeScript enum unions** — `ui/src/pages/Alerts/enums.ts` declares `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` manually because springdoc emits a broken polymorphic schema (condition types resolve to `never` in TS due to discriminator-literal conflict).
Plan 04 is **pure hardening**: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them.
### Guiding principles
- **Prefer SSOT over drift-detect.** When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't.
- **Type correctness at the contract layer.** Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration.
- **Close the loop on Plan 03, don't start new product work.** Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan.
---
## 2. Scope
### In scope
1.**Context-startup smoke test** — a `@SpringBootTest` that loads the full production application context and asserts the alerting beans are present. Catches `@Autowired` regressions at `mvn verify` time.
2.**Template-variables SSOT endpoint** — new `GET /api/v1/alerts/template-variables` returning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query.
4.**Typed vocabulary for 5 String condition fields** — `AgentStateCondition.state`, `DeploymentStateCondition.states`, `LogPatternCondition.level`, `ExchangeFilter.status` become proper Java enums. `JvmMetricCondition.metric` stays `String` + `@Schema(allowableValues={...})` because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive.
5.**Springdoc discriminator mapping** — add `@Schema(discriminatorProperty, discriminatorMapping)` to the `AlertCondition` sealed interface. TS stops generating `kind: "<ClassName>"` literals. Hand-declared unions in `ui/src/pages/Alerts/enums.ts` migrate to derived types.
### Out of scope (explicit)
To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot:
Plan 03 shipped a regression where `AlertingMetrics`'s production constructor was missing `@Autowired`. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid.
// + one assertion per ConditionEvaluator impl (6 evaluators)
}
}
```
**Profile stubs:**`test-context-smoke` profile replaces `DataSource`, `ClickHouseJdbcTemplate`, and `RuntimeOrchestrator` with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting.
### Rationale
- **Why not extend an existing `@SpringBootTest`?** Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing `@Autowired` on a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed.
- **Why assert individual beans, not just "context loads"?** A context loads if *any* bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable.
### Validation
Intentionally break one bean (remove `@Autowired` or `@Component`) and confirm the test fails with a clear message. Revert before committing.
---
## 4. Task 2 — Template-variables SSOT endpoint
### Problem
`ui/src/components/MustacheEditor/alert-variables.ts` is a hand-maintained list of template variable paths (`alert.rule.name`, `exchange.status`, etc.) mirroring what `NotificationContextBuilder` actually produces. On 2026-04-20, commit `18e6dde6` fixed one drift between them. Until drift is structurally prevented, it will recur.
### Design
#### Backend
1.**Declarative registry in core.** Extract `NotificationContextBuilder`'s leaf paths into a static constant:
public static final List<TemplateVariableDescriptor> ALL = List.of(
new TemplateVariableDescriptor("alert.rule.name", "string", "Rule display name", "High error rate"),
new TemplateVariableDescriptor("alert.firing_at", "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"),
// ... exhaustive list
);
}
```
2. **Refactor `NotificationContextBuilder`** to build its context from this registry, so divergence is structural impossibility rather than convention.
3. **Round-trip unit test.** `AlertTemplateVariablesTest`: for every descriptor in `ALL`, confirm the path resolves to a non-null value when `NotificationContextBuilder.build(...)` runs against a canonical sample firing instance. Fails if a variable is added to `ALL` but the builder doesn't produce it (or vice versa).
4. **Controller.** `AlertTemplateVariablesController` at `GET /api/v1/alerts/template-variables`:
- **Flat endpoint** (added to allow-list in `.claude/rules/app-classes.md`, reason: registry is tenant- and env-agnostic global metadata).
- VIEWER+ (operators authoring webhook bodies are typically VIEWER+).
- Returns `List<TemplateVariableDto>` — identical shape to `TemplateVariableDescriptor`.
- Response is deterministic and cacheable (`Cache-Control: public, max-age=3600` — the registry changes only across server versions).
#### Frontend
1. **New query hook:** `ui/src/api/queries/alertTemplateVariables.ts` exposing `useTemplateVariables()`. TanStack Query config: `staleTime: Infinity`, `gcTime: Infinity` — the list never changes at runtime.
- `MustacheEditor.tsx` takes `variables: TemplateVariable[]` as a prop.
- `mustache-completion.ts` and `mustache-linter.ts` become functions taking the list as an argument instead of importing a module.
- `NotifyStep.tsx` calls `useTemplateVariables()` once and passes the result.
- Webhook body/header editors do the same.
4. **Loading state:** while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of `NotifyStep` (or whichever component opens first) and the result is cached forever.
### Rationale
- **Why a runtime endpoint, not a build-time JSON artifact?** The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time.
- **Why `staleTime: Infinity`?** The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling.
---
## 5. Task 3 — Second Playwright spec
### Problem
Existing `ui/src/test/e2e/alerting.spec.ts` covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions.
1. **Rule EDIT.** Seed a rule via backend API, navigate to `/alerts/rules/:id`, change `threshold` field, save, reload page, assert threshold persisted.
2. **MustacheEditor autocomplete.** On NotifyStep, focus the notification title field, type `{{`, assert CodeMirror completion popover appears listing `alert.rule.name` (and friends), arrow-down + Enter, assert insertion.
3. **MustacheEditor linter.** Type `{{alert.not_a_real_var}}`, assert a linter diagnostic appears within 500 ms.
4. **Env-promotion warnings.** Seed a rule in env A with `scope.routeId` referencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route.
5. **Inbox ack.** Seed a FIRING alert via API, navigate to `/alerts/inbox`, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables.
- Use `page.waitForFunction` on CodeMirror's `EditorView.completionStatus` rather than `waitForTimeout` for autocomplete assertions.
- Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate).
- Seed fixtures via API, not via page navigation — each test independent, no ordering dependency.
### Rationale
Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, `waitForFunction` over `waitForTimeout`) buys the coverage that bit us in Plan 03 triage.
Five condition fields are typed as `String` in the Java records, so springdoc emits `"type": "string"` without `enum`. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in `ui/src/pages/Alerts/enums.ts` for context.
### Design
Four field migrations to proper Java enums + one schema-only hint:
**Why JvmMetric is different.** `JvmMetricEvaluator` passes `c.metric()` directly to `MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...)`. The field is a raw Micrometer metric name emitted by the agent — `jvm.memory.used.value`, `process.cpu.usage.value`, `jvm.gc.pause.total_time`, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s `@Schema(allowableValues=…)` gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from `.claude/rules/metrics.md`.
### Persistence
`alert_rules.condition` is stored as JSONB.
- For the 4 enum migrations: Jackson serializes enum values by `.name()`; on-the-wire payloads already use uppercase tokens (`"COMPLETED"`, not `"Completed"`). Legacy rows round-trip unchanged.
- For `JvmMetricCondition.metric`: no change — it's already a String with dotted-name values. `@Schema(allowableValues=…)` affects OpenAPI emission only, not deserialization or validation.
**Pre-flight check:** before the first enum-flip commit, run against the local dev DB:
```sql
SELECT condition_kind, condition->>'level' AS val
FROM alert_rules WHERE condition_kind='LOG_PATTERN'
UNION ALL
SELECT condition_kind, condition->>'state'
FROM alert_rules WHERE condition_kind='AGENT_STATE'
-- etc, one union per migrating field
```
If any row has a value that doesn't match the target enum (e.g. lowercase `"info"` instead of `"INFO"`), Plan 04 adds a one-shot Flyway `V15__alert_condition_enum_repair.sql` that normalizes them. If all values match, skip the migration.
### Validation
Controller-level `@Valid` on `AlertRuleRequest` automatically enforces enum membership — unknown values produce `400 Bad Request`. Replaces any ad-hoc string checks in evaluators.
Integration test per migrated field: `POST /rules` with known-bad value → 400; known-good value → 201.
### UI cleanup
After each enum migration ships, the corresponding manual artifact in `ui/src/pages/Alerts/enums.ts` (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted.
---
## 7. Task 5 — Springdoc discriminator mapping
### Problem
The generated OpenAPI has a discriminator for `AlertCondition` but no `mapping`:
And `openapi-typescript` emits `kind: "ROUTE_METRIC"` instead of `kind: "RouteMetricCondition"`.
### Fallback
If springdoc 2.8.6 does not honor `@DiscriminatorMapping` on a sealed interface (sealed types are newer than the library's primary target surface), add a `SpringDocConfig` bean:
2. **Task 4** (String → enum migrations, one commit per field).
3. **Task 1** (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout).
4. **Task 2** (template-variables SSOT).
5. **Task 3** (Playwright spec — last, exercises all of the above).
No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| Springdoc `@DiscriminatorMapping` on a sealed interface not honored in 2.8.6 | `OpenApiCustomizer` bean fallback (§7). |
| Legacy `alert_rules.condition` JSONB rows have values that don't match new enum names | Pre-flight SQL inspection; `V15` Flyway repair migration only if needed. |
| `NotificationContextBuilder` refactor silently changes rendered notifications | Golden-file integration test — one sample notification per (severity × kind); run before and after refactor. |
| Playwright spec flakiness on CodeMirror autocomplete timing | `waitForFunction` on completion state; 3 consecutive green CI runs before merge. |
| `@Valid` enum enforcement breaks existing rule creation flows in the UI | Task-4 commits bundle `generate-api:live` + typecheck-fixing; any UI call-site change ships with the backend change. |
---
## 10. Acceptance Criteria
Plan 04 is done when:
1. Reverting commit `5edf7eb2` (re-introducing the `AlertingMetrics` ctor bug) causes `mvn clean verify` to fail. Revert the revert; test stays.
2. `grep -r "ALERT_VARIABLES" ui/src` shows no hand-maintained registry — only the backend-fetched hook and its consumers.
3. `ui/src/pages/Alerts/enums.ts` has no hand-declared unions for `RouteMetric`, `Comparator`, `JvmAggregation`, `ExchangeFireMode` — all four are derived from `schema.d.ts`. The top-of-file explanatory comment is gone.
4. All 5 condition String fields are enum-typed on the wire (inspect `openapi.json`) and in `schema.d.ts`.
5. `alerting-editor.spec.ts` green in CI for 3 consecutive runs.
6. `.claude/rules/app-classes.md` updated: new `AlertTemplateVariablesController` listed under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata").
7. `openapi.json` and `schema.d.ts` regenerated and committed alongside the backend changes they reflect.
---
## 11. Estimated Commit Cadence
Approximate, for sizing:
1. `feat(alerting): discriminator mapping on AlertCondition` — Task 5 backend.
10. `refactor(ui/alerts): consume template-variables via API` — Task 2 frontend.
11. `test(ui/alerts): Playwright editor spec`.
~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible.
---
## 12. CRITICAL process rules (per CLAUDE.md)
- `gitnexus_impact` on `AlertCondition`, `NotificationContextBuilder`, `AlertingMetrics`, `ExchangeMatchCondition`, each touched subtype before editing.
- `gitnexus_detect_changes` before every commit.
- `cd ui && npm run generate-api:live` after every controller or DTO change.
- `.claude/rules/app-classes.md` updated in the same commit that adds `AlertTemplateVariablesController`.
- `.claude/rules/core-classes.md` updated when the new `LogLevel` enum lands.
- Maven: `mvn clean verify` must stay green through every commit.
- UI: `cd ui && npm run typecheck && npm run test && npm run test:e2e` must stay green through every commit.
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.