Closes the loop on three bug classes from Plan 03 triage: context-load regressions (missing @Autowired), UI/backend drift on template variables, and hand-maintained TS enum unions caused by springdoc polymorphic schema quirk. Covers 5 tasks: context-startup smoke test, template-variables SSOT endpoint, second Playwright spec, String-to-enum migrations on 5 condition fields, and @DiscriminatorMapping on AlertCondition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 KiB
Alerting — Plan 04 Design Spec: Post-Ship Hardening
Date: 2026-04-20 Status: Draft — awaiting user review Surfaces: server (core + app), UI, tests Related: Plan 03 (shipped as PR #144); builds on Alerting design
1. Summary
Close the loop on three bug classes that surfaced in Plan 03's post-merge triage:
- Spring wiring regressions not caught by unit tests —
AlertingMetrics's production constructor was missing@Autowiredand only broke at Docker runtime (commit5edf7eb2). Unit tests exercised the package-private ctor, so green locally. - UI↔backend drift on Mustache template variables —
ALERT_VARIABLESin the UI andNotificationContextBuilderon the server were two hand-maintained lists of the same thing; one drift was fixed in commit18e6dde6. - Hand-maintained TypeScript enum unions —
ui/src/pages/Alerts/enums.tsdeclaresRouteMetric,Comparator,JvmAggregation,ExchangeFireModemanually because springdoc emits a broken polymorphic schema (condition types resolve toneverin TS due to discriminator-literal conflict).
Plan 04 is pure hardening: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them.
Guiding principles
- Prefer SSOT over drift-detect. When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't.
- Type correctness at the contract layer. Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration.
- Close the loop on Plan 03, don't start new product work. Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan.
2. Scope
In scope
- Context-startup smoke test — a
@SpringBootTestthat loads the full production application context and asserts the alerting beans are present. Catches@Autowiredregressions atmvn verifytime. - Template-variables SSOT endpoint — new
GET /api/v1/alerts/template-variablesreturning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query. - Second Playwright spec — rule EDIT path, MustacheEditor autocomplete + linter in-browser, env-promotion warnings banner, inbox ack, bulk-read.
- Typed vocabulary for 5 String condition fields —
AgentStateCondition.state,DeploymentStateCondition.states,LogPatternCondition.level,ExchangeFilter.statusbecome proper Java enums.JvmMetricCondition.metricstaysString+@Schema(allowableValues={...})because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive. - Springdoc discriminator mapping — add
@Schema(discriminatorProperty, discriminatorMapping)to theAlertConditionsealed interface. TS stops generatingkind: "<ClassName>"literals. Hand-declared unions inui/src/pages/Alerts/enums.tsmigrate to derived types.
Out of scope (explicit)
To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot:
- Native Slack / Teams / PagerDuty integrations (Plan 03 spec BL-002 — future plan).
- Admin CA-bundle UI (gitea #137 / BL-001).
- Router.tsx lint sweep (207 fast-refresh warnings, pre-existed Plan 03 — standalone lint PR).
- Performance ITs for 500 rules × 5 replicas (Plan 03 spec deferral — separate plan, different shape).
- OIDC / UserAdmin re-audit (commit
ae647363was minimal; re-audit post-alpha). - Orphan
.worktrees/cleanup (user holds authorization).
3. Task 1 — Context-startup smoke test
Problem
Plan 03 shipped a regression where AlertingMetrics's production constructor was missing @Autowired. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid.
Design
Location: cameleer-server-app/src/test/java/com/cameleer/server/app/ContextStartupSmokeTest.java
Shape:
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test-context-smoke")
class ContextStartupSmokeTest {
@Autowired ApplicationContext ctx;
@Test
void productionContextLoadsWithAllAlertingBeans() {
assertNotNull(ctx.getBean(AlertingMetrics.class));
assertNotNull(ctx.getBean(AlertEvaluatorJob.class));
assertNotNull(ctx.getBean(NotificationDispatchJob.class));
assertNotNull(ctx.getBean(NotificationContextBuilder.class));
assertNotNull(ctx.getBean(WebhookDispatcher.class));
assertNotNull(ctx.getBean(MustacheRenderer.class));
assertNotNull(ctx.getBean(AlertingRetentionJob.class));
assertNotNull(ctx.getBean(PerKindCircuitBreaker.class));
// + one assertion per ConditionEvaluator impl (6 evaluators)
}
}
Profile stubs: test-context-smoke profile replaces DataSource, ClickHouseJdbcTemplate, and RuntimeOrchestrator with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting.
Rationale
- Why not extend an existing
@SpringBootTest? Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing@Autowiredon a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed. - Why assert individual beans, not just "context loads"? A context loads if any bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable.
Validation
Intentionally break one bean (remove @Autowired or @Component) and confirm the test fails with a clear message. Revert before committing.
4. Task 2 — Template-variables SSOT endpoint
Problem
ui/src/components/MustacheEditor/alert-variables.ts is a hand-maintained list of template variable paths (alert.rule.name, exchange.status, etc.) mirroring what NotificationContextBuilder actually produces. On 2026-04-20, commit 18e6dde6 fixed one drift between them. Until drift is structurally prevented, it will recur.
Design
Backend
-
Declarative registry in core. Extract
NotificationContextBuilder's leaf paths into a static constant:public record TemplateVariableDescriptor( String path, // "alert.rule.name" String type, // "string" | "number" | "boolean" | "datetime" String description, String example ) {} public final class AlertTemplateVariables { public static final List<TemplateVariableDescriptor> ALL = List.of( new TemplateVariableDescriptor("alert.rule.name", "string", "Rule display name", "High error rate"), new TemplateVariableDescriptor("alert.firing_at", "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"), // ... exhaustive list ); } -
Refactor
NotificationContextBuilderto build its context from this registry, so divergence is structural impossibility rather than convention. -
Round-trip unit test.
AlertTemplateVariablesTest: for every descriptor inALL, confirm the path resolves to a non-null value whenNotificationContextBuilder.build(...)runs against a canonical sample firing instance. Fails if a variable is added toALLbut the builder doesn't produce it (or vice versa). -
Controller.
AlertTemplateVariablesControlleratGET /api/v1/alerts/template-variables:- Flat endpoint (added to allow-list in
.claude/rules/app-classes.md, reason: registry is tenant- and env-agnostic global metadata). - VIEWER+ (operators authoring webhook bodies are typically VIEWER+).
- Returns
List<TemplateVariableDto>— identical shape toTemplateVariableDescriptor. - Response is deterministic and cacheable (
Cache-Control: public, max-age=3600— the registry changes only across server versions).
- Flat endpoint (added to allow-list in
Frontend
-
New query hook:
ui/src/api/queries/alertTemplateVariables.tsexposinguseTemplateVariables(). TanStack Query config:staleTime: Infinity,gcTime: Infinity— the list never changes at runtime. -
Delete
ui/src/components/MustacheEditor/alert-variables.tsentirely. -
Prop-drill the variable list:
MustacheEditor.tsxtakesvariables: TemplateVariable[]as a prop.mustache-completion.tsandmustache-linter.tsbecome functions taking the list as an argument instead of importing a module.NotifyStep.tsxcallsuseTemplateVariables()once and passes the result.- Webhook body/header editors do the same.
-
Loading state: while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of
NotifyStep(or whichever component opens first) and the result is cached forever.
Rationale
- Why a runtime endpoint, not a build-time JSON artifact? The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time.
- Why
staleTime: Infinity? The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling.
5. Task 3 — Second Playwright spec
Problem
Existing ui/src/test/e2e/alerting.spec.ts covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions.
Design
File: ui/src/test/e2e/alerting-editor.spec.ts (new; existing smoke stays unchanged)
Cases:
- Rule EDIT. Seed a rule via backend API, navigate to
/alerts/rules/:id, changethresholdfield, save, reload page, assert threshold persisted. - MustacheEditor autocomplete. On NotifyStep, focus the notification title field, type
{{, assert CodeMirror completion popover appears listingalert.rule.name(and friends), arrow-down + Enter, assert insertion. - MustacheEditor linter. Type
{{alert.not_a_real_var}}, assert a linter diagnostic appears within 500 ms. - Env-promotion warnings. Seed a rule in env A with
scope.routeIdreferencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route. - Inbox ack. Seed a FIRING alert via API, navigate to
/alerts/inbox, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables. - Bulk-read. Seed 3 FIRING alerts, select-all via header checkbox, click bulk-read button, assert notification-bell unread-count becomes 0.
Stability
- Use
page.waitForFunctionon CodeMirror'sEditorView.completionStatusrather thanwaitForTimeoutfor autocomplete assertions. - Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate).
- Seed fixtures via API, not via page navigation — each test independent, no ordering dependency.
Rationale
Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, waitForFunction over waitForTimeout) buys the coverage that bit us in Plan 03 triage.
6. Task 4 — Typed vocabulary for condition String fields
Problem
Five condition fields are typed as String in the Java records, so springdoc emits "type": "string" without enum. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in ui/src/pages/Alerts/enums.ts for context.
Design
Four field migrations to proper Java enums + one schema-only hint:
| Field | Target type | Existing or new? |
|---|---|---|
AgentStateCondition.state |
com.cameleer.server.core.agent.AgentState (enum) |
existing (LIVE, STALE, DEAD, SHUTDOWN) |
DeploymentStateCondition.states |
List<com.cameleer.server.core.runtime.DeploymentStatus> (enum) |
existing (STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED) |
LogPatternCondition.level |
new com.cameleer.server.core.alerting.LogLevel (enum) |
new — TRACE, DEBUG, INFO, WARN, ERROR |
ExchangeFilter.status |
com.cameleer.common.model.ExecutionStatus (enum) |
existing in cameleer-common (RUNNING, COMPLETED, FAILED, ABANDONED) |
JvmMetricCondition.metric |
stays String + @Schema(allowableValues={...}) |
schema-only hint — wire stays as agent-reported Micrometer metric names (jvm.memory.used.value, etc.) |
Why JvmMetric is different. JvmMetricEvaluator passes c.metric() directly to MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...). The field is a raw Micrometer metric name emitted by the agent — jvm.memory.used.value, process.cpu.usage.value, jvm.gc.pause.total_time, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s @Schema(allowableValues=…) gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from .claude/rules/metrics.md.
Persistence
alert_rules.condition is stored as JSONB.
- For the 4 enum migrations: Jackson serializes enum values by
.name(); on-the-wire payloads already use uppercase tokens ("COMPLETED", not"Completed"). Legacy rows round-trip unchanged. - For
JvmMetricCondition.metric: no change — it's already a String with dotted-name values.@Schema(allowableValues=…)affects OpenAPI emission only, not deserialization or validation.
Pre-flight check: before the first enum-flip commit, run against the local dev DB:
SELECT condition_kind, condition->>'level' AS val
FROM alert_rules WHERE condition_kind='LOG_PATTERN'
UNION ALL
SELECT condition_kind, condition->>'state'
FROM alert_rules WHERE condition_kind='AGENT_STATE'
-- etc, one union per migrating field
If any row has a value that doesn't match the target enum (e.g. lowercase "info" instead of "INFO"), Plan 04 adds a one-shot Flyway V15__alert_condition_enum_repair.sql that normalizes them. If all values match, skip the migration.
Validation
Controller-level @Valid on AlertRuleRequest automatically enforces enum membership — unknown values produce 400 Bad Request. Replaces any ad-hoc string checks in evaluators.
Integration test per migrated field: POST /rules with known-bad value → 400; known-good value → 201.
UI cleanup
After each enum migration ships, the corresponding manual artifact in ui/src/pages/Alerts/enums.ts (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted.
7. Task 5 — Springdoc discriminator mapping
Problem
The generated OpenAPI has a discriminator for AlertCondition but no mapping:
"AlertCondition": {
"discriminator": { "propertyName": "kind" },
"properties": { "kind": { "enum": ["ROUTE_METRIC", ...] } }
}
When openapi-typescript encounters a discriminator without a mapping, it synthesizes the literal from the schema name. Result:
AgentStateCondition: {
kind: "AgentStateCondition"; // synthesized — class simple name
} & (Omit<AlertCondition, "kind"> & {
readonly kind?: "ROUTE_METRIC" | ... | "JVM_METRIC";
});
Indexed access like AgentStateCondition['kind'] is "AgentStateCondition" & ("ROUTE_METRIC" | ...) = never. Record<T, string> exhaustiveness checks silently collapse.
Design
One edit to cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertCondition.java:
import io.swagger.v3.oas.annotations.media.DiscriminatorMapping;
import io.swagger.v3.oas.annotations.media.Schema;
@Schema(
discriminatorProperty = "kind",
discriminatorMapping = {
@DiscriminatorMapping(value = "ROUTE_METRIC", schema = RouteMetricCondition.class),
@DiscriminatorMapping(value = "EXCHANGE_MATCH", schema = ExchangeMatchCondition.class),
@DiscriminatorMapping(value = "AGENT_STATE", schema = AgentStateCondition.class),
@DiscriminatorMapping(value = "DEPLOYMENT_STATE", schema = DeploymentStateCondition.class),
@DiscriminatorMapping(value = "LOG_PATTERN", schema = LogPatternCondition.class),
@DiscriminatorMapping(value = "JVM_METRIC", schema = JvmMetricCondition.class)
}
)
@JsonTypeInfo(...) // unchanged
@JsonSubTypes(...) // unchanged
public sealed interface AlertCondition ...
After mvn compile && cd ui && npm run generate-api:live, the schema gains:
"discriminator": {
"propertyName": "kind",
"mapping": {
"ROUTE_METRIC": "#/components/schemas/RouteMetricCondition",
...
}
}
And openapi-typescript emits kind: "ROUTE_METRIC" instead of kind: "RouteMetricCondition".
Fallback
If springdoc 2.8.6 does not honor @DiscriminatorMapping on a sealed interface (sealed types are newer than the library's primary target surface), add a SpringDocConfig bean:
@Bean OpenApiCustomizer alertConditionDiscriminatorFix() {
return openApi -> {
Schema<?> alertCondition = openApi.getComponents().getSchemas().get("AlertCondition");
alertCondition.getDiscriminator().mapping(Map.of(
"ROUTE_METRIC", "#/components/schemas/RouteMetricCondition",
// ...
));
};
}
~20 lines, no behavioural change for the server — emits the exact same OpenAPI. Plan 04 ships either way.
UI cleanup
Once the schema is correct, ui/src/pages/Alerts/enums.ts is simplified:
- Delete hand-declared
RouteMetric,Comparator,JvmAggregation,ExchangeFireModeunions. - Replace with derived indexed access:
type RouteMetric = NonNullable<components['schemas']['RouteMetricCondition']['metric']>; type Comparator = NonNullable<components['schemas']['RouteMetricCondition']['comparator']>; // etc. - Delete the 30-line comment block at the top of the file explaining the workaround.
- Keep
HIDDENarrays and label-mapRecord<T, string>exhaustiveness enforcement — those remain valuable.
8. Execution Order
- Task 5 (springdoc discriminator mapping) — purely additive annotation.
- Task 4 (String → enum migrations, one commit per field).
- Task 1 (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout).
- Task 2 (template-variables SSOT).
- Task 3 (Playwright spec — last, exercises all of the above).
No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small.
9. Risks
| Risk | Mitigation |
|---|---|
Springdoc @DiscriminatorMapping on a sealed interface not honored in 2.8.6 |
OpenApiCustomizer bean fallback (§7). |
Legacy alert_rules.condition JSONB rows have values that don't match new enum names |
Pre-flight SQL inspection; V15 Flyway repair migration only if needed. |
NotificationContextBuilder refactor silently changes rendered notifications |
Golden-file integration test — one sample notification per (severity × kind); run before and after refactor. |
| Playwright spec flakiness on CodeMirror autocomplete timing | waitForFunction on completion state; 3 consecutive green CI runs before merge. |
@Valid enum enforcement breaks existing rule creation flows in the UI |
Task-4 commits bundle generate-api:live + typecheck-fixing; any UI call-site change ships with the backend change. |
10. Acceptance Criteria
Plan 04 is done when:
- Reverting commit
5edf7eb2(re-introducing theAlertingMetricsctor bug) causesmvn clean verifyto fail. Revert the revert; test stays. grep -r "ALERT_VARIABLES" ui/srcshows no hand-maintained registry — only the backend-fetched hook and its consumers.ui/src/pages/Alerts/enums.tshas no hand-declared unions forRouteMetric,Comparator,JvmAggregation,ExchangeFireMode— all four are derived fromschema.d.ts. The top-of-file explanatory comment is gone.- All 5 condition String fields are enum-typed on the wire (inspect
openapi.json) and inschema.d.ts. alerting-editor.spec.tsgreen in CI for 3 consecutive runs..claude/rules/app-classes.mdupdated: newAlertTemplateVariablesControllerlisted under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata").openapi.jsonandschema.d.tsregenerated and committed alongside the backend changes they reflect.
11. Estimated Commit Cadence
Approximate, for sizing:
feat(alerting): discriminator mapping on AlertCondition— Task 5 backend.refactor(ui/alerts): derive polymorphic unions from schema— Task 5 UI cleanup.refactor(alerting): AgentStateCondition.state → AgentState enum.refactor(alerting): DeploymentStateCondition.states → DeploymentStatus enum.refactor(alerting): LogPatternCondition.level → LogLevel enum(adds new enum).refactor(alerting): JvmMetricCondition.metric — @Schema(allowableValues) hint(no type change; OpenAPI hint only).refactor(alerting): ExchangeFilter.status → ExecutionStatus enum.test(alerting): Spring context-startup smoke.feat(alerting): template-variables SSOT endpoint— Task 2 backend.refactor(ui/alerts): consume template-variables via API— Task 2 frontend.test(ui/alerts): Playwright editor spec.
~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible.
12. CRITICAL process rules (per CLAUDE.md)
gitnexus_impactonAlertCondition,NotificationContextBuilder,AlertingMetrics,ExchangeMatchCondition, each touched subtype before editing.gitnexus_detect_changesbefore every commit.cd ui && npm run generate-api:liveafter every controller or DTO change..claude/rules/app-classes.mdupdated in the same commit that addsAlertTemplateVariablesController..claude/rules/core-classes.mdupdated when the newLogLevelenum lands.- Maven:
mvn clean verifymust stay green through every commit. - UI:
cd ui && npm run typecheck && npm run test && npm run test:e2emust stay green through every commit.