Files
cameleer-server/docs/superpowers/specs/2026-04-20-alerting-04-hardening-design.md
hsiegeln 46867cc659 docs(alerting): Plan 04 design spec — post-ship hardening
Closes the loop on three bug classes from Plan 03 triage:
context-load regressions (missing @Autowired), UI/backend drift
on template variables, and hand-maintained TS enum unions caused
by springdoc polymorphic schema quirk.

Covers 5 tasks: context-startup smoke test, template-variables
SSOT endpoint, second Playwright spec, String-to-enum migrations
on 5 condition fields, and @DiscriminatorMapping on AlertCondition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:44:41 +02:00

23 KiB
Raw Blame History

Alerting — Plan 04 Design Spec: Post-Ship Hardening

Date: 2026-04-20 Status: Draft — awaiting user review Surfaces: server (core + app), UI, tests Related: Plan 03 (shipped as PR #144); builds on Alerting design


1. Summary

Close the loop on three bug classes that surfaced in Plan 03's post-merge triage:

  1. Spring wiring regressions not caught by unit testsAlertingMetrics's production constructor was missing @Autowired and only broke at Docker runtime (commit 5edf7eb2). Unit tests exercised the package-private ctor, so green locally.
  2. UI↔backend drift on Mustache template variablesALERT_VARIABLES in the UI and NotificationContextBuilder on the server were two hand-maintained lists of the same thing; one drift was fixed in commit 18e6dde6.
  3. Hand-maintained TypeScript enum unionsui/src/pages/Alerts/enums.ts declares RouteMetric, Comparator, JvmAggregation, ExchangeFireMode manually because springdoc emits a broken polymorphic schema (condition types resolve to never in TS due to discriminator-literal conflict).

Plan 04 is pure hardening: no new product features. Five tasks that together eliminate these bug classes at their source rather than adding another test that detects them.

Guiding principles

  • Prefer SSOT over drift-detect. When two hand-maintained lists need to match, make one derive from the other rather than adding a test that fails when they don't.
  • Type correctness at the contract layer. Backend records are the contract; if a field has a closed vocabulary, it is an enum. String-typed-with-comment does not survive schema regeneration.
  • Close the loop on Plan 03, don't start new product work. Features (#8 native integrations, #9 CA-bundle UI), performance (#7 load tests), and deferred cleanup (#6 OIDC post-alpha, #10 router lint sweep) stay out of scope — each is a separate future plan.

2. Scope

In scope

  1. Context-startup smoke test — a @SpringBootTest that loads the full production application context and asserts the alerting beans are present. Catches @Autowired regressions at mvn verify time.
  2. Template-variables SSOT endpoint — new GET /api/v1/alerts/template-variables returning the declarative registry. UI deletes its duplicate; autocomplete + linter consume the server registry via TanStack Query.
  3. Second Playwright spec — rule EDIT path, MustacheEditor autocomplete + linter in-browser, env-promotion warnings banner, inbox ack, bulk-read.
  4. Typed vocabulary for 5 String condition fieldsAgentStateCondition.state, DeploymentStateCondition.states, LogPatternCondition.level, ExchangeFilter.status become proper Java enums. JvmMetricCondition.metric stays String + @Schema(allowableValues={...}) because agents report raw Micrometer metric names that the evaluator passes through unchanged. OpenAPI emits closed unions for all five; UI types derive.
  5. Springdoc discriminator mapping — add @Schema(discriminatorProperty, discriminatorMapping) to the AlertCondition sealed interface. TS stops generating kind: "<ClassName>" literals. Hand-declared unions in ui/src/pages/Alerts/enums.ts migrate to derived types.

Out of scope (explicit)

To keep Plan 04 focused, none of these are touched in this plan. Each has a separate issue or a future plan slot:

  • Native Slack / Teams / PagerDuty integrations (Plan 03 spec BL-002 — future plan).
  • Admin CA-bundle UI (gitea #137 / BL-001).
  • Router.tsx lint sweep (207 fast-refresh warnings, pre-existed Plan 03 — standalone lint PR).
  • Performance ITs for 500 rules × 5 replicas (Plan 03 spec deferral — separate plan, different shape).
  • OIDC / UserAdmin re-audit (commit ae647363 was minimal; re-audit post-alpha).
  • Orphan .worktrees/ cleanup (user holds authorization).

3. Task 1 — Context-startup smoke test

Problem

Plan 03 shipped a regression where AlertingMetrics's production constructor was missing @Autowired. Unit tests instantiated it via a package-private ctor, so local builds and CI were green. Spring context-load failed only when the full app started under Docker. The entire class of regression (missing annotation on any alerting-subsystem bean) is invisible to the current test pyramid.

Design

Location: cameleer-server-app/src/test/java/com/cameleer/server/app/ContextStartupSmokeTest.java

Shape:

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test-context-smoke")
class ContextStartupSmokeTest {
    @Autowired ApplicationContext ctx;

    @Test
    void productionContextLoadsWithAllAlertingBeans() {
        assertNotNull(ctx.getBean(AlertingMetrics.class));
        assertNotNull(ctx.getBean(AlertEvaluatorJob.class));
        assertNotNull(ctx.getBean(NotificationDispatchJob.class));
        assertNotNull(ctx.getBean(NotificationContextBuilder.class));
        assertNotNull(ctx.getBean(WebhookDispatcher.class));
        assertNotNull(ctx.getBean(MustacheRenderer.class));
        assertNotNull(ctx.getBean(AlertingRetentionJob.class));
        assertNotNull(ctx.getBean(PerKindCircuitBreaker.class));
        // + one assertion per ConditionEvaluator impl (6 evaluators)
    }
}

Profile stubs: test-context-smoke profile replaces DataSource, ClickHouseJdbcTemplate, and RuntimeOrchestrator with Mockito mocks so the test doesn't need Testcontainers. Real beans for everything alerting.

Rationale

  • Why not extend an existing @SpringBootTest? Existing ITs use Testcontainers for Postgres + ClickHouse. They catch query-path regressions but are slow and duplicate work we don't need. The bug class here — missing @Autowired on a ctor — surfaces during bean creation regardless of whether dependencies are real or stubbed.
  • Why assert individual beans, not just "context loads"? A context loads if any bean breaks is still a passing test if the broken bean is conditional. Naming each bean makes intent explicit and failure messages readable.

Validation

Intentionally break one bean (remove @Autowired or @Component) and confirm the test fails with a clear message. Revert before committing.


4. Task 2 — Template-variables SSOT endpoint

Problem

ui/src/components/MustacheEditor/alert-variables.ts is a hand-maintained list of template variable paths (alert.rule.name, exchange.status, etc.) mirroring what NotificationContextBuilder actually produces. On 2026-04-20, commit 18e6dde6 fixed one drift between them. Until drift is structurally prevented, it will recur.

Design

Backend

  1. Declarative registry in core. Extract NotificationContextBuilder's leaf paths into a static constant:

    public record TemplateVariableDescriptor(
        String path,           // "alert.rule.name"
        String type,           // "string" | "number" | "boolean" | "datetime"
        String description,
        String example
    ) {}
    
    public final class AlertTemplateVariables {
        public static final List<TemplateVariableDescriptor> ALL = List.of(
            new TemplateVariableDescriptor("alert.rule.name",    "string",   "Rule display name", "High error rate"),
            new TemplateVariableDescriptor("alert.firing_at",    "datetime", "ISO-8601 firing timestamp", "2026-04-20T12:34:56Z"),
            // ... exhaustive list
        );
    }
    
  2. Refactor NotificationContextBuilder to build its context from this registry, so divergence is structural impossibility rather than convention.

  3. Round-trip unit test. AlertTemplateVariablesTest: for every descriptor in ALL, confirm the path resolves to a non-null value when NotificationContextBuilder.build(...) runs against a canonical sample firing instance. Fails if a variable is added to ALL but the builder doesn't produce it (or vice versa).

  4. Controller. AlertTemplateVariablesController at GET /api/v1/alerts/template-variables:

    • Flat endpoint (added to allow-list in .claude/rules/app-classes.md, reason: registry is tenant- and env-agnostic global metadata).
    • VIEWER+ (operators authoring webhook bodies are typically VIEWER+).
    • Returns List<TemplateVariableDto> — identical shape to TemplateVariableDescriptor.
    • Response is deterministic and cacheable (Cache-Control: public, max-age=3600 — the registry changes only across server versions).

Frontend

  1. New query hook: ui/src/api/queries/alertTemplateVariables.ts exposing useTemplateVariables(). TanStack Query config: staleTime: Infinity, gcTime: Infinity — the list never changes at runtime.

  2. Delete ui/src/components/MustacheEditor/alert-variables.ts entirely.

  3. Prop-drill the variable list:

    • MustacheEditor.tsx takes variables: TemplateVariable[] as a prop.
    • mustache-completion.ts and mustache-linter.ts become functions taking the list as an argument instead of importing a module.
    • NotifyStep.tsx calls useTemplateVariables() once and passes the result.
    • Webhook body/header editors do the same.
  4. Loading state: while the variables are still being fetched, the MustacheEditor shows the textarea without autocomplete rather than blocking. The query runs on mount of NotifyStep (or whichever component opens first) and the result is cached forever.

Rationale

  • Why a runtime endpoint, not a build-time JSON artifact? The registry is also useful to admins outside the wizard — e.g. generated docs, tooling that inspects the schema. A REST endpoint is idiomatic API surface; a generated JSON file is build plumbing that couples UI to server at build time.
  • Why staleTime: Infinity? The registry only changes when the server version changes, at which point the SPA is also redeployed (fresh fetch on load). No point polling.

5. Task 3 — Second Playwright spec

Problem

Existing ui/src/test/e2e/alerting.spec.ts covers CREATE + DELETE of rules and the sidebar structure. It doesn't exercise the EDIT path, MustacheEditor autocomplete in the browser, the env-promotion warnings banner, or inbox interactions.

Design

File: ui/src/test/e2e/alerting-editor.spec.ts (new; existing smoke stays unchanged)

Cases:

  1. Rule EDIT. Seed a rule via backend API, navigate to /alerts/rules/:id, change threshold field, save, reload page, assert threshold persisted.
  2. MustacheEditor autocomplete. On NotifyStep, focus the notification title field, type {{, assert CodeMirror completion popover appears listing alert.rule.name (and friends), arrow-down + Enter, assert insertion.
  3. MustacheEditor linter. Type {{alert.not_a_real_var}}, assert a linter diagnostic appears within 500 ms.
  4. Env-promotion warnings. Seed a rule in env A with scope.routeId referencing a route that doesn't exist in env B. Open RulesListPage, promote to env B, assert the wizard opens with a warning banner listing the missing route.
  5. Inbox ack. Seed a FIRING alert via API, navigate to /alerts/inbox, click Ack button on the row, assert state chip flips to ACKNOWLEDGED and row's Ack button disables.
  6. Bulk-read. Seed 3 FIRING alerts, select-all via header checkbox, click bulk-read button, assert notification-bell unread-count becomes 0.

Stability

  • Use page.waitForFunction on CodeMirror's EditorView.completionStatus rather than waitForTimeout for autocomplete assertions.
  • Gate CI on 3 consecutive green runs of the new spec before merging (reviewer checks the CI history, not a scripted gate).
  • Seed fixtures via API, not via page navigation — each test independent, no ordering dependency.

Rationale

Playwright tests are the only layer that exercises CodeMirror's live behaviour. Unit tests mock it; integration tests skip it. The cost of flakiness mitigation (three green runs, waitForFunction over waitForTimeout) buys the coverage that bit us in Plan 03 triage.


6. Task 4 — Typed vocabulary for condition String fields

Problem

Five condition fields are typed as String in the Java records, so springdoc emits "type": "string" without enum. UI has no way to derive a closed union and either (a) cannot render a dropdown or (b) hard-codes the values. See the top-of-file comment in ui/src/pages/Alerts/enums.ts for context.

Design

Four field migrations to proper Java enums + one schema-only hint:

Field Target type Existing or new?
AgentStateCondition.state com.cameleer.server.core.agent.AgentState (enum) existing (LIVE, STALE, DEAD, SHUTDOWN)
DeploymentStateCondition.states List<com.cameleer.server.core.runtime.DeploymentStatus> (enum) existing (STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED)
LogPatternCondition.level new com.cameleer.server.core.alerting.LogLevel (enum) new — TRACE, DEBUG, INFO, WARN, ERROR
ExchangeFilter.status com.cameleer.common.model.ExecutionStatus (enum) existing in cameleer-common (RUNNING, COMPLETED, FAILED, ABANDONED)
JvmMetricCondition.metric stays String + @Schema(allowableValues={...}) schema-only hint — wire stays as agent-reported Micrometer metric names (jvm.memory.used.value, etc.)

Why JvmMetric is different. JvmMetricEvaluator passes c.metric() directly to MetricsQueryStore.queryTimeSeries(agentId, List.of(c.metric()), ...). The field is a raw Micrometer metric name emitted by the agent — jvm.memory.used.value, process.cpu.usage.value, jvm.gc.pause.total_time, etc. Making this an enum either (a) forces an un-idiomatic mapping between SCREAMING_SNAKE constants and dotted strings, or (b) locks out any metric not in the hard-coded list. Option (a)'s @Schema(allowableValues=…) gives springdoc enough to emit a narrowed TS union for UI autocomplete while leaving the backend permissive. The curated list comes from .claude/rules/metrics.md.

Persistence

alert_rules.condition is stored as JSONB.

  • For the 4 enum migrations: Jackson serializes enum values by .name(); on-the-wire payloads already use uppercase tokens ("COMPLETED", not "Completed"). Legacy rows round-trip unchanged.
  • For JvmMetricCondition.metric: no change — it's already a String with dotted-name values. @Schema(allowableValues=…) affects OpenAPI emission only, not deserialization or validation.

Pre-flight check: before the first enum-flip commit, run against the local dev DB:

SELECT condition_kind, condition->>'level' AS val
FROM alert_rules WHERE condition_kind='LOG_PATTERN'
UNION ALL
SELECT condition_kind, condition->>'state'
FROM alert_rules WHERE condition_kind='AGENT_STATE'
-- etc, one union per migrating field

If any row has a value that doesn't match the target enum (e.g. lowercase "info" instead of "INFO"), Plan 04 adds a one-shot Flyway V15__alert_condition_enum_repair.sql that normalizes them. If all values match, skip the migration.

Validation

Controller-level @Valid on AlertRuleRequest automatically enforces enum membership — unknown values produce 400 Bad Request. Replaces any ad-hoc string checks in evaluators.

Integration test per migrated field: POST /rules with known-bad value → 400; known-good value → 201.

UI cleanup

After each enum migration ships, the corresponding manual artifact in ui/src/pages/Alerts/enums.ts (if any) is either deleted (derived type replaces it) or narrowed to label maps only. The top-of-file comment noting "Fields whose backend type is String … aren't in here at all" becomes obsolete and gets deleted.


7. Task 5 — Springdoc discriminator mapping

Problem

The generated OpenAPI has a discriminator for AlertCondition but no mapping:

"AlertCondition": {
  "discriminator": { "propertyName": "kind" },
  "properties": { "kind": { "enum": ["ROUTE_METRIC", ...] } }
}

When openapi-typescript encounters a discriminator without a mapping, it synthesizes the literal from the schema name. Result:

AgentStateCondition: {
  kind: "AgentStateCondition";   // synthesized — class simple name
} & (Omit<AlertCondition, "kind"> & {
  readonly kind?: "ROUTE_METRIC" | ... | "JVM_METRIC";
});

Indexed access like AgentStateCondition['kind'] is "AgentStateCondition" & ("ROUTE_METRIC" | ...) = never. Record<T, string> exhaustiveness checks silently collapse.

Design

One edit to cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertCondition.java:

import io.swagger.v3.oas.annotations.media.DiscriminatorMapping;
import io.swagger.v3.oas.annotations.media.Schema;

@Schema(
    discriminatorProperty = "kind",
    discriminatorMapping = {
        @DiscriminatorMapping(value = "ROUTE_METRIC",     schema = RouteMetricCondition.class),
        @DiscriminatorMapping(value = "EXCHANGE_MATCH",   schema = ExchangeMatchCondition.class),
        @DiscriminatorMapping(value = "AGENT_STATE",      schema = AgentStateCondition.class),
        @DiscriminatorMapping(value = "DEPLOYMENT_STATE", schema = DeploymentStateCondition.class),
        @DiscriminatorMapping(value = "LOG_PATTERN",      schema = LogPatternCondition.class),
        @DiscriminatorMapping(value = "JVM_METRIC",       schema = JvmMetricCondition.class)
    }
)
@JsonTypeInfo(...)  // unchanged
@JsonSubTypes(...)  // unchanged
public sealed interface AlertCondition ...

After mvn compile && cd ui && npm run generate-api:live, the schema gains:

"discriminator": {
  "propertyName": "kind",
  "mapping": {
    "ROUTE_METRIC": "#/components/schemas/RouteMetricCondition",
    ...
  }
}

And openapi-typescript emits kind: "ROUTE_METRIC" instead of kind: "RouteMetricCondition".

Fallback

If springdoc 2.8.6 does not honor @DiscriminatorMapping on a sealed interface (sealed types are newer than the library's primary target surface), add a SpringDocConfig bean:

@Bean OpenApiCustomizer alertConditionDiscriminatorFix() {
    return openApi -> {
        Schema<?> alertCondition = openApi.getComponents().getSchemas().get("AlertCondition");
        alertCondition.getDiscriminator().mapping(Map.of(
            "ROUTE_METRIC", "#/components/schemas/RouteMetricCondition",
            // ...
        ));
    };
}

~20 lines, no behavioural change for the server — emits the exact same OpenAPI. Plan 04 ships either way.

UI cleanup

Once the schema is correct, ui/src/pages/Alerts/enums.ts is simplified:

  • Delete hand-declared RouteMetric, Comparator, JvmAggregation, ExchangeFireMode unions.
  • Replace with derived indexed access:
    type RouteMetric = NonNullable<components['schemas']['RouteMetricCondition']['metric']>;
    type Comparator  = NonNullable<components['schemas']['RouteMetricCondition']['comparator']>;
    // etc.
    
  • Delete the 30-line comment block at the top of the file explaining the workaround.
  • Keep HIDDEN arrays and label-map Record<T, string> exhaustiveness enforcement — those remain valuable.

8. Execution Order

  1. Task 5 (springdoc discriminator mapping) — purely additive annotation.
  2. Task 4 (String → enum migrations, one commit per field).
  3. Task 1 (context-startup smoke — independent, slots anywhere but early catches 4/5 fallout).
  4. Task 2 (template-variables SSOT).
  5. Task 3 (Playwright spec — last, exercises all of the above).

No hard blocking dependencies between tasks, but this order minimizes rebase pain and keeps each commit small.


9. Risks

Risk Mitigation
Springdoc @DiscriminatorMapping on a sealed interface not honored in 2.8.6 OpenApiCustomizer bean fallback (§7).
Legacy alert_rules.condition JSONB rows have values that don't match new enum names Pre-flight SQL inspection; V15 Flyway repair migration only if needed.
NotificationContextBuilder refactor silently changes rendered notifications Golden-file integration test — one sample notification per (severity × kind); run before and after refactor.
Playwright spec flakiness on CodeMirror autocomplete timing waitForFunction on completion state; 3 consecutive green CI runs before merge.
@Valid enum enforcement breaks existing rule creation flows in the UI Task-4 commits bundle generate-api:live + typecheck-fixing; any UI call-site change ships with the backend change.

10. Acceptance Criteria

Plan 04 is done when:

  1. Reverting commit 5edf7eb2 (re-introducing the AlertingMetrics ctor bug) causes mvn clean verify to fail. Revert the revert; test stays.
  2. grep -r "ALERT_VARIABLES" ui/src shows no hand-maintained registry — only the backend-fetched hook and its consumers.
  3. ui/src/pages/Alerts/enums.ts has no hand-declared unions for RouteMetric, Comparator, JvmAggregation, ExchangeFireMode — all four are derived from schema.d.ts. The top-of-file explanatory comment is gone.
  4. All 5 condition String fields are enum-typed on the wire (inspect openapi.json) and in schema.d.ts.
  5. alerting-editor.spec.ts green in CI for 3 consecutive runs.
  6. .claude/rules/app-classes.md updated: new AlertTemplateVariablesController listed under the flat-endpoint allow-list with the reason ("registry is tenant- and env-agnostic global metadata").
  7. openapi.json and schema.d.ts regenerated and committed alongside the backend changes they reflect.

11. Estimated Commit Cadence

Approximate, for sizing:

  1. feat(alerting): discriminator mapping on AlertCondition — Task 5 backend.
  2. refactor(ui/alerts): derive polymorphic unions from schema — Task 5 UI cleanup.
  3. refactor(alerting): AgentStateCondition.state → AgentState enum.
  4. refactor(alerting): DeploymentStateCondition.states → DeploymentStatus enum.
  5. refactor(alerting): LogPatternCondition.level → LogLevel enum (adds new enum).
  6. refactor(alerting): JvmMetricCondition.metric — @Schema(allowableValues) hint (no type change; OpenAPI hint only).
  7. refactor(alerting): ExchangeFilter.status → ExecutionStatus enum.
  8. test(alerting): Spring context-startup smoke.
  9. feat(alerting): template-variables SSOT endpoint — Task 2 backend.
  10. refactor(ui/alerts): consume template-variables via API — Task 2 frontend.
  11. test(ui/alerts): Playwright editor spec.

~11 commits, bundleable to ~8 if enum flips are trivial. Each atomic and reversible.


12. CRITICAL process rules (per CLAUDE.md)

  • gitnexus_impact on AlertCondition, NotificationContextBuilder, AlertingMetrics, ExchangeMatchCondition, each touched subtype before editing.
  • gitnexus_detect_changes before every commit.
  • cd ui && npm run generate-api:live after every controller or DTO change.
  • .claude/rules/app-classes.md updated in the same commit that adds AlertTemplateVariablesController.
  • .claude/rules/core-classes.md updated when the new LogLevel enum lands.
  • Maven: mvn clean verify must stay green through every commit.
  • UI: cd ui && npm run typecheck && npm run test && npm run test:e2e must stay green through every commit.