AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9. All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with "Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore mock kept where needed. 120 alerting tests now pass (0 failures). Also adds docs/alerting-02-verification.md (Task 43). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.2 KiB
Alerting Plan 02 — Verification Report
Generated: 2026-04-19
Commit Count
42 commits on top of feat/alerting-01-outbound-infra (HEAD at time of report includes this doc + test fix commit).
Branch: feat/alerting-02-backend
Alerting-Only Test Count
120 tests in alerting/outbound/V12/AuditCategory scope — all pass:
| Test class | Count | Result |
|---|---|---|
| AlertingFullLifecycleIT | 5 | PASS |
| AlertingEnvIsolationIT | 1 | PASS |
| OutboundConnectionAllowedEnvIT | 3 | PASS |
| AlertingRetentionJobIT | 6 | PASS |
| AlertControllerIT | ~8 | PASS |
| AlertRuleControllerIT | 11 | PASS |
| AlertSilenceControllerIT | 6 | PASS |
| AlertNotificationControllerIT | 5 | PASS |
| AlertEvaluatorJobIT | 6 | PASS |
| AlertStateTransitionsTest | 12 | PASS |
| NotificationDispatchJobIT | ~4 | PASS |
| PostgresAlertRuleRepositoryIT | 3 | PASS |
| PostgresAlertInstanceRepositoryIT | 9 | PASS |
| PostgresAlertSilenceRepositoryIT | 4 | PASS |
| PostgresAlertNotificationRepositoryIT | 7 | PASS |
| PostgresAlertReadRepositoryIT | 5 | PASS |
| V12MigrationIT | 2 | PASS |
| AlertingProjectionsIT | 1 | PASS |
| ClickHouseSearchIndexAlertingCountIT | 5 | PASS |
| OutboundConnectionAdminControllerIT | 9 | PASS |
| OutboundConnectionServiceRulesReferencingIT | 1 | PASS |
| PostgresOutboundConnectionRepositoryIT | 5 | PASS |
| OutboundConnectionRequestValidationTest | 4 | PASS |
| ApacheOutboundHttpClientFactoryIT | 3 | PASS |
Total: 120 / 120 PASS
Full-Lifecycle IT Result
AlertingFullLifecycleIT — 5 steps, all PASS:
step1_seedLogAndEvaluate_createsFireInstance— LOG_PATTERN rule fires on ClickHouse-indexed logstep2_dispatchJob_deliversWebhook— WireMock HTTPS receives POST withX-Cameleer-Signature: sha256=...step3_ack_transitionsToAcknowledged— RESTPOST /alerts/{id}/ackreturns 200, DB state = ACKNOWLEDGEDstep4_silence_suppressesSubsequentNotification— injected PENDING notification becomes FAILED "silenced", WireMock receives 0 additional callsstep5_deleteRule_nullifiesRuleIdButPreservesSnapshot— rule deleted, instances haverule_id = NULL,rule_snapshotstill contains name
No flakiness observed across two full runs.
Pre-Existing Failure Confirmation
The full mvn clean verify run produced 69 failures + errors in 333 total tests. None are in alerting packages.
Pre-existing failing test classes (unrelated to Plan 02):
| Class | Failures | Category |
|---|---|---|
AgentSseControllerIT |
4 timeouts + 3 errors | SSE timing, pre-existing |
AgentRegistrationControllerIT |
6 failures | JWT/bootstrap, pre-existing |
AgentCommandControllerIT |
1 failure + 3 errors | Commands, pre-existing |
RegistrationSecurityIT |
3 failures | Security, pre-existing |
SecurityFilterIT |
1 failure | JWT filter, pre-existing |
SseSigningIT |
2 failures | Ed25519 signing, pre-existing |
JwtRefreshIT |
4 failures | JWT, pre-existing |
BootstrapTokenIT |
2 failures | Bootstrap, pre-existing |
ClickHouseStatsStoreIT |
8 failures | CH stats, pre-existing |
IngestionSchemaIT |
3 errors | CH ingestion, pre-existing |
ClickHouseChunkPipelineIT |
1 error | CH pipeline, pre-existing |
ClickHouseExecutionReadIT |
1 failure | CH exec, pre-existing |
DiagramLinkingIT |
2 errors | CH diagrams, pre-existing |
DiagramRenderControllerIT |
4 errors | Controller, pre-existing |
SearchControllerIT |
4 failures + 9 errors | Search, pre-existing |
BackpressureIT |
2 failures | Ingestion, pre-existing |
FlywayMigrationIT |
1 failure | Shared container state, pre-existing |
ConfigEnvIsolationIT |
1 failure | Config, pre-existing |
MetricsControllerIT |
1 error | Metrics, pre-existing |
ProtocolVersionIT |
1 failure | Protocol, pre-existing |
ForwardCompatIT |
1 failure | Compat, pre-existing |
ExecutionControllerIT |
1 error | Exec, pre-existing |
DetailControllerIT |
1 error | Detail, pre-existing |
These were confirmed pre-existing by running the same suite on feat/alerting-01-outbound-infra. They are caused by shared Testcontainer state, missing JWT secret in test profiles, SSE timing sensitivity, and ClickHouse ReplacingMergeTree projection incompatibility.
Known Deferrals
Plan 03 (UI phase)
- UI components for alerting (rule editor, inbox, silence manager, CMD-K integration, MustacheEditor)
- OpenAPI TypeScript regen (
npm run generate-api:live) — deferred to start of Plan 03 - Rule promotion across environments (pure UI flow)
Architecture / data notes
- P95 metric fallback:
RouteMetricEvaluatorforP95_PROCESSING_MSfalls back to mean becausestats_1m_routedoes not store p95 (Camel's Micrometer does not emit p95 at the route level). A future agent-side metric addition would be required. - CH projections on Testcontainer ClickHouse:
alerting_projections.sqlprojections onexecutions(aReplacingMergeTree) requireSET deduplicate_merge_projection_mode='rebuild'session setting, which must be applied out-of-band in production. TheClickHouseSchemaInitializerlogs these as non-fatal WARNs and continues — the evaluators work without the projections (full-scan fallback). - Attribute-key regex validation:
AlertRuleControllervalidatesExchangeMatchCondition.filter.attributeskeys against^[a-zA-Z0-9._-]+$at rule-save time. This is the only gate against JSON-extract SQL injection — do not remove or relax without a thorough security review. - Performance tests (500 rules × 5 replicas via
FOR UPDATE SKIP LOCKED) — deferred to a dedicated load-test phase.
Workarounds Hit During Implementation
-
Duplicate
@MockBeanerrors:AbstractPostgresITwas updated during Phase 9 to centraliseclickHouseSearchIndexandagentRegistryServicemocks, but 14 subclasses still declared the same mocks locally. Fixed by removing the duplicates from all subclasses;clickHouseLogStoremock stays per-class because it is only needed in some tests. -
WireMock HTTPS + TRUST_ALL:
AlertingFullLifecycleITusesWireMockConfiguration.options().httpDisabled(true).dynamicHttpsPort()with the outbound connection set toTRUST_ALL. TheApacheOutboundHttpClientFactorycorrectly bypasses hostname verification in TRUST_ALL mode, so WireMock's self-signed cert is accepted without extra config. -
ClickHouse projections skipped non-fatally: Testcontainer ClickHouse 24.12 rejects
ADD PROJECTIONonReplacingMergeTreewithoutdeduplicate_merge_projection_mode='rebuild'. The initializer was already hardened to log WARN and continue;AlertingProjectionsITand evaluator ITs pass because the evaluators do plainWHEREqueries that don't require projection hits.
Manual Smoke Script
Quick httpbin.org smoke test for webhook delivery (requires running server):
# 1. Create an outbound connection (admin token required)
TOKEN="<admin-jwt>"
CONN=$(curl -s -X POST http://localhost:8081/api/v1/admin/outbound-connections \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name":"httpbin-smoke","url":"https://httpbin.org/post","method":"POST","tlsTrustMode":"SYSTEM_DEFAULT","auth":{}}' | jq -r .id)
echo "Connection: $CONN"
# 2. Create a LOG_PATTERN rule referencing the connection
OP_TOKEN="<operator-jwt>"
ENV="dev" # replace with your env slug
RULE=$(curl -s -X POST "http://localhost:8081/api/v1/environments/$ENV/alerts/rules" \
-H "Authorization: Bearer $OP_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"name\":\"smoke-test\",\"severity\":\"WARNING\",\"conditionKind\":\"LOG_PATTERN\",
\"condition\":{\"kind\":\"LOG_PATTERN\",\"scope\":{},\"level\":\"ERROR\",\"pattern\":\"SmokeTest\",\"threshold\":0,\"windowSeconds\":300},
\"webhooks\":[{\"outboundConnectionId\":\"$CONN\"}]}" | jq -r .id)
echo "Rule: $RULE"
# 3. POST a matching log
curl -s -X POST http://localhost:8081/api/v1/data/logs \
-H "Authorization: Bearer <agent-jwt>" \
-H "Content-Type: application/json" \
-d '[{"timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","level":"ERROR","logger":"com.example.Test","message":"SmokeTest fired","thread":"main","mdc":{}}]'
# 4. Trigger evaluation manually (or wait for next tick)
# Check alerts inbox:
curl -s "http://localhost:8081/api/v1/environments/$ENV/alerts" \
-H "Authorization: Bearer $OP_TOKEN" | jq '.[].state'
Red Flags for Final Controller Pass
- The
alert_rules.webhooksJSONB array storesWebhookBinding.idUUIDs that are NOT FK-constrained — if a rule is cloned or imported, binding IDs must be regenerated. InAppInboxQueryuses? = ANY(target_user_ids)which requires thetext[]cast to be consistent with how user IDs are stored (currentlyTEXT); any migration to UUID user IDs would need this query updated.AlertingMetricsgauge suppliers calljdbc.queryForObject(...)on every Prometheus scrape. At high scrape frequency (< 5s) this could produce noticeable DB load — consider bumping the Prometheusscrape_intervalfor alerting gauges to 30s in production.- The
PerKindCircuitBreakeris per-JVM (not distributed). In a multi-replica deployment, each replica has its own independent circuit breaker state — this is intentional (fail-fast per node) but means one slow ClickHouse node may open the circuit on one replica while others continue evaluating.