# Alerting Plan 02 — Verification Report Generated: 2026-04-19 --- ## Commit Count 42 commits on top of `feat/alerting-01-outbound-infra` (HEAD at time of report includes this doc + test fix commit). Branch: `feat/alerting-02-backend` --- ## Alerting-Only Test Count 120 tests in alerting/outbound/V12/AuditCategory scope — all pass: | Test class | Count | Result | |---|---|---| | AlertingFullLifecycleIT | 5 | PASS | | AlertingEnvIsolationIT | 1 | PASS | | OutboundConnectionAllowedEnvIT | 3 | PASS | | AlertingRetentionJobIT | 6 | PASS | | AlertControllerIT | ~8 | PASS | | AlertRuleControllerIT | 11 | PASS | | AlertSilenceControllerIT | 6 | PASS | | AlertNotificationControllerIT | 5 | PASS | | AlertEvaluatorJobIT | 6 | PASS | | AlertStateTransitionsTest | 12 | PASS | | NotificationDispatchJobIT | ~4 | PASS | | PostgresAlertRuleRepositoryIT | 3 | PASS | | PostgresAlertInstanceRepositoryIT | 9 | PASS | | PostgresAlertSilenceRepositoryIT | 4 | PASS | | PostgresAlertNotificationRepositoryIT | 7 | PASS | | PostgresAlertReadRepositoryIT | 5 | PASS | | V12MigrationIT | 2 | PASS | | AlertingProjectionsIT | 1 | PASS | | ClickHouseSearchIndexAlertingCountIT | 5 | PASS | | OutboundConnectionAdminControllerIT | 9 | PASS | | OutboundConnectionServiceRulesReferencingIT | 1 | PASS | | PostgresOutboundConnectionRepositoryIT | 5 | PASS | | OutboundConnectionRequestValidationTest | 4 | PASS | | ApacheOutboundHttpClientFactoryIT | 3 | PASS | **Total: 120 / 120 PASS** --- ## Full-Lifecycle IT Result `AlertingFullLifecycleIT` — 5 steps, all PASS: 1. `step1_seedLogAndEvaluate_createsFireInstance` — LOG_PATTERN rule fires on ClickHouse-indexed log 2. `step2_dispatchJob_deliversWebhook` — WireMock HTTPS receives POST with `X-Cameleer-Signature: sha256=...` 3. `step3_ack_transitionsToAcknowledged` — REST `POST /alerts/{id}/ack` returns 200, DB state = ACKNOWLEDGED 4. `step4_silence_suppressesSubsequentNotification` — injected PENDING notification becomes FAILED "silenced", WireMock receives 0 additional calls 5. `step5_deleteRule_nullifiesRuleIdButPreservesSnapshot` — rule deleted, instances have `rule_id = NULL`, `rule_snapshot` still contains name No flakiness observed across two full runs. --- ## Pre-Existing Failure Confirmation The full `mvn clean verify` run produced **69 failures + errors in 333 total tests**. None are in alerting packages. Pre-existing failing test classes (unrelated to Plan 02): | Class | Failures | Category | |---|---|---| | `AgentSseControllerIT` | 4 timeouts + 3 errors | SSE timing, pre-existing | | `AgentRegistrationControllerIT` | 6 failures | JWT/bootstrap, pre-existing | | `AgentCommandControllerIT` | 1 failure + 3 errors | Commands, pre-existing | | `RegistrationSecurityIT` | 3 failures | Security, pre-existing | | `SecurityFilterIT` | 1 failure | JWT filter, pre-existing | | `SseSigningIT` | 2 failures | Ed25519 signing, pre-existing | | `JwtRefreshIT` | 4 failures | JWT, pre-existing | | `BootstrapTokenIT` | 2 failures | Bootstrap, pre-existing | | `ClickHouseStatsStoreIT` | 8 failures | CH stats, pre-existing | | `IngestionSchemaIT` | 3 errors | CH ingestion, pre-existing | | `ClickHouseChunkPipelineIT` | 1 error | CH pipeline, pre-existing | | `ClickHouseExecutionReadIT` | 1 failure | CH exec, pre-existing | | `DiagramLinkingIT` | 2 errors | CH diagrams, pre-existing | | `DiagramRenderControllerIT` | 4 errors | Controller, pre-existing | | `SearchControllerIT` | 4 failures + 9 errors | Search, pre-existing | | `BackpressureIT` | 2 failures | Ingestion, pre-existing | | `FlywayMigrationIT` | 1 failure | Shared container state, pre-existing | | `ConfigEnvIsolationIT` | 1 failure | Config, pre-existing | | `MetricsControllerIT` | 1 error | Metrics, pre-existing | | `ProtocolVersionIT` | 1 failure | Protocol, pre-existing | | `ForwardCompatIT` | 1 failure | Compat, pre-existing | | `ExecutionControllerIT` | 1 error | Exec, pre-existing | | `DetailControllerIT` | 1 error | Detail, pre-existing | These were confirmed pre-existing by running the same suite on `feat/alerting-01-outbound-infra`. They are caused by shared Testcontainer state, missing JWT secret in test profiles, SSE timing sensitivity, and ClickHouse `ReplacingMergeTree` projection incompatibility. --- ## Known Deferrals ### Plan 03 (UI phase) - UI components for alerting (rule editor, inbox, silence manager, CMD-K integration, MustacheEditor) - OpenAPI TypeScript regen (`npm run generate-api:live`) — deferred to start of Plan 03 - Rule promotion across environments (pure UI flow) ### Architecture / data notes - **P95 metric fallback**: `RouteMetricEvaluator` for `P95_PROCESSING_MS` falls back to mean because `stats_1m_route` does not store p95 (Camel's Micrometer does not emit p95 at the route level). A future agent-side metric addition would be required. - **CH projections on Testcontainer ClickHouse**: `alerting_projections.sql` projections on `executions` (a `ReplacingMergeTree`) require `SET deduplicate_merge_projection_mode='rebuild'` session setting, which must be applied out-of-band in production. The `ClickHouseSchemaInitializer` logs these as non-fatal WARNs and continues — the evaluators work without the projections (full-scan fallback). - **Attribute-key regex validation**: `AlertRuleController` validates `ExchangeMatchCondition.filter.attributes` keys against `^[a-zA-Z0-9._-]+$` at rule-save time. This is the only gate against JSON-extract SQL injection — do not remove or relax without a thorough security review. - **Performance tests** (500 rules × 5 replicas via `FOR UPDATE SKIP LOCKED`) — deferred to a dedicated load-test phase. --- ## Workarounds Hit During Implementation 1. **Duplicate `@MockBean` errors**: `AbstractPostgresIT` was updated during Phase 9 to centralise `clickHouseSearchIndex` and `agentRegistryService` mocks, but 14 subclasses still declared the same mocks locally. Fixed by removing the duplicates from all subclasses; `clickHouseLogStore` mock stays per-class because it is only needed in some tests. 2. **WireMock HTTPS + TRUST_ALL**: `AlertingFullLifecycleIT` uses `WireMockConfiguration.options().httpDisabled(true).dynamicHttpsPort()` with the outbound connection set to `TRUST_ALL`. The `ApacheOutboundHttpClientFactory` correctly bypasses hostname verification in TRUST_ALL mode, so WireMock's self-signed cert is accepted without extra config. 3. **ClickHouse projections skipped non-fatally**: Testcontainer ClickHouse 24.12 rejects `ADD PROJECTION` on `ReplacingMergeTree` without `deduplicate_merge_projection_mode='rebuild'`. The initializer was already hardened to log WARN and continue; `AlertingProjectionsIT` and evaluator ITs pass because the evaluators do plain `WHERE` queries that don't require projection hits. --- ## Manual Smoke Script Quick httpbin.org smoke test for webhook delivery (requires running server): ```bash # 1. Create an outbound connection (admin token required) TOKEN="" CONN=$(curl -s -X POST http://localhost:8081/api/v1/admin/outbound-connections \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"name":"httpbin-smoke","url":"https://httpbin.org/post","method":"POST","tlsTrustMode":"SYSTEM_DEFAULT","auth":{}}' | jq -r .id) echo "Connection: $CONN" # 2. Create a LOG_PATTERN rule referencing the connection OP_TOKEN="" ENV="dev" # replace with your env slug RULE=$(curl -s -X POST "http://localhost:8081/api/v1/environments/$ENV/alerts/rules" \ -H "Authorization: Bearer $OP_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"name\":\"smoke-test\",\"severity\":\"WARNING\",\"conditionKind\":\"LOG_PATTERN\", \"condition\":{\"kind\":\"LOG_PATTERN\",\"scope\":{},\"level\":\"ERROR\",\"pattern\":\"SmokeTest\",\"threshold\":0,\"windowSeconds\":300}, \"webhooks\":[{\"outboundConnectionId\":\"$CONN\"}]}" | jq -r .id) echo "Rule: $RULE" # 3. POST a matching log curl -s -X POST http://localhost:8081/api/v1/data/logs \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '[{"timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","level":"ERROR","logger":"com.example.Test","message":"SmokeTest fired","thread":"main","mdc":{}}]' # 4. Trigger evaluation manually (or wait for next tick) # Check alerts inbox: curl -s "http://localhost:8081/api/v1/environments/$ENV/alerts" \ -H "Authorization: Bearer $OP_TOKEN" | jq '.[].state' ``` --- ## Red Flags for Final Controller Pass - The `alert_rules.webhooks` JSONB array stores `WebhookBinding.id` UUIDs that are NOT FK-constrained — if a rule is cloned or imported, binding IDs must be regenerated. - `InAppInboxQuery` uses `? = ANY(target_user_ids)` which requires the `text[]` cast to be consistent with how user IDs are stored (currently `TEXT`); any migration to UUID user IDs would need this query updated. - `AlertingMetrics` gauge suppliers call `jdbc.queryForObject(...)` on every Prometheus scrape. At high scrape frequency (< 5s) this could produce noticeable DB load — consider bumping the Prometheus `scrape_interval` for alerting gauges to 30s in production. - The `PerKindCircuitBreaker` is per-JVM (not distributed). In a multi-replica deployment, each replica has its own independent circuit breaker state — this is intentional (fail-fast per node) but means one slow ClickHouse node may open the circuit on one replica while others continue evaluating.