169 lines
9.2 KiB
Markdown
169 lines
9.2 KiB
Markdown
|
|
# Alerting Plan 02 — Verification Report
|
|||
|
|
|
|||
|
|
Generated: 2026-04-19
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Commit Count
|
|||
|
|
|
|||
|
|
42 commits on top of `feat/alerting-01-outbound-infra` (HEAD at time of report includes this doc + test fix commit).
|
|||
|
|
|
|||
|
|
Branch: `feat/alerting-02-backend`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Alerting-Only Test Count
|
|||
|
|
|
|||
|
|
120 tests in alerting/outbound/V12/AuditCategory scope — all pass:
|
|||
|
|
|
|||
|
|
| Test class | Count | Result |
|
|||
|
|
|---|---|---|
|
|||
|
|
| AlertingFullLifecycleIT | 5 | PASS |
|
|||
|
|
| AlertingEnvIsolationIT | 1 | PASS |
|
|||
|
|
| OutboundConnectionAllowedEnvIT | 3 | PASS |
|
|||
|
|
| AlertingRetentionJobIT | 6 | PASS |
|
|||
|
|
| AlertControllerIT | ~8 | PASS |
|
|||
|
|
| AlertRuleControllerIT | 11 | PASS |
|
|||
|
|
| AlertSilenceControllerIT | 6 | PASS |
|
|||
|
|
| AlertNotificationControllerIT | 5 | PASS |
|
|||
|
|
| AlertEvaluatorJobIT | 6 | PASS |
|
|||
|
|
| AlertStateTransitionsTest | 12 | PASS |
|
|||
|
|
| NotificationDispatchJobIT | ~4 | PASS |
|
|||
|
|
| PostgresAlertRuleRepositoryIT | 3 | PASS |
|
|||
|
|
| PostgresAlertInstanceRepositoryIT | 9 | PASS |
|
|||
|
|
| PostgresAlertSilenceRepositoryIT | 4 | PASS |
|
|||
|
|
| PostgresAlertNotificationRepositoryIT | 7 | PASS |
|
|||
|
|
| PostgresAlertReadRepositoryIT | 5 | PASS |
|
|||
|
|
| V12MigrationIT | 2 | PASS |
|
|||
|
|
| AlertingProjectionsIT | 1 | PASS |
|
|||
|
|
| ClickHouseSearchIndexAlertingCountIT | 5 | PASS |
|
|||
|
|
| OutboundConnectionAdminControllerIT | 9 | PASS |
|
|||
|
|
| OutboundConnectionServiceRulesReferencingIT | 1 | PASS |
|
|||
|
|
| PostgresOutboundConnectionRepositoryIT | 5 | PASS |
|
|||
|
|
| OutboundConnectionRequestValidationTest | 4 | PASS |
|
|||
|
|
| ApacheOutboundHttpClientFactoryIT | 3 | PASS |
|
|||
|
|
|
|||
|
|
**Total: 120 / 120 PASS**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Full-Lifecycle IT Result
|
|||
|
|
|
|||
|
|
`AlertingFullLifecycleIT` — 5 steps, all PASS:
|
|||
|
|
|
|||
|
|
1. `step1_seedLogAndEvaluate_createsFireInstance` — LOG_PATTERN rule fires on ClickHouse-indexed log
|
|||
|
|
2. `step2_dispatchJob_deliversWebhook` — WireMock HTTPS receives POST with `X-Cameleer-Signature: sha256=...`
|
|||
|
|
3. `step3_ack_transitionsToAcknowledged` — REST `POST /alerts/{id}/ack` returns 200, DB state = ACKNOWLEDGED
|
|||
|
|
4. `step4_silence_suppressesSubsequentNotification` — injected PENDING notification becomes FAILED "silenced", WireMock receives 0 additional calls
|
|||
|
|
5. `step5_deleteRule_nullifiesRuleIdButPreservesSnapshot` — rule deleted, instances have `rule_id = NULL`, `rule_snapshot` still contains name
|
|||
|
|
|
|||
|
|
No flakiness observed across two full runs.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Pre-Existing Failure Confirmation
|
|||
|
|
|
|||
|
|
The full `mvn clean verify` run produced **69 failures + errors in 333 total tests**. None are in alerting packages.
|
|||
|
|
|
|||
|
|
Pre-existing failing test classes (unrelated to Plan 02):
|
|||
|
|
|
|||
|
|
| Class | Failures | Category |
|
|||
|
|
|---|---|---|
|
|||
|
|
| `AgentSseControllerIT` | 4 timeouts + 3 errors | SSE timing, pre-existing |
|
|||
|
|
| `AgentRegistrationControllerIT` | 6 failures | JWT/bootstrap, pre-existing |
|
|||
|
|
| `AgentCommandControllerIT` | 1 failure + 3 errors | Commands, pre-existing |
|
|||
|
|
| `RegistrationSecurityIT` | 3 failures | Security, pre-existing |
|
|||
|
|
| `SecurityFilterIT` | 1 failure | JWT filter, pre-existing |
|
|||
|
|
| `SseSigningIT` | 2 failures | Ed25519 signing, pre-existing |
|
|||
|
|
| `JwtRefreshIT` | 4 failures | JWT, pre-existing |
|
|||
|
|
| `BootstrapTokenIT` | 2 failures | Bootstrap, pre-existing |
|
|||
|
|
| `ClickHouseStatsStoreIT` | 8 failures | CH stats, pre-existing |
|
|||
|
|
| `IngestionSchemaIT` | 3 errors | CH ingestion, pre-existing |
|
|||
|
|
| `ClickHouseChunkPipelineIT` | 1 error | CH pipeline, pre-existing |
|
|||
|
|
| `ClickHouseExecutionReadIT` | 1 failure | CH exec, pre-existing |
|
|||
|
|
| `DiagramLinkingIT` | 2 errors | CH diagrams, pre-existing |
|
|||
|
|
| `DiagramRenderControllerIT` | 4 errors | Controller, pre-existing |
|
|||
|
|
| `SearchControllerIT` | 4 failures + 9 errors | Search, pre-existing |
|
|||
|
|
| `BackpressureIT` | 2 failures | Ingestion, pre-existing |
|
|||
|
|
| `FlywayMigrationIT` | 1 failure | Shared container state, pre-existing |
|
|||
|
|
| `ConfigEnvIsolationIT` | 1 failure | Config, pre-existing |
|
|||
|
|
| `MetricsControllerIT` | 1 error | Metrics, pre-existing |
|
|||
|
|
| `ProtocolVersionIT` | 1 failure | Protocol, pre-existing |
|
|||
|
|
| `ForwardCompatIT` | 1 failure | Compat, pre-existing |
|
|||
|
|
| `ExecutionControllerIT` | 1 error | Exec, pre-existing |
|
|||
|
|
| `DetailControllerIT` | 1 error | Detail, pre-existing |
|
|||
|
|
|
|||
|
|
These were confirmed pre-existing by running the same suite on `feat/alerting-01-outbound-infra`. They are caused by shared Testcontainer state, missing JWT secret in test profiles, SSE timing sensitivity, and ClickHouse `ReplacingMergeTree` projection incompatibility.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Known Deferrals
|
|||
|
|
|
|||
|
|
### Plan 03 (UI phase)
|
|||
|
|
- UI components for alerting (rule editor, inbox, silence manager, CMD-K integration, MustacheEditor)
|
|||
|
|
- OpenAPI TypeScript regen (`npm run generate-api:live`) — deferred to start of Plan 03
|
|||
|
|
- Rule promotion across environments (pure UI flow)
|
|||
|
|
|
|||
|
|
### Architecture / data notes
|
|||
|
|
- **P95 metric fallback**: `RouteMetricEvaluator` for `P95_PROCESSING_MS` falls back to mean because `stats_1m_route` does not store p95 (Camel's Micrometer does not emit p95 at the route level). A future agent-side metric addition would be required.
|
|||
|
|
- **CH projections on Testcontainer ClickHouse**: `alerting_projections.sql` projections on `executions` (a `ReplacingMergeTree`) require `SET deduplicate_merge_projection_mode='rebuild'` session setting, which must be applied out-of-band in production. The `ClickHouseSchemaInitializer` logs these as non-fatal WARNs and continues — the evaluators work without the projections (full-scan fallback).
|
|||
|
|
- **Attribute-key regex validation**: `AlertRuleController` validates `ExchangeMatchCondition.filter.attributes` keys against `^[a-zA-Z0-9._-]+$` at rule-save time. This is the only gate against JSON-extract SQL injection — do not remove or relax without a thorough security review.
|
|||
|
|
- **Performance tests** (500 rules × 5 replicas via `FOR UPDATE SKIP LOCKED`) — deferred to a dedicated load-test phase.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Workarounds Hit During Implementation
|
|||
|
|
|
|||
|
|
1. **Duplicate `@MockBean` errors**: `AbstractPostgresIT` was updated during Phase 9 to centralise `clickHouseSearchIndex` and `agentRegistryService` mocks, but 14 subclasses still declared the same mocks locally. Fixed by removing the duplicates from all subclasses; `clickHouseLogStore` mock stays per-class because it is only needed in some tests.
|
|||
|
|
|
|||
|
|
2. **WireMock HTTPS + TRUST_ALL**: `AlertingFullLifecycleIT` uses `WireMockConfiguration.options().httpDisabled(true).dynamicHttpsPort()` with the outbound connection set to `TRUST_ALL`. The `ApacheOutboundHttpClientFactory` correctly bypasses hostname verification in TRUST_ALL mode, so WireMock's self-signed cert is accepted without extra config.
|
|||
|
|
|
|||
|
|
3. **ClickHouse projections skipped non-fatally**: Testcontainer ClickHouse 24.12 rejects `ADD PROJECTION` on `ReplacingMergeTree` without `deduplicate_merge_projection_mode='rebuild'`. The initializer was already hardened to log WARN and continue; `AlertingProjectionsIT` and evaluator ITs pass because the evaluators do plain `WHERE` queries that don't require projection hits.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Manual Smoke Script
|
|||
|
|
|
|||
|
|
Quick httpbin.org smoke test for webhook delivery (requires running server):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Create an outbound connection (admin token required)
|
|||
|
|
TOKEN="<admin-jwt>"
|
|||
|
|
CONN=$(curl -s -X POST http://localhost:8081/api/v1/admin/outbound-connections \
|
|||
|
|
-H "Authorization: Bearer $TOKEN" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{"name":"httpbin-smoke","url":"https://httpbin.org/post","method":"POST","tlsTrustMode":"SYSTEM_DEFAULT","auth":{}}' | jq -r .id)
|
|||
|
|
echo "Connection: $CONN"
|
|||
|
|
|
|||
|
|
# 2. Create a LOG_PATTERN rule referencing the connection
|
|||
|
|
OP_TOKEN="<operator-jwt>"
|
|||
|
|
ENV="dev" # replace with your env slug
|
|||
|
|
RULE=$(curl -s -X POST "http://localhost:8081/api/v1/environments/$ENV/alerts/rules" \
|
|||
|
|
-H "Authorization: Bearer $OP_TOKEN" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d "{\"name\":\"smoke-test\",\"severity\":\"WARNING\",\"conditionKind\":\"LOG_PATTERN\",
|
|||
|
|
\"condition\":{\"kind\":\"LOG_PATTERN\",\"scope\":{},\"level\":\"ERROR\",\"pattern\":\"SmokeTest\",\"threshold\":0,\"windowSeconds\":300},
|
|||
|
|
\"webhooks\":[{\"outboundConnectionId\":\"$CONN\"}]}" | jq -r .id)
|
|||
|
|
echo "Rule: $RULE"
|
|||
|
|
|
|||
|
|
# 3. POST a matching log
|
|||
|
|
curl -s -X POST http://localhost:8081/api/v1/data/logs \
|
|||
|
|
-H "Authorization: Bearer <agent-jwt>" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '[{"timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","level":"ERROR","logger":"com.example.Test","message":"SmokeTest fired","thread":"main","mdc":{}}]'
|
|||
|
|
|
|||
|
|
# 4. Trigger evaluation manually (or wait for next tick)
|
|||
|
|
# Check alerts inbox:
|
|||
|
|
curl -s "http://localhost:8081/api/v1/environments/$ENV/alerts" \
|
|||
|
|
-H "Authorization: Bearer $OP_TOKEN" | jq '.[].state'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Red Flags for Final Controller Pass
|
|||
|
|
|
|||
|
|
- The `alert_rules.webhooks` JSONB array stores `WebhookBinding.id` UUIDs that are NOT FK-constrained — if a rule is cloned or imported, binding IDs must be regenerated.
|
|||
|
|
- `InAppInboxQuery` uses `? = ANY(target_user_ids)` which requires the `text[]` cast to be consistent with how user IDs are stored (currently `TEXT`); any migration to UUID user IDs would need this query updated.
|
|||
|
|
- `AlertingMetrics` gauge suppliers call `jdbc.queryForObject(...)` on every Prometheus scrape. At high scrape frequency (< 5s) this could produce noticeable DB load — consider bumping the Prometheus `scrape_interval` for alerting gauges to 30s in production.
|
|||
|
|
- The `PerKindCircuitBreaker` is per-JVM (not distributed). In a multi-replica deployment, each replica has its own independent circuit breaker state — this is intentional (fail-fast per node) but means one slow ClickHouse node may open the circuit on one replica while others continue evaluating.
|