docs(alerting): whole-branch final review report
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
122
docs/alerting-02-final-review.md
Normal file
122
docs/alerting-02-final-review.md
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# Plan 02 — Final Whole-Branch Review
|
||||||
|
|
||||||
|
**Verdict:** ⚠ FIX BEFORE SHIP
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The 43-commit, 14k-LOC implementation is structurally sound: the evaluator job, outbox loop, RBAC layering, SQL injection gate, state machine, and ClickHouse projections are all correct and well-tested. Three issues require fixing before production use. Two are functional blockers: (1) alert targets configured via the REST API are silently discarded because `PostgresAlertRuleRepository.save()` never writes to `alert_rule_targets`, making the entire in-app inbox feature non-functional for production-created rules; and (2) re-notification cadence (`reNotifyMinutes`) is stored and exposed but never acted on — `withLastNotifiedAt()` is defined but never called, so a still-FIRING alert will never re-notify no matter what the rule says. A third important issue is the retry endpoint calling `scheduleRetry` (which increments `attempts`) rather than resetting it, defeating the operator's intent. SSRF (Plan 01 scope) is absent and flagged for completeness.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## BLOCKER findings
|
||||||
|
|
||||||
|
### B-1: `PostgresAlertRuleRepository.save()` never persists targets — inbox is empty for all production-created rules
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:24-58`
|
||||||
|
|
||||||
|
**Impact:** `AlertRuleController.buildRule()` accepts `req.targets()` and passes them into the `AlertRule` record (line 326/344), but `save()` only upserts the `alert_rules` row — it never touches `alert_rule_targets`. On re-load, `rowMapper()` returns `List.of()` for targets (line 185). When the evaluator creates a `newInstance()`, `AlertStateTransitions` copies `rule.targets()` — which is always empty for any rule created via the REST API. The result: `target_user_ids`, `target_group_ids`, and `target_role_names` on every `alert_instances` row are empty arrays, so `listForInbox()` returns nothing for any user. The IT only catches this because it seeds targets via raw SQL (`INSERT INTO alert_rule_targets … ON CONFLICT DO NOTHING`), not the API path.
|
||||||
|
|
||||||
|
**Repro:** POST a rule with `targets: [{kind:"USER", targetId:"alice"}]` via the REST API. Evaluate and fire. Check `alert_instances.target_user_ids` — it is `{}`.
|
||||||
|
|
||||||
|
**Fix:** Add a `saveTargets(UUID ruleId, List<AlertRuleTarget> targets)` step inside `save()`: delete existing targets for the rule, then insert new ones. Both operations must be inside the same logical unit (no transaction wrapper needed since JdbcTemplate auto-commits, but ordering matters: delete-then-insert).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### B-2: Re-notification cadence is completely unimplemented — `reNotifyMinutes` is stored but never consulted
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/notify/NotificationDispatchJob.java` (entire file), `cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertInstance.java:82`
|
||||||
|
|
||||||
|
**Impact:** The spec's §7 state diagram defines a FIRING→re-notify cycle driven by `rule.reNotifyMinutes`. The `withLastNotifiedAt()` wither method exists on `AlertInstance` but is never called anywhere in production code. `NotificationDispatchJob` has no logic to check `instance.lastNotifiedAt()` or `rule.reNotifyMinutes()`. A rule configured with `reNotifyMinutes=60` will send exactly one notification on first fire and nothing more, regardless of how long the alert stays FIRING. This is a silent spec violation visible to operators when an acknowledged-then-re-fired alert never pages again.
|
||||||
|
|
||||||
|
**Fix:** In `NotificationDispatchJob.markDelivered` path (or in the evaluator after `enqueueNotifications`), set `instance.withLastNotifiedAt(now)` and persist it. Add a scheduled re-notification enqueue: on each evaluator tick, for FIRING instances where `lastNotifiedAt` is older than `rule.reNotifyMinutes` minutes, enqueue fresh `AlertNotification` rows for each webhook binding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## IMPORTANT findings
|
||||||
|
|
||||||
|
### I-1: Retry endpoint resets description says "attempts→0" but SQL does `attempts + 1`
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/controller/AlertNotificationController.java:71-75` and `storage/PostgresAlertNotificationRepository.java:108-119`
|
||||||
|
|
||||||
|
**Impact:** The operator intent of `POST /api/v1/alerts/notifications/{id}/retry` is to reset a FAILED notification and re-dispatch it fresh. The controller comment says "attempts → 0". However, it calls `scheduleRetry(id, Instant.now(), 0, null)`, and `scheduleRetry`'s SQL is `SET attempts = attempts + 1`. If the notification had already hit `webhookMaxAttempts` (default 3), the retried notification will immediately re-fail on the first transient 5xx because `attempts` is now 4 (≥ maxAttempts). The spec says "retry resets attempts to 0"; the code does the opposite.
|
||||||
|
|
||||||
|
**Fix:** Add a dedicated `resetForRetry(UUID id, Instant nextAttemptAt)` method to the repo that sets `attempts = 0, status = 'PENDING', next_attempt_at = ?, claimed_by = NULL, claimed_until = NULL`. Call it from the retry endpoint instead of `scheduleRetry`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### I-2: No UNIQUE partial index on `alert_instances(rule_id)` WHERE open — two replicas can create duplicate FIRING rows
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/resources/db/migration/V12__alerting_tables.sql:68`
|
||||||
|
|
||||||
|
**Impact:** `findOpenForRule` is a plain SELECT (not inside a lock), followed by `instanceRepo.save()`. Two evaluator replicas claiming different rule batches won't conflict on the claim (SKIP LOCKED protects that). But `applyResult` calls `findOpenForRule` after claiming — if two replicas claim the same rule in back-to-back windows (claim TTL 30s, min rule interval 5s), the second will also see no open instance (the first is still PENDING, not yet visible if on a different connection) and create a second FIRING row. There is no `UNIQUE (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED')` to block this. In a single-replica setup this is harmless; in HA it causes duplicate alerts.
|
||||||
|
|
||||||
|
**Fix:** Add `CREATE UNIQUE INDEX alert_instances_open_rule_uq ON alert_instances (rule_id) WHERE rule_id IS NOT NULL AND state IN ('PENDING','FIRING','ACKNOWLEDGED');` and handle the unique-violation in `save()` (log + skip).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### I-3: SSRF guard absent on `OutboundConnection.url` (Plan 01 scope, flagged here)
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/outbound/dto/OutboundConnectionRequest.java` and `OutboundConnectionAdminController`
|
||||||
|
|
||||||
|
**Impact:** The URL constraint is `@Pattern("^https://.+")` — it accepts `https://169.254.169.254/` (AWS metadata), `https://10.0.0.1/internal`, and `https://localhost/`. An ADMIN user can configure a connection pointing to cloud metadata or internal services; the dispatcher will POST to it. In the SaaS multi-tenant context this is a server-side request forgery risk. Plan 01 scope — not blocking Plan 02 merge — but must be resolved before this feature is exposed in SaaS.
|
||||||
|
|
||||||
|
**Suggested fix:** At service-layer save time, resolve the URL's hostname and reject RFC-1918, loopback, link-local, and unroutable addresses. The Apache HttpClient already enforces the TLS handshake, which limits practical exploit, but the URL-level guard should be explicit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### I-4: `alerting_notifications_total` metric (`notificationOutcome`) is never called
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:141` and `notify/NotificationDispatchJob.java`
|
||||||
|
|
||||||
|
**Impact:** `AlertingMetrics.notificationOutcome(status)` is defined but `NotificationDispatchJob.processOne()` never calls it after `markDelivered`, `markFailed`, or `scheduleRetry`. The `alerting_notifications_total` counter will always read 0, making the metric useless for dashboards/alerts.
|
||||||
|
|
||||||
|
**Fix:** Call `metrics.notificationOutcome(NotificationStatus.DELIVERED)` / `FAILED` at the three outcome branches in `processOne()`. Requires injecting `AlertingMetrics` into `NotificationDispatchJob`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## NIT findings
|
||||||
|
|
||||||
|
### N-1: `P95_LATENCY_MS` silently falls back to `avgDurationMs`
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/eval/RouteMetricEvaluator.java:52` and `cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/RouteMetric.java`
|
||||||
|
|
||||||
|
`ExecutionStats` has no p95 field (only `p99LatencyMs`). The evaluator handles `P95_LATENCY_MS` by returning `avgDurationMs` with a code comment acknowledging the substitution. This is misleading to operators who configure a threshold expecting p95 semantics. Recommend either removing `P95_LATENCY_MS` from the enum or renaming it `AVG_DURATION_MS` before GA.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### N-2: `withTargets()` IN-clause uses string interpolation with UUIDs
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:124-127`
|
||||||
|
|
||||||
|
`inClause` is built by string-joining rule UUIDs. UUIDs come from the database (not user input), so SQL injection is not a realistic risk here. However the pattern is fragile and inconsistent with the rest of the codebase which uses parameterized queries. If `batchSize` ever grows large, a single `claimDueRules` call with 20 rules generates a 20-UUID IN clause that Postgres has to plan each time. Use `= ANY(?)` with a UUID array instead (matches the pattern already used in `PostgresAlertInstanceRepository`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### N-3: `AlertingMetrics` gauge queries hit Postgres on every Micrometer scrape — no caching
|
||||||
|
|
||||||
|
**File:** `cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:153-174`
|
||||||
|
|
||||||
|
Default scrape interval is typically 60s per Prometheus config, so 6 COUNT(*) queries per minute total. At current scale this is fine. If scrape interval is tightened (e.g. for alerting rules) or tenant count grows, these gauges add visible Postgres load. A 30s in-memory cache (e.g. `AtomicReference<CachedValue>` with expiry) would eliminate the concern. Low priority — leave as a documented follow-up.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notable strengths
|
||||||
|
|
||||||
|
- **Security gate is airtight for the injection surface:** The `ATTR_KEY` regex is applied on both create (`@PostMapping`) and update (`@PutMapping`) paths, validated before any persistence. Attribute values, log patterns, JVM metric names, and logger names all go through parameterized queries — only keys are inlined, and only after regex validation.
|
||||||
|
- **Claim-polling concurrency model:** Both `claimDueRules` and `claimDueNotifications` use the correct `UPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED) RETURNING *` pattern. The subquery lock does not re-scan the outer table; rows are locked, updated, and returned atomically, which is exactly what multi-replica claim-polling requires.
|
||||||
|
- **Target population from rule on FIRING:** `AlertStateTransitions.newInstance()` correctly copies USER/GROUP/ROLE targets from the rule at fire time, so inbox queries work correctly once B-1 is fixed.
|
||||||
|
- **Rule snapshot is frozen on creation and never re-written on state transitions:** `withRuleSnapshot()` is only called in `applyResult` and `applyBatchFiring` before the first `instanceRepo.save()`, and the ON CONFLICT UPDATE clause on `alert_instances` intentionally does not include `rule_snapshot`. History survives rule deletion correctly.
|
||||||
|
- **Test coverage is substantive:** The lifecycle IT (`AlertingFullLifecycleIT`) verifies fire→dispatch→ack→silence→rule-delete end-to-end with real Postgres, real ClickHouse, and WireMock. The webhook body assertion (step 2) confirms the rule name is present in the payload, not just that one POST arrived.
|
||||||
|
- **ClickHouse test bootstrap is production-identical:** `ClickHouseTestHelper` runs the same `clickhouse/init.sql` as `ClickHouseSchemaInitializer`; no schema drift between test and prod paths.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
1. **Target persistence design intent:** Was `alert_rule_targets` always intended to be managed by a separate `saveTargets()` call that was accidentally omitted, or was there a plan to store targets as JSONB in the `alert_rules` row (which would be simpler and avoid the separate table)? The migration creates the table, the evaluator reads from it, but the write path is absent. Clarify before fixing.
|
||||||
|
|
||||||
|
2. **Re-notification: evaluator vs dispatcher responsibility:** Should re-notification enqueue happen in the evaluator (on each tick when the instance is still FIRING and cadence elapsed) or in the dispatcher (after delivery, schedule a future notification)? The evaluator has the rule and instance context; the dispatcher has the outcome timing. Spec §7 is silent on which component owns this — confirm before implementing.
|
||||||
|
|
||||||
|
3. **HA deployment intent:** Is the alerting subsystem expected to run on multiple replicas in the current release? If single-replica only, the UNIQUE index for open instances (I-2) can be deferred; if HA is in scope for this release it should be fixed now.
|
||||||
|
|
||||||
|
4. **`P95_LATENCY_MS` enum removal:** Removing from the enum is a breaking API change if any rules using `P95_LATENCY_MS` exist in production (unlikely at launch, but confirm). Renaming to `AVG_DURATION_MS` also requires a migration to update existing `condition` JSONB values and the `condition_kind_enum` type.
|
||||||
Reference in New Issue
Block a user