Files
cameleer-server/docs/alerting-02-final-review.md
hsiegeln 144915563c docs(alerting): whole-branch final review report
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:25:33 +02:00

13 KiB

Plan 02 — Final Whole-Branch Review

Verdict: ⚠ FIX BEFORE SHIP

Summary

The 43-commit, 14k-LOC implementation is structurally sound: the evaluator job, outbox loop, RBAC layering, SQL injection gate, state machine, and ClickHouse projections are all correct and well-tested. Three issues require fixing before production use. Two are functional blockers: (1) alert targets configured via the REST API are silently discarded because PostgresAlertRuleRepository.save() never writes to alert_rule_targets, making the entire in-app inbox feature non-functional for production-created rules; and (2) re-notification cadence (reNotifyMinutes) is stored and exposed but never acted on — withLastNotifiedAt() is defined but never called, so a still-FIRING alert will never re-notify no matter what the rule says. A third important issue is the retry endpoint calling scheduleRetry (which increments attempts) rather than resetting it, defeating the operator's intent. SSRF (Plan 01 scope) is absent and flagged for completeness.


BLOCKER findings

B-1: PostgresAlertRuleRepository.save() never persists targets — inbox is empty for all production-created rules

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:24-58

Impact: AlertRuleController.buildRule() accepts req.targets() and passes them into the AlertRule record (line 326/344), but save() only upserts the alert_rules row — it never touches alert_rule_targets. On re-load, rowMapper() returns List.of() for targets (line 185). When the evaluator creates a newInstance(), AlertStateTransitions copies rule.targets() — which is always empty for any rule created via the REST API. The result: target_user_ids, target_group_ids, and target_role_names on every alert_instances row are empty arrays, so listForInbox() returns nothing for any user. The IT only catches this because it seeds targets via raw SQL (INSERT INTO alert_rule_targets … ON CONFLICT DO NOTHING), not the API path.

Repro: POST a rule with targets: [{kind:"USER", targetId:"alice"}] via the REST API. Evaluate and fire. Check alert_instances.target_user_ids — it is {}.

Fix: Add a saveTargets(UUID ruleId, List<AlertRuleTarget> targets) step inside save(): delete existing targets for the rule, then insert new ones. Both operations must be inside the same logical unit (no transaction wrapper needed since JdbcTemplate auto-commits, but ordering matters: delete-then-insert).


B-2: Re-notification cadence is completely unimplemented — reNotifyMinutes is stored but never consulted

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/notify/NotificationDispatchJob.java (entire file), cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertInstance.java:82

Impact: The spec's §7 state diagram defines a FIRING→re-notify cycle driven by rule.reNotifyMinutes. The withLastNotifiedAt() wither method exists on AlertInstance but is never called anywhere in production code. NotificationDispatchJob has no logic to check instance.lastNotifiedAt() or rule.reNotifyMinutes(). A rule configured with reNotifyMinutes=60 will send exactly one notification on first fire and nothing more, regardless of how long the alert stays FIRING. This is a silent spec violation visible to operators when an acknowledged-then-re-fired alert never pages again.

Fix: In NotificationDispatchJob.markDelivered path (or in the evaluator after enqueueNotifications), set instance.withLastNotifiedAt(now) and persist it. Add a scheduled re-notification enqueue: on each evaluator tick, for FIRING instances where lastNotifiedAt is older than rule.reNotifyMinutes minutes, enqueue fresh AlertNotification rows for each webhook binding.


IMPORTANT findings

I-1: Retry endpoint resets description says "attempts→0" but SQL does attempts + 1

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/controller/AlertNotificationController.java:71-75 and storage/PostgresAlertNotificationRepository.java:108-119

Impact: The operator intent of POST /api/v1/alerts/notifications/{id}/retry is to reset a FAILED notification and re-dispatch it fresh. The controller comment says "attempts → 0". However, it calls scheduleRetry(id, Instant.now(), 0, null), and scheduleRetry's SQL is SET attempts = attempts + 1. If the notification had already hit webhookMaxAttempts (default 3), the retried notification will immediately re-fail on the first transient 5xx because attempts is now 4 (≥ maxAttempts). The spec says "retry resets attempts to 0"; the code does the opposite.

Fix: Add a dedicated resetForRetry(UUID id, Instant nextAttemptAt) method to the repo that sets attempts = 0, status = 'PENDING', next_attempt_at = ?, claimed_by = NULL, claimed_until = NULL. Call it from the retry endpoint instead of scheduleRetry.


I-2: No UNIQUE partial index on alert_instances(rule_id) WHERE open — two replicas can create duplicate FIRING rows

File: cameleer-server-app/src/main/resources/db/migration/V12__alerting_tables.sql:68

Impact: findOpenForRule is a plain SELECT (not inside a lock), followed by instanceRepo.save(). Two evaluator replicas claiming different rule batches won't conflict on the claim (SKIP LOCKED protects that). But applyResult calls findOpenForRule after claiming — if two replicas claim the same rule in back-to-back windows (claim TTL 30s, min rule interval 5s), the second will also see no open instance (the first is still PENDING, not yet visible if on a different connection) and create a second FIRING row. There is no UNIQUE (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED') to block this. In a single-replica setup this is harmless; in HA it causes duplicate alerts.

Fix: Add CREATE UNIQUE INDEX alert_instances_open_rule_uq ON alert_instances (rule_id) WHERE rule_id IS NOT NULL AND state IN ('PENDING','FIRING','ACKNOWLEDGED'); and handle the unique-violation in save() (log + skip).


I-3: SSRF guard absent on OutboundConnection.url (Plan 01 scope, flagged here)

File: cameleer-server-app/src/main/java/com/cameleer/server/app/outbound/dto/OutboundConnectionRequest.java and OutboundConnectionAdminController

Impact: The URL constraint is @Pattern("^https://.+") — it accepts https://169.254.169.254/ (AWS metadata), https://10.0.0.1/internal, and https://localhost/. An ADMIN user can configure a connection pointing to cloud metadata or internal services; the dispatcher will POST to it. In the SaaS multi-tenant context this is a server-side request forgery risk. Plan 01 scope — not blocking Plan 02 merge — but must be resolved before this feature is exposed in SaaS.

Suggested fix: At service-layer save time, resolve the URL's hostname and reject RFC-1918, loopback, link-local, and unroutable addresses. The Apache HttpClient already enforces the TLS handshake, which limits practical exploit, but the URL-level guard should be explicit.


I-4: alerting_notifications_total metric (notificationOutcome) is never called

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:141 and notify/NotificationDispatchJob.java

Impact: AlertingMetrics.notificationOutcome(status) is defined but NotificationDispatchJob.processOne() never calls it after markDelivered, markFailed, or scheduleRetry. The alerting_notifications_total counter will always read 0, making the metric useless for dashboards/alerts.

Fix: Call metrics.notificationOutcome(NotificationStatus.DELIVERED) / FAILED at the three outcome branches in processOne(). Requires injecting AlertingMetrics into NotificationDispatchJob.


NIT findings

N-1: P95_LATENCY_MS silently falls back to avgDurationMs

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/eval/RouteMetricEvaluator.java:52 and cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/RouteMetric.java

ExecutionStats has no p95 field (only p99LatencyMs). The evaluator handles P95_LATENCY_MS by returning avgDurationMs with a code comment acknowledging the substitution. This is misleading to operators who configure a threshold expecting p95 semantics. Recommend either removing P95_LATENCY_MS from the enum or renaming it AVG_DURATION_MS before GA.


N-2: withTargets() IN-clause uses string interpolation with UUIDs

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:124-127

inClause is built by string-joining rule UUIDs. UUIDs come from the database (not user input), so SQL injection is not a realistic risk here. However the pattern is fragile and inconsistent with the rest of the codebase which uses parameterized queries. If batchSize ever grows large, a single claimDueRules call with 20 rules generates a 20-UUID IN clause that Postgres has to plan each time. Use = ANY(?) with a UUID array instead (matches the pattern already used in PostgresAlertInstanceRepository).


N-3: AlertingMetrics gauge queries hit Postgres on every Micrometer scrape — no caching

File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:153-174

Default scrape interval is typically 60s per Prometheus config, so 6 COUNT(*) queries per minute total. At current scale this is fine. If scrape interval is tightened (e.g. for alerting rules) or tenant count grows, these gauges add visible Postgres load. A 30s in-memory cache (e.g. AtomicReference<CachedValue> with expiry) would eliminate the concern. Low priority — leave as a documented follow-up.


Notable strengths

  • Security gate is airtight for the injection surface: The ATTR_KEY regex is applied on both create (@PostMapping) and update (@PutMapping) paths, validated before any persistence. Attribute values, log patterns, JVM metric names, and logger names all go through parameterized queries — only keys are inlined, and only after regex validation.
  • Claim-polling concurrency model: Both claimDueRules and claimDueNotifications use the correct UPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED) RETURNING * pattern. The subquery lock does not re-scan the outer table; rows are locked, updated, and returned atomically, which is exactly what multi-replica claim-polling requires.
  • Target population from rule on FIRING: AlertStateTransitions.newInstance() correctly copies USER/GROUP/ROLE targets from the rule at fire time, so inbox queries work correctly once B-1 is fixed.
  • Rule snapshot is frozen on creation and never re-written on state transitions: withRuleSnapshot() is only called in applyResult and applyBatchFiring before the first instanceRepo.save(), and the ON CONFLICT UPDATE clause on alert_instances intentionally does not include rule_snapshot. History survives rule deletion correctly.
  • Test coverage is substantive: The lifecycle IT (AlertingFullLifecycleIT) verifies fire→dispatch→ack→silence→rule-delete end-to-end with real Postgres, real ClickHouse, and WireMock. The webhook body assertion (step 2) confirms the rule name is present in the payload, not just that one POST arrived.
  • ClickHouse test bootstrap is production-identical: ClickHouseTestHelper runs the same clickhouse/init.sql as ClickHouseSchemaInitializer; no schema drift between test and prod paths.

Open questions

  1. Target persistence design intent: Was alert_rule_targets always intended to be managed by a separate saveTargets() call that was accidentally omitted, or was there a plan to store targets as JSONB in the alert_rules row (which would be simpler and avoid the separate table)? The migration creates the table, the evaluator reads from it, but the write path is absent. Clarify before fixing.

  2. Re-notification: evaluator vs dispatcher responsibility: Should re-notification enqueue happen in the evaluator (on each tick when the instance is still FIRING and cadence elapsed) or in the dispatcher (after delivery, schedule a future notification)? The evaluator has the rule and instance context; the dispatcher has the outcome timing. Spec §7 is silent on which component owns this — confirm before implementing.

  3. HA deployment intent: Is the alerting subsystem expected to run on multiple replicas in the current release? If single-replica only, the UNIQUE index for open instances (I-2) can be deferred; if HA is in scope for this release it should be fixed now.

  4. P95_LATENCY_MS enum removal: Removing from the enum is a breaking API change if any rules using P95_LATENCY_MS exist in production (unlikely at launch, but confirm). Renaming to AVG_DURATION_MS also requires a migration to update existing condition JSONB values and the condition_kind_enum type.