13 KiB
Plan 02 — Final Whole-Branch Review
Verdict: ⚠ FIX BEFORE SHIP
Summary
The 43-commit, 14k-LOC implementation is structurally sound: the evaluator job, outbox loop, RBAC layering, SQL injection gate, state machine, and ClickHouse projections are all correct and well-tested. Three issues require fixing before production use. Two are functional blockers: (1) alert targets configured via the REST API are silently discarded because PostgresAlertRuleRepository.save() never writes to alert_rule_targets, making the entire in-app inbox feature non-functional for production-created rules; and (2) re-notification cadence (reNotifyMinutes) is stored and exposed but never acted on — withLastNotifiedAt() is defined but never called, so a still-FIRING alert will never re-notify no matter what the rule says. A third important issue is the retry endpoint calling scheduleRetry (which increments attempts) rather than resetting it, defeating the operator's intent. SSRF (Plan 01 scope) is absent and flagged for completeness.
BLOCKER findings
B-1: PostgresAlertRuleRepository.save() never persists targets — inbox is empty for all production-created rules
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:24-58
Impact: AlertRuleController.buildRule() accepts req.targets() and passes them into the AlertRule record (line 326/344), but save() only upserts the alert_rules row — it never touches alert_rule_targets. On re-load, rowMapper() returns List.of() for targets (line 185). When the evaluator creates a newInstance(), AlertStateTransitions copies rule.targets() — which is always empty for any rule created via the REST API. The result: target_user_ids, target_group_ids, and target_role_names on every alert_instances row are empty arrays, so listForInbox() returns nothing for any user. The IT only catches this because it seeds targets via raw SQL (INSERT INTO alert_rule_targets … ON CONFLICT DO NOTHING), not the API path.
Repro: POST a rule with targets: [{kind:"USER", targetId:"alice"}] via the REST API. Evaluate and fire. Check alert_instances.target_user_ids — it is {}.
Fix: Add a saveTargets(UUID ruleId, List<AlertRuleTarget> targets) step inside save(): delete existing targets for the rule, then insert new ones. Both operations must be inside the same logical unit (no transaction wrapper needed since JdbcTemplate auto-commits, but ordering matters: delete-then-insert).
B-2: Re-notification cadence is completely unimplemented — reNotifyMinutes is stored but never consulted
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/notify/NotificationDispatchJob.java (entire file), cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/AlertInstance.java:82
Impact: The spec's §7 state diagram defines a FIRING→re-notify cycle driven by rule.reNotifyMinutes. The withLastNotifiedAt() wither method exists on AlertInstance but is never called anywhere in production code. NotificationDispatchJob has no logic to check instance.lastNotifiedAt() or rule.reNotifyMinutes(). A rule configured with reNotifyMinutes=60 will send exactly one notification on first fire and nothing more, regardless of how long the alert stays FIRING. This is a silent spec violation visible to operators when an acknowledged-then-re-fired alert never pages again.
Fix: In NotificationDispatchJob.markDelivered path (or in the evaluator after enqueueNotifications), set instance.withLastNotifiedAt(now) and persist it. Add a scheduled re-notification enqueue: on each evaluator tick, for FIRING instances where lastNotifiedAt is older than rule.reNotifyMinutes minutes, enqueue fresh AlertNotification rows for each webhook binding.
IMPORTANT findings
I-1: Retry endpoint resets description says "attempts→0" but SQL does attempts + 1
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/controller/AlertNotificationController.java:71-75 and storage/PostgresAlertNotificationRepository.java:108-119
Impact: The operator intent of POST /api/v1/alerts/notifications/{id}/retry is to reset a FAILED notification and re-dispatch it fresh. The controller comment says "attempts → 0". However, it calls scheduleRetry(id, Instant.now(), 0, null), and scheduleRetry's SQL is SET attempts = attempts + 1. If the notification had already hit webhookMaxAttempts (default 3), the retried notification will immediately re-fail on the first transient 5xx because attempts is now 4 (≥ maxAttempts). The spec says "retry resets attempts to 0"; the code does the opposite.
Fix: Add a dedicated resetForRetry(UUID id, Instant nextAttemptAt) method to the repo that sets attempts = 0, status = 'PENDING', next_attempt_at = ?, claimed_by = NULL, claimed_until = NULL. Call it from the retry endpoint instead of scheduleRetry.
I-2: No UNIQUE partial index on alert_instances(rule_id) WHERE open — two replicas can create duplicate FIRING rows
File: cameleer-server-app/src/main/resources/db/migration/V12__alerting_tables.sql:68
Impact: findOpenForRule is a plain SELECT (not inside a lock), followed by instanceRepo.save(). Two evaluator replicas claiming different rule batches won't conflict on the claim (SKIP LOCKED protects that). But applyResult calls findOpenForRule after claiming — if two replicas claim the same rule in back-to-back windows (claim TTL 30s, min rule interval 5s), the second will also see no open instance (the first is still PENDING, not yet visible if on a different connection) and create a second FIRING row. There is no UNIQUE (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED') to block this. In a single-replica setup this is harmless; in HA it causes duplicate alerts.
Fix: Add CREATE UNIQUE INDEX alert_instances_open_rule_uq ON alert_instances (rule_id) WHERE rule_id IS NOT NULL AND state IN ('PENDING','FIRING','ACKNOWLEDGED'); and handle the unique-violation in save() (log + skip).
I-3: SSRF guard absent on OutboundConnection.url (Plan 01 scope, flagged here)
File: cameleer-server-app/src/main/java/com/cameleer/server/app/outbound/dto/OutboundConnectionRequest.java and OutboundConnectionAdminController
Impact: The URL constraint is @Pattern("^https://.+") — it accepts https://169.254.169.254/ (AWS metadata), https://10.0.0.1/internal, and https://localhost/. An ADMIN user can configure a connection pointing to cloud metadata or internal services; the dispatcher will POST to it. In the SaaS multi-tenant context this is a server-side request forgery risk. Plan 01 scope — not blocking Plan 02 merge — but must be resolved before this feature is exposed in SaaS.
Suggested fix: At service-layer save time, resolve the URL's hostname and reject RFC-1918, loopback, link-local, and unroutable addresses. The Apache HttpClient already enforces the TLS handshake, which limits practical exploit, but the URL-level guard should be explicit.
I-4: alerting_notifications_total metric (notificationOutcome) is never called
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:141 and notify/NotificationDispatchJob.java
Impact: AlertingMetrics.notificationOutcome(status) is defined but NotificationDispatchJob.processOne() never calls it after markDelivered, markFailed, or scheduleRetry. The alerting_notifications_total counter will always read 0, making the metric useless for dashboards/alerts.
Fix: Call metrics.notificationOutcome(NotificationStatus.DELIVERED) / FAILED at the three outcome branches in processOne(). Requires injecting AlertingMetrics into NotificationDispatchJob.
NIT findings
N-1: P95_LATENCY_MS silently falls back to avgDurationMs
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/eval/RouteMetricEvaluator.java:52 and cameleer-server-core/src/main/java/com/cameleer/server/core/alerting/RouteMetric.java
ExecutionStats has no p95 field (only p99LatencyMs). The evaluator handles P95_LATENCY_MS by returning avgDurationMs with a code comment acknowledging the substitution. This is misleading to operators who configure a threshold expecting p95 semantics. Recommend either removing P95_LATENCY_MS from the enum or renaming it AVG_DURATION_MS before GA.
N-2: withTargets() IN-clause uses string interpolation with UUIDs
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/storage/PostgresAlertRuleRepository.java:124-127
inClause is built by string-joining rule UUIDs. UUIDs come from the database (not user input), so SQL injection is not a realistic risk here. However the pattern is fragile and inconsistent with the rest of the codebase which uses parameterized queries. If batchSize ever grows large, a single claimDueRules call with 20 rules generates a 20-UUID IN clause that Postgres has to plan each time. Use = ANY(?) with a UUID array instead (matches the pattern already used in PostgresAlertInstanceRepository).
N-3: AlertingMetrics gauge queries hit Postgres on every Micrometer scrape — no caching
File: cameleer-server-app/src/main/java/com/cameleer/server/app/alerting/metrics/AlertingMetrics.java:153-174
Default scrape interval is typically 60s per Prometheus config, so 6 COUNT(*) queries per minute total. At current scale this is fine. If scrape interval is tightened (e.g. for alerting rules) or tenant count grows, these gauges add visible Postgres load. A 30s in-memory cache (e.g. AtomicReference<CachedValue> with expiry) would eliminate the concern. Low priority — leave as a documented follow-up.
Notable strengths
- Security gate is airtight for the injection surface: The
ATTR_KEYregex is applied on both create (@PostMapping) and update (@PutMapping) paths, validated before any persistence. Attribute values, log patterns, JVM metric names, and logger names all go through parameterized queries — only keys are inlined, and only after regex validation. - Claim-polling concurrency model: Both
claimDueRulesandclaimDueNotificationsuse the correctUPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED) RETURNING *pattern. The subquery lock does not re-scan the outer table; rows are locked, updated, and returned atomically, which is exactly what multi-replica claim-polling requires. - Target population from rule on FIRING:
AlertStateTransitions.newInstance()correctly copies USER/GROUP/ROLE targets from the rule at fire time, so inbox queries work correctly once B-1 is fixed. - Rule snapshot is frozen on creation and never re-written on state transitions:
withRuleSnapshot()is only called inapplyResultandapplyBatchFiringbefore the firstinstanceRepo.save(), and the ON CONFLICT UPDATE clause onalert_instancesintentionally does not includerule_snapshot. History survives rule deletion correctly. - Test coverage is substantive: The lifecycle IT (
AlertingFullLifecycleIT) verifies fire→dispatch→ack→silence→rule-delete end-to-end with real Postgres, real ClickHouse, and WireMock. The webhook body assertion (step 2) confirms the rule name is present in the payload, not just that one POST arrived. - ClickHouse test bootstrap is production-identical:
ClickHouseTestHelperruns the sameclickhouse/init.sqlasClickHouseSchemaInitializer; no schema drift between test and prod paths.
Open questions
-
Target persistence design intent: Was
alert_rule_targetsalways intended to be managed by a separatesaveTargets()call that was accidentally omitted, or was there a plan to store targets as JSONB in thealert_rulesrow (which would be simpler and avoid the separate table)? The migration creates the table, the evaluator reads from it, but the write path is absent. Clarify before fixing. -
Re-notification: evaluator vs dispatcher responsibility: Should re-notification enqueue happen in the evaluator (on each tick when the instance is still FIRING and cadence elapsed) or in the dispatcher (after delivery, schedule a future notification)? The evaluator has the rule and instance context; the dispatcher has the outcome timing. Spec §7 is silent on which component owns this — confirm before implementing.
-
HA deployment intent: Is the alerting subsystem expected to run on multiple replicas in the current release? If single-replica only, the UNIQUE index for open instances (I-2) can be deferred; if HA is in scope for this release it should be fixed now.
-
P95_LATENCY_MSenum removal: Removing from the enum is a breaking API change if any rules usingP95_LATENCY_MSexist in production (unlikely at launch, but confirm). Renaming toAVG_DURATION_MSalso requires a migration to update existingconditionJSONB values and thecondition_kind_enumtype.