[P0] Alerting & notification system #102
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent Epic
#100
Problem
Cameleer shows SLA breaches, error spikes, dead agents — but there's no way to be notified about them. Without alerting, teams still need Grafana/PagerDuty alongside Cameleer, which means Cameleer is a "nice to have" viewer, not an essential monitoring tool. Alerting is what makes monitoring tools sticky.
Current State
Proposed Solution
1. Alert Rules Engine
2. Built-in Alert Types
Pre-configured templates users can enable with one click:
3. Notification Channels
Minimum viable channels for v1:
4. Alert History & Status
A page showing recent alert firings, acknowledgments, and resolution:
5. UI Integration Points
Architecture Notes
Acceptance Criteria
Competitive Reference
Design Specification
Data Model (PostgreSQL)
Migration:
V2__alerting.sqlalert_channels — Notification destinations:
Config JSONB per type: webhook needs
url,method; email needsrecipients; slack needswebhookUrl.alert_rules — What to monitor:
alert_history — Immutable firing log:
alert_rule_state — Ephemeral evaluation state (survives restarts):
Alert Evaluation Engine
Spring
@Scheduledcomponent,fixedDelay = 10s. For each enabled, non-muted rule wherenow - last_evaluated_at >= eval_interval_sec:AlertEvaluator.evaluate(rule)→EvalResult(breached, measuredValue, context)consecutive_count >= rule.consecutive_breachesand not currently firing and cooldown elapsed → FIREThreadPoolTaskExecutor(core=2, max=4)Built-in Alert Types (7 types)
error_ratestats_1m_app/routeMVcountIfMerge(failed_count)/countMerge(total_count)*100sla_breachexecutionstablecountIf(duration_ms <= slaMs)/count()*100agent_disconnectedAgentRegistryService(memory)findByState(STALE).size()agent_deadAgentRegistryService(memory)findByState(DEAD).size()throughput_dropstats_1m_appMVroute_stoppedstats_1m_routeMVcountMerge(total_count)in windowresource_warningagent_metricstableavg(metric_value) WHERE metric_name=?Notification Dispatch
Async with retry (3 attempts, exponential backoff: 1s, 4s, 16s).
Webhook: POST JSON payload with
X-Cameleer-Alert-SignatureHMAC-SHA256 header. Payload includes ruleName, alertType, severity, status, measuredValue, thresholdValue, scope, dashboardUrl.Email: Subject
[SEVERITY] Rule Name. Body includes alert details, scope, dashboard link. UsesJavaMailSenderwith SMTP config fromserver_configtable.Slack: Incoming webhook with attachment. Color: critical=#dc3545, warning=#f59e0b, info=#3b82f6, resolved=#22c55e. Fields: Application, Route, Window, Status.
REST API
Rules:
GET/POST /api/v1/alerts/rules,GET/PUT/DELETE /rules/{id},POST /rules/{id}/mute|unmute|enable|disableChannels:
GET/POST /api/v1/alerts/channels,GET/PUT/DELETE /channels/{id},POST /channels/{id}/testHistory:
GET /api/v1/alerts/history(paginated),POST /history/{id}/acknowledgeActive:
GET /api/v1/alerts/active→{ firingCount, firing[], recent[] }RBAC: VIEWER can read; OPERATOR can create/edit/mute/acknowledge; ADMIN can delete.
Alert Lifecycle State Machine
Frontend: Alerting Admin Tab
New tab in AdminLayout between "App Config" and "Database". Sub-tabs: Rules, Channels, History.
Rules list wireframe:
Rule editor form: Name, Description, Alert Type dropdown, Severity radio, Threshold (value + operator), Scope (app/route/instance dropdowns), Evaluation settings (interval, window, consecutive breaches, cooldown), Type-specific params (conditional), Channel checkboxes.
Channel manager: List with type icon, name, enabled status, [Test] and kebab menu. Editor modal with type-specific config fields.
History page: Filtered timeline of firings with status badges (FIRING=red, RESOLVED=green, ACKNOWLEDGED=amber), notification delivery status, [ACK] button on firing alerts.
Bell Icon in TopBar
Polls
GET /api/v1/alerts/activeevery 30s. Shows red badge with firing count. Dropdown shows firing alerts and recent resolved/acknowledged. Click navigates to dashboard (scoped) or history.Dashboard Integration
[!]when a firing rule targets that metricMuting
POST /rules/{id}/mutewith{ "duration": "PT1H" }. Setsmuted_until. Auto-unmutes when past. Muted rules show grey dot + "MUTED" badge with countdown.Audit Integration
New
AuditCategory.ALERTING. All config changes logged: create/update/delete rules and channels, mute/unmute, enable/disable, acknowledge, test channel.Module Placement
cameleer3-server-core/.../alerting/cameleer3-server-core/.../alerting/evaluators/cameleer3-server-app/.../alerting/cameleer3-server-app/.../controller/and.../dto/ui/src/pages/Admin/Alerting/(AlertingPage, RulesTab, ChannelsTab, HistoryTab, RuleEditor, ChannelEditor)ui/src/components/AlertBell.tsxui/src/api/queries/alerts.tsImplementation Phases