[P0] Alerting & notification system #102

Open
opened 2026-04-01 22:51:02 +02:00 by claude · 1 comment
Owner

Parent Epic

#100

Problem

Cameleer shows SLA breaches, error spikes, dead agents — but there's no way to be notified about them. Without alerting, teams still need Grafana/PagerDuty alongside Cameleer, which means Cameleer is a "nice to have" viewer, not an essential monitoring tool. Alerting is what makes monitoring tools sticky.

Current State

  • Dashboard shows SLA compliance with "BREACH" badge — but no notification
  • Error rate is visible but no threshold triggers
  • Dead agent count visible but no alert when agents go down
  • No alert configuration UI anywhere
  • No notification channel configuration

Proposed Solution

1. Alert Rules Engine

┌──────────────────────────────────────────────────────┐
│  Alert Rules                              [+ New Rule]│
│──────────────────────────────────────────────────────│
│                                                      │
│  ● ACTIVE   Error Rate > 5%                          │
│    Scope: All applications                           │
│    Window: 5 minutes  │  Channels: #ops-slack, email │
│    Triggered 3 times in last 24h          [Edit]     │
│                                                      │
│  ● ACTIVE   P99 Latency > 2000ms                     │
│    Scope: backend-app                                │
│    Window: 10 minutes │  Channels: #ops-slack        │
│    Last triggered: 2h ago                 [Edit]     │
│                                                      │
│  ○ MUTED    Agent Disconnected                       │
│    Scope: All applications                           │
│    Grace: 60s         │  Channels: email             │
│    Muted until 2026-04-02 09:00           [Edit]     │
│                                                      │
│  ● ACTIVE   Route Stopped Unexpectedly               │
│    Scope: sample-app/complex-fulfillment             │
│    Window: immediate  │  Channels: webhook           │
│    Never triggered                        [Edit]     │
│                                                      │
└──────────────────────────────────────────────────────┘

2. Built-in Alert Types

Pre-configured templates users can enable with one click:

Alert Type Default Threshold Description
Error Rate Spike > 5% over 5min Error rate exceeds threshold in rolling window
SLA Breach P99 > configured threshold Route latency exceeds SLA target
Agent Disconnected 60s grace period Agent stops heartbeating
Agent Dead 5min no response Agent confirmed unreachable
Throughput Drop > 50% decrease Sudden drop in message throughput
Route Stopped immediate Route transitions to stopped state unexpectedly
Disk/Memory Warning > 85% usage ClickHouse or agent resource pressure

3. Notification Channels

┌──────────────────────────────────────────────────────┐
│  Notification Channels                   [+ Add Channel]│
│──────────────────────────────────────────────────────│
│                                                      │
│  ✉  Email                                            │
│     ops-team@company.com                  [Edit]     │
│                                                      │
│  💬 Slack (Webhook)                                   │
│     #cameleer-alerts                      [Edit]     │
│                                                      │
│  🔗 Webhook                                          │
│     https://pagerduty.com/...             [Edit]     │
│                                                      │
│  📱 Microsoft Teams                                   │
│     Ops Channel                           [Edit]     │
│                                                      │
└──────────────────────────────────────────────────────┘

Minimum viable channels for v1:

  1. Webhook (covers PagerDuty, Opsgenie, custom integrations)
  2. Email (SMTP config in admin)
  3. Slack (incoming webhook URL)

4. Alert History & Status

A page showing recent alert firings, acknowledgments, and resolution:

TIMESTAMP            RULE                    STATUS      DURATION
2026-04-01 22:31     Error Rate > 5%         RESOLVED    12 min
2026-04-01 21:15     P99 > 2000ms            RESOLVED    45 min
2026-04-01 18:00     Agent Disconnected      ACK'd       —

5. UI Integration Points

  • KPI strip: clicking the Err% KPI should show "Set alert threshold" option
  • Dashboard L2/L3: "Alert" icon next to metrics that have active rules
  • Runtime: "Agent went STALE" should show "Configure alert" link
  • Top bar: bell icon with unread alert count badge

Architecture Notes

  • Alert evaluation can run server-side on a scheduled interval (every 30s–60s)
  • Use existing ClickHouse stats materialized views for metric thresholds
  • Agent state changes already tracked in registry — hook alerts there
  • Store alert rules + history in PostgreSQL (RBAC-protected)
  • Webhook dispatch should be async with retry (3 attempts, exponential backoff)

Acceptance Criteria

  • Admin can create alert rules with metric, threshold, scope, and window
  • At least 3 notification channels supported (webhook, email, Slack)
  • Built-in alert templates for common scenarios
  • Alert history page showing recent firings
  • Bell icon in top bar with unread alert count
  • Alert rules can be muted temporarily
  • Test notification button per channel

Competitive Reference

  • Datadog: monitors with conditions, multi-channel routing, alert history, downtime scheduling
  • Grafana: alert rules engine, contact points, notification policies, silences
  • New Relic: alert conditions, incident workflow, notification destinations
## Parent Epic #100 ## Problem Cameleer shows SLA breaches, error spikes, dead agents — but there's no way to be notified about them. Without alerting, teams still need Grafana/PagerDuty alongside Cameleer, which means Cameleer is a "nice to have" viewer, not an essential monitoring tool. **Alerting is what makes monitoring tools sticky.** ## Current State - Dashboard shows SLA compliance with "BREACH" badge — but no notification - Error rate is visible but no threshold triggers - Dead agent count visible but no alert when agents go down - No alert configuration UI anywhere - No notification channel configuration ## Proposed Solution ### 1. Alert Rules Engine ``` ┌──────────────────────────────────────────────────────┐ │ Alert Rules [+ New Rule]│ │──────────────────────────────────────────────────────│ │ │ │ ● ACTIVE Error Rate > 5% │ │ Scope: All applications │ │ Window: 5 minutes │ Channels: #ops-slack, email │ │ Triggered 3 times in last 24h [Edit] │ │ │ │ ● ACTIVE P99 Latency > 2000ms │ │ Scope: backend-app │ │ Window: 10 minutes │ Channels: #ops-slack │ │ Last triggered: 2h ago [Edit] │ │ │ │ ○ MUTED Agent Disconnected │ │ Scope: All applications │ │ Grace: 60s │ Channels: email │ │ Muted until 2026-04-02 09:00 [Edit] │ │ │ │ ● ACTIVE Route Stopped Unexpectedly │ │ Scope: sample-app/complex-fulfillment │ │ Window: immediate │ Channels: webhook │ │ Never triggered [Edit] │ │ │ └──────────────────────────────────────────────────────┘ ``` ### 2. Built-in Alert Types Pre-configured templates users can enable with one click: | Alert Type | Default Threshold | Description | |-----------|-------------------|-------------| | Error Rate Spike | > 5% over 5min | Error rate exceeds threshold in rolling window | | SLA Breach | P99 > configured threshold | Route latency exceeds SLA target | | Agent Disconnected | 60s grace period | Agent stops heartbeating | | Agent Dead | 5min no response | Agent confirmed unreachable | | Throughput Drop | > 50% decrease | Sudden drop in message throughput | | Route Stopped | immediate | Route transitions to stopped state unexpectedly | | Disk/Memory Warning | > 85% usage | ClickHouse or agent resource pressure | ### 3. Notification Channels ``` ┌──────────────────────────────────────────────────────┐ │ Notification Channels [+ Add Channel]│ │──────────────────────────────────────────────────────│ │ │ │ ✉ Email │ │ ops-team@company.com [Edit] │ │ │ │ 💬 Slack (Webhook) │ │ #cameleer-alerts [Edit] │ │ │ │ 🔗 Webhook │ │ https://pagerduty.com/... [Edit] │ │ │ │ 📱 Microsoft Teams │ │ Ops Channel [Edit] │ │ │ └──────────────────────────────────────────────────────┘ ``` **Minimum viable channels for v1:** 1. **Webhook** (covers PagerDuty, Opsgenie, custom integrations) 2. **Email** (SMTP config in admin) 3. **Slack** (incoming webhook URL) ### 4. Alert History & Status A page showing recent alert firings, acknowledgments, and resolution: ``` TIMESTAMP RULE STATUS DURATION 2026-04-01 22:31 Error Rate > 5% RESOLVED 12 min 2026-04-01 21:15 P99 > 2000ms RESOLVED 45 min 2026-04-01 18:00 Agent Disconnected ACK'd — ``` ### 5. UI Integration Points - **KPI strip:** clicking the Err% KPI should show "Set alert threshold" option - **Dashboard L2/L3:** "Alert" icon next to metrics that have active rules - **Runtime:** "Agent went STALE" should show "Configure alert" link - **Top bar:** bell icon with unread alert count badge ## Architecture Notes - Alert evaluation can run server-side on a scheduled interval (every 30s–60s) - Use existing ClickHouse stats materialized views for metric thresholds - Agent state changes already tracked in registry — hook alerts there - Store alert rules + history in PostgreSQL (RBAC-protected) - Webhook dispatch should be async with retry (3 attempts, exponential backoff) ## Acceptance Criteria - [ ] Admin can create alert rules with metric, threshold, scope, and window - [ ] At least 3 notification channels supported (webhook, email, Slack) - [ ] Built-in alert templates for common scenarios - [ ] Alert history page showing recent firings - [ ] Bell icon in top bar with unread alert count - [ ] Alert rules can be muted temporarily - [ ] Test notification button per channel ## Competitive Reference - **Datadog:** monitors with conditions, multi-channel routing, alert history, downtime scheduling - **Grafana:** alert rules engine, contact points, notification policies, silences - **New Relic:** alert conditions, incident workflow, notification destinations
claude added the featureuxpmf labels 2026-04-01 22:51:02 +02:00
Author
Owner

Design Specification

Data Model (PostgreSQL)

Migration: V2__alerting.sql

alert_channels — Notification destinations:

CREATE TABLE alert_channels (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL UNIQUE,
    channel_type TEXT NOT NULL CHECK (channel_type IN ('webhook', 'email', 'slack')),
    config JSONB NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT true,
    created_by TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Config JSONB per type: webhook needs url, method; email needs recipients; slack needs webhookUrl.

alert_rules — What to monitor:

CREATE TABLE alert_rules (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL UNIQUE,
    alert_type TEXT NOT NULL CHECK (alert_type IN ('error_rate','sla_breach','agent_disconnected','agent_dead','throughput_drop','route_stopped','resource_warning')),
    severity TEXT NOT NULL DEFAULT 'warning' CHECK (severity IN ('info','warning','critical')),
    enabled BOOLEAN NOT NULL DEFAULT true,
    scope_application TEXT, scope_route TEXT, scope_instance TEXT,
    eval_interval_sec INTEGER NOT NULL DEFAULT 60,
    eval_window_sec INTEGER NOT NULL DEFAULT 300,
    threshold_value DOUBLE PRECISION NOT NULL,
    threshold_operator TEXT NOT NULL DEFAULT 'gt',
    consecutive_breaches INTEGER NOT NULL DEFAULT 1,
    cooldown_sec INTEGER NOT NULL DEFAULT 300,
    params JSONB NOT NULL DEFAULT '{}',
    muted_until TIMESTAMPTZ,
    channel_ids UUID[] NOT NULL DEFAULT '{}',
    created_by TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

alert_history — Immutable firing log:

CREATE TABLE alert_history (
    id BIGSERIAL PRIMARY KEY,
    rule_id UUID NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
    status TEXT NOT NULL CHECK (status IN ('firing','resolved','acknowledged')),
    fired_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    resolved_at TIMESTAMPTZ, acknowledged_at TIMESTAMPTZ, acknowledged_by TEXT,
    measured_value DOUBLE PRECISION, threshold_value DOUBLE PRECISION,
    context JSONB NOT NULL DEFAULT '{}',
    notifications_sent JSONB NOT NULL DEFAULT '[]',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

alert_rule_state — Ephemeral evaluation state (survives restarts):

CREATE TABLE alert_rule_state (
    rule_id UUID PRIMARY KEY REFERENCES alert_rules(id) ON DELETE CASCADE,
    consecutive_count INTEGER NOT NULL DEFAULT 0,
    current_status TEXT NOT NULL DEFAULT 'ok',
    last_evaluated_at TIMESTAMPTZ, last_fired_at TIMESTAMPTZ, last_resolved_at TIMESTAMPTZ,
    current_history_id BIGINT REFERENCES alert_history(id),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Alert Evaluation Engine

Spring @Scheduled component, fixedDelay = 10s. For each enabled, non-muted rule where now - last_evaluated_at >= eval_interval_sec:

  1. Evaluate via AlertEvaluator.evaluate(rule)EvalResult(breached, measuredValue, context)
  2. Update hysteresis tracker (consecutive count)
  3. If consecutive_count >= rule.consecutive_breaches and not currently firing and cooldown elapsed → FIRE
  4. If not breached and currently firing → RESOLVE
  5. Dispatch notifications async via ThreadPoolTaskExecutor (core=2, max=4)

Built-in Alert Types (7 types)

Type Source Default Threshold Query
error_rate stats_1m_app/route MV >5% countIfMerge(failed_count)/countMerge(total_count)*100
sla_breach executions table <99% SLA countIf(duration_ms <= slaMs)/count()*100
agent_disconnected AgentRegistryService (memory) >=1 stale findByState(STALE).size()
agent_dead AgentRegistryService (memory) >=1 dead findByState(DEAD).size()
throughput_drop stats_1m_app MV >50% drop Compare current window vs previous window
route_stopped stats_1m_route MV <=0 exchanges countMerge(total_count) in window
resource_warning agent_metrics table >85% avg(metric_value) WHERE metric_name=?

Notification Dispatch

Async with retry (3 attempts, exponential backoff: 1s, 4s, 16s).

Webhook: POST JSON payload with X-Cameleer-Alert-Signature HMAC-SHA256 header. Payload includes ruleName, alertType, severity, status, measuredValue, thresholdValue, scope, dashboardUrl.

Email: Subject [SEVERITY] Rule Name. Body includes alert details, scope, dashboard link. Uses JavaMailSender with SMTP config from server_config table.

Slack: Incoming webhook with attachment. Color: critical=#dc3545, warning=#f59e0b, info=#3b82f6, resolved=#22c55e. Fields: Application, Route, Window, Status.


REST API

Rules: GET/POST /api/v1/alerts/rules, GET/PUT/DELETE /rules/{id}, POST /rules/{id}/mute|unmute|enable|disable
Channels: GET/POST /api/v1/alerts/channels, GET/PUT/DELETE /channels/{id}, POST /channels/{id}/test
History: GET /api/v1/alerts/history (paginated), POST /history/{id}/acknowledge
Active: GET /api/v1/alerts/active{ firingCount, firing[], recent[] }

RBAC: VIEWER can read; OPERATOR can create/edit/mute/acknowledge; ADMIN can delete.


Alert Lifecycle State Machine

OK → PENDING (breached, count < N)
PENDING → FIRING (count >= N, cooldown elapsed)
PENDING → OK (not breached, count resets)
FIRING → RESOLVED (not breached)
FIRING → ACKNOWLEDGED (user action)
ACKNOWLEDGED → RESOLVED (not breached)

Frontend: Alerting Admin Tab

New tab in AdminLayout between "App Config" and "Database". Sub-tabs: Rules, Channels, History.

Rules list wireframe:

Alert Rules                                    [+ Create Rule]
Filter: [All Types v] [All Severities v] [All Statuses v]

● Error Rate - order-service        CRITICAL  FIRING    [...]
  > 5.0% error rate | eval: 1m / window: 5m
  Channels: ops-slack, pagerduty

○ Agent Dead - any                  CRITICAL  OK        [...]
  >= 1 dead agents | eval: 30s
  Channels: ops-slack, ops-email

Rule editor form: Name, Description, Alert Type dropdown, Severity radio, Threshold (value + operator), Scope (app/route/instance dropdowns), Evaluation settings (interval, window, consecutive breaches, cooldown), Type-specific params (conditional), Channel checkboxes.

Channel manager: List with type icon, name, enabled status, [Test] and kebab menu. Editor modal with type-specific config fields.

History page: Filtered timeline of firings with status badges (FIRING=red, RESOLVED=green, ACKNOWLEDGED=amber), notification delivery status, [ACK] button on firing alerts.


Bell Icon in TopBar

Polls GET /api/v1/alerts/active every 30s. Shows red badge with firing count. Dropdown shows firing alerts and recent resolved/acknowledged. Click navigates to dashboard (scoped) or history.

TopBar: [...] [bell(2)] [admin] [logout]
                  |
                  v
              +-------------------+
              | Active Alerts     |
              | FIRING (2)        |
              | ● Error Rate 12m  |
              | ● Agent Dead 3m   |
              | RECENT            |
              | ○ SLA (resolved)  |
              | [View All History]|
              +-------------------+

Dashboard Integration

  • KPI cards show alert indicator [!] when a firing rule targets that metric
  • Hover on KPI shows "Set Alert" action → opens rule editor pre-filled
  • L1 app health table gets "Alerts" column with firing count per app

Muting

POST /rules/{id}/mute with { "duration": "PT1H" }. Sets muted_until. Auto-unmutes when past. Muted rules show grey dot + "MUTED" badge with countdown.

Audit Integration

New AuditCategory.ALERTING. All config changes logged: create/update/delete rules and channels, mute/unmute, enable/disable, acknowledge, test channel.


Module Placement

  • Domain records + repositories: cameleer3-server-core/.../alerting/
  • Evaluators: cameleer3-server-core/.../alerting/evaluators/
  • Scheduler + dispatchers: cameleer3-server-app/.../alerting/
  • Controllers + DTOs: cameleer3-server-app/.../controller/ and .../dto/
  • Frontend: ui/src/pages/Admin/Alerting/ (AlertingPage, RulesTab, ChannelsTab, HistoryTab, RuleEditor, ChannelEditor)
  • Bell icon: ui/src/components/AlertBell.tsx
  • API hooks: ui/src/api/queries/alerts.ts

Implementation Phases

  1. Data model + repositories + REST API (CRUD)
  2. Evaluation engine with all 7 alert types + hysteresis
  3. Notification dispatch (webhook, email, Slack) with async retry
  4. Frontend admin tab (rules, channels, history)
  5. Bell icon + dashboard integration
## Design Specification ### Data Model (PostgreSQL) Migration: `V2__alerting.sql` **alert_channels** — Notification destinations: ```sql CREATE TABLE alert_channels ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL UNIQUE, channel_type TEXT NOT NULL CHECK (channel_type IN ('webhook', 'email', 'slack')), config JSONB NOT NULL, enabled BOOLEAN NOT NULL DEFAULT true, created_by TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` Config JSONB per type: webhook needs `url`, `method`; email needs `recipients`; slack needs `webhookUrl`. **alert_rules** — What to monitor: ```sql CREATE TABLE alert_rules ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL UNIQUE, alert_type TEXT NOT NULL CHECK (alert_type IN ('error_rate','sla_breach','agent_disconnected','agent_dead','throughput_drop','route_stopped','resource_warning')), severity TEXT NOT NULL DEFAULT 'warning' CHECK (severity IN ('info','warning','critical')), enabled BOOLEAN NOT NULL DEFAULT true, scope_application TEXT, scope_route TEXT, scope_instance TEXT, eval_interval_sec INTEGER NOT NULL DEFAULT 60, eval_window_sec INTEGER NOT NULL DEFAULT 300, threshold_value DOUBLE PRECISION NOT NULL, threshold_operator TEXT NOT NULL DEFAULT 'gt', consecutive_breaches INTEGER NOT NULL DEFAULT 1, cooldown_sec INTEGER NOT NULL DEFAULT 300, params JSONB NOT NULL DEFAULT '{}', muted_until TIMESTAMPTZ, channel_ids UUID[] NOT NULL DEFAULT '{}', created_by TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` **alert_history** — Immutable firing log: ```sql CREATE TABLE alert_history ( id BIGSERIAL PRIMARY KEY, rule_id UUID NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE, status TEXT NOT NULL CHECK (status IN ('firing','resolved','acknowledged')), fired_at TIMESTAMPTZ NOT NULL DEFAULT now(), resolved_at TIMESTAMPTZ, acknowledged_at TIMESTAMPTZ, acknowledged_by TEXT, measured_value DOUBLE PRECISION, threshold_value DOUBLE PRECISION, context JSONB NOT NULL DEFAULT '{}', notifications_sent JSONB NOT NULL DEFAULT '[]', created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` **alert_rule_state** — Ephemeral evaluation state (survives restarts): ```sql CREATE TABLE alert_rule_state ( rule_id UUID PRIMARY KEY REFERENCES alert_rules(id) ON DELETE CASCADE, consecutive_count INTEGER NOT NULL DEFAULT 0, current_status TEXT NOT NULL DEFAULT 'ok', last_evaluated_at TIMESTAMPTZ, last_fired_at TIMESTAMPTZ, last_resolved_at TIMESTAMPTZ, current_history_id BIGINT REFERENCES alert_history(id), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` --- ### Alert Evaluation Engine Spring `@Scheduled` component, `fixedDelay = 10s`. For each enabled, non-muted rule where `now - last_evaluated_at >= eval_interval_sec`: 1. Evaluate via `AlertEvaluator.evaluate(rule)` → `EvalResult(breached, measuredValue, context)` 2. Update hysteresis tracker (consecutive count) 3. If `consecutive_count >= rule.consecutive_breaches` and not currently firing and cooldown elapsed → FIRE 4. If not breached and currently firing → RESOLVE 5. Dispatch notifications async via `ThreadPoolTaskExecutor` (core=2, max=4) --- ### Built-in Alert Types (7 types) | Type | Source | Default Threshold | Query | |------|--------|-------------------|-------| | `error_rate` | `stats_1m_app/route` MV | >5% | `countIfMerge(failed_count)/countMerge(total_count)*100` | | `sla_breach` | `executions` table | <99% SLA | `countIf(duration_ms <= slaMs)/count()*100` | | `agent_disconnected` | `AgentRegistryService` (memory) | >=1 stale | `findByState(STALE).size()` | | `agent_dead` | `AgentRegistryService` (memory) | >=1 dead | `findByState(DEAD).size()` | | `throughput_drop` | `stats_1m_app` MV | >50% drop | Compare current window vs previous window | | `route_stopped` | `stats_1m_route` MV | <=0 exchanges | `countMerge(total_count)` in window | | `resource_warning` | `agent_metrics` table | >85% | `avg(metric_value) WHERE metric_name=?` | --- ### Notification Dispatch Async with retry (3 attempts, exponential backoff: 1s, 4s, 16s). **Webhook**: POST JSON payload with `X-Cameleer-Alert-Signature` HMAC-SHA256 header. Payload includes ruleName, alertType, severity, status, measuredValue, thresholdValue, scope, dashboardUrl. **Email**: Subject `[SEVERITY] Rule Name`. Body includes alert details, scope, dashboard link. Uses `JavaMailSender` with SMTP config from `server_config` table. **Slack**: Incoming webhook with attachment. Color: critical=#dc3545, warning=#f59e0b, info=#3b82f6, resolved=#22c55e. Fields: Application, Route, Window, Status. --- ### REST API **Rules**: `GET/POST /api/v1/alerts/rules`, `GET/PUT/DELETE /rules/{id}`, `POST /rules/{id}/mute|unmute|enable|disable` **Channels**: `GET/POST /api/v1/alerts/channels`, `GET/PUT/DELETE /channels/{id}`, `POST /channels/{id}/test` **History**: `GET /api/v1/alerts/history` (paginated), `POST /history/{id}/acknowledge` **Active**: `GET /api/v1/alerts/active` → `{ firingCount, firing[], recent[] }` RBAC: VIEWER can read; OPERATOR can create/edit/mute/acknowledge; ADMIN can delete. --- ### Alert Lifecycle State Machine ``` OK → PENDING (breached, count < N) PENDING → FIRING (count >= N, cooldown elapsed) PENDING → OK (not breached, count resets) FIRING → RESOLVED (not breached) FIRING → ACKNOWLEDGED (user action) ACKNOWLEDGED → RESOLVED (not breached) ``` --- ### Frontend: Alerting Admin Tab New tab in AdminLayout between "App Config" and "Database". Sub-tabs: Rules, Channels, History. **Rules list wireframe:** ``` Alert Rules [+ Create Rule] Filter: [All Types v] [All Severities v] [All Statuses v] ● Error Rate - order-service CRITICAL FIRING [...] > 5.0% error rate | eval: 1m / window: 5m Channels: ops-slack, pagerduty ○ Agent Dead - any CRITICAL OK [...] >= 1 dead agents | eval: 30s Channels: ops-slack, ops-email ``` **Rule editor form**: Name, Description, Alert Type dropdown, Severity radio, Threshold (value + operator), Scope (app/route/instance dropdowns), Evaluation settings (interval, window, consecutive breaches, cooldown), Type-specific params (conditional), Channel checkboxes. **Channel manager**: List with type icon, name, enabled status, [Test] and kebab menu. Editor modal with type-specific config fields. **History page**: Filtered timeline of firings with status badges (FIRING=red, RESOLVED=green, ACKNOWLEDGED=amber), notification delivery status, [ACK] button on firing alerts. --- ### Bell Icon in TopBar Polls `GET /api/v1/alerts/active` every 30s. Shows red badge with firing count. Dropdown shows firing alerts and recent resolved/acknowledged. Click navigates to dashboard (scoped) or history. ``` TopBar: [...] [bell(2)] [admin] [logout] | v +-------------------+ | Active Alerts | | FIRING (2) | | ● Error Rate 12m | | ● Agent Dead 3m | | RECENT | | ○ SLA (resolved) | | [View All History]| +-------------------+ ``` --- ### Dashboard Integration - KPI cards show alert indicator `[!]` when a firing rule targets that metric - Hover on KPI shows "Set Alert" action → opens rule editor pre-filled - L1 app health table gets "Alerts" column with firing count per app --- ### Muting `POST /rules/{id}/mute` with `{ "duration": "PT1H" }`. Sets `muted_until`. Auto-unmutes when past. Muted rules show grey dot + "MUTED" badge with countdown. ### Audit Integration New `AuditCategory.ALERTING`. All config changes logged: create/update/delete rules and channels, mute/unmute, enable/disable, acknowledge, test channel. --- ### Module Placement - Domain records + repositories: `cameleer3-server-core/.../alerting/` - Evaluators: `cameleer3-server-core/.../alerting/evaluators/` - Scheduler + dispatchers: `cameleer3-server-app/.../alerting/` - Controllers + DTOs: `cameleer3-server-app/.../controller/` and `.../dto/` - Frontend: `ui/src/pages/Admin/Alerting/` (AlertingPage, RulesTab, ChannelsTab, HistoryTab, RuleEditor, ChannelEditor) - Bell icon: `ui/src/components/AlertBell.tsx` - API hooks: `ui/src/api/queries/alerts.ts` ### Implementation Phases 1. Data model + repositories + REST API (CRUD) 2. Evaluation engine with all 7 alert types + hysteresis 3. Notification dispatch (webhook, email, Slack) with async retry 4. Frontend admin tab (rules, channels, history) 5. Bell icon + dashboard integration
Sign in to join this conversation.