[P0] Alerting & notification system #102

New Issue

claude · 2026-04-01T22:51:02+02:00

claude commented

2026-04-01 22:51:02 +02:00

Parent Epic

Problem

Cameleer shows SLA breaches, error spikes, dead agents — but there's no way to be notified about them. Without alerting, teams still need Grafana/PagerDuty alongside Cameleer, which means Cameleer is a "nice to have" viewer, not an essential monitoring tool. Alerting is what makes monitoring tools sticky.

Current State

Dashboard shows SLA compliance with "BREACH" badge — but no notification
Error rate is visible but no threshold triggers
Dead agent count visible but no alert when agents go down
No alert configuration UI anywhere
No notification channel configuration

Proposed Solution

1. Alert Rules Engine

┌──────────────────────────────────────────────────────┐
│  Alert Rules                              [+ New Rule]│
│──────────────────────────────────────────────────────│
│                                                      │
│  ● ACTIVE   Error Rate > 5%                          │
│    Scope: All applications                           │
│    Window: 5 minutes  │  Channels: #ops-slack, email │
│    Triggered 3 times in last 24h          [Edit]     │
│                                                      │
│  ● ACTIVE   P99 Latency > 2000ms                     │
│    Scope: backend-app                                │
│    Window: 10 minutes │  Channels: #ops-slack        │
│    Last triggered: 2h ago                 [Edit]     │
│                                                      │
│  ○ MUTED    Agent Disconnected                       │
│    Scope: All applications                           │
│    Grace: 60s         │  Channels: email             │
│    Muted until 2026-04-02 09:00           [Edit]     │
│                                                      │
│  ● ACTIVE   Route Stopped Unexpectedly               │
│    Scope: sample-app/complex-fulfillment             │
│    Window: immediate  │  Channels: webhook           │
│    Never triggered                        [Edit]     │
│                                                      │
└──────────────────────────────────────────────────────┘

2. Built-in Alert Types

Pre-configured templates users can enable with one click:

Alert Type	Default Threshold	Description
Error Rate Spike	> 5% over 5min	Error rate exceeds threshold in rolling window
SLA Breach	P99 > configured threshold	Route latency exceeds SLA target
Agent Disconnected	60s grace period	Agent stops heartbeating
Agent Dead	5min no response	Agent confirmed unreachable
Throughput Drop	> 50% decrease	Sudden drop in message throughput
Route Stopped	immediate	Route transitions to stopped state unexpectedly
Disk/Memory Warning	> 85% usage	ClickHouse or agent resource pressure

3. Notification Channels

┌──────────────────────────────────────────────────────┐
│  Notification Channels                   [+ Add Channel]│
│──────────────────────────────────────────────────────│
│                                                      │
│  ✉  Email                                            │
│     ops-team@company.com                  [Edit]     │
│                                                      │
│  💬 Slack (Webhook)                                   │
│     #cameleer-alerts                      [Edit]     │
│                                                      │
│  🔗 Webhook                                          │
│     https://pagerduty.com/...             [Edit]     │
│                                                      │
│  📱 Microsoft Teams                                   │
│     Ops Channel                           [Edit]     │
│                                                      │
└──────────────────────────────────────────────────────┘

Minimum viable channels for v1:

Webhook (covers PagerDuty, Opsgenie, custom integrations)
Email (SMTP config in admin)
Slack (incoming webhook URL)

4. Alert History & Status

A page showing recent alert firings, acknowledgments, and resolution:

TIMESTAMP            RULE                    STATUS      DURATION
2026-04-01 22:31     Error Rate > 5%         RESOLVED    12 min
2026-04-01 21:15     P99 > 2000ms            RESOLVED    45 min
2026-04-01 18:00     Agent Disconnected      ACK'd       —

5. UI Integration Points

KPI strip: clicking the Err% KPI should show "Set alert threshold" option
Dashboard L2/L3: "Alert" icon next to metrics that have active rules
Runtime: "Agent went STALE" should show "Configure alert" link
Top bar: bell icon with unread alert count badge

Architecture Notes

Alert evaluation can run server-side on a scheduled interval (every 30s–60s)
Use existing ClickHouse stats materialized views for metric thresholds
Agent state changes already tracked in registry — hook alerts there
Store alert rules + history in PostgreSQL (RBAC-protected)
Webhook dispatch should be async with retry (3 attempts, exponential backoff)

Acceptance Criteria

Admin can create alert rules with metric, threshold, scope, and window
At least 3 notification channels supported (webhook, email, Slack)
Built-in alert templates for common scenarios
Alert history page showing recent firings
Bell icon in top bar with unread alert count
Alert rules can be muted temporarily
Test notification button per channel

Competitive Reference

Datadog: monitors with conditions, multi-channel routing, alert history, downtime scheduling
Grafana: alert rules engine, contact points, notification policies, silences
New Relic: alert conditions, incident workflow, notification destinations

## Parent Epic #100 ## Problem Cameleer shows SLA breaches, error spikes, dead agents — but there's no way to be notified about them. Without alerting, teams still need Grafana/PagerDuty alongside Cameleer, which means Cameleer is a "nice to have" viewer, not an essential monitoring tool. **Alerting is what makes monitoring tools sticky.** ## Current State - Dashboard shows SLA compliance with "BREACH" badge — but no notification - Error rate is visible but no threshold triggers - Dead agent count visible but no alert when agents go down - No alert configuration UI anywhere - No notification channel configuration ## Proposed Solution ### 1. Alert Rules Engine ``` ┌──────────────────────────────────────────────────────┐ │ Alert Rules [+ New Rule]│ │──────────────────────────────────────────────────────│ │ │ │ ● ACTIVE Error Rate > 5% │ │ Scope: All applications │ │ Window: 5 minutes │ Channels: #ops-slack, email │ │ Triggered 3 times in last 24h [Edit] │ │ │ │ ● ACTIVE P99 Latency > 2000ms │ │ Scope: backend-app │ │ Window: 10 minutes │ Channels: #ops-slack │ │ Last triggered: 2h ago [Edit] │ │ │ │ ○ MUTED Agent Disconnected │ │ Scope: All applications │ │ Grace: 60s │ Channels: email │ │ Muted until 2026-04-02 09:00 [Edit] │ │ │ │ ● ACTIVE Route Stopped Unexpectedly │ │ Scope: sample-app/complex-fulfillment │ │ Window: immediate │ Channels: webhook │ │ Never triggered [Edit] │ │ │ └──────────────────────────────────────────────────────┘ ``` ### 2. Built-in Alert Types Pre-configured templates users can enable with one click: | Alert Type | Default Threshold | Description | |-----------|-------------------|-------------| | Error Rate Spike | > 5% over 5min | Error rate exceeds threshold in rolling window | | SLA Breach | P99 > configured threshold | Route latency exceeds SLA target | | Agent Disconnected | 60s grace period | Agent stops heartbeating | | Agent Dead | 5min no response | Agent confirmed unreachable | | Throughput Drop | > 50% decrease | Sudden drop in message throughput | | Route Stopped | immediate | Route transitions to stopped state unexpectedly | | Disk/Memory Warning | > 85% usage | ClickHouse or agent resource pressure | ### 3. Notification Channels ``` ┌──────────────────────────────────────────────────────┐ │ Notification Channels [+ Add Channel]│ │──────────────────────────────────────────────────────│ │ │ │ ✉ Email │ │ ops-team@company.com [Edit] │ │ │ │ 💬 Slack (Webhook) │ │ #cameleer-alerts [Edit] │ │ │ │ 🔗 Webhook │ │ https://pagerduty.com/... [Edit] │ │ │ │ 📱 Microsoft Teams │ │ Ops Channel [Edit] │ │ │ └──────────────────────────────────────────────────────┘ ``` **Minimum viable channels for v1:** 1. **Webhook** (covers PagerDuty, Opsgenie, custom integrations) 2. **Email** (SMTP config in admin) 3. **Slack** (incoming webhook URL) ### 4. Alert History & Status A page showing recent alert firings, acknowledgments, and resolution: ``` TIMESTAMP RULE STATUS DURATION 2026-04-01 22:31 Error Rate > 5% RESOLVED 12 min 2026-04-01 21:15 P99 > 2000ms RESOLVED 45 min 2026-04-01 18:00 Agent Disconnected ACK'd — ``` ### 5. UI Integration Points - **KPI strip:** clicking the Err% KPI should show "Set alert threshold" option - **Dashboard L2/L3:** "Alert" icon next to metrics that have active rules - **Runtime:** "Agent went STALE" should show "Configure alert" link - **Top bar:** bell icon with unread alert count badge ## Architecture Notes - Alert evaluation can run server-side on a scheduled interval (every 30s–60s) - Use existing ClickHouse stats materialized views for metric thresholds - Agent state changes already tracked in registry — hook alerts there - Store alert rules + history in PostgreSQL (RBAC-protected) - Webhook dispatch should be async with retry (3 attempts, exponential backoff) ## Acceptance Criteria - [ ] Admin can create alert rules with metric, threshold, scope, and window - [ ] At least 3 notification channels supported (webhook, email, Slack) - [ ] Built-in alert templates for common scenarios - [ ] Alert history page showing recent firings - [ ] Bell icon in top bar with unread alert count - [ ] Alert rules can be muted temporarily - [ ] Test notification button per channel ## Competitive Reference - **Datadog:** monitors with conditions, multi-channel routing, alert history, downtime scheduling - **Grafana:** alert rules engine, contact points, notification policies, silences - **New Relic:** alert conditions, incident workflow, notification destinations

claude added the feature ux pmf labels 2026-04-01 22:51:02 +02:00

claude referenced this issue

2026-04-01 22:56:35 +02:00

Epic: UX Audit — PMF Readiness for First Market Offer #100

claude commented

2026-04-01 23:08:29 +02:00

Design Specification

Data Model (PostgreSQL)

Migration: V2__alerting.sql

alert_channels — Notification destinations:

CREATE TABLE alert_channels (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL UNIQUE,
    channel_type TEXT NOT NULL CHECK (channel_type IN ('webhook', 'email', 'slack')),
    config JSONB NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT true,
    created_by TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Config JSONB per type: webhook needs url, method; email needs recipients; slack needs webhookUrl.

alert_rules — What to monitor:

CREATE TABLE alert_rules (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL UNIQUE,
    alert_type TEXT NOT NULL CHECK (alert_type IN ('error_rate','sla_breach','agent_disconnected','agent_dead','throughput_drop','route_stopped','resource_warning')),
    severity TEXT NOT NULL DEFAULT 'warning' CHECK (severity IN ('info','warning','critical')),
    enabled BOOLEAN NOT NULL DEFAULT true,
    scope_application TEXT, scope_route TEXT, scope_instance TEXT,
    eval_interval_sec INTEGER NOT NULL DEFAULT 60,
    eval_window_sec INTEGER NOT NULL DEFAULT 300,
    threshold_value DOUBLE PRECISION NOT NULL,
    threshold_operator TEXT NOT NULL DEFAULT 'gt',
    consecutive_breaches INTEGER NOT NULL DEFAULT 1,
    cooldown_sec INTEGER NOT NULL DEFAULT 300,
    params JSONB NOT NULL DEFAULT '{}',
    muted_until TIMESTAMPTZ,
    channel_ids UUID[] NOT NULL DEFAULT '{}',
    created_by TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

alert_history — Immutable firing log:

CREATE TABLE alert_history (
    id BIGSERIAL PRIMARY KEY,
    rule_id UUID NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
    status TEXT NOT NULL CHECK (status IN ('firing','resolved','acknowledged')),
    fired_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    resolved_at TIMESTAMPTZ, acknowledged_at TIMESTAMPTZ, acknowledged_by TEXT,
    measured_value DOUBLE PRECISION, threshold_value DOUBLE PRECISION,
    context JSONB NOT NULL DEFAULT '{}',
    notifications_sent JSONB NOT NULL DEFAULT '[]',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

alert_rule_state — Ephemeral evaluation state (survives restarts):

CREATE TABLE alert_rule_state (
    rule_id UUID PRIMARY KEY REFERENCES alert_rules(id) ON DELETE CASCADE,
    consecutive_count INTEGER NOT NULL DEFAULT 0,
    current_status TEXT NOT NULL DEFAULT 'ok',
    last_evaluated_at TIMESTAMPTZ, last_fired_at TIMESTAMPTZ, last_resolved_at TIMESTAMPTZ,
    current_history_id BIGINT REFERENCES alert_history(id),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Alert Evaluation Engine

Spring @Scheduled component, fixedDelay = 10s. For each enabled, non-muted rule where now - last_evaluated_at >= eval_interval_sec:

Evaluate via AlertEvaluator.evaluate(rule) → EvalResult(breached, measuredValue, context)
Update hysteresis tracker (consecutive count)
If consecutive_count >= rule.consecutive_breaches and not currently firing and cooldown elapsed → FIRE
If not breached and currently firing → RESOLVE
Dispatch notifications async via ThreadPoolTaskExecutor (core=2, max=4)

Built-in Alert Types (7 types)

Type	Source	Default Threshold	Query
`error_rate`	`stats_1m_app/route` MV	>5%	`countIfMerge(failed_count)/countMerge(total_count)*100`
`sla_breach`	`executions` table	<99% SLA	`countIf(duration_ms <= slaMs)/count()*100`
`agent_disconnected`	`AgentRegistryService` (memory)	>=1 stale	`findByState(STALE).size()`
`agent_dead`	`AgentRegistryService` (memory)	>=1 dead	`findByState(DEAD).size()`
`throughput_drop`	`stats_1m_app` MV	>50% drop	Compare current window vs previous window
`route_stopped`	`stats_1m_route` MV	<=0 exchanges	`countMerge(total_count)` in window
`resource_warning`	`agent_metrics` table	>85%	`avg(metric_value) WHERE metric_name=?`

Notification Dispatch

Async with retry (3 attempts, exponential backoff: 1s, 4s, 16s).

Webhook: POST JSON payload with X-Cameleer-Alert-Signature HMAC-SHA256 header. Payload includes ruleName, alertType, severity, status, measuredValue, thresholdValue, scope, dashboardUrl.

Email: Subject [SEVERITY] Rule Name. Body includes alert details, scope, dashboard link. Uses JavaMailSender with SMTP config from server_config table.

Slack: Incoming webhook with attachment. Color: critical=#dc3545, warning=#f59e0b, info=#3b82f6, resolved=#22c55e. Fields: Application, Route, Window, Status.

REST API

Rules: GET/POST /api/v1/alerts/rules, GET/PUT/DELETE /rules/{id}, POST /rules/{id}/mute|unmute|enable|disable
Channels: GET/POST /api/v1/alerts/channels, GET/PUT/DELETE /channels/{id}, POST /channels/{id}/test
History: GET /api/v1/alerts/history (paginated), POST /history/{id}/acknowledge
Active: GET /api/v1/alerts/active → { firingCount, firing[], recent[] }

RBAC: VIEWER can read; OPERATOR can create/edit/mute/acknowledge; ADMIN can delete.

Alert Lifecycle State Machine

OK → PENDING (breached, count < N)
PENDING → FIRING (count >= N, cooldown elapsed)
PENDING → OK (not breached, count resets)
FIRING → RESOLVED (not breached)
FIRING → ACKNOWLEDGED (user action)
ACKNOWLEDGED → RESOLVED (not breached)

Frontend: Alerting Admin Tab

New tab in AdminLayout between "App Config" and "Database". Sub-tabs: Rules, Channels, History.

Rules list wireframe:

Alert Rules                                    [+ Create Rule]
Filter: [All Types v] [All Severities v] [All Statuses v]

● Error Rate - order-service        CRITICAL  FIRING    [...]
  > 5.0% error rate | eval: 1m / window: 5m
  Channels: ops-slack, pagerduty

○ Agent Dead - any                  CRITICAL  OK        [...]
  >= 1 dead agents | eval: 30s
  Channels: ops-slack, ops-email

Rule editor form: Name, Description, Alert Type dropdown, Severity radio, Threshold (value + operator), Scope (app/route/instance dropdowns), Evaluation settings (interval, window, consecutive breaches, cooldown), Type-specific params (conditional), Channel checkboxes.

Channel manager: List with type icon, name, enabled status, [Test] and kebab menu. Editor modal with type-specific config fields.

History page: Filtered timeline of firings with status badges (FIRING=red, RESOLVED=green, ACKNOWLEDGED=amber), notification delivery status, [ACK] button on firing alerts.

Bell Icon in TopBar

Polls GET /api/v1/alerts/active every 30s. Shows red badge with firing count. Dropdown shows firing alerts and recent resolved/acknowledged. Click navigates to dashboard (scoped) or history.

TopBar: [...] [bell(2)] [admin] [logout]
                  |
                  v
              +-------------------+
              | Active Alerts     |
              | FIRING (2)        |
              | ● Error Rate 12m  |
              | ● Agent Dead 3m   |
              | RECENT            |
              | ○ SLA (resolved)  |
              | [View All History]|
              +-------------------+

Dashboard Integration

KPI cards show alert indicator [!] when a firing rule targets that metric
Hover on KPI shows "Set Alert" action → opens rule editor pre-filled
L1 app health table gets "Alerts" column with firing count per app

Muting

POST /rules/{id}/mute with { "duration": "PT1H" }. Sets muted_until. Auto-unmutes when past. Muted rules show grey dot + "MUTED" badge with countdown.

Audit Integration

New AuditCategory.ALERTING. All config changes logged: create/update/delete rules and channels, mute/unmute, enable/disable, acknowledge, test channel.

Module Placement

Domain records + repositories: cameleer3-server-core/.../alerting/
Evaluators: cameleer3-server-core/.../alerting/evaluators/
Scheduler + dispatchers: cameleer3-server-app/.../alerting/
Controllers + DTOs: cameleer3-server-app/.../controller/ and .../dto/
Frontend: ui/src/pages/Admin/Alerting/ (AlertingPage, RulesTab, ChannelsTab, HistoryTab, RuleEditor, ChannelEditor)
Bell icon: ui/src/components/AlertBell.tsx
API hooks: ui/src/api/queries/alerts.ts

Implementation Phases

Data model + repositories + REST API (CRUD)
Evaluation engine with all 7 alert types + hysteresis
Notification dispatch (webhook, email, Slack) with async retry
Frontend admin tab (rules, channels, history)
Bell icon + dashboard integration

## Design Specification ### Data Model (PostgreSQL) Migration: `V2__alerting.sql` **alert_channels** — Notification destinations: ```sql CREATE TABLE alert_channels ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL UNIQUE, channel_type TEXT NOT NULL CHECK (channel_type IN ('webhook', 'email', 'slack')), config JSONB NOT NULL, enabled BOOLEAN NOT NULL DEFAULT true, created_by TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` Config JSONB per type: webhook needs `url`, `method`; email needs `recipients`; slack needs `webhookUrl`. **alert_rules** — What to monitor: ```sql CREATE TABLE alert_rules ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL UNIQUE, alert_type TEXT NOT NULL CHECK (alert_type IN ('error_rate','sla_breach','agent_disconnected','agent_dead','throughput_drop','route_stopped','resource_warning')), severity TEXT NOT NULL DEFAULT 'warning' CHECK (severity IN ('info','warning','critical')), enabled BOOLEAN NOT NULL DEFAULT true, scope_application TEXT, scope_route TEXT, scope_instance TEXT, eval_interval_sec INTEGER NOT NULL DEFAULT 60, eval_window_sec INTEGER NOT NULL DEFAULT 300, threshold_value DOUBLE PRECISION NOT NULL, threshold_operator TEXT NOT NULL DEFAULT 'gt', consecutive_breaches INTEGER NOT NULL DEFAULT 1, cooldown_sec INTEGER NOT NULL DEFAULT 300, params JSONB NOT NULL DEFAULT '{}', muted_until TIMESTAMPTZ, channel_ids UUID[] NOT NULL DEFAULT '{}', created_by TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` **alert_history** — Immutable firing log: ```sql CREATE TABLE alert_history ( id BIGSERIAL PRIMARY KEY, rule_id UUID NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE, status TEXT NOT NULL CHECK (status IN ('firing','resolved','acknowledged')), fired_at TIMESTAMPTZ NOT NULL DEFAULT now(), resolved_at TIMESTAMPTZ, acknowledged_at TIMESTAMPTZ, acknowledged_by TEXT, measured_value DOUBLE PRECISION, threshold_value DOUBLE PRECISION, context JSONB NOT NULL DEFAULT '{}', notifications_sent JSONB NOT NULL DEFAULT '[]', created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` **alert_rule_state** — Ephemeral evaluation state (survives restarts): ```sql CREATE TABLE alert_rule_state ( rule_id UUID PRIMARY KEY REFERENCES alert_rules(id) ON DELETE CASCADE, consecutive_count INTEGER NOT NULL DEFAULT 0, current_status TEXT NOT NULL DEFAULT 'ok', last_evaluated_at TIMESTAMPTZ, last_fired_at TIMESTAMPTZ, last_resolved_at TIMESTAMPTZ, current_history_id BIGINT REFERENCES alert_history(id), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` --- ### Alert Evaluation Engine Spring `@Scheduled` component, `fixedDelay = 10s`. For each enabled, non-muted rule where `now - last_evaluated_at >= eval_interval_sec`: 1. Evaluate via `AlertEvaluator.evaluate(rule)` → `EvalResult(breached, measuredValue, context)` 2. Update hysteresis tracker (consecutive count) 3. If `consecutive_count >= rule.consecutive_breaches` and not currently firing and cooldown elapsed → FIRE 4. If not breached and currently firing → RESOLVE 5. Dispatch notifications async via `ThreadPoolTaskExecutor` (core=2, max=4) --- ### Built-in Alert Types (7 types) | Type | Source | Default Threshold | Query | |------|--------|-------------------|-------| | `error_rate` | `stats_1m_app/route` MV | >5% | `countIfMerge(failed_count)/countMerge(total_count)*100` | | `sla_breach` | `executions` table | <99% SLA | `countIf(duration_ms <= slaMs)/count()*100` | | `agent_disconnected` | `AgentRegistryService` (memory) | >=1 stale | `findByState(STALE).size()` | | `agent_dead` | `AgentRegistryService` (memory) | >=1 dead | `findByState(DEAD).size()` | | `throughput_drop` | `stats_1m_app` MV | >50% drop | Compare current window vs previous window | | `route_stopped` | `stats_1m_route` MV | <=0 exchanges | `countMerge(total_count)` in window | | `resource_warning` | `agent_metrics` table | >85% | `avg(metric_value) WHERE metric_name=?` | --- ### Notification Dispatch Async with retry (3 attempts, exponential backoff: 1s, 4s, 16s). **Webhook**: POST JSON payload with `X-Cameleer-Alert-Signature` HMAC-SHA256 header. Payload includes ruleName, alertType, severity, status, measuredValue, thresholdValue, scope, dashboardUrl. **Email**: Subject `[SEVERITY] Rule Name`. Body includes alert details, scope, dashboard link. Uses `JavaMailSender` with SMTP config from `server_config` table. **Slack**: Incoming webhook with attachment. Color: critical=#dc3545, warning=#f59e0b, info=#3b82f6, resolved=#22c55e. Fields: Application, Route, Window, Status. --- ### REST API **Rules**: `GET/POST /api/v1/alerts/rules`, `GET/PUT/DELETE /rules/{id}`, `POST /rules/{id}/mute|unmute|enable|disable` **Channels**: `GET/POST /api/v1/alerts/channels`, `GET/PUT/DELETE /channels/{id}`, `POST /channels/{id}/test` **History**: `GET /api/v1/alerts/history` (paginated), `POST /history/{id}/acknowledge` **Active**: `GET /api/v1/alerts/active` → `{ firingCount, firing[], recent[] }` RBAC: VIEWER can read; OPERATOR can create/edit/mute/acknowledge; ADMIN can delete. --- ### Alert Lifecycle State Machine ``` OK → PENDING (breached, count < N) PENDING → FIRING (count >= N, cooldown elapsed) PENDING → OK (not breached, count resets) FIRING → RESOLVED (not breached) FIRING → ACKNOWLEDGED (user action) ACKNOWLEDGED → RESOLVED (not breached) ``` --- ### Frontend: Alerting Admin Tab New tab in AdminLayout between "App Config" and "Database". Sub-tabs: Rules, Channels, History. **Rules list wireframe:** ``` Alert Rules [+ Create Rule] Filter: [All Types v] [All Severities v] [All Statuses v] ● Error Rate - order-service CRITICAL FIRING [...] > 5.0% error rate | eval: 1m / window: 5m Channels: ops-slack, pagerduty ○ Agent Dead - any CRITICAL OK [...] >= 1 dead agents | eval: 30s Channels: ops-slack, ops-email ``` **Rule editor form**: Name, Description, Alert Type dropdown, Severity radio, Threshold (value + operator), Scope (app/route/instance dropdowns), Evaluation settings (interval, window, consecutive breaches, cooldown), Type-specific params (conditional), Channel checkboxes. **Channel manager**: List with type icon, name, enabled status, [Test] and kebab menu. Editor modal with type-specific config fields. **History page**: Filtered timeline of firings with status badges (FIRING=red, RESOLVED=green, ACKNOWLEDGED=amber), notification delivery status, [ACK] button on firing alerts. --- ### Bell Icon in TopBar Polls `GET /api/v1/alerts/active` every 30s. Shows red badge with firing count. Dropdown shows firing alerts and recent resolved/acknowledged. Click navigates to dashboard (scoped) or history. ``` TopBar: [...] [bell(2)] [admin] [logout] | v +-------------------+ | Active Alerts | | FIRING (2) | | ● Error Rate 12m | | ● Agent Dead 3m | | RECENT | | ○ SLA (resolved) | | [View All History]| +-------------------+ ``` --- ### Dashboard Integration - KPI cards show alert indicator `[!]` when a firing rule targets that metric - Hover on KPI shows "Set Alert" action → opens rule editor pre-filled - L1 app health table gets "Alerts" column with firing count per app --- ### Muting `POST /rules/{id}/mute` with `{ "duration": "PT1H" }`. Sets `muted_until`. Auto-unmutes when past. Muted rules show grey dot + "MUTED" badge with countdown. ### Audit Integration New `AuditCategory.ALERTING`. All config changes logged: create/update/delete rules and channels, mute/unmute, enable/disable, acknowledge, test channel. --- ### Module Placement - Domain records + repositories: `cameleer3-server-core/.../alerting/` - Evaluators: `cameleer3-server-core/.../alerting/evaluators/` - Scheduler + dispatchers: `cameleer3-server-app/.../alerting/` - Controllers + DTOs: `cameleer3-server-app/.../controller/` and `.../dto/` - Frontend: `ui/src/pages/Admin/Alerting/` (AlertingPage, RulesTab, ChannelsTab, HistoryTab, RuleEditor, ChannelEditor) - Bell icon: `ui/src/components/AlertBell.tsx` - API hooks: `ui/src/api/queries/alerts.ts` ### Implementation Phases 1. Data model + repositories + REST API (CRUD) 2. Evaluation engine with all 7 alert types + hysteresis 3. Notification dispatch (webhook, email, Slack) with async retry 4. Frontend admin tab (rules, channels, history) 5. Bell icon + dashboard integration

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cameleer/cameleer-server#102