BL-002 / gitea#138 tracks deferred native provider types (Slack Block Kit, PagerDuty Events v2, Teams connector) with shipped templates as a post-v1 fast-follow once usage data informs which providers matter. Spec §13 folds in context-aware variable auto-complete for the shared <MustacheEditor /> component used in rule editor, webhook overrides, and outbound-connection admin. Available variables filter by condition kind. Completion engine choice added to §20 as a planning-phase decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1078 lines
52 KiB
Markdown
1078 lines
52 KiB
Markdown
# Alerting — Design Spec
|
|
|
|
**Date:** 2026-04-19
|
|
**Status:** Draft — awaiting user review
|
|
**Surfaces:** server (core + app), UI, admin, Gitea issues
|
|
**Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred)
|
|
|
|
---
|
|
|
|
## 1. Summary
|
|
|
|
A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores.
|
|
|
|
### Guiding principles
|
|
|
|
- **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling.
|
|
- **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
|
|
- **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
|
|
- **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
|
|
- **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild.
|
|
|
|
---
|
|
|
|
## 2. Scope
|
|
|
|
### In scope (v1)
|
|
|
|
Six signal sources, expressed as sealed-type conditions:
|
|
|
|
1. **`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`.
|
|
2. **`EXCHANGE_MATCH`** — per-exchange matching with two fire modes:
|
|
- `PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)
|
|
- `COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold
|
|
3. **`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`.
|
|
4. **`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds.
|
|
5. **`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold.
|
|
6. **`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.
|
|
|
|
**Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.
|
|
|
|
**Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).
|
|
|
|
**Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail).
|
|
|
|
**Rule promotion across environments** via UI prefill — no new server endpoint.
|
|
|
|
**CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry.
|
|
|
|
**Configurable evaluator cadence** (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence.
|
|
|
|
### Out of scope (v1, not deferred)
|
|
|
|
- Custom SQL / Prometheus-style query DSL (option F).
|
|
- Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations.
|
|
- Native provider integrations (Slack, Teams, PagerDuty as first-class types).
|
|
- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
|
|
- Expression language in rules — fixed templates only.
|
|
- mTLS / client-cert auth on outbound webhooks.
|
|
- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.
|
|
|
|
### Deferred to backlog
|
|
|
|
- **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.
|
|
|
|
---
|
|
|
|
## 3. Key decisions
|
|
|
|
Captured from brainstorming, in order of architectural impact.
|
|
|
|
| Decision | Chosen | Rejected | Rationale |
|
|
|---|---|---|---|
|
|
| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten |
|
|
| Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) |
|
|
| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules |
|
|
| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting |
|
|
| Horizontal scale | `FOR UPDATE SKIP LOCKED` claim pattern | advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard |
|
|
| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature |
|
|
| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision |
|
|
| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users |
|
|
| UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin |
|
|
| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry |
|
|
| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations |
|
|
| TLS trust | shared cross-cutting module `http/` | alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path |
|
|
| CA management UI | **deferred (BL-001)** | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse |
|
|
| Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows |
|
|
|
|
---
|
|
|
|
## 4. Module architecture
|
|
|
|
### Package layout
|
|
|
|
```
|
|
cameleer-server-core/src/main/java/com/cameleer/server/core/
|
|
├── alerting/ (domain; pure records + interfaces)
|
|
│ ├── AlertRule
|
|
│ ├── AlertCondition (sealed)
|
|
│ │ ├── RouteMetricCondition
|
|
│ │ ├── ExchangeMatchCondition
|
|
│ │ ├── AgentStateCondition
|
|
│ │ ├── DeploymentStateCondition
|
|
│ │ ├── LogPatternCondition
|
|
│ │ └── JvmMetricCondition
|
|
│ ├── AlertSeverity / AlertState (enums)
|
|
│ ├── AlertInstance / AlertEvent
|
|
│ ├── NotificationTarget / NotificationTargetKind
|
|
│ ├── AlertSilence / SilenceMatcher
|
|
│ ├── AlertRuleRepository (interface)
|
|
│ ├── AlertInstanceRepository (interface)
|
|
│ ├── AlertSilenceRepository (interface)
|
|
│ ├── AlertNotificationRepository (interface)
|
|
│ ├── AlertReadRepository (interface)
|
|
│ ├── ConditionEvaluator<C> (sealed)
|
|
│ └── NotificationDispatcher (interface)
|
|
├── outbound/ (admin-managed outbound connections)
|
|
│ ├── OutboundConnection
|
|
│ ├── OutboundAuth (sealed — NONE, BEARER, BASIC)
|
|
│ ├── TrustMode (enum)
|
|
│ └── OutboundConnectionRepository (interface)
|
|
└── http/ (cross-cutting outbound HTTP primitive)
|
|
├── OutboundHttpProperties
|
|
├── OutboundHttpRequestContext
|
|
└── OutboundHttpClientFactory (interface)
|
|
|
|
cameleer-server-app/src/main/java/com/cameleer/server/app/
|
|
├── alerting/
|
|
│ ├── controller/ (REST)
|
|
│ │ ├── AlertRuleController
|
|
│ │ ├── AlertController
|
|
│ │ ├── AlertSilenceController
|
|
│ │ └── AlertNotificationController
|
|
│ ├── storage/ (Postgres)
|
|
│ │ ├── PostgresAlertRuleRepository
|
|
│ │ ├── PostgresAlertInstanceRepository
|
|
│ │ ├── PostgresAlertSilenceRepository
|
|
│ │ ├── PostgresAlertNotificationRepository
|
|
│ │ └── PostgresAlertReadRepository
|
|
│ ├── eval/ (the scheduled evaluators)
|
|
│ │ ├── AlertEvaluatorJob (@Scheduled, claim-based)
|
|
│ │ ├── RouteMetricEvaluator
|
|
│ │ ├── ExchangeMatchEvaluator
|
|
│ │ ├── AgentStateEvaluator
|
|
│ │ ├── DeploymentStateEvaluator
|
|
│ │ ├── LogPatternEvaluator
|
|
│ │ ├── JvmMetricEvaluator
|
|
│ │ ├── PerKindCircuitBreaker
|
|
│ │ └── TickCache
|
|
│ ├── notify/
|
|
│ │ ├── NotificationDispatchJob (@Scheduled, claim-based)
|
|
│ │ ├── InAppInboxQuery
|
|
│ │ ├── WebhookDispatcher
|
|
│ │ ├── MustacheRenderer
|
|
│ │ └── SilenceMatcher
|
|
│ ├── dto/ (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.)
|
|
│ ├── retention/
|
|
│ │ └── AlertingRetentionJob (daily @Scheduled)
|
|
│ └── config/
|
|
│ └── AlertingProperties (@ConfigurationProperties)
|
|
├── outbound/
|
|
│ ├── controller/
|
|
│ │ └── OutboundConnectionAdminController
|
|
│ ├── storage/
|
|
│ │ └── PostgresOutboundConnectionRepository
|
|
│ └── dto/
|
|
│ └── OutboundConnectionDto
|
|
└── http/
|
|
├── ApacheOutboundHttpClientFactory
|
|
├── SslContextBuilder
|
|
└── config/
|
|
└── OutboundHttpConfig (@ConfigurationProperties)
|
|
|
|
cameleer-server-app/src/main/resources/
|
|
├── db/migration/V11__alerting_and_outbound.sql (one Flyway migration)
|
|
└── clickhouse/V_alerting_projections.sql (one CH migration, idempotent)
|
|
|
|
ui/src/
|
|
├── pages/Alerts/
|
|
│ ├── InboxPage.tsx
|
|
│ ├── AllAlertsPage.tsx
|
|
│ ├── RulesListPage.tsx
|
|
│ ├── RuleEditor/
|
|
│ │ ├── RuleEditorWizard.tsx
|
|
│ │ ├── ScopeStep.tsx
|
|
│ │ ├── ConditionStep.tsx
|
|
│ │ ├── TriggerStep.tsx
|
|
│ │ ├── NotifyStep.tsx
|
|
│ │ └── ReviewStep.tsx
|
|
│ ├── SilencesPage.tsx
|
|
│ └── HistoryPage.tsx
|
|
├── pages/Admin/
|
|
│ └── OutboundConnectionsPage.tsx
|
|
├── components/
|
|
│ ├── NotificationBell.tsx
|
|
│ └── AlertStateChip.tsx
|
|
├── api/queries/
|
|
│ ├── alerts.ts
|
|
│ ├── alertRules.ts
|
|
│ ├── alertSilences.ts
|
|
│ └── outboundConnections.ts
|
|
└── cmdk/sources/
|
|
├── alerts.ts
|
|
└── alertRules.ts
|
|
```
|
|
|
|
### Touchpoints on existing code (deliberate, minimal)
|
|
|
|
| Existing surface | Change | Scope |
|
|
|---|---|---|
|
|
| `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive |
|
|
| `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` |
|
|
| `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added |
|
|
| `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added |
|
|
| `SecurityConfig` | Path matchers for new endpoints | ~15 lines |
|
|
| `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive |
|
|
| Top-nav layout | Insert `<NotificationBell />` | one import + one component |
|
|
| CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import |
|
|
| `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation |
|
|
| `com.cameleer:cameleer-common` | no changes | — |
|
|
| ingestion paths | no changes | — |
|
|
| agent protocol | no changes | — |
|
|
| ClickHouse schema (table structure) | no changes — only projections added | — |
|
|
|
|
### New dependencies
|
|
|
|
- `com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`.
|
|
- Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate.
|
|
|
|
---
|
|
|
|
## 5. Data model (PostgreSQL)
|
|
|
|
One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction.
|
|
|
|
### Enum types
|
|
|
|
```sql
|
|
CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO');
|
|
CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
|
|
CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
|
|
CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE');
|
|
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
|
|
CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
|
|
CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH');
|
|
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');
|
|
```
|
|
|
|
### Tables
|
|
|
|
#### `outbound_connections` (admin-managed)
|
|
|
|
```sql
|
|
CREATE TABLE outbound_connections (
|
|
id uuid PRIMARY KEY,
|
|
tenant_id varchar(64) NOT NULL,
|
|
name varchar(100) NOT NULL,
|
|
description text,
|
|
url text NOT NULL, -- Mustache-enabled
|
|
method outbound_method_enum NOT NULL,
|
|
default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates
|
|
default_body_tmpl text, -- null = built-in default JSON envelope
|
|
tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
|
|
tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties
|
|
hmac_secret text, -- Ed25519-key-derived encryption at rest
|
|
auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
|
|
auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused
|
|
allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs
|
|
created_at timestamptz NOT NULL DEFAULT now(),
|
|
created_by uuid NOT NULL REFERENCES users(id),
|
|
updated_at timestamptz NOT NULL DEFAULT now(),
|
|
updated_by uuid NOT NULL REFERENCES users(id),
|
|
UNIQUE (tenant_id, name)
|
|
);
|
|
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);
|
|
```
|
|
|
|
#### `alert_rules`
|
|
|
|
```sql
|
|
CREATE TABLE alert_rules (
|
|
id uuid PRIMARY KEY,
|
|
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
|
|
name varchar(200) NOT NULL,
|
|
description text,
|
|
severity severity_enum NOT NULL,
|
|
enabled boolean NOT NULL DEFAULT true,
|
|
|
|
condition_kind condition_kind_enum NOT NULL,
|
|
condition jsonb NOT NULL, -- sealed-subtype payload, Jackson-DEDUCTION polymorphic
|
|
|
|
evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
|
|
for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
|
|
re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),
|
|
|
|
notification_title_tmpl text NOT NULL, -- Mustache
|
|
notification_message_tmpl text NOT NULL, -- Mustache
|
|
webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id
|
|
|
|
next_evaluation_at timestamptz NOT NULL DEFAULT now(),
|
|
claimed_by varchar(64),
|
|
claimed_until timestamptz,
|
|
eval_state jsonb NOT NULL DEFAULT '{}',
|
|
|
|
created_at timestamptz NOT NULL DEFAULT now(),
|
|
created_by uuid NOT NULL REFERENCES users(id),
|
|
updated_at timestamptz NOT NULL DEFAULT now(),
|
|
updated_by uuid NOT NULL REFERENCES users(id)
|
|
);
|
|
CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id);
|
|
CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true;
|
|
```
|
|
|
|
#### `alert_rule_targets`
|
|
|
|
```sql
|
|
CREATE TABLE alert_rule_targets (
|
|
id uuid PRIMARY KEY,
|
|
rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
|
|
target_kind target_kind_enum NOT NULL,
|
|
target_id varchar(128) NOT NULL,
|
|
UNIQUE (rule_id, target_kind, target_id)
|
|
);
|
|
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);
|
|
```
|
|
|
|
#### `alert_instances`
|
|
|
|
```sql
|
|
CREATE TABLE alert_instances (
|
|
id uuid PRIMARY KEY,
|
|
rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
|
|
rule_snapshot jsonb NOT NULL,
|
|
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
|
|
state alert_state_enum NOT NULL,
|
|
severity severity_enum NOT NULL,
|
|
fired_at timestamptz NOT NULL,
|
|
acked_at timestamptz,
|
|
acked_by uuid REFERENCES users(id),
|
|
resolved_at timestamptz,
|
|
last_notified_at timestamptz,
|
|
silenced boolean NOT NULL DEFAULT false,
|
|
current_value numeric,
|
|
threshold numeric,
|
|
context jsonb NOT NULL,
|
|
title text NOT NULL,
|
|
message text NOT NULL,
|
|
target_user_ids uuid[] NOT NULL DEFAULT '{}',
|
|
target_group_ids uuid[] NOT NULL DEFAULT '{}',
|
|
target_role_names text[] NOT NULL DEFAULT '{}'
|
|
);
|
|
CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC);
|
|
CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
|
|
CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
|
|
CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids);
|
|
CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids);
|
|
CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names);
|
|
```
|
|
|
|
#### `alert_silences`
|
|
|
|
```sql
|
|
CREATE TABLE alert_silences (
|
|
id uuid PRIMARY KEY,
|
|
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
|
|
matcher jsonb NOT NULL, -- { ruleId?, appSlug?, routeId?, agentId?, severity? }
|
|
reason text,
|
|
starts_at timestamptz NOT NULL,
|
|
ends_at timestamptz NOT NULL CHECK (ends_at > starts_at),
|
|
created_by uuid NOT NULL REFERENCES users(id),
|
|
created_at timestamptz NOT NULL DEFAULT now()
|
|
);
|
|
CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at);
|
|
```
|
|
|
|
#### `alert_notifications` (webhook delivery outbox)
|
|
|
|
```sql
|
|
CREATE TABLE alert_notifications (
|
|
id uuid PRIMARY KEY,
|
|
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
|
|
webhook_id uuid, -- opaque ref into rule's webhooks JSONB
|
|
outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL,
|
|
status notification_status_enum NOT NULL DEFAULT 'PENDING',
|
|
attempts int NOT NULL DEFAULT 0,
|
|
next_attempt_at timestamptz NOT NULL DEFAULT now(),
|
|
claimed_by varchar(64),
|
|
claimed_until timestamptz,
|
|
last_response_status int,
|
|
last_response_snippet text,
|
|
payload jsonb NOT NULL, -- snapshotted at first attempt
|
|
delivered_at timestamptz,
|
|
created_at timestamptz NOT NULL DEFAULT now()
|
|
);
|
|
CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING';
|
|
CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id);
|
|
```
|
|
|
|
#### `alert_reads`
|
|
|
|
```sql
|
|
CREATE TABLE alert_reads (
|
|
user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
|
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
|
|
read_at timestamptz NOT NULL DEFAULT now(),
|
|
PRIMARY KEY (user_id, alert_instance_id)
|
|
);
|
|
```
|
|
|
|
### Cascade summary
|
|
|
|
```
|
|
environments → alert_rules (CASCADE) → alert_rule_targets (CASCADE)
|
|
environments → alert_silences (CASCADE)
|
|
environments → alert_instances (CASCADE) → alert_reads (CASCADE)
|
|
→ alert_notifications (CASCADE)
|
|
alert_rules → alert_instances (SET NULL, rule_snapshot preserves context)
|
|
users → alert_reads (CASCADE)
|
|
outbound_connections (delete) — blocked by FK from rules.webhooks JSONB via app-level 409 check
|
|
```
|
|
|
|
**Rule deletion** preserves history (`alert_instances.rule_id = NULL`, `rule_snapshot` retains details). **Environment deletion** leaves zero alerting rows — POC-safe.
|
|
|
|
### Jackson polymorphism for conditions
|
|
|
|
```java
|
|
@JsonTypeInfo(use = JsonTypeInfo.Id.DEDUCTION)
|
|
@JsonSubTypes({
|
|
@Type(RouteMetricCondition.class),
|
|
@Type(ExchangeMatchCondition.class),
|
|
@Type(AgentStateCondition.class),
|
|
@Type(DeploymentStateCondition.class),
|
|
@Type(LogPatternCondition.class),
|
|
@Type(JvmMetricCondition.class),
|
|
})
|
|
public sealed interface AlertCondition permits
|
|
RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition,
|
|
DeploymentStateCondition, LogPatternCondition, JvmMetricCondition {
|
|
ConditionKind kind();
|
|
}
|
|
```
|
|
|
|
Jackson deduces the subtype from the set of present fields. Bean Validation (`@Valid`) on each record validates at the controller boundary.
|
|
|
|
Example condition payloads:
|
|
|
|
```json
|
|
// ROUTE_METRIC
|
|
{ "scope": {"appSlug":"orders","routeId":"route-1"},
|
|
"metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 }
|
|
|
|
// EXCHANGE_MATCH — PER_EXCHANGE
|
|
{ "scope": {"appSlug":"orders"},
|
|
"filter": {"status":"FAILED","attributes":{"type":"payment"}},
|
|
"fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 }
|
|
|
|
// EXCHANGE_MATCH — COUNT_IN_WINDOW
|
|
{ "scope": {"appSlug":"orders"},
|
|
"filter": {"status":"FAILED"},
|
|
"fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 }
|
|
|
|
// AGENT_STATE
|
|
{ "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 }
|
|
|
|
// DEPLOYMENT_STATE
|
|
{ "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] }
|
|
|
|
// LOG_PATTERN
|
|
{ "scope": {"appSlug":"orders"}, "level": "ERROR",
|
|
"pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 }
|
|
|
|
// JVM_METRIC
|
|
{ "scope": {"appSlug":"orders"}, "metric": "heap_used_percent",
|
|
"aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 }
|
|
```
|
|
|
|
### Claim-polling queries
|
|
|
|
```sql
|
|
-- Rule evaluator
|
|
UPDATE alert_rules
|
|
SET claimed_by = :instance, claimed_until = now() + interval '30 seconds'
|
|
WHERE id IN (
|
|
SELECT id FROM alert_rules
|
|
WHERE enabled = true
|
|
AND next_evaluation_at <= now()
|
|
AND (claimed_until IS NULL OR claimed_until < now())
|
|
ORDER BY next_evaluation_at
|
|
LIMIT :batch
|
|
FOR UPDATE SKIP LOCKED
|
|
)
|
|
RETURNING *;
|
|
|
|
-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')
|
|
```
|
|
|
|
`FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other.
|
|
|
|
---
|
|
|
|
## 6. Outbound connections
|
|
|
|
### Concept
|
|
|
|
An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.
|
|
|
|
**Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules.
|
|
|
|
**Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.
|
|
|
|
**Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references.
|
|
|
|
### Default body template (when rule has no override)
|
|
|
|
```json
|
|
{
|
|
"alert": { "id", "state", "firedAt", "severity", "title", "message", "link" },
|
|
"rule": { "id", "name", "description", "severity" },
|
|
"env": { "slug", "id" },
|
|
"context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
|
|
}
|
|
```
|
|
|
|
"Just plug in my Slack incoming webhook URL" works without writing a template.
|
|
|
|
### HMAC signing (optional per connection)
|
|
|
|
When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=<hmac(secret, body)>` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).
|
|
|
|
---
|
|
|
|
## 7. Rule evaluation
|
|
|
|
### Scheduler
|
|
|
|
```java
|
|
@Component
|
|
public class AlertEvaluatorJob implements SchedulingConfigurer {
|
|
|
|
// Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
|
|
@Override
|
|
public void configureTasks(ScheduledTaskRegistrar registrar) {
|
|
registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs());
|
|
}
|
|
|
|
void tick() {
|
|
List<AlertRule> claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize());
|
|
var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r))));
|
|
for (var entry : groups.entrySet()) {
|
|
if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; }
|
|
try {
|
|
coalescedEvaluate(entry.getKey(), entry.getValue());
|
|
} catch (Exception e) {
|
|
circuitBreaker.recordFailure(entry.getKey().kind());
|
|
rescheduleBatch(entry.getValue());
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Per-condition evaluators
|
|
|
|
| Kind | Read source | Query shape |
|
|
|---|---|---|
|
|
| `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold |
|
|
| `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor |
|
|
| `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold |
|
|
| `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state |
|
|
| `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set |
|
|
| `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold |
|
|
| `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold |
|
|
|
|
### State machine
|
|
|
|
```
|
|
(cond holds for <forDuration)
|
|
PENDING ──────▶ keep pendingSince
|
|
▲ │
|
|
│ ▼ (cond holds ≥ forDuration)
|
|
│ FIRING ◀── (re-eval matches; update last_notified_at cadence)
|
|
│ / \
|
|
│ / \
|
|
│ ack/ \resolve
|
|
│ ▼ ▼
|
|
│ ACKNOWLEDGED RESOLVED ── (cond false again → cycle can restart)
|
|
```
|
|
|
|
`PER_EXCHANGE` mode: each match is its own brief FIRING instance that auto-resolves after `perExchangeLingerSeconds` (default 300 s). History retains it for 90 d.
|
|
|
|
### Performance optimizations (v1)
|
|
|
|
1. **Four ClickHouse projections** (new CH migration, idempotent):
|
|
|
|
```sql
|
|
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_app_status
|
|
(SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time));
|
|
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_route_status
|
|
(SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time));
|
|
ALTER TABLE logs ADD PROJECTION IF NOT EXISTS alerting_app_level
|
|
(SELECT * ORDER BY (tenant_id, environment, application, level, timestamp));
|
|
ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric
|
|
(SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));
|
|
```
|
|
|
|
`stats_1m_route`'s existing ORDER BY already aligns with alerting access patterns; no projection needed.
|
|
|
|
2. **Drop `FINAL` for alerting counts.** New methods `ClickHouseLogStore.countLogs(...)` and `ClickHouseSearchIndex.countExecutionsForAlerting(...)` skip `FINAL` — alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facing `count()` path unchanged.
|
|
|
|
3. **Per-tick query coalescing.** Rules of the same kind + window share one aggregate query per tick.
|
|
|
|
4. **In-tick cache.** `Map<QueryKey, Long>` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call.
|
|
|
|
5. **Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open.
|
|
|
|
### Silence matching
|
|
|
|
At notification-dispatch time (not evaluation time):
|
|
|
|
```sql
|
|
SELECT 1 FROM alert_silences
|
|
WHERE environment_id = :env
|
|
AND now() BETWEEN starts_at AND ends_at
|
|
AND matcher_matches(matcher, :instanceContext)
|
|
LIMIT 1;
|
|
```
|
|
|
|
If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.
|
|
|
|
### Failure modes
|
|
|
|
| Failure | Behavior |
|
|
|---|---|
|
|
| Read interface throws | Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim |
|
|
| 10 consecutive failures for a rule | Mark `eval_state.disabledReason`, surface in UI |
|
|
| Template render error | Fall back to literal `{{var}}` in output, log WARN, still dispatch |
|
|
| Slow evaluator | Claim TTL 30 s; investigate if sustained |
|
|
| Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized |
|
|
| Env deleted mid-eval | FK cascade waits — effectively serialized |
|
|
|
|
---
|
|
|
|
## 8. Notification dispatch
|
|
|
|
### In-app inbox — derived, not materialized
|
|
|
|
```sql
|
|
SELECT ai.*
|
|
FROM alert_instances ai
|
|
WHERE ai.environment_id = :env
|
|
AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
|
|
AND (
|
|
:me = ANY(ai.target_user_ids)
|
|
OR ai.target_group_ids && :my_group_ids
|
|
OR ai.target_role_names && :my_role_names
|
|
)
|
|
ORDER BY ai.fired_at DESC
|
|
LIMIT 100;
|
|
```
|
|
|
|
`:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`.
|
|
|
|
**Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap.
|
|
|
|
### Webhook outbox — claim-based
|
|
|
|
`NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection.
|
|
|
|
- **2xx** → `DELIVERED`
|
|
- **4xx** → `FAILED` immediately (retry won't help); log at WARN
|
|
- **5xx / network / timeout** → retry with exponential backoff 30 s → 2 m → 5 m, then `FAILED`
|
|
- Manual retry: `POST /alerts/notifications/{id}/retry` (OPERATOR+)
|
|
|
|
Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.
|
|
|
|
### Template rendering
|
|
|
|
JMustache (`com.samskivert:jmustache`). Logic-less, industry-standard syntax.
|
|
|
|
**Rendered surfaces:** URL (query-string interpolation), header values, body, and separately `alert_instances.title` / `message` rendered once at fire.
|
|
|
|
**Context map** (dot-notation + camelCase leaves):
|
|
|
|
```
|
|
env.slug env.id
|
|
rule.id rule.name rule.severity rule.description
|
|
alert.id alert.state alert.firedAt alert.resolvedAt
|
|
alert.ackedBy alert.link alert.currentValue alert.threshold
|
|
alert.comparator alert.window
|
|
app.slug app.id app.displayName
|
|
route.id
|
|
agent.id agent.name agent.state
|
|
exchange.id exchange.status exchange.link
|
|
deployment.id deployment.status
|
|
log.logger log.level log.message
|
|
metric.name metric.value
|
|
```
|
|
|
|
**Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.
|
|
|
|
**"Test render" endpoint:** `POST /alerts/rules/{id}/render-preview` — drives rule editor's Preview button.
|
|
|
|
---
|
|
|
|
## 9. Rule promotion across environments
|
|
|
|
**UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `<name>` from `<src>` → `<dst>`. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy).
|
|
|
|
**Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths.
|
|
|
|
**Prefill-time validation (client-side warnings, non-blocking):**
|
|
|
|
| Field | Check | Behavior |
|
|
|---|---|---|
|
|
| `scope.appSlug` | Does app exist in target env? | ⚠ warn + picker from target env's apps |
|
|
| `scope.agentId` | Per-env; can't transfer | Clear field, keep appSlug, note |
|
|
| `scope.routeId` | Per-app logical ID, stable | ✓ pass through |
|
|
| `targets[]` | Tenant-scoped | ✓ transfer as-is |
|
|
| `webhooks[].outboundConnectionId` | Target env allowed by connection? | ⚠ warn if not; disable save until resolved |
|
|
|
|
Bulk promotion (select multiple → promote all) deferred until usage patterns justify it.
|
|
|
|
---
|
|
|
|
## 10. Cross-cutting: outbound HTTP & TLS trust
|
|
|
|
Shared module — not inside `alerting/`.
|
|
|
|
### `OutboundHttpClientFactory`
|
|
|
|
```java
|
|
public interface OutboundHttpClientFactory {
|
|
CloseableHttpClient clientFor(OutboundHttpRequestContext context);
|
|
}
|
|
|
|
public record OutboundHttpRequestContext(
|
|
TrustMode trustMode, // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS
|
|
List<String> trustedCaPemPaths,
|
|
Duration connectTimeout,
|
|
Duration readTimeout
|
|
) {}
|
|
```
|
|
|
|
Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call.
|
|
|
|
### System config (`cameleer.server.outbound-http.*`)
|
|
|
|
```yaml
|
|
cameleer:
|
|
server:
|
|
outbound-http:
|
|
trust-all: false # global kill-switch; WARN logged if true
|
|
trusted-ca-pem-paths: # additional roots layered on JVM default
|
|
- /etc/cameleer/certs/corporate-root.pem
|
|
- /etc/cameleer/certs/acme-internal.pem
|
|
default-connect-timeout-ms: 2000
|
|
default-read-timeout-ms: 5000
|
|
proxy-url: # optional; null = no proxy
|
|
proxy-username:
|
|
proxy-password:
|
|
```
|
|
|
|
On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files.
|
|
|
|
### Per-connection overrides
|
|
|
|
Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`).
|
|
|
|
### Deferred
|
|
|
|
See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**:
|
|
- In-app CA bundle upload / admin management
|
|
- SaaS-layer CA reuse investigation (do first)
|
|
|
|
---
|
|
|
|
## 11. API surface
|
|
|
|
All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver.
|
|
|
|
### Alerting — rules
|
|
|
|
| Method | Path | Role |
|
|
|---|---|---|
|
|
| `GET` | `/alerts/rules` | VIEWER+ |
|
|
| `POST` | `/alerts/rules` | OPERATOR+ |
|
|
| `GET` | `/alerts/rules/{id}` | VIEWER+ |
|
|
| `PUT` | `/alerts/rules/{id}` | OPERATOR+ |
|
|
| `DELETE` | `/alerts/rules/{id}` | OPERATOR+ |
|
|
| `POST` | `/alerts/rules/{id}/enable` · `/disable` | OPERATOR+ |
|
|
| `POST` | `/alerts/rules/{id}/render-preview` | OPERATOR+ |
|
|
| `POST` | `/alerts/rules/{id}/test-evaluate` | OPERATOR+ |
|
|
|
|
### Alerting — instances
|
|
|
|
| Method | Path | Role |
|
|
|---|---|---|
|
|
| `GET` | `/alerts` | VIEWER+ |
|
|
| `GET` | `/alerts/unread-count` | VIEWER+ |
|
|
| `GET` | `/alerts/{id}` | VIEWER+ |
|
|
| `POST` | `/alerts/{id}/ack` | VIEWER+ (if targeted) / OPERATOR+ |
|
|
| `POST` | `/alerts/{id}/read` | VIEWER+ (self) |
|
|
| `POST` | `/alerts/bulk-read` | VIEWER+ (self) |
|
|
|
|
### Alerting — silences
|
|
|
|
| Method | Path | Role |
|
|
|---|---|---|
|
|
| `GET` | `/alerts/silences` | VIEWER+ |
|
|
| `POST` | `/alerts/silences` | OPERATOR+ |
|
|
| `PUT` | `/alerts/silences/{id}` | OPERATOR+ |
|
|
| `DELETE` | `/alerts/silences/{id}` | OPERATOR+ |
|
|
|
|
### Alerting — notifications
|
|
|
|
| Method | Path | Role |
|
|
|---|---|---|
|
|
| `GET` | `/alerts/{id}/notifications` | VIEWER+ |
|
|
| `POST` | `/alerts/notifications/{id}/retry` | OPERATOR+ |
|
|
|
|
### Outbound connections (admin)
|
|
|
|
| Method | Path | Role |
|
|
|---|---|---|
|
|
| `GET` | `/api/v1/admin/outbound-connections` | ADMIN / OPERATOR (read-only) |
|
|
| `POST` | `/api/v1/admin/outbound-connections` | ADMIN |
|
|
| `GET` | `/api/v1/admin/outbound-connections/{id}` | ADMIN / OPERATOR (read-only) |
|
|
| `PUT` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if narrowing breaks references) |
|
|
| `DELETE` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if referenced) |
|
|
| `POST` | `/api/v1/admin/outbound-connections/{id}/test` | ADMIN |
|
|
| `GET` | `/api/v1/admin/outbound-connections/{id}/usage` | ADMIN / OPERATOR |
|
|
|
|
### OpenAPI regen
|
|
|
|
Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change.
|
|
|
|
---
|
|
|
|
## 12. CMD-K integration
|
|
|
|
Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`):
|
|
|
|
- **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`.
|
|
- **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`.
|
|
|
|
No new registry machinery — uses the existing extension point.
|
|
|
|
---
|
|
|
|
## 13. UI
|
|
|
|
### Routes
|
|
|
|
```
|
|
/alerts
|
|
├── /inbox (default landing)
|
|
├── /all
|
|
├── /rules
|
|
│ ├── /new
|
|
│ └── /{id} (edit; accepts ?promoteFrom=<src>&ruleId=<id>)
|
|
├── /silences
|
|
└── /history
|
|
|
|
/admin/outbound-connections
|
|
├── /
|
|
├── /new
|
|
└── /{id}
|
|
```
|
|
|
|
### Top-nav
|
|
|
|
Insert `<NotificationBell />` between env selector and user menu. Badge severity = `max(severities of unread targeting me)` (CRITICAL → `var(--error)`, WARNING → `var(--amber)`, INFO → `var(--muted)`). Dropdown shows 5 most-recent unread with inline ack button + "See all".
|
|
|
|
### Alerts section
|
|
|
|
New sidebar/top-nav entry visible to `VIEWER+`. Authoring actions (`POST /rules`, silence create, etc.) gated to `OPERATOR+`.
|
|
|
|
### Rule editor — 5-step wizard
|
|
|
|
1. **Scope** — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints).
|
|
2. **Condition** — radio (6 kinds) + kind-specific form.
|
|
3. **Trigger** — threshold + comparator + window + for-duration + evaluation interval + severity; inline *Test evaluate* button.
|
|
4. **Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`.
|
|
5. **Review** — summary card, enabled toggle, save.
|
|
|
|
### Template editor — Mustache with variable auto-complete
|
|
|
|
Every Mustache template-editable field — notification title, notification message, webhook URL, webhook header values, webhook body — uses a shared `<MustacheEditor />` component with **variable auto-complete**. Users never have to guess what context variables are available.
|
|
|
|
**Behavior.**
|
|
- Typing `{{` opens a dropdown of available variables at the caret position.
|
|
- Each suggestion shows the variable path (`alert.firedAt`), its type (`Instant`), a one-line description, and a sample rendered value from the canned context.
|
|
- Filtering narrows the list as the user keeps typing (`{{ale…` → filters to `alert.*`).
|
|
- `Enter` / `Tab` inserts the path and closes `}}` automatically.
|
|
- Arrow keys + `Esc` follow standard combobox semantics (ARIA-conformant).
|
|
|
|
**Context-aware filtering.** The available variables depend on the rule's condition kind and scope. The editor is aware of both:
|
|
- Always shown: `env.*`, `rule.*`, `alert.*`
|
|
- `ROUTE_METRIC` with `route.id` set: adds `route.id`, `app.*`
|
|
- `EXCHANGE_MATCH`: adds `exchange.*`, `app.*`, `route.id` (if scoped)
|
|
- `AGENT_STATE`: adds `agent.*`, `app.*`
|
|
- `DEPLOYMENT_STATE`: adds `deployment.*`, `app.*`
|
|
- `LOG_PATTERN`: adds `log.*`, `app.*`
|
|
- `JVM_METRIC`: adds `metric.*`, `agent.*`, `app.*`
|
|
|
|
Variables that *might not* populate (e.g., `alert.resolvedAt` while state is FIRING) are shown with a grey "may be null" badge — users still see them so they can defensively template.
|
|
|
|
**Syntax checks inline.**
|
|
- Unclosed `{{` / unmatched `}}` flagged with a red underline + hover hint.
|
|
- Reference to an out-of-scope variable (e.g., `{{exchange.id}}` in a ROUTE_METRIC rule) flagged with an amber underline + hint ("not available for this rule kind — will render as literal").
|
|
- Checks run client-side on every keystroke (debounced); server-side render preview is still authoritative (§8).
|
|
|
|
**Shared implementation.** Same `<MustacheEditor />` component is used in:
|
|
- Rule editor — Notify step (title, message)
|
|
- Rule editor — Webhook overrides (body override, header value overrides; URL not editable per rule, it's the connection's)
|
|
- Admin **Outbound Connections** editor — default body template, default header values, URL (URL gets a reduced context: only `env.*` since a connection URL is rule-agnostic)
|
|
- *Test render* inline preview — rendered output updates live as user types
|
|
|
|
**Completion engine.** Specific library choice (CodeMirror 6 with a custom completion extension vs Monaco vs a lighter custom overlay on `<textarea>`) is deferred to planning — see §20.
|
|
|
|
### Silences, History, Rules list, OutboundConnectionAdminPage
|
|
|
|
Structure described in design presentation; no new design-system components required. Reuses `Select`, `Tabs`, `Toggle`, `Button`, `Label`, `InfiniteScrollArea`, `PageLoader`, `Badge` from `@cameleer/design-system`.
|
|
|
|
### Real-time behavior
|
|
|
|
- Bell: `/alerts/unread-count` polled every 30 s; paused when tab hidden (Page Visibility API).
|
|
- Inbox view: `/alerts` polled every 30 s when focused.
|
|
- No SSE in v1. SSE is a clean future add under `/alerts/stream` with no schema changes.
|
|
|
|
### Accessibility
|
|
|
|
Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive `aria-label`.
|
|
|
|
### Styling
|
|
|
|
All colors via `@cameleer/design-system` CSS variables (`var(--error)`, `var(--amber)`, `var(--muted)`, `var(--success)`). No hard-coded hex.
|
|
|
|
---
|
|
|
|
## 14. Configuration
|
|
|
|
### `AlertingProperties` (`cameleer.server.alerting.*`)
|
|
|
|
```yaml
|
|
cameleer:
|
|
server:
|
|
alerting:
|
|
evaluator-tick-interval-ms: 5000 # floor: 5000 (clamped at startup with WARN if lower)
|
|
evaluator-batch-size: 20
|
|
claim-ttl-seconds: 30
|
|
notification-tick-interval-ms: 5000
|
|
notification-batch-size: 50
|
|
in-tick-cache-enabled: true
|
|
circuit-breaker-fail-threshold: 5
|
|
circuit-breaker-window-seconds: 30
|
|
circuit-breaker-cooldown-seconds: 60
|
|
event-retention-days: 90
|
|
notification-retention-days: 30
|
|
webhook-timeout-ms: 5000
|
|
webhook-max-attempts: 3
|
|
```
|
|
|
|
Env-var overridable (`CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...`). Wired via `SchedulingConfigurer` (not literal `@Scheduled(fixedDelay=...)`) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.
|
|
|
|
### `OutboundHttpProperties` (`cameleer.server.outbound-http.*`)
|
|
|
|
See §10.
|
|
|
|
---
|
|
|
|
## 15. Retention
|
|
|
|
Daily `@Scheduled(cron = "0 0 3 * * *")` job `AlertingRetentionJob` (advisory-lock-of-the-day pattern, same as `JarRetentionJob`):
|
|
|
|
```sql
|
|
DELETE FROM alert_instances
|
|
WHERE state = 'RESOLVED'
|
|
AND resolved_at < now() - :eventRetentionDays::interval;
|
|
|
|
DELETE FROM alert_notifications
|
|
WHERE status IN ('DELIVERED','FAILED')
|
|
AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval);
|
|
```
|
|
|
|
Retention values from `AlertingProperties`.
|
|
|
|
---
|
|
|
|
## 16. Observability
|
|
|
|
New metrics exposed via existing `/api/v1/prometheus`:
|
|
|
|
- `alerting_eval_duration_seconds{kind}` — histogram per condition kind
|
|
- `alerting_eval_errors_total{kind, rule_id}` — counter
|
|
- `alerting_circuit_open_total{kind}` — counter
|
|
- `alerting_rule_state{state}` — gauge (enabled / disabled / broken-reference)
|
|
- `alerting_instances_total{state, severity}` — gauge (open alerts)
|
|
- `alerting_notifications_total{status}` — counter
|
|
- `alerting_webhook_delivery_duration_seconds` — histogram
|
|
|
|
No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add.
|
|
|
|
### Audit
|
|
|
|
New `AuditCategory` values:
|
|
- `OUTBOUND_HTTP_TRUST_CHANGE` — webhook or connection TLS config change
|
|
- `ALERT_RULE_CHANGE` — create / update / delete rule
|
|
- `ALERT_SILENCE_CHANGE` — create / update / delete silence
|
|
- `OUTBOUND_CONNECTION_CHANGE` — admin CRUD on outbound connection
|
|
|
|
Emitted via existing `AuditService.log(...)`.
|
|
|
|
---
|
|
|
|
## 17. Security
|
|
|
|
- **Tenant + env isolation.** Every controller call runs through `@EnvPath` (resolves env → tenant via `TenantContext`). Every CH query filters by `tenant_id AND environment` per pre-existing invariant.
|
|
- **RBAC.** Enforced via Spring Security `@PreAuthorize` on each endpoint (see §11 role column).
|
|
- **Webhook URL SSRF protection.** At rule save, reject URLs resolving to private IPs (`127.0.0.0/8`, `10.0.0.0/8`, `172.16/12`, `192.168/16`, `::1`, `fc00::/7`) unless a deployment-level allow-listed dev flag is set.
|
|
- **HMAC signing.** Per-connection `hmac_secret` encrypted at rest; signature header sent on dispatch.
|
|
- **TLS trust.** Cross-cutting module (§10).
|
|
- **Audit.** See §16.
|
|
|
|
---
|
|
|
|
## 18. Testing
|
|
|
|
### Backend — unit (`*Test.java`, no Spring)
|
|
|
|
- Each `ConditionEvaluator`: synthetic inputs → expected `EvalResult`. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce.
|
|
- `MustacheRenderer`: context + template → expected output; malformed falls back + logs.
|
|
- `SilenceMatcher`: matcher JSONB vs instance → truth table.
|
|
- Jackson polymorphism: roundtrip each `AlertCondition` subtype.
|
|
- Claim-polling concurrency (embedded PG): two threads → no duplicates.
|
|
|
|
### Backend — integration (Testcontainers, `*IT.java`)
|
|
|
|
- `AlertingFullLifecycleIT` — end-to-end rule → fire → ack → silence → delete, history survives.
|
|
- `AlertingEnvIsolationIT` — alert in env-A invisible from env-B inbox.
|
|
- `OutboundConnectionAllowedEnvIT` — 422 on save if connection not allowed in env; 409 on narrow-while-referenced.
|
|
- `WebhookDispatchIT` (WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx.
|
|
- `PerformanceIT` (opt-in, not default CI) — 500 rules + 5-replica simulation.
|
|
|
|
### Frontend — component (Vitest + Testing Library)
|
|
|
|
- Rule editor wizard step navigation + validation.
|
|
- Bell polling pause on tab hide.
|
|
- Inbox row rendering by severity.
|
|
- CMD-K result-source registration.
|
|
|
|
### Frontend — E2E (Playwright if infra supports)
|
|
|
|
- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.
|
|
|
|
---
|
|
|
|
## 19. Rollout
|
|
|
|
- **No feature flag.** Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
|
|
- **Migration rollback.** V11 PG migration has matching down-script; CH projections are `IF NOT EXISTS`-safe and droppable without data loss.
|
|
- **Progressive adoption.** First user creates the first rule; feature organically spreads from there.
|
|
- **Documentation.** Add an admin-facing alerting guide under `docs/` describing rule shapes, template variables, webhook destinations, and silence patterns.
|
|
- **`.claude/rules/` updates.** `app-classes.md` and `core-classes.md` updated to document the new packages and any touched classes — part of the change, not a follow-up.
|
|
|
|
---
|
|
|
|
## 20. Open questions / items for writing-plans
|
|
|
|
These are not design-level decisions — they're implementation-phase tasks to be carried into planning:
|
|
|
|
1. **Alignment with existing OIDC outbound cert handling.** Before implementing `ApacheOutboundHttpClientFactory`, audit how `OidcProviderHelper` / `OidcTokenExchanger` currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1).
|
|
2. **`hmac_secret` encryption-at-rest.** Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.
|
|
3. **V1 CH migration file naming.** Confirm the convention for alerting-owned CH migrations (`V_alerting_projections.sql` vs numbered). Current `ClickHouseSchemaInitializer` runs files idempotently — naming is informational.
|
|
4. **Bell component keyboard shortcut.** Optional; align with existing CMD-K shortcut conventions.
|
|
5. **Target picker UX.** How to mix user / group / role in one multi-select with typeahead. Small UX design task.
|
|
6. **Env-delete cascade audit.** Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.
|
|
7. **`<MustacheEditor />` completion engine choice.** Decide between CodeMirror 6 with a custom completion extension, Monaco, or a lighter custom-overlay-on-`<textarea>` implementation. Criteria: bundle-size cost, accessibility (ARIA combobox semantics), existing design-system integration. The variable metadata registry (`{path, type, description, sampleValue, availableForKinds[]}`) is the same regardless of engine.
|