Files
cameleer-server/docs/superpowers/specs/2026-04-19-alerting-design.md
hsiegeln 18cacb33ee docs(alerting): align @JsonTypeInfo spec with shipped code
Design spec and Plan 02 described AlertCondition polymorphism as
Id.DEDUCTION, but the code that shipped in PR #140 uses Id.NAME with
property="kind" and include=EXISTING_PROPERTY. The `kind` field is
real on every subtype and the DB stores it in a separate column
(condition_kind), so reading the discriminator directly is simpler
than deduction — update the docs to match. Also add `"kind"` to the
example JSON payloads so they match on-wire reality.

OutboundAuth (Plan 01) correctly still uses Id.DEDUCTION and is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:04:17 +02:00

1082 lines
52 KiB
Markdown

# Alerting — Design Spec
**Date:** 2026-04-19
**Status:** Draft — awaiting user review
**Surfaces:** server (core + app), UI, admin, Gitea issues
**Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred)
---
## 1. Summary
A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores.
### Guiding principles
- **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling.
- **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
- **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
- **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
- **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild.
---
## 2. Scope
### In scope (v1)
Six signal sources, expressed as sealed-type conditions:
1. **`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`.
2. **`EXCHANGE_MATCH`** — per-exchange matching with two fire modes:
- `PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)
- `COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold
3. **`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`.
4. **`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds.
5. **`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold.
6. **`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.
**Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.
**Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).
**Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail).
**Rule promotion across environments** via UI prefill — no new server endpoint.
**CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry.
**Configurable evaluator cadence** (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence.
### Out of scope (v1, not deferred)
- Custom SQL / Prometheus-style query DSL (option F).
- Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations.
- Native provider integrations (Slack, Teams, PagerDuty as first-class types).
- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
- Expression language in rules — fixed templates only.
- mTLS / client-cert auth on outbound webhooks.
- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.
### Deferred to backlog
- **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.
---
## 3. Key decisions
Captured from brainstorming, in order of architectural impact.
| Decision | Chosen | Rejected | Rationale |
|---|---|---|---|
| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten |
| Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) |
| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules |
| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting |
| Horizontal scale | `FOR UPDATE SKIP LOCKED` claim pattern | advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard |
| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature |
| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision |
| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users |
| UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin |
| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry |
| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations |
| TLS trust | shared cross-cutting module `http/` | alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path |
| CA management UI | **deferred (BL-001)** | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse |
| Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows |
---
## 4. Module architecture
### Package layout
```
cameleer-server-core/src/main/java/com/cameleer/server/core/
├── alerting/ (domain; pure records + interfaces)
│ ├── AlertRule
│ ├── AlertCondition (sealed)
│ │ ├── RouteMetricCondition
│ │ ├── ExchangeMatchCondition
│ │ ├── AgentStateCondition
│ │ ├── DeploymentStateCondition
│ │ ├── LogPatternCondition
│ │ └── JvmMetricCondition
│ ├── AlertSeverity / AlertState (enums)
│ ├── AlertInstance / AlertEvent
│ ├── NotificationTarget / NotificationTargetKind
│ ├── AlertSilence / SilenceMatcher
│ ├── AlertRuleRepository (interface)
│ ├── AlertInstanceRepository (interface)
│ ├── AlertSilenceRepository (interface)
│ ├── AlertNotificationRepository (interface)
│ ├── AlertReadRepository (interface)
│ ├── ConditionEvaluator<C> (sealed)
│ └── NotificationDispatcher (interface)
├── outbound/ (admin-managed outbound connections)
│ ├── OutboundConnection
│ ├── OutboundAuth (sealed — NONE, BEARER, BASIC)
│ ├── TrustMode (enum)
│ └── OutboundConnectionRepository (interface)
└── http/ (cross-cutting outbound HTTP primitive)
├── OutboundHttpProperties
├── OutboundHttpRequestContext
└── OutboundHttpClientFactory (interface)
cameleer-server-app/src/main/java/com/cameleer/server/app/
├── alerting/
│ ├── controller/ (REST)
│ │ ├── AlertRuleController
│ │ ├── AlertController
│ │ ├── AlertSilenceController
│ │ └── AlertNotificationController
│ ├── storage/ (Postgres)
│ │ ├── PostgresAlertRuleRepository
│ │ ├── PostgresAlertInstanceRepository
│ │ ├── PostgresAlertSilenceRepository
│ │ ├── PostgresAlertNotificationRepository
│ │ └── PostgresAlertReadRepository
│ ├── eval/ (the scheduled evaluators)
│ │ ├── AlertEvaluatorJob (@Scheduled, claim-based)
│ │ ├── RouteMetricEvaluator
│ │ ├── ExchangeMatchEvaluator
│ │ ├── AgentStateEvaluator
│ │ ├── DeploymentStateEvaluator
│ │ ├── LogPatternEvaluator
│ │ ├── JvmMetricEvaluator
│ │ ├── PerKindCircuitBreaker
│ │ └── TickCache
│ ├── notify/
│ │ ├── NotificationDispatchJob (@Scheduled, claim-based)
│ │ ├── InAppInboxQuery
│ │ ├── WebhookDispatcher
│ │ ├── MustacheRenderer
│ │ └── SilenceMatcher
│ ├── dto/ (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.)
│ ├── retention/
│ │ └── AlertingRetentionJob (daily @Scheduled)
│ └── config/
│ └── AlertingProperties (@ConfigurationProperties)
├── outbound/
│ ├── controller/
│ │ └── OutboundConnectionAdminController
│ ├── storage/
│ │ └── PostgresOutboundConnectionRepository
│ └── dto/
│ └── OutboundConnectionDto
└── http/
├── ApacheOutboundHttpClientFactory
├── SslContextBuilder
└── config/
└── OutboundHttpConfig (@ConfigurationProperties)
cameleer-server-app/src/main/resources/
├── db/migration/V11__alerting_and_outbound.sql (one Flyway migration)
└── clickhouse/V_alerting_projections.sql (one CH migration, idempotent)
ui/src/
├── pages/Alerts/
│ ├── InboxPage.tsx
│ ├── AllAlertsPage.tsx
│ ├── RulesListPage.tsx
│ ├── RuleEditor/
│ │ ├── RuleEditorWizard.tsx
│ │ ├── ScopeStep.tsx
│ │ ├── ConditionStep.tsx
│ │ ├── TriggerStep.tsx
│ │ ├── NotifyStep.tsx
│ │ └── ReviewStep.tsx
│ ├── SilencesPage.tsx
│ └── HistoryPage.tsx
├── pages/Admin/
│ └── OutboundConnectionsPage.tsx
├── components/
│ ├── NotificationBell.tsx
│ └── AlertStateChip.tsx
├── api/queries/
│ ├── alerts.ts
│ ├── alertRules.ts
│ ├── alertSilences.ts
│ └── outboundConnections.ts
└── cmdk/sources/
├── alerts.ts
└── alertRules.ts
```
### Touchpoints on existing code (deliberate, minimal)
| Existing surface | Change | Scope |
|---|---|---|
| `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive |
| `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` |
| `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added |
| `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added |
| `SecurityConfig` | Path matchers for new endpoints | ~15 lines |
| `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive |
| Top-nav layout | Insert `<NotificationBell />` | one import + one component |
| CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import |
| `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation |
| `com.cameleer:cameleer-common` | no changes | — |
| ingestion paths | no changes | — |
| agent protocol | no changes | — |
| ClickHouse schema (table structure) | no changes — only projections added | — |
### New dependencies
- `com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`.
- Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate.
---
## 5. Data model (PostgreSQL)
One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction.
### Enum types
```sql
CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO');
CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE');
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH');
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');
```
### Tables
#### `outbound_connections` (admin-managed)
```sql
CREATE TABLE outbound_connections (
id uuid PRIMARY KEY,
tenant_id varchar(64) NOT NULL,
name varchar(100) NOT NULL,
description text,
url text NOT NULL, -- Mustache-enabled
method outbound_method_enum NOT NULL,
default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates
default_body_tmpl text, -- null = built-in default JSON envelope
tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties
hmac_secret text, -- Ed25519-key-derived encryption at rest
auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused
allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id),
UNIQUE (tenant_id, name)
);
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);
```
#### `alert_rules`
```sql
CREATE TABLE alert_rules (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
name varchar(200) NOT NULL,
description text,
severity severity_enum NOT NULL,
enabled boolean NOT NULL DEFAULT true,
condition_kind condition_kind_enum NOT NULL,
condition jsonb NOT NULL, -- sealed-subtype payload, Jackson polymorphic on `kind`
evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),
notification_title_tmpl text NOT NULL, -- Mustache
notification_message_tmpl text NOT NULL, -- Mustache
webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id
next_evaluation_at timestamptz NOT NULL DEFAULT now(),
claimed_by varchar(64),
claimed_until timestamptz,
eval_state jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id)
);
CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id);
CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true;
```
#### `alert_rule_targets`
```sql
CREATE TABLE alert_rule_targets (
id uuid PRIMARY KEY,
rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
target_kind target_kind_enum NOT NULL,
target_id varchar(128) NOT NULL,
UNIQUE (rule_id, target_kind, target_id)
);
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);
```
#### `alert_instances`
```sql
CREATE TABLE alert_instances (
id uuid PRIMARY KEY,
rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
rule_snapshot jsonb NOT NULL,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
state alert_state_enum NOT NULL,
severity severity_enum NOT NULL,
fired_at timestamptz NOT NULL,
acked_at timestamptz,
acked_by uuid REFERENCES users(id),
resolved_at timestamptz,
last_notified_at timestamptz,
silenced boolean NOT NULL DEFAULT false,
current_value numeric,
threshold numeric,
context jsonb NOT NULL,
title text NOT NULL,
message text NOT NULL,
target_user_ids uuid[] NOT NULL DEFAULT '{}',
target_group_ids uuid[] NOT NULL DEFAULT '{}',
target_role_names text[] NOT NULL DEFAULT '{}'
);
CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC);
CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids);
CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids);
CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names);
```
#### `alert_silences`
```sql
CREATE TABLE alert_silences (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
matcher jsonb NOT NULL, -- { ruleId?, appSlug?, routeId?, agentId?, severity? }
reason text,
starts_at timestamptz NOT NULL,
ends_at timestamptz NOT NULL CHECK (ends_at > starts_at),
created_by uuid NOT NULL REFERENCES users(id),
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at);
```
#### `alert_notifications` (webhook delivery outbox)
```sql
CREATE TABLE alert_notifications (
id uuid PRIMARY KEY,
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
webhook_id uuid, -- opaque ref into rule's webhooks JSONB
outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL,
status notification_status_enum NOT NULL DEFAULT 'PENDING',
attempts int NOT NULL DEFAULT 0,
next_attempt_at timestamptz NOT NULL DEFAULT now(),
claimed_by varchar(64),
claimed_until timestamptz,
last_response_status int,
last_response_snippet text,
payload jsonb NOT NULL, -- snapshotted at first attempt
delivered_at timestamptz,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING';
CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id);
```
#### `alert_reads`
```sql
CREATE TABLE alert_reads (
user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
read_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, alert_instance_id)
);
```
### Cascade summary
```
environments → alert_rules (CASCADE) → alert_rule_targets (CASCADE)
environments → alert_silences (CASCADE)
environments → alert_instances (CASCADE) → alert_reads (CASCADE)
→ alert_notifications (CASCADE)
alert_rules → alert_instances (SET NULL, rule_snapshot preserves context)
users → alert_reads (CASCADE)
outbound_connections (delete) — blocked by FK from rules.webhooks JSONB via app-level 409 check
```
**Rule deletion** preserves history (`alert_instances.rule_id = NULL`, `rule_snapshot` retains details). **Environment deletion** leaves zero alerting rows — POC-safe.
### Jackson polymorphism for conditions
```java
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind",
include = JsonTypeInfo.As.EXISTING_PROPERTY, visible = true)
@JsonSubTypes({
@Type(value = RouteMetricCondition.class, name = "ROUTE_METRIC"),
@Type(value = ExchangeMatchCondition.class, name = "EXCHANGE_MATCH"),
@Type(value = AgentStateCondition.class, name = "AGENT_STATE"),
@Type(value = DeploymentStateCondition.class, name = "DEPLOYMENT_STATE"),
@Type(value = LogPatternCondition.class, name = "LOG_PATTERN"),
@Type(value = JvmMetricCondition.class, name = "JVM_METRIC"),
})
public sealed interface AlertCondition permits
RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition,
DeploymentStateCondition, LogPatternCondition, JvmMetricCondition {
ConditionKind kind();
}
```
Each payload carries its own `kind` field, which Jackson reads (`EXISTING_PROPERTY`) to pick the subtype and the record still exposes as `ConditionKind kind()`. Bean Validation (`@Valid`) on each record validates at the controller boundary.
Example condition payloads:
```json
// ROUTE_METRIC
{ "kind": "ROUTE_METRIC",
"scope": {"appSlug":"orders","routeId":"route-1"},
"metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 }
// EXCHANGE_MATCH — PER_EXCHANGE
{ "kind": "EXCHANGE_MATCH",
"scope": {"appSlug":"orders"},
"filter": {"status":"FAILED","attributes":{"type":"payment"}},
"fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 }
// EXCHANGE_MATCH — COUNT_IN_WINDOW
{ "kind": "EXCHANGE_MATCH",
"scope": {"appSlug":"orders"},
"filter": {"status":"FAILED"},
"fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 }
// AGENT_STATE
{ "kind": "AGENT_STATE", "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 }
// DEPLOYMENT_STATE
{ "kind": "DEPLOYMENT_STATE", "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] }
// LOG_PATTERN
{ "kind": "LOG_PATTERN", "scope": {"appSlug":"orders"}, "level": "ERROR",
"pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 }
// JVM_METRIC
{ "kind": "JVM_METRIC", "scope": {"appSlug":"orders"}, "metric": "heap_used_percent",
"aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 }
```
### Claim-polling queries
```sql
-- Rule evaluator
UPDATE alert_rules
SET claimed_by = :instance, claimed_until = now() + interval '30 seconds'
WHERE id IN (
SELECT id FROM alert_rules
WHERE enabled = true
AND next_evaluation_at <= now()
AND (claimed_until IS NULL OR claimed_until < now())
ORDER BY next_evaluation_at
LIMIT :batch
FOR UPDATE SKIP LOCKED
)
RETURNING *;
-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')
```
`FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other.
---
## 6. Outbound connections
### Concept
An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.
**Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules.
**Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.
**Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references.
### Default body template (when rule has no override)
```json
{
"alert": { "id", "state", "firedAt", "severity", "title", "message", "link" },
"rule": { "id", "name", "description", "severity" },
"env": { "slug", "id" },
"context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
}
```
"Just plug in my Slack incoming webhook URL" works without writing a template.
### HMAC signing (optional per connection)
When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=<hmac(secret, body)>` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).
---
## 7. Rule evaluation
### Scheduler
```java
@Component
public class AlertEvaluatorJob implements SchedulingConfigurer {
// Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
@Override
public void configureTasks(ScheduledTaskRegistrar registrar) {
registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs());
}
void tick() {
List<AlertRule> claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize());
var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r))));
for (var entry : groups.entrySet()) {
if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; }
try {
coalescedEvaluate(entry.getKey(), entry.getValue());
} catch (Exception e) {
circuitBreaker.recordFailure(entry.getKey().kind());
rescheduleBatch(entry.getValue());
}
}
}
}
```
### Per-condition evaluators
| Kind | Read source | Query shape |
|---|---|---|
| `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold |
| `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor |
| `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold |
| `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state |
| `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set |
| `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold |
| `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold |
### State machine
```
(cond holds for <forDuration)
PENDING ──────▶ keep pendingSince
▲ │
│ ▼ (cond holds ≥ forDuration)
│ FIRING ◀── (re-eval matches; update last_notified_at cadence)
│ / \
│ / \
│ ack/ \resolve
│ ▼ ▼
│ ACKNOWLEDGED RESOLVED ── (cond false again → cycle can restart)
```
`PER_EXCHANGE` mode: each match is its own brief FIRING instance that auto-resolves after `perExchangeLingerSeconds` (default 300 s). History retains it for 90 d.
### Performance optimizations (v1)
1. **Four ClickHouse projections** (new CH migration, idempotent):
```sql
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_app_status
(SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time));
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_route_status
(SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time));
ALTER TABLE logs ADD PROJECTION IF NOT EXISTS alerting_app_level
(SELECT * ORDER BY (tenant_id, environment, application, level, timestamp));
ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric
(SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));
```
`stats_1m_route`'s existing ORDER BY already aligns with alerting access patterns; no projection needed.
2. **Drop `FINAL` for alerting counts.** New methods `ClickHouseLogStore.countLogs(...)` and `ClickHouseSearchIndex.countExecutionsForAlerting(...)` skip `FINAL` — alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facing `count()` path unchanged.
3. **Per-tick query coalescing.** Rules of the same kind + window share one aggregate query per tick.
4. **In-tick cache.** `Map<QueryKey, Long>` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call.
5. **Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open.
### Silence matching
At notification-dispatch time (not evaluation time):
```sql
SELECT 1 FROM alert_silences
WHERE environment_id = :env
AND now() BETWEEN starts_at AND ends_at
AND matcher_matches(matcher, :instanceContext)
LIMIT 1;
```
If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.
### Failure modes
| Failure | Behavior |
|---|---|
| Read interface throws | Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim |
| 10 consecutive failures for a rule | Mark `eval_state.disabledReason`, surface in UI |
| Template render error | Fall back to literal `{{var}}` in output, log WARN, still dispatch |
| Slow evaluator | Claim TTL 30 s; investigate if sustained |
| Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized |
| Env deleted mid-eval | FK cascade waits — effectively serialized |
---
## 8. Notification dispatch
### In-app inbox — derived, not materialized
```sql
SELECT ai.*
FROM alert_instances ai
WHERE ai.environment_id = :env
AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
AND (
:me = ANY(ai.target_user_ids)
OR ai.target_group_ids && :my_group_ids
OR ai.target_role_names && :my_role_names
)
ORDER BY ai.fired_at DESC
LIMIT 100;
```
`:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`.
**Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap.
### Webhook outbox — claim-based
`NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection.
- **2xx** → `DELIVERED`
- **4xx** → `FAILED` immediately (retry won't help); log at WARN
- **5xx / network / timeout** → retry with exponential backoff 30 s → 2 m → 5 m, then `FAILED`
- Manual retry: `POST /alerts/notifications/{id}/retry` (OPERATOR+)
Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.
### Template rendering
JMustache (`com.samskivert:jmustache`). Logic-less, industry-standard syntax.
**Rendered surfaces:** URL (query-string interpolation), header values, body, and separately `alert_instances.title` / `message` rendered once at fire.
**Context map** (dot-notation + camelCase leaves):
```
env.slug env.id
rule.id rule.name rule.severity rule.description
alert.id alert.state alert.firedAt alert.resolvedAt
alert.ackedBy alert.link alert.currentValue alert.threshold
alert.comparator alert.window
app.slug app.id app.displayName
route.id
agent.id agent.name agent.state
exchange.id exchange.status exchange.link
deployment.id deployment.status
log.logger log.level log.message
metric.name metric.value
```
**Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.
**"Test render" endpoint:** `POST /alerts/rules/{id}/render-preview` — drives rule editor's Preview button.
---
## 9. Rule promotion across environments
**UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `<name>` from `<src>` → `<dst>`. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy).
**Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths.
**Prefill-time validation (client-side warnings, non-blocking):**
| Field | Check | Behavior |
|---|---|---|
| `scope.appSlug` | Does app exist in target env? | ⚠ warn + picker from target env's apps |
| `scope.agentId` | Per-env; can't transfer | Clear field, keep appSlug, note |
| `scope.routeId` | Per-app logical ID, stable | ✓ pass through |
| `targets[]` | Tenant-scoped | ✓ transfer as-is |
| `webhooks[].outboundConnectionId` | Target env allowed by connection? | ⚠ warn if not; disable save until resolved |
Bulk promotion (select multiple → promote all) deferred until usage patterns justify it.
---
## 10. Cross-cutting: outbound HTTP & TLS trust
Shared module — not inside `alerting/`.
### `OutboundHttpClientFactory`
```java
public interface OutboundHttpClientFactory {
CloseableHttpClient clientFor(OutboundHttpRequestContext context);
}
public record OutboundHttpRequestContext(
TrustMode trustMode, // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS
List<String> trustedCaPemPaths,
Duration connectTimeout,
Duration readTimeout
) {}
```
Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call.
### System config (`cameleer.server.outbound-http.*`)
```yaml
cameleer:
server:
outbound-http:
trust-all: false # global kill-switch; WARN logged if true
trusted-ca-pem-paths: # additional roots layered on JVM default
- /etc/cameleer/certs/corporate-root.pem
- /etc/cameleer/certs/acme-internal.pem
default-connect-timeout-ms: 2000
default-read-timeout-ms: 5000
proxy-url: # optional; null = no proxy
proxy-username:
proxy-password:
```
On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files.
### Per-connection overrides
Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`).
### Deferred
See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**:
- In-app CA bundle upload / admin management
- SaaS-layer CA reuse investigation (do first)
---
## 11. API surface
All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver.
### Alerting — rules
| Method | Path | Role |
|---|---|---|
| `GET` | `/alerts/rules` | VIEWER+ |
| `POST` | `/alerts/rules` | OPERATOR+ |
| `GET` | `/alerts/rules/{id}` | VIEWER+ |
| `PUT` | `/alerts/rules/{id}` | OPERATOR+ |
| `DELETE` | `/alerts/rules/{id}` | OPERATOR+ |
| `POST` | `/alerts/rules/{id}/enable` · `/disable` | OPERATOR+ |
| `POST` | `/alerts/rules/{id}/render-preview` | OPERATOR+ |
| `POST` | `/alerts/rules/{id}/test-evaluate` | OPERATOR+ |
### Alerting — instances
| Method | Path | Role |
|---|---|---|
| `GET` | `/alerts` | VIEWER+ |
| `GET` | `/alerts/unread-count` | VIEWER+ |
| `GET` | `/alerts/{id}` | VIEWER+ |
| `POST` | `/alerts/{id}/ack` | VIEWER+ (if targeted) / OPERATOR+ |
| `POST` | `/alerts/{id}/read` | VIEWER+ (self) |
| `POST` | `/alerts/bulk-read` | VIEWER+ (self) |
### Alerting — silences
| Method | Path | Role |
|---|---|---|
| `GET` | `/alerts/silences` | VIEWER+ |
| `POST` | `/alerts/silences` | OPERATOR+ |
| `PUT` | `/alerts/silences/{id}` | OPERATOR+ |
| `DELETE` | `/alerts/silences/{id}` | OPERATOR+ |
### Alerting — notifications
| Method | Path | Role |
|---|---|---|
| `GET` | `/alerts/{id}/notifications` | VIEWER+ |
| `POST` | `/alerts/notifications/{id}/retry` | OPERATOR+ |
### Outbound connections (admin)
| Method | Path | Role |
|---|---|---|
| `GET` | `/api/v1/admin/outbound-connections` | ADMIN / OPERATOR (read-only) |
| `POST` | `/api/v1/admin/outbound-connections` | ADMIN |
| `GET` | `/api/v1/admin/outbound-connections/{id}` | ADMIN / OPERATOR (read-only) |
| `PUT` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if narrowing breaks references) |
| `DELETE` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if referenced) |
| `POST` | `/api/v1/admin/outbound-connections/{id}/test` | ADMIN |
| `GET` | `/api/v1/admin/outbound-connections/{id}/usage` | ADMIN / OPERATOR |
### OpenAPI regen
Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change.
---
## 12. CMD-K integration
Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`):
- **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`.
- **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`.
No new registry machinery — uses the existing extension point.
---
## 13. UI
### Routes
```
/alerts
├── /inbox (default landing)
├── /all
├── /rules
│ ├── /new
│ └── /{id} (edit; accepts ?promoteFrom=<src>&ruleId=<id>)
├── /silences
└── /history
/admin/outbound-connections
├── /
├── /new
└── /{id}
```
### Top-nav
Insert `<NotificationBell />` between env selector and user menu. Badge severity = `max(severities of unread targeting me)` (CRITICAL → `var(--error)`, WARNING → `var(--amber)`, INFO → `var(--muted)`). Dropdown shows 5 most-recent unread with inline ack button + "See all".
### Alerts section
New sidebar/top-nav entry visible to `VIEWER+`. Authoring actions (`POST /rules`, silence create, etc.) gated to `OPERATOR+`.
### Rule editor — 5-step wizard
1. **Scope** — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints).
2. **Condition** — radio (6 kinds) + kind-specific form.
3. **Trigger** — threshold + comparator + window + for-duration + evaluation interval + severity; inline *Test evaluate* button.
4. **Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`.
5. **Review** — summary card, enabled toggle, save.
### Template editor — Mustache with variable auto-complete
Every Mustache template-editable field — notification title, notification message, webhook URL, webhook header values, webhook body — uses a shared `<MustacheEditor />` component with **variable auto-complete**. Users never have to guess what context variables are available.
**Behavior.**
- Typing `{{` opens a dropdown of available variables at the caret position.
- Each suggestion shows the variable path (`alert.firedAt`), its type (`Instant`), a one-line description, and a sample rendered value from the canned context.
- Filtering narrows the list as the user keeps typing (`{{ale…` → filters to `alert.*`).
- `Enter` / `Tab` inserts the path and closes `}}` automatically.
- Arrow keys + `Esc` follow standard combobox semantics (ARIA-conformant).
**Context-aware filtering.** The available variables depend on the rule's condition kind and scope. The editor is aware of both:
- Always shown: `env.*`, `rule.*`, `alert.*`
- `ROUTE_METRIC` with `route.id` set: adds `route.id`, `app.*`
- `EXCHANGE_MATCH`: adds `exchange.*`, `app.*`, `route.id` (if scoped)
- `AGENT_STATE`: adds `agent.*`, `app.*`
- `DEPLOYMENT_STATE`: adds `deployment.*`, `app.*`
- `LOG_PATTERN`: adds `log.*`, `app.*`
- `JVM_METRIC`: adds `metric.*`, `agent.*`, `app.*`
Variables that *might not* populate (e.g., `alert.resolvedAt` while state is FIRING) are shown with a grey "may be null" badge — users still see them so they can defensively template.
**Syntax checks inline.**
- Unclosed `{{` / unmatched `}}` flagged with a red underline + hover hint.
- Reference to an out-of-scope variable (e.g., `{{exchange.id}}` in a ROUTE_METRIC rule) flagged with an amber underline + hint ("not available for this rule kind — will render as literal").
- Checks run client-side on every keystroke (debounced); server-side render preview is still authoritative (§8).
**Shared implementation.** Same `<MustacheEditor />` component is used in:
- Rule editor — Notify step (title, message)
- Rule editor — Webhook overrides (body override, header value overrides; URL not editable per rule, it's the connection's)
- Admin **Outbound Connections** editor — default body template, default header values, URL (URL gets a reduced context: only `env.*` since a connection URL is rule-agnostic)
- *Test render* inline preview — rendered output updates live as user types
**Completion engine.** Specific library choice (CodeMirror 6 with a custom completion extension vs Monaco vs a lighter custom overlay on `<textarea>`) is deferred to planning — see §20.
### Silences, History, Rules list, OutboundConnectionAdminPage
Structure described in design presentation; no new design-system components required. Reuses `Select`, `Tabs`, `Toggle`, `Button`, `Label`, `InfiniteScrollArea`, `PageLoader`, `Badge` from `@cameleer/design-system`.
### Real-time behavior
- Bell: `/alerts/unread-count` polled every 30 s; paused when tab hidden (Page Visibility API).
- Inbox view: `/alerts` polled every 30 s when focused.
- No SSE in v1. SSE is a clean future add under `/alerts/stream` with no schema changes.
### Accessibility
Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive `aria-label`.
### Styling
All colors via `@cameleer/design-system` CSS variables (`var(--error)`, `var(--amber)`, `var(--muted)`, `var(--success)`). No hard-coded hex.
---
## 14. Configuration
### `AlertingProperties` (`cameleer.server.alerting.*`)
```yaml
cameleer:
server:
alerting:
evaluator-tick-interval-ms: 5000 # floor: 5000 (clamped at startup with WARN if lower)
evaluator-batch-size: 20
claim-ttl-seconds: 30
notification-tick-interval-ms: 5000
notification-batch-size: 50
in-tick-cache-enabled: true
circuit-breaker-fail-threshold: 5
circuit-breaker-window-seconds: 30
circuit-breaker-cooldown-seconds: 60
event-retention-days: 90
notification-retention-days: 30
webhook-timeout-ms: 5000
webhook-max-attempts: 3
```
Env-var overridable (`CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...`). Wired via `SchedulingConfigurer` (not literal `@Scheduled(fixedDelay=...)`) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.
### `OutboundHttpProperties` (`cameleer.server.outbound-http.*`)
See §10.
---
## 15. Retention
Daily `@Scheduled(cron = "0 0 3 * * *")` job `AlertingRetentionJob` (advisory-lock-of-the-day pattern, same as `JarRetentionJob`):
```sql
DELETE FROM alert_instances
WHERE state = 'RESOLVED'
AND resolved_at < now() - :eventRetentionDays::interval;
DELETE FROM alert_notifications
WHERE status IN ('DELIVERED','FAILED')
AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval);
```
Retention values from `AlertingProperties`.
---
## 16. Observability
New metrics exposed via existing `/api/v1/prometheus`:
- `alerting_eval_duration_seconds{kind}` — histogram per condition kind
- `alerting_eval_errors_total{kind, rule_id}` — counter
- `alerting_circuit_open_total{kind}` — counter
- `alerting_rule_state{state}` — gauge (enabled / disabled / broken-reference)
- `alerting_instances_total{state, severity}` — gauge (open alerts)
- `alerting_notifications_total{status}` — counter
- `alerting_webhook_delivery_duration_seconds` — histogram
No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add.
### Audit
New `AuditCategory` values:
- `OUTBOUND_HTTP_TRUST_CHANGE` — webhook or connection TLS config change
- `ALERT_RULE_CHANGE` — create / update / delete rule
- `ALERT_SILENCE_CHANGE` — create / update / delete silence
- `OUTBOUND_CONNECTION_CHANGE` — admin CRUD on outbound connection
Emitted via existing `AuditService.log(...)`.
---
## 17. Security
- **Tenant + env isolation.** Every controller call runs through `@EnvPath` (resolves env → tenant via `TenantContext`). Every CH query filters by `tenant_id AND environment` per pre-existing invariant.
- **RBAC.** Enforced via Spring Security `@PreAuthorize` on each endpoint (see §11 role column).
- **Webhook URL SSRF protection.** At rule save, reject URLs resolving to private IPs (`127.0.0.0/8`, `10.0.0.0/8`, `172.16/12`, `192.168/16`, `::1`, `fc00::/7`) unless a deployment-level allow-listed dev flag is set.
- **HMAC signing.** Per-connection `hmac_secret` encrypted at rest; signature header sent on dispatch.
- **TLS trust.** Cross-cutting module (§10).
- **Audit.** See §16.
---
## 18. Testing
### Backend — unit (`*Test.java`, no Spring)
- Each `ConditionEvaluator`: synthetic inputs → expected `EvalResult`. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce.
- `MustacheRenderer`: context + template → expected output; malformed falls back + logs.
- `SilenceMatcher`: matcher JSONB vs instance → truth table.
- Jackson polymorphism: roundtrip each `AlertCondition` subtype.
- Claim-polling concurrency (embedded PG): two threads → no duplicates.
### Backend — integration (Testcontainers, `*IT.java`)
- `AlertingFullLifecycleIT` — end-to-end rule → fire → ack → silence → delete, history survives.
- `AlertingEnvIsolationIT` — alert in env-A invisible from env-B inbox.
- `OutboundConnectionAllowedEnvIT` — 422 on save if connection not allowed in env; 409 on narrow-while-referenced.
- `WebhookDispatchIT` (WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx.
- `PerformanceIT` (opt-in, not default CI) — 500 rules + 5-replica simulation.
### Frontend — component (Vitest + Testing Library)
- Rule editor wizard step navigation + validation.
- Bell polling pause on tab hide.
- Inbox row rendering by severity.
- CMD-K result-source registration.
### Frontend — E2E (Playwright if infra supports)
- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.
---
## 19. Rollout
- **No feature flag.** Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
- **Migration rollback.** V11 PG migration has matching down-script; CH projections are `IF NOT EXISTS`-safe and droppable without data loss.
- **Progressive adoption.** First user creates the first rule; feature organically spreads from there.
- **Documentation.** Add an admin-facing alerting guide under `docs/` describing rule shapes, template variables, webhook destinations, and silence patterns.
- **`.claude/rules/` updates.** `app-classes.md` and `core-classes.md` updated to document the new packages and any touched classes — part of the change, not a follow-up.
---
## 20. Open questions / items for writing-plans
These are not design-level decisions — they're implementation-phase tasks to be carried into planning:
1. **Alignment with existing OIDC outbound cert handling.** Before implementing `ApacheOutboundHttpClientFactory`, audit how `OidcProviderHelper` / `OidcTokenExchanger` currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1).
2. **`hmac_secret` encryption-at-rest.** Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.
3. **V1 CH migration file naming.** Confirm the convention for alerting-owned CH migrations (`V_alerting_projections.sql` vs numbered). Current `ClickHouseSchemaInitializer` runs files idempotently — naming is informational.
4. **Bell component keyboard shortcut.** Optional; align with existing CMD-K shortcut conventions.
5. **Target picker UX.** How to mix user / group / role in one multi-select with typeahead. Small UX design task.
6. **Env-delete cascade audit.** Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.
7. **`<MustacheEditor />` completion engine choice.** Decide between CodeMirror 6 with a custom completion extension, Monaco, or a lighter custom-overlay-on-`<textarea>` implementation. Criteria: bundle-size cost, accessibility (ARIA combobox semantics), existing design-system integration. The variable metadata registry (`{path, type, description, sampleValue, availableForKinds[]}`) is the same regardless of engine.