**Surfaces:** server (core + app), UI, admin, Gitea issues
**Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred)
---
## 1. Summary
A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores.
### Guiding principles
- **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling.
- **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
- **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
- **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
- **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild.
---
## 2. Scope
### In scope (v1)
Six signal sources, expressed as sealed-type conditions:
1.**`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`.
2.**`EXCHANGE_MATCH`** — per-exchange matching with two fire modes:
-`PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)
-`COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold
3.**`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`.
4.**`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds.
5.**`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold.
6.**`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.
**Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.
**Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).
**Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail).
**Rule promotion across environments** via UI prefill — no new server endpoint.
**CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry.
- Native provider integrations (Slack, Teams, PagerDuty as first-class types).
- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
- Expression language in rules — fixed templates only.
- mTLS / client-cert auth on outbound webhooks.
- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.
### Deferred to backlog
- **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.
---
## 3. Key decisions
Captured from brainstorming, in order of architectural impact.
| Decision | Chosen | Rejected | Rationale |
|---|---|---|---|
| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten |
| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules |
| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting |
| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature |
| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision |
| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users |
| UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin |
| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry |
| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations |
### Touchpoints on existing code (deliberate, minimal)
| Existing surface | Change | Scope |
|---|---|---|
| `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive |
| `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` |
| `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added |
| `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added |
| `SecurityConfig` | Path matchers for new endpoints | ~15 lines |
| `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive |
| Top-nav layout | Insert `<NotificationBell />` | one import + one component |
| CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import |
| `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation |
| `com.cameleer:cameleer-common` | no changes | — |
| ingestion paths | no changes | — |
| agent protocol | no changes | — |
| ClickHouse schema (table structure) | no changes — only projections added | — |
### New dependencies
-`com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`.
- Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate.
---
## 5. Data model (PostgreSQL)
One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction.
### Enum types
```sql
CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO');
CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE');
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH');
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');
```
### Tables
#### `outbound_connections` (admin-managed)
```sql
CREATE TABLE outbound_connections (
id uuid PRIMARY KEY,
tenant_id varchar(64) NOT NULL,
name varchar(100) NOT NULL,
description text,
url text NOT NULL, -- Mustache-enabled
method outbound_method_enum NOT NULL,
default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates
tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties
hmac_secret text, -- Ed25519-key-derived encryption at rest
auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused
allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id),
UNIQUE (tenant_id, name)
);
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);
```
#### `alert_rules`
```sql
CREATE TABLE alert_rules (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
name varchar(200) NOT NULL,
description text,
severity severity_enum NOT NULL,
enabled boolean NOT NULL DEFAULT true,
condition_kind condition_kind_enum NOT NULL,
condition jsonb NOT NULL, -- sealed-subtype payload, Jackson-DEDUCTION polymorphic
evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),
notification_title_tmpl text NOT NULL, -- Mustache
notification_message_tmpl text NOT NULL, -- Mustache
webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id
next_evaluation_at timestamptz NOT NULL DEFAULT now(),
claimed_by varchar(64),
claimed_until timestamptz,
eval_state jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id)
);
CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id);
CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true;
```
#### `alert_rule_targets`
```sql
CREATE TABLE alert_rule_targets (
id uuid PRIMARY KEY,
rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
target_kind target_kind_enum NOT NULL,
target_id varchar(128) NOT NULL,
UNIQUE (rule_id, target_kind, target_id)
);
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);
```
#### `alert_instances`
```sql
CREATE TABLE alert_instances (
id uuid PRIMARY KEY,
rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
rule_snapshot jsonb NOT NULL,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
state alert_state_enum NOT NULL,
severity severity_enum NOT NULL,
fired_at timestamptz NOT NULL,
acked_at timestamptz,
acked_by uuid REFERENCES users(id),
resolved_at timestamptz,
last_notified_at timestamptz,
silenced boolean NOT NULL DEFAULT false,
current_value numeric,
threshold numeric,
context jsonb NOT NULL,
title text NOT NULL,
message text NOT NULL,
target_user_ids uuid[] NOT NULL DEFAULT '{}',
target_group_ids uuid[] NOT NULL DEFAULT '{}',
target_role_names text[] NOT NULL DEFAULT '{}'
);
CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC);
CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids);
CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids);
CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names);
```
#### `alert_silences`
```sql
CREATE TABLE alert_silences (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
AND (claimed_until IS NULL OR claimed_until < now())
ORDER BY next_evaluation_at
LIMIT :batch
FOR UPDATE SKIP LOCKED
)
RETURNING *;
-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')
```
`FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other.
---
## 6. Outbound connections
### Concept
An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.
**Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules.
**Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.
**Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references.
### Default body template (when rule has no override)
"context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
}
```
"Just plug in my Slack incoming webhook URL" works without writing a template.
### HMAC signing (optional per connection)
When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=<hmac(secret, body)>` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).
---
## 7. Rule evaluation
### Scheduler
```java
@Component
public class AlertEvaluatorJob implements SchedulingConfigurer {
// Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
@Override
public void configureTasks(ScheduledTaskRegistrar registrar) {
| `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold |
| `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor |
| `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold |
| `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state |
| `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set |
| `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold |
| `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold |
│ ACKNOWLEDGED RESOLVED ── (cond false again → cycle can restart)
```
`PER_EXCHANGE` mode: each match is its own brief FIRING instance that auto-resolves after `perExchangeLingerSeconds` (default 300 s). History retains it for 90 d.
### Performance optimizations (v1)
1.**Four ClickHouse projections** (new CH migration, idempotent):
```sql
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_app_status
(SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time));
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_route_status
(SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time));
ALTER TABLE logs ADD PROJECTION IF NOT EXISTS alerting_app_level
(SELECT * ORDER BY (tenant_id, environment, application, level, timestamp));
ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric
(SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));
```
`stats_1m_route`'s existing ORDER BY already aligns with alerting access patterns; no projection needed.
2.**Drop `FINAL` for alerting counts.** New methods `ClickHouseLogStore.countLogs(...)` and `ClickHouseSearchIndex.countExecutionsForAlerting(...)` skip `FINAL` — alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facing `count()` path unchanged.
3.**Per-tick query coalescing.** Rules of the same kind + window share one aggregate query per tick.
4.**In-tick cache.**`Map<QueryKey, Long>` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call.
5.**Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open.
### Silence matching
At notification-dispatch time (not evaluation time):
```sql
SELECT 1 FROM alert_silences
WHERE environment_id = :env
AND now() BETWEEN starts_at AND ends_at
AND matcher_matches(matcher, :instanceContext)
LIMIT 1;
```
If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.
AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
AND (
:me = ANY(ai.target_user_ids)
OR ai.target_group_ids && :my_group_ids
OR ai.target_role_names && :my_role_names
)
ORDER BY ai.fired_at DESC
LIMIT 100;
```
`:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`.
**Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap.
### Webhook outbox — claim-based
`NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection.
Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.
**Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.
**UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `<name>` from `<src>` → `<dst>`. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy).
**Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths.
Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call.
### System config (`cameleer.server.outbound-http.*`)
```yaml
cameleer:
server:
outbound-http:
trust-all: false # global kill-switch; WARN logged if true
trusted-ca-pem-paths: # additional roots layered on JVM default
- /etc/cameleer/certs/corporate-root.pem
- /etc/cameleer/certs/acme-internal.pem
default-connect-timeout-ms: 2000
default-read-timeout-ms: 5000
proxy-url: # optional; null = no proxy
proxy-username:
proxy-password:
```
On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files.
### Per-connection overrides
Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`).
### Deferred
See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**:
- In-app CA bundle upload / admin management
- SaaS-layer CA reuse investigation (do first)
---
## 11. API surface
All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver.
Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change.
---
## 12. CMD-K integration
Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`):
- **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`.
- **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`.
No new registry machinery — uses the existing extension point.
4.**Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`.
### Template editor — Mustache with variable auto-complete
Every Mustache template-editable field — notification title, notification message, webhook URL, webhook header values, webhook body — uses a shared `<MustacheEditor />` component with **variable auto-complete**. Users never have to guess what context variables are available.
**Behavior.**
- Typing `{{` opens a dropdown of available variables at the caret position.
- Each suggestion shows the variable path (`alert.firedAt`), its type (`Instant`), a one-line description, and a sample rendered value from the canned context.
- Filtering narrows the list as the user keeps typing (`{{ale…` → filters to `alert.*`).
-`Enter` / `Tab` inserts the path and closes `}}` automatically.
- Arrow keys + `Esc` follow standard combobox semantics (ARIA-conformant).
**Context-aware filtering.** The available variables depend on the rule's condition kind and scope. The editor is aware of both:
- Always shown: `env.*`, `rule.*`, `alert.*`
-`ROUTE_METRIC` with `route.id` set: adds `route.id`, `app.*`
Variables that *might not* populate (e.g., `alert.resolvedAt` while state is FIRING) are shown with a grey "may be null" badge — users still see them so they can defensively template.
**Syntax checks inline.**
- Unclosed `{{` / unmatched `}}` flagged with a red underline + hover hint.
- Reference to an out-of-scope variable (e.g., `{{exchange.id}}` in a ROUTE_METRIC rule) flagged with an amber underline + hint ("not available for this rule kind — will render as literal").
- Checks run client-side on every keystroke (debounced); server-side render preview is still authoritative (§8).
**Shared implementation.** Same `<MustacheEditor />` component is used in:
- Rule editor — Notify step (title, message)
- Rule editor — Webhook overrides (body override, header value overrides; URL not editable per rule, it's the connection's)
- Admin **Outbound Connections** editor — default body template, default header values, URL (URL gets a reduced context: only `env.*` since a connection URL is rule-agnostic)
- *Test render* inline preview — rendered output updates live as user types
**Completion engine.** Specific library choice (CodeMirror 6 with a custom completion extension vs Monaco vs a lighter custom overlay on `<textarea>`) is deferred to planning — see §20.
Structure described in design presentation; no new design-system components required. Reuses `Select`, `Tabs`, `Toggle`, `Button`, `Label`, `InfiniteScrollArea`, `PageLoader`, `Badge` from `@cameleer/design-system`.
### Real-time behavior
- Bell: `/alerts/unread-count` polled every 30 s; paused when tab hidden (Page Visibility API).
- Inbox view: `/alerts` polled every 30 s when focused.
- No SSE in v1. SSE is a clean future add under `/alerts/stream` with no schema changes.
### Accessibility
Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive `aria-label`.
### Styling
All colors via `@cameleer/design-system` CSS variables (`var(--error)`, `var(--amber)`, `var(--muted)`, `var(--success)`). No hard-coded hex.
evaluator-tick-interval-ms: 5000 # floor: 5000 (clamped at startup with WARN if lower)
evaluator-batch-size: 20
claim-ttl-seconds: 30
notification-tick-interval-ms: 5000
notification-batch-size: 50
in-tick-cache-enabled: true
circuit-breaker-fail-threshold: 5
circuit-breaker-window-seconds: 30
circuit-breaker-cooldown-seconds: 60
event-retention-days: 90
notification-retention-days: 30
webhook-timeout-ms: 5000
webhook-max-attempts: 3
```
Env-var overridable (`CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...`). Wired via `SchedulingConfigurer` (not literal `@Scheduled(fixedDelay=...)`) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.
-`OUTBOUND_CONNECTION_CHANGE` — admin CRUD on outbound connection
Emitted via existing `AuditService.log(...)`.
---
## 17. Security
- **Tenant + env isolation.** Every controller call runs through `@EnvPath` (resolves env → tenant via `TenantContext`). Every CH query filters by `tenant_id AND environment` per pre-existing invariant.
- **RBAC.** Enforced via Spring Security `@PreAuthorize` on each endpoint (see §11 role column).
- **Webhook URL SSRF protection.** At rule save, reject URLs resolving to private IPs (`127.0.0.0/8`, `10.0.0.0/8`, `172.16/12`, `192.168/16`, `::1`, `fc00::/7`) unless a deployment-level allow-listed dev flag is set.
- **HMAC signing.** Per-connection `hmac_secret` encrypted at rest; signature header sent on dispatch.
- **TLS trust.** Cross-cutting module (§10).
- **Audit.** See §16.
---
## 18. Testing
### Backend — unit (`*Test.java`, no Spring)
- Each `ConditionEvaluator`: synthetic inputs → expected `EvalResult`. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce.
- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.
---
## 19. Rollout
- **No feature flag.** Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
- **Migration rollback.** V11 PG migration has matching down-script; CH projections are `IF NOT EXISTS`-safe and droppable without data loss.
- **Progressive adoption.** First user creates the first rule; feature organically spreads from there.
- **Documentation.** Add an admin-facing alerting guide under `docs/` describing rule shapes, template variables, webhook destinations, and silence patterns.
- **`.claude/rules/` updates.** `app-classes.md` and `core-classes.md` updated to document the new packages and any touched classes — part of the change, not a follow-up.
---
## 20. Open questions / items for writing-plans
These are not design-level decisions — they're implementation-phase tasks to be carried into planning:
1.**Alignment with existing OIDC outbound cert handling.** Before implementing `ApacheOutboundHttpClientFactory`, audit how `OidcProviderHelper` / `OidcTokenExchanger` currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1).
2.**`hmac_secret` encryption-at-rest.** Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.
3.**V1 CH migration file naming.** Confirm the convention for alerting-owned CH migrations (`V_alerting_projections.sql` vs numbered). Current `ClickHouseSchemaInitializer` runs files idempotently — naming is informational.
7.**`<MustacheEditor />` completion engine choice.** Decide between CodeMirror 6 with a custom completion extension, Monaco, or a lighter custom-overlay-on-`<textarea>` implementation. Criteria: bundle-size cost, accessibility (ARIA combobox semantics), existing design-system integration. The variable metadata registry (`{path, type, description, sampleValue, availableForKinds[]}`) is the same regardless of engine.