# Alerting — Design Spec **Date:** 2026-04-19 **Status:** Draft — awaiting user review **Surfaces:** server (core + app), UI, admin, Gitea issues **Related:** [backlog BL-001](../backlog.md) / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137) (managed CA bundles — deferred) --- ## 1. Summary A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: `FIRING → ACKNOWLEDGED → RESOLVED` with orthogonal `SILENCED`. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new `alerting/`, `outbound/`, and `http/` packages with minimal, documented touchpoints on existing stores. ### Guiding principles - **"Good enough" baseline.** Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those *without*. Resist incident-management feature creep; provide the floor, not the ceiling. - **Confinement over cleverness.** Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration. - **Env-scoped by default, tenant-global where infrastructure.** Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env. - **Performance is a first-class design concern**, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1. - **No ClickHouse table changes, only projections.** Additive, idempotent (`IF NOT EXISTS`), safe to drop and rebuild. --- ## 2. Scope ### In scope (v1) Six signal sources, expressed as sealed-type conditions: 1. **`ROUTE_METRIC`** — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by `stats_1m_route`. 2. **`EXCHANGE_MATCH`** — per-exchange matching with two fire modes: - `PER_EXCHANGE` — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns) - `COUNT_IN_WINDOW` — aggregate "N exchanges matched in window" threshold 3. **`AGENT_STATE`** — agent in `DEAD` / `STALE` state for ≥ N seconds. Reads in-memory `AgentRegistryService`. 4. **`DEPLOYMENT_STATE`** — deployment status is `FAILED` / `DEGRADED` for ≥ N seconds. 5. **`LOG_PATTERN`** — count of log rows matching level / logger / pattern in a window > threshold. 6. **`JVM_METRIC`** — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window. **Delivery channels.** In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations. **Sharing model.** Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC). **Lifecycle states.** `PENDING → FIRING → ACKNOWLEDGED → RESOLVED`, with `SILENCED` as an orthogonal property resolved at notification-dispatch time (preserves audit trail). **Rule promotion across environments** via UI prefill — no new server endpoint. **CMD-K integration** — alerts + alert rules appear as new result sources in the existing CMD-K registry. **Configurable evaluator cadence** (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence. ### Out of scope (v1, not deferred) - Custom SQL / Prometheus-style query DSL (option F). - Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations. - Native provider integrations (Slack, Teams, PagerDuty as first-class types). - Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead. - Expression language in rules — fixed templates only. - mTLS / client-cert auth on outbound webhooks. - Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed. ### Deferred to backlog - **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)** — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys. --- ## 3. Key decisions Captured from brainstorming, in order of architectural impact. | Decision | Chosen | Rejected | Rationale | |---|---|---|---| | Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten | | Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) | | Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules | | Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting | | Horizontal scale | `FOR UPDATE SKIP LOCKED` claim pattern | advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard | | Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature | | Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision | | Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users | | UI placement | top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read) | admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin | | CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry | | Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations | | TLS trust | shared cross-cutting module `http/` | alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path | | CA management UI | **deferred (BL-001)** | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse | | Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows | --- ## 4. Module architecture ### Package layout ``` cameleer-server-core/src/main/java/com/cameleer/server/core/ ├── alerting/ (domain; pure records + interfaces) │ ├── AlertRule │ ├── AlertCondition (sealed) │ │ ├── RouteMetricCondition │ │ ├── ExchangeMatchCondition │ │ ├── AgentStateCondition │ │ ├── DeploymentStateCondition │ │ ├── LogPatternCondition │ │ └── JvmMetricCondition │ ├── AlertSeverity / AlertState (enums) │ ├── AlertInstance / AlertEvent │ ├── NotificationTarget / NotificationTargetKind │ ├── AlertSilence / SilenceMatcher │ ├── AlertRuleRepository (interface) │ ├── AlertInstanceRepository (interface) │ ├── AlertSilenceRepository (interface) │ ├── AlertNotificationRepository (interface) │ ├── AlertReadRepository (interface) │ ├── ConditionEvaluator (sealed) │ └── NotificationDispatcher (interface) ├── outbound/ (admin-managed outbound connections) │ ├── OutboundConnection │ ├── OutboundAuth (sealed — NONE, BEARER, BASIC) │ ├── TrustMode (enum) │ └── OutboundConnectionRepository (interface) └── http/ (cross-cutting outbound HTTP primitive) ├── OutboundHttpProperties ├── OutboundHttpRequestContext └── OutboundHttpClientFactory (interface) cameleer-server-app/src/main/java/com/cameleer/server/app/ ├── alerting/ │ ├── controller/ (REST) │ │ ├── AlertRuleController │ │ ├── AlertController │ │ ├── AlertSilenceController │ │ └── AlertNotificationController │ ├── storage/ (Postgres) │ │ ├── PostgresAlertRuleRepository │ │ ├── PostgresAlertInstanceRepository │ │ ├── PostgresAlertSilenceRepository │ │ ├── PostgresAlertNotificationRepository │ │ └── PostgresAlertReadRepository │ ├── eval/ (the scheduled evaluators) │ │ ├── AlertEvaluatorJob (@Scheduled, claim-based) │ │ ├── RouteMetricEvaluator │ │ ├── ExchangeMatchEvaluator │ │ ├── AgentStateEvaluator │ │ ├── DeploymentStateEvaluator │ │ ├── LogPatternEvaluator │ │ ├── JvmMetricEvaluator │ │ ├── PerKindCircuitBreaker │ │ └── TickCache │ ├── notify/ │ │ ├── NotificationDispatchJob (@Scheduled, claim-based) │ │ ├── InAppInboxQuery │ │ ├── WebhookDispatcher │ │ ├── MustacheRenderer │ │ └── SilenceMatcher │ ├── dto/ (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.) │ ├── retention/ │ │ └── AlertingRetentionJob (daily @Scheduled) │ └── config/ │ └── AlertingProperties (@ConfigurationProperties) ├── outbound/ │ ├── controller/ │ │ └── OutboundConnectionAdminController │ ├── storage/ │ │ └── PostgresOutboundConnectionRepository │ └── dto/ │ └── OutboundConnectionDto └── http/ ├── ApacheOutboundHttpClientFactory ├── SslContextBuilder └── config/ └── OutboundHttpConfig (@ConfigurationProperties) cameleer-server-app/src/main/resources/ ├── db/migration/V11__alerting_and_outbound.sql (one Flyway migration) └── clickhouse/V_alerting_projections.sql (one CH migration, idempotent) ui/src/ ├── pages/Alerts/ │ ├── InboxPage.tsx │ ├── AllAlertsPage.tsx │ ├── RulesListPage.tsx │ ├── RuleEditor/ │ │ ├── RuleEditorWizard.tsx │ │ ├── ScopeStep.tsx │ │ ├── ConditionStep.tsx │ │ ├── TriggerStep.tsx │ │ ├── NotifyStep.tsx │ │ └── ReviewStep.tsx │ ├── SilencesPage.tsx │ └── HistoryPage.tsx ├── pages/Admin/ │ └── OutboundConnectionsPage.tsx ├── components/ │ ├── NotificationBell.tsx │ └── AlertStateChip.tsx ├── api/queries/ │ ├── alerts.ts │ ├── alertRules.ts │ ├── alertSilences.ts │ └── outboundConnections.ts └── cmdk/sources/ ├── alerts.ts └── alertRules.ts ``` ### Touchpoints on existing code (deliberate, minimal) | Existing surface | Change | Scope | |---|---|---| | `cameleer-server-app/src/main/resources/db/migration/V11__…` | New Flyway migration | additive | | `cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql` | New CH migration | additive, `IF NOT EXISTS` | | `ClickHouseLogStore` | New method `long countLogs(LogSearchRequest)` (no `FINAL`) | one public method added | | `ClickHouseSearchIndex` | New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries) | one public method added | | `SecurityConfig` | Path matchers for new endpoints | ~15 lines | | `ui/src/router.tsx` | Route entries for `/alerts/**` and `/admin/outbound-connections` | additive | | Top-nav layout | Insert `` | one import + one component | | CMD-K registry | Register `alerts` + `alertRules` result sources | two file additions + one import | | `.claude/rules/app-classes.md` + `core-classes.md` | Update class maps for the new packages | documentation | | `com.cameleer:cameleer-common` | no changes | — | | ingestion paths | no changes | — | | agent protocol | no changes | — | | ClickHouse schema (table structure) | no changes — only projections added | — | ### New dependencies - `com.samskivert:jmustache` — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to `cameleer-server-core`. - Apache HttpClient 5 (`org.apache.hc.client5`) — **already present** in the project; no new coordinate. --- ## 5. Data model (PostgreSQL) One Flyway migration `V11__alerting_and_outbound.sql` creates all tables, enums, and indexes in a single transaction. ### Enum types ```sql CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO'); CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC'); CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED'); CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE'); CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED'); CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS'); CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH'); CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC'); ``` ### Tables #### `outbound_connections` (admin-managed) ```sql CREATE TABLE outbound_connections ( id uuid PRIMARY KEY, tenant_id varchar(64) NOT NULL, name varchar(100) NOT NULL, description text, url text NOT NULL, -- Mustache-enabled method outbound_method_enum NOT NULL, default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates default_body_tmpl text, -- null = built-in default JSON envelope tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT', tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties hmac_secret text, -- Ed25519-key-derived encryption at rest auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE', auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs created_at timestamptz NOT NULL DEFAULT now(), created_by uuid NOT NULL REFERENCES users(id), updated_at timestamptz NOT NULL DEFAULT now(), updated_by uuid NOT NULL REFERENCES users(id), UNIQUE (tenant_id, name) ); CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id); ``` #### `alert_rules` ```sql CREATE TABLE alert_rules ( id uuid PRIMARY KEY, environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, name varchar(200) NOT NULL, description text, severity severity_enum NOT NULL, enabled boolean NOT NULL DEFAULT true, condition_kind condition_kind_enum NOT NULL, condition jsonb NOT NULL, -- sealed-subtype payload, Jackson polymorphic on `kind` evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5), for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0), re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0), notification_title_tmpl text NOT NULL, -- Mustache notification_message_tmpl text NOT NULL, -- Mustache webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id next_evaluation_at timestamptz NOT NULL DEFAULT now(), claimed_by varchar(64), claimed_until timestamptz, eval_state jsonb NOT NULL DEFAULT '{}', created_at timestamptz NOT NULL DEFAULT now(), created_by uuid NOT NULL REFERENCES users(id), updated_at timestamptz NOT NULL DEFAULT now(), updated_by uuid NOT NULL REFERENCES users(id) ); CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id); CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true; ``` #### `alert_rule_targets` ```sql CREATE TABLE alert_rule_targets ( id uuid PRIMARY KEY, rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE, target_kind target_kind_enum NOT NULL, target_id varchar(128) NOT NULL, UNIQUE (rule_id, target_kind, target_id) ); CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id); ``` #### `alert_instances` ```sql CREATE TABLE alert_instances ( id uuid PRIMARY KEY, rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL, rule_snapshot jsonb NOT NULL, environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, state alert_state_enum NOT NULL, severity severity_enum NOT NULL, fired_at timestamptz NOT NULL, acked_at timestamptz, acked_by uuid REFERENCES users(id), resolved_at timestamptz, last_notified_at timestamptz, silenced boolean NOT NULL DEFAULT false, current_value numeric, threshold numeric, context jsonb NOT NULL, title text NOT NULL, message text NOT NULL, target_user_ids uuid[] NOT NULL DEFAULT '{}', target_group_ids uuid[] NOT NULL DEFAULT '{}', target_role_names text[] NOT NULL DEFAULT '{}' ); CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC); CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL; CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED'; CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids); CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids); CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names); ``` #### `alert_silences` ```sql CREATE TABLE alert_silences ( id uuid PRIMARY KEY, environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE, matcher jsonb NOT NULL, -- { ruleId?, appSlug?, routeId?, agentId?, severity? } reason text, starts_at timestamptz NOT NULL, ends_at timestamptz NOT NULL CHECK (ends_at > starts_at), created_by uuid NOT NULL REFERENCES users(id), created_at timestamptz NOT NULL DEFAULT now() ); CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at); ``` #### `alert_notifications` (webhook delivery outbox) ```sql CREATE TABLE alert_notifications ( id uuid PRIMARY KEY, alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE, webhook_id uuid, -- opaque ref into rule's webhooks JSONB outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL, status notification_status_enum NOT NULL DEFAULT 'PENDING', attempts int NOT NULL DEFAULT 0, next_attempt_at timestamptz NOT NULL DEFAULT now(), claimed_by varchar(64), claimed_until timestamptz, last_response_status int, last_response_snippet text, payload jsonb NOT NULL, -- snapshotted at first attempt delivered_at timestamptz, created_at timestamptz NOT NULL DEFAULT now() ); CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING'; CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id); ``` #### `alert_reads` ```sql CREATE TABLE alert_reads ( user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE, alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE, read_at timestamptz NOT NULL DEFAULT now(), PRIMARY KEY (user_id, alert_instance_id) ); ``` ### Cascade summary ``` environments → alert_rules (CASCADE) → alert_rule_targets (CASCADE) environments → alert_silences (CASCADE) environments → alert_instances (CASCADE) → alert_reads (CASCADE) → alert_notifications (CASCADE) alert_rules → alert_instances (SET NULL, rule_snapshot preserves context) users → alert_reads (CASCADE) outbound_connections (delete) — blocked by FK from rules.webhooks JSONB via app-level 409 check ``` **Rule deletion** preserves history (`alert_instances.rule_id = NULL`, `rule_snapshot` retains details). **Environment deletion** leaves zero alerting rows — POC-safe. ### Jackson polymorphism for conditions ```java @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind", include = JsonTypeInfo.As.EXISTING_PROPERTY, visible = true) @JsonSubTypes({ @Type(value = RouteMetricCondition.class, name = "ROUTE_METRIC"), @Type(value = ExchangeMatchCondition.class, name = "EXCHANGE_MATCH"), @Type(value = AgentStateCondition.class, name = "AGENT_STATE"), @Type(value = DeploymentStateCondition.class, name = "DEPLOYMENT_STATE"), @Type(value = LogPatternCondition.class, name = "LOG_PATTERN"), @Type(value = JvmMetricCondition.class, name = "JVM_METRIC"), }) public sealed interface AlertCondition permits RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition, DeploymentStateCondition, LogPatternCondition, JvmMetricCondition { ConditionKind kind(); } ``` Each payload carries its own `kind` field, which Jackson reads (`EXISTING_PROPERTY`) to pick the subtype and the record still exposes as `ConditionKind kind()`. Bean Validation (`@Valid`) on each record validates at the controller boundary. Example condition payloads: ```json // ROUTE_METRIC { "kind": "ROUTE_METRIC", "scope": {"appSlug":"orders","routeId":"route-1"}, "metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 } // EXCHANGE_MATCH — PER_EXCHANGE { "kind": "EXCHANGE_MATCH", "scope": {"appSlug":"orders"}, "filter": {"status":"FAILED","attributes":{"type":"payment"}}, "fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 } // EXCHANGE_MATCH — COUNT_IN_WINDOW { "kind": "EXCHANGE_MATCH", "scope": {"appSlug":"orders"}, "filter": {"status":"FAILED"}, "fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 } // AGENT_STATE { "kind": "AGENT_STATE", "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 } // DEPLOYMENT_STATE { "kind": "DEPLOYMENT_STATE", "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] } // LOG_PATTERN { "kind": "LOG_PATTERN", "scope": {"appSlug":"orders"}, "level": "ERROR", "pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 } // JVM_METRIC { "kind": "JVM_METRIC", "scope": {"appSlug":"orders"}, "metric": "heap_used_percent", "aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 } ``` ### Claim-polling queries ```sql -- Rule evaluator UPDATE alert_rules SET claimed_by = :instance, claimed_until = now() + interval '30 seconds' WHERE id IN ( SELECT id FROM alert_rules WHERE enabled = true AND next_evaluation_at <= now() AND (claimed_until IS NULL OR claimed_until < now()) ORDER BY next_evaluation_at LIMIT :batch FOR UPDATE SKIP LOCKED ) RETURNING *; -- Notification dispatcher (same pattern on alert_notifications with status='PENDING') ``` `FOR UPDATE SKIP LOCKED` is the crux: replicas never block each other. --- ## 6. Outbound connections ### Concept An `OutboundConnection` is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically. **Tenant-global.** Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (`slack-prod`, `slack-dev`) and referencing the appropriate one in each env's rules. **Allowed-env restriction.** `allowed_environment_ids` (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list. **Delete semantics.** 409 if any rule references the connection. No silent cascade — admin must first remove references. ### Default body template (when rule has no override) ```json { "alert": { "id", "state", "firedAt", "severity", "title", "message", "link" }, "rule": { "id", "name", "description", "severity" }, "env": { "slug", "id" }, "context": { /* full Mustache context: app, route, agent, exchange, etc. */ } } ``` "Just plug in my Slack incoming webhook URL" works without writing a template. ### HMAC signing (optional per connection) When `hmac_secret` is set, dispatch adds `X-Cameleer-Signature: sha256=` header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20). --- ## 7. Rule evaluation ### Scheduler ```java @Component public class AlertEvaluatorJob implements SchedulingConfigurer { // Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000) @Override public void configureTasks(ScheduledTaskRegistrar registrar) { registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs()); } void tick() { List claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize()); var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r)))); for (var entry : groups.entrySet()) { if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; } try { coalescedEvaluate(entry.getKey(), entry.getValue()); } catch (Exception e) { circuitBreaker.recordFailure(entry.getKey().kind()); rescheduleBatch(entry.getValue()); } } } } ``` ### Per-condition evaluators | Kind | Read source | Query shape | |---|---|---| | `ROUTE_METRIC` | `SearchService.statsForRoute` / `statsForApp` | Stats over window; comparator vs threshold | | `EXCHANGE_MATCH` (PER_EXCHANGE) | `SearchService.search(SearchRequest)` | `timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor | | `EXCHANGE_MATCH` (COUNT_IN_WINDOW) | `ClickHouseSearchIndex.countExecutionsForAlerting(spec)` | Count in window vs threshold | | `AGENT_STATE` | `AgentRegistryService.listByEnvironment` | Any agent matches scope + state | | `DEPLOYMENT_STATE` | `DeploymentRepository.findLatestByAppAndEnv` | Status in target set | | `LOG_PATTERN` | `ClickHouseLogStore.countLogs(LogSearchRequest)` | Count in window vs threshold | | `JVM_METRIC` | `MetricsQueryStore` | Latest value (aggregation per rule) vs threshold | ### State machine ``` (cond holds for ` discarded at tick end. Two rules hitting the same `(app, route, window, metric)` produce one CH call. 5. **Per-kind circuit breaker.** 5 failures in 30 s → open for 60 s. Metric `alerting_circuit_open_total{kind}`. UI surfaces an admin banner when open. ### Silence matching At notification-dispatch time (not evaluation time): ```sql SELECT 1 FROM alert_silences WHERE environment_id = :env AND now() BETWEEN starts_at AND ends_at AND matcher_matches(matcher, :instanceContext) LIMIT 1; ``` If any match → `alert_instances.silenced = true`, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved. ### Failure modes | Failure | Behavior | |---|---| | Read interface throws | Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim | | 10 consecutive failures for a rule | Mark `eval_state.disabledReason`, surface in UI | | Template render error | Fall back to literal `{{var}}` in output, log WARN, still dispatch | | Slow evaluator | Claim TTL 30 s; investigate if sustained | | Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized | | Env deleted mid-eval | FK cascade waits — effectively serialized | --- ## 8. Notification dispatch ### In-app inbox — derived, not materialized ```sql SELECT ai.* FROM alert_instances ai WHERE ai.environment_id = :env AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED') AND ( :me = ANY(ai.target_user_ids) OR ai.target_group_ids && :my_group_ids OR ai.target_role_names && :my_role_names ) ORDER BY ai.fired_at DESC LIMIT 100; ``` `:my_group_ids` and `:my_role_names` resolved once per request from `RbacService`. **Bell badge count:** same filter + `state IN ('FIRING','ACKNOWLEDGED')` + `NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id)`, count-only. Server-side 5 s memoization per `(env, user)` keeps bell polling cheap. ### Webhook outbox — claim-based `NotificationDispatchJob` claims due notifications (`status='PENDING' AND next_attempt_at <= now()`) and dispatches. HTTP client from shared `OutboundHttpClientFactory` with TLS config from the referenced outbound connection. - **2xx** → `DELIVERED` - **4xx** → `FAILED` immediately (retry won't help); log at WARN - **5xx / network / timeout** → retry with exponential backoff 30 s → 2 m → 5 m, then `FAILED` - Manual retry: `POST /alerts/notifications/{id}/retry` (OPERATOR+) Payload rendered at **first** dispatch attempt, snapshotted in `alert_notifications.payload`. Retries replay the snapshot — template edits after fire don't affect in-flight notifications. ### Template rendering JMustache (`com.samskivert:jmustache`). Logic-less, industry-standard syntax. **Rendered surfaces:** URL (query-string interpolation), header values, body, and separately `alert_instances.title` / `message` rendered once at fire. **Context map** (dot-notation + camelCase leaves): ``` env.slug env.id rule.id rule.name rule.severity rule.description alert.id alert.state alert.firedAt alert.resolvedAt alert.ackedBy alert.link alert.currentValue alert.threshold alert.comparator alert.window app.slug app.id app.displayName route.id agent.id agent.name agent.state exchange.id exchange.status exchange.link deployment.id deployment.status log.logger log.level log.message metric.name metric.value ``` **Error handling.** Missing variable renders as `{{var.name}}` literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error. **"Test render" endpoint:** `POST /alerts/rules/{id}/render-preview` — drives rule editor's Preview button. --- ## 9. Rule promotion across environments **UX.** Rule list row → **Environments ▾** menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: *"Promoting `` from `` → ``. Review and adjust, then save."* Save → normal `POST /api/v1/environments/{dstEnvSlug}/alerts/rules`. Source unaffected (it's a copy). **Pure UI flow — no new server endpoint.** Re-uses the existing GET (to fetch) and POST (to create) paths. **Prefill-time validation (client-side warnings, non-blocking):** | Field | Check | Behavior | |---|---|---| | `scope.appSlug` | Does app exist in target env? | ⚠ warn + picker from target env's apps | | `scope.agentId` | Per-env; can't transfer | Clear field, keep appSlug, note | | `scope.routeId` | Per-app logical ID, stable | ✓ pass through | | `targets[]` | Tenant-scoped | ✓ transfer as-is | | `webhooks[].outboundConnectionId` | Target env allowed by connection? | ⚠ warn if not; disable save until resolved | Bulk promotion (select multiple → promote all) deferred until usage patterns justify it. --- ## 10. Cross-cutting: outbound HTTP & TLS trust Shared module — not inside `alerting/`. ### `OutboundHttpClientFactory` ```java public interface OutboundHttpClientFactory { CloseableHttpClient clientFor(OutboundHttpRequestContext context); } public record OutboundHttpRequestContext( TrustMode trustMode, // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS List trustedCaPemPaths, Duration connectTimeout, Duration readTimeout ) {} ``` Implementation (`ApacheOutboundHttpClientFactory`) memoizes one `CloseableHttpClient` per unique effective config — not one per call. ### System config (`cameleer.server.outbound-http.*`) ```yaml cameleer: server: outbound-http: trust-all: false # global kill-switch; WARN logged if true trusted-ca-pem-paths: # additional roots layered on JVM default - /etc/cameleer/certs/corporate-root.pem - /etc/cameleer/certs/acme-internal.pem default-connect-timeout-ms: 2000 default-read-timeout-ms: 5000 proxy-url: # optional; null = no proxy proxy-username: proxy-password: ``` On startup: if `trust-all=true`, log red WARN (not suitable for production). If `trusted-ca-pem-paths` has entries, verify each path exists; fail-fast on missing files. ### Per-connection overrides Each `OutboundConnection` carries `tls_trust_mode` + `tls_ca_pem_paths`. UI surfaces a dropdown: **System default (validated)** / **Trust custom CAs (from server config)** / **Trust all (insecure — testing only)**. Amber warning when *Trust all* selected. Audit logged (`AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE`). ### Deferred See **BL-001 / [gitea#137](https://gitea.siegeln.net/cameleer/cameleer-server/issues/137)**: - In-app CA bundle upload / admin management - SaaS-layer CA reuse investigation (do first) --- ## 11. API surface All env-scoped routes under `/api/v1/environments/{envSlug}/alerts/...` via existing `@EnvPath` resolver. ### Alerting — rules | Method | Path | Role | |---|---|---| | `GET` | `/alerts/rules` | VIEWER+ | | `POST` | `/alerts/rules` | OPERATOR+ | | `GET` | `/alerts/rules/{id}` | VIEWER+ | | `PUT` | `/alerts/rules/{id}` | OPERATOR+ | | `DELETE` | `/alerts/rules/{id}` | OPERATOR+ | | `POST` | `/alerts/rules/{id}/enable` · `/disable` | OPERATOR+ | | `POST` | `/alerts/rules/{id}/render-preview` | OPERATOR+ | | `POST` | `/alerts/rules/{id}/test-evaluate` | OPERATOR+ | ### Alerting — instances | Method | Path | Role | |---|---|---| | `GET` | `/alerts` | VIEWER+ | | `GET` | `/alerts/unread-count` | VIEWER+ | | `GET` | `/alerts/{id}` | VIEWER+ | | `POST` | `/alerts/{id}/ack` | VIEWER+ (if targeted) / OPERATOR+ | | `POST` | `/alerts/{id}/read` | VIEWER+ (self) | | `POST` | `/alerts/bulk-read` | VIEWER+ (self) | ### Alerting — silences | Method | Path | Role | |---|---|---| | `GET` | `/alerts/silences` | VIEWER+ | | `POST` | `/alerts/silences` | OPERATOR+ | | `PUT` | `/alerts/silences/{id}` | OPERATOR+ | | `DELETE` | `/alerts/silences/{id}` | OPERATOR+ | ### Alerting — notifications | Method | Path | Role | |---|---|---| | `GET` | `/alerts/{id}/notifications` | VIEWER+ | | `POST` | `/alerts/notifications/{id}/retry` | OPERATOR+ | ### Outbound connections (admin) | Method | Path | Role | |---|---|---| | `GET` | `/api/v1/admin/outbound-connections` | ADMIN / OPERATOR (read-only) | | `POST` | `/api/v1/admin/outbound-connections` | ADMIN | | `GET` | `/api/v1/admin/outbound-connections/{id}` | ADMIN / OPERATOR (read-only) | | `PUT` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if narrowing breaks references) | | `DELETE` | `/api/v1/admin/outbound-connections/{id}` | ADMIN (409 if referenced) | | `POST` | `/api/v1/admin/outbound-connections/{id}/test` | ADMIN | | `GET` | `/api/v1/admin/outbound-connections/{id}/usage` | ADMIN / OPERATOR | ### OpenAPI regen Per `CLAUDE.md` convention: after controller/DTO changes, run `cd ui && npm run generate-api:live` (backend on :8081) to regenerate `ui/src/api/schema.d.ts`. Commit regen alongside controller change. --- ## 12. CMD-K integration Two new result sources registered in the existing UI registry (`ui/src/cmdk/sources/`): - **Alerts** — queries `/alerts?q=...&limit=5` (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to `/alerts/inbox/{id}`. - **Alert Rules** — queries `/alerts/rules?q=...&limit=5`; deep-link to `/alerts/rules/{id}`. No new registry machinery — uses the existing extension point. --- ## 13. UI ### Routes ``` /alerts ├── /inbox (default landing) ├── /all ├── /rules │ ├── /new │ └── /{id} (edit; accepts ?promoteFrom=&ruleId=) ├── /silences └── /history /admin/outbound-connections ├── / ├── /new └── /{id} ``` ### Top-nav Insert `` between env selector and user menu. Badge severity = `max(severities of unread targeting me)` (CRITICAL → `var(--error)`, WARNING → `var(--amber)`, INFO → `var(--muted)`). Dropdown shows 5 most-recent unread with inline ack button + "See all". ### Alerts section New sidebar/top-nav entry visible to `VIEWER+`. Authoring actions (`POST /rules`, silence create, etc.) gated to `OPERATOR+`. ### Rule editor — 5-step wizard 1. **Scope** — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints). 2. **Condition** — radio (6 kinds) + kind-specific form. 3. **Trigger** — threshold + comparator + window + for-duration + evaluation interval + severity; inline *Test evaluate* button. 4. **Notify** — title + message templates with *Preview* button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + `allowed_environment_ids`. 5. **Review** — summary card, enabled toggle, save. ### Template editor — Mustache with variable auto-complete Every Mustache template-editable field — notification title, notification message, webhook URL, webhook header values, webhook body — uses a shared `` component with **variable auto-complete**. Users never have to guess what context variables are available. **Behavior.** - Typing `{{` opens a dropdown of available variables at the caret position. - Each suggestion shows the variable path (`alert.firedAt`), its type (`Instant`), a one-line description, and a sample rendered value from the canned context. - Filtering narrows the list as the user keeps typing (`{{ale…` → filters to `alert.*`). - `Enter` / `Tab` inserts the path and closes `}}` automatically. - Arrow keys + `Esc` follow standard combobox semantics (ARIA-conformant). **Context-aware filtering.** The available variables depend on the rule's condition kind and scope. The editor is aware of both: - Always shown: `env.*`, `rule.*`, `alert.*` - `ROUTE_METRIC` with `route.id` set: adds `route.id`, `app.*` - `EXCHANGE_MATCH`: adds `exchange.*`, `app.*`, `route.id` (if scoped) - `AGENT_STATE`: adds `agent.*`, `app.*` - `DEPLOYMENT_STATE`: adds `deployment.*`, `app.*` - `LOG_PATTERN`: adds `log.*`, `app.*` - `JVM_METRIC`: adds `metric.*`, `agent.*`, `app.*` Variables that *might not* populate (e.g., `alert.resolvedAt` while state is FIRING) are shown with a grey "may be null" badge — users still see them so they can defensively template. **Syntax checks inline.** - Unclosed `{{` / unmatched `}}` flagged with a red underline + hover hint. - Reference to an out-of-scope variable (e.g., `{{exchange.id}}` in a ROUTE_METRIC rule) flagged with an amber underline + hint ("not available for this rule kind — will render as literal"). - Checks run client-side on every keystroke (debounced); server-side render preview is still authoritative (§8). **Shared implementation.** Same `` component is used in: - Rule editor — Notify step (title, message) - Rule editor — Webhook overrides (body override, header value overrides; URL not editable per rule, it's the connection's) - Admin **Outbound Connections** editor — default body template, default header values, URL (URL gets a reduced context: only `env.*` since a connection URL is rule-agnostic) - *Test render* inline preview — rendered output updates live as user types **Completion engine.** Specific library choice (CodeMirror 6 with a custom completion extension vs Monaco vs a lighter custom overlay on `