Files

hsiegeln a9ad0eb841 docs(alerting): spec for alerting feature + backlog entry BL-001

Comprehensive design spec for a confined, env-scoped alerting feature:
6 signal sources, shared env-scoped rules with RBAC-targeted notifications,
in-app inbox + webhook delivery via admin-managed outbound connections,
claim-based polling for horizontal scalability, 4 CH projections for hot-path
reads. Backlog entry BL-001 / gitea#137 tracks deferred managed-CA investigation
(reuse SaaS-layer CA handling first before building in-server storage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 14:58:38 +02:00

49 KiB

Raw Blame History

Alerting — Design Spec

Date: 2026-04-19 Status: Draft — awaiting user review Surfaces: server (core + app), UI, admin, Gitea issues Related: backlog BL-001 / gitea#137 (managed CA bundles — deferred)

1. Summary

A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: FIRING → ACKNOWLEDGED → RESOLVED with orthogonal SILENCED. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new alerting/, outbound/, and http/ packages with minimal, documented touchpoints on existing stores.

Guiding principles

"Good enough" baseline. Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those without. Resist incident-management feature creep; provide the floor, not the ceiling.
Confinement over cleverness. Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
Env-scoped by default, tenant-global where infrastructure. Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
Performance is a first-class design concern, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
No ClickHouse table changes, only projections. Additive, idempotent (IF NOT EXISTS), safe to drop and rebuild.

2. Scope

In scope (v1)

Six signal sources, expressed as sealed-type conditions:

ROUTE_METRIC — aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed by stats_1m_route.
EXCHANGE_MATCH — per-exchange matching with two fire modes:
- PER_EXCHANGE — one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)
- COUNT_IN_WINDOW — aggregate "N exchanges matched in window" threshold
AGENT_STATE — agent in DEAD / STALE state for ≥ N seconds. Reads in-memory AgentRegistryService.
DEPLOYMENT_STATE — deployment status is FAILED / DEGRADED for ≥ N seconds.
LOG_PATTERN — count of log rows matching level / logger / pattern in a window > threshold.
JVM_METRIC — agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.

Delivery channels. In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.

Sharing model. Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).

Lifecycle states. PENDING → FIRING → ACKNOWLEDGED → RESOLVED, with SILENCED as an orthogonal property resolved at notification-dispatch time (preserves audit trail).

Rule promotion across environments via UI prefill — no new server endpoint.

CMD-K integration — alerts + alert rules appear as new result sources in the existing CMD-K registry.

Configurable evaluator cadence (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence.

Out of scope (v1, not deferred)

Custom SQL / Prometheus-style query DSL (option F).
Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations.
Native provider integrations (Slack, Teams, PagerDuty as first-class types).
Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
Expression language in rules — fixed templates only.
mTLS / client-cert auth on outbound webhooks.
Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.

Deferred to backlog

BL-001 / gitea#137 — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.

3. Key decisions

Captured from brainstorming, in order of architectural impact.

Decision	Chosen	Rejected	Rationale
Signal sources	6 (route / exchange / agent / deployment / log / JVM)	SQL power-user mode	"Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten
Delivery channels	in-app + webhook	email, native integrations	Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM)
Sharing	tenant-env-shared rules; notifications target users/groups/roles	per-user "my alerts"	Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules
Evaluation	pull / claim-based polling	push / event-driven	Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting
Horizontal scale	`FOR UPDATE SKIP LOCKED` claim pattern	advisory locks / leader election	Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard
Alert lifecycle	FIRING / ACK / RESOLVED + SILENCED	minimal fire/resolve only, full incident workflow	Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature
Rule shape	fixed templates, sealed-type JSONB	expression DSL, expression-first	Form-fillable; typed; additive for new kinds; consistent with no-SQL decision
Templating	JMustache	in-house substituter, Pebble/Freemarker	Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users
UI placement	top-nav bell (consumer) + `/alerts` section (OPERATOR+ authoring, VIEWER+ read)	admin-only page, embedded per context, new top-level tab only	Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin
CMD-K	alerts + rules searchable	not searchable	Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry
Outbound connections	admin-managed, tenant-global, allowed-env restriction	per-rule raw webhook URLs	Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations
TLS trust	shared cross-cutting module `http/`	alerting-local trust config	Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path
CA management UI	deferred (BL-001)	build in-server now	SaaS-layer CA mechanism should be investigated first for reuse
Env deletion	full cascade across alerting tables	partial cascade with SET NULL	POC teardown safety — zero orphaned rows

4. Module architecture

Package layout

cameleer-server-core/src/main/java/com/cameleer/server/core/
├── alerting/                      (domain; pure records + interfaces)
│   ├── AlertRule
│   ├── AlertCondition (sealed)
│   │   ├── RouteMetricCondition
│   │   ├── ExchangeMatchCondition
│   │   ├── AgentStateCondition
│   │   ├── DeploymentStateCondition
│   │   ├── LogPatternCondition
│   │   └── JvmMetricCondition
│   ├── AlertSeverity / AlertState (enums)
│   ├── AlertInstance / AlertEvent
│   ├── NotificationTarget / NotificationTargetKind
│   ├── AlertSilence / SilenceMatcher
│   ├── AlertRuleRepository (interface)
│   ├── AlertInstanceRepository (interface)
│   ├── AlertSilenceRepository (interface)
│   ├── AlertNotificationRepository (interface)
│   ├── AlertReadRepository (interface)
│   ├── ConditionEvaluator<C> (sealed)
│   └── NotificationDispatcher (interface)
├── outbound/                      (admin-managed outbound connections)
│   ├── OutboundConnection
│   ├── OutboundAuth (sealed — NONE, BEARER, BASIC)
│   ├── TrustMode (enum)
│   └── OutboundConnectionRepository (interface)
└── http/                          (cross-cutting outbound HTTP primitive)
    ├── OutboundHttpProperties
    ├── OutboundHttpRequestContext
    └── OutboundHttpClientFactory (interface)

cameleer-server-app/src/main/java/com/cameleer/server/app/
├── alerting/
│   ├── controller/                (REST)
│   │   ├── AlertRuleController
│   │   ├── AlertController
│   │   ├── AlertSilenceController
│   │   └── AlertNotificationController
│   ├── storage/                   (Postgres)
│   │   ├── PostgresAlertRuleRepository
│   │   ├── PostgresAlertInstanceRepository
│   │   ├── PostgresAlertSilenceRepository
│   │   ├── PostgresAlertNotificationRepository
│   │   └── PostgresAlertReadRepository
│   ├── eval/                      (the scheduled evaluators)
│   │   ├── AlertEvaluatorJob        (@Scheduled, claim-based)
│   │   ├── RouteMetricEvaluator
│   │   ├── ExchangeMatchEvaluator
│   │   ├── AgentStateEvaluator
│   │   ├── DeploymentStateEvaluator
│   │   ├── LogPatternEvaluator
│   │   ├── JvmMetricEvaluator
│   │   ├── PerKindCircuitBreaker
│   │   └── TickCache
│   ├── notify/
│   │   ├── NotificationDispatchJob  (@Scheduled, claim-based)
│   │   ├── InAppInboxQuery
│   │   ├── WebhookDispatcher
│   │   ├── MustacheRenderer
│   │   └── SilenceMatcher
│   ├── dto/                       (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.)
│   ├── retention/
│   │   └── AlertingRetentionJob     (daily @Scheduled)
│   └── config/
│       └── AlertingProperties       (@ConfigurationProperties)
├── outbound/
│   ├── controller/
│   │   └── OutboundConnectionAdminController
│   ├── storage/
│   │   └── PostgresOutboundConnectionRepository
│   └── dto/
│       └── OutboundConnectionDto
└── http/
    ├── ApacheOutboundHttpClientFactory
    ├── SslContextBuilder
    └── config/
        └── OutboundHttpConfig         (@ConfigurationProperties)

cameleer-server-app/src/main/resources/
├── db/migration/V11__alerting_and_outbound.sql   (one Flyway migration)
└── clickhouse/V_alerting_projections.sql         (one CH migration, idempotent)

ui/src/
├── pages/Alerts/
│   ├── InboxPage.tsx
│   ├── AllAlertsPage.tsx
│   ├── RulesListPage.tsx
│   ├── RuleEditor/
│   │   ├── RuleEditorWizard.tsx
│   │   ├── ScopeStep.tsx
│   │   ├── ConditionStep.tsx
│   │   ├── TriggerStep.tsx
│   │   ├── NotifyStep.tsx
│   │   └── ReviewStep.tsx
│   ├── SilencesPage.tsx
│   └── HistoryPage.tsx
├── pages/Admin/
│   └── OutboundConnectionsPage.tsx
├── components/
│   ├── NotificationBell.tsx
│   └── AlertStateChip.tsx
├── api/queries/
│   ├── alerts.ts
│   ├── alertRules.ts
│   ├── alertSilences.ts
│   └── outboundConnections.ts
└── cmdk/sources/
    ├── alerts.ts
    └── alertRules.ts

Touchpoints on existing code (deliberate, minimal)

Existing surface	Change	Scope
`cameleer-server-app/src/main/resources/db/migration/V11__…`	New Flyway migration	additive
`cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql`	New CH migration	additive, `IF NOT EXISTS`
`ClickHouseLogStore`	New method `long countLogs(LogSearchRequest)` (no `FINAL`)	one public method added
`ClickHouseSearchIndex`	New method `long countExecutionsForAlerting(AlertMatchSpec)` (no `FINAL`, no text-in-body subqueries)	one public method added
`SecurityConfig`	Path matchers for new endpoints	~15 lines
`ui/src/router.tsx`	Route entries for `/alerts/**` and `/admin/outbound-connections`	additive
Top-nav layout	Insert `<NotificationBell />`	one import + one component
CMD-K registry	Register `alerts` + `alertRules` result sources	two file additions + one import
`.claude/rules/app-classes.md` + `core-classes.md`	Update class maps for the new packages	documentation
`com.cameleer:cameleer-common`	no changes	—
ingestion paths	no changes	—
agent protocol	no changes	—
ClickHouse schema (table structure)	no changes — only projections added	—

New dependencies

com.samskivert:jmustache — logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added to cameleer-server-core.
Apache HttpClient 5 (org.apache.hc.client5) — already present in the project; no new coordinate.

5. Data model (PostgreSQL)

One Flyway migration V11__alerting_and_outbound.sql creates all tables, enums, and indexes in a single transaction.

Enum types

CREATE TYPE severity_enum          AS ENUM ('CRITICAL','WARNING','INFO');
CREATE TYPE condition_kind_enum    AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
CREATE TYPE alert_state_enum       AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
CREATE TYPE target_kind_enum       AS ENUM ('USER','GROUP','ROLE');
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
CREATE TYPE trust_mode_enum        AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
CREATE TYPE outbound_method_enum   AS ENUM ('POST','PUT','PATCH');
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');

Tables

`outbound_connections` (admin-managed)

CREATE TABLE outbound_connections (
  id                       uuid PRIMARY KEY,
  tenant_id                varchar(64) NOT NULL,
  name                     varchar(100) NOT NULL,
  description              text,
  url                      text NOT NULL,                         -- Mustache-enabled
  method                   outbound_method_enum NOT NULL,
  default_headers          jsonb NOT NULL DEFAULT '{}',           -- values are Mustache templates
  default_body_tmpl        text,                                  -- null = built-in default JSON envelope
  tls_trust_mode           trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
  tls_ca_pem_paths         jsonb NOT NULL DEFAULT '[]',           -- array of paths from OutboundHttpProperties
  hmac_secret              text,                                  -- Ed25519-key-derived encryption at rest
  auth_kind                outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
  auth_config              jsonb NOT NULL DEFAULT '{}',           -- shape depends on auth_kind; v1 unused
  allowed_environment_ids  uuid[] NOT NULL DEFAULT '{}',          -- [] = allowed in all envs
  created_at               timestamptz NOT NULL DEFAULT now(),
  created_by               uuid NOT NULL REFERENCES users(id),
  updated_at               timestamptz NOT NULL DEFAULT now(),
  updated_by               uuid NOT NULL REFERENCES users(id),
  UNIQUE (tenant_id, name)
);
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);

`alert_rules`

CREATE TABLE alert_rules (
  id                          uuid PRIMARY KEY,
  environment_id              uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  name                        varchar(200) NOT NULL,
  description                 text,
  severity                    severity_enum NOT NULL,
  enabled                     boolean NOT NULL DEFAULT true,

  condition_kind              condition_kind_enum NOT NULL,
  condition                   jsonb NOT NULL,                     -- sealed-subtype payload, Jackson-DEDUCTION polymorphic

  evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
  for_duration_seconds        int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
  re_notify_minutes           int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),

  notification_title_tmpl     text NOT NULL,                      -- Mustache
  notification_message_tmpl   text NOT NULL,                      -- Mustache
  webhooks                    jsonb NOT NULL DEFAULT '[]',        -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id

  next_evaluation_at          timestamptz NOT NULL DEFAULT now(),
  claimed_by                  varchar(64),
  claimed_until               timestamptz,
  eval_state                  jsonb NOT NULL DEFAULT '{}',

  created_at                  timestamptz NOT NULL DEFAULT now(),
  created_by                  uuid NOT NULL REFERENCES users(id),
  updated_at                  timestamptz NOT NULL DEFAULT now(),
  updated_by                  uuid NOT NULL REFERENCES users(id)
);
CREATE INDEX alert_rules_env_idx            ON alert_rules (environment_id);
CREATE INDEX alert_rules_claim_due_idx      ON alert_rules (next_evaluation_at) WHERE enabled = true;

`alert_rule_targets`

CREATE TABLE alert_rule_targets (
  id            uuid PRIMARY KEY,
  rule_id       uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
  target_kind   target_kind_enum NOT NULL,
  target_id     varchar(128) NOT NULL,
  UNIQUE (rule_id, target_kind, target_id)
);
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);

`alert_instances`

CREATE TABLE alert_instances (
  id                  uuid PRIMARY KEY,
  rule_id             uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
  rule_snapshot       jsonb NOT NULL,
  environment_id      uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  state               alert_state_enum NOT NULL,
  severity            severity_enum NOT NULL,
  fired_at            timestamptz NOT NULL,
  acked_at            timestamptz,
  acked_by            uuid REFERENCES users(id),
  resolved_at         timestamptz,
  last_notified_at    timestamptz,
  silenced            boolean NOT NULL DEFAULT false,
  current_value       numeric,
  threshold           numeric,
  context             jsonb NOT NULL,
  title               text NOT NULL,
  message             text NOT NULL,
  target_user_ids     uuid[] NOT NULL DEFAULT '{}',
  target_group_ids    uuid[] NOT NULL DEFAULT '{}',
  target_role_names   text[] NOT NULL DEFAULT '{}'
);
CREATE INDEX alert_instances_inbox_idx      ON alert_instances (environment_id, state, fired_at DESC);
CREATE INDEX alert_instances_open_rule_idx  ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
CREATE INDEX alert_instances_resolved_idx   ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
CREATE INDEX alert_instances_target_u_idx   ON alert_instances USING GIN (target_user_ids);
CREATE INDEX alert_instances_target_g_idx   ON alert_instances USING GIN (target_group_ids);
CREATE INDEX alert_instances_target_r_idx   ON alert_instances USING GIN (target_role_names);

`alert_silences`

CREATE TABLE alert_silences (
  id             uuid PRIMARY KEY,
  environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
  matcher        jsonb NOT NULL,                      -- { ruleId?, appSlug?, routeId?, agentId?, severity? }
  reason         text,
  starts_at      timestamptz NOT NULL,
  ends_at        timestamptz NOT NULL CHECK (ends_at > starts_at),
  created_by     uuid NOT NULL REFERENCES users(id),
  created_at     timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at);

`alert_notifications` (webhook delivery outbox)

CREATE TABLE alert_notifications (
  id                    uuid PRIMARY KEY,
  alert_instance_id     uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
  webhook_id            uuid,                        -- opaque ref into rule's webhooks JSONB
  outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL,
  status                notification_status_enum NOT NULL DEFAULT 'PENDING',
  attempts              int NOT NULL DEFAULT 0,
  next_attempt_at       timestamptz NOT NULL DEFAULT now(),
  claimed_by            varchar(64),
  claimed_until         timestamptz,
  last_response_status  int,
  last_response_snippet text,
  payload               jsonb NOT NULL,              -- snapshotted at first attempt
  delivered_at          timestamptz,
  created_at            timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING';
CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id);

`alert_reads`

CREATE TABLE alert_reads (
  user_id            uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  alert_instance_id  uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
  read_at            timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (user_id, alert_instance_id)
);

Cascade summary

environments → alert_rules           (CASCADE)  → alert_rule_targets   (CASCADE)
environments → alert_silences        (CASCADE)
environments → alert_instances       (CASCADE)  → alert_reads          (CASCADE)
                                                → alert_notifications  (CASCADE)
alert_rules  → alert_instances                   (SET NULL, rule_snapshot preserves context)
users        → alert_reads           (CASCADE)
outbound_connections (delete)        — blocked by FK from rules.webhooks JSONB via app-level 409 check

Rule deletion preserves history (alert_instances.rule_id = NULL, rule_snapshot retains details). Environment deletion leaves zero alerting rows — POC-safe.

Jackson polymorphism for conditions

@JsonTypeInfo(use = JsonTypeInfo.Id.DEDUCTION)
@JsonSubTypes({
    @Type(RouteMetricCondition.class),
    @Type(ExchangeMatchCondition.class),
    @Type(AgentStateCondition.class),
    @Type(DeploymentStateCondition.class),
    @Type(LogPatternCondition.class),
    @Type(JvmMetricCondition.class),
})
public sealed interface AlertCondition permits
    RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition,
    DeploymentStateCondition, LogPatternCondition, JvmMetricCondition {
    ConditionKind kind();
}

Jackson deduces the subtype from the set of present fields. Bean Validation (@Valid) on each record validates at the controller boundary.

Example condition payloads:

// ROUTE_METRIC
{ "scope": {"appSlug":"orders","routeId":"route-1"},
  "metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 }

// EXCHANGE_MATCH — PER_EXCHANGE
{ "scope": {"appSlug":"orders"},
  "filter": {"status":"FAILED","attributes":{"type":"payment"}},
  "fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 }

// EXCHANGE_MATCH — COUNT_IN_WINDOW
{ "scope": {"appSlug":"orders"},
  "filter": {"status":"FAILED"},
  "fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 }

// AGENT_STATE
{ "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 }

// DEPLOYMENT_STATE
{ "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] }

// LOG_PATTERN
{ "scope": {"appSlug":"orders"}, "level": "ERROR",
  "pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 }

// JVM_METRIC
{ "scope": {"appSlug":"orders"}, "metric": "heap_used_percent",
  "aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 }

Claim-polling queries

-- Rule evaluator
UPDATE alert_rules
   SET claimed_by = :instance, claimed_until = now() + interval '30 seconds'
 WHERE id IN (
   SELECT id FROM alert_rules
    WHERE enabled = true
      AND next_evaluation_at <= now()
      AND (claimed_until IS NULL OR claimed_until < now())
    ORDER BY next_evaluation_at
    LIMIT :batch
    FOR UPDATE SKIP LOCKED
 )
 RETURNING *;

-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')

FOR UPDATE SKIP LOCKED is the crux: replicas never block each other.

6. Outbound connections

Concept

An OutboundConnection is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.

Tenant-global. Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (slack-prod, slack-dev) and referencing the appropriate one in each env's rules.

Allowed-env restriction. allowed_environment_ids (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.

Delete semantics. 409 if any rule references the connection. No silent cascade — admin must first remove references.

Default body template (when rule has no override)

{
  "alert":   { "id", "state", "firedAt", "severity", "title", "message", "link" },
  "rule":    { "id", "name", "description", "severity" },
  "env":     { "slug", "id" },
  "context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
}

"Just plug in my Slack incoming webhook URL" works without writing a template.

HMAC signing (optional per connection)

When hmac_secret is set, dispatch adds X-Cameleer-Signature: sha256=<hmac(secret, body)> header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).

7. Rule evaluation

Scheduler

@Component
public class AlertEvaluatorJob implements SchedulingConfigurer {

    // Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
    @Override
    public void configureTasks(ScheduledTaskRegistrar registrar) {
        registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs());
    }

    void tick() {
        List<AlertRule> claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize());
        var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r))));
        for (var entry : groups.entrySet()) {
            if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; }
            try {
                coalescedEvaluate(entry.getKey(), entry.getValue());
            } catch (Exception e) {
                circuitBreaker.recordFailure(entry.getKey().kind());
                rescheduleBatch(entry.getValue());
            }
        }
    }
}

Per-condition evaluators

Kind	Read source	Query shape
`ROUTE_METRIC`	`SearchService.statsForRoute` / `statsForApp`	Stats over window; comparator vs threshold
`EXCHANGE_MATCH` (PER_EXCHANGE)	`SearchService.search(SearchRequest)`	`timestamp > eval_state.lastExchangeTs AND filter` → fire one alert per match, advance cursor
`EXCHANGE_MATCH` (COUNT_IN_WINDOW)	`ClickHouseSearchIndex.countExecutionsForAlerting(spec)`	Count in window vs threshold
`AGENT_STATE`	`AgentRegistryService.listByEnvironment`	Any agent matches scope + state
`DEPLOYMENT_STATE`	`DeploymentRepository.findLatestByAppAndEnv`	Status in target set
`LOG_PATTERN`	`ClickHouseLogStore.countLogs(LogSearchRequest)`	Count in window vs threshold
`JVM_METRIC`	`MetricsQueryStore`	Latest value (aggregation per rule) vs threshold

State machine

                 (cond holds for <forDuration)
  PENDING ──────▶ keep pendingSince
    ▲            │
    │            ▼ (cond holds ≥ forDuration)
    │          FIRING ◀── (re-eval matches; update last_notified_at cadence)
    │          / \
    │         /   \
    │    ack/      \resolve
    │       ▼       ▼
    │   ACKNOWLEDGED  RESOLVED ── (cond false again → cycle can restart)

PER_EXCHANGE mode: each match is its own brief FIRING instance that auto-resolves after perExchangeLingerSeconds (default 300 s). History retains it for 90 d.

Performance optimizations (v1)

Four ClickHouse projections (new CH migration, idempotent):

ALTER TABLE executions    ADD PROJECTION IF NOT EXISTS alerting_app_status
  (SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time));
ALTER TABLE executions    ADD PROJECTION IF NOT EXISTS alerting_route_status
  (SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time));
ALTER TABLE logs          ADD PROJECTION IF NOT EXISTS alerting_app_level
  (SELECT * ORDER BY (tenant_id, environment, application, level, timestamp));
ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric
  (SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));

stats_1m_route's existing ORDER BY already aligns with alerting access patterns; no projection needed.

Drop FINAL for alerting counts. New methods ClickHouseLogStore.countLogs(...) and ClickHouseSearchIndex.countExecutionsForAlerting(...) skip FINAL — alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facing count() path unchanged.
Per-tick query coalescing. Rules of the same kind + window share one aggregate query per tick.
In-tick cache. Map<QueryKey, Long> discarded at tick end. Two rules hitting the same (app, route, window, metric) produce one CH call.
Per-kind circuit breaker. 5 failures in 30 s → open for 60 s. Metric alerting_circuit_open_total{kind}. UI surfaces an admin banner when open.

Silence matching

At notification-dispatch time (not evaluation time):

SELECT 1 FROM alert_silences
 WHERE environment_id = :env
   AND now() BETWEEN starts_at AND ends_at
   AND matcher_matches(matcher, :instanceContext)
 LIMIT 1;

If any match → alert_instances.silenced = true, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.

Failure modes

Failure	Behavior
Read interface throws	Log WARN, increment `alerting_eval_errors_total{kind, rule_id}`, reschedule rule, release claim
10 consecutive failures for a rule	Mark `eval_state.disabledReason`, surface in UI
Template render error	Fall back to literal `{{var}}` in output, log WARN, still dispatch
Slow evaluator	Claim TTL 30 s; investigate if sustained
Rule deleted mid-eval	FK cascade waits on the row lock — effectively serialized
Env deleted mid-eval	FK cascade waits — effectively serialized

8. Notification dispatch

In-app inbox — derived, not materialized

SELECT ai.*
  FROM alert_instances ai
 WHERE ai.environment_id = :env
   AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
   AND (
       :me = ANY(ai.target_user_ids)
    OR ai.target_group_ids && :my_group_ids
    OR ai.target_role_names && :my_role_names
   )
 ORDER BY ai.fired_at DESC
 LIMIT 100;

:my_group_ids and :my_role_names resolved once per request from RbacService.

Bell badge count: same filter + state IN ('FIRING','ACKNOWLEDGED') + NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id), count-only. Server-side 5 s memoization per (env, user) keeps bell polling cheap.

Webhook outbox — claim-based

NotificationDispatchJob claims due notifications (status='PENDING' AND next_attempt_at <= now()) and dispatches. HTTP client from shared OutboundHttpClientFactory with TLS config from the referenced outbound connection.

2xx → DELIVERED
4xx → FAILED immediately (retry won't help); log at WARN
5xx / network / timeout → retry with exponential backoff 30 s → 2 m → 5 m, then FAILED
Manual retry: POST /alerts/notifications/{id}/retry (OPERATOR+)

Payload rendered at first dispatch attempt, snapshotted in alert_notifications.payload. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.

Template rendering

JMustache (com.samskivert:jmustache). Logic-less, industry-standard syntax.

Rendered surfaces: URL (query-string interpolation), header values, body, and separately alert_instances.title / message rendered once at fire.

Context map (dot-notation + camelCase leaves):

env.slug                 env.id
rule.id                  rule.name                 rule.severity            rule.description
alert.id                 alert.state               alert.firedAt            alert.resolvedAt
alert.ackedBy            alert.link                alert.currentValue       alert.threshold
alert.comparator         alert.window
app.slug                 app.id                    app.displayName
route.id
agent.id                 agent.name                agent.state
exchange.id              exchange.status           exchange.link
deployment.id            deployment.status
log.logger               log.level                 log.message
metric.name              metric.value

Error handling. Missing variable renders as {{var.name}} literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.

"Test render" endpoint: POST /alerts/rules/{id}/render-preview — drives rule editor's Preview button.

9. Rule promotion across environments

UX. Rule list row → Environments ▾ menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: "Promoting <name> from <src> → <dst>. Review and adjust, then save." Save → normal POST /api/v1/environments/{dstEnvSlug}/alerts/rules. Source unaffected (it's a copy).

Pure UI flow — no new server endpoint. Re-uses the existing GET (to fetch) and POST (to create) paths.

Prefill-time validation (client-side warnings, non-blocking):

Field	Check	Behavior
`scope.appSlug`	Does app exist in target env?	⚠ warn + picker from target env's apps
`scope.agentId`	Per-env; can't transfer	Clear field, keep appSlug, note
`scope.routeId`	Per-app logical ID, stable	✓ pass through
`targets[]`	Tenant-scoped	✓ transfer as-is
`webhooks[].outboundConnectionId`	Target env allowed by connection?	⚠ warn if not; disable save until resolved

Bulk promotion (select multiple → promote all) deferred until usage patterns justify it.

10. Cross-cutting: outbound HTTP & TLS trust

Shared module — not inside alerting/.

`OutboundHttpClientFactory`

public interface OutboundHttpClientFactory {
    CloseableHttpClient clientFor(OutboundHttpRequestContext context);
}

public record OutboundHttpRequestContext(
    TrustMode trustMode,                // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS
    List<String> trustedCaPemPaths,
    Duration connectTimeout,
    Duration readTimeout
) {}

Implementation (ApacheOutboundHttpClientFactory) memoizes one CloseableHttpClient per unique effective config — not one per call.

System config (`cameleer.server.outbound-http.*`)

cameleer:
  server:
    outbound-http:
      trust-all: false                       # global kill-switch; WARN logged if true
      trusted-ca-pem-paths:                  # additional roots layered on JVM default
        - /etc/cameleer/certs/corporate-root.pem
        - /etc/cameleer/certs/acme-internal.pem
      default-connect-timeout-ms: 2000
      default-read-timeout-ms:    5000
      proxy-url:                             # optional; null = no proxy
      proxy-username:
      proxy-password:

On startup: if trust-all=true, log red WARN (not suitable for production). If trusted-ca-pem-paths has entries, verify each path exists; fail-fast on missing files.

Per-connection overrides

Each OutboundConnection carries tls_trust_mode + tls_ca_pem_paths. UI surfaces a dropdown: System default (validated) / Trust custom CAs (from server config) / Trust all (insecure — testing only). Amber warning when Trust all selected. Audit logged (AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE).

Deferred

See BL-001 / gitea#137:

In-app CA bundle upload / admin management
SaaS-layer CA reuse investigation (do first)

11. API surface

All env-scoped routes under /api/v1/environments/{envSlug}/alerts/... via existing @EnvPath resolver.

Alerting — rules

Method	Path	Role
`GET`	`/alerts/rules`	VIEWER+
`POST`	`/alerts/rules`	OPERATOR+
`GET`	`/alerts/rules/{id}`	VIEWER+
`PUT`	`/alerts/rules/{id}`	OPERATOR+
`DELETE`	`/alerts/rules/{id}`	OPERATOR+
`POST`	`/alerts/rules/{id}/enable` · `/disable`	OPERATOR+
`POST`	`/alerts/rules/{id}/render-preview`	OPERATOR+
`POST`	`/alerts/rules/{id}/test-evaluate`	OPERATOR+

Alerting — instances

Method	Path	Role
`GET`	`/alerts`	VIEWER+
`GET`	`/alerts/unread-count`	VIEWER+
`GET`	`/alerts/{id}`	VIEWER+
`POST`	`/alerts/{id}/ack`	VIEWER+ (if targeted) / OPERATOR+
`POST`	`/alerts/{id}/read`	VIEWER+ (self)
`POST`	`/alerts/bulk-read`	VIEWER+ (self)

Alerting — silences

Method	Path	Role
`GET`	`/alerts/silences`	VIEWER+
`POST`	`/alerts/silences`	OPERATOR+
`PUT`	`/alerts/silences/{id}`	OPERATOR+
`DELETE`	`/alerts/silences/{id}`	OPERATOR+

Alerting — notifications

Method	Path	Role
`GET`	`/alerts/{id}/notifications`	VIEWER+
`POST`	`/alerts/notifications/{id}/retry`	OPERATOR+

Outbound connections (admin)

Method	Path	Role
`GET`	`/api/v1/admin/outbound-connections`	ADMIN / OPERATOR (read-only)
`POST`	`/api/v1/admin/outbound-connections`	ADMIN
`GET`	`/api/v1/admin/outbound-connections/{id}`	ADMIN / OPERATOR (read-only)
`PUT`	`/api/v1/admin/outbound-connections/{id}`	ADMIN (409 if narrowing breaks references)
`DELETE`	`/api/v1/admin/outbound-connections/{id}`	ADMIN (409 if referenced)
`POST`	`/api/v1/admin/outbound-connections/{id}/test`	ADMIN
`GET`	`/api/v1/admin/outbound-connections/{id}/usage`	ADMIN / OPERATOR

OpenAPI regen

Per CLAUDE.md convention: after controller/DTO changes, run cd ui && npm run generate-api:live (backend on :8081) to regenerate ui/src/api/schema.d.ts. Commit regen alongside controller change.

12. CMD-K integration

Two new result sources registered in the existing UI registry (ui/src/cmdk/sources/):

Alerts — queries /alerts?q=...&limit=5 (server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to /alerts/inbox/{id}.
Alert Rules — queries /alerts/rules?q=...&limit=5; deep-link to /alerts/rules/{id}.

No new registry machinery — uses the existing extension point.

13. UI

Routes

/alerts
  ├── /inbox          (default landing)
  ├── /all
  ├── /rules
  │     ├── /new
  │     └── /{id}     (edit; accepts ?promoteFrom=<src>&ruleId=<id>)
  ├── /silences
  └── /history

/admin/outbound-connections
  ├── /
  ├── /new
  └── /{id}

Top-nav

Insert <NotificationBell /> between env selector and user menu. Badge severity = max(severities of unread targeting me) (CRITICAL → var(--error), WARNING → var(--amber), INFO → var(--muted)). Dropdown shows 5 most-recent unread with inline ack button + "See all".

Alerts section

New sidebar/top-nav entry visible to VIEWER+. Authoring actions (POST /rules, silence create, etc.) gated to OPERATOR+.

Rule editor — 5-step wizard

Scope — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints).
Condition — radio (6 kinds) + kind-specific form.
Trigger — threshold + comparator + window + for-duration + evaluation interval + severity; inline Test evaluate button.
Notify — title + message templates with Preview button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env + allowed_environment_ids.
Review — summary card, enabled toggle, save.

Silences, History, Rules list, OutboundConnectionAdminPage

Structure described in design presentation; no new design-system components required. Reuses Select, Tabs, Toggle, Button, Label, InfiniteScrollArea, PageLoader, Badge from @cameleer/design-system.

Real-time behavior

Bell: /alerts/unread-count polled every 30 s; paused when tab hidden (Page Visibility API).
Inbox view: /alerts polled every 30 s when focused.
No SSE in v1. SSE is a clean future add under /alerts/stream with no schema changes.

Accessibility

Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive aria-label.

Styling

All colors via @cameleer/design-system CSS variables (var(--error), var(--amber), var(--muted), var(--success)). No hard-coded hex.

14. Configuration

`AlertingProperties` (`cameleer.server.alerting.*`)

cameleer:
  server:
    alerting:
      evaluator-tick-interval-ms:       5000    # floor: 5000 (clamped at startup with WARN if lower)
      evaluator-batch-size:             20
      claim-ttl-seconds:                30
      notification-tick-interval-ms:    5000
      notification-batch-size:          50
      in-tick-cache-enabled:            true
      circuit-breaker-fail-threshold:   5
      circuit-breaker-window-seconds:   30
      circuit-breaker-cooldown-seconds: 60
      event-retention-days:             90
      notification-retention-days:      30
      webhook-timeout-ms:               5000
      webhook-max-attempts:             3

Env-var overridable (CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...). Wired via SchedulingConfigurer (not literal @Scheduled(fixedDelay=...)) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.

`OutboundHttpProperties` (`cameleer.server.outbound-http.*`)

See §10.

15. Retention

Daily @Scheduled(cron = "0 0 3 * * *") job AlertingRetentionJob (advisory-lock-of-the-day pattern, same as JarRetentionJob):

DELETE FROM alert_instances
 WHERE state = 'RESOLVED'
   AND resolved_at < now() - :eventRetentionDays::interval;

DELETE FROM alert_notifications
 WHERE status IN ('DELIVERED','FAILED')
   AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval);

Retention values from AlertingProperties.

16. Observability

New metrics exposed via existing /api/v1/prometheus:

alerting_eval_duration_seconds{kind} — histogram per condition kind
alerting_eval_errors_total{kind, rule_id} — counter
alerting_circuit_open_total{kind} — counter
alerting_rule_state{state} — gauge (enabled / disabled / broken-reference)
alerting_instances_total{state, severity} — gauge (open alerts)
alerting_notifications_total{status} — counter
alerting_webhook_delivery_duration_seconds — histogram

No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add.

Audit

New AuditCategory values:

OUTBOUND_HTTP_TRUST_CHANGE — webhook or connection TLS config change
ALERT_RULE_CHANGE — create / update / delete rule
ALERT_SILENCE_CHANGE — create / update / delete silence
OUTBOUND_CONNECTION_CHANGE — admin CRUD on outbound connection

Emitted via existing AuditService.log(...).

17. Security

Tenant + env isolation. Every controller call runs through @EnvPath (resolves env → tenant via TenantContext). Every CH query filters by tenant_id AND environment per pre-existing invariant.
RBAC. Enforced via Spring Security @PreAuthorize on each endpoint (see §11 role column).
Webhook URL SSRF protection. At rule save, reject URLs resolving to private IPs (127.0.0.0/8, 10.0.0.0/8, 172.16/12, 192.168/16, ::1, fc00::/7) unless a deployment-level allow-listed dev flag is set.
HMAC signing. Per-connection hmac_secret encrypted at rest; signature header sent on dispatch.
TLS trust. Cross-cutting module (§10).
Audit. See §16.

18. Testing

Backend — unit (`*Test.java`, no Spring)

Each ConditionEvaluator: synthetic inputs → expected EvalResult. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce.
MustacheRenderer: context + template → expected output; malformed falls back + logs.
SilenceMatcher: matcher JSONB vs instance → truth table.
Jackson polymorphism: roundtrip each AlertCondition subtype.
Claim-polling concurrency (embedded PG): two threads → no duplicates.

Backend — integration (Testcontainers, `*IT.java`)

AlertingFullLifecycleIT — end-to-end rule → fire → ack → silence → delete, history survives.
AlertingEnvIsolationIT — alert in env-A invisible from env-B inbox.
OutboundConnectionAllowedEnvIT — 422 on save if connection not allowed in env; 409 on narrow-while-referenced.
WebhookDispatchIT (WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx.
PerformanceIT (opt-in, not default CI) — 500 rules + 5-replica simulation.

Frontend — component (Vitest + Testing Library)

Rule editor wizard step navigation + validation.
Bell polling pause on tab hide.
Inbox row rendering by severity.
CMD-K result-source registration.

Frontend — E2E (Playwright if infra supports)

Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.

19. Rollout

No feature flag. Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
Migration rollback. V11 PG migration has matching down-script; CH projections are IF NOT EXISTS-safe and droppable without data loss.
Progressive adoption. First user creates the first rule; feature organically spreads from there.
Documentation. Add an admin-facing alerting guide under docs/ describing rule shapes, template variables, webhook destinations, and silence patterns.
.claude/rules/ updates. app-classes.md and core-classes.md updated to document the new packages and any touched classes — part of the change, not a follow-up.

20. Open questions / items for writing-plans

These are not design-level decisions — they're implementation-phase tasks to be carried into planning:

Alignment with existing OIDC outbound cert handling. Before implementing ApacheOutboundHttpClientFactory, audit how OidcProviderHelper / OidcTokenExchanger currently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1).
hmac_secret encryption-at-rest. Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.
V1 CH migration file naming. Confirm the convention for alerting-owned CH migrations (V_alerting_projections.sql vs numbered). Current ClickHouseSchemaInitializer runs files idempotently — naming is informational.
Bell component keyboard shortcut. Optional; align with existing CMD-K shortcut conventions.
Target picker UX. How to mix user / group / role in one multi-select with typeahead. Small UX design task.
Env-delete cascade audit. Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.

49 KiB Raw Blame History