Comprehensive design spec for a confined, env-scoped alerting feature: 6 signal sources, shared env-scoped rules with RBAC-targeted notifications, in-app inbox + webhook delivery via admin-managed outbound connections, claim-based polling for horizontal scalability, 4 CH projections for hot-path reads. Backlog entry BL-001 / gitea#137 tracks deferred managed-CA investigation (reuse SaaS-layer CA handling first before building in-server storage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
49 KiB
Alerting — Design Spec
Date: 2026-04-19 Status: Draft — awaiting user review Surfaces: server (core + app), UI, admin, Gitea issues Related: backlog BL-001 / gitea#137 (managed CA bundles — deferred)
1. Summary
A first-class alerting feature inside Cameleer. Operators author rules that evaluate conditions over observability data; violations create shared, env-scoped alerts visible in an in-app inbox and optionally dispatched to external systems via admin-curated webhook connections. Lifecycle: FIRING → ACKNOWLEDGED → RESOLVED with orthogonal SILENCED. Horizontally scalable via PostgreSQL claim-based polling. All code confined to new alerting/, outbound/, and http/ packages with minimal, documented touchpoints on existing stores.
Guiding principles
- "Good enough" baseline. Customers with dedicated ops tooling (PagerDuty, Grafana, Opsgenie) will keep using it — alerting here serves those without. Resist incident-management feature creep; provide the floor, not the ceiling.
- Confinement over cleverness. Reads go through existing interfaces; no hooks into ingestion; no new ClickHouse tables; all new code in dedicated packages. The feature should be removable by deleting those packages and one migration.
- Env-scoped by default, tenant-global where infrastructure. Rules, alerts, silences live inside an environment. Outbound connections are tenant-global infrastructure admins manage, optionally restricted by env.
- Performance is a first-class design concern, not a v2 afterthought. Claim-polling, query coalescing, in-tick caching, per-kind circuit breaker, and CH projections are all v1.
- No ClickHouse table changes, only projections. Additive, idempotent (
IF NOT EXISTS), safe to drop and rebuild.
2. Scope
In scope (v1)
Six signal sources, expressed as sealed-type conditions:
ROUTE_METRIC— aggregate stats per route or app: error rate, p95/p99 latency, throughput, error count. Backed bystats_1m_route.EXCHANGE_MATCH— per-exchange matching with two fire modes:PER_EXCHANGE— one alert per matching exchange (cursor-advanced, used for "specific failure" patterns)COUNT_IN_WINDOW— aggregate "N exchanges matched in window" threshold
AGENT_STATE— agent inDEAD/STALEstate for ≥ N seconds. Reads in-memoryAgentRegistryService.DEPLOYMENT_STATE— deployment status isFAILED/DEGRADEDfor ≥ N seconds.LOG_PATTERN— count of log rows matching level / logger / pattern in a window > threshold.JVM_METRIC— agent-reported JVM/Camel metric (heap %, GC pressure, inflight) over threshold for a window.
Delivery channels. In-app inbox (derived from alerts + target-membership) and outbound HTTPS webhooks (via admin-managed outbound connections). No email. No native Slack/Teams integrations — users point webhook URLs at their own integrations.
Sharing model. Rules are shared within an environment; alerts are visible to any viewer of the env, but notifications route to targeted users, groups, or roles (via existing RBAC).
Lifecycle states. PENDING → FIRING → ACKNOWLEDGED → RESOLVED, with SILENCED as an orthogonal property resolved at notification-dispatch time (preserves audit trail).
Rule promotion across environments via UI prefill — no new server endpoint.
CMD-K integration — alerts + alert rules appear as new result sources in the existing CMD-K registry.
Configurable evaluator cadence (min 5 s floor), per-rule evaluation intervals, per-rule re-notify cadence.
Out of scope (v1, not deferred)
- Custom SQL / Prometheus-style query DSL (option F).
- Email delivery channel — webhooks cover Slack / PagerDuty / Teams / OpsGenie / n8n / Zapier via ops-team-owned integrations.
- Native provider integrations (Slack, Teams, PagerDuty as first-class types).
- Incident management (merging alerts, parent/child, assignees, SLA tracking) — integrate with PagerDuty via webhook instead.
- Expression language in rules — fixed templates only.
- mTLS / client-cert auth on outbound webhooks.
- Real-time push (SSE) to the UI — 30 s polling is the v1 cadence. SSE is a clean drop-in for v2 if needed.
Deferred to backlog
- BL-001 / gitea#137 — in-app CA bundle management UI. Deferred pending investigation of reusing the SaaS layer's existing CA handling (KISS / DRY). V1 CA trust material is filesystem-resident via deployment config, same posture as OIDC issuer URIs and Ed25519 keys.
3. Key decisions
Captured from brainstorming, in order of architectural impact.
| Decision | Chosen | Rejected | Rationale |
|---|---|---|---|
| Signal sources | 6 (route / exchange / agent / deployment / log / JVM) | SQL power-user mode | "Good enough" baseline; fixed templates cover real needs; expression languages are where observability tools go to be rewritten |
| Delivery channels | in-app + webhook | email, native integrations | Webhooks cover every target; email is deceptively expensive (deliverability, bounces, DKIM) |
| Sharing | tenant-env-shared rules; notifications target users/groups/roles | per-user "my alerts" | Ops products need single source of truth for what's broken; targets give per-person routing without duplicating rules |
| Evaluation | pull / claim-based polling | push / event-driven | Confinement — reads through existing interfaces, zero ingestion hooks; native handling of "no data" condition; 60 s latency acceptable for ops alerting |
| Horizontal scale | FOR UPDATE SKIP LOCKED claim pattern |
advisory locks / leader election | Naturally partitions work; supports per-rule cadences; recovers from replica death; industry-standard |
| Alert lifecycle | FIRING / ACK / RESOLVED + SILENCED | minimal fire/resolve only, full incident workflow | Ack is the floor for team workflows (stop paging everyone); silences needed for ops maintenance; incident mgmt is a product category, not a feature |
| Rule shape | fixed templates, sealed-type JSONB | expression DSL, expression-first | Form-fillable; typed; additive for new kinds; consistent with no-SQL decision |
| Templating | JMustache | in-house substituter, Pebble/Freemarker | Industry standard for webhook templates (Slack, PagerDuty); logic-less (safe); small dep; familiar to ops users |
| UI placement | top-nav bell (consumer) + /alerts section (OPERATOR+ authoring, VIEWER+ read) |
admin-only page, embedded per context, new top-level tab only | Separates consumer from authoring surfaces; rule authoring happens frequently, shouldn't be buried in admin |
| CMD-K | alerts + rules searchable | not searchable | Covers the "I saw this alert before lunch" use case; small surface via existing result-source registry |
| Outbound connections | admin-managed, tenant-global, allowed-env restriction | per-rule raw webhook URLs | Admins own infrastructure; operators author rules; rotation is atomic across N rules; reusable for future integrations |
| TLS trust | shared cross-cutting module http/ |
alerting-local trust config | Future-proofs for additional outbound HTTPS consumers; joins the existing OIDC outbound path |
| CA management UI | deferred (BL-001) | build in-server now | SaaS-layer CA mechanism should be investigated first for reuse |
| Env deletion | full cascade across alerting tables | partial cascade with SET NULL | POC teardown safety — zero orphaned rows |
4. Module architecture
Package layout
cameleer-server-core/src/main/java/com/cameleer/server/core/
├── alerting/ (domain; pure records + interfaces)
│ ├── AlertRule
│ ├── AlertCondition (sealed)
│ │ ├── RouteMetricCondition
│ │ ├── ExchangeMatchCondition
│ │ ├── AgentStateCondition
│ │ ├── DeploymentStateCondition
│ │ ├── LogPatternCondition
│ │ └── JvmMetricCondition
│ ├── AlertSeverity / AlertState (enums)
│ ├── AlertInstance / AlertEvent
│ ├── NotificationTarget / NotificationTargetKind
│ ├── AlertSilence / SilenceMatcher
│ ├── AlertRuleRepository (interface)
│ ├── AlertInstanceRepository (interface)
│ ├── AlertSilenceRepository (interface)
│ ├── AlertNotificationRepository (interface)
│ ├── AlertReadRepository (interface)
│ ├── ConditionEvaluator<C> (sealed)
│ └── NotificationDispatcher (interface)
├── outbound/ (admin-managed outbound connections)
│ ├── OutboundConnection
│ ├── OutboundAuth (sealed — NONE, BEARER, BASIC)
│ ├── TrustMode (enum)
│ └── OutboundConnectionRepository (interface)
└── http/ (cross-cutting outbound HTTP primitive)
├── OutboundHttpProperties
├── OutboundHttpRequestContext
└── OutboundHttpClientFactory (interface)
cameleer-server-app/src/main/java/com/cameleer/server/app/
├── alerting/
│ ├── controller/ (REST)
│ │ ├── AlertRuleController
│ │ ├── AlertController
│ │ ├── AlertSilenceController
│ │ └── AlertNotificationController
│ ├── storage/ (Postgres)
│ │ ├── PostgresAlertRuleRepository
│ │ ├── PostgresAlertInstanceRepository
│ │ ├── PostgresAlertSilenceRepository
│ │ ├── PostgresAlertNotificationRepository
│ │ └── PostgresAlertReadRepository
│ ├── eval/ (the scheduled evaluators)
│ │ ├── AlertEvaluatorJob (@Scheduled, claim-based)
│ │ ├── RouteMetricEvaluator
│ │ ├── ExchangeMatchEvaluator
│ │ ├── AgentStateEvaluator
│ │ ├── DeploymentStateEvaluator
│ │ ├── LogPatternEvaluator
│ │ ├── JvmMetricEvaluator
│ │ ├── PerKindCircuitBreaker
│ │ └── TickCache
│ ├── notify/
│ │ ├── NotificationDispatchJob (@Scheduled, claim-based)
│ │ ├── InAppInboxQuery
│ │ ├── WebhookDispatcher
│ │ ├── MustacheRenderer
│ │ └── SilenceMatcher
│ ├── dto/ (AlertRuleDto, AlertDto, ConditionDto sealed, WebhookDto, etc.)
│ ├── retention/
│ │ └── AlertingRetentionJob (daily @Scheduled)
│ └── config/
│ └── AlertingProperties (@ConfigurationProperties)
├── outbound/
│ ├── controller/
│ │ └── OutboundConnectionAdminController
│ ├── storage/
│ │ └── PostgresOutboundConnectionRepository
│ └── dto/
│ └── OutboundConnectionDto
└── http/
├── ApacheOutboundHttpClientFactory
├── SslContextBuilder
└── config/
└── OutboundHttpConfig (@ConfigurationProperties)
cameleer-server-app/src/main/resources/
├── db/migration/V11__alerting_and_outbound.sql (one Flyway migration)
└── clickhouse/V_alerting_projections.sql (one CH migration, idempotent)
ui/src/
├── pages/Alerts/
│ ├── InboxPage.tsx
│ ├── AllAlertsPage.tsx
│ ├── RulesListPage.tsx
│ ├── RuleEditor/
│ │ ├── RuleEditorWizard.tsx
│ │ ├── ScopeStep.tsx
│ │ ├── ConditionStep.tsx
│ │ ├── TriggerStep.tsx
│ │ ├── NotifyStep.tsx
│ │ └── ReviewStep.tsx
│ ├── SilencesPage.tsx
│ └── HistoryPage.tsx
├── pages/Admin/
│ └── OutboundConnectionsPage.tsx
├── components/
│ ├── NotificationBell.tsx
│ └── AlertStateChip.tsx
├── api/queries/
│ ├── alerts.ts
│ ├── alertRules.ts
│ ├── alertSilences.ts
│ └── outboundConnections.ts
└── cmdk/sources/
├── alerts.ts
└── alertRules.ts
Touchpoints on existing code (deliberate, minimal)
| Existing surface | Change | Scope |
|---|---|---|
cameleer-server-app/src/main/resources/db/migration/V11__… |
New Flyway migration | additive |
cameleer-server-app/src/main/resources/clickhouse/V_…_projections.sql |
New CH migration | additive, IF NOT EXISTS |
ClickHouseLogStore |
New method long countLogs(LogSearchRequest) (no FINAL) |
one public method added |
ClickHouseSearchIndex |
New method long countExecutionsForAlerting(AlertMatchSpec) (no FINAL, no text-in-body subqueries) |
one public method added |
SecurityConfig |
Path matchers for new endpoints | ~15 lines |
ui/src/router.tsx |
Route entries for /alerts/** and /admin/outbound-connections |
additive |
| Top-nav layout | Insert <NotificationBell /> |
one import + one component |
| CMD-K registry | Register alerts + alertRules result sources |
two file additions + one import |
.claude/rules/app-classes.md + core-classes.md |
Update class maps for the new packages | documentation |
com.cameleer:cameleer-common |
no changes | — |
| ingestion paths | no changes | — |
| agent protocol | no changes | — |
| ClickHouse schema (table structure) | no changes — only projections added | — |
New dependencies
com.samskivert:jmustache— logic-less Mustache templating for webhook/notification templates. ~30 KB, zero transitive deps. Added tocameleer-server-core.- Apache HttpClient 5 (
org.apache.hc.client5) — already present in the project; no new coordinate.
5. Data model (PostgreSQL)
One Flyway migration V11__alerting_and_outbound.sql creates all tables, enums, and indexes in a single transaction.
Enum types
CREATE TYPE severity_enum AS ENUM ('CRITICAL','WARNING','INFO');
CREATE TYPE condition_kind_enum AS ENUM ('ROUTE_METRIC','EXCHANGE_MATCH','AGENT_STATE','DEPLOYMENT_STATE','LOG_PATTERN','JVM_METRIC');
CREATE TYPE alert_state_enum AS ENUM ('PENDING','FIRING','ACKNOWLEDGED','RESOLVED');
CREATE TYPE target_kind_enum AS ENUM ('USER','GROUP','ROLE');
CREATE TYPE notification_status_enum AS ENUM ('PENDING','DELIVERED','FAILED');
CREATE TYPE trust_mode_enum AS ENUM ('SYSTEM_DEFAULT','TRUST_ALL','TRUST_PATHS');
CREATE TYPE outbound_method_enum AS ENUM ('POST','PUT','PATCH');
CREATE TYPE outbound_auth_kind_enum AS ENUM ('NONE','BEARER','BASIC');
Tables
outbound_connections (admin-managed)
CREATE TABLE outbound_connections (
id uuid PRIMARY KEY,
tenant_id varchar(64) NOT NULL,
name varchar(100) NOT NULL,
description text,
url text NOT NULL, -- Mustache-enabled
method outbound_method_enum NOT NULL,
default_headers jsonb NOT NULL DEFAULT '{}', -- values are Mustache templates
default_body_tmpl text, -- null = built-in default JSON envelope
tls_trust_mode trust_mode_enum NOT NULL DEFAULT 'SYSTEM_DEFAULT',
tls_ca_pem_paths jsonb NOT NULL DEFAULT '[]', -- array of paths from OutboundHttpProperties
hmac_secret text, -- Ed25519-key-derived encryption at rest
auth_kind outbound_auth_kind_enum NOT NULL DEFAULT 'NONE',
auth_config jsonb NOT NULL DEFAULT '{}', -- shape depends on auth_kind; v1 unused
allowed_environment_ids uuid[] NOT NULL DEFAULT '{}', -- [] = allowed in all envs
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id),
UNIQUE (tenant_id, name)
);
CREATE INDEX outbound_connections_tenant_idx ON outbound_connections (tenant_id);
alert_rules
CREATE TABLE alert_rules (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
name varchar(200) NOT NULL,
description text,
severity severity_enum NOT NULL,
enabled boolean NOT NULL DEFAULT true,
condition_kind condition_kind_enum NOT NULL,
condition jsonb NOT NULL, -- sealed-subtype payload, Jackson-DEDUCTION polymorphic
evaluation_interval_seconds int NOT NULL DEFAULT 60 CHECK (evaluation_interval_seconds >= 5),
for_duration_seconds int NOT NULL DEFAULT 0 CHECK (for_duration_seconds >= 0),
re_notify_minutes int NOT NULL DEFAULT 60 CHECK (re_notify_minutes >= 0),
notification_title_tmpl text NOT NULL, -- Mustache
notification_message_tmpl text NOT NULL, -- Mustache
webhooks jsonb NOT NULL DEFAULT '[]', -- [{id: uuid, outboundConnectionId, bodyOverride?, headerOverrides?}] — id assigned server-side on save, used as stable ref from alert_notifications.webhook_id
next_evaluation_at timestamptz NOT NULL DEFAULT now(),
claimed_by varchar(64),
claimed_until timestamptz,
eval_state jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now(),
created_by uuid NOT NULL REFERENCES users(id),
updated_at timestamptz NOT NULL DEFAULT now(),
updated_by uuid NOT NULL REFERENCES users(id)
);
CREATE INDEX alert_rules_env_idx ON alert_rules (environment_id);
CREATE INDEX alert_rules_claim_due_idx ON alert_rules (next_evaluation_at) WHERE enabled = true;
alert_rule_targets
CREATE TABLE alert_rule_targets (
id uuid PRIMARY KEY,
rule_id uuid NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
target_kind target_kind_enum NOT NULL,
target_id varchar(128) NOT NULL,
UNIQUE (rule_id, target_kind, target_id)
);
CREATE INDEX alert_rule_targets_lookup_idx ON alert_rule_targets (target_kind, target_id);
alert_instances
CREATE TABLE alert_instances (
id uuid PRIMARY KEY,
rule_id uuid REFERENCES alert_rules(id) ON DELETE SET NULL,
rule_snapshot jsonb NOT NULL,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
state alert_state_enum NOT NULL,
severity severity_enum NOT NULL,
fired_at timestamptz NOT NULL,
acked_at timestamptz,
acked_by uuid REFERENCES users(id),
resolved_at timestamptz,
last_notified_at timestamptz,
silenced boolean NOT NULL DEFAULT false,
current_value numeric,
threshold numeric,
context jsonb NOT NULL,
title text NOT NULL,
message text NOT NULL,
target_user_ids uuid[] NOT NULL DEFAULT '{}',
target_group_ids uuid[] NOT NULL DEFAULT '{}',
target_role_names text[] NOT NULL DEFAULT '{}'
);
CREATE INDEX alert_instances_inbox_idx ON alert_instances (environment_id, state, fired_at DESC);
CREATE INDEX alert_instances_open_rule_idx ON alert_instances (rule_id, state) WHERE rule_id IS NOT NULL;
CREATE INDEX alert_instances_resolved_idx ON alert_instances (resolved_at) WHERE state = 'RESOLVED';
CREATE INDEX alert_instances_target_u_idx ON alert_instances USING GIN (target_user_ids);
CREATE INDEX alert_instances_target_g_idx ON alert_instances USING GIN (target_group_ids);
CREATE INDEX alert_instances_target_r_idx ON alert_instances USING GIN (target_role_names);
alert_silences
CREATE TABLE alert_silences (
id uuid PRIMARY KEY,
environment_id uuid NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
matcher jsonb NOT NULL, -- { ruleId?, appSlug?, routeId?, agentId?, severity? }
reason text,
starts_at timestamptz NOT NULL,
ends_at timestamptz NOT NULL CHECK (ends_at > starts_at),
created_by uuid NOT NULL REFERENCES users(id),
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_silences_active_idx ON alert_silences (environment_id, ends_at);
alert_notifications (webhook delivery outbox)
CREATE TABLE alert_notifications (
id uuid PRIMARY KEY,
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
webhook_id uuid, -- opaque ref into rule's webhooks JSONB
outbound_connection_id uuid REFERENCES outbound_connections(id) ON DELETE SET NULL,
status notification_status_enum NOT NULL DEFAULT 'PENDING',
attempts int NOT NULL DEFAULT 0,
next_attempt_at timestamptz NOT NULL DEFAULT now(),
claimed_by varchar(64),
claimed_until timestamptz,
last_response_status int,
last_response_snippet text,
payload jsonb NOT NULL, -- snapshotted at first attempt
delivered_at timestamptz,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX alert_notifications_pending_idx ON alert_notifications (next_attempt_at) WHERE status = 'PENDING';
CREATE INDEX alert_notifications_instance_idx ON alert_notifications (alert_instance_id);
alert_reads
CREATE TABLE alert_reads (
user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
alert_instance_id uuid NOT NULL REFERENCES alert_instances(id) ON DELETE CASCADE,
read_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, alert_instance_id)
);
Cascade summary
environments → alert_rules (CASCADE) → alert_rule_targets (CASCADE)
environments → alert_silences (CASCADE)
environments → alert_instances (CASCADE) → alert_reads (CASCADE)
→ alert_notifications (CASCADE)
alert_rules → alert_instances (SET NULL, rule_snapshot preserves context)
users → alert_reads (CASCADE)
outbound_connections (delete) — blocked by FK from rules.webhooks JSONB via app-level 409 check
Rule deletion preserves history (alert_instances.rule_id = NULL, rule_snapshot retains details). Environment deletion leaves zero alerting rows — POC-safe.
Jackson polymorphism for conditions
@JsonTypeInfo(use = JsonTypeInfo.Id.DEDUCTION)
@JsonSubTypes({
@Type(RouteMetricCondition.class),
@Type(ExchangeMatchCondition.class),
@Type(AgentStateCondition.class),
@Type(DeploymentStateCondition.class),
@Type(LogPatternCondition.class),
@Type(JvmMetricCondition.class),
})
public sealed interface AlertCondition permits
RouteMetricCondition, ExchangeMatchCondition, AgentStateCondition,
DeploymentStateCondition, LogPatternCondition, JvmMetricCondition {
ConditionKind kind();
}
Jackson deduces the subtype from the set of present fields. Bean Validation (@Valid) on each record validates at the controller boundary.
Example condition payloads:
// ROUTE_METRIC
{ "scope": {"appSlug":"orders","routeId":"route-1"},
"metric": "P99_LATENCY_MS", "comparator": "GT", "threshold": 2000, "windowSeconds": 300 }
// EXCHANGE_MATCH — PER_EXCHANGE
{ "scope": {"appSlug":"orders"},
"filter": {"status":"FAILED","attributes":{"type":"payment"}},
"fireMode": "PER_EXCHANGE", "perExchangeLingerSeconds": 300 }
// EXCHANGE_MATCH — COUNT_IN_WINDOW
{ "scope": {"appSlug":"orders"},
"filter": {"status":"FAILED"},
"fireMode": "COUNT_IN_WINDOW", "threshold": 5, "windowSeconds": 900 }
// AGENT_STATE
{ "scope": {"appSlug":"orders"}, "state": "DEAD", "forSeconds": 60 }
// DEPLOYMENT_STATE
{ "scope": {"appSlug":"orders"}, "states": ["FAILED","DEGRADED"] }
// LOG_PATTERN
{ "scope": {"appSlug":"orders"}, "level": "ERROR",
"pattern": "TimeoutException", "threshold": 5, "windowSeconds": 900 }
// JVM_METRIC
{ "scope": {"appSlug":"orders"}, "metric": "heap_used_percent",
"aggregation": "MAX", "comparator": "GT", "threshold": 90, "windowSeconds": 300 }
Claim-polling queries
-- Rule evaluator
UPDATE alert_rules
SET claimed_by = :instance, claimed_until = now() + interval '30 seconds'
WHERE id IN (
SELECT id FROM alert_rules
WHERE enabled = true
AND next_evaluation_at <= now()
AND (claimed_until IS NULL OR claimed_until < now())
ORDER BY next_evaluation_at
LIMIT :batch
FOR UPDATE SKIP LOCKED
)
RETURNING *;
-- Notification dispatcher (same pattern on alert_notifications with status='PENDING')
FOR UPDATE SKIP LOCKED is the crux: replicas never block each other.
6. Outbound connections
Concept
An OutboundConnection is a reusable, admin-managed HTTPS destination. Alert rules reference connections by ID and may override body or header templates per rule. Rotating a URL or secret updates every rule atomically.
Tenant-global. Slack URLs and PagerDuty keys are team infrastructure, not env-specific. Env-specific routing is achieved by creating multiple connections (slack-prod, slack-dev) and referencing the appropriate one in each env's rules.
Allowed-env restriction. allowed_environment_ids (default empty = all envs). Admin restricts a connection to specific envs via a multi-select on the connection form. UI picker filters by current env; rule save validates (422 on violation); narrowing the restriction while rules still reference it returns 409 with conflict list.
Delete semantics. 409 if any rule references the connection. No silent cascade — admin must first remove references.
Default body template (when rule has no override)
{
"alert": { "id", "state", "firedAt", "severity", "title", "message", "link" },
"rule": { "id", "name", "description", "severity" },
"env": { "slug", "id" },
"context": { /* full Mustache context: app, route, agent, exchange, etc. */ }
}
"Just plug in my Slack incoming webhook URL" works without writing a template.
HMAC signing (optional per connection)
When hmac_secret is set, dispatch adds X-Cameleer-Signature: sha256=<hmac(secret, body)> header. GitHub / Stripe pattern. Secret encrypted at rest — concrete approach (Jasypt vs bespoke over existing Ed25519-derived key material) decided in planning (see §20).
7. Rule evaluation
Scheduler
@Component
public class AlertEvaluatorJob implements SchedulingConfigurer {
// Interval wired via AlertingProperties.evaluatorTickIntervalMs (floor 5000)
@Override
public void configureTasks(ScheduledTaskRegistrar registrar) {
registrar.addFixedDelayTask(this::tick, properties.effectiveEvaluatorTickIntervalMs());
}
void tick() {
List<AlertRule> claimed = ruleRepo.claimDueRules(instanceId, properties.batchSize());
var groups = claimed.stream().collect(groupingBy(r -> new GroupKey(r.conditionKind(), windowSeconds(r))));
for (var entry : groups.entrySet()) {
if (circuitBreaker.isOpen(entry.getKey().kind())) { rescheduleBatch(entry.getValue()); continue; }
try {
coalescedEvaluate(entry.getKey(), entry.getValue());
} catch (Exception e) {
circuitBreaker.recordFailure(entry.getKey().kind());
rescheduleBatch(entry.getValue());
}
}
}
}
Per-condition evaluators
| Kind | Read source | Query shape |
|---|---|---|
ROUTE_METRIC |
SearchService.statsForRoute / statsForApp |
Stats over window; comparator vs threshold |
EXCHANGE_MATCH (PER_EXCHANGE) |
SearchService.search(SearchRequest) |
timestamp > eval_state.lastExchangeTs AND filter → fire one alert per match, advance cursor |
EXCHANGE_MATCH (COUNT_IN_WINDOW) |
ClickHouseSearchIndex.countExecutionsForAlerting(spec) |
Count in window vs threshold |
AGENT_STATE |
AgentRegistryService.listByEnvironment |
Any agent matches scope + state |
DEPLOYMENT_STATE |
DeploymentRepository.findLatestByAppAndEnv |
Status in target set |
LOG_PATTERN |
ClickHouseLogStore.countLogs(LogSearchRequest) |
Count in window vs threshold |
JVM_METRIC |
MetricsQueryStore |
Latest value (aggregation per rule) vs threshold |
State machine
(cond holds for <forDuration)
PENDING ──────▶ keep pendingSince
▲ │
│ ▼ (cond holds ≥ forDuration)
│ FIRING ◀── (re-eval matches; update last_notified_at cadence)
│ / \
│ / \
│ ack/ \resolve
│ ▼ ▼
│ ACKNOWLEDGED RESOLVED ── (cond false again → cycle can restart)
PER_EXCHANGE mode: each match is its own brief FIRING instance that auto-resolves after perExchangeLingerSeconds (default 300 s). History retains it for 90 d.
Performance optimizations (v1)
-
Four ClickHouse projections (new CH migration, idempotent):
ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_app_status (SELECT * ORDER BY (tenant_id, environment, application_id, status, start_time)); ALTER TABLE executions ADD PROJECTION IF NOT EXISTS alerting_route_status (SELECT * ORDER BY (tenant_id, environment, route_id, status, start_time)); ALTER TABLE logs ADD PROJECTION IF NOT EXISTS alerting_app_level (SELECT * ORDER BY (tenant_id, environment, application, level, timestamp)); ALTER TABLE agent_metrics ADD PROJECTION IF NOT EXISTS alerting_instance_metric (SELECT * ORDER BY (tenant_id, environment, instance_id, metric_name, collected_at));stats_1m_route's existing ORDER BY already aligns with alerting access patterns; no projection needed. -
Drop
FINALfor alerting counts. New methodsClickHouseLogStore.countLogs(...)andClickHouseSearchIndex.countExecutionsForAlerting(...)skipFINAL— alerting tolerates brief duplicate-row over-count (alert fires briefly, self-resolves on next tick after merge). Existing UI-facingcount()path unchanged. -
Per-tick query coalescing. Rules of the same kind + window share one aggregate query per tick.
-
In-tick cache.
Map<QueryKey, Long>discarded at tick end. Two rules hitting the same(app, route, window, metric)produce one CH call. -
Per-kind circuit breaker. 5 failures in 30 s → open for 60 s. Metric
alerting_circuit_open_total{kind}. UI surfaces an admin banner when open.
Silence matching
At notification-dispatch time (not evaluation time):
SELECT 1 FROM alert_silences
WHERE environment_id = :env
AND now() BETWEEN starts_at AND ends_at
AND matcher_matches(matcher, :instanceContext)
LIMIT 1;
If any match → alert_instances.silenced = true, no webhook dispatch, no re-notification. Inbox still shows the instance with a silenced pill — audit trail preserved.
Failure modes
| Failure | Behavior |
|---|---|
| Read interface throws | Log WARN, increment alerting_eval_errors_total{kind, rule_id}, reschedule rule, release claim |
| 10 consecutive failures for a rule | Mark eval_state.disabledReason, surface in UI |
| Template render error | Fall back to literal {{var}} in output, log WARN, still dispatch |
| Slow evaluator | Claim TTL 30 s; investigate if sustained |
| Rule deleted mid-eval | FK cascade waits on the row lock — effectively serialized |
| Env deleted mid-eval | FK cascade waits — effectively serialized |
8. Notification dispatch
In-app inbox — derived, not materialized
SELECT ai.*
FROM alert_instances ai
WHERE ai.environment_id = :env
AND ai.state IN ('FIRING','ACKNOWLEDGED','RESOLVED')
AND (
:me = ANY(ai.target_user_ids)
OR ai.target_group_ids && :my_group_ids
OR ai.target_role_names && :my_role_names
)
ORDER BY ai.fired_at DESC
LIMIT 100;
:my_group_ids and :my_role_names resolved once per request from RbacService.
Bell badge count: same filter + state IN ('FIRING','ACKNOWLEDGED') + NOT EXISTS (alert_reads ar WHERE ar.user_id=:me AND ar.alert_instance_id=ai.id), count-only. Server-side 5 s memoization per (env, user) keeps bell polling cheap.
Webhook outbox — claim-based
NotificationDispatchJob claims due notifications (status='PENDING' AND next_attempt_at <= now()) and dispatches. HTTP client from shared OutboundHttpClientFactory with TLS config from the referenced outbound connection.
- 2xx →
DELIVERED - 4xx →
FAILEDimmediately (retry won't help); log at WARN - 5xx / network / timeout → retry with exponential backoff 30 s → 2 m → 5 m, then
FAILED - Manual retry:
POST /alerts/notifications/{id}/retry(OPERATOR+)
Payload rendered at first dispatch attempt, snapshotted in alert_notifications.payload. Retries replay the snapshot — template edits after fire don't affect in-flight notifications.
Template rendering
JMustache (com.samskivert:jmustache). Logic-less, industry-standard syntax.
Rendered surfaces: URL (query-string interpolation), header values, body, and separately alert_instances.title / message rendered once at fire.
Context map (dot-notation + camelCase leaves):
env.slug env.id
rule.id rule.name rule.severity rule.description
alert.id alert.state alert.firedAt alert.resolvedAt
alert.ackedBy alert.link alert.currentValue alert.threshold
alert.comparator alert.window
app.slug app.id app.displayName
route.id
agent.id agent.name agent.state
exchange.id exchange.status exchange.link
deployment.id deployment.status
log.logger log.level log.message
metric.name metric.value
Error handling. Missing variable renders as {{var.name}} literal + WARN log. Malformed template falls back to built-in default + WARN. Never drop a notification due to template error.
"Test render" endpoint: POST /alerts/rules/{id}/render-preview — drives rule editor's Preview button.
9. Rule promotion across environments
UX. Rule list row → Environments ▾ menu of other envs in the tenant → open rule editor pre-populated with source rule's payload, target env selected. Banner: "Promoting <name> from <src> → <dst>. Review and adjust, then save." Save → normal POST /api/v1/environments/{dstEnvSlug}/alerts/rules. Source unaffected (it's a copy).
Pure UI flow — no new server endpoint. Re-uses the existing GET (to fetch) and POST (to create) paths.
Prefill-time validation (client-side warnings, non-blocking):
| Field | Check | Behavior |
|---|---|---|
scope.appSlug |
Does app exist in target env? | ⚠ warn + picker from target env's apps |
scope.agentId |
Per-env; can't transfer | Clear field, keep appSlug, note |
scope.routeId |
Per-app logical ID, stable | ✓ pass through |
targets[] |
Tenant-scoped | ✓ transfer as-is |
webhooks[].outboundConnectionId |
Target env allowed by connection? | ⚠ warn if not; disable save until resolved |
Bulk promotion (select multiple → promote all) deferred until usage patterns justify it.
10. Cross-cutting: outbound HTTP & TLS trust
Shared module — not inside alerting/.
OutboundHttpClientFactory
public interface OutboundHttpClientFactory {
CloseableHttpClient clientFor(OutboundHttpRequestContext context);
}
public record OutboundHttpRequestContext(
TrustMode trustMode, // SYSTEM_DEFAULT | TRUST_ALL | TRUST_PATHS
List<String> trustedCaPemPaths,
Duration connectTimeout,
Duration readTimeout
) {}
Implementation (ApacheOutboundHttpClientFactory) memoizes one CloseableHttpClient per unique effective config — not one per call.
System config (cameleer.server.outbound-http.*)
cameleer:
server:
outbound-http:
trust-all: false # global kill-switch; WARN logged if true
trusted-ca-pem-paths: # additional roots layered on JVM default
- /etc/cameleer/certs/corporate-root.pem
- /etc/cameleer/certs/acme-internal.pem
default-connect-timeout-ms: 2000
default-read-timeout-ms: 5000
proxy-url: # optional; null = no proxy
proxy-username:
proxy-password:
On startup: if trust-all=true, log red WARN (not suitable for production). If trusted-ca-pem-paths has entries, verify each path exists; fail-fast on missing files.
Per-connection overrides
Each OutboundConnection carries tls_trust_mode + tls_ca_pem_paths. UI surfaces a dropdown: System default (validated) / Trust custom CAs (from server config) / Trust all (insecure — testing only). Amber warning when Trust all selected. Audit logged (AuditCategory.OUTBOUND_HTTP_TRUST_CHANGE).
Deferred
See BL-001 / gitea#137:
- In-app CA bundle upload / admin management
- SaaS-layer CA reuse investigation (do first)
11. API surface
All env-scoped routes under /api/v1/environments/{envSlug}/alerts/... via existing @EnvPath resolver.
Alerting — rules
| Method | Path | Role |
|---|---|---|
GET |
/alerts/rules |
VIEWER+ |
POST |
/alerts/rules |
OPERATOR+ |
GET |
/alerts/rules/{id} |
VIEWER+ |
PUT |
/alerts/rules/{id} |
OPERATOR+ |
DELETE |
/alerts/rules/{id} |
OPERATOR+ |
POST |
/alerts/rules/{id}/enable · /disable |
OPERATOR+ |
POST |
/alerts/rules/{id}/render-preview |
OPERATOR+ |
POST |
/alerts/rules/{id}/test-evaluate |
OPERATOR+ |
Alerting — instances
| Method | Path | Role |
|---|---|---|
GET |
/alerts |
VIEWER+ |
GET |
/alerts/unread-count |
VIEWER+ |
GET |
/alerts/{id} |
VIEWER+ |
POST |
/alerts/{id}/ack |
VIEWER+ (if targeted) / OPERATOR+ |
POST |
/alerts/{id}/read |
VIEWER+ (self) |
POST |
/alerts/bulk-read |
VIEWER+ (self) |
Alerting — silences
| Method | Path | Role |
|---|---|---|
GET |
/alerts/silences |
VIEWER+ |
POST |
/alerts/silences |
OPERATOR+ |
PUT |
/alerts/silences/{id} |
OPERATOR+ |
DELETE |
/alerts/silences/{id} |
OPERATOR+ |
Alerting — notifications
| Method | Path | Role |
|---|---|---|
GET |
/alerts/{id}/notifications |
VIEWER+ |
POST |
/alerts/notifications/{id}/retry |
OPERATOR+ |
Outbound connections (admin)
| Method | Path | Role |
|---|---|---|
GET |
/api/v1/admin/outbound-connections |
ADMIN / OPERATOR (read-only) |
POST |
/api/v1/admin/outbound-connections |
ADMIN |
GET |
/api/v1/admin/outbound-connections/{id} |
ADMIN / OPERATOR (read-only) |
PUT |
/api/v1/admin/outbound-connections/{id} |
ADMIN (409 if narrowing breaks references) |
DELETE |
/api/v1/admin/outbound-connections/{id} |
ADMIN (409 if referenced) |
POST |
/api/v1/admin/outbound-connections/{id}/test |
ADMIN |
GET |
/api/v1/admin/outbound-connections/{id}/usage |
ADMIN / OPERATOR |
OpenAPI regen
Per CLAUDE.md convention: after controller/DTO changes, run cd ui && npm run generate-api:live (backend on :8081) to regenerate ui/src/api/schema.d.ts. Commit regen alongside controller change.
12. CMD-K integration
Two new result sources registered in the existing UI registry (ui/src/cmdk/sources/):
- Alerts — queries
/alerts?q=...&limit=5(server-side fulltext against title / message / rule_snapshot); results show severity icon + state chip; deep-link to/alerts/inbox/{id}. - Alert Rules — queries
/alerts/rules?q=...&limit=5; deep-link to/alerts/rules/{id}.
No new registry machinery — uses the existing extension point.
13. UI
Routes
/alerts
├── /inbox (default landing)
├── /all
├── /rules
│ ├── /new
│ └── /{id} (edit; accepts ?promoteFrom=<src>&ruleId=<id>)
├── /silences
└── /history
/admin/outbound-connections
├── /
├── /new
└── /{id}
Top-nav
Insert <NotificationBell /> between env selector and user menu. Badge severity = max(severities of unread targeting me) (CRITICAL → var(--error), WARNING → var(--amber), INFO → var(--muted)). Dropdown shows 5 most-recent unread with inline ack button + "See all".
Alerts section
New sidebar/top-nav entry visible to VIEWER+. Authoring actions (POST /rules, silence create, etc.) gated to OPERATOR+.
Rule editor — 5-step wizard
- Scope — radio (env-wide / app / route / agent) + pickers from env catalog (existing endpoints).
- Condition — radio (6 kinds) + kind-specific form.
- Trigger — threshold + comparator + window + for-duration + evaluation interval + severity; inline Test evaluate button.
- Notify — title + message templates with Preview button; targets multi-select (users / groups / roles with typeahead); outbound connections multi-select filtered by current env +
allowed_environment_ids. - Review — summary card, enabled toggle, save.
Silences, History, Rules list, OutboundConnectionAdminPage
Structure described in design presentation; no new design-system components required. Reuses Select, Tabs, Toggle, Button, Label, InfiniteScrollArea, PageLoader, Badge from @cameleer/design-system.
Real-time behavior
- Bell:
/alerts/unread-countpolled every 30 s; paused when tab hidden (Page Visibility API). - Inbox view:
/alertspolled every 30 s when focused. - No SSE in v1. SSE is a clean future add under
/alerts/streamwith no schema changes.
Accessibility
Keyboard navigation; severity conveyed via icon + text + color (not color alone); ARIA live region on inbox for new-alert announcement; bell component has descriptive aria-label.
Styling
All colors via @cameleer/design-system CSS variables (var(--error), var(--amber), var(--muted), var(--success)). No hard-coded hex.
14. Configuration
AlertingProperties (cameleer.server.alerting.*)
cameleer:
server:
alerting:
evaluator-tick-interval-ms: 5000 # floor: 5000 (clamped at startup with WARN if lower)
evaluator-batch-size: 20
claim-ttl-seconds: 30
notification-tick-interval-ms: 5000
notification-batch-size: 50
in-tick-cache-enabled: true
circuit-breaker-fail-threshold: 5
circuit-breaker-window-seconds: 30
circuit-breaker-cooldown-seconds: 60
event-retention-days: 90
notification-retention-days: 30
webhook-timeout-ms: 5000
webhook-max-attempts: 3
Env-var overridable (CAMELEER_SERVER_ALERTING_EVALUATOR_TICK_INTERVAL_MS=...). Wired via SchedulingConfigurer (not literal @Scheduled(fixedDelay=...)) so intervals come from the bean at startup. Hot-reload not supported — restart required to change cadence.
OutboundHttpProperties (cameleer.server.outbound-http.*)
See §10.
15. Retention
Daily @Scheduled(cron = "0 0 3 * * *") job AlertingRetentionJob (advisory-lock-of-the-day pattern, same as JarRetentionJob):
DELETE FROM alert_instances
WHERE state = 'RESOLVED'
AND resolved_at < now() - :eventRetentionDays::interval;
DELETE FROM alert_notifications
WHERE status IN ('DELIVERED','FAILED')
AND (delivered_at IS NULL OR delivered_at < now() - :notificationRetentionDays::interval);
Retention values from AlertingProperties.
16. Observability
New metrics exposed via existing /api/v1/prometheus:
alerting_eval_duration_seconds{kind}— histogram per condition kindalerting_eval_errors_total{kind, rule_id}— counteralerting_circuit_open_total{kind}— counteralerting_rule_state{state}— gauge (enabled / disabled / broken-reference)alerting_instances_total{state, severity}— gauge (open alerts)alerting_notifications_total{status}— counteralerting_webhook_delivery_duration_seconds— histogram
No new dashboards shipped in v1; tenants with Prometheus + Grafana can build their own. An "Alerting health" admin sub-page is a cheap future add.
Audit
New AuditCategory values:
OUTBOUND_HTTP_TRUST_CHANGE— webhook or connection TLS config changeALERT_RULE_CHANGE— create / update / delete ruleALERT_SILENCE_CHANGE— create / update / delete silenceOUTBOUND_CONNECTION_CHANGE— admin CRUD on outbound connection
Emitted via existing AuditService.log(...).
17. Security
- Tenant + env isolation. Every controller call runs through
@EnvPath(resolves env → tenant viaTenantContext). Every CH query filters bytenant_id AND environmentper pre-existing invariant. - RBAC. Enforced via Spring Security
@PreAuthorizeon each endpoint (see §11 role column). - Webhook URL SSRF protection. At rule save, reject URLs resolving to private IPs (
127.0.0.0/8,10.0.0.0/8,172.16/12,192.168/16,::1,fc00::/7) unless a deployment-level allow-listed dev flag is set. - HMAC signing. Per-connection
hmac_secretencrypted at rest; signature header sent on dispatch. - TLS trust. Cross-cutting module (§10).
- Audit. See §16.
18. Testing
Backend — unit (*Test.java, no Spring)
- Each
ConditionEvaluator: synthetic inputs → expectedEvalResult. Fire / no-fire / threshold edges / PER_EXCHANGE cursor / for-duration debounce. MustacheRenderer: context + template → expected output; malformed falls back + logs.SilenceMatcher: matcher JSONB vs instance → truth table.- Jackson polymorphism: roundtrip each
AlertConditionsubtype. - Claim-polling concurrency (embedded PG): two threads → no duplicates.
Backend — integration (Testcontainers, *IT.java)
AlertingFullLifecycleIT— end-to-end rule → fire → ack → silence → delete, history survives.AlertingEnvIsolationIT— alert in env-A invisible from env-B inbox.OutboundConnectionAllowedEnvIT— 422 on save if connection not allowed in env; 409 on narrow-while-referenced.WebhookDispatchIT(WireMock) — payload shape, HMAC signature, retry on 5xx, FAILED after max, no retry on 4xx.PerformanceIT(opt-in, not default CI) — 500 rules + 5-replica simulation.
Frontend — component (Vitest + Testing Library)
- Rule editor wizard step navigation + validation.
- Bell polling pause on tab hide.
- Inbox row rendering by severity.
- CMD-K result-source registration.
Frontend — E2E (Playwright if infra supports)
- Create rule → inject matching data → bell badge appears → open alert → ack → badge clears.
19. Rollout
- No feature flag. Alerting is dormant-by-default: zero rules → zero evaluator work → zero behavior change. Migration is additive.
- Migration rollback. V11 PG migration has matching down-script; CH projections are
IF NOT EXISTS-safe and droppable without data loss. - Progressive adoption. First user creates the first rule; feature organically spreads from there.
- Documentation. Add an admin-facing alerting guide under
docs/describing rule shapes, template variables, webhook destinations, and silence patterns. .claude/rules/updates.app-classes.mdandcore-classes.mdupdated to document the new packages and any touched classes — part of the change, not a follow-up.
20. Open questions / items for writing-plans
These are not design-level decisions — they're implementation-phase tasks to be carried into planning:
- Alignment with existing OIDC outbound cert handling. Before implementing
ApacheOutboundHttpClientFactory, audit howOidcProviderHelper/OidcTokenExchangercurrently validate certs. If there's a pattern in place, mirror it; if not, adopt the factory as the one-true-way and retrofit OIDC in a separate follow-up (not part of alerting v1). hmac_secretencryption-at-rest. Decide between Jasypt (simplest, adds a dep) and a bespoke encrypt/decrypt over the existing Ed25519-derived key material (no new dep, ~50 LOC). Defer to plan.- V1 CH migration file naming. Confirm the convention for alerting-owned CH migrations (
V_alerting_projections.sqlvs numbered). CurrentClickHouseSchemaInitializerruns files idempotently — naming is informational. - Bell component keyboard shortcut. Optional; align with existing CMD-K shortcut conventions.
- Target picker UX. How to mix user / group / role in one multi-select with typeahead. Small UX design task.
- Env-delete cascade audit. Before merge, verify the full cascade chain empirically in a PG integration test — POC safety depends on it.