310 lines
9.3 KiB
Markdown
310 lines
9.3 KiB
Markdown
|
|
# Alerting — Admin Guide
|
||
|
|
|
||
|
|
Cameleer's alerting system provides rule-based monitoring over the observability data the server already collects: route metrics, exchange outcomes, agent state, deployment state, application logs, and JVM metrics. It is a "good enough" baseline for operational awareness. For on-call rotation, escalation policies, and incident management, integrate with PagerDuty or OpsGenie via a webhook rule — Cameleer handles the HTTP POST, they handle the rest.
|
||
|
|
|
||
|
|
> For full architectural detail see `docs/superpowers/plans/2026-04-19-alerting-02-backend.md` and the spec at `docs/superpowers/specs/2026-04-19-alerting-design.md`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Condition Kinds
|
||
|
|
|
||
|
|
Six condition kinds are supported. All rules live under a single environment.
|
||
|
|
|
||
|
|
### ROUTE_METRIC
|
||
|
|
|
||
|
|
Fires when a computed route metric crosses a threshold over a rolling window.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "High error rate on orders",
|
||
|
|
"severity": "CRITICAL",
|
||
|
|
"conditionKind": "ROUTE_METRIC",
|
||
|
|
"condition": {
|
||
|
|
"kind": "ROUTE_METRIC",
|
||
|
|
"scope": { "appSlug": "orders-service" },
|
||
|
|
"metric": "ERROR_RATE",
|
||
|
|
"comparator": "GT",
|
||
|
|
"threshold": 0.05,
|
||
|
|
"windowSeconds": 300
|
||
|
|
},
|
||
|
|
"evaluationIntervalSeconds": 60
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Available metrics: `ERROR_RATE`, `THROUGHPUT`, `MEAN_PROCESSING_MS`, `P95_PROCESSING_MS`.
|
||
|
|
Comparators: `GT`, `GTE`, `LT`, `LTE`, `EQ`.
|
||
|
|
|
||
|
|
### EXCHANGE_MATCH
|
||
|
|
|
||
|
|
Fires when the number of exchanges matching a filter exceeds a threshold.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "Failed payment exchanges",
|
||
|
|
"severity": "WARNING",
|
||
|
|
"conditionKind": "EXCHANGE_MATCH",
|
||
|
|
"condition": {
|
||
|
|
"kind": "EXCHANGE_MATCH",
|
||
|
|
"scope": { "appSlug": "payment-service", "routeId": "processPayment" },
|
||
|
|
"filter": { "status": "FAILED", "attributes": { "payment.type": "card" } },
|
||
|
|
"fireMode": "AGGREGATE",
|
||
|
|
"threshold": 3,
|
||
|
|
"windowSeconds": 600
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
`fireMode`: `AGGREGATE` (one alert for the count) or `PER_EXCHANGE` (one alert per matching exchange).
|
||
|
|
|
||
|
|
### AGENT_STATE
|
||
|
|
|
||
|
|
Fires when a specific agent (or any agent for an app) reaches a given state for a sustained period.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "Orders agent dead",
|
||
|
|
"severity": "CRITICAL",
|
||
|
|
"conditionKind": "AGENT_STATE",
|
||
|
|
"condition": {
|
||
|
|
"kind": "AGENT_STATE",
|
||
|
|
"scope": { "appSlug": "orders-service" },
|
||
|
|
"state": "DEAD",
|
||
|
|
"forSeconds": 120
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
States: `LIVE`, `STALE`, `DEAD`.
|
||
|
|
|
||
|
|
### DEPLOYMENT_STATE
|
||
|
|
|
||
|
|
Fires when a deployment reaches one of the specified states.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "Deployment failed",
|
||
|
|
"severity": "WARNING",
|
||
|
|
"conditionKind": "DEPLOYMENT_STATE",
|
||
|
|
"condition": {
|
||
|
|
"kind": "DEPLOYMENT_STATE",
|
||
|
|
"scope": { "appSlug": "orders-service" },
|
||
|
|
"states": ["FAILED", "DEGRADED"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### LOG_PATTERN
|
||
|
|
|
||
|
|
Fires when the number of log lines matching a regex pattern at a given level exceeds a threshold in a rolling window.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "TimeoutException spike",
|
||
|
|
"severity": "WARNING",
|
||
|
|
"conditionKind": "LOG_PATTERN",
|
||
|
|
"condition": {
|
||
|
|
"kind": "LOG_PATTERN",
|
||
|
|
"scope": { "appSlug": "orders-service" },
|
||
|
|
"level": "ERROR",
|
||
|
|
"pattern": "TimeoutException",
|
||
|
|
"threshold": 5,
|
||
|
|
"windowSeconds": 300
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
`level`: `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`. `pattern` is a Java regex matched against the log message.
|
||
|
|
|
||
|
|
### JVM_METRIC
|
||
|
|
|
||
|
|
Fires when an aggregated JVM metric crosses a threshold.
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "Heap > 85%",
|
||
|
|
"severity": "WARNING",
|
||
|
|
"conditionKind": "JVM_METRIC",
|
||
|
|
"condition": {
|
||
|
|
"kind": "JVM_METRIC",
|
||
|
|
"scope": { "appSlug": "orders-service" },
|
||
|
|
"metric": "jvm.memory.used.value",
|
||
|
|
"aggregation": "AVG",
|
||
|
|
"comparator": "GT",
|
||
|
|
"threshold": 0.85,
|
||
|
|
"windowSeconds": 120
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
`aggregation`: `AVG`, `MAX`, `MIN`, `LAST`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Notification Templates
|
||
|
|
|
||
|
|
Rules carry a `notificationTitleTmpl` and `notificationMessageTmpl` field rendered with [JMustache](https://github.com/samskivert/jmustache). Variables available in every template (populated by `NotificationContextBuilder`):
|
||
|
|
|
||
|
|
| Variable | Example |
|
||
|
|
|---|---|
|
||
|
|
| `{{rule.name}}` | "TimeoutException spike" |
|
||
|
|
| `{{rule.severity}}` | "WARNING" |
|
||
|
|
| `{{rule.description}}` | "…" |
|
||
|
|
| `{{alert.id}}` | UUID |
|
||
|
|
| `{{alert.state}}` | "FIRING" |
|
||
|
|
| `{{alert.firedAt}}` | ISO-8601 instant |
|
||
|
|
| `{{alert.resolvedAt}}` | ISO-8601 instant or empty |
|
||
|
|
| `{{alert.currentValue}}` | numeric value that triggered |
|
||
|
|
| `{{alert.threshold}}` | configured threshold |
|
||
|
|
| `{{alert.link}}` | deep-link URL to inbox item |
|
||
|
|
| `{{env.slug}}` | "prod" |
|
||
|
|
| `{{env.name}}` | "Production" |
|
||
|
|
|
||
|
|
Default templates (applied when not specified):
|
||
|
|
|
||
|
|
- Title: `"[{{rule.severity}}] {{rule.name}} — {{env.slug}}"`
|
||
|
|
- Message: `"Alert {{alert.id}} fired at {{alert.firedAt}}. Value: {{alert.currentValue}}, Threshold: {{alert.threshold}}"`
|
||
|
|
|
||
|
|
Use `POST /alerts/rules/{id}/render-preview` to test templates before saving.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Webhook Setup
|
||
|
|
|
||
|
|
Webhooks are sent via **outbound connections** managed by an ADMIN at
|
||
|
|
`/api/v1/admin/outbound-connections`. This decouples secrets (HMAC key, auth tokens) from rule definitions. An OPERATOR can attach an existing connection to a rule.
|
||
|
|
|
||
|
|
### Creating an outbound connection (ADMIN)
|
||
|
|
|
||
|
|
```http
|
||
|
|
POST /api/v1/admin/outbound-connections
|
||
|
|
{
|
||
|
|
"name": "slack-alerts",
|
||
|
|
"url": "https://hooks.slack.com/services/T00/B00/XXX",
|
||
|
|
"method": "POST",
|
||
|
|
"tlsTrustMode": "SYSTEM_DEFAULT",
|
||
|
|
"auth": { "kind": "NONE" },
|
||
|
|
"defaultHeaders": { "Content-Type": "application/json" },
|
||
|
|
"bodyTemplate": "{\"text\": \"{{rule.name}}: {{alert.state}}\"}",
|
||
|
|
"hmacSecret": "my-signing-secret",
|
||
|
|
"allowedEnvironmentIds": []
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
For PagerDuty Events API v2:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"name": "pagerduty-prod",
|
||
|
|
"url": "https://events.pagerduty.com/v2/enqueue",
|
||
|
|
"method": "POST",
|
||
|
|
"tlsTrustMode": "SYSTEM_DEFAULT",
|
||
|
|
"auth": { "kind": "BEARER", "token": "your-integration-key" },
|
||
|
|
"defaultHeaders": { "Content-Type": "application/json" },
|
||
|
|
"bodyTemplate": "{\"routing_key\":\"{{rule.id}}\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{rule.name}}\",\"severity\":\"{{rule.severity}}\",\"source\":\"{{env.slug}}\"}}"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Attaching to a rule (OPERATOR)
|
||
|
|
|
||
|
|
Include the connection UUID in the `webhooks` array when creating or updating a rule:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"webhooks": [
|
||
|
|
{ "outboundConnectionId": "a1b2c3d4-..." }
|
||
|
|
]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
The server validates that the connection exists and is allowed in the rule's environment (422 otherwise).
|
||
|
|
|
||
|
|
### HMAC Signature
|
||
|
|
|
||
|
|
When `hmacSecret` is set on the connection, each POST includes:
|
||
|
|
|
||
|
|
```
|
||
|
|
X-Cameleer-Signature: sha256=<hex-encoded-HMAC-SHA256(secret, body)>
|
||
|
|
```
|
||
|
|
|
||
|
|
Verify this on the receiving end to confirm authenticity.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Silences
|
||
|
|
|
||
|
|
A silence suppresses notifications for matching alerts without deleting the rule. Silences are time-bounded.
|
||
|
|
|
||
|
|
```http
|
||
|
|
POST /api/v1/environments/{envSlug}/alerts/silences
|
||
|
|
{
|
||
|
|
"matcher": {
|
||
|
|
"ruleId": "uuid-of-rule",
|
||
|
|
"severity": "WARNING"
|
||
|
|
},
|
||
|
|
"reason": "Planned maintenance window",
|
||
|
|
"startsAt": "2026-04-20T02:00:00Z",
|
||
|
|
"endsAt": "2026-04-20T06:00:00Z"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Matcher fields are all optional; at least one should be set. A silence matches an alert instance if ALL specified matcher fields match. List active silences with `GET /api/v1/environments/{envSlug}/alerts/silences`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Circuit Breaker
|
||
|
|
|
||
|
|
If an evaluator kind (`LOG_PATTERN`, `ROUTE_METRIC`, etc.) throws exceptions repeatedly (default: 5 failures in 30 s), the circuit opens and skips that kind for a cooldown period (default: 60 s). Check server logs for:
|
||
|
|
|
||
|
|
```
|
||
|
|
Circuit breaker open for LOG_PATTERN; skipping rule <id>
|
||
|
|
```
|
||
|
|
|
||
|
|
The `alerting_circuit_opened_total{kind}` Prometheus counter tracks openings.
|
||
|
|
|
||
|
|
Tune via:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
cameleer:
|
||
|
|
server:
|
||
|
|
alerting:
|
||
|
|
circuit-breaker-fail-threshold: 5
|
||
|
|
circuit-breaker-window-seconds: 30
|
||
|
|
circuit-breaker-cooldown-seconds: 60
|
||
|
|
```
|
||
|
|
|
||
|
|
### Retention
|
||
|
|
|
||
|
|
Old resolved alert instances and settled notifications are deleted nightly at 03:00. Retention windows:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
cameleer:
|
||
|
|
server:
|
||
|
|
alerting:
|
||
|
|
event-retention-days: 90 # RESOLVED instances
|
||
|
|
notification-retention-days: 30 # DELIVERED/FAILED notifications
|
||
|
|
```
|
||
|
|
|
||
|
|
FIRING and ACKNOWLEDGED instances are never deleted by retention (only RESOLVED ones are).
|
||
|
|
|
||
|
|
### Webhook delivery failures
|
||
|
|
|
||
|
|
Check `GET /api/v1/environments/{envSlug}/alerts/{id}/notifications` for response status and snippet. OPERATOR can retry a failed notification via `POST /api/v1/alerts/notifications/{id}/retry`.
|
||
|
|
|
||
|
|
### Prometheus metrics (alerting)
|
||
|
|
|
||
|
|
| Metric | Tags | Description |
|
||
|
|
|---|---|---|
|
||
|
|
| `alerting_eval_errors_total` | `kind` | Evaluation errors by condition kind |
|
||
|
|
| `alerting_eval_duration_seconds` | `kind` | Evaluation latency histogram |
|
||
|
|
| `alerting_circuit_opened_total` | `kind` | Circuit breaker open transitions |
|
||
|
|
| `alerting_notifications_total` | `status` | Notification outcomes |
|
||
|
|
| `alerting_webhook_delivery_duration_seconds` | — | Webhook POST latency |
|
||
|
|
| `alerting_rules_total` | `state` (enabled/disabled) | Rule count gauge |
|
||
|
|
| `alerting_instances_total` | `state` | Instance count gauge |
|
||
|
|
|
||
|
|
### ClickHouse projections
|
||
|
|
|
||
|
|
The `LOG_PATTERN` and `EXCHANGE_MATCH` evaluators use ClickHouse projections (`logs_by_level`, `executions_by_status`). On fresh ClickHouse containers (e.g. Testcontainers), projections may not be active immediately — the evaluator falls back to a full table scan with the same WHERE clause, so correctness is preserved but latency may increase on first evaluation. In production ClickHouse, projections are applied to new data immediately and to existing data after `OPTIMIZE TABLE … FINAL`.
|