cameleer-server

Author	SHA1	Message	Date
hsiegeln	88804aca2c	fix(alerts): final sweep — drop ACKNOWLEDGED from AlertStateChip + CMD-K; harden V17 IT UI: AlertStateChip.LABELS and .COLORS no longer include ACKNOWLEDGED (dropped in V17). AlertStateChip.test.tsx test-cases trimmed to the three remaining states. LayoutShell CMD-K now searches FIRING alerts with acked=false (was state=[FIRING,ACKNOWLEDGED]). Test: V17MigrationIT.open_rule_index_predicate_is_reworked replaced with a structural-only assertion (index exists, indisunique). The pg_get_indexdef pretty-printer varies across Postgres versions, so predicate semantics are verified behaviorally in PostgresAlertInstanceRepositoryIT (findOpenForRule_* + save_rejectsSecondOpenInstanceForSameRuleAndExchange). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:29:58 +02:00
hsiegeln	69fe80353c	test(alerts): close repo IT gaps — filterInEnvLive other-env + bulkMarkRead soft-delete Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:55:12 +02:00
hsiegeln	99b739d946	fix(alerts): backend hardening + complete ACKNOWLEDGED migration - new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation - AlertController.inEnvLiveIds now one SQL round-trip instead of N - bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL - bulkAck SQL already had deleted_at IS NULL guard — no change needed - PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted - V12MigrationIT: remove alert_reads assertion (table dropped by V17) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:48:57 +02:00
hsiegeln	c70fa130ab	test(alerts): cover global read — one user marks read, others see readAt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:20:21 +02:00
hsiegeln	efd8396045	feat(alerts): controller — DELETE/bulk-delete/bulk-ack/restore + acked/read filters + readAt on DTO - GET /alerts gains tri-state acked + read query params - new endpoints: DELETE /{id} (soft-delete), POST /bulk-delete, POST /bulk-ack, POST /{id}/restore - requireLiveInstance 404s on soft-deleted rows; restore() reads the row regardless - BulkReadRequest → BulkIdsRequest (shared body for bulk read/ack/delete) - AlertDto gains readAt; deletedAt stays off the wire - InAppInboxQuery.listInbox threads acked/read through to the repo (7-arg, no more null placeholders) - SecurityConfig: new matchers for bulk-ack (VIEWER+), DELETE/bulk-delete/restore (OPERATOR+) - AlertControllerIT: persistence assertions on /read + /bulk-read; full coverage for new endpoints - InAppInboxQueryTest: updated to 7-arg listInbox signature Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:15:16 +02:00
hsiegeln	dd2a5536ab	test(alerts): rename ack test to reflect state is unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:04:39 +02:00
hsiegeln	e1321a4002	chore(alerts): delete orphan PostgresAlertReadRepositoryIT The class under test was removed in da281933; the IT became a @Disabled placeholder. Deleting per no-backwards-compat policy. Read mutation coverage lives in PostgresAlertInstanceRepositoryIT going forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:00:00 +02:00
hsiegeln	da2819332c	feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations - save/rowMapper read+write read_at and deleted_at - listForInbox: tri-state acked/read filters; always excludes deleted - countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill - new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore - delete PostgresAlertReadRepository + its bean - restore zero-fill Javadoc on interface - mechanical compile-fixes in AlertController, InAppInboxQuery, AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite - PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:06 +02:00
hsiegeln	6e8d890442	fix(alerts): remove dead ACKNOWLEDGED enum SQL + TODO comments Remove SET state='ACKNOWLEDGED' from ack() and the ACKNOWLEDGED predicate from findOpenForRule — both would error after V17. The final ack() + open-rule semantics (idempotent guards, deleted_at) are owned by Task 5; this is just the minimum to stop runtime SQL errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:36:02 +02:00
hsiegeln	5b1b3f215a	test(alerts): state machine — ack is orthogonal, does not transition FIRING - AlertStateTransitionsTest: add null,null for readAt/deletedAt in openInstance helper; replace firingWhenAcknowledgedIsNoOp with firing_with_ack_stays_firing_on_next_firing_tick; convert ackedInstanceClearsToResolved to use FIRING+withAck; update section comment. - PostgresAlertInstanceRepository: stub null,null for readAt/deletedAt in rowMapper to unblock compilation (Task 4 will read the actual DB columns). - All other alerting test files: add null,null for readAt/deletedAt to AlertInstance ctor calls so the test source tree compiles; stub ACKNOWLEDGED JSON/state assertions with FIRING + TODO Task 4 comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:28:31 +02:00
hsiegeln	e95c21d0cb	feat(alerts): V17 migration — drop ACKNOWLEDGED, add read_at + deleted_at Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:04:09 +02:00
hsiegeln	414f7204bf	feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode Allows alert rules to fire on agent-lifecycle events — REGISTERED, RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather than only on current state. Each matching `(agent, eventType, timestamp)` becomes its own ackable AlertInstance, so outages on distinct agents are independently routable. Core: - New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record (scope, eventTypes, withinSeconds). Compact ctor rejects empty eventTypes and withinSeconds<1. - Strict allowlist enum `AgentLifecycleEventType` (six entries matching the server-emitted types in `AgentRegistrationController` and `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in backlog issue #145. - `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes, from, to, limit)` — new read path ordered `(timestamp ASC, insert_id ASC)` used by the evaluator. Implemented on `ClickHouseAgentEventRepository` with tenant + env filter mandatory. App: - `AgentLifecycleEvaluator` queries events in the last `withinSeconds` window and returns `EvalResult.Batch` with one `Firing` per row. Every Firing carries a canonical `_subjectFingerprint` of `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event` subtrees for Mustache templating. - `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}`. - Validation is delegated to the record compact ctor + enum at Jackson deserialization time — matches the existing policy of keeping controller validators focused on env-scoped / SQL-injection concerns. Schema: - V16 migration generalises the V15 per-exchange discriminator on `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a fallback to the legacy `exchange.id` expression. Scalar kinds still resolve to `''` and keep one-open-per-rule. Duplicate-key path in `PostgresAlertInstanceRepository.save` is unchanged — the index is the deduper. UI: - New `AgentLifecycleForm.tsx` wizard form with multi-select chips for the six allowed event types + `withinSeconds` input. Wired into `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD, 300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the new option array. - `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind, and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`. Tests (all passing): - 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive + empty/zero/unknown-type rejection). - 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty window, multi-agent fingerprint shape, scope forwarding, missing env). - `NotificationContextBuilderTest` switch now covers the new kind. - 119 alerting unit tests + 71 UI tests green. Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.	2026-04-21 14:52:08 +02:00
hsiegeln	f037d8c922	feat(alerting): server-side state+severity filters, ButtonGroup filter UI Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value `state=…` and `severity=…` query params. Filters are pushed down to PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` / `AND severity::text = ANY(?)` to the inbox query (null/empty = no filter). `AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg form is preserved as a default delegate so existing callers (evaluator, AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged. `InAppInboxQuery.listInbox` also has a new filtered overload. UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select, no color cues) to `ButtonGroup` (multi-select with severity-coloured dots), matching the topnavbar status-filter pattern. `useAlerts` forwards the filters as query params and cache-keys on the filter tuple so each combo is independently cached. Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit tests passing). OpenAPI types regenerated from the fresh local backend.	2026-04-21 12:47:31 +02:00
hsiegeln	037a27d405	fix(alerting): allow multiple open alert_instances per rule for PER_EXCHANGE All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m51s Details CI / docker (push) Successful in 1m17s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details V13 added a partial unique index on alert_instances(rule_id) WHERE state IN (PENDING,FIRING,ACKNOWLEDGED). Correct for scalar condition kinds (ROUTE_METRIC / AGENT_STATE / DEPLOYMENT_STATE / LOG_PATTERN / JVM_METRIC / EXCHANGE_MATCH in COUNT_IN_WINDOW) but wrong for EXCHANGE_MATCH / PER_EXCHANGE, which by design emits one alert_instance per matching exchange. Under V13 every PER_EXCHANGE tick with >1 match logged "Skipped duplicate open alert_instance for rule …" at evaluator cadence and silently lost alert fidelity — only the first matching exchange per tick got an AlertInstance + webhook dispatch. V15 drops the rule_id-only constraint and recreates it with a discriminator on context->'exchange'->>'id'. Scalar kinds emit Map.of() as context, so their expression resolves to '' — "one open per rule" preserved. ExchangeMatchEvaluator.evaluatePerExchange always populates exchange.id, so per-exchange instances coexist cleanly. Two new PostgresAlertInstanceRepositoryIT tests: - multiple open instances for same rule + distinct exchanges all land - second open for identical (rule, exchange) still dedups via the DuplicateKeyException fallback in save() — defense-in-depth kept Also fixes pre-existing PostgresAlertReadRepositoryIT brokenness: its setup() inserted 3 open instances sharing one rule_id, which V13 blocked on arrival. Migrate to one rule_id per instance (pattern already used across other storage ITs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 22:26:19 +02:00
hsiegeln	09b49f096c	feat(alerting): per-severity breakdown on unread-count DTO Spec §13 calls for the notification bell to colour-code by highest unread severity (CRITICAL → error, WARNING → amber, INFO → muted). The old { count } DTO forced the UI to pick one static colour, so NotificationBell shipped with a TODO. Grow the contract instead: UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } } Guarantees: - every severity is always present with a >=0 value (no undefined keys on the wire), so the UI can branch without defaults. - total = sum of bySeverity values — kept explicit on the wire for cheap top-line display, not recomputed client-side. Backend - AlertInstanceRepository: replaces countUnreadForUser(long) with countUnreadBySeverityForUser returning Map<AlertSeverity, Long>. One SQL round-trip per (env, user) — GROUP BY ai.severity over the same NOT EXISTS(alert_reads) filter. - UnreadCountResponse.from(Map) normalises and defensively copies; missing severities default to 0. - InAppInboxQuery.countUnread now returns the DTO, caches the full response (still 5s TTL) so severity breakdown gets the same hit-rate as the total did before. - AlertController just hands the DTO back. Breaking change — no backwards-compat shim: the `count` field is gone. UI and tests updated in the same commit; there are no other API consumers in the tree. Frontend - Regenerated openapi.json + schema.d.ts against a fresh build of the new backend. - NotificationBell branches badge colour on the highest unread severity (CRITICAL > WARNING > INFO) via new CSS variants. - Tests cover all four paths: zero, critical-present, warning-only, info-only. Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map) + 49 vitest (was 46; +3 severity-branch assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:15:56 +02:00
hsiegeln	ec460faf02	Merge pull request 'feat(alerting): Plan 03 — UI + backfills (SSRF guard, metrics caching, docker stack)' (#144 ) from feat/alerting-03-ui into main All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m1s Details CI / docker (push) Successful in 1m16s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 42s Details Reviewed-on: #144	2026-04-20 16:27:49 +02:00
hsiegeln	9f109b20fd	perf(alerting): 30s TTL cache on AlertingMetrics gauge suppliers Prometheus scrapes can fire every few seconds. The open-alerts / open-rules gauges query Postgres on each read — caching the values for 30s amortises that to one query per half-minute. Addresses final-review NIT from Plan 02. - Introduces a package-private TtlCache that wraps a Supplier<Long> and memoises the last read for a configurable Duration against a Supplier<Instant> clock. - Wraps each gauge supplier (alerting_rules_total{enabled\|disabled}, alerting_instances_total{state}) in its own TtlCache. - Adds a test-friendly constructor (package-private) taking explicit Duration + Supplier<Instant> so AlertingMetricsCachingTest can advance a fake clock without waiting wall-clock time. - Adds AlertingMetricsCachingTest covering: * supplier invoked once per TTL across repeated scrapes * 29 s elapsed → still cached; 31 s elapsed → re-queried * gauge value reflects the cached result even after delegate mutates Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:22:54 +02:00
hsiegeln	5ebc729b82	feat(alerting): SSRF guard on outbound connection URL Rejects webhook URLs that resolve to loopback, link-local, or RFC-1918 private ranges (IPv4 + IPv6 ULA fc00::/7). Enforced on both create and update in OutboundConnectionServiceImpl before persistence; returns 400 Bad Request with "private or loopback" in the body. Bypass via `cameleer.server.outbound-http.allow-private-targets=true` for dev environments where webhooks legitimately point at local services. Production default is `false`. Test profile sets the flag to `true` in application-test.yml so the existing ITs that post webhooks to WireMock on https://localhost:PORT keep working. A dedicated OutboundConnectionSsrfIT overrides the flag back to false (via @TestPropertySource + @DirtiesContext) to exercise the reject path end-to-end through the admin controller. Plan 01 scope; required before SaaS exposure (spec §17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:17:44 +02:00
hsiegeln	94e941b026	test(alerting): decentralize @MockBean from AbstractPostgresIT + add SpringContextSmokeIT All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m43s Details CI / cleanup-branch (pull_request) Has been skipped Details CI / build (pull_request) Successful in 3m7s Details CI / docker (pull_request) Has been skipped Details CI / deploy (pull_request) Has been skipped Details CI / deploy-feature (pull_request) Has been skipped Details CI / docker (push) Successful in 1m37s Details CI / deploy (push) Has been skipped Details CI / deploy-feature (push) Successful in 39s Details Follow-up to #141. AbstractPostgresIT centrally declared three @MockBean fields (clickHouseSearchIndex, clickHouseLogStore, agentRegistryService), which meant EVERY IT ran against mocks instead of the real Spring context. That masked the production crashloop — the real bean graph was never exercised by CI. - Remove the three @MockBean fields from AbstractPostgresIT. - Move @MockBean declarations onto only the specific ITs that stub method behavior (verified by grepping for when/verify calls). - ITs that don't stub CH behavior now inject the real beans. - Add SpringContextSmokeIT — @SpringBootTest with no mocks, void contextLoads(). Fails fast on declared-type / autowire-type mismatches like the one #141 fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 10:51:46 +02:00
hsiegeln	b0ba08e572	test(alerting): rewrite AlertingFullLifecycleIT — REST-driven rule creation, re-notify cadence Rule creation now goes through POST /alerts/rules (exercises saveTargets on the write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps: 1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1) 2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2) 3. ack via REST → assert ACKNOWLEDGED state 4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED) 5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL) 6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s → evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence) Background scheduler races addressed by resetting claimed_by/claimed_until before each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp falls within the evaluator window. Re-notify notifications backdated in Postgres to work around the simulated vs real clock gap in claimDueNotifications. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:38 +02:00
hsiegeln	424894a3e2	fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response fields. AlertNotificationController calls resetForRetry instead of scheduleRetry so a manual retry always starts from a clean slate. AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify attempts==0 and status==PENDING after three prior markFailed calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:59 +02:00
hsiegeln	d74079da63	fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL branch excluded: sweep only re-notifies, initial notify is the dispatcher's job). AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh notifications for eligible instances and stamps last_notified_at. NotificationDispatchJob stamps last_notified_at on the alert_instance when a notification is DELIVERED, providing the anchor timestamp for cadence checks. PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering the three-rule eligibility matrix (never-notified, long-ago, recent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:50 +02:00
hsiegeln	3f036da03d	fix(alerting/B-1): PostgresAlertRuleRepository.save() now persists alert_rule_targets saveTargets() is called unconditionally at the end of save() — it deletes existing targets and re-inserts from the current targets list. findById() and listByEnvironment() already call withTargets() so reads are consistent. PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and saveTargets_updateReplacesExistingTargets to cover the new write path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:39 +02:00
hsiegeln	8bf45d5456	fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree Investigated three approaches for CH 24.12: - Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting). - ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in table metadata across connection restarts; runs before ADD PROJECTION in the SQL script. - Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better). alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs. AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status and alerting_route_status on executions) exist after schema init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:55 +02:00
hsiegeln	c79a6234af	test(alerting): fix duplicate @MockBean after AbstractPostgresIT centralised mocks + Plan 02 verification report AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9. All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with "Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore mock kept where needed. 120 alerting tests now pass (0 failures). Also adds docs/alerting-02-verification.md (Task 43). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 23:27:19 +02:00
hsiegeln	1ab21bc019	feat(alerting): AlertingRetentionJob daily cleanup Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older than eventRetentionDays and DELIVERED/FAILED alert_notifications older than notificationRetentionDays. Uses injected Clock for testability. IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept regardless of age, PENDING notification never deleted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:21 +02:00
hsiegeln	e334dfacd3	feat(alerting): AlertNotificationController + SecurityConfig matchers + fix IT context (Task 35) - GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+) - POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+) Flat path because notification IDs are globally unique (no env routing needed) - scheduleRetry resets attempts to 0 and sets nextAttemptAt = now - Added 11 alerting path matchers to SecurityConfig before outbound-connections block - Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and LogPatternEvaluator inject the concrete classes directly (not interface beans), so the full Spring context fails without these mocks in tests that don't use the real CH container - 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:17 +02:00
hsiegeln	77d1718451	feat(alerting): AlertSilenceController CRUD with time-range validation + audit (Task 34) - POST/GET/DELETE /environments/{envSlug}/alerts/silences - 422 when endsAt <= startsAt ("endsAt must be after startsAt") - OPERATOR+ for create/delete, VIEWER+ for list - Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE - 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:03 +02:00
hsiegeln	841793d7b9	feat(alerting): AlertController in-app inbox with ack/read/bulk-read (Task 33) - GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery - GET /unread-count — memoized unread count (5s TTL) - GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read - bulkRead filters instanceIds to env before delegating to AlertReadRepository - VIEWER+ for all endpoints; env isolation enforced by requireInstance - 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:55 +02:00
hsiegeln	c1b34f592b	feat(alerting): AlertRuleController with attribute-key SQL injection validation (Task 32) - POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD - POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate - Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL) - Webhook validation: verifies outboundConnectionId exists and is allowed in env - Null-safe notification template defaults to "" for NOT NULL DB constraint - Fixed misleading comment in ClickHouseSearchIndex to document validation contract - OPERATOR+ for mutations, VIEWER+ for reads - Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE - 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:46 +02:00
hsiegeln	d3dd8882bd	feat(alerting): InAppInboxQuery with 5s unread-count memoization listInbox resolves user groups+roles via RbacService.getEffectiveGroupsForUser / getEffectiveRolesForUser then delegates to AlertInstanceRepository. countUnread memoized per (envId, userId) with 5s TTL via ConcurrentHashMap using a controllable Clock. 6 unit tests covering delegation, cache hit, TTL expiry, and isolation between users/envs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:25:00 +02:00
hsiegeln	6b48bc63bf	feat(alerting): NotificationDispatchJob outbox loop with silence + retry Claim-polling SchedulingConfigurer: claims due notifications, resolves instance/connection/rule, checks active silences, dispatches via WebhookDispatcher, classifies outcomes into DELIVERED/FAILED/retry. Guards null rule/env after deletion. 5 Testcontainers ITs: 200/503/404 outcomes, active silence suppression, deleted connection fast-fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:54 +02:00
hsiegeln	466aceb920	feat(alerting): WebhookDispatcher with HMAC + TLS + retry classification Renders URL/headers/body with Mustache, optionally HMAC-signs the body (X-Cameleer-Signature), supports POST/PUT/PATCH, classifies 2xx/4xx/5xx into DELIVERED/FAILED/retry. 8 WireMock-backed IT tests including HTTPS TRUST_ALL against WireMock self-signed cert. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:47 +02:00
hsiegeln	6f1feaa4b0	feat(alerting): HmacSigner for webhook signature HmacSHA256 signer returning sha256=<lowercase-hex>. 5 unit tests covering known vector, prefix, hex casing, and different secrets/bodies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:39 +02:00
hsiegeln	bf178ba141	fix(alerting): populate AlertInstance.rule_snapshot so history survives rule delete - Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers) - Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot - Strip null values from the Jackson-serialized map before wrapping in the immutable snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields - Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind appear in the rule_snapshot column after a tick fires an instance - Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:09:28 +02:00
hsiegeln	15c0a8273c	feat(alerting): AlertEvaluatorJob with claim-polling + circuit breaker - AlertEvaluatorJob implements SchedulingConfigurer; fixed-delay tick from AlertingProperties.effectiveEvaluatorTickIntervalMs (5 s floor) - Claim-polling via AlertRuleRepository.claimDueRules (FOR UPDATE SKIP LOCKED) - Per-kind circuit breaker guards each evaluator; failures recorded, open kinds skipped and rescheduled without evaluation - Single-Firing path delegates to AlertStateTransitions; new FIRING instances enqueue AlertNotification rows per rule.webhooks() - Batch (PER_EXCHANGE) path creates one FIRING AlertInstance per Firing entry - PENDING→FIRING promotion handled in applyResult via state machine - Title/message rendered via MustacheRenderer + NotificationContextBuilder; environment resolved from EnvironmentRepository.findById per tick - AlertEvaluatorJobIT (4 tests): uses named @MockBean replacements for ClickHouseSearchIndex + ClickHouseLogStore; @MockBean AgentRegistryService drives Clear/Firing/resolve cycle without timing sensitivity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:27 +02:00
hsiegeln	657dc2d407	feat(alerting): AlertingProperties + AlertStateTransitions state machine - AlertingProperties @ConfigurationProperties with effective*() accessors and 5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup - AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated to job - AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck, withSilenced, withTitleMessage, withLastNotifiedAt, withContext - AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties), alertingInstanceId bean (hostname:pid), alertingClock bean, PerKindCircuitBreaker bean wired from props - 12 unit tests in AlertStateTransitionsTest covering all transitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:12 +02:00
hsiegeln	f8cd3f3ee4	feat(alerting): EXCHANGE_MATCH evaluator with per-exchange + count modes PER_EXCHANGE returns EvalResult.Batch(List<Firing>); last Firing carries _nextCursor (Instant) in its context map for the job to persist as evalState.lastExchangeTs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:40:54 +02:00
hsiegeln	89db8bd1c5	feat(alerting): JVM_METRIC evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:38:48 +02:00
hsiegeln	17d2be5638	feat(alerting): LOG_PATTERN evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:37:33 +02:00
hsiegeln	07d0386bf2	feat(alerting): ROUTE_METRIC evaluator P95_LATENCY_MS maps to avgDurationMs (ExecutionStats has no p95 bucket). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:36:22 +02:00
hsiegeln	983b698266	feat(alerting): DEPLOYMENT_STATE evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:34:47 +02:00
hsiegeln	e84338fc9a	feat(alerting): AGENT_STATE evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:33:13 +02:00
hsiegeln	55f4cab948	feat(alerting): evaluator scaffolding (context, result, tick cache, circuit breaker) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:32:06 +02:00
hsiegeln	891c7f87e3	feat(alerting): silence matcher for notification-time dispatch SilenceMatcherService.matches() evaluates AND semantics across ruleId, severity, appSlug, routeId, agentId constraints. Null fields are wildcards. Scope-based constraints (appSlug/routeId/agentId) return false when rule is null (deleted rule — scope cannot be verified). 17 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:18 +02:00
hsiegeln	1c74ab8541	feat(alerting): NotificationContextBuilder for template context maps Builds the Mustache context map from AlertRule + AlertInstance + Environment. Always emits env/rule/alert subtrees; conditionally emits kind-specific subtrees (agent, app, route, exchange, log, metric, deployment) based on rule.conditionKind(). Missing instance.context() keys resolve to empty string. alert.link prefixed with uiOrigin when non-null. 11 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:12 +02:00
hsiegeln	92a74e7b8d	feat(alerting): MustacheRenderer with literal fallback on missing vars Sentinel-substitution approach: unresolved {{x.y.z}} tokens are replaced with a unique NUL-delimited sentinel before Mustache compilation, rendered as opaque text, then post-replaced with the original {{x.y.z}} literal. Malformed templates (unclosed {{) are caught and return the raw template. Never throws. 9 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:05 +02:00
hsiegeln	7c0e94a425	feat(alerting): ClickHouse projections for alerting read paths Adds alerting_projections.sql with four projections (alerting_app_status, alerting_route_status on executions; alerting_app_level on logs; alerting_instance_metric on agent_metrics). ClickHouseSchemaInitializer now runs both init.sql and alerting_projections.sql, with ADD PROJECTION and MATERIALIZE treated as non-fatal — executions (ReplacingMergeTree) requires deduplicate_merge_projection_mode=rebuild which is unavailable via JDBC pool. MergeTree projections (logs, agent_metrics) always succeed and are asserted in IT. Column names confirmed from init.sql: logs uses 'application' (not application_id), agent_metrics uses 'collected_at' (not timestamp). All column names match the plan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:58 +02:00
hsiegeln	7b79d3aa64	feat(alerting): countExecutionsForAlerting for exchange-match evaluator Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting — no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window, and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:49 +02:00
hsiegeln	44e91ccdb5	feat(alerting): ClickHouseLogStore.countLogs for log-pattern evaluator Adds countLogs(LogSearchRequest) — no FINAL, no cursor/sort/limit — reusing the same WHERE-clause logic as search() for tenant, env, app, level, q, logger, source, exchangeId, and time-range filters. Also extends ClickHouseTestHelper with executeInitSqlWithProjections() and makes the script runner non-fatal for ADD/MATERIALIZE PROJECTION. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:41 +02:00

1 2

78 Commits