cameleer-server

Author	SHA1	Message	Date
hsiegeln	1ea0258393	fix(auth): upsert UI login user_id unprefixed (drop docker seeder workaround) Root cause of the mismatch that prompted the one-shot cameleer-seed docker service: UiAuthController stored users.user_id as the JWT subject "user:admin" (JWT sub format). Every env-scoped controller (Alert, AlertSilence, AlertRule, OutboundConnectionAdmin) already strips the "user:" prefix on the read path — so the rest of the system expects the DB key to be the bare username. With UiAuth storing prefixed, fresh docker stacks hit "alert_rules_created_by_fkey violation" on the first rule create. Fix: inside login(), compute `userId = request.username()` and use it everywhere the DB/RBAC layer is touched (isLocked, getPasswordHash, record/clearFailedLogins, upsert, assignRoleToUser, addUserToGroup, getSystemRoleNames). Keep `subject = "user:" + userId` — we still sign JWTs with the namespaced subject so JwtAuthenticationFilter can distinguish user vs agent tokens. refresh() and me() follow the same rule via a stripSubjectPrefix() helper (JWT subject in, bare DB key out). With the write path aligned, the docker bridge is no longer needed: - Deleted deploy/docker/postgres-init.sql - Deleted cameleer-seed service from docker-compose.yml Scope: UiAuthController only. UserAdminController + OidcAuthController still prefix on upsert — that's the bug class the triage identified as "Option A or B either way OK". Not changing them now because: a) prod admins are provisioned unprefixed through some other path, so those two files aren't the docker-only failure observed; b) stripping them would need a data migration for any existing prod users stored prefixed, which is out of scope for a cleanup phase. Follow-up worth scheduling if we ever wire OIDC or admin- created users into alerting FKs. Verified: 33/33 alerting+outbound controller ITs pass (9 outbound, 10 rules, 9 silences, 5 alert inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:26:03 +02:00
hsiegeln	09b49f096c	feat(alerting): per-severity breakdown on unread-count DTO Spec §13 calls for the notification bell to colour-code by highest unread severity (CRITICAL → error, WARNING → amber, INFO → muted). The old { count } DTO forced the UI to pick one static colour, so NotificationBell shipped with a TODO. Grow the contract instead: UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } } Guarantees: - every severity is always present with a >=0 value (no undefined keys on the wire), so the UI can branch without defaults. - total = sum of bySeverity values — kept explicit on the wire for cheap top-line display, not recomputed client-side. Backend - AlertInstanceRepository: replaces countUnreadForUser(long) with countUnreadBySeverityForUser returning Map<AlertSeverity, Long>. One SQL round-trip per (env, user) — GROUP BY ai.severity over the same NOT EXISTS(alert_reads) filter. - UnreadCountResponse.from(Map) normalises and defensively copies; missing severities default to 0. - InAppInboxQuery.countUnread now returns the DTO, caches the full response (still 5s TTL) so severity breakdown gets the same hit-rate as the total did before. - AlertController just hands the DTO back. Breaking change — no backwards-compat shim: the `count` field is gone. UI and tests updated in the same commit; there are no other API consumers in the tree. Frontend - Regenerated openapi.json + schema.d.ts against a fresh build of the new backend. - NotificationBell branches badge colour on the highest unread severity (CRITICAL > WARNING > INFO) via new CSS variants. - Tests cover all four paths: zero, critical-present, warning-only, info-only. Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map) + 49 vitest (was 46; +3 severity-branch assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:15:56 +02:00
hsiegeln	ec460faf02	Merge pull request 'feat(alerting): Plan 03 — UI + backfills (SSRF guard, metrics caching, docker stack)' (#144 ) from feat/alerting-03-ui into main All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m1s Details CI / docker (push) Successful in 1m16s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 42s Details Reviewed-on: #144	2026-04-20 16:27:49 +02:00
hsiegeln	5edf7eb23a	fix(alerting): @Autowired on AlertingMetrics production constructor Task 29's refactor added a package-private test-friendly constructor alongside the public production one. Without @Autowired Spring cannot pick which constructor to use for the @Component, and falls back to searching for a no-arg default — crashing startup with 'No default constructor found'. Detected when launching the server via the new docker-compose stack; unit tests still pass because they invoke the package-private test constructor directly.	2026-04-20 16:02:48 +02:00
hsiegeln	9f109b20fd	perf(alerting): 30s TTL cache on AlertingMetrics gauge suppliers Prometheus scrapes can fire every few seconds. The open-alerts / open-rules gauges query Postgres on each read — caching the values for 30s amortises that to one query per half-minute. Addresses final-review NIT from Plan 02. - Introduces a package-private TtlCache that wraps a Supplier<Long> and memoises the last read for a configurable Duration against a Supplier<Instant> clock. - Wraps each gauge supplier (alerting_rules_total{enabled\|disabled}, alerting_instances_total{state}) in its own TtlCache. - Adds a test-friendly constructor (package-private) taking explicit Duration + Supplier<Instant> so AlertingMetricsCachingTest can advance a fake clock without waiting wall-clock time. - Adds AlertingMetricsCachingTest covering: * supplier invoked once per TTL across repeated scrapes * 29 s elapsed → still cached; 31 s elapsed → re-queried * gauge value reflects the cached result even after delegate mutates Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:22:54 +02:00
hsiegeln	5ebc729b82	feat(alerting): SSRF guard on outbound connection URL Rejects webhook URLs that resolve to loopback, link-local, or RFC-1918 private ranges (IPv4 + IPv6 ULA fc00::/7). Enforced on both create and update in OutboundConnectionServiceImpl before persistence; returns 400 Bad Request with "private or loopback" in the body. Bypass via `cameleer.server.outbound-http.allow-private-targets=true` for dev environments where webhooks legitimately point at local services. Production default is `false`. Test profile sets the flag to `true` in application-test.yml so the existing ITs that post webhooks to WireMock on https://localhost:PORT keep working. A dedicated OutboundConnectionSsrfIT overrides the flag back to false (via @TestPropertySource + @DirtiesContext) to exercise the reject path end-to-end through the admin controller. Plan 01 scope; required before SaaS exposure (spec §17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:17:44 +02:00
hsiegeln	94e941b026	test(alerting): decentralize @MockBean from AbstractPostgresIT + add SpringContextSmokeIT All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m43s Details CI / cleanup-branch (pull_request) Has been skipped Details CI / build (pull_request) Successful in 3m7s Details CI / docker (pull_request) Has been skipped Details CI / deploy (pull_request) Has been skipped Details CI / deploy-feature (pull_request) Has been skipped Details CI / docker (push) Successful in 1m37s Details CI / deploy (push) Has been skipped Details CI / deploy-feature (push) Successful in 39s Details Follow-up to #141. AbstractPostgresIT centrally declared three @MockBean fields (clickHouseSearchIndex, clickHouseLogStore, agentRegistryService), which meant EVERY IT ran against mocks instead of the real Spring context. That masked the production crashloop — the real bean graph was never exercised by CI. - Remove the three @MockBean fields from AbstractPostgresIT. - Move @MockBean declarations onto only the specific ITs that stub method behavior (verified by grepping for when/verify calls). - ITs that don't stub CH behavior now inject the real beans. - Add SpringContextSmokeIT — @SpringBootTest with no mocks, void contextLoads(). Fails fast on declared-type / autowire-type mismatches like the one #141 fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 10:51:46 +02:00
hsiegeln	c9c93ac565	fix(alerting): declare ClickHouseSearchIndex bean as concrete type All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 3m6s Details CI / cleanup-branch (pull_request) Has been skipped Details CI / build (pull_request) Successful in 4m5s Details CI / docker (pull_request) Has been skipped Details CI / deploy (pull_request) Has been skipped Details CI / deploy-feature (pull_request) Has been skipped Details CI / docker (push) Successful in 1m37s Details CI / deploy (push) Has been skipped Details CI / deploy-feature (push) Successful in 39s Details Production crashlooped on startup: ExchangeMatchEvaluator autowires the concrete ClickHouseSearchIndex (for countExecutionsForAlerting, which lives only on the concrete class, not the SearchIndex interface), but StorageBeanConfig declared the bean with interface return type SearchIndex. Spring matches autowire candidates by declared bean type, not by runtime instance class, so the concrete-typed autowire failed with: Parameter 0 of constructor in ExchangeMatchEvaluator required a bean of type 'ClickHouseSearchIndex' that could not be found. ClickHouseLogStore's bean is already declared with the concrete return type (line 171), which is why LogPatternEvaluator autowires fine. All alerting ITs passed pre-merge because AbstractPostgresIT replaces the clickHouseSearchIndex bean with @MockBean(name=...) whose declared type IS the concrete ClickHouseSearchIndex. The mock masked the prod bug. Follow-up: remove @MockBean(name="clickHouseSearchIndex") from AbstractPostgresIT so the real bean graph is exercised by alerting ITs (and add a SpringContextSmokeIT that loads the context with no mocks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 09:11:47 +02:00
hsiegeln	b0ba08e572	test(alerting): rewrite AlertingFullLifecycleIT — REST-driven rule creation, re-notify cadence Rule creation now goes through POST /alerts/rules (exercises saveTargets on the write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps: 1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1) 2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2) 3. ack via REST → assert ACKNOWLEDGED state 4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED) 5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL) 6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s → evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence) Background scheduler races addressed by resetting claimed_by/claimed_until before each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp falls within the evaluator window. Re-notify notifications backdated in Postgres to work around the simulated vs real clock gap in claimDueNotifications. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:38 +02:00
hsiegeln	2c82b50ea2	fix(alerting/B-1): AlertStateTransitions.newInstance() propagates rule targets to AlertInstance newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames so newly created AlertInstance rows carry the correct target arrays. Previously these were always empty List.of(), making the inbox query return nothing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:25 +02:00
hsiegeln	7e79ff4d98	fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states V13 migration creates alert_instances_open_rule_uq — a partial unique index on (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches DuplicateKeyException and returns the existing open instance instead of failing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:07 +02:00
hsiegeln	424894a3e2	fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response fields. AlertNotificationController calls resetForRetry instead of scheduleRetry so a manual retry always starts from a clean slate. AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify attempts==0 and status==PENDING after three prior markFailed calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:59 +02:00
hsiegeln	d74079da63	fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL branch excluded: sweep only re-notifies, initial notify is the dispatcher's job). AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh notifications for eligible instances and stamps last_notified_at. NotificationDispatchJob stamps last_notified_at on the alert_instance when a notification is DELIVERED, providing the anchor timestamp for cadence checks. PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering the three-rule eligibility matrix (never-notified, long-ago, recent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:50 +02:00
hsiegeln	3f036da03d	fix(alerting/B-1): PostgresAlertRuleRepository.save() now persists alert_rule_targets saveTargets() is called unconditionally at the end of save() — it deletes existing targets and re-inserts from the current targets list. findById() and listByEnvironment() already call withTargets() so reads are consistent. PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and saveTargets_updateReplacesExistingTargets to cover the new write path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:39 +02:00
hsiegeln	8bf45d5456	fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree Investigated three approaches for CH 24.12: - Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting). - ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in table metadata across connection restarts; runs before ADD PROJECTION in the SQL script. - Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better). alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs. AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status and alerting_route_status on executions) exist after schema init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:55 +02:00
hsiegeln	f1abca3a45	refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because stats_1m_route has no p95 column. Exposing the old name implied p95 semantics operators did not get. Rename to AVG_DURATION_MS makes the contract honest. Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:43 +02:00
hsiegeln	c79a6234af	test(alerting): fix duplicate @MockBean after AbstractPostgresIT centralised mocks + Plan 02 verification report AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9. All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with "Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore mock kept where needed. 120 alerting tests now pass (0 failures). Also adds docs/alerting-02-verification.md (Task 43). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 23:27:19 +02:00
hsiegeln	63669bd1d7	docs(alerting): default config + admin guide Adds alerting stanza to application.yml with all AlertingProperties fields backed by env-var overrides. Creates docs/alerting.md covering six condition kinds (with example JSON), template variables, webhook setup (Slack/PagerDuty examples), silence patterns, circuit-breaker and retention troubleshooting, and Prometheus metrics reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:38 +02:00
hsiegeln	840a71df94	feat(alerting): observability metrics via micrometer AlertingMetrics @Component wraps MeterRegistry: - Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind}, alerting_notifications_total{status} - Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds - Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state} AlertEvaluatorJob records evalError + evalDuration around each evaluator call. PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind). AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:30 +02:00
hsiegeln	1ab21bc019	feat(alerting): AlertingRetentionJob daily cleanup Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older than eventRetentionDays and DELIVERED/FAILED alert_notifications older than notificationRetentionDays. Uses injected Clock for testability. IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept regardless of age, PENDING notification never deleted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:21 +02:00
hsiegeln	e334dfacd3	feat(alerting): AlertNotificationController + SecurityConfig matchers + fix IT context (Task 35) - GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+) - POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+) Flat path because notification IDs are globally unique (no env routing needed) - scheduleRetry resets attempts to 0 and sets nextAttemptAt = now - Added 11 alerting path matchers to SecurityConfig before outbound-connections block - Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and LogPatternEvaluator inject the concrete classes directly (not interface beans), so the full Spring context fails without these mocks in tests that don't use the real CH container - 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:17 +02:00
hsiegeln	77d1718451	feat(alerting): AlertSilenceController CRUD with time-range validation + audit (Task 34) - POST/GET/DELETE /environments/{envSlug}/alerts/silences - 422 when endsAt <= startsAt ("endsAt must be after startsAt") - OPERATOR+ for create/delete, VIEWER+ for list - Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE - 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:03 +02:00
hsiegeln	841793d7b9	feat(alerting): AlertController in-app inbox with ack/read/bulk-read (Task 33) - GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery - GET /unread-count — memoized unread count (5s TTL) - GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read - bulkRead filters instanceIds to env before delegating to AlertReadRepository - VIEWER+ for all endpoints; env isolation enforced by requireInstance - 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:55 +02:00
hsiegeln	c1b34f592b	feat(alerting): AlertRuleController with attribute-key SQL injection validation (Task 32) - POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD - POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate - Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL) - Webhook validation: verifies outboundConnectionId exists and is allowed in env - Null-safe notification template defaults to "" for NOT NULL DB constraint - Fixed misleading comment in ClickHouseSearchIndex to document validation contract - OPERATOR+ for mutations, VIEWER+ for reads - Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE - 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:46 +02:00
hsiegeln	d3dd8882bd	feat(alerting): InAppInboxQuery with 5s unread-count memoization listInbox resolves user groups+roles via RbacService.getEffectiveGroupsForUser / getEffectiveRolesForUser then delegates to AlertInstanceRepository. countUnread memoized per (envId, userId) with 5s TTL via ConcurrentHashMap using a controllable Clock. 6 unit tests covering delegation, cache hit, TTL expiry, and isolation between users/envs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:25:00 +02:00
hsiegeln	6b48bc63bf	feat(alerting): NotificationDispatchJob outbox loop with silence + retry Claim-polling SchedulingConfigurer: claims due notifications, resolves instance/connection/rule, checks active silences, dispatches via WebhookDispatcher, classifies outcomes into DELIVERED/FAILED/retry. Guards null rule/env after deletion. 5 Testcontainers ITs: 200/503/404 outcomes, active silence suppression, deleted connection fast-fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:54 +02:00
hsiegeln	466aceb920	feat(alerting): WebhookDispatcher with HMAC + TLS + retry classification Renders URL/headers/body with Mustache, optionally HMAC-signs the body (X-Cameleer-Signature), supports POST/PUT/PATCH, classifies 2xx/4xx/5xx into DELIVERED/FAILED/retry. 8 WireMock-backed IT tests including HTTPS TRUST_ALL against WireMock self-signed cert. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:47 +02:00
hsiegeln	6f1feaa4b0	feat(alerting): HmacSigner for webhook signature HmacSHA256 signer returning sha256=<lowercase-hex>. 5 unit tests covering known vector, prefix, hex casing, and different secrets/bodies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:24:39 +02:00
hsiegeln	bf178ba141	fix(alerting): populate AlertInstance.rule_snapshot so history survives rule delete - Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers) - Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot - Strip null values from the Jackson-serialized map before wrapping in the immutable snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields - Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind appear in the rule_snapshot column after a tick fires an instance - Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:09:28 +02:00
hsiegeln	15c0a8273c	feat(alerting): AlertEvaluatorJob with claim-polling + circuit breaker - AlertEvaluatorJob implements SchedulingConfigurer; fixed-delay tick from AlertingProperties.effectiveEvaluatorTickIntervalMs (5 s floor) - Claim-polling via AlertRuleRepository.claimDueRules (FOR UPDATE SKIP LOCKED) - Per-kind circuit breaker guards each evaluator; failures recorded, open kinds skipped and rescheduled without evaluation - Single-Firing path delegates to AlertStateTransitions; new FIRING instances enqueue AlertNotification rows per rule.webhooks() - Batch (PER_EXCHANGE) path creates one FIRING AlertInstance per Firing entry - PENDING→FIRING promotion handled in applyResult via state machine - Title/message rendered via MustacheRenderer + NotificationContextBuilder; environment resolved from EnvironmentRepository.findById per tick - AlertEvaluatorJobIT (4 tests): uses named @MockBean replacements for ClickHouseSearchIndex + ClickHouseLogStore; @MockBean AgentRegistryService drives Clear/Firing/resolve cycle without timing sensitivity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:27 +02:00
hsiegeln	657dc2d407	feat(alerting): AlertingProperties + AlertStateTransitions state machine - AlertingProperties @ConfigurationProperties with effective*() accessors and 5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup - AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated to job - AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck, withSilenced, withTitleMessage, withLastNotifiedAt, withContext - AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties), alertingInstanceId bean (hostname:pid), alertingClock bean, PerKindCircuitBreaker bean wired from props - 12 unit tests in AlertStateTransitionsTest covering all transitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:12 +02:00
hsiegeln	f8cd3f3ee4	feat(alerting): EXCHANGE_MATCH evaluator with per-exchange + count modes PER_EXCHANGE returns EvalResult.Batch(List<Firing>); last Firing carries _nextCursor (Instant) in its context map for the job to persist as evalState.lastExchangeTs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:40:54 +02:00
hsiegeln	89db8bd1c5	feat(alerting): JVM_METRIC evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:38:48 +02:00
hsiegeln	17d2be5638	feat(alerting): LOG_PATTERN evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:37:33 +02:00
hsiegeln	07d0386bf2	feat(alerting): ROUTE_METRIC evaluator P95_LATENCY_MS maps to avgDurationMs (ExecutionStats has no p95 bucket). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:36:22 +02:00
hsiegeln	983b698266	feat(alerting): DEPLOYMENT_STATE evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:34:47 +02:00
hsiegeln	e84338fc9a	feat(alerting): AGENT_STATE evaluator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:33:13 +02:00
hsiegeln	55f4cab948	feat(alerting): evaluator scaffolding (context, result, tick cache, circuit breaker) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:32:06 +02:00
hsiegeln	891c7f87e3	feat(alerting): silence matcher for notification-time dispatch SilenceMatcherService.matches() evaluates AND semantics across ruleId, severity, appSlug, routeId, agentId constraints. Null fields are wildcards. Scope-based constraints (appSlug/routeId/agentId) return false when rule is null (deleted rule — scope cannot be verified). 17 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:18 +02:00
hsiegeln	1c74ab8541	feat(alerting): NotificationContextBuilder for template context maps Builds the Mustache context map from AlertRule + AlertInstance + Environment. Always emits env/rule/alert subtrees; conditionally emits kind-specific subtrees (agent, app, route, exchange, log, metric, deployment) based on rule.conditionKind(). Missing instance.context() keys resolve to empty string. alert.link prefixed with uiOrigin when non-null. 11 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:12 +02:00
hsiegeln	92a74e7b8d	feat(alerting): MustacheRenderer with literal fallback on missing vars Sentinel-substitution approach: unresolved {{x.y.z}} tokens are replaced with a unique NUL-delimited sentinel before Mustache compilation, rendered as opaque text, then post-replaced with the original {{x.y.z}} literal. Malformed templates (unclosed {{) are caught and return the raw template. Never throws. 9 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:27:05 +02:00
hsiegeln	c53f642838	chore(alerting): add jmustache 1.16 Declared in cameleer-server-core pom (canonical location for unit-testable rendering without Spring) and mirrored in cameleer-server-app pom so the app module compiles standalone without a full reactor install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:26:57 +02:00
hsiegeln	7c0e94a425	feat(alerting): ClickHouse projections for alerting read paths Adds alerting_projections.sql with four projections (alerting_app_status, alerting_route_status on executions; alerting_app_level on logs; alerting_instance_metric on agent_metrics). ClickHouseSchemaInitializer now runs both init.sql and alerting_projections.sql, with ADD PROJECTION and MATERIALIZE treated as non-fatal — executions (ReplacingMergeTree) requires deduplicate_merge_projection_mode=rebuild which is unavailable via JDBC pool. MergeTree projections (logs, agent_metrics) always succeed and are asserted in IT. Column names confirmed from init.sql: logs uses 'application' (not application_id), agent_metrics uses 'collected_at' (not timestamp). All column names match the plan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:58 +02:00
hsiegeln	7b79d3aa64	feat(alerting): countExecutionsForAlerting for exchange-match evaluator Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting — no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window, and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:49 +02:00
hsiegeln	44e91ccdb5	feat(alerting): ClickHouseLogStore.countLogs for log-pattern evaluator Adds countLogs(LogSearchRequest) — no FINAL, no cursor/sort/limit — reusing the same WHERE-clause logic as search() for tenant, env, app, level, q, logger, source, exchangeId, and time-range filters. Also extends ClickHouseTestHelper with executeInitSqlWithProjections() and makes the script runner non-fatal for ADD/MATERIALIZE PROJECTION. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:41 +02:00
hsiegeln	59354fae18	feat(alerting): wire all alerting repository beans AlertingBeanConfig now exposes 4 additional @Bean methods: alertInstanceRepository, alertSilenceRepository, alertNotificationRepository, alertReadRepository. AlertReadRepository takes only JdbcTemplate (no JSONB/ObjectMapper needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:05:06 +02:00
hsiegeln	f829929b07	feat(alerting): Postgres repositories for silences, notifications, reads PostgresAlertSilenceRepository: save/findById roundtrip, listActive (BETWEEN starts_at AND ends_at), listByEnvironment, delete. JSONB SilenceMatcher via ObjectMapper. PostgresAlertNotificationRepository: save/findById, listForInstance, claimDueNotifications (UPDATE...RETURNING with FOR UPDATE SKIP LOCKED), markDelivered, scheduleRetry (bumps attempts + next_attempt_at), markFailed, deleteSettledBefore (DELIVERED+FAILED rows older than cutoff). JSONB payload. PostgresAlertReadRepository: markRead (ON CONFLICT DO NOTHING idempotent), bulkMarkRead (iterates, handles empty list without error). 16 IT scenarios across 3 classes, all passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:05:01 +02:00
hsiegeln	45028de1db	feat(alerting): Postgres repository for alert_instances with inbox queries Implements AlertInstanceRepository: save (upsert), findById, findOpenForRule, listForInbox (3-way OR: user/group/role via && array-overlap + ANY), countUnreadForUser (LEFT JOIN alert_reads), ack, resolve, markSilenced, deleteResolvedBefore. Integration test covers all 9 scenarios including inbox fan-out across all three target types. Also adds @JsonIgnoreProperties(ignoreUnknown=true) to SilenceMatcher to suppress Jackson serializing isWildcard() as a round-trip field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:04:51 +02:00
hsiegeln	930ac20d11	fix(outbound): wire rulesReferencing to AlertRuleRepository (Plan 01 gate) Replaces the Plan 01 stub that returned [] with a real call through AlertRuleRepository.findRuleIdsByOutboundConnectionId. Adds AlertingBeanConfig exposing the AlertRuleRepository bean; widens OutboundBeanConfig constructor to inject it. Delete and narrow-envs guards now correctly block when rules reference a connection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:51:36 +02:00
hsiegeln	f80bc006c1	feat(alerting): Postgres repository for alert_rules Implements AlertRuleRepository with JSONB condition/webhooks/eval_state serialization via ObjectMapper, UPSERT on conflict, JSONB containment query for findRuleIdsByOutboundConnectionId, and FOR UPDATE SKIP LOCKED claim-polling for horizontal scale. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:48:15 +02:00

1 2 3 4

200 Commits