cameleer-server

Author	SHA1	Message	Date
hsiegeln	dfacedb0ca	fix(test): rewrite DetailControllerIT seed to ExecutionChunk + REST-driven lookup POST /api/v1/data/executions is owned by ChunkIngestionController (the legacy ExecutionController path is @ConditionalOnMissingBean(ChunkAccumulator) and never binds). The old RouteExecution-shaped seed was silently parsed as an empty ExecutionChunk and nothing landed in ClickHouse. Rewrote the seed as a single final ExecutionChunk with chunkSeq=0 / final=true and a flat processors[] carrying seq + parentSeq to preserve the 3-level tree (DetailService.buildTree reconstructs the nested shape for the API response). Execution-id lookup now goes through the search REST API filtered by correlationId, per the no-raw-SQL preference. Template for the other Cluster B ITs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:04:00 +02:00
hsiegeln	9bda4d8f8d	fix(test): de-couple Flyway/ConfigEnvIsolation ITs from cross-test state Both Testcontainers Postgres ITs were asserting exact counts on rows that other classes in the shared context had already written. - FlywayMigrationIT: treat the non-seed tables (users, server_config, audit_log, application_config, app_settings) as "must exist; COUNT must return a non-negative integer" rather than expecting exactly 0. The seeded tables (roles=4, groups=1) still assert exact V1 baseline. - ConfigEnvIsolationIT.findByEnvironment_excludesOtherEnvs: use unique prefixed app slugs and switch containsExactlyInAnyOrder to contains + doesNotContain, so the cross-env filter is still verified without coupling to other tests' inserts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:43:29 +02:00
hsiegeln	10e2b69974	fix(test): route SecurityFilterIT protected-endpoint check to env-scoped URL The agent list moved from /api/v1/agents to /api/v1/environments/{envSlug}/agents; the 'valid JWT returns 200' test was hitting the retired flat path and getting 404. The other 'without JWT' cases still pass because Spring Security rejects them at the filter chain before URL routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:41:35 +02:00
hsiegeln	e955302fe8	fix(test): add required environmentId to agent register bodies Registration now requires environmentId in the body (400 if missing), so the stale register bodies were failing every downstream test that relied on a registered agent. Affected helpers in: - BootstrapTokenIT (static constant + inline body) - JwtRefreshIT (registerAndGetTokens) - RegistrationSecurityIT (registerAgent) - SseSigningIT (registerAgentWithAuth) - AgentSseControllerIT (registerAgent helper) Also in JwtRefreshIT / RegistrationSecurityIT, the "access token can reach a protected endpoint" tests were hitting env-scoped read endpoints that now require VIEWER+. Redirected both to the AGENT-role heartbeat endpoint — it proves the token is accepted by the security filter without being coupled to RBAC rules for reader endpoints. JwtRefreshIT.refreshWithValidToken also dropped an isNotEqualTo assertion that assumed sub-second iat uniqueness — HMAC JWTs with second-precision claims are byte-identical when minted for the same subject within the same second, so the old assertion was flaky by design. SseSigningIT / AgentSseControllerIT still have SSE-connection timing failures unrelated to registration — parked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:24:54 +02:00
hsiegeln	97a6b2e010	fix(test): align AgentCommandControllerIT with current spec Two drifts corrected: - registerAgent helper missing required environmentId (spec: 400 if absent). - sendGroupCommand is now synchronous request-reply: returns 200 with an aggregated CommandGroupResponse {success,total,responded,responses,timedOut} — no longer 202 with {targetCount,commandIds}. Updated assertions and name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:18:14 +02:00
hsiegeln	7436a37b99	fix(test): align AgentRegistrationControllerIT with current spec Four drifts against the current server contract, all now corrected: - Registration body missing required environmentId (spec: 400 if absent). - Agent list moved to env-scoped /api/v1/environments/{envSlug}/agents; flat /api/v1/agents no longer exists. - heartbeatUnknownAgent now auto-heals via JWT env claim (`fb54f9cb`); the 404 branch is only reachable without a JWT, which the security filter rejects before the controller sees the request. - sseEndpoint is an absolute URL (ServletUriComponentsBuilder.fromCurrentContextPath), so assert endsWith the path rather than equals-to-relative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:15:16 +02:00
hsiegeln	fb54f9cbd2	fix(agent): revive DEAD agents on heartbeat (not just STALE) Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details CI / docker (push) Has been cancelled Details Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:55:47 +02:00
hsiegeln	90083f886a	refactor(schema): collapse V1..V18 into single V1__init.sql baseline Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m17s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details The project is still greenfield (no production deployment) so this is the last safe moment to flatten the migration archaeology before the checksum history starts mattering for real. Schema changes - 18 migration files (531 lines) → one V1__init.sql (~380 lines) declaring the final end-state: RBAC + claim mappings + runtime management + config + audit + outbound + alerting, plus seed data (system roles, Admins group, default environment). - Drops the data-repair statements from V14 (firemode backfill), V16 (subjectFingerprint migration), V17 (ACKNOWLEDGED → FIRING coercion) — they were no-ops on any DB that starts at V1. - Declares condition_kind_enum with AGENT_LIFECYCLE from the start (was added retroactively by V18). - Declares alert_state_enum with three values only (was five, then swapped in V17) and alert_instances with read_at / deleted_at columns from day one (was added by V17). - alert_reads table never created (V12 created, V17 dropped). - alert_instances_open_rule_uq built with the V17 predicate from the start. Test changes - Replace V12MigrationIT / V17MigrationIT / V18MigrationIT with one SchemaBootstrapIT that asserts the combined invariants: tables present, alert_reads absent, enum value sets, alert_instances has read_at + deleted_at, open_rule_uq exists and is unique, env-delete cascade fires. Verification - pg_dump of the new V1 matches the pg_dump of V1..V18 applied in sequence (bytewise modulo column order and Postgres-auto FK names). - Full alerting IT suite (53 tests across 6 classes) green against the new schema. - The 47 pre-existing test failures on main (AgentRegistrationIT, SearchControllerIT, ClickHouseStatsStoreIT, …) are unrelated and fail identically without this change. Developer impact - Existing local DBs will fail checksum validation on boot. Wipe: docker compose down -v (or drop the tenant_default schema). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:52:22 +02:00
hsiegeln	b7d201d743	fix(alerts): add AGENT_LIFECYCLE to condition_kind_enum + readable error toasts All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / docker (push) Successful in 1m19s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details Backend - V18 migration adds AGENT_LIFECYCLE to condition_kind_enum. Java ConditionKind enum shipped with this value but no Postgres migration extended the type, so any AGENT_LIFECYCLE rule insert failed with "invalid input value for enum condition_kind_enum". - ALTER TYPE ... ADD VALUE lives alone in its migration per Postgres constraint that the new value cannot be referenced in the same tx. - V18MigrationIT asserts the enum now contains all 7 kinds. Frontend - Add describeApiError(e) helper to unwrap openapi-fetch error bodies (Spring error JSON) into readable strings. String(e) on a plain object rendered "[object Object]" in toasts — the actual failure reason was hidden from the user. - Replace String(e) in all 13 toast descriptions across the alerting and outbound-connection mutation paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:23:14 +02:00
hsiegeln	88804aca2c	fix(alerts): final sweep — drop ACKNOWLEDGED from AlertStateChip + CMD-K; harden V17 IT UI: AlertStateChip.LABELS and .COLORS no longer include ACKNOWLEDGED (dropped in V17). AlertStateChip.test.tsx test-cases trimmed to the three remaining states. LayoutShell CMD-K now searches FIRING alerts with acked=false (was state=[FIRING,ACKNOWLEDGED]). Test: V17MigrationIT.open_rule_index_predicate_is_reworked replaced with a structural-only assertion (index exists, indisunique). The pg_get_indexdef pretty-printer varies across Postgres versions, so predicate semantics are verified behaviorally in PostgresAlertInstanceRepositoryIT (findOpenForRule_* + save_rejectsSecondOpenInstanceForSameRuleAndExchange). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:29:58 +02:00
hsiegeln	69fe80353c	test(alerts): close repo IT gaps — filterInEnvLive other-env + bulkMarkRead soft-delete Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:55:12 +02:00
hsiegeln	99b739d946	fix(alerts): backend hardening + complete ACKNOWLEDGED migration - new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation - AlertController.inEnvLiveIds now one SQL round-trip instead of N - bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL - bulkAck SQL already had deleted_at IS NULL guard — no change needed - PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted - V12MigrationIT: remove alert_reads assertion (table dropped by V17) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:48:57 +02:00
hsiegeln	c70fa130ab	test(alerts): cover global read — one user marks read, others see readAt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:20:21 +02:00
hsiegeln	efd8396045	feat(alerts): controller — DELETE/bulk-delete/bulk-ack/restore + acked/read filters + readAt on DTO - GET /alerts gains tri-state acked + read query params - new endpoints: DELETE /{id} (soft-delete), POST /bulk-delete, POST /bulk-ack, POST /{id}/restore - requireLiveInstance 404s on soft-deleted rows; restore() reads the row regardless - BulkReadRequest → BulkIdsRequest (shared body for bulk read/ack/delete) - AlertDto gains readAt; deletedAt stays off the wire - InAppInboxQuery.listInbox threads acked/read through to the repo (7-arg, no more null placeholders) - SecurityConfig: new matchers for bulk-ack (VIEWER+), DELETE/bulk-delete/restore (OPERATOR+) - AlertControllerIT: persistence assertions on /read + /bulk-read; full coverage for new endpoints - InAppInboxQueryTest: updated to 7-arg listInbox signature Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:15:16 +02:00
hsiegeln	dd2a5536ab	test(alerts): rename ack test to reflect state is unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:04:39 +02:00
hsiegeln	e1321a4002	chore(alerts): delete orphan PostgresAlertReadRepositoryIT The class under test was removed in da281933; the IT became a @Disabled placeholder. Deleting per no-backwards-compat policy. Read mutation coverage lives in PostgresAlertInstanceRepositoryIT going forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:00:00 +02:00
hsiegeln	da2819332c	feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations - save/rowMapper read+write read_at and deleted_at - listForInbox: tri-state acked/read filters; always excludes deleted - countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill - new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore - delete PostgresAlertReadRepository + its bean - restore zero-fill Javadoc on interface - mechanical compile-fixes in AlertController, InAppInboxQuery, AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite - PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:06 +02:00
hsiegeln	6e8d890442	fix(alerts): remove dead ACKNOWLEDGED enum SQL + TODO comments Remove SET state='ACKNOWLEDGED' from ack() and the ACKNOWLEDGED predicate from findOpenForRule — both would error after V17. The final ack() + open-rule semantics (idempotent guards, deleted_at) are owned by Task 5; this is just the minimum to stop runtime SQL errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:36:02 +02:00
hsiegeln	5b1b3f215a	test(alerts): state machine — ack is orthogonal, does not transition FIRING - AlertStateTransitionsTest: add null,null for readAt/deletedAt in openInstance helper; replace firingWhenAcknowledgedIsNoOp with firing_with_ack_stays_firing_on_next_firing_tick; convert ackedInstanceClearsToResolved to use FIRING+withAck; update section comment. - PostgresAlertInstanceRepository: stub null,null for readAt/deletedAt in rowMapper to unblock compilation (Task 4 will read the actual DB columns). - All other alerting test files: add null,null for readAt/deletedAt to AlertInstance ctor calls so the test source tree compiles; stub ACKNOWLEDGED JSON/state assertions with FIRING + TODO Task 4 comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:28:31 +02:00
hsiegeln	82e82350f9	refactor(alerts): drop ACKNOWLEDGED from AlertState, add readAt/deletedAt to AlertInstance - AlertState: remove ACKNOWLEDGED case (V17 migration already dropped it from DB enum) - AlertInstance: insert readAt + deletedAt Instant fields after lastNotifiedAt; add withReadAt/withDeletedAt withers; update all existing withers to pass both fields positionally - AlertStateTransitions: add null,null for readAt/deletedAt in newInstance ctor call; collapse FIRING,ACKNOWLEDGED switch arm to just FIRING - AlertScopeTest: update AlertState.values() assertion to 3 values; fix stale ConditionKind.hasSize(6) to 7 (JVM_METRIC was added earlier) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:12:37 +02:00
hsiegeln	e95c21d0cb	feat(alerts): V17 migration — drop ACKNOWLEDGED, add read_at + deleted_at Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:04:09 +02:00
hsiegeln	414f7204bf	feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode Allows alert rules to fire on agent-lifecycle events — REGISTERED, RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather than only on current state. Each matching `(agent, eventType, timestamp)` becomes its own ackable AlertInstance, so outages on distinct agents are independently routable. Core: - New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record (scope, eventTypes, withinSeconds). Compact ctor rejects empty eventTypes and withinSeconds<1. - Strict allowlist enum `AgentLifecycleEventType` (six entries matching the server-emitted types in `AgentRegistrationController` and `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in backlog issue #145. - `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes, from, to, limit)` — new read path ordered `(timestamp ASC, insert_id ASC)` used by the evaluator. Implemented on `ClickHouseAgentEventRepository` with tenant + env filter mandatory. App: - `AgentLifecycleEvaluator` queries events in the last `withinSeconds` window and returns `EvalResult.Batch` with one `Firing` per row. Every Firing carries a canonical `_subjectFingerprint` of `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event` subtrees for Mustache templating. - `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}`. - Validation is delegated to the record compact ctor + enum at Jackson deserialization time — matches the existing policy of keeping controller validators focused on env-scoped / SQL-injection concerns. Schema: - V16 migration generalises the V15 per-exchange discriminator on `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a fallback to the legacy `exchange.id` expression. Scalar kinds still resolve to `''` and keep one-open-per-rule. Duplicate-key path in `PostgresAlertInstanceRepository.save` is unchanged — the index is the deduper. UI: - New `AgentLifecycleForm.tsx` wizard form with multi-select chips for the six allowed event types + `withinSeconds` input. Wired into `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD, 300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the new option array. - `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind, and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`. Tests (all passing): - 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive + empty/zero/unknown-type rejection). - 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty window, multi-agent fingerprint shape, scope forwarding, missing env). - `NotificationContextBuilderTest` switch now covers the new kind. - 119 alerting unit tests + 71 UI tests green. Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.	2026-04-21 14:52:08 +02:00
hsiegeln	f037d8c922	feat(alerting): server-side state+severity filters, ButtonGroup filter UI Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value `state=…` and `severity=…` query params. Filters are pushed down to PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` / `AND severity::text = ANY(?)` to the inbox query (null/empty = no filter). `AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg form is preserved as a default delegate so existing callers (evaluator, AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged. `InAppInboxQuery.listInbox` also has a new filtered overload. UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select, no color cues) to `ButtonGroup` (multi-select with severity-coloured dots), matching the topnavbar status-filter pattern. `useAlerts` forwards the filters as query params and cache-keys on the filter tuple so each combo is independently cached. Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit tests passing). OpenAPI types regenerated from the fresh local backend.	2026-04-21 12:47:31 +02:00
hsiegeln	037a27d405	fix(alerting): allow multiple open alert_instances per rule for PER_EXCHANGE All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m51s Details CI / docker (push) Successful in 1m17s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details V13 added a partial unique index on alert_instances(rule_id) WHERE state IN (PENDING,FIRING,ACKNOWLEDGED). Correct for scalar condition kinds (ROUTE_METRIC / AGENT_STATE / DEPLOYMENT_STATE / LOG_PATTERN / JVM_METRIC / EXCHANGE_MATCH in COUNT_IN_WINDOW) but wrong for EXCHANGE_MATCH / PER_EXCHANGE, which by design emits one alert_instance per matching exchange. Under V13 every PER_EXCHANGE tick with >1 match logged "Skipped duplicate open alert_instance for rule …" at evaluator cadence and silently lost alert fidelity — only the first matching exchange per tick got an AlertInstance + webhook dispatch. V15 drops the rule_id-only constraint and recreates it with a discriminator on context->'exchange'->>'id'. Scalar kinds emit Map.of() as context, so their expression resolves to '' — "one open per rule" preserved. ExchangeMatchEvaluator.evaluatePerExchange always populates exchange.id, so per-exchange instances coexist cleanly. Two new PostgresAlertInstanceRepositoryIT tests: - multiple open instances for same rule + distinct exchanges all land - second open for identical (rule, exchange) still dedups via the DuplicateKeyException fallback in save() — defense-in-depth kept Also fixes pre-existing PostgresAlertReadRepositoryIT brokenness: its setup() inserted 3 open instances sharing one rule_id, which V13 blocked on arrival. Migrate to one rule_id per instance (pattern already used across other storage ITs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 22:26:19 +02:00
hsiegeln	efa8390108	fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m2s Details CI / docker (push) Successful in 1m20s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details SonarQube / sonarqube (push) Successful in 5m31s Details The rule editor wizard reset the condition payload on kind-change without seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule. - core: compact ctor now rejects null fireMode (Jackson-deser path only — all production callers already pass a value). - V14: repair existing EXCHANGE_MATCH rows with fireMode=null to PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard). - ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the Select's displayed fallback ("Per exchange") is actually in form state. - ui: validateStep('condition', ...) now enforces fireMode presence + the mode-specific fields before the user reaches Review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:05:55 +02:00
hsiegeln	ae6473635d	fix(auth): OidcAuthController + UserAdminController upsert unprefixed Follow-up to the UiAuthController fix: every write path that puts a row into users/user_roles/user_groups must use the bare DB key, because the env-scoped controllers (Alert, AlertRule, AlertSilence, Outbound) strip "user:" before using the name as an FK. If the write path stores prefixed, first-time alerting/outbound writes fail with alert_rules_created_by_fkey violation. UiAuthController shipped the model in the prior commit (bare userId for all DB/RBAC calls, "user:"-namespaced subject for JWT signing). Bringing the other two write paths in line: - OidcAuthController.callback: userId = "oidc:" + oidcUser.subject() // DB key, no "user:" subject = "user:" + userId // JWT subject (namespaced) All userRepository / rbacService / applyClaimMappings calls use userId. Tokens still carry the namespaced subject so JwtAuthenticationFilter can distinguish user vs agent tokens. - UserAdminController.createUser: userId = request.username() (bare). resetPassword: dropped the "user:"-strip fallback that was only needed because create used to prefix — now dead. No migration. Greenfield alpha product — any pre-existing prefixed rows in a dev DB will become orphans on next login (login upserts the unprefixed row, old prefixed row is harmless but unused). Operators doing a clean re-index can wipe the DB. Read-path controllers still strip — harmless for bare DB rows, and OIDC humans (JWT sub "user:oidc:<s>") still resolve correctly to the new DB key "oidc:<s>" after stripping. Verified: 45/45 alerting + outbound ITs pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:44:17 +02:00
hsiegeln	1ea0258393	fix(auth): upsert UI login user_id unprefixed (drop docker seeder workaround) Root cause of the mismatch that prompted the one-shot cameleer-seed docker service: UiAuthController stored users.user_id as the JWT subject "user:admin" (JWT sub format). Every env-scoped controller (Alert, AlertSilence, AlertRule, OutboundConnectionAdmin) already strips the "user:" prefix on the read path — so the rest of the system expects the DB key to be the bare username. With UiAuth storing prefixed, fresh docker stacks hit "alert_rules_created_by_fkey violation" on the first rule create. Fix: inside login(), compute `userId = request.username()` and use it everywhere the DB/RBAC layer is touched (isLocked, getPasswordHash, record/clearFailedLogins, upsert, assignRoleToUser, addUserToGroup, getSystemRoleNames). Keep `subject = "user:" + userId` — we still sign JWTs with the namespaced subject so JwtAuthenticationFilter can distinguish user vs agent tokens. refresh() and me() follow the same rule via a stripSubjectPrefix() helper (JWT subject in, bare DB key out). With the write path aligned, the docker bridge is no longer needed: - Deleted deploy/docker/postgres-init.sql - Deleted cameleer-seed service from docker-compose.yml Scope: UiAuthController only. UserAdminController + OidcAuthController still prefix on upsert — that's the bug class the triage identified as "Option A or B either way OK". Not changing them now because: a) prod admins are provisioned unprefixed through some other path, so those two files aren't the docker-only failure observed; b) stripping them would need a data migration for any existing prod users stored prefixed, which is out of scope for a cleanup phase. Follow-up worth scheduling if we ever wire OIDC or admin- created users into alerting FKs. Verified: 33/33 alerting+outbound controller ITs pass (9 outbound, 10 rules, 9 silences, 5 alert inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:26:03 +02:00
hsiegeln	09b49f096c	feat(alerting): per-severity breakdown on unread-count DTO Spec §13 calls for the notification bell to colour-code by highest unread severity (CRITICAL → error, WARNING → amber, INFO → muted). The old { count } DTO forced the UI to pick one static colour, so NotificationBell shipped with a TODO. Grow the contract instead: UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } } Guarantees: - every severity is always present with a >=0 value (no undefined keys on the wire), so the UI can branch without defaults. - total = sum of bySeverity values — kept explicit on the wire for cheap top-line display, not recomputed client-side. Backend - AlertInstanceRepository: replaces countUnreadForUser(long) with countUnreadBySeverityForUser returning Map<AlertSeverity, Long>. One SQL round-trip per (env, user) — GROUP BY ai.severity over the same NOT EXISTS(alert_reads) filter. - UnreadCountResponse.from(Map) normalises and defensively copies; missing severities default to 0. - InAppInboxQuery.countUnread now returns the DTO, caches the full response (still 5s TTL) so severity breakdown gets the same hit-rate as the total did before. - AlertController just hands the DTO back. Breaking change — no backwards-compat shim: the `count` field is gone. UI and tests updated in the same commit; there are no other API consumers in the tree. Frontend - Regenerated openapi.json + schema.d.ts against a fresh build of the new backend. - NotificationBell branches badge colour on the highest unread severity (CRITICAL > WARNING > INFO) via new CSS variants. - Tests cover all four paths: zero, critical-present, warning-only, info-only. Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map) + 49 vitest (was 46; +3 severity-branch assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:15:56 +02:00
hsiegeln	ec460faf02	Merge pull request 'feat(alerting): Plan 03 — UI + backfills (SSRF guard, metrics caching, docker stack)' (#144 ) from feat/alerting-03-ui into main All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m1s Details CI / docker (push) Successful in 1m16s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 42s Details Reviewed-on: #144	2026-04-20 16:27:49 +02:00
hsiegeln	5edf7eb23a	fix(alerting): @Autowired on AlertingMetrics production constructor Task 29's refactor added a package-private test-friendly constructor alongside the public production one. Without @Autowired Spring cannot pick which constructor to use for the @Component, and falls back to searching for a no-arg default — crashing startup with 'No default constructor found'. Detected when launching the server via the new docker-compose stack; unit tests still pass because they invoke the package-private test constructor directly.	2026-04-20 16:02:48 +02:00
hsiegeln	9f109b20fd	perf(alerting): 30s TTL cache on AlertingMetrics gauge suppliers Prometheus scrapes can fire every few seconds. The open-alerts / open-rules gauges query Postgres on each read — caching the values for 30s amortises that to one query per half-minute. Addresses final-review NIT from Plan 02. - Introduces a package-private TtlCache that wraps a Supplier<Long> and memoises the last read for a configurable Duration against a Supplier<Instant> clock. - Wraps each gauge supplier (alerting_rules_total{enabled\|disabled}, alerting_instances_total{state}) in its own TtlCache. - Adds a test-friendly constructor (package-private) taking explicit Duration + Supplier<Instant> so AlertingMetricsCachingTest can advance a fake clock without waiting wall-clock time. - Adds AlertingMetricsCachingTest covering: * supplier invoked once per TTL across repeated scrapes * 29 s elapsed → still cached; 31 s elapsed → re-queried * gauge value reflects the cached result even after delegate mutates Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:22:54 +02:00
hsiegeln	5ebc729b82	feat(alerting): SSRF guard on outbound connection URL Rejects webhook URLs that resolve to loopback, link-local, or RFC-1918 private ranges (IPv4 + IPv6 ULA fc00::/7). Enforced on both create and update in OutboundConnectionServiceImpl before persistence; returns 400 Bad Request with "private or loopback" in the body. Bypass via `cameleer.server.outbound-http.allow-private-targets=true` for dev environments where webhooks legitimately point at local services. Production default is `false`. Test profile sets the flag to `true` in application-test.yml so the existing ITs that post webhooks to WireMock on https://localhost:PORT keep working. A dedicated OutboundConnectionSsrfIT overrides the flag back to false (via @TestPropertySource + @DirtiesContext) to exercise the reject path end-to-end through the admin controller. Plan 01 scope; required before SaaS exposure (spec §17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:17:44 +02:00
hsiegeln	94e941b026	test(alerting): decentralize @MockBean from AbstractPostgresIT + add SpringContextSmokeIT All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m43s Details CI / cleanup-branch (pull_request) Has been skipped Details CI / build (pull_request) Successful in 3m7s Details CI / docker (pull_request) Has been skipped Details CI / deploy (pull_request) Has been skipped Details CI / deploy-feature (pull_request) Has been skipped Details CI / docker (push) Successful in 1m37s Details CI / deploy (push) Has been skipped Details CI / deploy-feature (push) Successful in 39s Details Follow-up to #141. AbstractPostgresIT centrally declared three @MockBean fields (clickHouseSearchIndex, clickHouseLogStore, agentRegistryService), which meant EVERY IT ran against mocks instead of the real Spring context. That masked the production crashloop — the real bean graph was never exercised by CI. - Remove the three @MockBean fields from AbstractPostgresIT. - Move @MockBean declarations onto only the specific ITs that stub method behavior (verified by grepping for when/verify calls). - ITs that don't stub CH behavior now inject the real beans. - Add SpringContextSmokeIT — @SpringBootTest with no mocks, void contextLoads(). Fails fast on declared-type / autowire-type mismatches like the one #141 fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 10:51:46 +02:00
hsiegeln	c9c93ac565	fix(alerting): declare ClickHouseSearchIndex bean as concrete type All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 3m6s Details CI / cleanup-branch (pull_request) Has been skipped Details CI / build (pull_request) Successful in 4m5s Details CI / docker (pull_request) Has been skipped Details CI / deploy (pull_request) Has been skipped Details CI / deploy-feature (pull_request) Has been skipped Details CI / docker (push) Successful in 1m37s Details CI / deploy (push) Has been skipped Details CI / deploy-feature (push) Successful in 39s Details Production crashlooped on startup: ExchangeMatchEvaluator autowires the concrete ClickHouseSearchIndex (for countExecutionsForAlerting, which lives only on the concrete class, not the SearchIndex interface), but StorageBeanConfig declared the bean with interface return type SearchIndex. Spring matches autowire candidates by declared bean type, not by runtime instance class, so the concrete-typed autowire failed with: Parameter 0 of constructor in ExchangeMatchEvaluator required a bean of type 'ClickHouseSearchIndex' that could not be found. ClickHouseLogStore's bean is already declared with the concrete return type (line 171), which is why LogPatternEvaluator autowires fine. All alerting ITs passed pre-merge because AbstractPostgresIT replaces the clickHouseSearchIndex bean with @MockBean(name=...) whose declared type IS the concrete ClickHouseSearchIndex. The mock masked the prod bug. Follow-up: remove @MockBean(name="clickHouseSearchIndex") from AbstractPostgresIT so the real bean graph is exercised by alerting ITs (and add a SpringContextSmokeIT that loads the context with no mocks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 09:11:47 +02:00
hsiegeln	b0ba08e572	test(alerting): rewrite AlertingFullLifecycleIT — REST-driven rule creation, re-notify cadence Rule creation now goes through POST /alerts/rules (exercises saveTargets on the write path). Clock is replaced with @MockBean(name="alertingClock") and re-stubbed in @BeforeEach to survive Mockito's inter-test reset. Six ordered steps: 1. seed log → tick evaluator → assert FIRING instance with non-empty targets (B-1) 2. tick dispatcher → assert DELIVERED notification + lastNotifiedAt stamped (B-2) 3. ack via REST → assert ACKNOWLEDGED state 4. create silence → inject PENDING notification → tick dispatcher → assert silenced (FAILED) 5. delete rule → assert rule_id nullified, rule_snapshot preserved (ON DELETE SET NULL) 6. new rule with reNotifyMinutes=1 → first dispatch → advance clock 61s → evaluator sweep → second dispatch → verify 2 WireMock POSTs (B-2 cadence) Background scheduler races addressed by resetting claimed_by/claimed_until before each manual tick. Simulated clock set AFTER log insert to guarantee log timestamp falls within the evaluator window. Re-notify notifications backdated in Postgres to work around the simulated vs real clock gap in claimDueNotifications. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:38 +02:00
hsiegeln	2c82b50ea2	fix(alerting/B-1): AlertStateTransitions.newInstance() propagates rule targets to AlertInstance newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames so newly created AlertInstance rows carry the correct target arrays. Previously these were always empty List.of(), making the inbox query return nothing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:25 +02:00
hsiegeln	7e79ff4d98	fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states V13 migration creates alert_instances_open_rule_uq — a partial unique index on (rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches DuplicateKeyException and returns the existing open instance instead of failing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:26:07 +02:00
hsiegeln	424894a3e2	fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response fields. AlertNotificationController calls resetForRetry instead of scheduleRetry so a manual retry always starts from a clean slate. AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify attempts==0 and status==PENDING after three prior markFailed calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:59 +02:00
hsiegeln	d74079da63	fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL branch excluded: sweep only re-notifies, initial notify is the dispatcher's job). AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh notifications for eligible instances and stamps last_notified_at. NotificationDispatchJob stamps last_notified_at on the alert_instance when a notification is DELIVERED, providing the anchor timestamp for cadence checks. PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering the three-rule eligibility matrix (never-notified, long-ago, recent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:50 +02:00
hsiegeln	3f036da03d	fix(alerting/B-1): PostgresAlertRuleRepository.save() now persists alert_rule_targets saveTargets() is called unconditionally at the end of save() — it deletes existing targets and re-inserts from the current targets list. findById() and listByEnvironment() already call withTargets() so reads are consistent. PostgresAlertRuleRepositoryIT adds saveTargets_roundtrip and saveTargets_updateReplacesExistingTargets to cover the new write path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:39 +02:00
hsiegeln	8bf45d5456	fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree Investigated three approaches for CH 24.12: - Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting). - ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in table metadata across connection restarts; runs before ADD PROJECTION in the SQL script. - Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better). alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs. AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status and alerting_route_status on executions) exist after schema init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:55 +02:00
hsiegeln	f1abca3a45	refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because stats_1m_route has no p95 column. Exposing the old name implied p95 semantics operators did not get. Rename to AVG_DURATION_MS makes the contract honest. Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:43 +02:00
hsiegeln	c79a6234af	test(alerting): fix duplicate @MockBean after AbstractPostgresIT centralised mocks + Plan 02 verification report AbstractPostgresIT gained clickHouseSearchIndex and agentRegistryService mocks in Phase 9. All 14 alerting IT subclasses that re-declared the same @MockBean fields now fail with "Duplicate mock definition". Removed the redundant declarations; per-class clickHouseLogStore mock kept where needed. 120 alerting tests now pass (0 failures). Also adds docs/alerting-02-verification.md (Task 43). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 23:27:19 +02:00
hsiegeln	63669bd1d7	docs(alerting): default config + admin guide Adds alerting stanza to application.yml with all AlertingProperties fields backed by env-var overrides. Creates docs/alerting.md covering six condition kinds (with example JSON), template variables, webhook setup (Slack/PagerDuty examples), silence patterns, circuit-breaker and retention troubleshooting, and Prometheus metrics reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:38 +02:00
hsiegeln	840a71df94	feat(alerting): observability metrics via micrometer AlertingMetrics @Component wraps MeterRegistry: - Counters: alerting_eval_errors_total{kind}, alerting_circuit_opened_total{kind}, alerting_notifications_total{status} - Timers: alerting_eval_duration_seconds{kind}, alerting_webhook_delivery_duration_seconds - Gauges (DB-backed): alerting_rules_total{state}, alerting_instances_total{state} AlertEvaluatorJob records evalError + evalDuration around each evaluator call. PerKindCircuitBreaker detects open transitions and fires metrics.circuitOpened(kind). AlertingBeanConfig wires AlertingMetrics into the circuit breaker post-construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:30 +02:00
hsiegeln	1ab21bc019	feat(alerting): AlertingRetentionJob daily cleanup Nightly @Scheduled(03:00) job deletes RESOLVED alert_instances older than eventRetentionDays and DELIVERED/FAILED alert_notifications older than notificationRetentionDays. Uses injected Clock for testability. IT covers: old-resolved deleted, fresh-resolved kept, FIRING kept regardless of age, PENDING notification never deleted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 22:16:21 +02:00
hsiegeln	e334dfacd3	feat(alerting): AlertNotificationController + SecurityConfig matchers + fix IT context (Task 35) - GET /environments/{envSlug}/alerts/{alertId}/notifications — list notifications for instance (VIEWER+) - POST /alerts/notifications/{id}/retry — manual retry of failed notification (OPERATOR+) Flat path because notification IDs are globally unique (no env routing needed) - scheduleRetry resets attempts to 0 and sets nextAttemptAt = now - Added 11 alerting path matchers to SecurityConfig before outbound-connections block - Fixed context loading failure in 6 pre-existing alerting storage/migration ITs by adding @MockBean(clickHouseSearchIndex/clickHouseLogStore): ExchangeMatchEvaluator and LogPatternEvaluator inject the concrete classes directly (not interface beans), so the full Spring context fails without these mocks in tests that don't use the real CH container - 5 IT tests: list, viewer-can-list, retry, viewer-cannot-retry, unknown-404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:17 +02:00
hsiegeln	77d1718451	feat(alerting): AlertSilenceController CRUD with time-range validation + audit (Task 34) - POST/GET/DELETE /environments/{envSlug}/alerts/silences - 422 when endsAt <= startsAt ("endsAt must be after startsAt") - OPERATOR+ for create/delete, VIEWER+ for list - Audit: ALERT_SILENCE_CREATE/DELETE with AuditCategory.ALERT_SILENCE_CHANGE - 6 IT tests: create, viewer-list, viewer-cannot-create, bad time-range, delete, audit event Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:29:03 +02:00
hsiegeln	841793d7b9	feat(alerting): AlertController in-app inbox with ack/read/bulk-read (Task 33) - GET /environments/{envSlug}/alerts — inbox filtered by userId/groupIds/roleNames via InAppInboxQuery - GET /unread-count — memoized unread count (5s TTL) - GET /{id}, POST /{id}/ack, POST /{id}/read, POST /bulk-read - bulkRead filters instanceIds to env before delegating to AlertReadRepository - VIEWER+ for all endpoints; env isolation enforced by requireInstance - 7 IT tests: list, env isolation, unread-count, ack flow, read, bulk-read, viewer access Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:55 +02:00
hsiegeln	c1b34f592b	feat(alerting): AlertRuleController with attribute-key SQL injection validation (Task 32) - POST/GET/PUT/DELETE /environments/{envSlug}/alerts/rules CRUD - POST /{id}/enable, /{id}/disable, /{id}/render-preview, /{id}/test-evaluate - Attribute-key validation: rejects keys not matching ^[a-zA-Z0-9._-]+$ at rule-save time (CRITICAL: ExchangeMatchCondition attribute keys are inlined into ClickHouse SQL) - Webhook validation: verifies outboundConnectionId exists and is allowed in env - Null-safe notification template defaults to "" for NOT NULL DB constraint - Fixed misleading comment in ClickHouseSearchIndex to document validation contract - OPERATOR+ for mutations, VIEWER+ for reads - Audit: ALERT_RULE_CREATE/UPDATE/DELETE/ENABLE/DISABLE with AuditCategory.ALERT_RULE_CHANGE - 11 IT tests covering RBAC, SQL-injection prevention, enable/disable, audit, render-preview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:28:46 +02:00

1 2 3

126 Commits