cameleer-server

Author	SHA1	Message	Date
hsiegeln	0b419db9f1	feat(search): add AttributeFilter record with key regex + wildcard pattern translation	2026-04-24 09:51:28 +02:00
hsiegeln	d58c8cde2e	feat(server): REST API over server_metrics for SaaS dashboards Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control planes can build the server-health dashboard without direct ClickHouse access. One generic /query endpoint covers every panel in the server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag, filter-by-tag, counter-delta mode with per-server_instance_id rotation handling, and a derived 'mean' statistic for timers. Regex-validated identifiers, parameterised literals, 31-day range cap, 500-series response cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated: all 17 suggested panels now expressed as single-endpoint queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:41:02 +02:00
hsiegeln	48ce75bf38	feat(server): persist server self-metrics into ClickHouse Snapshot the full Micrometer registry (cameleer business metrics, alerting metrics, and Spring Boot Actuator defaults) every 60s into a new server_metrics table so server health survives restarts without an external Prometheus. Includes a dashboard-builder reference for the SaaS team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:20:45 +02:00
hsiegeln	c7e5c7fa2d	refactor(diagrams): retire findContentHashForRouteByAgents All production callers migrated to findLatestContentHashForAppRoute in the preceding commits. The agent-scoped lookup adds no coverage beyond the latest-per-(app,env,route) resolver, so the dead API is removed along with its test coverage and unused imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:02:47 +02:00
hsiegeln	d3ce5e861b	feat(diagrams): add findLatestContentHashForAppRoute with app-route cache Agent-scoped lookups miss diagrams from routes whose publishing agents have been redeployed or removed. The new method resolves by (applicationId, environment, routeId) + created_at DESC, independent of the agent registry. An in-memory cache mirrors the existing hashCache pattern, warm-loaded at startup via argMax. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:58:49 +02:00
hsiegeln	21db92ff00	fix(traefik): make TLS cert resolver configurable, omit when unset All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m15s Details CI / docker (push) Successful in 1m3s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 42s Details Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on every router. That assumes a resolver literally named `default` exists in the Traefik static config — true for ACME-backed installs, false for dev/local installs that use a file-based TLS store. Traefik logs "Router uses a nonexistent certificate resolver" for the bogus resolver on every managed app, and any future attempt to define a differently- named real resolver would silently skip these routers. Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver` into `ResolvedContainerConfig.certResolver`. When blank the `tls.certresolver` label is omitted entirely; `tls=true` is still emitted so Traefik serves the default TLS-store cert. When set, the label is emitted with the configured resolver name. Not per-app/per-env configurable: there is one Traefik per server instance and one resolver config; app-level override would only let users break their own routers. TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank). Full unit suite 211/0/0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:18:47 +02:00
hsiegeln	165c9f10e3	feat(deploy): externalRouting toggle to keep apps off Traefik All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m26s Details CI / docker (push) Successful in 1m5s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details Adds a boolean `externalRouting` flag (default `true`) on ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only the identity labels (`managed-by`, `cameleer.`) and skips every `traefik.` label, so the container is not published by Traefik. Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}` can still reach it via Docker DNS on whatever port the app listens on. TDD: new TraefikLabelBuilderTest covers enabled (default labels present), disabled (zero traefik.* labels), and disabled (identity labels retained) cases. Full module unit suite: 208/0/0. Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form state, Resources tab toggle, POST payload, and snapshot-to-form mapping. Rule files updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:03:48 +02:00
hsiegeln	382e1801a7	feat(logs): add instanceIds multi-value filter to /logs endpoint Adds List<String> instanceIds to LogSearchRequest (null-normalized to List.of() in compact ctor) and generates an IN clause in both ClickHouseLogStore.search() and countLogs(), mirroring the existing sources pattern. LogQueryController parses ?instanceIds= as a comma-split list. All existing LogSearchRequest call sites updated. New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty filter (all rows), null filter (all rows), single-value filter, and coexistence with the singular instanceId field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 12:41:09 +02:00
hsiegeln	a141e99a07	feat(deploy): cascade createdBy through Deployment record + service + repo Appends String createdBy to the Deployment record (after createdAt), updates both with-er methods to pass it through, threads the parameter through DeploymentRepository.create, DeploymentService.createDeployment/promote, and PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController passes null as placeholder (Task 4 will resolve from SecurityContextHolder). Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via both createDeployment and promote. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 12:04:15 +02:00
hsiegeln	15d00f039c	feat(audit): add DEPLOYMENT audit category	2026-04-23 11:51:28 +02:00
hsiegeln	c6aef5ab35	fix(deploy): Checkpoints — preserve STOPPED history, fix filter + placement All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m15s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details - Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment. STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty. Now only FAILED rows are pruned; STOPPED deployments are retained as restorable checkpoints (they still carry deployed_config_snapshot from their RUNNING window). - UI filter: any deployment with a snapshot is a checkpoint (was RUNNING\|DEGRADED only, which excluded the main case — the previous blue/green deployment now in STOPPED). - UI placement: Checkpoints disclosure now renders inside IdentitySection, matching the design spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:26:46 +02:00
hsiegeln	5304c8ee01	core(deploy): DeploymentStrategy enum with safe wire conversion Typed enum (BLUE_GREEN, ROLLING) with fromWire/toWire kebab-case translation. fromWire falls back to BLUE_GREEN for unknown or null input so the executor dispatch site never null-checks and no misconfigured container-config can throw at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:42:35 +02:00
hsiegeln	f8dccaae2b	fix(deploy): stop previous active deployment before START_REPLICAS (fixes 409) Container names are deterministic: {tenant}-{envSlug}-{appSlug}-{replica}. The prior code did the stop-existing step at SWAP_TRAFFIC, after START_REPLICAS had already tried to create containers with the same names — so a redeploy against a RUNNING app consistently failed with Docker 409 "container name already in use". Move the stop-existing block to run right after CREATE_NETWORK and before START_REPLICAS. SWAP_TRAFFIC becomes a label-only marker (traffic is swapped implicitly by Traefik labels once new replicas are healthy). Also: add `findActiveByAppIdAndEnvironmentIdExcluding` so the SQL excludes the current deployment by id — previously the Java-side `!id.equals(me)` guard failed because the newly-inserted row has status=STARTING (DB default) and ORDER BY created_at DESC LIMIT 1 picked the new row, hiding the actual previous deployment. Trade-off: this is destroy-then-start rather than true blue/green — brief downtime during the swap. Matches the pre-unified-page behavior and is what users reasonably expect. True blue/green would require per-deployment container names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 01:01:00 +02:00
hsiegeln	d33c039a17	fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations - Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field - Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the new app is in cache before the router renders the deployed view (eliminates transient Save- disabled flash) - Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits; background refetches only overwrite form when no local changes are present - Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment resolves; handleSave invalidates dirty-state after staged save - Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy, environment, application) before JSON comparison via scrubAgentConfig(); adds versionBumpDoesNotMarkDirty test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 23:29:01 +02:00
hsiegeln	97f25b4c7e	test(deploy): register JavaTimeModule in DirtyStateCalculator unit test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:38:57 +02:00
hsiegeln	6591f2fde3	api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff Wires DirtyStateCalculator behind an HTTP endpoint on AppController. Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository, registers DirtyStateCalculator as a Spring bean (with ObjectMapper for JavaTimeModule support), and covers all three scenarios with IT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:35:35 +02:00
hsiegeln	24464c0772	core(deploy): recurse into nested diffs + unquote scalar values in DirtyStateCalculator - compareJson now recurses when both nodes are ObjectNode, so nested maps (tracedProcessors, routeRecording, routeSamplingRates) produce deep paths like agentConfig.tracedProcessors.proc-1 instead of a blob diff - Extract nodeToString helper: value nodes use asText() (strips JSON quotes), null becomes "(none)", arrays/objects get compact JSON - Apply nodeToString in both diff-emission paths (top-level mismatch + leaf) - Add three new tests: nullAgentConfigInSnapshot, nestedAgentField_reportsDeepPath, stringField_differenceValueIsUnquoted (8 tests total, all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:25:04 +02:00
hsiegeln	e4ccce1e3b	core(deploy): add DirtyStateCalculator + DirtyStateResult Pure-logic dirty-state detection: compares desired JAR + agent config + container config against the DeploymentConfigSnapshot from the last successful deployment. Returns a structured DirtyStateResult with per-field differences. 5 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:20:49 +02:00
hsiegeln	7f9cfc7f18	core(deploy): add deployedConfigSnapshot field to Deployment model Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment record and adds a matching withDeployedConfigSnapshot wither. All positional call sites (repository mapper, test fixture) updated to pass null; Task 1.4 will wire real persistence and Task 1.5 will populate the field on RUNNING transition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:31:48 +02:00
hsiegeln	06fa7d832f	core(deploy): type jarVersionId as UUID (match domain convention) All other FKs to app_versions.id (e.g. Deployment.appVersionId) use UUID; DeploymentConfigSnapshot.jarVersionId was incorrectly typed as String. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:29:26 +02:00
hsiegeln	d580b6e90c	core(deploy): add DeploymentConfigSnapshot record Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:26:30 +02:00
hsiegeln	c2eab71a31	env(admin): per-environment color field + V2 migration - V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate. - Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API. - UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400. - ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:24:30 +02:00
hsiegeln	e483e52eee	alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition Dead field — was enforced by compact ctor as required for PER_EXCHANGE, but never read anywhere in the codebase. Removal tightens the API surface and is precondition for the Task 3.3 cross-field validator. Pre-prod; no shim / migration.	2026-04-22 17:10:53 +02:00
hsiegeln	0bad014811	core(alerting): AlertRule.withEvalState wither for cursor threading	2026-04-22 16:04:55 +02:00
hsiegeln	b41f34c090	search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate Adds an optional afterExecutionId field to SearchRequest. When combined with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id)) so same-millisecond exchanges can be consumed exactly once across ticks. When afterExecutionId is null, timeFrom keeps its existing >= semantics — no behaviour change for any current caller. Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field through existing withInstanceIds / withEnvironment witheres. All existing positional call-sites (SearchController, ExchangeMatchEvaluator, ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new slot. Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md. The evaluator-side wiring that actually supplies the cursor is Task 1.5.	2026-04-22 15:49:05 +02:00
hsiegeln	06c6f53bbc	refactor(ingestion): remove unused TaggedExecution record No callers after the legacy PG ingestion path was retired in `0f635576`. core-classes.md updated to drop the leftover note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:33:26 +02:00
hsiegeln	98cbf8f3fc	refactor(search): drop dead SearchIndexer subsystem After the ExecutionController removal (`0f635576`), SearchIndexer subscribed to ExecutionUpdatedEvent but nothing publishes that event. Every SearchIndexerStats metric returned always-zero, and the admin /api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats carried no signal. Backend removed: - core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent - app: IndexerPipelineResponse DTO, /pipeline endpoint on ClickHouseAdminController (field + ctor param) - StorageBeanConfig.searchIndexer bean UI removed: - IndexerPipeline type + useIndexerPipeline hook in api/queries/admin/clickhouse.ts - Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar import and pipeline* CSS classes) OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path and IndexerPipelineResponse schema removed). SearchIndex interface + ClickHouseSearchIndex impl kept — those are live and used by SearchService + ExchangeMatchEvaluator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:32:49 +02:00
hsiegeln	0f635576a3	refactor(ingestion): drop dead legacy execution-ingestion path ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class), and ChunkAccumulator is registered unconditionally — the legacy controller never bound in any profile. Even if it had, IngestionService.ingestExecution called executionStore.upsert(), and the only ExecutionStore impl (ClickHouseExecutionStore) threw UnsupportedOperationException from upsert and upsertProcessors. The entire RouteExecution → upsert path was dead code carrying four transitive dependencies (RouteExecution import, eventPublisher wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook). Removed: - cameleer-server-app/.../controller/ExecutionController.java (whole file) - ExecutionStore.upsert + upsertProcessors (interface methods) - ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides) - IngestionService.ingestExecution + toExecutionRecord + flattenProcessors + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers - IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>); dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit - StorageBeanConfig.ingestionService(...) simplified accordingly Untouched because still in use: - ExecutionRecord / ProcessorRecord records (findById / findProcessors / SearchIndexer / DetailController) - SearchIndexer (its onExecutionUpdated never fires now since no-one publishes ExecutionUpdatedEvent, but SearchIndexerStats is still referenced by ClickHouseAdminController — separate cleanup) - TaggedExecution record has no remaining callers after this change — flagged in core-classes.md as a leftover; separate cleanup. Rule docs updated: - .claude/rules/app-classes.md: retired ExecutionController bullet, fixed stale URL for ChunkIngestionController (it owns /api/v1/data/executions, not /api/v1/ingestion/chunk/executions). - .claude/rules/core-classes.md: IngestionService surface + note the dead TaggedExecution. Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:50:51 +02:00
hsiegeln	fb54f9cbd2	fix(agent): revive DEAD agents on heartbeat (not just STALE) Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details CI / docker (push) Has been cancelled Details Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:55:47 +02:00
hsiegeln	99b739d946	fix(alerts): backend hardening + complete ACKNOWLEDGED migration - new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation - AlertController.inEnvLiveIds now one SQL round-trip instead of N - bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL - bulkAck SQL already had deleted_at IS NULL guard — no change needed - PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted - V12MigrationIT: remove alert_reads assertion (table dropped by V17) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:48:57 +02:00
hsiegeln	da2819332c	feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations - save/rowMapper read+write read_at and deleted_at - listForInbox: tri-state acked/read filters; always excludes deleted - countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill - new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore - delete PostgresAlertReadRepository + its bean - restore zero-fill Javadoc on interface - mechanical compile-fixes in AlertController, InAppInboxQuery, AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite - PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:06 +02:00
hsiegeln	55b2a00458	feat(alerts): core repo — filter params + markRead/softDelete/bulkAck/restore; drop AlertReadRepository - listForInbox gains tri-state acked/read filter params - countUnreadBySeverityForUser(envId, userId) → countUnreadBySeverity(envId, userId, groupIds, roleNames) - new methods: markRead, bulkMarkRead, softDelete, bulkSoftDelete, bulkAck, restore - delete AlertReadRepository — read is now global on alert_instances.read_at Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:38:10 +02:00
hsiegeln	82e82350f9	refactor(alerts): drop ACKNOWLEDGED from AlertState, add readAt/deletedAt to AlertInstance - AlertState: remove ACKNOWLEDGED case (V17 migration already dropped it from DB enum) - AlertInstance: insert readAt + deletedAt Instant fields after lastNotifiedAt; add withReadAt/withDeletedAt withers; update all existing withers to pass both fields positionally - AlertStateTransitions: add null,null for readAt/deletedAt in newInstance ctor call; collapse FIRING,ACKNOWLEDGED switch arm to just FIRING - AlertScopeTest: update AlertState.values() assertion to 3 values; fix stale ConditionKind.hasSize(6) to 7 (JVM_METRIC was added earlier) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:12:37 +02:00
hsiegeln	414f7204bf	feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode Allows alert rules to fire on agent-lifecycle events — REGISTERED, RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather than only on current state. Each matching `(agent, eventType, timestamp)` becomes its own ackable AlertInstance, so outages on distinct agents are independently routable. Core: - New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record (scope, eventTypes, withinSeconds). Compact ctor rejects empty eventTypes and withinSeconds<1. - Strict allowlist enum `AgentLifecycleEventType` (six entries matching the server-emitted types in `AgentRegistrationController` and `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in backlog issue #145. - `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes, from, to, limit)` — new read path ordered `(timestamp ASC, insert_id ASC)` used by the evaluator. Implemented on `ClickHouseAgentEventRepository` with tenant + env filter mandatory. App: - `AgentLifecycleEvaluator` queries events in the last `withinSeconds` window and returns `EvalResult.Batch` with one `Firing` per row. Every Firing carries a canonical `_subjectFingerprint` of `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event` subtrees for Mustache templating. - `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}`. - Validation is delegated to the record compact ctor + enum at Jackson deserialization time — matches the existing policy of keeping controller validators focused on env-scoped / SQL-injection concerns. Schema: - V16 migration generalises the V15 per-exchange discriminator on `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a fallback to the legacy `exchange.id` expression. Scalar kinds still resolve to `''` and keep one-open-per-rule. Duplicate-key path in `PostgresAlertInstanceRepository.save` is unchanged — the index is the deduper. UI: - New `AgentLifecycleForm.tsx` wizard form with multi-select chips for the six allowed event types + `withinSeconds` input. Wired into `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD, 300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the new option array. - `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind, and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`. Tests (all passing): - 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive + empty/zero/unknown-type rejection). - 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty window, multi-agent fingerprint shape, scope forwarding, missing env). - `NotificationContextBuilderTest` switch now covers the new kind. - 119 alerting unit tests + 71 UI tests green. Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.	2026-04-21 14:52:08 +02:00
hsiegeln	f037d8c922	feat(alerting): server-side state+severity filters, ButtonGroup filter UI Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value `state=…` and `severity=…` query params. Filters are pushed down to PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` / `AND severity::text = ANY(?)` to the inbox query (null/empty = no filter). `AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg form is preserved as a default delegate so existing callers (evaluator, AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged. `InAppInboxQuery.listInbox` also has a new filtered overload. UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select, no color cues) to `ButtonGroup` (multi-select with severity-coloured dots), matching the topnavbar status-filter pattern. `useAlerts` forwards the filters as query params and cache-keys on the filter tuple so each combo is independently cached. Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit tests passing). OpenAPI types regenerated from the fresh local backend.	2026-04-21 12:47:31 +02:00
hsiegeln	efa8390108	fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m2s Details CI / docker (push) Successful in 1m20s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details SonarQube / sonarqube (push) Successful in 5m31s Details The rule editor wizard reset the condition payload on kind-change without seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule. - core: compact ctor now rejects null fireMode (Jackson-deser path only — all production callers already pass a value). - V14: repair existing EXCHANGE_MATCH rows with fireMode=null to PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard). - ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the Select's displayed fallback ("Per exchange") is actually in form state. - ui: validateStep('condition', ...) now enforces fireMode presence + the mode-specific fields before the user reaches Review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:05:55 +02:00
hsiegeln	09b49f096c	feat(alerting): per-severity breakdown on unread-count DTO Spec §13 calls for the notification bell to colour-code by highest unread severity (CRITICAL → error, WARNING → amber, INFO → muted). The old { count } DTO forced the UI to pick one static colour, so NotificationBell shipped with a TODO. Grow the contract instead: UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } } Guarantees: - every severity is always present with a >=0 value (no undefined keys on the wire), so the UI can branch without defaults. - total = sum of bySeverity values — kept explicit on the wire for cheap top-line display, not recomputed client-side. Backend - AlertInstanceRepository: replaces countUnreadForUser(long) with countUnreadBySeverityForUser returning Map<AlertSeverity, Long>. One SQL round-trip per (env, user) — GROUP BY ai.severity over the same NOT EXISTS(alert_reads) filter. - UnreadCountResponse.from(Map) normalises and defensively copies; missing severities default to 0. - InAppInboxQuery.countUnread now returns the DTO, caches the full response (still 5s TTL) so severity breakdown gets the same hit-rate as the total did before. - AlertController just hands the DTO back. Breaking change — no backwards-compat shim: the `count` field is gone. UI and tests updated in the same commit; there are no other API consumers in the tree. Frontend - Regenerated openapi.json + schema.d.ts against a fresh build of the new backend. - NotificationBell branches badge colour on the highest unread severity (CRITICAL > WARNING > INFO) via new CSS variants. - Tests cover all four paths: zero, critical-present, warning-only, info-only. Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map) + 49 vitest (was 46; +3 severity-branch assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:15:56 +02:00
hsiegeln	424894a3e2	fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response fields. AlertNotificationController calls resetForRetry instead of scheduleRetry so a manual retry always starts from a clean slate. AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify attempts==0 and status==PENDING after three prior markFailed calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:59 +02:00
hsiegeln	d74079da63	fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL branch excluded: sweep only re-notifies, initial notify is the dispatcher's job). AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh notifications for eligible instances and stamps last_notified_at. NotificationDispatchJob stamps last_notified_at on the alert_instance when a notification is DELIVERED, providing the anchor timestamp for cadence checks. PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering the three-rule eligibility matrix (never-notified, long-ago, recent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:50 +02:00
hsiegeln	f1abca3a45	refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because stats_1m_route has no p95 column. Exposing the old name implied p95 semantics operators did not get. Rename to AVG_DURATION_MS makes the contract honest. Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:43 +02:00
hsiegeln	bf178ba141	fix(alerting): populate AlertInstance.rule_snapshot so history survives rule delete - Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers) - Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot - Strip null values from the Jackson-serialized map before wrapping in the immutable snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields - Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind appear in the rule_snapshot column after a tick fires an instance - Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:09:28 +02:00
hsiegeln	657dc2d407	feat(alerting): AlertingProperties + AlertStateTransitions state machine - AlertingProperties @ConfigurationProperties with effective*() accessors and 5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup - AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated to job - AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck, withSilenced, withTitleMessage, withLastNotifiedAt, withContext - AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties), alertingInstanceId bean (hostname:pid), alertingClock bean, PerKindCircuitBreaker bean wired from props - 12 unit tests in AlertStateTransitionsTest covering all transitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:12 +02:00
hsiegeln	7b79d3aa64	feat(alerting): countExecutionsForAlerting for exchange-match evaluator Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting — no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window, and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:49 +02:00
hsiegeln	45028de1db	feat(alerting): Postgres repository for alert_instances with inbox queries Implements AlertInstanceRepository: save (upsert), findById, findOpenForRule, listForInbox (3-way OR: user/group/role via && array-overlap + ANY), countUnreadForUser (LEFT JOIN alert_reads), ack, resolve, markSilenced, deleteResolvedBefore. Integration test covers all 9 scenarios including inbox fan-out across all three target types. Also adds @JsonIgnoreProperties(ignoreUnknown=true) to SilenceMatcher to suppress Jackson serializing isWildcard() as a round-trip field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:04:51 +02:00
hsiegeln	1ff256dce0	feat(alerting): core repository interfaces	2026-04-19 18:43:36 +02:00
hsiegeln	e7a9042677	feat(alerting): core domain records (rule, instance, silence, notification)	2026-04-19 18:43:03 +02:00
hsiegeln	56a7b6de7d	feat(alerting): sealed AlertCondition hierarchy with Jackson deduction	2026-04-19 18:42:04 +02:00
hsiegeln	530bc32040	feat(alerting): core enums + AlertScope	2026-04-19 18:36:29 +02:00
hsiegeln	5103dc91be	feat(alerting): add ALERT_RULE_CHANGE + ALERT_SILENCE_CHANGE audit categories	2026-04-19 18:34:08 +02:00
hsiegeln	ea4c56e7f6	feat(outbound): admin CRUD REST + RBAC + audit New audit categories: OUTBOUND_CONNECTION_CHANGE, OUTBOUND_HTTP_TRUST_CHANGE. Controller-level @PreAuthorize defaults to ADMIN; GETs relaxed to ADMIN\|OPERATOR. SecurityConfig permits OPERATOR GETs on /api/v1/admin/outbound-connections/**. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:43:48 +02:00

1 2

68 Commits