cameleer-server

Author	SHA1	Message	Date
hsiegeln	c7e5c7fa2d	refactor(diagrams): retire findContentHashForRouteByAgents All production callers migrated to findLatestContentHashForAppRoute in the preceding commits. The agent-scoped lookup adds no coverage beyond the latest-per-(app,env,route) resolver, so the dead API is removed along with its test coverage and unused imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:02:47 +02:00
hsiegeln	0995ab35c4	fix(catalog): preserve fromEndpointUri for removed routes Both catalog controllers resolved the from-endpoint URI via findContentHashForRouteByAgents, which filtered by the currently-live agent instance_ids. Routes removed between app versions therefore lost their fromUri even though the diagram row still exists. Route through findLatestContentHashForAppRoute so resolution depends only on (app, env, route) — stays populated for historical routes. CatalogController now resolves the per-row env slug up-front so the fromUri lookup works even for cross-env queries against managed apps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:01:19 +02:00
hsiegeln	480a53c80c	fix(diagrams): by-route lookup no longer requires live agents The env-scoped /routes/{routeId}/diagram endpoint filtered diagrams by the currently-live agent instance_ids. Routes removed between app versions have no live publisher, so the lookup returned 404 even though the historical diagram row still exists in route_diagrams. Sidebar entries for removed routes showed "no diagram" as a result. Switch to findLatestContentHashForAppRoute which resolves directly off (applicationId, environment, routeId) + created_at DESC, independent of the agent registry. The controller no longer depends on AgentRegistryService. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:59:43 +02:00
hsiegeln	d3ce5e861b	feat(diagrams): add findLatestContentHashForAppRoute with app-route cache Agent-scoped lookups miss diagrams from routes whose publishing agents have been redeployed or removed. The new method resolves by (applicationId, environment, routeId) + created_at DESC, independent of the agent registry. An in-memory cache mirrors the existing hashCache pattern, warm-loaded at startup via argMax. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:58:49 +02:00
hsiegeln	21db92ff00	fix(traefik): make TLS cert resolver configurable, omit when unset All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m15s Details CI / docker (push) Successful in 1m3s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 42s Details Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on every router. That assumes a resolver literally named `default` exists in the Traefik static config — true for ACME-backed installs, false for dev/local installs that use a file-based TLS store. Traefik logs "Router uses a nonexistent certificate resolver" for the bogus resolver on every managed app, and any future attempt to define a differently- named real resolver would silently skip these routers. Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver` into `ResolvedContainerConfig.certResolver`. When blank the `tls.certresolver` label is omitted entirely; `tls=true` is still emitted so Traefik serves the default TLS-store cert. When set, the label is emitted with the configured resolver name. Not per-app/per-env configurable: there is one Traefik per server instance and one resolver config; app-level override would only let users break their own routers. TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank). Full unit suite 211/0/0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:18:47 +02:00
hsiegeln	165c9f10e3	feat(deploy): externalRouting toggle to keep apps off Traefik All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m26s Details CI / docker (push) Successful in 1m5s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details Adds a boolean `externalRouting` flag (default `true`) on ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only the identity labels (`managed-by`, `cameleer.`) and skips every `traefik.` label, so the container is not published by Traefik. Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}` can still reach it via Docker DNS on whatever port the app listens on. TDD: new TraefikLabelBuilderTest covers enabled (default labels present), disabled (zero traefik.* labels), and disabled (identity labels retained) cases. Full module unit suite: 208/0/0. Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form state, Resources tab toggle, POST payload, and snapshot-to-form mapping. Rule files updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:03:48 +02:00
hsiegeln	0cf64b2928	fix(audit): exclude env-scoped executions/search from safety-net log All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m24s Details CI / docker (push) Successful in 1m1s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details The exclusion list still named the legacy flat `/api/v1/search/executions` URL, which no longer exists — the endpoint moved to env-scoped `/api/v1/environments/{envSlug}/executions/search`. Exact-match Set lookup never matched, so every UI search POST produced an audit row. Switch to AntPathMatcher over a pattern list so the dynamic envSlug is handled correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:35:44 +02:00
hsiegeln	ed0e616109	refactor(logs): drop dead null guards on instanceIds filter (record normalizes)	2026-04-23 12:52:18 +02:00
hsiegeln	382e1801a7	feat(logs): add instanceIds multi-value filter to /logs endpoint Adds List<String> instanceIds to LogSearchRequest (null-normalized to List.of() in compact ctor) and generates an IN clause in both ClickHouseLogStore.search() and countLogs(), mirroring the existing sources pattern. LogQueryController parses ?instanceIds= as a comma-split list. All existing LogSearchRequest call sites updated. New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty filter (all rows), null filter (all rows), single-value filter, and coexistence with the singular instanceId field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 12:41:09 +02:00
hsiegeln	2312a7304d	fix(deploy): widen promote FAILURE audit detail + clean up test envs	2026-04-23 12:29:46 +02:00
hsiegeln	47d5611462	feat(audit): audit deploy/stop/promote with DEPLOYMENT category Wires AuditService and AppVersionRepository into DeploymentController. Replaces null createdBy placeholder with currentUserId() on createDeployment/promote. Adds audit log entries (SUCCESS + FAILURE) for deploy_app, stop_deployment, and promote_deployment actions. Fixes FK violations in affected ITs by seeding the test-operator and alice users into the users table before deploy calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 12:24:27 +02:00
hsiegeln	a141e99a07	feat(deploy): cascade createdBy through Deployment record + service + repo Appends String createdBy to the Deployment record (after createdAt), updates both with-er methods to pass it through, threads the parameter through DeploymentRepository.create, DeploymentService.createDeployment/promote, and PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController passes null as placeholder (Task 4 will resolve from SecurityContextHolder). Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via both createDeployment and promote. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 12:04:15 +02:00
hsiegeln	35748ea7a1	feat(deploy): V4 migration — add created_by to deployments	2026-04-23 11:44:05 +02:00
hsiegeln	c6aef5ab35	fix(deploy): Checkpoints — preserve STOPPED history, fix filter + placement All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m15s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details - Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment. STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty. Now only FAILED rows are pruned; STOPPED deployments are retained as restorable checkpoints (they still carry deployed_config_snapshot from their RUNNING window). - UI filter: any deployment with a snapshot is a checkpoint (was RUNNING\|DEGRADED only, which excluded the main case — the previous blue/green deployment now in STOPPED). - UI placement: Checkpoints disclosure now renders inside IdentitySection, matching the design spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:26:46 +02:00
hsiegeln	653f983a08	deploy: rolling strategy (per-replica replacement) Replace the Phase 3 stub with a working rolling implementation. Flow: - Capture previous deployment's per-index container ids up front. - For i = 0..replicas-1: - Start new[i] (gen-suffixed name, coexists with old[i]). - Wait for new[i] healthy (new waitForOneHealthy helper). - On success: stop old[i] if present, continue. - On failure: stop in-flight new[0..i], leave un-replaced old[i+1..N] running, mark FAILED. Already-replaced old replicas are not restored — rolling is not reversible; user redeploys to recover. - After the loop: sweep any leftover old replicas (when replica count shrank) and mark the old deployment STOPPED. Resource peak: replicas + 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:53:52 +02:00
hsiegeln	459cdfe427	deploy: blue-green strategy (start → health-all → stop old) Phase 3 of deployment-strategies plan. Refactor executeAsync to dispatch on DeploymentStrategy.fromWire(config.deploymentStrategy()). Blue-green (default): - Start all N new replicas (gen-suffixed names coexist with old). - Wait for ALL healthy (strict — partial-healthy = FAILED, preserves previous deployment untouched). - Only then find + stop the previous deployment. - Final status is always RUNNING; DEGRADED is now reserved for post-deploy replica crashes (set by DockerEventMonitor). Rolling: stub — throws UnsupportedOperationException for now, gets its real implementation in Phase 4. Refactor details: - Extract DeployCtx record to carry 13 per-deploy values around. - Extract startReplica(ctx, i, stateOut) — shared by both strategy paths. - Extract persistSnapshotAndMarkRunning(ctx, primaryCid) — shared finalizer. - Rename waitForAnyHealthy → waitForAllHealthy (the name was misleading; the method already waited for all, just returned partial on timeout). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:51:24 +02:00
hsiegeln	652346dcd4	deploy: gen-suffixed container names + cameleer.generation label Append an 8-char generation id (first 8 chars of deployment UUID) to: - container name: {tenant}-{env}-{app}-{replica}-{gen} - CAMELEER_AGENT_INSTANCEID (so old+new agents are distinct in the registry) - Traefik cameleer.instance-id label And emit a new standalone cameleer.generation label so dashboards (Prometheus/Grafana) can pin deploy boundaries without regex on instance-id. Strategy branching comes next — this commit is foundation only; the interim destroy-then-start flow still runs regardless of strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:45:44 +02:00
hsiegeln	f8dccaae2b	fix(deploy): stop previous active deployment before START_REPLICAS (fixes 409) Container names are deterministic: {tenant}-{envSlug}-{appSlug}-{replica}. The prior code did the stop-existing step at SWAP_TRAFFIC, after START_REPLICAS had already tried to create containers with the same names — so a redeploy against a RUNNING app consistently failed with Docker 409 "container name already in use". Move the stop-existing block to run right after CREATE_NETWORK and before START_REPLICAS. SWAP_TRAFFIC becomes a label-only marker (traffic is swapped implicitly by Traefik labels once new replicas are healthy). Also: add `findActiveByAppIdAndEnvironmentIdExcluding` so the SQL excludes the current deployment by id — previously the Java-side `!id.equals(me)` guard failed because the newly-inserted row has status=STARTING (DB default) and ORDER BY created_at DESC LIMIT 1 picked the new row, hiding the actual previous deployment. Trade-off: this is destroy-then-start rather than true blue/green — brief downtime during the swap. Matches the pre-unified-page behavior and is what users reasonably expect. True blue/green would require per-deployment container names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 01:01:00 +02:00
hsiegeln	b655de3975	fix(config): structured 400 body on unknown apply value Replace empty-body ResponseEntity.status(BAD_REQUEST).build() with ResponseStatusException so Spring returns the usual error body shape with a descriptive reason string, matching the idiom used by UserAdminController, AppSettingsController, ThresholdAdminController. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 00:45:31 +02:00
hsiegeln	b5ecd39100	docs(api): document ?apply query param on updateConfig (Swagger) Adds @Parameter description so the generated OpenAPI spec / Swagger UI explains what 'staged' vs 'live' means instead of just surfacing the bare param name. Follow-up: run `cd ui && npm run generate-api:live` against a live backend to refresh openapi.json + schema.d.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 00:39:10 +02:00
hsiegeln	4d4c59efe3	fix(deploy): include DEGRADED deploys as restorable checkpoints Snapshot is written by DeploymentExecutor before the RUNNING/DEGRADED split, so DEGRADED rows already carry a deployed_config_snapshot. Treat them as checkpoints — partial-healthy deploys still produced a working config worth restoring. Aligns repo query with UI filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 00:34:25 +02:00
hsiegeln	d33c039a17	fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations - Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field - Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the new app is in cache before the router renders the deployed view (eliminates transient Save- disabled flash) - Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits; background refetches only overwrite form when no local changes are present - Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment resolves; handleSave invalidates dirty-state after staged save - Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy, environment, application) before JSON comparison via scrubAgentConfig(); adds versionBumpDoesNotMarkDirty test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 23:29:01 +02:00
hsiegeln	6591f2fde3	api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff Wires DirtyStateCalculator behind an HTTP endpoint on AppController. Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository, registers DirtyStateCalculator as a Spring bean (with ObjectMapper for JavaTimeModule support), and covers all three scenarios with IT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:35:35 +02:00
hsiegeln	76129d407e	api(config): ?apply=staged\|live gates SSE push on PUT /apps/{slug}/config When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents. When apply=live (default, back-compat), preserves today's immediate-push behavior. Unknown apply values return 400. Audit action is stage_app_config vs update_app_config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:07:36 +02:00
hsiegeln	a79eafeaf4	runtime(deploy): capture config snapshot on RUNNING transition Injects PostgresApplicationConfigRepository into DeploymentExecutor and calls saveDeployedConfigSnapshot at the COMPLETE stage, before markRunning. Snapshot contains jarVersionId, agentConfig (nullable), and app.containerConfig. The FAILED catch path is left untouched so snapshot stays null on failure. Verified by DeploymentSnapshotIT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:51:00 +02:00
hsiegeln	d3e86b9d77	storage(deploy): persist deployed_config_snapshot as JSONB Wire SELECT_COLS, mapRow deserialization, and saveDeployedConfigSnapshot update method. Adds PostgresDeploymentRepositoryIT with roundtrip, null-default, and clear-to-null tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:39:04 +02:00
hsiegeln	7f9cfc7f18	core(deploy): add deployedConfigSnapshot field to Deployment model Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment record and adds a matching withDeployedConfigSnapshot wither. All positional call sites (repository mapper, test fixture) updated to pass null; Task 1.4 will wire real persistence and Task 1.5 will populate the field on RUNNING transition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:31:48 +02:00
hsiegeln	ff95187707	db(deploy): add deployments.deployed_config_snapshot column (V3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:23:46 +02:00
hsiegeln	c2eab71a31	env(admin): per-environment color field + V2 migration - V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate. - Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API. - UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400. - ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:24:30 +02:00
hsiegeln	e6dcad1e07	config(app): silence MustacheAutoConfiguration templates-dir warning jmustache on the classpath (for alert notification templates) triggers Spring Boot's MustacheAutoConfiguration, which warns about the missing classpath:/templates/ folder we don't use. Disable its check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:47:46 +02:00
hsiegeln	eda74b7339	docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m7s Details CI / docker (push) Successful in 1m22s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink rule), exactly-once guarantee scope, and the first-run backlog-cap knob. Surface the new config in application.yml with the 24h default and the opt-out-to-0 semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:39:49 +02:00
hsiegeln	e470fc0dab	alerting(eval): clamp first-run cursor to deployBacklogCap — flood guard New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds (default 86400 = 24h, 0 disables). On first run (no persisted cursor or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a long-lived PER_EXCHANGE rule doesn't scan from its creation date forward on first post-deploy tick. Normal-advance path unaffected. Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:34:23 +02:00
hsiegeln	0f6bafae8e	alerting(api): cross-field validation for PER_EXCHANGE + empty-targets guard PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0. Any rule: 400 if webhooks + targets are both empty (never notifies anyone). Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400, AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.	2026-04-22 17:31:11 +02:00
hsiegeln	ba4e2bb68f	alerting(eval): atomic per-rule batch commit via @Transactional — Phase 2 close Wraps instance writes, notification enqueues, and cursor advance in one transactional boundary per rule tick. Rollback leaves the rule replayable on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT #tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).	2026-04-22 17:03:07 +02:00
hsiegeln	b8d4b59f40	alerting(eval): AlertEvaluatorJob persists advanced cursor via withEvalState Thread EvalResult.Batch.nextEvalState into releaseClaim so the composite cursor from Task 1.5 actually lands in rule.evalState across tick boundaries. Guards against empty-batch wipe (would regress to first-run scan).	2026-04-22 16:24:27 +02:00
hsiegeln	850c030642	search: compose ORDER BY with execution_id when afterExecutionId set Follow-up to Task 1.2 flagged by Task 1.5 review (I-1). Single-column ORDER BY could drop tail rows in a same-millisecond group >50 when paginating via the composite cursor. Appending ', execution_id <dir>' as secondary key only when afterExecutionId is set preserves existing behaviour for UI/stats callers.	2026-04-22 16:21:52 +02:00
hsiegeln	4acf0aeeff	alerting(eval): PER_EXCHANGE composite cursor — monotone across same-ms exchanges Tests: - cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick - firstRun_boundedByRuleCreatedAt_notRetentionHistory	2026-04-22 16:11:01 +02:00
hsiegeln	b41f34c090	search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate Adds an optional afterExecutionId field to SearchRequest. When combined with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id)) so same-millisecond exchanges can be consumed exactly once across ticks. When afterExecutionId is null, timeFrom keeps its existing >= semantics — no behaviour change for any current caller. Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field through existing withInstanceIds / withEnvironment witheres. All existing positional call-sites (SearchController, ExchangeMatchEvaluator, ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new slot. Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md. The evaluator-side wiring that actually supplies the cursor is Task 1.5.	2026-04-22 15:49:05 +02:00
hsiegeln	6fa8e3aa30	alerting(eval): EvalResult.Batch carries nextEvalState for cursor threading	2026-04-22 15:42:20 +02:00
hsiegeln	41df042e98	fix(sse): close 4 parked SSE test failures Three distinct root causes, all reproducible when the classes run solo — not order-dependent as the triage report suggested. Full diagnosis in .planning/sse-flakiness-diagnosis.md. 1. AgentSseController.events auto-heal was over-permissive: any valid JWT allowed registering an arbitrary path-id, a spoofing vector. Surface symptom was the parked sseConnect_unknownAgent_returns404 test hanging on a 200-with-empty-stream instead of getting 404. Fix: auto-heal requires JWT subject == path id. 2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms} (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.* — same family of bug as the MetricsFlushScheduler fix in `a6944911`. Fix: corrected placeholder prefix. 3. Spring's SseEmitter doesn't flush response headers until the first emitter.send(); clients on BodyHandlers.ofInputStream blocked on the first body byte, making awaitConnection(5s) unreliable under a 15s ping cadence. Fix: send an initial ": connected" comment on connect() so headers hit the wire immediately. Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:41:34 +02:00
hsiegeln	98cbf8f3fc	refactor(search): drop dead SearchIndexer subsystem After the ExecutionController removal (`0f635576`), SearchIndexer subscribed to ExecutionUpdatedEvent but nothing publishes that event. Every SearchIndexerStats metric returned always-zero, and the admin /api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats carried no signal. Backend removed: - core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent - app: IndexerPipelineResponse DTO, /pipeline endpoint on ClickHouseAdminController (field + ctor param) - StorageBeanConfig.searchIndexer bean UI removed: - IndexerPipeline type + useIndexerPipeline hook in api/queries/admin/clickhouse.ts - Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar import and pipeline* CSS classes) OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path and IndexerPipelineResponse schema removed). SearchIndex interface + ClickHouseSearchIndex impl kept — those are live and used by SearchService + ExchangeMatchEvaluator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:32:49 +02:00
hsiegeln	a694491140	fix(metrics): MetricsFlushScheduler honour ingestion config flush interval The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000} (unprefixed) but IngestionConfig binds cameleer.server.ingestion.* — YAML tuning of the metrics flush interval was silently ignored and the scheduler fell back to the 1s default in every environment. Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}. (The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs} failed because beans registered via @EnableConfigurationProperties use a compound bean name "<prefix>-<FQN>", not the simple camelCase form. The property-placeholder path is sufficient — IngestionConfig still owns the Java-side default.) BackpressureIT: drops the obsolete workaround property `ingestion.flush-interval-ms=60000`; the single prefixed override now controls both buffer config and flush cadence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:28:00 +02:00
hsiegeln	a9a6b465d4	fix(stats): close 8 ClickHouseStatsStoreIT TZ failures (bucket DateTime('UTC') + JVM UTC pin) Two-layer fix for the TZ drift that caused stats reads to miss every row when the JVM default TZ and CH session TZ disagreed: - Insert side: ClickHouse JDBC 0.9.7 formats java.sql.Timestamp via Timestamp.toString(), which uses JVM default TZ. A CEST JVM shipping to a UTC CH server stored Unix timestamps off by the TZ offset (the triage report's original symptom). Pinned JVM default to UTC in CameleerServerApplication.main() — standard practice for observability servers that push to time-series stores. - Read side: stats_1m_* tables now declare bucket as DateTime('UTC'), MV SELECTs wrap toStartOfMinute(start_time) in toDateTime(..., 'UTC') so projections match column type, and ClickHouseStatsStore.lit(Instant) emits toDateTime('...', 'UTC') rather than a bare literal — defence in depth against future refactors. Test class pins its own JVM TZ (the store IT builds its own HikariDataSource, bypassing the main() path). Debug scaffolding from the triage investigation removed. Greenfield CH — no migration needed. Verified: 14/14 ClickHouseStatsStoreIT green, plus 84/84 across all ClickHouse IT classes (no regression from the JVM TZ default change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:25:22 +02:00
hsiegeln	0f635576a3	refactor(ingestion): drop dead legacy execution-ingestion path ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class), and ChunkAccumulator is registered unconditionally — the legacy controller never bound in any profile. Even if it had, IngestionService.ingestExecution called executionStore.upsert(), and the only ExecutionStore impl (ClickHouseExecutionStore) threw UnsupportedOperationException from upsert and upsertProcessors. The entire RouteExecution → upsert path was dead code carrying four transitive dependencies (RouteExecution import, eventPublisher wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook). Removed: - cameleer-server-app/.../controller/ExecutionController.java (whole file) - ExecutionStore.upsert + upsertProcessors (interface methods) - ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides) - IngestionService.ingestExecution + toExecutionRecord + flattenProcessors + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers - IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>); dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit - StorageBeanConfig.ingestionService(...) simplified accordingly Untouched because still in use: - ExecutionRecord / ProcessorRecord records (findById / findProcessors / SearchIndexer / DetailController) - SearchIndexer (its onExecutionUpdated never fires now since no-one publishes ExecutionUpdatedEvent, but SearchIndexerStats is still referenced by ClickHouseAdminController — separate cleanup) - TaggedExecution record has no remaining callers after this change — flagged in core-classes.md as a leftover; separate cleanup. Rule docs updated: - .claude/rules/app-classes.md: retired ExecutionController bullet, fixed stale URL for ChunkIngestionController (it owns /api/v1/data/executions, not /api/v1/ingestion/chunk/executions). - .claude/rules/core-classes.md: IngestionService surface + note the dead TaggedExecution. Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:50:51 +02:00
hsiegeln	fb54f9cbd2	fix(agent): revive DEAD agents on heartbeat (not just STALE) Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details CI / docker (push) Has been cancelled Details Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:55:47 +02:00
hsiegeln	90083f886a	refactor(schema): collapse V1..V18 into single V1__init.sql baseline Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m17s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details The project is still greenfield (no production deployment) so this is the last safe moment to flatten the migration archaeology before the checksum history starts mattering for real. Schema changes - 18 migration files (531 lines) → one V1__init.sql (~380 lines) declaring the final end-state: RBAC + claim mappings + runtime management + config + audit + outbound + alerting, plus seed data (system roles, Admins group, default environment). - Drops the data-repair statements from V14 (firemode backfill), V16 (subjectFingerprint migration), V17 (ACKNOWLEDGED → FIRING coercion) — they were no-ops on any DB that starts at V1. - Declares condition_kind_enum with AGENT_LIFECYCLE from the start (was added retroactively by V18). - Declares alert_state_enum with three values only (was five, then swapped in V17) and alert_instances with read_at / deleted_at columns from day one (was added by V17). - alert_reads table never created (V12 created, V17 dropped). - alert_instances_open_rule_uq built with the V17 predicate from the start. Test changes - Replace V12MigrationIT / V17MigrationIT / V18MigrationIT with one SchemaBootstrapIT that asserts the combined invariants: tables present, alert_reads absent, enum value sets, alert_instances has read_at + deleted_at, open_rule_uq exists and is unique, env-delete cascade fires. Verification - pg_dump of the new V1 matches the pg_dump of V1..V18 applied in sequence (bytewise modulo column order and Postgres-auto FK names). - Full alerting IT suite (53 tests across 6 classes) green against the new schema. - The 47 pre-existing test failures on main (AgentRegistrationIT, SearchControllerIT, ClickHouseStatsStoreIT, …) are unrelated and fail identically without this change. Developer impact - Existing local DBs will fail checksum validation on boot. Wipe: docker compose down -v (or drop the tenant_default schema). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:52:22 +02:00
hsiegeln	b7d201d743	fix(alerts): add AGENT_LIFECYCLE to condition_kind_enum + readable error toasts All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / docker (push) Successful in 1m19s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details Backend - V18 migration adds AGENT_LIFECYCLE to condition_kind_enum. Java ConditionKind enum shipped with this value but no Postgres migration extended the type, so any AGENT_LIFECYCLE rule insert failed with "invalid input value for enum condition_kind_enum". - ALTER TYPE ... ADD VALUE lives alone in its migration per Postgres constraint that the new value cannot be referenced in the same tx. - V18MigrationIT asserts the enum now contains all 7 kinds. Frontend - Add describeApiError(e) helper to unwrap openapi-fetch error bodies (Spring error JSON) into readable strings. String(e) on a plain object rendered "[object Object]" in toasts — the actual failure reason was hidden from the user. - Replace String(e) in all 13 toast descriptions across the alerting and outbound-connection mutation paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:23:14 +02:00
hsiegeln	99b739d946	fix(alerts): backend hardening + complete ACKNOWLEDGED migration - new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation - AlertController.inEnvLiveIds now one SQL round-trip instead of N - bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL - bulkAck SQL already had deleted_at IS NULL guard — no change needed - PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted - V12MigrationIT: remove alert_reads assertion (table dropped by V17) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:48:57 +02:00
hsiegeln	efd8396045	feat(alerts): controller — DELETE/bulk-delete/bulk-ack/restore + acked/read filters + readAt on DTO - GET /alerts gains tri-state acked + read query params - new endpoints: DELETE /{id} (soft-delete), POST /bulk-delete, POST /bulk-ack, POST /{id}/restore - requireLiveInstance 404s on soft-deleted rows; restore() reads the row regardless - BulkReadRequest → BulkIdsRequest (shared body for bulk read/ack/delete) - AlertDto gains readAt; deletedAt stays off the wire - InAppInboxQuery.listInbox threads acked/read through to the repo (7-arg, no more null placeholders) - SecurityConfig: new matchers for bulk-ack (VIEWER+), DELETE/bulk-delete/restore (OPERATOR+) - AlertControllerIT: persistence assertions on /read + /bulk-read; full coverage for new endpoints - InAppInboxQueryTest: updated to 7-arg listInbox signature Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:15:16 +02:00
hsiegeln	da2819332c	feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations - save/rowMapper read+write read_at and deleted_at - listForInbox: tri-state acked/read filters; always excludes deleted - countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill - new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore - delete PostgresAlertReadRepository + its bean - restore zero-fill Javadoc on interface - mechanical compile-fixes in AlertController, InAppInboxQuery, AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite - PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:06 +02:00

1 2 3

150 Commits