cameleer-server

Author	SHA1	Message	Date
hsiegeln	d33c039a17	fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations - Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field - Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the new app is in cache before the router renders the deployed view (eliminates transient Save- disabled flash) - Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits; background refetches only overwrite form when no local changes are present - Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment resolves; handleSave invalidates dirty-state after staged save - Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy, environment, application) before JSON comparison via scrubAgentConfig(); adds versionBumpDoesNotMarkDirty test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 23:29:01 +02:00
hsiegeln	6591f2fde3	api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff Wires DirtyStateCalculator behind an HTTP endpoint on AppController. Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository, registers DirtyStateCalculator as a Spring bean (with ObjectMapper for JavaTimeModule support), and covers all three scenarios with IT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:35:35 +02:00
hsiegeln	76352c0d6f	test(config): tighten audit assertions + @DirtiesContext on ApplicationConfigControllerIT - Add @DirtiesContext(AFTER_CLASS) so the SpyBean-forked context is torn down after the 6 tests finish, preventing permanent cache pollution - Replace single-row queryForObject with queryForList + hasSize(1) in both audit tests so spurious extra rows will fail explicitly - Assert auditCount == 0 in the 400 test to lock in the no-audit-on-bad-input invariant Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:18:44 +02:00
hsiegeln	e716dbf8ca	test(config): verify audit action in staged/live config IT Replace the misleading putConfig_staged_auditActionIsStagedAppConfig test (which only checked pushResult.total == 0, a duplicate of _savesButDoesNotPush) with two real audit-log assertions: one verifying "stage_app_config" is written for apply=staged and a new companion test verifying "update_app_config" for the live path. Uses jdbcTemplate to query audit_log directly (Option B). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:13:53 +02:00
hsiegeln	76129d407e	api(config): ?apply=staged\|live gates SSE push on PUT /apps/{slug}/config When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents. When apply=live (default, back-compat), preserves today's immediate-push behavior. Unknown apply values return 400. Audit action is stage_app_config vs update_app_config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:07:36 +02:00
hsiegeln	9b1240274d	test(deploy): assert containerConfig round-trip + strict RUNNING in snapshot IT Adds the missing containerConfig assertion to snapshot_isPopulated_whenDeploymentReachesRunning (runtimeType + appPort entries), and tightens the await predicate from .isIn(RUNNING, DEGRADED) to .isEqualTo(RUNNING) — the mock returns a healthy container so RUNNING is deterministic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:54:57 +02:00
hsiegeln	a79eafeaf4	runtime(deploy): capture config snapshot on RUNNING transition Injects PostgresApplicationConfigRepository into DeploymentExecutor and calls saveDeployedConfigSnapshot at the COMPLETE stage, before markRunning. Snapshot contains jarVersionId, agentConfig (nullable), and app.containerConfig. The FAILED catch path is left untouched so snapshot stays null on failure. Verified by DeploymentSnapshotIT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:51:00 +02:00
hsiegeln	9b851c4622	test(deploy): autowire repository in snapshot IT (JavaTimeModule-safe) Replace manual `new PostgresDeploymentRepository(jdbcTemplate, new ObjectMapper())` with `@Autowired PostgresDeploymentRepository repository` to use the Spring-managed bean whose ObjectMapper has JavaTimeModule registered. Also removes the redundant isNotNull() assertion whose work is done by the field-level assertions that follow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:43:40 +02:00
hsiegeln	d3e86b9d77	storage(deploy): persist deployed_config_snapshot as JSONB Wire SELECT_COLS, mapRow deserialization, and saveDeployedConfigSnapshot update method. Adds PostgresDeploymentRepositoryIT with roundtrip, null-default, and clear-to-null tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:39:04 +02:00
hsiegeln	7f9cfc7f18	core(deploy): add deployedConfigSnapshot field to Deployment model Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment record and adds a matching withDeployedConfigSnapshot wither. All positional call sites (repository mapper, test fixture) updated to pass null; Task 1.4 will wire real persistence and Task 1.5 will populate the field on RUNNING transition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:31:48 +02:00
hsiegeln	ff95187707	db(deploy): add deployments.deployed_config_snapshot column (V3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:23:46 +02:00
hsiegeln	c2eab71a31	env(admin): per-environment color field + V2 migration - V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate. - Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API. - UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400. - ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:24:30 +02:00
hsiegeln	e6dcad1e07	config(app): silence MustacheAutoConfiguration templates-dir warning jmustache on the classpath (for alert notification templates) triggers Spring Boot's MustacheAutoConfiguration, which warns about the missing classpath:/templates/ folder we don't use. Disable its check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:47:46 +02:00
hsiegeln	eda74b7339	docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m7s Details CI / docker (push) Successful in 1m22s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink rule), exactly-once guarantee scope, and the first-run backlog-cap knob. Surface the new config in application.yml with the 24h default and the opt-out-to-0 semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:39:49 +02:00
hsiegeln	e470fc0dab	alerting(eval): clamp first-run cursor to deployBacklogCap — flood guard New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds (default 86400 = 24h, 0 disables). On first run (no persisted cursor or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a long-lived PER_EXCHANGE rule doesn't scan from its creation date forward on first post-deploy tick. Normal-advance path unaffected. Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:34:23 +02:00
hsiegeln	cfc619505a	alerting(it): AlertingFullLifecycleIT — exactly-once across ticks, ack isolation End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no new exchanges produces zero new instances or notifications. Ack on one instance leaves the other four untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:07:45 +02:00
hsiegeln	0f6bafae8e	alerting(api): cross-field validation for PER_EXCHANGE + empty-targets guard PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0. Any rule: 400 if webhooks + targets are both empty (never notifies anyone). Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400, AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.	2026-04-22 17:31:11 +02:00
hsiegeln	377968eb53	alerting(it): RED tests for PER_EXCHANGE cross-field validation + empty targets Three failing IT tests documenting the contract Task 3.3 will satisfy: - createPerExchangeRule_withReNotifyMinutesNonZero_returns400 - createPerExchangeRule_withForDurationSecondsNonZero_returns400 - createAnyRule_withEmptyWebhooksAndTargets_returns400	2026-04-22 17:17:47 +02:00
hsiegeln	e483e52eee	alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition Dead field — was enforced by compact ctor as required for PER_EXCHANGE, but never read anywhere in the codebase. Removal tightens the API surface and is precondition for the Task 3.3 cross-field validator. Pre-prod; no shim / migration.	2026-04-22 17:10:53 +02:00
hsiegeln	ba4e2bb68f	alerting(eval): atomic per-rule batch commit via @Transactional — Phase 2 close Wraps instance writes, notification enqueues, and cursor advance in one transactional boundary per rule tick. Rollback leaves the rule replayable on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT #tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).	2026-04-22 17:03:07 +02:00
hsiegeln	989dde23eb	alerting(it): RED test pinning Phase 2 tick-atomicity contract Fault-injection IT asserts that a crash mid-batch rolls back every instance + notification write AND leaves the cursor unchanged. Fails against current (Phase 1 only) code — turns green when Task 2.2 wraps batch processing in @Transactional.	2026-04-22 16:51:09 +02:00
hsiegeln	3c3d90c45b	test(alerting): align AlertEvaluatorJobIT CH cleanup with house style Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE executions — matches the convention used in ClickHouseExecutionStoreIT and peers. Env-slug isolation was already preventing cross-test pollution; this change is about hygiene and determinism (TRUNCATE is synchronous).	2026-04-22 16:45:28 +02:00
hsiegeln	5bd0e09df3	alerting(eval): persist advanced cursor via releaseClaim — Phase 1 close Fixes the notification-bleed regression pinned by AlertEvaluatorJobIT#tick2_noNewExchanges_enqueuesZeroAdditionalNotifications.	2026-04-22 16:36:01 +02:00
hsiegeln	b8d4b59f40	alerting(eval): AlertEvaluatorJob persists advanced cursor via withEvalState Thread EvalResult.Batch.nextEvalState into releaseClaim so the composite cursor from Task 1.5 actually lands in rule.evalState across tick boundaries. Guards against empty-batch wipe (would regress to first-run scan).	2026-04-22 16:24:27 +02:00
hsiegeln	850c030642	search: compose ORDER BY with execution_id when afterExecutionId set Follow-up to Task 1.2 flagged by Task 1.5 review (I-1). Single-column ORDER BY could drop tail rows in a same-millisecond group >50 when paginating via the composite cursor. Appending ', execution_id <dir>' as secondary key only when afterExecutionId is set preserves existing behaviour for UI/stats callers.	2026-04-22 16:21:52 +02:00
hsiegeln	4acf0aeeff	alerting(eval): PER_EXCHANGE composite cursor — monotone across same-ms exchanges Tests: - cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick - firstRun_boundedByRuleCreatedAt_notRetentionHistory	2026-04-22 16:11:01 +02:00
hsiegeln	c2252a0e72	alerting(eval): RED tests for PER_EXCHANGE cursor monotonicity + first-run bound Two failing tests documenting the contract Task 1.5 will satisfy: - cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick - firstRun_boundedByRuleCreatedAt_notRetentionHistory Compile may fail until Task 1.4 adds AlertRule.withEvalState wither.	2026-04-22 15:58:16 +02:00
hsiegeln	b41f34c090	search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate Adds an optional afterExecutionId field to SearchRequest. When combined with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id)) so same-millisecond exchanges can be consumed exactly once across ticks. When afterExecutionId is null, timeFrom keeps its existing >= semantics — no behaviour change for any current caller. Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field through existing withInstanceIds / withEnvironment witheres. All existing positional call-sites (SearchController, ExchangeMatchEvaluator, ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new slot. Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md. The evaluator-side wiring that actually supplies the cursor is Task 1.5.	2026-04-22 15:49:05 +02:00
hsiegeln	6fa8e3aa30	alerting(eval): EvalResult.Batch carries nextEvalState for cursor threading	2026-04-22 15:42:20 +02:00
hsiegeln	41df042e98	fix(sse): close 4 parked SSE test failures Three distinct root causes, all reproducible when the classes run solo — not order-dependent as the triage report suggested. Full diagnosis in .planning/sse-flakiness-diagnosis.md. 1. AgentSseController.events auto-heal was over-permissive: any valid JWT allowed registering an arbitrary path-id, a spoofing vector. Surface symptom was the parked sseConnect_unknownAgent_returns404 test hanging on a 200-with-empty-stream instead of getting 404. Fix: auto-heal requires JWT subject == path id. 2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms} (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.* — same family of bug as the MetricsFlushScheduler fix in `a6944911`. Fix: corrected placeholder prefix. 3. Spring's SseEmitter doesn't flush response headers until the first emitter.send(); clients on BodyHandlers.ofInputStream blocked on the first body byte, making awaitConnection(5s) unreliable under a 15s ping cadence. Fix: send an initial ": connected" comment on connect() so headers hit the wire immediately. Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:41:34 +02:00
hsiegeln	98cbf8f3fc	refactor(search): drop dead SearchIndexer subsystem After the ExecutionController removal (`0f635576`), SearchIndexer subscribed to ExecutionUpdatedEvent but nothing publishes that event. Every SearchIndexerStats metric returned always-zero, and the admin /api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats carried no signal. Backend removed: - core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent - app: IndexerPipelineResponse DTO, /pipeline endpoint on ClickHouseAdminController (field + ctor param) - StorageBeanConfig.searchIndexer bean UI removed: - IndexerPipeline type + useIndexerPipeline hook in api/queries/admin/clickhouse.ts - Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar import and pipeline* CSS classes) OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path and IndexerPipelineResponse schema removed). SearchIndex interface + ClickHouseSearchIndex impl kept — those are live and used by SearchService + ExchangeMatchEvaluator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:32:49 +02:00
hsiegeln	a694491140	fix(metrics): MetricsFlushScheduler honour ingestion config flush interval The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000} (unprefixed) but IngestionConfig binds cameleer.server.ingestion.* — YAML tuning of the metrics flush interval was silently ignored and the scheduler fell back to the 1s default in every environment. Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}. (The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs} failed because beans registered via @EnableConfigurationProperties use a compound bean name "<prefix>-<FQN>", not the simple camelCase form. The property-placeholder path is sufficient — IngestionConfig still owns the Java-side default.) BackpressureIT: drops the obsolete workaround property `ingestion.flush-interval-ms=60000`; the single prefixed override now controls both buffer config and flush cadence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:28:00 +02:00
hsiegeln	a9a6b465d4	fix(stats): close 8 ClickHouseStatsStoreIT TZ failures (bucket DateTime('UTC') + JVM UTC pin) Two-layer fix for the TZ drift that caused stats reads to miss every row when the JVM default TZ and CH session TZ disagreed: - Insert side: ClickHouse JDBC 0.9.7 formats java.sql.Timestamp via Timestamp.toString(), which uses JVM default TZ. A CEST JVM shipping to a UTC CH server stored Unix timestamps off by the TZ offset (the triage report's original symptom). Pinned JVM default to UTC in CameleerServerApplication.main() — standard practice for observability servers that push to time-series stores. - Read side: stats_1m_* tables now declare bucket as DateTime('UTC'), MV SELECTs wrap toStartOfMinute(start_time) in toDateTime(..., 'UTC') so projections match column type, and ClickHouseStatsStore.lit(Instant) emits toDateTime('...', 'UTC') rather than a bare literal — defence in depth against future refactors. Test class pins its own JVM TZ (the store IT builds its own HikariDataSource, bypassing the main() path). Debug scaffolding from the triage investigation removed. Greenfield CH — no migration needed. Verified: 14/14 ClickHouseStatsStoreIT green, plus 84/84 across all ClickHouse IT classes (no regression from the JVM TZ default change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:25:22 +02:00
hsiegeln	0f635576a3	refactor(ingestion): drop dead legacy execution-ingestion path ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class), and ChunkAccumulator is registered unconditionally — the legacy controller never bound in any profile. Even if it had, IngestionService.ingestExecution called executionStore.upsert(), and the only ExecutionStore impl (ClickHouseExecutionStore) threw UnsupportedOperationException from upsert and upsertProcessors. The entire RouteExecution → upsert path was dead code carrying four transitive dependencies (RouteExecution import, eventPublisher wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook). Removed: - cameleer-server-app/.../controller/ExecutionController.java (whole file) - ExecutionStore.upsert + upsertProcessors (interface methods) - ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides) - IngestionService.ingestExecution + toExecutionRecord + flattenProcessors + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers - IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>); dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit - StorageBeanConfig.ingestionService(...) simplified accordingly Untouched because still in use: - ExecutionRecord / ProcessorRecord records (findById / findProcessors / SearchIndexer / DetailController) - SearchIndexer (its onExecutionUpdated never fires now since no-one publishes ExecutionUpdatedEvent, but SearchIndexerStats is still referenced by ClickHouseAdminController — separate cleanup) - TaggedExecution record has no remaining callers after this change — flagged in core-classes.md as a leftover; separate cleanup. Rule docs updated: - .claude/rules/app-classes.md: retired ExecutionController bullet, fixed stale URL for ChunkIngestionController (it owns /api/v1/data/executions, not /api/v1/ingestion/chunk/executions). - .claude/rules/core-classes.md: IngestionService surface + note the dead TaggedExecution. Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:50:51 +02:00
hsiegeln	b55221e90a	fix(test): SensitiveKeysAdminControllerIT — assert push-result shape, not count The pushToAgents fan-out iterates every distinct (app, env) slice in the shared agent registry. In isolated runs that's 0, but with Spring context reuse across IT classes we always see non-zero here. Assert the response has a pushResult.total field (shape) rather than exact 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:28:44 +02:00
hsiegeln	95f90f43dc	fix(test): update Forward-compat / Protocol-version / Backpressure ITs - ForwardCompatIT: send a valid ExecutionChunk envelope with extra unknown fields instead of a bare {futureField}. Was being parsed into an empty/degenerate chunk and rejected with 400. - ProtocolVersionIT.requestWithCorrectProtocolVersionPassesInterceptor: same shape fix — minimal valid chunk so the controller's 400 is not an ambiguous signal for interceptor-passthrough. - BackpressureIT: * TestPropertySource keys were "ingestion." but IngestionConfig is bound under "cameleer.server.ingestion." — overrides were ignored and the buffer stayed at its default 50_000, so the 503 overflow branch was unreachable. Corrected the keys. * MetricsFlushScheduler's @Scheduled uses a different key again ("ingestion.flush-interval-ms"), so we override that separately to stop the default 1s flush from draining the buffer mid-test. * executionIngestion_isSynchronous_returnsAccepted now uses the chunked envelope format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:26:48 +02:00
hsiegeln	8283d531f6	fix(test): restore CH pipeline + read ITs after schema collapse ClickHouseChunkPipelineIT.setUp was loading /clickhouse/V2__executions.sql and /clickhouse/V3__processor_executions.sql — resource paths that no longer exist after `90083f88` collapsed the V1..V18 ClickHouse schema into init.sql. Swapped for ClickHouseTestHelper.executeInitSql(jdbc). ClickHouseExecutionReadIT.detailService_buildTree_withIterations was asserting getLoopIndex() on children of a split, but DetailService's seq-based buildTree path (buildTreeBySeq) maps FlatProcessorRecord.iteration into ProcessorNode.iteration — not loopIndex. The loopIndex path is only populated by buildTreeByProcessorId (the legacy ID-only fallback). Switched the assertion to getIteration() to match the seq-driven reconstruction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:22:34 +02:00
hsiegeln	d5adaaab72	fix(test): REST-drive Diagram-linking and IngestionSchema ITs Both tests extend AbstractPostgresIT and inherit the Postgres jdbcTemplate, which they were using to query ClickHouse-resident tables (executions, processor_executions, route_diagrams). Now: - DiagramLinkingIT reads diagramContentHash off the execution-detail REST response (and tolerates JSON null by normalising to empty string, which matches how the ingestion service stamps un-linked executions). - IngestionSchemaIT asserts the reconstructed processor tree through the execution-detail endpoint (covers both flattening on write and buildTree on read) and reads processor bodies via the processor-snapshot endpoint rather than raw processor_executions rows. Both tests now use the ExecutionChunk envelope on POST /data/executions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:20:05 +02:00
hsiegeln	5684479938	fix(test): rewrite SearchControllerIT seed to chunks + fix GET auth scope Largest Cluster B test: seeded 10 executions via the legacy RouteExecution shape which ChunkIngestionController silently degenerates to empty chunks, then verified via a Postgres SELECT against a ClickHouse table. Both failure modes addressed: - All 10 seed payloads are now ExecutionChunk envelopes (chunkSeq=0, final=true, flat processors[]). - Pipeline visibility probe is the env-scoped search REST endpoint (polling for the last corr-page-10 row). - searchGet() helper was using the AGENT token; env-scoped read endpoints require VIEWER+, so it now uses viewerJwt (matches what searchPost already did). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:14:56 +02:00
hsiegeln	a6e7458adb	fix(test): REST-drive Diagram / DiagramRender ITs for CH assertions DiagramControllerIT.postDiagram_dataAppearsAfterFlush now verifies via GET /api/v1/environments/{env}/apps/{app}/routes/{route}/diagram instead of a PG SELECT against the ClickHouse route_diagrams table. DiagramRenderControllerIT seeds both a diagram and an execution on the same route, then reads the stamped diagramContentHash off the execution- detail REST response to drive the flat /api/v1/diagrams/{hash}/render tests. The env-scoped endpoint only serves JSON, so SVG tests still hit the content-hash endpoint — but the hash comes from REST now, not SQL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:12:19 +02:00
hsiegeln	87bada1fc7	fix(test): rewrite Execution/Metrics ControllerITs to chunks + REST verify Same pattern as DetailControllerIT: - ExecutionControllerIT: all four tests now post ExecutionChunk envelopes (chunkSeq=0, final=true) carrying instanceId/applicationId. Flush visibility check pivoted from PG SELECT → env-scoped search REST. - MetricsControllerIT: postMetrics_dataAppearsAfterFlush now stamps collectedAt at now() and verifies through GET /environments/{env}/ agents/{id}/metrics with the default 1h lookback, looking for a non-zero bucket on the metric name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:07:25 +02:00
hsiegeln	dfacedb0ca	fix(test): rewrite DetailControllerIT seed to ExecutionChunk + REST-driven lookup POST /api/v1/data/executions is owned by ChunkIngestionController (the legacy ExecutionController path is @ConditionalOnMissingBean(ChunkAccumulator) and never binds). The old RouteExecution-shaped seed was silently parsed as an empty ExecutionChunk and nothing landed in ClickHouse. Rewrote the seed as a single final ExecutionChunk with chunkSeq=0 / final=true and a flat processors[] carrying seq + parentSeq to preserve the 3-level tree (DetailService.buildTree reconstructs the nested shape for the API response). Execution-id lookup now goes through the search REST API filtered by correlationId, per the no-raw-SQL preference. Template for the other Cluster B ITs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:04:00 +02:00
hsiegeln	9bda4d8f8d	fix(test): de-couple Flyway/ConfigEnvIsolation ITs from cross-test state Both Testcontainers Postgres ITs were asserting exact counts on rows that other classes in the shared context had already written. - FlywayMigrationIT: treat the non-seed tables (users, server_config, audit_log, application_config, app_settings) as "must exist; COUNT must return a non-negative integer" rather than expecting exactly 0. The seeded tables (roles=4, groups=1) still assert exact V1 baseline. - ConfigEnvIsolationIT.findByEnvironment_excludesOtherEnvs: use unique prefixed app slugs and switch containsExactlyInAnyOrder to contains + doesNotContain, so the cross-env filter is still verified without coupling to other tests' inserts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:43:29 +02:00
hsiegeln	10e2b69974	fix(test): route SecurityFilterIT protected-endpoint check to env-scoped URL The agent list moved from /api/v1/agents to /api/v1/environments/{envSlug}/agents; the 'valid JWT returns 200' test was hitting the retired flat path and getting 404. The other 'without JWT' cases still pass because Spring Security rejects them at the filter chain before URL routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:41:35 +02:00
hsiegeln	e955302fe8	fix(test): add required environmentId to agent register bodies Registration now requires environmentId in the body (400 if missing), so the stale register bodies were failing every downstream test that relied on a registered agent. Affected helpers in: - BootstrapTokenIT (static constant + inline body) - JwtRefreshIT (registerAndGetTokens) - RegistrationSecurityIT (registerAgent) - SseSigningIT (registerAgentWithAuth) - AgentSseControllerIT (registerAgent helper) Also in JwtRefreshIT / RegistrationSecurityIT, the "access token can reach a protected endpoint" tests were hitting env-scoped read endpoints that now require VIEWER+. Redirected both to the AGENT-role heartbeat endpoint — it proves the token is accepted by the security filter without being coupled to RBAC rules for reader endpoints. JwtRefreshIT.refreshWithValidToken also dropped an isNotEqualTo assertion that assumed sub-second iat uniqueness — HMAC JWTs with second-precision claims are byte-identical when minted for the same subject within the same second, so the old assertion was flaky by design. SseSigningIT / AgentSseControllerIT still have SSE-connection timing failures unrelated to registration — parked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:24:54 +02:00
hsiegeln	97a6b2e010	fix(test): align AgentCommandControllerIT with current spec Two drifts corrected: - registerAgent helper missing required environmentId (spec: 400 if absent). - sendGroupCommand is now synchronous request-reply: returns 200 with an aggregated CommandGroupResponse {success,total,responded,responses,timedOut} — no longer 202 with {targetCount,commandIds}. Updated assertions and name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:18:14 +02:00
hsiegeln	7436a37b99	fix(test): align AgentRegistrationControllerIT with current spec Four drifts against the current server contract, all now corrected: - Registration body missing required environmentId (spec: 400 if absent). - Agent list moved to env-scoped /api/v1/environments/{envSlug}/agents; flat /api/v1/agents no longer exists. - heartbeatUnknownAgent now auto-heals via JWT env claim (`fb54f9cb`); the 404 branch is only reachable without a JWT, which the security filter rejects before the controller sees the request. - sseEndpoint is an absolute URL (ServletUriComponentsBuilder.fromCurrentContextPath), so assert endsWith the path rather than equals-to-relative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:15:16 +02:00
hsiegeln	fb54f9cbd2	fix(agent): revive DEAD agents on heartbeat (not just STALE) Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details CI / docker (push) Has been cancelled Details Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:55:47 +02:00
hsiegeln	90083f886a	refactor(schema): collapse V1..V18 into single V1__init.sql baseline Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m17s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details The project is still greenfield (no production deployment) so this is the last safe moment to flatten the migration archaeology before the checksum history starts mattering for real. Schema changes - 18 migration files (531 lines) → one V1__init.sql (~380 lines) declaring the final end-state: RBAC + claim mappings + runtime management + config + audit + outbound + alerting, plus seed data (system roles, Admins group, default environment). - Drops the data-repair statements from V14 (firemode backfill), V16 (subjectFingerprint migration), V17 (ACKNOWLEDGED → FIRING coercion) — they were no-ops on any DB that starts at V1. - Declares condition_kind_enum with AGENT_LIFECYCLE from the start (was added retroactively by V18). - Declares alert_state_enum with three values only (was five, then swapped in V17) and alert_instances with read_at / deleted_at columns from day one (was added by V17). - alert_reads table never created (V12 created, V17 dropped). - alert_instances_open_rule_uq built with the V17 predicate from the start. Test changes - Replace V12MigrationIT / V17MigrationIT / V18MigrationIT with one SchemaBootstrapIT that asserts the combined invariants: tables present, alert_reads absent, enum value sets, alert_instances has read_at + deleted_at, open_rule_uq exists and is unique, env-delete cascade fires. Verification - pg_dump of the new V1 matches the pg_dump of V1..V18 applied in sequence (bytewise modulo column order and Postgres-auto FK names). - Full alerting IT suite (53 tests across 6 classes) green against the new schema. - The 47 pre-existing test failures on main (AgentRegistrationIT, SearchControllerIT, ClickHouseStatsStoreIT, …) are unrelated and fail identically without this change. Developer impact - Existing local DBs will fail checksum validation on boot. Wipe: docker compose down -v (or drop the tenant_default schema). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:52:22 +02:00
hsiegeln	b7d201d743	fix(alerts): add AGENT_LIFECYCLE to condition_kind_enum + readable error toasts All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / docker (push) Successful in 1m19s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details Backend - V18 migration adds AGENT_LIFECYCLE to condition_kind_enum. Java ConditionKind enum shipped with this value but no Postgres migration extended the type, so any AGENT_LIFECYCLE rule insert failed with "invalid input value for enum condition_kind_enum". - ALTER TYPE ... ADD VALUE lives alone in its migration per Postgres constraint that the new value cannot be referenced in the same tx. - V18MigrationIT asserts the enum now contains all 7 kinds. Frontend - Add describeApiError(e) helper to unwrap openapi-fetch error bodies (Spring error JSON) into readable strings. String(e) on a plain object rendered "[object Object]" in toasts — the actual failure reason was hidden from the user. - Replace String(e) in all 13 toast descriptions across the alerting and outbound-connection mutation paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:23:14 +02:00

1 2 3 4

167 Commits