Records the three fix commits + two prod-code cleanup commits, with one-paragraph summaries for each cluster and pointers to the diagnosis doc for SSE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
IT Triage Report — 2026-04-21
Branch: main, starting HEAD 90460705 (chore: refresh GitNexus index stats).
Summary
- Starting state: 65 IT failures (46 F + 19 E) out of 555 tests on a clean build. Side-note:
target/classesincremental-build staleness from the90083f88V1..V18 → V1 schema collapse makes the number look worse (every context load dies onFlyway V2__claim_mapping.sql failed). A freshmvn clean verifygives the real 65. - Final state: 12 failures across 3 test classes (
AgentSseControllerIT,SseSigningIT,ClickHouseStatsStoreIT). 53 failures closed across 14 test classes. - 11 commits landed on local
main(not pushed). - No new env vars, endpoints, tables, or columns added.
V1__init.sqluntouched. No tests rewritten to pass-by-weakening — every assertion change is accompanied by a comment explaining the contract it now captures.
Commits (in order)
| SHA | Test classes | What changed |
|---|---|---|
7436a37b |
AgentRegistrationControllerIT | environmentId, flat→env URL, heartbeat auto-heal, absolute sseEndpoint |
97a6b2e0 |
AgentCommandControllerIT | environmentId, CommandGroupResponse new shape (200 w/ aggregate replies) |
e955302f |
BootstrapTokenIT / JwtRefreshIT / RegistrationSecurityIT / SseSigningIT / AgentSseControllerIT | environmentId in register bodies; AGENT-role smoke target; drop flaky iat-coupled assertion |
10e2b699 |
SecurityFilterIT | env-scoped agent list URL |
9bda4d8f |
FlywayMigrationIT, ConfigEnvIsolationIT | decouple from shared Testcontainers Postgres state |
36571013 |
(docs) | first version of this report |
dfacedb0 |
DetailControllerIT | Cluster B template: ExecutionChunk envelope + REST-driven lookup |
87bada1f |
ExecutionControllerIT, MetricsControllerIT | Chunk payloads + REST flush-visibility probes |
a6e7458a |
DiagramControllerIT, DiagramRenderControllerIT | Env-scoped render + execution-detail-derived content hash for flat SVG path |
56844799 |
SearchControllerIT | 10 seed payloads → ExecutionChunk; fix AGENT→VIEWER token on search GET |
d5adaaab |
DiagramLinkingIT, IngestionSchemaIT | REST for diagramContentHash + processor-tree/snapshot assertions |
8283d531 |
ClickHouseChunkPipelineIT, ClickHouseExecutionReadIT | Replace removed /clickhouse/V2_.sql with consolidated init.sql; correct iteration vs loopIndex on seq-based tree path |
95f90f43 |
ForwardCompatIT, ProtocolVersionIT, BackpressureIT | Chunk payload; fix wrong property-key prefix in BackpressureIT (+ MetricsFlushScheduler's separate ingestion.flush-interval-ms key) |
b55221e9 |
SensitiveKeysAdminControllerIT | assert pushResult shape, not exact 0 (shared registry across ITs) |
The single biggest insight
ExecutionController (legacy PG path) is dead code. It's @ConditionalOnMissingBean(ChunkAccumulator.class) and ChunkAccumulator is registered unconditionally in StorageBeanConfig.java:92, so ExecutionController never binds. Even if it did, IngestionService.upsert → ClickHouseExecutionStore.upsert throws UnsupportedOperationException("ClickHouse writes use the chunked pipeline") — the only ExecutionStore impl in src/main/java is ClickHouse, the Postgres variant lives in a planning doc only.
Practical consequences for every IT that was exercising /api/v1/data/executions:
ChunkIngestionControllerowns the URL and expects anExecutionChunkenvelope (exchangeId,applicationId,instanceId,routeId,status,startTime,endTime,durationMs,chunkSeq,final,processors: FlatProcessorRecord[]) — the legacyRouteExecutionshape was being silently degraded to an empty/degenerate chunk.- The test payload changes are accompanied by assertion changes that now go through REST endpoints instead of raw SQL against the (ClickHouse-resident)
executions/processor_executions/route_diagrams/agent_metricstables. - Recommendation for cleanup: remove
ExecutionController+ theupsertpath inIngestionService+ the stubbedClickHouseExecutionStore.upsertthrowers. Separate PR. Happy to file.
Cluster breakdown
Cluster A — missing environmentId in register bodies (DONE)
Root cause: POST /api/v1/agents/register now 400s without environmentId. Test payloads minted before this requirement. Fixed across all agent-registering ITs plus side-cleanups (flaky iat-coupled assertion in JwtRefreshIT, wrong RBAC target in can-access tests, absolute vs relative sseEndpoint).
Cluster B — ingestion payload drift (DONE per user direction)
All controller + storage ITs that posted RouteExecution JSON now post ExecutionChunk envelopes. All CH-side assertions now go through REST endpoints (/api/v1/environments/{env}/executions search + /api/v1/executions/{id} detail + /agents/{id}/metrics + /apps/{app}/routes/{route}/diagram). DiagramRenderControllerIT's SVG tests still need a content hash → reads it off the execution-detail REST response rather than querying route_diagrams.
Cluster C — flat URL drift (DONE)
/api/v1/agents → /api/v1/environments/{envSlug}/agents. Two test classes touched.
Cluster D — heartbeat auto-heal contract (DONE)
heartbeatUnknownAgent_returns404 renamed and asserts the 200 auto-heal path that fb54f9cb made the contract.
Cluster E — individual drifts (DONE except three parked)
| Test class | Status |
|---|---|
| FlywayMigrationIT | DONE (decouple from shared PG state) |
| ConfigEnvIsolationIT.findByEnvironment_excludesOtherEnvs | DONE (unique slug prefix) |
| ForwardCompatIT | DONE (chunk payload) |
| ProtocolVersionIT | DONE (chunk payload) |
| BackpressureIT | DONE (property-key prefix fix — see note below) |
| SensitiveKeysAdminControllerIT | DONE (assert shape not count) |
| ClickHouseChunkPipelineIT | DONE (consolidated init.sql) |
| ClickHouseExecutionReadIT | DONE (iteration vs loopIndex mapping) |
PARKED — what you'll want to look at next
1. ClickHouseStatsStoreIT (8 failures) — timezone bug in production code
ClickHouseStatsStore.buildStatsSql uses lit(Instant) which formats as 'yyyy-MM-dd HH:mm:ss' in UTC but with no timezone marker. ClickHouse parses that literal in the session timezone when comparing against the DateTime-typed bucket column in stats_1m_*. On a non-UTC CH host (e.g. CEST docker on a CEST laptop), the filter endpoint is off by the tz offset in hours and misses every row the MV bucketed.
I confirmed this by instrumenting the test: toDateTime(bucket) returned 12:00:00 for a row inserted with start_time=10:00:00Z (i.e. the stored UTC Unix timestamp but displayed in CEST), and the filter literal '2026-03-31 10:05:00' was being parsed as CEST → UTC 08:05 → excluded all rows.
I didn't fix this because the repair is in src/main/java, not the test. Two reasonable options:
- Test-side: pin the container TZ via
.withEnv("TZ", "UTC")+ includeuse_time_zone=UTCin the JDBC URL. I tried both; neither was sufficient on their own — the CH server readstimezonefrom its own config, not$TZ. Getting all three layers (container env, CH server config, JDBC driver) aligned needs dedicated effort. - Production-side (preferred): change
lit(Instant)totoDateTime('...', 'UTC')or use the 3-argDateTime(3, 'UTC')column type forbucket. That's a store change; would be caught by a matching unit test.
I did add the explicit 'default' env to the seed INSERTs per your directive, but reverted it locally because the timezone bug swallowed the fix. The raw unchanged test is what's committed.
2. AgentSseControllerIT (3 failures) & SseSigningIT (1 failure) — SSE connection timing
All failing assertions are awaitConnection(5000) timeouts or ConditionTimeoutException on SSE stream observation. Not related to any spec drift I could identify — the SSE server is up (other tests in the same classes connect fine), and auth/JWT is accepted. Looks like a real race on either the SseConnectionManager registration or on the HTTP client's first-read flush. Needs a dedicated debug session with a minimal reproducer; not something I wanted to hack around with sleeps.
Specific tests:
AgentSseControllerIT.sseConnect_unknownAgent_returns404— 5sCompletableFuture.gettimeout on an HTTP GET that should return 404 synchronously. Suggests the client is waiting on body data that never arrives (SSE stream opens even on 404?).AgentSseControllerIT.lastEventIdHeader_connectionSucceeds—stream.awaitConnection(5000)false.AgentSseControllerIT.pingKeepalive_receivedViaSseStream— waits for an event line in the stream snapshot, never sees it.SseSigningIT.deepTraceEvent_containsValidSignature— same pattern.
The sibling tests (SseSigningIT.configUpdateEvent_containsValidEd25519Signature) pass in isolation, which strongly suggests order-dependent flakiness rather than a protocol break.
Final verify command
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify
Reports land in cameleer-server-app/target/failsafe-reports/. Expect 12 failures in the three classes above. Everything else is green.
Side notes worth flagging
- Property-key inconsistency in the main code — surfaced via BackpressureIT.
IngestionConfigis bound undercameleer.server.ingestion.*, butMetricsFlushScheduler.@Scheduledreadsingestion.flush-interval-ms(no prefix, hyphenated). In production this means the flush-interval inapplication.ymlisn't actually being honoured by the metrics flush — it stays at the 1s fallback. Separate cleanup. - Shared Testcontainers PG across IT classes — several of the "cross-test state" fixes (FlywayMigrationIT, ConfigEnvIsolationIT, SensitiveKeysAdminControllerIT) are symptoms of one underlying issue:
AbstractPostgresITuses a singleton PG container, and nothing cleans between test classes. Could do with a global@Sql("/test-reset.sql")on@BeforeAll, but out of scope here. - Agent registry shared across ITs — same class of issue. Doesn't bite until a test explicitly inspects registry membership (SensitiveKeys
pushResult.total).
Follow-up (2026-04-22) — 12 parked failures closed
All three parked clusters now green. 560/560 tests passing.
- ClickHouseStatsStoreIT (8 failures) — fixed in
a9a6b465. Two-layer TZ fix: JVM default TZ pinned to UTC inCameleerServerApplication.main()(the ClickHouse JDBC 0.9.7 driver formatsjava.sql.TimestampviaTimestamp.toString(), which uses JVM default TZ — a CEST JVM shipping to a UTC CH server stored off-by-offset Unix timestamps), plus column-levelbucket DateTime('UTC')on allstats_1m_*tables with explicittoDateTime(..., 'UTC')casts in MV projections andClickHouseStatsStore.lit(Instant)as defence in depth. - MetricsFlushScheduler property-key drift — fixed in
a6944911. Scheduler now reads${cameleer.server.ingestion.flush-interval-ms:1000}(the SpEL-via-@ingestionConfigapproach doesn't work because@EnableConfigurationPropertiesuses a compound bean name). BackpressureIT workaround property removed. - SSE flakiness (4 failures,
AgentSseControllerIT+SseSigningIT) — fixed in41df042e. Triage's "order-dependent flakiness" theory was wrong — all four reproduced in isolation. Three root causes: (a)AgentSseController.eventsauto-heal was over-permissive (spoofing vector), fixed with JWT-subject-equals-path-id check; (b)SseConnectionManager.pingAllread an unprefixed property key (agent-registry.ping-interval-ms), same family of bug as (a6944911); (c) SSE response headers didn't flush until the firstemitter.send(), soawaitConnection(5s)assertions timed out under the 15s ping cadence — fixed by sending an initial: connectedcomment onconnect(). Full diagnosis in.planning/sse-flakiness-diagnosis.md.
Plus the two prod-code cleanups from the ExecutionController-removal follow-ons:
- Dead
SearchIndexersubsystem — removed in98cbf8f3.ExecutionUpdatedEventhad no publisher after0f635576, so the whole indexer + stats +/admin/clickhouse/pipelineendpoint + UI pipeline card carried zero signal. - Unused
TaggedExecutionrecord — removed in06c6f53b.
Final verify: mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' ... verify → Tests run: 560, Failures: 0, Errors: 0, Skipped: 0.