Design for closing the 12 parked IT failures (ClickHouseStatsStoreIT
timezone, SSE flakiness in AgentSseControllerIT/SseSigningIT) plus two
production-code side notes the ExecutionController removal surfaced:
- ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on
bucket, greenfield CH
- SSE flakiness — diagnose-then-fix with user checkpoint between phases
- MetricsFlushScheduler property-key fix — bind via SpEL, single source
of truth in IngestionConfig
- Dead-code cleanup — SearchIndexer.onExecutionUpdated listener +
unused TaggedExecution record
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
IT Triage Follow-Ups — Design
Date: 2026-04-21
Branch: main (local, not pushed)
Starting HEAD: 0f635576 (refactor(ingestion): drop dead legacy execution-ingestion path)
Context source: .planning/it-triage-report.md
Goal
Close the three tracks the IT triage report parked, plus two production-code cleanups flagged by the ExecutionController removal commit, so that mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify returns 0 failures.
Non-goals
- Test-infrastructure hygiene (shared Testcontainers PG, shared agent registry across ITs). Report called these out as a separate concern — they stay deferred.
- Rewriting tests to pass-by-weakening. Every assertion stays as strong or stronger than current.
- New env vars, endpoints, DB tables, or schema columns beyond what's explicitly listed below.
Scope — 4 items
- ClickHouseStatsStore timezone fix — column-level
DateTime('UTC')onbucket, greenfield CH (no migration) - SSE flakiness — diagnose-then-fix with a user checkpoint between the two phases
- MetricsFlushScheduler property-key fix — bind via SpEL so
IngestionConfigis the single source of truth - Dead-code cleanup —
SearchIndexer.onExecutionUpdated+SearchIndexerStats(possibly), and the unusedTaggedExecutionrecord
Item 1 — ClickHouseStatsStore timezone fix (8 failures)
Failing tests
ClickHouseStatsStoreIT — 8 assertions that filter by a time window currently miss every row the MV bucketed because the filter literal is parsed in session TZ (CEST in the test env) while the bucket column stores UTC Unix timestamps.
Root cause
ClickHouseStatsStore.buildStatsSql emits lit(Instant) which formats as 'yyyy-MM-dd HH:mm:ss' with no timezone marker. ClickHouse parses that literal in the session timezone when comparing against the bare DateTime-typed bucket column. On a CEST CH host, '2026-03-31 10:05:00' becomes UTC 08:05:00 — off by the CEST offset — so the row inserted at start_time = 10:00:00Z (bucketed to 10:00:00 UTC) is excluded.
The report's evidence: toDateTime(bucket) returned 12:00:00 for a row whose start_time was 10:00:00Z — the stored UTC timestamp displayed in CEST.
Fix — column-level TZ
Greenfield applies (pre-prod, no existing data to migrate). Changes:
-
cameleer-server-app/src/main/resources/clickhouse/init.sql- Change
bucket DateTime→bucket DateTime('UTC')on everystats_1m_*target table - Wrap
toStartOfMinute(...)intoDateTime(toStartOfMinute(...), 'UTC')in every MV SELECT that produces abucketvalue, so the MV output matches the column type - Audit the whole file for any other
bucket-typed columns or any otherDateTime-typed column that participates in time-range filtering; if found, apply the same treatment
- Change
-
ClickHouseStatsStore.buildStatsSql- With the column now
DateTime('UTC'), jOOQ'slit(Instant)literal should cast into UTC correctly. If it doesn't (quick verify in the failing ITs after the schema change), switch to an explicittoDateTime('...', 'UTC')literal. - No behavioural change to the method signature or callers.
- With the column now
Blast radius
gitnexus_impact({target: "buildStatsSql", direction: "upstream"})before editinggitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"})— identify all stats read paths- Every MV definition touched → any dashboard or API reading
stats_1m_*sees the same corrected values
Verification
The 8 failing ITs in ClickHouseStatsStoreIT are the regression net. No new tests. After the fix, all 8 go green without touching the test code or container TZ env.
Commits
1 commit: fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ
Item 2 — SSE flakiness (4 failures, diagnose-then-fix)
Failing tests
AgentSseControllerIT.sseConnect_unknownAgent_returns404— 5s timeout on what should be a synchronous 404AgentSseControllerIT.lastEventIdHeader_connectionSucceeds—stream.awaitConnection(5000)returns falseAgentSseControllerIT.pingKeepalive_receivedViaSseStream— keepalive never observed in stream snapshotSseSigningIT.deepTraceEvent_containsValidSignature—awaitConnectionpattern, never sees signed event
Sibling test SseSigningIT.configUpdateEvent_containsValidEd25519Signature passes in isolation — strong signal of order-dependent flakiness, not a protocol break.
Phase 2a — Diagnosis
One commit, markdown-only: docs(debug): SSE flakiness root-cause analysis.
Steps:
- Baseline in isolation. Run each failing test solo (
-Dit.test=AgentSseControllerIT#sseConnect_unknownAgent_returns404etc.) to confirm it passes alone. Record. - Bisect test order. Run the full IT suite with
-Dsurefire.runOrder=alphabeticaland-Dsurefire.runOrder=reversealphabetical. Identify which prior IT class poisons the state. - Inspect shared singletons. Read
SseConnectionManager,AgentInstanceRegistry, the Tomcat async thread pool config, any singleton HTTP client used by theSseTestClientharness. Look for state that persists across Spring context reuse when@DirtiesContextisn't applied. - Inspect
sseConnect_unknownAgent_returns404specifically. A synchronous 404 that hangs 5s is suspicious on its own. Likely cause: the controller opens theSseEmitterbefore validating agent existence, so the test client sees an open stream and theCompletableFuturewaits on body data that never arrives. That would be a controller bug — a real finding, not a test problem. - Write
.planning/sse-flakiness-diagnosis.mdwith: named root cause, evidence (test output, log excerpts, code references), proposed fix, risk. Commit only this file.
CHECKPOINT
Stop and present the diagnosis to the user. Do not proceed to Phase 2b until approved — the fix shape depends entirely on what the diagnosis finds, and we can't responsibly plan it up front.
Phase 2b — Fix (1–2 commits, shape TBD)
Likely shapes (to be locked by diagnosis):
- If shared-singleton state poisoning → add
@DirtiesContext(classMode = BEFORE_CLASS)on the affected IT classes, or add a proper reset bean (e.g.SseConnectionManager.clear()called from a test-only@Component). - If
sseConnect_unknownAgent_returns404controller bug → reorderAgentSseControllerto callagentRegistry.lookup()before creating theSseEmitter; return a synchronousResponseEntity.notFound()when the agent is unknown. - If thread-pool exhaustion → explicit bounded async pool with sizing tied to test count.
Any fix must be accompanied by a .claude/rules/app-classes.md update if controller behaviour changes.
Blast radius
Depends on diagnosis. gitnexus_impact on whatever symbols the diagnosis names, before the fix commit lands.
Verification
The 4 failing ITs are the regression net. Fix lands only once all 4 go green and the sibling passing tests stay green.
Commits
1 diagnosis commit + 1–2 fix commits.
Item 3 — MetricsFlushScheduler property-key fix
Root cause
IngestionConfig is @ConfigurationProperties("cameleer.server.ingestion"). MetricsFlushScheduler.@Scheduled(fixedRateString = "${ingestion.flush-interval-ms:1000}") uses a key with no cameleer.server. prefix. The YAML key cameleer.server.ingestion.flush-interval-ms is never resolved by the scheduler; the :1000 fallback is always used. Prod config of flush interval is silently ignored.
Fix
Bind via SpEL so IngestionConfig is the single source of truth and the :1000 default doesn't get duplicated between YAML and @Scheduled:
@Scheduled(fixedRateString = "#{@ingestionConfig.flushIntervalMs}")
Requires IngestionConfig to be a named bean (usually via @ConfigurationProperties + @EnableConfigurationProperties — verify it is, or ensure the bean name is ingestionConfig / whatever SpEL resolves). If the default on IngestionConfig.flushIntervalMs field isn't already 1000, keep it there — it's the single place the default is now defined.
Blast radius
gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"})— confirm no other@Scheduledstrings depend on the old unprefixed keygitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"})— confirm no test depends on the old placeholder string
Verification
No new test. The prod bug is "silent config not honoured" — testing @Scheduled placeholder resolution is framework plumbing and not worth a test. Manual verification: set cameleer.server.ingestion.flush-interval-ms: 250 in application.yml and confirm logs show 250ms flush cadence rather than 1s.
Commits
1 commit: fix(metrics): MetricsFlushScheduler honour ingestion config flush interval.
Item 4 — Dead-code cleanup (2 commits)
Flagged in 0f635576's commit message as follow-on cleanups from the ExecutionController removal.
4a — SearchIndexer.onExecutionUpdated (+ possibly SearchIndexerStats)
After the ExecutionController removal, SearchIndexer.onExecutionUpdated is subscribed to ExecutionUpdatedEvent, but nothing publishes that event anymore. The method can never fire. SearchIndexerStats is still referenced by ClickHouseAdminController.
Decide based on what SearchIndexerStats tracks once read:
- (a) If
SearchIndexerStatstracks only dead signals → delete the whole subsystem (listener + stats class + admin controller exposure + UI consumer if any) - (b) If it still tracks live signals (e.g. search index build time) → delete just the listener method and keep the stats class
Approach: read SearchIndexerStats and ClickHouseAdminController before committing to a shape; pick (a) or (b) accordingly; note the decision in the commit message.
Blast radius: gitnexus_impact({target: "onExecutionUpdated"}) and gitnexus_impact({target: "SearchIndexerStats"}).
Rule update: .claude/rules/core-classes.md.
4b — TaggedExecution record
Commit message on 0f635576 says no remaining callers. Verify with gitnexus_context({name: "TaggedExecution"}) before deleting — if anything surprises us (e.g. a test file referencing it), fold that into the delete commit.
Blast radius: gitnexus_impact({target: "TaggedExecution"}) — expect empty upstream.
Rule update: .claude/rules/core-classes.md (it's explicitly listed there as a leftover).
Commits
2 commits, one per piece:
refactor(search): drop dead SearchIndexer.onExecutionUpdated listener(plus stats cleanup if applicable)refactor(model): remove unused TaggedExecution record
Execution order
Wave 1 — parallelizable, no inter-dependencies:
- Item 1 (CH timezone)
- Item 3 (scheduler SpEL)
- Item 4a (SearchIndexer)
- Item 4b (TaggedExecution)
Wave 2 — sequential with user checkpoint:
- Item 2a (SSE diagnosis)
- CHECKPOINT — user reviews diagnosis
- Item 2b (SSE fix)
Total commits: 5–6 on local main, not pushed (same convention as the triage report already established).
Verification — final
After all commits land:
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify
Expected: 0 failures. Reports in cameleer-server-app/target/failsafe-reports/.
Cross-cutting rules
- Every symbol edit:
gitnexus_impact({target, direction: "upstream"})before the edit, warn on HIGH/CRITICAL risk per CLAUDE.md - Before each commit:
gitnexus_detect_changes({scope: "staged"})— verify scope matches expectation - After each commit: GitNexus index re-runs via PostToolUse hook per CLAUDE.md
.claude/rules/updates are part of the same commit as the class-level change, not a separate task- No new env vars, endpoints, tables, or columns beyond what's explicitly listed in this spec
- No tests rewritten to pass-by-weakening. Every assertion change is accompanied by a comment capturing the contract it now expresses
Risks
- Item 2 diagnosis finds nothing conclusive. Mitigation: the diagnosis commit documents what was ruled out, user decides whether to continue investigating or park with
@Disabled+ a GH issue pointer. No code-side workaround (sleeps, retries) per report's explicit direction. - Item 1 MV recreation drops existing local dev data. Greenfield means this is acceptable. Local devs re-run their smoke scenarios. No prod impact since there is no prod yet.
- Item 3 SpEL bean name resolution fails.
IngestionConfigmight not be registered with the exact bean nameingestionConfig. Mitigation: verify bean name viaApplicationContext.getBeanNamesForType(IngestionConfig.class)in a quick smoke before committing; if the name differs, use the actual name in SpEL. - Item 4a decision is harder than expected. If
SearchIndexerStatshas a live UI consumer the cleanup scope changes. Mitigation: read first, commit to (a) or (b) with a clear note.