Files
cameleer-server/docs/superpowers/specs/2026-04-21-it-triage-followups-design.md
hsiegeln 6c1cbc289c docs(spec): IT triage follow-ups — design
Design for closing the 12 parked IT failures (ClickHouseStatsStoreIT
timezone, SSE flakiness in AgentSseControllerIT/SseSigningIT) plus two
production-code side notes the ExecutionController removal surfaced:

- ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on
  bucket, greenfield CH
- SSE flakiness — diagnose-then-fix with user checkpoint between phases
- MetricsFlushScheduler property-key fix — bind via SpEL, single source
  of truth in IngestionConfig
- Dead-code cleanup — SearchIndexer.onExecutionUpdated listener +
  unused TaggedExecution record

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:03:08 +02:00

13 KiB
Raw Permalink Blame History

IT Triage Follow-Ups — Design

Date: 2026-04-21 Branch: main (local, not pushed) Starting HEAD: 0f635576 (refactor(ingestion): drop dead legacy execution-ingestion path) Context source: .planning/it-triage-report.md

Goal

Close the three tracks the IT triage report parked, plus two production-code cleanups flagged by the ExecutionController removal commit, so that mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify returns 0 failures.

Non-goals

  • Test-infrastructure hygiene (shared Testcontainers PG, shared agent registry across ITs). Report called these out as a separate concern — they stay deferred.
  • Rewriting tests to pass-by-weakening. Every assertion stays as strong or stronger than current.
  • New env vars, endpoints, DB tables, or schema columns beyond what's explicitly listed below.

Scope — 4 items

  1. ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on bucket, greenfield CH (no migration)
  2. SSE flakiness — diagnose-then-fix with a user checkpoint between the two phases
  3. MetricsFlushScheduler property-key fix — bind via SpEL so IngestionConfig is the single source of truth
  4. Dead-code cleanupSearchIndexer.onExecutionUpdated + SearchIndexerStats (possibly), and the unused TaggedExecution record

Item 1 — ClickHouseStatsStore timezone fix (8 failures)

Failing tests

ClickHouseStatsStoreIT — 8 assertions that filter by a time window currently miss every row the MV bucketed because the filter literal is parsed in session TZ (CEST in the test env) while the bucket column stores UTC Unix timestamps.

Root cause

ClickHouseStatsStore.buildStatsSql emits lit(Instant) which formats as 'yyyy-MM-dd HH:mm:ss' with no timezone marker. ClickHouse parses that literal in the session timezone when comparing against the bare DateTime-typed bucket column. On a CEST CH host, '2026-03-31 10:05:00' becomes UTC 08:05:00 — off by the CEST offset — so the row inserted at start_time = 10:00:00Z (bucketed to 10:00:00 UTC) is excluded.

The report's evidence: toDateTime(bucket) returned 12:00:00 for a row whose start_time was 10:00:00Z — the stored UTC timestamp displayed in CEST.

Fix — column-level TZ

Greenfield applies (pre-prod, no existing data to migrate). Changes:

  1. cameleer-server-app/src/main/resources/clickhouse/init.sql

    • Change bucket DateTimebucket DateTime('UTC') on every stats_1m_* target table
    • Wrap toStartOfMinute(...) in toDateTime(toStartOfMinute(...), 'UTC') in every MV SELECT that produces a bucket value, so the MV output matches the column type
    • Audit the whole file for any other bucket-typed columns or any other DateTime-typed column that participates in time-range filtering; if found, apply the same treatment
  2. ClickHouseStatsStore.buildStatsSql

    • With the column now DateTime('UTC'), jOOQ's lit(Instant) literal should cast into UTC correctly. If it doesn't (quick verify in the failing ITs after the schema change), switch to an explicit toDateTime('...', 'UTC') literal.
    • No behavioural change to the method signature or callers.

Blast radius

  • gitnexus_impact({target: "buildStatsSql", direction: "upstream"}) before editing
  • gitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"}) — identify all stats read paths
  • Every MV definition touched → any dashboard or API reading stats_1m_* sees the same corrected values

Verification

The 8 failing ITs in ClickHouseStatsStoreIT are the regression net. No new tests. After the fix, all 8 go green without touching the test code or container TZ env.

Commits

1 commit: fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ

Item 2 — SSE flakiness (4 failures, diagnose-then-fix)

Failing tests

  • AgentSseControllerIT.sseConnect_unknownAgent_returns404 — 5s timeout on what should be a synchronous 404
  • AgentSseControllerIT.lastEventIdHeader_connectionSucceedsstream.awaitConnection(5000) returns false
  • AgentSseControllerIT.pingKeepalive_receivedViaSseStream — keepalive never observed in stream snapshot
  • SseSigningIT.deepTraceEvent_containsValidSignatureawaitConnection pattern, never sees signed event

Sibling test SseSigningIT.configUpdateEvent_containsValidEd25519Signature passes in isolation — strong signal of order-dependent flakiness, not a protocol break.

Phase 2a — Diagnosis

One commit, markdown-only: docs(debug): SSE flakiness root-cause analysis.

Steps:

  1. Baseline in isolation. Run each failing test solo (-Dit.test=AgentSseControllerIT#sseConnect_unknownAgent_returns404 etc.) to confirm it passes alone. Record.
  2. Bisect test order. Run the full IT suite with -Dsurefire.runOrder=alphabetical and -Dsurefire.runOrder=reversealphabetical. Identify which prior IT class poisons the state.
  3. Inspect shared singletons. Read SseConnectionManager, AgentInstanceRegistry, the Tomcat async thread pool config, any singleton HTTP client used by the SseTestClient harness. Look for state that persists across Spring context reuse when @DirtiesContext isn't applied.
  4. Inspect sseConnect_unknownAgent_returns404 specifically. A synchronous 404 that hangs 5s is suspicious on its own. Likely cause: the controller opens the SseEmitter before validating agent existence, so the test client sees an open stream and the CompletableFuture waits on body data that never arrives. That would be a controller bug — a real finding, not a test problem.
  5. Write .planning/sse-flakiness-diagnosis.md with: named root cause, evidence (test output, log excerpts, code references), proposed fix, risk. Commit only this file.

CHECKPOINT

Stop and present the diagnosis to the user. Do not proceed to Phase 2b until approved — the fix shape depends entirely on what the diagnosis finds, and we can't responsibly plan it up front.

Phase 2b — Fix (12 commits, shape TBD)

Likely shapes (to be locked by diagnosis):

  • If shared-singleton state poisoning → add @DirtiesContext(classMode = BEFORE_CLASS) on the affected IT classes, or add a proper reset bean (e.g. SseConnectionManager.clear() called from a test-only @Component).
  • If sseConnect_unknownAgent_returns404 controller bug → reorder AgentSseController to call agentRegistry.lookup() before creating the SseEmitter; return a synchronous ResponseEntity.notFound() when the agent is unknown.
  • If thread-pool exhaustion → explicit bounded async pool with sizing tied to test count.

Any fix must be accompanied by a .claude/rules/app-classes.md update if controller behaviour changes.

Blast radius

Depends on diagnosis. gitnexus_impact on whatever symbols the diagnosis names, before the fix commit lands.

Verification

The 4 failing ITs are the regression net. Fix lands only once all 4 go green and the sibling passing tests stay green.

Commits

1 diagnosis commit + 12 fix commits.

Item 3 — MetricsFlushScheduler property-key fix

Root cause

IngestionConfig is @ConfigurationProperties("cameleer.server.ingestion"). MetricsFlushScheduler.@Scheduled(fixedRateString = "${ingestion.flush-interval-ms:1000}") uses a key with no cameleer.server. prefix. The YAML key cameleer.server.ingestion.flush-interval-ms is never resolved by the scheduler; the :1000 fallback is always used. Prod config of flush interval is silently ignored.

Fix

Bind via SpEL so IngestionConfig is the single source of truth and the :1000 default doesn't get duplicated between YAML and @Scheduled:

@Scheduled(fixedRateString = "#{@ingestionConfig.flushIntervalMs}")

Requires IngestionConfig to be a named bean (usually via @ConfigurationProperties + @EnableConfigurationProperties — verify it is, or ensure the bean name is ingestionConfig / whatever SpEL resolves). If the default on IngestionConfig.flushIntervalMs field isn't already 1000, keep it there — it's the single place the default is now defined.

Blast radius

  • gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"}) — confirm no other @Scheduled strings depend on the old unprefixed key
  • gitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"}) — confirm no test depends on the old placeholder string

Verification

No new test. The prod bug is "silent config not honoured" — testing @Scheduled placeholder resolution is framework plumbing and not worth a test. Manual verification: set cameleer.server.ingestion.flush-interval-ms: 250 in application.yml and confirm logs show 250ms flush cadence rather than 1s.

Commits

1 commit: fix(metrics): MetricsFlushScheduler honour ingestion config flush interval.

Item 4 — Dead-code cleanup (2 commits)

Flagged in 0f635576's commit message as follow-on cleanups from the ExecutionController removal.

4a — SearchIndexer.onExecutionUpdated (+ possibly SearchIndexerStats)

After the ExecutionController removal, SearchIndexer.onExecutionUpdated is subscribed to ExecutionUpdatedEvent, but nothing publishes that event anymore. The method can never fire. SearchIndexerStats is still referenced by ClickHouseAdminController.

Decide based on what SearchIndexerStats tracks once read:

  • (a) If SearchIndexerStats tracks only dead signals → delete the whole subsystem (listener + stats class + admin controller exposure + UI consumer if any)
  • (b) If it still tracks live signals (e.g. search index build time) → delete just the listener method and keep the stats class

Approach: read SearchIndexerStats and ClickHouseAdminController before committing to a shape; pick (a) or (b) accordingly; note the decision in the commit message.

Blast radius: gitnexus_impact({target: "onExecutionUpdated"}) and gitnexus_impact({target: "SearchIndexerStats"}).

Rule update: .claude/rules/core-classes.md.

4b — TaggedExecution record

Commit message on 0f635576 says no remaining callers. Verify with gitnexus_context({name: "TaggedExecution"}) before deleting — if anything surprises us (e.g. a test file referencing it), fold that into the delete commit.

Blast radius: gitnexus_impact({target: "TaggedExecution"}) — expect empty upstream.

Rule update: .claude/rules/core-classes.md (it's explicitly listed there as a leftover).

Commits

2 commits, one per piece:

  • refactor(search): drop dead SearchIndexer.onExecutionUpdated listener (plus stats cleanup if applicable)
  • refactor(model): remove unused TaggedExecution record

Execution order

Wave 1 — parallelizable, no inter-dependencies:

  • Item 1 (CH timezone)
  • Item 3 (scheduler SpEL)
  • Item 4a (SearchIndexer)
  • Item 4b (TaggedExecution)

Wave 2 — sequential with user checkpoint:

  • Item 2a (SSE diagnosis)
  • CHECKPOINT — user reviews diagnosis
  • Item 2b (SSE fix)

Total commits: 56 on local main, not pushed (same convention as the triage report already established).

Verification — final

After all commits land:

mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify

Expected: 0 failures. Reports in cameleer-server-app/target/failsafe-reports/.

Cross-cutting rules

  • Every symbol edit: gitnexus_impact({target, direction: "upstream"}) before the edit, warn on HIGH/CRITICAL risk per CLAUDE.md
  • Before each commit: gitnexus_detect_changes({scope: "staged"}) — verify scope matches expectation
  • After each commit: GitNexus index re-runs via PostToolUse hook per CLAUDE.md
  • .claude/rules/ updates are part of the same commit as the class-level change, not a separate task
  • No new env vars, endpoints, tables, or columns beyond what's explicitly listed in this spec
  • No tests rewritten to pass-by-weakening. Every assertion change is accompanied by a comment capturing the contract it now expresses

Risks

  • Item 2 diagnosis finds nothing conclusive. Mitigation: the diagnosis commit documents what was ruled out, user decides whether to continue investigating or park with @Disabled + a GH issue pointer. No code-side workaround (sleeps, retries) per report's explicit direction.
  • Item 1 MV recreation drops existing local dev data. Greenfield means this is acceptable. Local devs re-run their smoke scenarios. No prod impact since there is no prod yet.
  • Item 3 SpEL bean name resolution fails. IngestionConfig might not be registered with the exact bean name ingestionConfig. Mitigation: verify bean name via ApplicationContext.getBeanNamesForType(IngestionConfig.class) in a quick smoke before committing; if the name differs, use the actual name in SpEL.
  • Item 4a decision is harder than expected. If SearchIndexerStats has a live UI consumer the cleanup scope changes. Mitigation: read first, commit to (a) or (b) with a clear note.