Files
cameleer-server/docs/superpowers/specs/2026-04-21-it-triage-followups-design.md
hsiegeln 6c1cbc289c docs(spec): IT triage follow-ups — design
Design for closing the 12 parked IT failures (ClickHouseStatsStoreIT
timezone, SSE flakiness in AgentSseControllerIT/SseSigningIT) plus two
production-code side notes the ExecutionController removal surfaced:

- ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on
  bucket, greenfield CH
- SSE flakiness — diagnose-then-fix with user checkpoint between phases
- MetricsFlushScheduler property-key fix — bind via SpEL, single source
  of truth in IngestionConfig
- Dead-code cleanup — SearchIndexer.onExecutionUpdated listener +
  unused TaggedExecution record

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:03:08 +02:00

215 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IT Triage Follow-Ups — Design
**Date:** 2026-04-21
**Branch:** `main` (local, not pushed)
**Starting HEAD:** `0f635576` (refactor(ingestion): drop dead legacy execution-ingestion path)
**Context source:** `.planning/it-triage-report.md`
## Goal
Close the three tracks the IT triage report parked, plus two production-code cleanups flagged by the ExecutionController removal commit, so that `mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify` returns **0 failures**.
## Non-goals
- Test-infrastructure hygiene (shared Testcontainers PG, shared agent registry across ITs). Report called these out as a separate concern — they stay deferred.
- Rewriting tests to pass-by-weakening. Every assertion stays as strong or stronger than current.
- New env vars, endpoints, DB tables, or schema columns beyond what's explicitly listed below.
## Scope — 4 items
1. **ClickHouseStatsStore timezone fix** — column-level `DateTime('UTC')` on `bucket`, greenfield CH (no migration)
2. **SSE flakiness** — diagnose-then-fix with a user checkpoint between the two phases
3. **MetricsFlushScheduler property-key fix** — bind via SpEL so `IngestionConfig` is the single source of truth
4. **Dead-code cleanup**`SearchIndexer.onExecutionUpdated` + `SearchIndexerStats` (possibly), and the unused `TaggedExecution` record
## Item 1 — ClickHouseStatsStore timezone fix (8 failures)
### Failing tests
`ClickHouseStatsStoreIT` — 8 assertions that filter by a time window currently miss every row the MV bucketed because the filter literal is parsed in session TZ (CEST in the test env) while the `bucket` column stores UTC Unix timestamps.
### Root cause
`ClickHouseStatsStore.buildStatsSql` emits `lit(Instant)` which formats as `'yyyy-MM-dd HH:mm:ss'` with no timezone marker. ClickHouse parses that literal in the session timezone when comparing against the bare `DateTime`-typed `bucket` column. On a CEST CH host, `'2026-03-31 10:05:00'` becomes UTC `08:05:00` — off by the CEST offset — so the row inserted at `start_time = 10:00:00Z` (bucketed to `10:00:00` UTC) is excluded.
The report's evidence: `toDateTime(bucket)` returned `12:00:00` for a row whose `start_time` was `10:00:00Z` — the stored UTC timestamp displayed in CEST.
### Fix — column-level TZ
Greenfield applies (pre-prod, no existing data to migrate). Changes:
1. **`cameleer-server-app/src/main/resources/clickhouse/init.sql`**
- Change `bucket DateTime``bucket DateTime('UTC')` on every `stats_1m_*` target table
- Wrap `toStartOfMinute(...)` in `toDateTime(toStartOfMinute(...), 'UTC')` in every MV SELECT that produces a `bucket` value, so the MV output matches the column type
- Audit the whole file for any other `bucket`-typed columns or any other `DateTime`-typed column that participates in time-range filtering; if found, apply the same treatment
2. **`ClickHouseStatsStore.buildStatsSql`**
- With the column now `DateTime('UTC')`, jOOQ's `lit(Instant)` literal should cast into UTC correctly. If it doesn't (quick verify in the failing ITs after the schema change), switch to an explicit `toDateTime('...', 'UTC')` literal.
- No behavioural change to the method signature or callers.
### Blast radius
- `gitnexus_impact({target: "buildStatsSql", direction: "upstream"})` before editing
- `gitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"})` — identify all stats read paths
- Every MV definition touched → any dashboard or API reading `stats_1m_*` sees the same corrected values
### Verification
The 8 failing ITs in `ClickHouseStatsStoreIT` are the regression net. No new tests. After the fix, all 8 go green without touching the test code or container TZ env.
### Commits
1 commit: `fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ`
## Item 2 — SSE flakiness (4 failures, diagnose-then-fix)
### Failing tests
- `AgentSseControllerIT.sseConnect_unknownAgent_returns404` — 5s timeout on what should be a synchronous 404
- `AgentSseControllerIT.lastEventIdHeader_connectionSucceeds``stream.awaitConnection(5000)` returns false
- `AgentSseControllerIT.pingKeepalive_receivedViaSseStream` — keepalive never observed in stream snapshot
- `SseSigningIT.deepTraceEvent_containsValidSignature``awaitConnection` pattern, never sees signed event
Sibling test `SseSigningIT.configUpdateEvent_containsValidEd25519Signature` passes in isolation — strong signal of order-dependent flakiness, not a protocol break.
### Phase 2a — Diagnosis
One commit, markdown-only: `docs(debug): SSE flakiness root-cause analysis`.
Steps:
1. **Baseline in isolation.** Run each failing test solo (`-Dit.test=AgentSseControllerIT#sseConnect_unknownAgent_returns404` etc.) to confirm it passes alone. Record.
2. **Bisect test order.** Run the full IT suite with `-Dsurefire.runOrder=alphabetical` and `-Dsurefire.runOrder=reversealphabetical`. Identify which prior IT class poisons the state.
3. **Inspect shared singletons.** Read `SseConnectionManager`, `AgentInstanceRegistry`, the Tomcat async thread pool config, any singleton HTTP client used by the `SseTestClient` harness. Look for state that persists across Spring context reuse when `@DirtiesContext` isn't applied.
4. **Inspect `sseConnect_unknownAgent_returns404` specifically.** A synchronous 404 that hangs 5s is suspicious on its own. Likely cause: the controller opens the `SseEmitter` *before* validating agent existence, so the test client sees an open stream and the `CompletableFuture` waits on body data that never arrives. That would be a controller bug — a real finding, not a test problem.
5. **Write `.planning/sse-flakiness-diagnosis.md`** with: named root cause, evidence (test output, log excerpts, code references), proposed fix, risk. Commit only this file.
### CHECKPOINT
Stop and present the diagnosis to the user. Do not proceed to Phase 2b until approved — the fix shape depends entirely on what the diagnosis finds, and we can't responsibly plan it up front.
### Phase 2b — Fix (12 commits, shape TBD)
Likely shapes (to be locked by diagnosis):
- **If shared-singleton state poisoning** → add `@DirtiesContext(classMode = BEFORE_CLASS)` on the affected IT classes, or add a proper reset bean (e.g. `SseConnectionManager.clear()` called from a test-only `@Component`).
- **If `sseConnect_unknownAgent_returns404` controller bug** → reorder `AgentSseController` to call `agentRegistry.lookup()` *before* creating the `SseEmitter`; return a synchronous `ResponseEntity.notFound()` when the agent is unknown.
- **If thread-pool exhaustion** → explicit bounded async pool with sizing tied to test count.
Any fix must be accompanied by a `.claude/rules/app-classes.md` update if controller behaviour changes.
### Blast radius
Depends on diagnosis. `gitnexus_impact` on whatever symbols the diagnosis names, before the fix commit lands.
### Verification
The 4 failing ITs are the regression net. Fix lands only once all 4 go green and the sibling passing tests stay green.
### Commits
1 diagnosis commit + 12 fix commits.
## Item 3 — MetricsFlushScheduler property-key fix
### Root cause
`IngestionConfig` is `@ConfigurationProperties("cameleer.server.ingestion")`. `MetricsFlushScheduler.@Scheduled(fixedRateString = "${ingestion.flush-interval-ms:1000}")` uses a key with no `cameleer.server.` prefix. The YAML key `cameleer.server.ingestion.flush-interval-ms` is never resolved by the scheduler; the `:1000` fallback is always used. Prod config of flush interval is silently ignored.
### Fix
Bind via SpEL so `IngestionConfig` is the single source of truth and the `:1000` default doesn't get duplicated between YAML and `@Scheduled`:
```java
@Scheduled(fixedRateString = "#{@ingestionConfig.flushIntervalMs}")
```
Requires `IngestionConfig` to be a named bean (usually via `@ConfigurationProperties` + `@EnableConfigurationProperties` — verify it is, or ensure the bean name is `ingestionConfig` / whatever SpEL resolves). If the default on `IngestionConfig.flushIntervalMs` field isn't already `1000`, keep it there — it's the single place the default is now defined.
### Blast radius
- `gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"})` — confirm no other `@Scheduled` strings depend on the old unprefixed key
- `gitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"})` — confirm no test depends on the old placeholder string
### Verification
No new test. The prod bug is "silent config not honoured" — testing `@Scheduled` placeholder resolution is framework plumbing and not worth a test. Manual verification: set `cameleer.server.ingestion.flush-interval-ms: 250` in `application.yml` and confirm logs show 250ms flush cadence rather than 1s.
### Commits
1 commit: `fix(metrics): MetricsFlushScheduler honour ingestion config flush interval`.
## Item 4 — Dead-code cleanup (2 commits)
Flagged in `0f635576`'s commit message as follow-on cleanups from the ExecutionController removal.
### 4a — SearchIndexer.onExecutionUpdated (+ possibly SearchIndexerStats)
After the ExecutionController removal, `SearchIndexer.onExecutionUpdated` is subscribed to `ExecutionUpdatedEvent`, but nothing publishes that event anymore. The method can never fire. `SearchIndexerStats` is still referenced by `ClickHouseAdminController`.
Decide based on what `SearchIndexerStats` tracks once read:
- **(a)** If `SearchIndexerStats` tracks only dead signals → delete the whole subsystem (listener + stats class + admin controller exposure + UI consumer if any)
- **(b)** If it still tracks live signals (e.g. search index build time) → delete just the listener method and keep the stats class
Approach: read `SearchIndexerStats` and `ClickHouseAdminController` before committing to a shape; pick (a) or (b) accordingly; note the decision in the commit message.
Blast radius: `gitnexus_impact({target: "onExecutionUpdated"})` and `gitnexus_impact({target: "SearchIndexerStats"})`.
Rule update: `.claude/rules/core-classes.md`.
### 4b — TaggedExecution record
Commit message on `0f635576` says no remaining callers. Verify with `gitnexus_context({name: "TaggedExecution"})` before deleting — if anything surprises us (e.g. a test file referencing it), fold that into the delete commit.
Blast radius: `gitnexus_impact({target: "TaggedExecution"})` — expect empty upstream.
Rule update: `.claude/rules/core-classes.md` (it's explicitly listed there as a leftover).
### Commits
2 commits, one per piece:
- `refactor(search): drop dead SearchIndexer.onExecutionUpdated listener` (plus stats cleanup if applicable)
- `refactor(model): remove unused TaggedExecution record`
## Execution order
**Wave 1 — parallelizable, no inter-dependencies:**
- Item 1 (CH timezone)
- Item 3 (scheduler SpEL)
- Item 4a (SearchIndexer)
- Item 4b (TaggedExecution)
**Wave 2 — sequential with user checkpoint:**
- Item 2a (SSE diagnosis)
- **CHECKPOINT** — user reviews diagnosis
- Item 2b (SSE fix)
Total commits: 56 on local `main`, not pushed (same convention as the triage report already established).
## Verification — final
After all commits land:
```bash
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify
```
Expected: **0 failures**. Reports in `cameleer-server-app/target/failsafe-reports/`.
## Cross-cutting rules
- **Every symbol edit:** `gitnexus_impact({target, direction: "upstream"})` before the edit, warn on HIGH/CRITICAL risk per CLAUDE.md
- **Before each commit:** `gitnexus_detect_changes({scope: "staged"})` — verify scope matches expectation
- **After each commit:** GitNexus index re-runs via PostToolUse hook per CLAUDE.md
- **`.claude/rules/` updates** are part of the same commit as the class-level change, not a separate task
- **No new env vars, endpoints, tables, or columns** beyond what's explicitly listed in this spec
- **No tests rewritten to pass-by-weakening.** Every assertion change is accompanied by a comment capturing the contract it now expresses
## Risks
- **Item 2 diagnosis finds nothing conclusive.** Mitigation: the diagnosis commit documents what was ruled out, user decides whether to continue investigating or park with `@Disabled` + a GH issue pointer. No code-side workaround (sleeps, retries) per report's explicit direction.
- **Item 1 MV recreation drops existing local dev data.** Greenfield means this is acceptable. Local devs re-run their smoke scenarios. No prod impact since there is no prod yet.
- **Item 3 SpEL bean name resolution fails.** `IngestionConfig` might not be registered with the exact bean name `ingestionConfig`. Mitigation: verify bean name via `ApplicationContext.getBeanNamesForType(IngestionConfig.class)` in a quick smoke before committing; if the name differs, use the actual name in SpEL.
- **Item 4a decision is harder than expected.** If `SearchIndexerStats` has a live UI consumer the cleanup scope changes. Mitigation: read first, commit to (a) or (b) with a clear note.