docs(spec): IT triage follow-ups — design
Design for closing the 12 parked IT failures (ClickHouseStatsStoreIT
timezone, SSE flakiness in AgentSseControllerIT/SseSigningIT) plus two
production-code side notes the ExecutionController removal surfaced:
- ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on
bucket, greenfield CH
- SSE flakiness — diagnose-then-fix with user checkpoint between phases
- MetricsFlushScheduler property-key fix — bind via SpEL, single source
of truth in IngestionConfig
- Dead-code cleanup — SearchIndexer.onExecutionUpdated listener +
unused TaggedExecution record
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
214
docs/superpowers/specs/2026-04-21-it-triage-followups-design.md
Normal file
214
docs/superpowers/specs/2026-04-21-it-triage-followups-design.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# IT Triage Follow-Ups — Design
|
||||
|
||||
**Date:** 2026-04-21
|
||||
**Branch:** `main` (local, not pushed)
|
||||
**Starting HEAD:** `0f635576` (refactor(ingestion): drop dead legacy execution-ingestion path)
|
||||
**Context source:** `.planning/it-triage-report.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Close the three tracks the IT triage report parked, plus two production-code cleanups flagged by the ExecutionController removal commit, so that `mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify` returns **0 failures**.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Test-infrastructure hygiene (shared Testcontainers PG, shared agent registry across ITs). Report called these out as a separate concern — they stay deferred.
|
||||
- Rewriting tests to pass-by-weakening. Every assertion stays as strong or stronger than current.
|
||||
- New env vars, endpoints, DB tables, or schema columns beyond what's explicitly listed below.
|
||||
|
||||
## Scope — 4 items
|
||||
|
||||
1. **ClickHouseStatsStore timezone fix** — column-level `DateTime('UTC')` on `bucket`, greenfield CH (no migration)
|
||||
2. **SSE flakiness** — diagnose-then-fix with a user checkpoint between the two phases
|
||||
3. **MetricsFlushScheduler property-key fix** — bind via SpEL so `IngestionConfig` is the single source of truth
|
||||
4. **Dead-code cleanup** — `SearchIndexer.onExecutionUpdated` + `SearchIndexerStats` (possibly), and the unused `TaggedExecution` record
|
||||
|
||||
## Item 1 — ClickHouseStatsStore timezone fix (8 failures)
|
||||
|
||||
### Failing tests
|
||||
|
||||
`ClickHouseStatsStoreIT` — 8 assertions that filter by a time window currently miss every row the MV bucketed because the filter literal is parsed in session TZ (CEST in the test env) while the `bucket` column stores UTC Unix timestamps.
|
||||
|
||||
### Root cause
|
||||
|
||||
`ClickHouseStatsStore.buildStatsSql` emits `lit(Instant)` which formats as `'yyyy-MM-dd HH:mm:ss'` with no timezone marker. ClickHouse parses that literal in the session timezone when comparing against the bare `DateTime`-typed `bucket` column. On a CEST CH host, `'2026-03-31 10:05:00'` becomes UTC `08:05:00` — off by the CEST offset — so the row inserted at `start_time = 10:00:00Z` (bucketed to `10:00:00` UTC) is excluded.
|
||||
|
||||
The report's evidence: `toDateTime(bucket)` returned `12:00:00` for a row whose `start_time` was `10:00:00Z` — the stored UTC timestamp displayed in CEST.
|
||||
|
||||
### Fix — column-level TZ
|
||||
|
||||
Greenfield applies (pre-prod, no existing data to migrate). Changes:
|
||||
|
||||
1. **`cameleer-server-app/src/main/resources/clickhouse/init.sql`**
|
||||
- Change `bucket DateTime` → `bucket DateTime('UTC')` on every `stats_1m_*` target table
|
||||
- Wrap `toStartOfMinute(...)` in `toDateTime(toStartOfMinute(...), 'UTC')` in every MV SELECT that produces a `bucket` value, so the MV output matches the column type
|
||||
- Audit the whole file for any other `bucket`-typed columns or any other `DateTime`-typed column that participates in time-range filtering; if found, apply the same treatment
|
||||
|
||||
2. **`ClickHouseStatsStore.buildStatsSql`**
|
||||
- With the column now `DateTime('UTC')`, jOOQ's `lit(Instant)` literal should cast into UTC correctly. If it doesn't (quick verify in the failing ITs after the schema change), switch to an explicit `toDateTime('...', 'UTC')` literal.
|
||||
- No behavioural change to the method signature or callers.
|
||||
|
||||
### Blast radius
|
||||
|
||||
- `gitnexus_impact({target: "buildStatsSql", direction: "upstream"})` before editing
|
||||
- `gitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"})` — identify all stats read paths
|
||||
- Every MV definition touched → any dashboard or API reading `stats_1m_*` sees the same corrected values
|
||||
|
||||
### Verification
|
||||
|
||||
The 8 failing ITs in `ClickHouseStatsStoreIT` are the regression net. No new tests. After the fix, all 8 go green without touching the test code or container TZ env.
|
||||
|
||||
### Commits
|
||||
|
||||
1 commit: `fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ`
|
||||
|
||||
## Item 2 — SSE flakiness (4 failures, diagnose-then-fix)
|
||||
|
||||
### Failing tests
|
||||
|
||||
- `AgentSseControllerIT.sseConnect_unknownAgent_returns404` — 5s timeout on what should be a synchronous 404
|
||||
- `AgentSseControllerIT.lastEventIdHeader_connectionSucceeds` — `stream.awaitConnection(5000)` returns false
|
||||
- `AgentSseControllerIT.pingKeepalive_receivedViaSseStream` — keepalive never observed in stream snapshot
|
||||
- `SseSigningIT.deepTraceEvent_containsValidSignature` — `awaitConnection` pattern, never sees signed event
|
||||
|
||||
Sibling test `SseSigningIT.configUpdateEvent_containsValidEd25519Signature` passes in isolation — strong signal of order-dependent flakiness, not a protocol break.
|
||||
|
||||
### Phase 2a — Diagnosis
|
||||
|
||||
One commit, markdown-only: `docs(debug): SSE flakiness root-cause analysis`.
|
||||
|
||||
Steps:
|
||||
|
||||
1. **Baseline in isolation.** Run each failing test solo (`-Dit.test=AgentSseControllerIT#sseConnect_unknownAgent_returns404` etc.) to confirm it passes alone. Record.
|
||||
2. **Bisect test order.** Run the full IT suite with `-Dsurefire.runOrder=alphabetical` and `-Dsurefire.runOrder=reversealphabetical`. Identify which prior IT class poisons the state.
|
||||
3. **Inspect shared singletons.** Read `SseConnectionManager`, `AgentInstanceRegistry`, the Tomcat async thread pool config, any singleton HTTP client used by the `SseTestClient` harness. Look for state that persists across Spring context reuse when `@DirtiesContext` isn't applied.
|
||||
4. **Inspect `sseConnect_unknownAgent_returns404` specifically.** A synchronous 404 that hangs 5s is suspicious on its own. Likely cause: the controller opens the `SseEmitter` *before* validating agent existence, so the test client sees an open stream and the `CompletableFuture` waits on body data that never arrives. That would be a controller bug — a real finding, not a test problem.
|
||||
5. **Write `.planning/sse-flakiness-diagnosis.md`** with: named root cause, evidence (test output, log excerpts, code references), proposed fix, risk. Commit only this file.
|
||||
|
||||
### CHECKPOINT
|
||||
|
||||
Stop and present the diagnosis to the user. Do not proceed to Phase 2b until approved — the fix shape depends entirely on what the diagnosis finds, and we can't responsibly plan it up front.
|
||||
|
||||
### Phase 2b — Fix (1–2 commits, shape TBD)
|
||||
|
||||
Likely shapes (to be locked by diagnosis):
|
||||
|
||||
- **If shared-singleton state poisoning** → add `@DirtiesContext(classMode = BEFORE_CLASS)` on the affected IT classes, or add a proper reset bean (e.g. `SseConnectionManager.clear()` called from a test-only `@Component`).
|
||||
- **If `sseConnect_unknownAgent_returns404` controller bug** → reorder `AgentSseController` to call `agentRegistry.lookup()` *before* creating the `SseEmitter`; return a synchronous `ResponseEntity.notFound()` when the agent is unknown.
|
||||
- **If thread-pool exhaustion** → explicit bounded async pool with sizing tied to test count.
|
||||
|
||||
Any fix must be accompanied by a `.claude/rules/app-classes.md` update if controller behaviour changes.
|
||||
|
||||
### Blast radius
|
||||
|
||||
Depends on diagnosis. `gitnexus_impact` on whatever symbols the diagnosis names, before the fix commit lands.
|
||||
|
||||
### Verification
|
||||
|
||||
The 4 failing ITs are the regression net. Fix lands only once all 4 go green and the sibling passing tests stay green.
|
||||
|
||||
### Commits
|
||||
|
||||
1 diagnosis commit + 1–2 fix commits.
|
||||
|
||||
## Item 3 — MetricsFlushScheduler property-key fix
|
||||
|
||||
### Root cause
|
||||
|
||||
`IngestionConfig` is `@ConfigurationProperties("cameleer.server.ingestion")`. `MetricsFlushScheduler.@Scheduled(fixedRateString = "${ingestion.flush-interval-ms:1000}")` uses a key with no `cameleer.server.` prefix. The YAML key `cameleer.server.ingestion.flush-interval-ms` is never resolved by the scheduler; the `:1000` fallback is always used. Prod config of flush interval is silently ignored.
|
||||
|
||||
### Fix
|
||||
|
||||
Bind via SpEL so `IngestionConfig` is the single source of truth and the `:1000` default doesn't get duplicated between YAML and `@Scheduled`:
|
||||
|
||||
```java
|
||||
@Scheduled(fixedRateString = "#{@ingestionConfig.flushIntervalMs}")
|
||||
```
|
||||
|
||||
Requires `IngestionConfig` to be a named bean (usually via `@ConfigurationProperties` + `@EnableConfigurationProperties` — verify it is, or ensure the bean name is `ingestionConfig` / whatever SpEL resolves). If the default on `IngestionConfig.flushIntervalMs` field isn't already `1000`, keep it there — it's the single place the default is now defined.
|
||||
|
||||
### Blast radius
|
||||
|
||||
- `gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"})` — confirm no other `@Scheduled` strings depend on the old unprefixed key
|
||||
- `gitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"})` — confirm no test depends on the old placeholder string
|
||||
|
||||
### Verification
|
||||
|
||||
No new test. The prod bug is "silent config not honoured" — testing `@Scheduled` placeholder resolution is framework plumbing and not worth a test. Manual verification: set `cameleer.server.ingestion.flush-interval-ms: 250` in `application.yml` and confirm logs show 250ms flush cadence rather than 1s.
|
||||
|
||||
### Commits
|
||||
|
||||
1 commit: `fix(metrics): MetricsFlushScheduler honour ingestion config flush interval`.
|
||||
|
||||
## Item 4 — Dead-code cleanup (2 commits)
|
||||
|
||||
Flagged in `0f635576`'s commit message as follow-on cleanups from the ExecutionController removal.
|
||||
|
||||
### 4a — SearchIndexer.onExecutionUpdated (+ possibly SearchIndexerStats)
|
||||
|
||||
After the ExecutionController removal, `SearchIndexer.onExecutionUpdated` is subscribed to `ExecutionUpdatedEvent`, but nothing publishes that event anymore. The method can never fire. `SearchIndexerStats` is still referenced by `ClickHouseAdminController`.
|
||||
|
||||
Decide based on what `SearchIndexerStats` tracks once read:
|
||||
|
||||
- **(a)** If `SearchIndexerStats` tracks only dead signals → delete the whole subsystem (listener + stats class + admin controller exposure + UI consumer if any)
|
||||
- **(b)** If it still tracks live signals (e.g. search index build time) → delete just the listener method and keep the stats class
|
||||
|
||||
Approach: read `SearchIndexerStats` and `ClickHouseAdminController` before committing to a shape; pick (a) or (b) accordingly; note the decision in the commit message.
|
||||
|
||||
Blast radius: `gitnexus_impact({target: "onExecutionUpdated"})` and `gitnexus_impact({target: "SearchIndexerStats"})`.
|
||||
|
||||
Rule update: `.claude/rules/core-classes.md`.
|
||||
|
||||
### 4b — TaggedExecution record
|
||||
|
||||
Commit message on `0f635576` says no remaining callers. Verify with `gitnexus_context({name: "TaggedExecution"})` before deleting — if anything surprises us (e.g. a test file referencing it), fold that into the delete commit.
|
||||
|
||||
Blast radius: `gitnexus_impact({target: "TaggedExecution"})` — expect empty upstream.
|
||||
|
||||
Rule update: `.claude/rules/core-classes.md` (it's explicitly listed there as a leftover).
|
||||
|
||||
### Commits
|
||||
|
||||
2 commits, one per piece:
|
||||
- `refactor(search): drop dead SearchIndexer.onExecutionUpdated listener` (plus stats cleanup if applicable)
|
||||
- `refactor(model): remove unused TaggedExecution record`
|
||||
|
||||
## Execution order
|
||||
|
||||
**Wave 1 — parallelizable, no inter-dependencies:**
|
||||
- Item 1 (CH timezone)
|
||||
- Item 3 (scheduler SpEL)
|
||||
- Item 4a (SearchIndexer)
|
||||
- Item 4b (TaggedExecution)
|
||||
|
||||
**Wave 2 — sequential with user checkpoint:**
|
||||
- Item 2a (SSE diagnosis)
|
||||
- **CHECKPOINT** — user reviews diagnosis
|
||||
- Item 2b (SSE fix)
|
||||
|
||||
Total commits: 5–6 on local `main`, not pushed (same convention as the triage report already established).
|
||||
|
||||
## Verification — final
|
||||
|
||||
After all commits land:
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify
|
||||
```
|
||||
|
||||
Expected: **0 failures**. Reports in `cameleer-server-app/target/failsafe-reports/`.
|
||||
|
||||
## Cross-cutting rules
|
||||
|
||||
- **Every symbol edit:** `gitnexus_impact({target, direction: "upstream"})` before the edit, warn on HIGH/CRITICAL risk per CLAUDE.md
|
||||
- **Before each commit:** `gitnexus_detect_changes({scope: "staged"})` — verify scope matches expectation
|
||||
- **After each commit:** GitNexus index re-runs via PostToolUse hook per CLAUDE.md
|
||||
- **`.claude/rules/` updates** are part of the same commit as the class-level change, not a separate task
|
||||
- **No new env vars, endpoints, tables, or columns** beyond what's explicitly listed in this spec
|
||||
- **No tests rewritten to pass-by-weakening.** Every assertion change is accompanied by a comment capturing the contract it now expresses
|
||||
|
||||
## Risks
|
||||
|
||||
- **Item 2 diagnosis finds nothing conclusive.** Mitigation: the diagnosis commit documents what was ruled out, user decides whether to continue investigating or park with `@Disabled` + a GH issue pointer. No code-side workaround (sleeps, retries) per report's explicit direction.
|
||||
- **Item 1 MV recreation drops existing local dev data.** Greenfield means this is acceptable. Local devs re-run their smoke scenarios. No prod impact since there is no prod yet.
|
||||
- **Item 3 SpEL bean name resolution fails.** `IngestionConfig` might not be registered with the exact bean name `ingestionConfig`. Mitigation: verify bean name via `ApplicationContext.getBeanNamesForType(IngestionConfig.class)` in a quick smoke before committing; if the name differs, use the actual name in SpEL.
|
||||
- **Item 4a decision is harder than expected.** If `SearchIndexerStats` has a live UI consumer the cleanup scope changes. Mitigation: read first, commit to (a) or (b) with a clear note.
|
||||
Reference in New Issue
Block a user