Files
cameleer-server/docs/superpowers/plans/2026-04-21-it-triage-followups.md
hsiegeln d32208d403 docs(plan): IT triage follow-ups — implementation plan
Task-by-task plan for the 2026-04-21-it-triage-followups-design spec.
Autonomous execution variant — SSE diagnose-then-fix branches to either
apply-fix or park-with-@Disabled based on diagnosis confidence, since
this runs unattended overnight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:10:55 +02:00

42 KiB

IT Triage Follow-Ups Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Close the 12 parked IT failures from .planning/it-triage-report.md plus two prod-code side-notes, so mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify returns 0 failures.

Architecture: Four focused fixes (CH timezone, scheduler property key, two dead-code removals) executed atomically, each with its own commit. Then SSE flakiness as diagnose-then-fix. User is asleep during execution — no interactive checkpoint; if SSE diagnosis is inconclusive within the timebox, park the 4 failing SSE tests with @Disabled + link to the diagnosis doc and finish the rest green.

Tech Stack: Java 17, Spring Boot 3.4.3, ClickHouse 24.12 via JDBC, Testcontainers, Maven Failsafe.


Execution policy

  • Atomic commits — one task, one commit, scoped to the task's files.
  • Before each symbol edit: gitnexus_impact({target, direction: "upstream"}). Warn on HIGH/CRITICAL. Stop if unexpected dependents appear and re-scope.
  • Before each commit: gitnexus_detect_changes({scope: "staged"}). Confirm scope.
  • .claude/rules/* updates are part of the same commit as the class change, not a separate task.
  • Test-only scope — no tests rewritten to pass-by-weakening. Every change to an assertion gets a comment explaining the contract it now captures.
  • Final stepgit push origin main after all tasks commit and the full verify run is green (or yellow with the parked SSE tests only, clearly noted).

Task 0 — Baseline verify (evidence, no commit)

Files: none modified.

  • Step 0.1: Run baseline failing tests to confirm starting state
mvn -pl cameleer-server-app -am -Dit.test='ClickHouseStatsStoreIT,AgentSseControllerIT,SseSigningIT,BackpressureIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -60

Expected: 12 failures distributed across those three IT classes, matching .planning/it-triage-report.md. If the baseline count differs, stop and re-audit — the spec assumes this number.

  • Step 0.2: Record baseline to memory

Keep the failure count as the reference number. The final verify must show 0 new failures; the only acceptable regression is the SSE cluster if and only if Task 5 parks them (and the user-facing summary notes this).


Task 1 — ClickHouse timezone fix

Closes 8 failures in ClickHouseStatsStoreIT.

Files:

  • Modify: cameleer-server-app/src/main/resources/clickhouse/init.sql

  • Modify: cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseStatsStore.java:346-350

  • Modify: cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java:49-78 (remove debug scaffolding from the triage investigation)

  • Step 1.1: Impact analysis

Run gitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"}). Expected: SearchService, SearchController, alerting evaluator. Note the blast radius — every read-path that uses stats_1m_* tables sees the now-correct values.

  • Step 1.2: Change init.sqlbucket columns to DateTime('UTC'), MV SELECTs to emit UTC

Edit cameleer-server-app/src/main/resources/clickhouse/init.sql:

For each of the five stats tables (stats_1m_all, stats_1m_app, stats_1m_route, stats_1m_processor, stats_1m_processor_detail), change the bucket column declaration from:

    bucket         DateTime,

to:

    bucket         DateTime('UTC'),

For each of the five materialized views (stats_1m_all_mv, stats_1m_app_mv, stats_1m_route_mv, stats_1m_processor_mv, stats_1m_processor_detail_mv), change the bucket projection from:

    toStartOfMinute(start_time)           AS bucket,

to:

    toDateTime(toStartOfMinute(start_time), 'UTC') AS bucket,

The TTL bucket + INTERVAL 365 DAY DELETE lines need no change — TTL interval arithmetic is tz-agnostic.

  • Step 1.3: Verify ClickHouseStatsStore.lit(Instant) literal works against the typed column

Read ClickHouseStatsStore.java:346-350. The current formatter writes 'yyyy-MM-dd HH:mm:ss' with ZoneOffset.UTC:

private static String lit(Instant instant) {
    return "'" + java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
            .withZone(java.time.ZoneOffset.UTC)
            .format(instant.truncatedTo(ChronoUnit.SECONDS)) + "'";
}

With bucket DateTime('UTC'), a bare literal like '2026-03-31 10:05:00' is parsed by ClickHouse as being in the column's TZ (UTC). So bucket >= '2026-03-31 10:05:00' now compares UTC-to-UTC consistently. No code change required in lit(Instant) — leave it alone.

However, for defence-in-depth (so a future reader or refactor doesn't reintroduce the bug), wrap the formatted string in an explicit toDateTime('...', 'UTC') cast. Change the method to:

/**
 * Format an Instant as a ClickHouse DateTime literal explicitly typed in UTC.
 * The explicit `toDateTime(..., 'UTC')` cast avoids depending on the session
 * timezone matching the `bucket DateTime('UTC')` column type.
 */
private static String lit(Instant instant) {
    String raw = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
            .withZone(java.time.ZoneOffset.UTC)
            .format(instant.truncatedTo(ChronoUnit.SECONDS));
    return "toDateTime('" + raw + "', 'UTC')";
}

Note: this affects both bucket >= ... comparisons and tenant_id = ... elsewhere — but tenant_id uses the lit(String) overload. Only lit(Instant) is touched.

  • Step 1.4: Remove debug scaffolding from ClickHouseStatsStoreIT.setUp()

Lines 49-78 currently contain a try-catch that runs a failing query, flushes logs, prints query log entries to stdout. This was diagnostic code from the triage investigation; it's no longer needed and pollutes CI output.

Edit cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java, replacing the current setUp() body with:

@BeforeEach
void setUp() throws Exception {
    HikariDataSource ds = new HikariDataSource();
    ds.setJdbcUrl(clickhouse.getJdbcUrl());
    ds.setUsername(clickhouse.getUsername());
    ds.setPassword(clickhouse.getPassword());

    jdbc = new JdbcTemplate(ds);

    ClickHouseTestHelper.executeInitSql(jdbc);

    // Truncate base tables
    jdbc.execute("TRUNCATE TABLE executions");
    jdbc.execute("TRUNCATE TABLE processor_executions");

    seedTestData();

    store = new ClickHouseStatsStore("default", jdbc);
}

And remove the now-unused imports: import java.nio.charset.StandardCharsets; (line 16) — keep everything else, the rest is still used by seedTestData and the tests.

  • Step 1.5: Run the 8 failing ITs
mvn -pl cameleer-server-app -am -Dit.test='ClickHouseStatsStoreIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -40

Expected: 0 failures, 15 passed (ClickHouseStatsStoreIT has 14 tests; count may vary — the baseline says 8 failures out of the full class count). If any failure remains, root-cause it:

  • If literal format is wrong → verify the toDateTime(..., 'UTC') cast renders correctly

  • If MV isn't emitting → the MV source expression now needs the explicit UTC wrap

  • If a different test that was previously passing now fails → CH schema change broke a reader; gitnexus_impact identifies who.

  • Step 1.6: Verify GitNexus impact surface

gitnexus_detect_changes({scope: "staged"})

Expected: init.sql, ClickHouseStatsStore.java, ClickHouseStatsStoreIT.java. Nothing else.

  • Step 1.7: Commit
git add cameleer-server-app/src/main/resources/clickhouse/init.sql \
        cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseStatsStore.java \
        cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java
git commit -m "$(cat <<'EOF'
fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ

ClickHouseStatsStoreIT had 8 failures when the CH container's session
timezone was non-UTC (e.g. CEST): stats filter literals were parsed in
session TZ while the bucket column stored UTC Unix timestamps, and every
time-range query missed rows by the tz offset.

- init.sql: bucket columns on all stats_1m_* tables typed as
  DateTime('UTC'); MV SELECTs wrap toStartOfMinute(start_time) in
  toDateTime(..., 'UTC') so projections match the target column type.
- ClickHouseStatsStore.lit(Instant): emit toDateTime('...', 'UTC') cast
  rather than a bare literal, as defence-in-depth against future
  refactors that change column typing.
- ClickHouseStatsStoreIT.setUp: remove debug scaffolding (failing-query
  try-catch + query_log printing) from the triage investigation.

Greenfield CH — no migration needed for existing data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 2 — MetricsFlushScheduler property-key fix

Fixes the production bug that flush-interval-ms YAML config was silently ignored. No IT failures directly depend on this (BackpressureIT worked around it with a second property), but the workaround is no longer needed after the fix.

Files:

  • Modify: cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java:33

  • Modify: cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java:24-36

  • Step 2.1: Impact analysis

gitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"})
gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"})

Expected: only MetricsFlushScheduler consumes flushIntervalMs. If another @Scheduled uses the unprefixed key, fix it too (it has the same bug).

Verify IngestionConfig bean name:

grep -rn "EnableConfigurationProperties" cameleer-server-app/src/main/java

Expected: CameleerServerApplication.java has @EnableConfigurationProperties({IngestionConfig.class, AgentRegistryConfig.class}). Spring registers the bean with the default name derived from the class simple name: ingestionConfig (camelCase first-letter-lower). SpEL @ingestionConfig resolves to this bean.

  • Step 2.2: Change MetricsFlushScheduler.@Scheduled to SpEL

Edit cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java. Replace line 33:

    @Scheduled(fixedDelayString = "${ingestion.flush-interval-ms:1000}")

with:

    @Scheduled(fixedDelayString = "#{@ingestionConfig.flushIntervalMs}")

No other change to this file.

  • Step 2.3: Drop the BackpressureIT workaround property

Edit cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java lines 24-36. Replace the @TestPropertySource block with:

@TestPropertySource(properties = {
        // Property keys must match the IngestionConfig @ConfigurationProperties
        // prefix (cameleer.server.ingestion) exactly. The MetricsFlushScheduler
        // now binds its @Scheduled flush interval via SpEL on IngestionConfig,
        // so a single property override controls both the buffer config and
        // the flush cadence.
        "cameleer.server.ingestion.buffercapacity=5",
        "cameleer.server.ingestion.batchsize=5",
        "cameleer.server.ingestion.flushintervalms=60000"
})

Removed: the second ingestion.flush-interval-ms=60000 entry and its comment block.

  • Step 2.4: Run BackpressureIT
mvn -pl cameleer-server-app -am -Dit.test='BackpressureIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30

Expected: 2 passed, 0 failed. whenMetricsBufferFull_returns503WithRetryAfter in particular must still pass — the 60s flush interval must still be honoured, proving the SpEL binding works.

  • Step 2.5: Smoke test the app bean wiring
mvn -pl cameleer-server-app compile 2>&1 | tail -10

Expected: BUILD SUCCESS. Bean name mismatches between SpEL and the actual bean name usually surface as IllegalStateException: No bean named 'ingestionConfig' available at runtime, not at compile time — BackpressureIT in Step 2.4 is the actual smoke test.

  • Step 2.6: Commit
gitnexus_detect_changes({scope: "staged"})
# Expected: only MetricsFlushScheduler.java, BackpressureIT.java

git add cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java \
        cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java
git commit -m "$(cat <<'EOF'
fix(metrics): MetricsFlushScheduler honour ingestion config flush interval

The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed), but IngestionConfig binds the cameleer.server.ingestion.*
namespace — YAML config of the metrics flush interval was silently
ignored, always falling back to 1s.

- Scheduler: bind via SpEL `#{@ingestionConfig.flushIntervalMs}` so
  IngestionConfig is the single source of truth; default lives on the
  config field, not duplicated in the @Scheduled annotation.
- BackpressureIT: remove the second ingestion.flush-interval-ms=60000
  workaround property that was papering over this bug. The single
  cameleer.server.ingestion.flushintervalms override now slows the
  scheduler enough for the 503 overflow scenario to be reachable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 3 — Delete dead SearchIndexer subsystem

The ExecutionController removal commit (0f635576) left SearchIndexer.onExecutionUpdated subscribed to an event (ExecutionUpdatedEvent) that nothing publishes. The whole indexer subsystem is dead: every stat method it exposes returns always-zero values, and the admin /pipeline endpoint that consumes them is therefore vestigial.

Files:

  • Delete: cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexer.java

  • Delete: cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexerStats.java

  • Delete: cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/ExecutionUpdatedEvent.java

  • Delete: cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java (if it exists as a standalone DTO — verify in Step 3.2)

  • Modify: cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java (remove /pipeline endpoint, indexerStats field, its constructor parameter)

  • Modify: any bean config that creates SearchIndexer (discover in Step 3.2)

  • Modify: ui/src/api/queries/admin/clickhouse.ts if it calls /pipeline (discover in Step 3.2)

  • Update: .claude/rules/core-classes.md (remove SearchIndexer/SearchIndexerStats bullets)

  • Update: .claude/rules/app-classes.md (remove /pipeline endpoint mention)

  • Step 3.1: Impact analysis

gitnexus_impact({target: "SearchIndexer", direction: "upstream"})
gitnexus_impact({target: "SearchIndexerStats", direction: "upstream"})
gitnexus_impact({target: "ExecutionUpdatedEvent", direction: "upstream"})
gitnexus_impact({target: "IndexerPipelineResponse", direction: "upstream"})

Expected: ClickHouseAdminController depends on SearchIndexerStats. The other three should have no non-self dependents after the ExecutionController removal. If anything else surprises you, STOP — something is still live and needs re-scoping.

  • Step 3.2: Discover full footprint
grep -rn "SearchIndexer\|IndexerPipelineResponse\|ExecutionUpdatedEvent" \
    --include="*.java" --include="*.ts" --include="*.tsx" --include="*.md" \
    cameleer-server-core/src cameleer-server-app/src ui/src .claude/rules

Expected matches:

  • SearchIndexer.java, SearchIndexerStats.java, ExecutionUpdatedEvent.java themselves
  • ClickHouseAdminController.java — has field + constructor param + /pipeline endpoint
  • IndexerPipelineResponse.java — DTO (check if it exists in cameleer-server-app/src/main/java/com/cameleer/server/app/dto/)
  • A bean config file (likely StorageBeanConfig.java or a dedicated indexing config) instantiating SearchIndexer
  • ui/src/api/queries/admin/clickhouse.ts — maybe queries /pipeline
  • .claude/rules/core-classes.md, .claude/rules/app-classes.md
  • design docs under docs/superpowers/specs/ — leave untouched (historical)

Make a list of every file to edit. Don't proceed until you've seen them all.

  • Step 3.3: Remove SearchIndexer instantiation from bean config

Open the file(s) found in Step 3.2 that construct SearchIndexer. Delete:

  • the @Bean SearchIndexer searchIndexer(...) method
  • the @Bean SearchIndexerStats searchIndexerStats(...) method (if it exists separately — usually just returns the SearchIndexer instance cast to the interface)
  • any private helper field such as private final ExecutionStore executionStore; that becomes unused only if it was used exclusively for constructing SearchIndexer; leave fields used by other beans.

If the bean config also pulls in SearchIndex purely to pass it to SearchIndexer, check whether anything else uses SearchIndex. If not, leave the SearchIndex bean — it may be used by the search-query path (SearchController/SearchService). Verify before deleting.

  • Step 3.4: Remove /pipeline endpoint from ClickHouseAdminController

Edit cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java:

  1. Remove the import import com.cameleer.server.core.indexing.SearchIndexerStats;
  2. Remove the import import com.cameleer.server.app.dto.IndexerPipelineResponse;
  3. Remove the field private final SearchIndexerStats indexerStats;
  4. Remove the constructor parameter SearchIndexerStats indexerStats and the this.indexerStats = indexerStats; assignment
  5. Remove the entire @GetMapping("/pipeline") ... public IndexerPipelineResponse getPipeline() { ... } method

The remaining controller retains /status, /tables, /performance, /queries endpoints — those don't depend on the indexer.

  • Step 3.5: Delete the dead files
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexer.java
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexerStats.java
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/ExecutionUpdatedEvent.java

If the indexing package becomes empty:

find cameleer-server-core/src/main/java/com/cameleer/server/core/indexing -type d -empty -delete 2>/dev/null
  • Step 3.6: Delete IndexerPipelineResponse DTO (if standalone)

If Step 3.2 confirmed cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java exists as its own file:

git rm cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java

If it's an inner record in another DTO file, leave that file alone and remove only the record definition.

  • Step 3.7: Remove UI consumer of /pipeline (if any)

If ui/src/api/queries/admin/clickhouse.ts or another UI file calls /api/v1/admin/clickhouse/pipeline:

  • Remove the query hook / fetch call
  • Remove any UI component rendering its data (likely in an admin page)
  • Run cd ui && npm run build 2>&1 | tail -20 to surface compile errors from other call sites; fix them by deleting the relevant UI sections

If no UI reference exists, skip this step.

  • Step 3.8: Regenerate OpenAPI schema

Per CLAUDE.md: any REST surface change requires regenerating ui/src/api/schema.d.ts.

Start the backend:

cd cameleer-server-app && mvn spring-boot:run &
# wait for port 8081 to be listening — poll with: until curl -sf http://localhost:8081/api-docs >/dev/null 2>&1; do sleep 2; done

Regenerate:

cd ui && npm run generate-api:live

Stop the backend. Commit includes the regenerated ui/src/api/schema.d.ts and ui/src/api/openapi.json.

If the user is offline / the backend can't start, skip this step but flag it in the commit message so a follow-up can regenerate. The TypeScript types will be out of sync until then — the build will fail if any UI code referenced /pipeline endpoint types.

  • Step 3.9: Update .claude/rules/core-classes.md

Remove these sections entirely:

  • The SearchIndexer bullet (if present in the core-classes rules)
  • Any SearchIndexerStats interface bullet
  • Any ExecutionUpdatedEvent record mention

The file is currently 100+ lines. Search for "SearchIndexer" and "ExecutionUpdatedEvent" and delete the matching lines/bullets.

  • Step 3.10: Update .claude/rules/app-classes.md

Remove:

  • The /pipeline endpoint mention under ClickHouseAdminController (line reading "GET /api/v1/admin/clickhouse/** (conditional on infrastructureendpoints flag)" can stay — /pipeline is no longer listed separately; if there was a specific /pipeline bullet, remove it).

Also grep for "SearchIndexer" in the rules and delete any residual mentions.

  • Step 3.11: Build and verify
mvn -pl cameleer-server-app -am compile 2>&1 | tail -20

Expected: BUILD SUCCESS. If a reference slipped through, the compile fails with a clear cannot find symbol pointing at the dead class.

mvn -pl cameleer-server-app -am -Dit.test='ClickHouseAdminControllerIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -20

If this IT exists and tests /pipeline, its test methods for that endpoint must be removed too. Edit the IT file, remove the /pipeline test methods. Re-run.

  • Step 3.12: Commit
gitnexus_detect_changes({scope: "staged"})
# Expected: deleted files + modified ClickHouseAdminController.java + rule updates + (optionally) UI changes and OpenAPI regen

git add -A cameleer-server-core/src/main/java/com/cameleer/server/core/indexing \
          cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java \
          cameleer-server-app/src/main/java/com/cameleer/server/app/dto/ \
          .claude/rules/core-classes.md .claude/rules/app-classes.md
# Only add UI/openapi paths if they actually changed:
git add ui/src/api/schema.d.ts ui/src/api/openapi.json ui/src/api/queries/admin/clickhouse.ts 2>/dev/null || true
git commit -m "$(cat <<'EOF'
refactor(search): drop dead SearchIndexer subsystem

After the ExecutionController removal (0f635576), SearchIndexer
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
Every SearchIndexerStats metric returned always-zero, and the admin
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
carried no signal.

Removed:
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
  ClickHouseAdminController, field + ctor param
- bean wiring that constructed SearchIndexer
- UI query for /pipeline if it existed
- .claude/rules/{core,app}-classes.md references

OpenAPI schema regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 4 — Delete unused TaggedExecution record

The ExecutionController removal commit (0f635576) flagged TaggedExecution as having no remaining callers after the legacy PG ingest path was retired.

Files:

  • Delete: cameleer-server-core/src/main/java/com/cameleer/server/core/ingestion/TaggedExecution.java

  • Update: .claude/rules/core-classes.md

  • Step 4.1: Impact analysis

gitnexus_impact({target: "TaggedExecution", direction: "upstream"})
gitnexus_context({name: "TaggedExecution"})

Expected: empty upstream (or only documentation-file references). If a test file still imports TaggedExecution, that test is dead code too and should be deleted.

grep -rn "TaggedExecution" --include="*.java" cameleer-server-core/src cameleer-server-app/src

Expected: only TaggedExecution.java itself.

  • Step 4.2: Delete the file
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/ingestion/TaggedExecution.java
  • Step 4.3: Update .claude/rules/core-classes.md

Edit the file. Find the line containing "TaggedExecution still lives in the package as a leftover" (in the ingestion/ section) and remove the parenthetical. Before:

- `MergedExecution`, `TaggedDiagram` — tagged ingestion records. `TaggedDiagram` carries `(instanceId, applicationId, environment, graph)` — env is resolved from the agent registry in the controller and stamped on the ClickHouse `route_diagrams` row. (`TaggedExecution` still lives in the package as a leftover but has no callers since the legacy PG ingest path was retired.)

After:

- `MergedExecution`, `TaggedDiagram` — tagged ingestion records. `TaggedDiagram` carries `(instanceId, applicationId, environment, graph)` — env is resolved from the agent registry in the controller and stamped on the ClickHouse `route_diagrams` row.
  • Step 4.4: Build
mvn -pl cameleer-server-core compile 2>&1 | tail -5

Expected: BUILD SUCCESS.

  • Step 4.5: Commit
gitnexus_detect_changes({scope: "staged"})
# Expected: TaggedExecution.java deleted, core-classes.md updated

git commit -m "$(cat <<'EOF'
refactor(ingestion): remove unused TaggedExecution record

No callers after the legacy PG ingestion path was retired in 0f635576.
core-classes.md updated to drop the leftover note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 5 — SSE diagnosis

Diagnose the 4 failing SSE tests before attempting a fix. Produces a markdown diagnosis doc, not code changes.

Files:

  • Create: .planning/sse-flakiness-diagnosis.md

  • Step 5.1: Run each failing test in isolation to confirm baseline

for t in "AgentSseControllerIT#sseConnect_unknownAgent_returns404" \
         "AgentSseControllerIT#lastEventIdHeader_connectionSucceeds" \
         "AgentSseControllerIT#pingKeepalive_receivedViaSseStream" \
         "SseSigningIT#deepTraceEvent_containsValidSignature"; do
    echo "=== $t ==="
    mvn -pl cameleer-server-app -am -Dit.test="$t" -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -15
done

Record for each: PASS or FAIL in isolation.

  • Step 5.2: Run all SSE tests together in both class orders
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dsurefire.runOrder=alphabetical verify 2>&1 | tail -30
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dsurefire.runOrder=reversealphabetical verify 2>&1 | tail -30

Record: which tests fail in which order.

  • Step 5.3: Investigate the sseConnect_unknownAgent_returns404 case specifically

Read AgentSseController.java:63-82. Trace the control flow when:

  • JWT is valid for agent subject X
  • Path id is unknown-sse-agent (different from JWT subject)
  • registryService.findById("unknown-sse-agent") returns null
  • jwtResult != null — so auto-heal triggers, registers unknown-sse-agent with JWT's env+application, returns 200 with SSE stream

Hypothesis: the test expects 404, but the controller's auto-heal path accepts the unknown agent because it only checks "JWT present", not "JWT subject matches path id". The 5s timeout on statusFuture.get(...) is because the 200 response opens an infinite SSE stream; BodyHandlers.ofString() waits for body completion that never comes.

Confirm by inspecting JwtAuthenticationFilter and JwtService.JwtValidationResult to see whether subject() or an equivalent agent-id claim is available on the result. Then read a nearby controller that does verify subject-vs-path-id (e.g. AgentRegistrationController.heartbeat or AgentCommandController) for the accepted pattern.

  • Step 5.4: Investigate the awaitConnection(5000) tests

For lastEventIdHeader_connectionSucceeds, pingKeepalive_receivedViaSseStream, deepTraceEvent_containsValidSignature: all register a fresh UUID-suffixed agent first, then open SSE with the JWT that was minted in setUp() for test-agent-sse-it. The JWT subject doesn't match the path id.

If Step 5.3 finds the auto-heal bug, these tests may also benefit: when JWT subject ≠ path id, these tests currently rely on auto-heal too (since the path id was freshly registered through the /register endpoint, but the JWT they use in the SSE request is a different agent's JWT).

Wait — re-check: registerAgent in the test uses bootstrapHeaders() (not JWT). It registers the agent directly. Then openSseStream uses jwt which is securityHelper.registerTestAgent("test-agent-sse-it") — a JWT for a different agent.

So in these tests:

  • Path id: fresh UUID (registered via bootstrap)
  • JWT subject: "test-agent-sse-it"
  • findById(uuid) succeeds — agent exists
  • Auto-heal NOT triggered
  • Controller calls connectionManager.connect(uuid) — returns SseEmitter

If this path works in isolation, why does it time out under full-class execution? Possibilities:

  • Tomcat async thread pool exhaustion. SseEmitter(Long.MAX_VALUE) holds a thread; prior tests' connections may not have closed before the pool fills
  • SseConnectionManager state leak — emitters from prior tests still in the map, competing
  • The ping scheduler (@Scheduled(fixedDelayString = "${agent-registry.ping-interval-ms:15000}")) — if an IOException on a stale emitter propagates

Check Tomcat async config in application.yml and any server.tomcat.* settings. Default max-threads is 200 but async handling has separate limits.

  • Step 5.5: Capture test output + logs

Run the failing tests with debug logging:

mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dlogging.level.com.cameleer=DEBUG -Dlogging.level.org.apache.catalina.core=DEBUG verify 2>&1 | tail -100

Look for:

  • SseConnectionManager log lines showing emitter count over time

  • "Replacing existing SSE connection" or "SSE connection timed out" patterns

  • Tomcat "async request timed out" warnings

  • Any NOT_FOUND being thrown that the client interprets as hanging

  • Step 5.6: Write the diagnosis doc

Create .planning/sse-flakiness-diagnosis.md with sections:

  1. Summary — 1-2 sentences, named root cause (or "inconclusive")
  2. Evidence — commands run, output snippets, code references (file:line)
  3. Hypothesis ladder — auto-heal over-permissiveness, thread pool exhaustion, singleton state leak — with confidence level for each
  4. Proposed fix — if confident: specific changes to specific files. If inconclusive: say so, recommend parking with @Disabled.
  5. Risk — what could go wrong with the proposed fix.
  • Step 5.7: Commit the diagnosis
git add .planning/sse-flakiness-diagnosis.md
git commit -m "$(cat <<'EOF'
docs(debug): SSE flakiness root-cause analysis

Investigation of the 4 parked SSE test failures documented in
.planning/it-triage-report.md. Records evidence, hypothesis ladder,
and proposed fix shape (or recommendation to park if inconclusive).

See .planning/sse-flakiness-diagnosis.md for details; Task 6 (or a
skip-to-final-verify) follows based on the conclusion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 6 — SSE fix (branches based on Task 5 diagnosis)

Decision tree:

  • If Task 5 landed a confident root cause → follow Task 6.A
  • If Task 5 found auto-heal over-permissiveness as the (whole or partial) cause → follow Task 6.B
  • If Task 5 was inconclusive or the fix exceeds a 45-minute timebox → follow Task 6.C (park)

Task 6.A — Fix per diagnosis finding

  • Step 6.A.1: Impact analysis on the symbols identified by diagnosis
gitnexus_impact({target: "<symbol_named_by_diagnosis>", direction: "upstream"})
  • Step 6.A.2: Apply the fix exactly as the diagnosis prescribes

Follow the "Proposed fix" section of .planning/sse-flakiness-diagnosis.md step-by-step. Do not adapt or extend — the diagnosis is the plan.

  • Step 6.A.3: Run the 4 failing SSE tests
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30

Expected: 0 failures. If any remain, the diagnosis was incomplete — fall back to Task 6.C and park the residual.

  • Step 6.A.4: Commit
git commit -m "$(cat <<'EOF'
fix(sse): <one-line description from diagnosis>

<2-3 sentence explanation pulled from diagnosis doc>

Closes 4 parked SSE test failures from .planning/it-triage-report.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 6.B — Auto-heal guard (likely fix if Step 5.3 confirms)

If diagnosis confirmed AgentSseController auto-heals regardless of JWT subject vs path id.

  • Step 6.B.1: Impact analysis
gitnexus_impact({target: "AgentSseController.events", direction: "upstream"})
gitnexus_impact({target: "JwtService.JwtValidationResult", direction: "upstream"})
  • Step 6.B.2: Inspect JwtValidationResult for subject access

Read the JwtValidationResult class/record. Confirm it exposes the JWT subject (likely .subject() or similar accessor). Note the field name.

  • Step 6.B.3: Add the guard to AgentSseController.events

Edit cameleer-server-app/src/main/java/com/cameleer/server/app/controller/AgentSseController.java:63-76.

Replace the auto-heal block:

AgentInfo agent = registryService.findById(id);
if (agent == null) {
    // Auto-heal: re-register agent from JWT claims after server restart
    var jwtResult = (JwtService.JwtValidationResult) httpRequest.getAttribute(
            JwtAuthenticationFilter.JWT_RESULT_ATTR);
    if (jwtResult != null) {
        String application = jwtResult.application() != null ? jwtResult.application() : "default";
        String env = jwtResult.environment() != null ? jwtResult.environment() : "default";
        registryService.register(id, id, application, env, "unknown", List.of(), Map.of());
        log.info("Auto-registered agent {} (app={}, env={}) from SSE connect after server restart", id, application, env);
    } else {
        throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Agent not found: " + id);
    }
}

with:

AgentInfo agent = registryService.findById(id);
if (agent == null) {
    // Auto-heal re-registers an agent from JWT claims after server restart,
    // but only when the JWT subject matches the path id. Otherwise a
    // different agent could spoof any agentId in the URL.
    var jwtResult = (JwtService.JwtValidationResult) httpRequest.getAttribute(
            JwtAuthenticationFilter.JWT_RESULT_ATTR);
    if (jwtResult != null && id.equals(jwtResult.subject())) {
        String application = jwtResult.application() != null ? jwtResult.application() : "default";
        String env = jwtResult.environment() != null ? jwtResult.environment() : "default";
        registryService.register(id, id, application, env, "unknown", List.of(), Map.of());
        log.info("Auto-registered agent {} (app={}, env={}) from SSE connect after server restart", id, application, env);
    } else {
        throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Agent not found: " + id);
    }
}

Adjust jwtResult.subject() to the actual accessor method from Step 6.B.2 (could be .subject(), .instanceId(), .agentId(), etc.).

  • Step 6.B.4: Run sseConnect_unknownAgent_returns404
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT#sseConnect_unknownAgent_returns404' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -15

Expected: PASS. The controller now returns a synchronous 404 for the mismatched case.

  • Step 6.B.5: Run the remaining 3 SSE tests
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30

If all 4 now pass → Task 6.B closed everything. If 3 still fail (the awaitConnection trio) → the auto-heal guard fixed the 404 case but the others have a separate root cause. Fall back to Task 6.A with narrower scope, or Task 6.C to park the residual.

  • Step 6.B.6: Commit
git add cameleer-server-app/src/main/java/com/cameleer/server/app/controller/AgentSseController.java
git commit -m "$(cat <<'EOF'
fix(sse): auto-heal requires JWT subject to match requested agent id

AgentSseController.events auto-registered an unknown agent id from JWT
claims whenever any valid JWT was present, regardless of whose agent the
JWT actually identified. This was a spoofing vector — a holder of a JWT
for agent X could open SSE for any path-id Y — and it silently masked
404 as 200 with an infinite empty stream (surface symptom: the parked
sseConnect_unknownAgent_returns404 test hung for 5s on the status
future).

Auto-heal now triggers only when the JWT subject equals the path id.
Cross-agent requests fall through to the existing 404.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 6.C — Park and annotate (fallback if diagnosis inconclusive)

If the 45-minute diagnosis timebox expires without a confident root cause, or Task 6.A/6.B leaves residual failures.

  • Step 6.C.1: Annotate the failing tests

Edit each of the (still-)failing test methods in AgentSseControllerIT and SseSigningIT. Add above each method:

@org.junit.jupiter.api.Disabled(
    "Parked — see .planning/sse-flakiness-diagnosis.md. Order-dependent "
    + "flakiness; passes in isolation. Re-enable after fix.")

Leave the rest of the method unchanged.

  • Step 6.C.2: Run to confirm they skip
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -20

Expected: 0 failures, N skipped (where N is the number parked). Other tests still run.

  • Step 6.C.3: Commit
git add cameleer-server-app/src/test/java/com/cameleer/server/app/controller/AgentSseControllerIT.java \
        cameleer-server-app/src/test/java/com/cameleer/server/app/controller/SseSigningIT.java
git commit -m "$(cat <<'EOF'
test(sse): park flaky tests with @Disabled pending fix

Order-dependent flakiness; all four tests pass in isolation. Diagnosis
in .planning/sse-flakiness-diagnosis.md was inconclusive within the
investigation timebox. Re-enable after targeted fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"

Task 7 — Final verify + push

  • Step 7.1: Full verify
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tee /tmp/final-verify.log | tail -60

Expected: 0 failures, plus either (a) all SSE tests passing if Task 6.A/6.B succeeded, or (b) 4 skipped if Task 6.C was taken.

If any non-SSE test fails that previously passed: STOP. Root cause before pushing. Likely a regression from Task 1, 2, or 3 that escaped unit verification.

  • Step 7.2: Confirm commit history
git log --oneline -15

Expected: the new commits in order — CH timezone fix, scheduler SpEL fix, SearchIndexer removal, TaggedExecution removal, SSE diagnosis doc, SSE fix or park.

  • Step 7.3: Push to main
git push origin main

The user explicitly authorized pushing to main for this overnight run. If the remote rejects (non-fast-forward, auth), stop and report — do not --force.

  • Step 7.4: Update .planning/it-triage-report.md

Append a closing section at the bottom of the triage report:

## Follow-up (2026-04-22)

Closed the 3 parked clusters:

- **ClickHouseStatsStoreIT (8 failures)** — fixed via column-level `DateTime('UTC')` on `bucket` + defensive `toDateTime(..., 'UTC')` cast in `ClickHouseStatsStore.lit(Instant)`.
- **MetricsFlushScheduler property-key drift** — scheduler now binds via SpEL `#{@ingestionConfig.flushIntervalMs}`; BackpressureIT workaround property dropped.
- **SSE flakiness (4 failures)** — see `.planning/sse-flakiness-diagnosis.md`; resolved by <one-line summary from diagnosis> / parked with `@Disabled` pending targeted fix.

Plus two prod-code cleanups from the ExecutionController removal follow-ons: removed dead `SearchIndexer` subsystem and unused `TaggedExecution` record.

Final verify: **0 failures** (or: **0 failures, 4 skipped SSE tests**).

Commit:

git add .planning/it-triage-report.md
git commit -m "$(cat <<'EOF'
docs(triage): IT triage report — close-out of remaining 12 failures

All three parked clusters closed + two prod-code side-notes landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
git push origin main

Self-review (pre-execution)

Spec coverage:

  • Item 1 (CH timezone) → Task 1
  • Item 2 (SSE flakiness) → Tasks 5 + 6 (diagnose-then-fix, autonomous variant without user checkpoint)
  • Item 3 (scheduler property-key) → Task 2
  • Item 4a (SearchIndexer cleanup) → Task 3
  • Item 4b (TaggedExecution removal) → Task 4
  • Execution order (Wave 1 parallelizable, Wave 2 sequential) → reflected in task numbering; Wave 1 tasks have no inter-task dependencies and can be executed in any order, Wave 2 (Tasks 5 → 6) is strictly sequential.

Placeholder scan: Every step contains concrete commands, file paths, and code blocks. The one deferred decision (Task 6 branching based on diagnosis) is bounded — all three branches (A/B/C) are fully specified.

Type consistency: ingestionConfig bean name consistent across Task 2 steps. JwtValidationResult.subject() access method flagged as "verify" in Step 6.B.2 — the actual accessor is confirmed during diagnosis, not guessed here.

Deviations from spec: the spec called for a user checkpoint between SSE diagnosis and fix. This plan runs autonomously (user is asleep), so the checkpoint becomes a decision tree (Task 6.A/6.B/6.C) with explicit stop conditions.