docs(plan): IT triage follow-ups — implementation plan
Task-by-task plan for the 2026-04-21-it-triage-followups-design spec. Autonomous execution variant — SSE diagnose-then-fix branches to either apply-fix or park-with-@Disabled based on diagnosis confidence, since this runs unattended overnight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
940
docs/superpowers/plans/2026-04-21-it-triage-followups.md
Normal file
940
docs/superpowers/plans/2026-04-21-it-triage-followups.md
Normal file
@@ -0,0 +1,940 @@
|
||||
# IT Triage Follow-Ups Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Close the 12 parked IT failures from `.planning/it-triage-report.md` plus two prod-code side-notes, so `mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' verify` returns **0 failures**.
|
||||
|
||||
**Architecture:** Four focused fixes (CH timezone, scheduler property key, two dead-code removals) executed atomically, each with its own commit. Then SSE flakiness as diagnose-then-fix. User is asleep during execution — no interactive checkpoint; if SSE diagnosis is inconclusive within the timebox, park the 4 failing SSE tests with `@Disabled` + link to the diagnosis doc and finish the rest green.
|
||||
|
||||
**Tech Stack:** Java 17, Spring Boot 3.4.3, ClickHouse 24.12 via JDBC, Testcontainers, Maven Failsafe.
|
||||
|
||||
---
|
||||
|
||||
## Execution policy
|
||||
|
||||
- **Atomic commits** — one task, one commit, scoped to the task's files.
|
||||
- **Before each symbol edit:** `gitnexus_impact({target, direction: "upstream"})`. Warn on HIGH/CRITICAL. Stop if unexpected dependents appear and re-scope.
|
||||
- **Before each commit:** `gitnexus_detect_changes({scope: "staged"})`. Confirm scope.
|
||||
- **`.claude/rules/*` updates** are part of the same commit as the class change, not a separate task.
|
||||
- **Test-only scope** — no tests rewritten to pass-by-weakening. Every change to an assertion gets a comment explaining the contract it now captures.
|
||||
- **Final step** — `git push origin main` after all tasks commit and the full verify run is green (or yellow with the parked SSE tests only, clearly noted).
|
||||
|
||||
---
|
||||
|
||||
## Task 0 — Baseline verify (evidence, no commit)
|
||||
|
||||
**Files:** none modified.
|
||||
|
||||
- [ ] **Step 0.1: Run baseline failing tests to confirm starting state**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='ClickHouseStatsStoreIT,AgentSseControllerIT,SseSigningIT,BackpressureIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -60
|
||||
```
|
||||
|
||||
Expected: **12 failures** distributed across those three IT classes, matching `.planning/it-triage-report.md`. If the baseline count differs, stop and re-audit — the spec assumes this number.
|
||||
|
||||
- [ ] **Step 0.2: Record baseline to memory**
|
||||
|
||||
Keep the failure count as the reference number. The final verify must show 0 new failures; the only acceptable regression is the SSE cluster if and only if Task 5 parks them (and the user-facing summary notes this).
|
||||
|
||||
---
|
||||
|
||||
## Task 1 — ClickHouse timezone fix
|
||||
|
||||
Closes 8 failures in `ClickHouseStatsStoreIT`.
|
||||
|
||||
**Files:**
|
||||
- Modify: `cameleer-server-app/src/main/resources/clickhouse/init.sql`
|
||||
- Modify: `cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseStatsStore.java:346-350`
|
||||
- Modify: `cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java:49-78` (remove debug scaffolding from the triage investigation)
|
||||
|
||||
- [ ] **Step 1.1: Impact analysis**
|
||||
|
||||
Run `gitnexus_impact({target: "ClickHouseStatsStore", direction: "upstream"})`. Expected: `SearchService`, `SearchController`, alerting evaluator. Note the blast radius — every read-path that uses `stats_1m_*` tables sees the now-correct values.
|
||||
|
||||
- [ ] **Step 1.2: Change `init.sql` — `bucket` columns to `DateTime('UTC')`, MV SELECTs to emit UTC**
|
||||
|
||||
Edit `cameleer-server-app/src/main/resources/clickhouse/init.sql`:
|
||||
|
||||
For each of the five stats tables (`stats_1m_all`, `stats_1m_app`, `stats_1m_route`, `stats_1m_processor`, `stats_1m_processor_detail`), change the `bucket` column declaration from:
|
||||
|
||||
```sql
|
||||
bucket DateTime,
|
||||
```
|
||||
|
||||
to:
|
||||
|
||||
```sql
|
||||
bucket DateTime('UTC'),
|
||||
```
|
||||
|
||||
For each of the five materialized views (`stats_1m_all_mv`, `stats_1m_app_mv`, `stats_1m_route_mv`, `stats_1m_processor_mv`, `stats_1m_processor_detail_mv`), change the bucket projection from:
|
||||
|
||||
```sql
|
||||
toStartOfMinute(start_time) AS bucket,
|
||||
```
|
||||
|
||||
to:
|
||||
|
||||
```sql
|
||||
toDateTime(toStartOfMinute(start_time), 'UTC') AS bucket,
|
||||
```
|
||||
|
||||
The `TTL bucket + INTERVAL 365 DAY DELETE` lines need no change — TTL interval arithmetic is tz-agnostic.
|
||||
|
||||
- [ ] **Step 1.3: Verify `ClickHouseStatsStore.lit(Instant)` literal works against the typed column**
|
||||
|
||||
Read `ClickHouseStatsStore.java:346-350`. The current formatter writes `'yyyy-MM-dd HH:mm:ss'` with `ZoneOffset.UTC`:
|
||||
|
||||
```java
|
||||
private static String lit(Instant instant) {
|
||||
return "'" + java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
|
||||
.withZone(java.time.ZoneOffset.UTC)
|
||||
.format(instant.truncatedTo(ChronoUnit.SECONDS)) + "'";
|
||||
}
|
||||
```
|
||||
|
||||
With `bucket DateTime('UTC')`, a bare literal like `'2026-03-31 10:05:00'` is parsed by ClickHouse as being in the column's TZ (UTC). So `bucket >= '2026-03-31 10:05:00'` now compares UTC-to-UTC consistently. No code change required in `lit(Instant)` — leave it alone.
|
||||
|
||||
**However**, for defence-in-depth (so a future reader or refactor doesn't reintroduce the bug), wrap the formatted string in an explicit `toDateTime('...', 'UTC')` cast. Change the method to:
|
||||
|
||||
```java
|
||||
/**
|
||||
* Format an Instant as a ClickHouse DateTime literal explicitly typed in UTC.
|
||||
* The explicit `toDateTime(..., 'UTC')` cast avoids depending on the session
|
||||
* timezone matching the `bucket DateTime('UTC')` column type.
|
||||
*/
|
||||
private static String lit(Instant instant) {
|
||||
String raw = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
|
||||
.withZone(java.time.ZoneOffset.UTC)
|
||||
.format(instant.truncatedTo(ChronoUnit.SECONDS));
|
||||
return "toDateTime('" + raw + "', 'UTC')";
|
||||
}
|
||||
```
|
||||
|
||||
Note: this affects both `bucket >= ...` comparisons and `tenant_id = ...` elsewhere — but `tenant_id` uses the `lit(String)` overload. Only `lit(Instant)` is touched.
|
||||
|
||||
- [ ] **Step 1.4: Remove debug scaffolding from `ClickHouseStatsStoreIT.setUp()`**
|
||||
|
||||
Lines 49-78 currently contain a try-catch that runs a failing query, flushes logs, prints query log entries to stdout. This was diagnostic code from the triage investigation; it's no longer needed and pollutes CI output.
|
||||
|
||||
Edit `cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java`, replacing the current `setUp()` body with:
|
||||
|
||||
```java
|
||||
@BeforeEach
|
||||
void setUp() throws Exception {
|
||||
HikariDataSource ds = new HikariDataSource();
|
||||
ds.setJdbcUrl(clickhouse.getJdbcUrl());
|
||||
ds.setUsername(clickhouse.getUsername());
|
||||
ds.setPassword(clickhouse.getPassword());
|
||||
|
||||
jdbc = new JdbcTemplate(ds);
|
||||
|
||||
ClickHouseTestHelper.executeInitSql(jdbc);
|
||||
|
||||
// Truncate base tables
|
||||
jdbc.execute("TRUNCATE TABLE executions");
|
||||
jdbc.execute("TRUNCATE TABLE processor_executions");
|
||||
|
||||
seedTestData();
|
||||
|
||||
store = new ClickHouseStatsStore("default", jdbc);
|
||||
}
|
||||
```
|
||||
|
||||
And remove the now-unused imports: `import java.nio.charset.StandardCharsets;` (line 16) — keep everything else, the rest is still used by `seedTestData` and the tests.
|
||||
|
||||
- [ ] **Step 1.5: Run the 8 failing ITs**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='ClickHouseStatsStoreIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -40
|
||||
```
|
||||
|
||||
Expected: **0 failures, 15 passed** (`ClickHouseStatsStoreIT` has 14 tests; count may vary — the baseline says 8 failures out of the full class count). If any failure remains, root-cause it:
|
||||
- If literal format is wrong → verify the `toDateTime(..., 'UTC')` cast renders correctly
|
||||
- If MV isn't emitting → the MV source expression now needs the explicit UTC wrap
|
||||
- If a different test that was previously passing now fails → CH schema change broke a reader; `gitnexus_impact` identifies who.
|
||||
|
||||
- [ ] **Step 1.6: Verify GitNexus impact surface**
|
||||
|
||||
```bash
|
||||
gitnexus_detect_changes({scope: "staged"})
|
||||
```
|
||||
|
||||
Expected: `init.sql`, `ClickHouseStatsStore.java`, `ClickHouseStatsStoreIT.java`. Nothing else.
|
||||
|
||||
- [ ] **Step 1.7: Commit**
|
||||
|
||||
```bash
|
||||
git add cameleer-server-app/src/main/resources/clickhouse/init.sql \
|
||||
cameleer-server-app/src/main/java/com/cameleer/server/app/storage/ClickHouseStatsStore.java \
|
||||
cameleer-server-app/src/test/java/com/cameleer/server/app/storage/ClickHouseStatsStoreIT.java
|
||||
git commit -m "$(cat <<'EOF'
|
||||
fix(stats): store bucket as DateTime('UTC') so reads don't depend on CH session TZ
|
||||
|
||||
ClickHouseStatsStoreIT had 8 failures when the CH container's session
|
||||
timezone was non-UTC (e.g. CEST): stats filter literals were parsed in
|
||||
session TZ while the bucket column stored UTC Unix timestamps, and every
|
||||
time-range query missed rows by the tz offset.
|
||||
|
||||
- init.sql: bucket columns on all stats_1m_* tables typed as
|
||||
DateTime('UTC'); MV SELECTs wrap toStartOfMinute(start_time) in
|
||||
toDateTime(..., 'UTC') so projections match the target column type.
|
||||
- ClickHouseStatsStore.lit(Instant): emit toDateTime('...', 'UTC') cast
|
||||
rather than a bare literal, as defence-in-depth against future
|
||||
refactors that change column typing.
|
||||
- ClickHouseStatsStoreIT.setUp: remove debug scaffolding (failing-query
|
||||
try-catch + query_log printing) from the triage investigation.
|
||||
|
||||
Greenfield CH — no migration needed for existing data.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2 — MetricsFlushScheduler property-key fix
|
||||
|
||||
Fixes the production bug that `flush-interval-ms` YAML config was silently ignored. No IT failures directly depend on this (BackpressureIT worked around it with a second property), but the workaround is no longer needed after the fix.
|
||||
|
||||
**Files:**
|
||||
- Modify: `cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java:33`
|
||||
- Modify: `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java:24-36`
|
||||
|
||||
- [ ] **Step 2.1: Impact analysis**
|
||||
|
||||
```bash
|
||||
gitnexus_impact({target: "MetricsFlushScheduler", direction: "upstream"})
|
||||
gitnexus_impact({target: "IngestionConfig.getFlushIntervalMs", direction: "upstream"})
|
||||
```
|
||||
|
||||
Expected: only `MetricsFlushScheduler` consumes `flushIntervalMs`. If another `@Scheduled` uses the unprefixed key, fix it too (it has the same bug).
|
||||
|
||||
Verify `IngestionConfig` bean name:
|
||||
|
||||
```bash
|
||||
grep -rn "EnableConfigurationProperties" cameleer-server-app/src/main/java
|
||||
```
|
||||
|
||||
Expected: `CameleerServerApplication.java` has `@EnableConfigurationProperties({IngestionConfig.class, AgentRegistryConfig.class})`. Spring registers the bean with the default name derived from the class simple name: `ingestionConfig` (camelCase first-letter-lower). SpEL `@ingestionConfig` resolves to this bean.
|
||||
|
||||
- [ ] **Step 2.2: Change `MetricsFlushScheduler.@Scheduled` to SpEL**
|
||||
|
||||
Edit `cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java`. Replace line 33:
|
||||
|
||||
```java
|
||||
@Scheduled(fixedDelayString = "${ingestion.flush-interval-ms:1000}")
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```java
|
||||
@Scheduled(fixedDelayString = "#{@ingestionConfig.flushIntervalMs}")
|
||||
```
|
||||
|
||||
No other change to this file.
|
||||
|
||||
- [ ] **Step 2.3: Drop the BackpressureIT workaround property**
|
||||
|
||||
Edit `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java` lines 24-36. Replace the `@TestPropertySource` block with:
|
||||
|
||||
```java
|
||||
@TestPropertySource(properties = {
|
||||
// Property keys must match the IngestionConfig @ConfigurationProperties
|
||||
// prefix (cameleer.server.ingestion) exactly. The MetricsFlushScheduler
|
||||
// now binds its @Scheduled flush interval via SpEL on IngestionConfig,
|
||||
// so a single property override controls both the buffer config and
|
||||
// the flush cadence.
|
||||
"cameleer.server.ingestion.buffercapacity=5",
|
||||
"cameleer.server.ingestion.batchsize=5",
|
||||
"cameleer.server.ingestion.flushintervalms=60000"
|
||||
})
|
||||
```
|
||||
|
||||
Removed: the second `ingestion.flush-interval-ms=60000` entry and its comment block.
|
||||
|
||||
- [ ] **Step 2.4: Run BackpressureIT**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='BackpressureIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: 2 passed, 0 failed. `whenMetricsBufferFull_returns503WithRetryAfter` in particular must still pass — the 60s flush interval must still be honoured, proving the SpEL binding works.
|
||||
|
||||
- [ ] **Step 2.5: Smoke test the app bean wiring**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app compile 2>&1 | tail -10
|
||||
```
|
||||
|
||||
Expected: BUILD SUCCESS. Bean name mismatches between SpEL and the actual bean name usually surface as `IllegalStateException: No bean named 'ingestionConfig' available` at runtime, not at compile time — BackpressureIT in Step 2.4 is the actual smoke test.
|
||||
|
||||
- [ ] **Step 2.6: Commit**
|
||||
|
||||
```bash
|
||||
gitnexus_detect_changes({scope: "staged"})
|
||||
# Expected: only MetricsFlushScheduler.java, BackpressureIT.java
|
||||
|
||||
git add cameleer-server-app/src/main/java/com/cameleer/server/app/ingestion/MetricsFlushScheduler.java \
|
||||
cameleer-server-app/src/test/java/com/cameleer/server/app/controller/BackpressureIT.java
|
||||
git commit -m "$(cat <<'EOF'
|
||||
fix(metrics): MetricsFlushScheduler honour ingestion config flush interval
|
||||
|
||||
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
|
||||
(unprefixed), but IngestionConfig binds the cameleer.server.ingestion.*
|
||||
namespace — YAML config of the metrics flush interval was silently
|
||||
ignored, always falling back to 1s.
|
||||
|
||||
- Scheduler: bind via SpEL `#{@ingestionConfig.flushIntervalMs}` so
|
||||
IngestionConfig is the single source of truth; default lives on the
|
||||
config field, not duplicated in the @Scheduled annotation.
|
||||
- BackpressureIT: remove the second ingestion.flush-interval-ms=60000
|
||||
workaround property that was papering over this bug. The single
|
||||
cameleer.server.ingestion.flushintervalms override now slows the
|
||||
scheduler enough for the 503 overflow scenario to be reachable.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3 — Delete dead SearchIndexer subsystem
|
||||
|
||||
The ExecutionController removal commit (0f635576) left `SearchIndexer.onExecutionUpdated` subscribed to an event (`ExecutionUpdatedEvent`) that nothing publishes. The whole indexer subsystem is dead: every stat method it exposes returns always-zero values, and the admin `/pipeline` endpoint that consumes them is therefore vestigial.
|
||||
|
||||
**Files:**
|
||||
- Delete: `cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexer.java`
|
||||
- Delete: `cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexerStats.java`
|
||||
- Delete: `cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/ExecutionUpdatedEvent.java`
|
||||
- Delete: `cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java` (if it exists as a standalone DTO — verify in Step 3.2)
|
||||
- Modify: `cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java` (remove `/pipeline` endpoint, `indexerStats` field, its constructor parameter)
|
||||
- Modify: any bean config that creates `SearchIndexer` (discover in Step 3.2)
|
||||
- Modify: `ui/src/api/queries/admin/clickhouse.ts` if it calls `/pipeline` (discover in Step 3.2)
|
||||
- Update: `.claude/rules/core-classes.md` (remove SearchIndexer/SearchIndexerStats bullets)
|
||||
- Update: `.claude/rules/app-classes.md` (remove `/pipeline` endpoint mention)
|
||||
|
||||
- [ ] **Step 3.1: Impact analysis**
|
||||
|
||||
```bash
|
||||
gitnexus_impact({target: "SearchIndexer", direction: "upstream"})
|
||||
gitnexus_impact({target: "SearchIndexerStats", direction: "upstream"})
|
||||
gitnexus_impact({target: "ExecutionUpdatedEvent", direction: "upstream"})
|
||||
gitnexus_impact({target: "IndexerPipelineResponse", direction: "upstream"})
|
||||
```
|
||||
|
||||
Expected: `ClickHouseAdminController` depends on `SearchIndexerStats`. The other three should have no non-self dependents after the ExecutionController removal. If anything else surprises you, STOP — something is still live and needs re-scoping.
|
||||
|
||||
- [ ] **Step 3.2: Discover full footprint**
|
||||
|
||||
```bash
|
||||
grep -rn "SearchIndexer\|IndexerPipelineResponse\|ExecutionUpdatedEvent" \
|
||||
--include="*.java" --include="*.ts" --include="*.tsx" --include="*.md" \
|
||||
cameleer-server-core/src cameleer-server-app/src ui/src .claude/rules
|
||||
```
|
||||
|
||||
Expected matches:
|
||||
- `SearchIndexer.java`, `SearchIndexerStats.java`, `ExecutionUpdatedEvent.java` themselves
|
||||
- `ClickHouseAdminController.java` — has field + constructor param + `/pipeline` endpoint
|
||||
- `IndexerPipelineResponse.java` — DTO (check if it exists in `cameleer-server-app/src/main/java/com/cameleer/server/app/dto/`)
|
||||
- A bean config file (likely `StorageBeanConfig.java` or a dedicated indexing config) instantiating `SearchIndexer`
|
||||
- `ui/src/api/queries/admin/clickhouse.ts` — maybe queries `/pipeline`
|
||||
- `.claude/rules/core-classes.md`, `.claude/rules/app-classes.md`
|
||||
- design docs under `docs/superpowers/specs/` — leave untouched (historical)
|
||||
|
||||
Make a list of every file to edit. Don't proceed until you've seen them all.
|
||||
|
||||
- [ ] **Step 3.3: Remove SearchIndexer instantiation from bean config**
|
||||
|
||||
Open the file(s) found in Step 3.2 that construct `SearchIndexer`. Delete:
|
||||
- the `@Bean SearchIndexer searchIndexer(...)` method
|
||||
- the `@Bean SearchIndexerStats searchIndexerStats(...)` method (if it exists separately — usually just returns the `SearchIndexer` instance cast to the interface)
|
||||
- any private helper field such as `private final ExecutionStore executionStore;` that becomes unused *only if it was used exclusively for constructing SearchIndexer*; leave fields used by other beans.
|
||||
|
||||
If the bean config also pulls in `SearchIndex` purely to pass it to `SearchIndexer`, check whether anything else uses `SearchIndex`. If not, leave the `SearchIndex` bean — it may be used by the search-query path (`SearchController`/`SearchService`). Verify before deleting.
|
||||
|
||||
- [ ] **Step 3.4: Remove `/pipeline` endpoint from ClickHouseAdminController**
|
||||
|
||||
Edit `cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java`:
|
||||
|
||||
1. Remove the import `import com.cameleer.server.core.indexing.SearchIndexerStats;`
|
||||
2. Remove the import `import com.cameleer.server.app.dto.IndexerPipelineResponse;`
|
||||
3. Remove the field `private final SearchIndexerStats indexerStats;`
|
||||
4. Remove the constructor parameter `SearchIndexerStats indexerStats` and the `this.indexerStats = indexerStats;` assignment
|
||||
5. Remove the entire `@GetMapping("/pipeline") ... public IndexerPipelineResponse getPipeline() { ... }` method
|
||||
|
||||
The remaining controller retains `/status`, `/tables`, `/performance`, `/queries` endpoints — those don't depend on the indexer.
|
||||
|
||||
- [ ] **Step 3.5: Delete the dead files**
|
||||
|
||||
```bash
|
||||
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexer.java
|
||||
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/SearchIndexerStats.java
|
||||
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/indexing/ExecutionUpdatedEvent.java
|
||||
```
|
||||
|
||||
If the indexing package becomes empty:
|
||||
|
||||
```bash
|
||||
find cameleer-server-core/src/main/java/com/cameleer/server/core/indexing -type d -empty -delete 2>/dev/null
|
||||
```
|
||||
|
||||
- [ ] **Step 3.6: Delete IndexerPipelineResponse DTO (if standalone)**
|
||||
|
||||
If Step 3.2 confirmed `cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java` exists as its own file:
|
||||
|
||||
```bash
|
||||
git rm cameleer-server-app/src/main/java/com/cameleer/server/app/dto/IndexerPipelineResponse.java
|
||||
```
|
||||
|
||||
If it's an inner record in another DTO file, leave that file alone and remove only the record definition.
|
||||
|
||||
- [ ] **Step 3.7: Remove UI consumer of `/pipeline` (if any)**
|
||||
|
||||
If `ui/src/api/queries/admin/clickhouse.ts` or another UI file calls `/api/v1/admin/clickhouse/pipeline`:
|
||||
- Remove the query hook / fetch call
|
||||
- Remove any UI component rendering its data (likely in an admin page)
|
||||
- Run `cd ui && npm run build 2>&1 | tail -20` to surface compile errors from other call sites; fix them by deleting the relevant UI sections
|
||||
|
||||
If no UI reference exists, skip this step.
|
||||
|
||||
- [ ] **Step 3.8: Regenerate OpenAPI schema**
|
||||
|
||||
Per CLAUDE.md: any REST surface change requires regenerating `ui/src/api/schema.d.ts`.
|
||||
|
||||
Start the backend:
|
||||
|
||||
```bash
|
||||
cd cameleer-server-app && mvn spring-boot:run &
|
||||
# wait for port 8081 to be listening — poll with: until curl -sf http://localhost:8081/api-docs >/dev/null 2>&1; do sleep 2; done
|
||||
```
|
||||
|
||||
Regenerate:
|
||||
|
||||
```bash
|
||||
cd ui && npm run generate-api:live
|
||||
```
|
||||
|
||||
Stop the backend. Commit includes the regenerated `ui/src/api/schema.d.ts` and `ui/src/api/openapi.json`.
|
||||
|
||||
If the user is offline / the backend can't start, skip this step but flag it in the commit message so a follow-up can regenerate. The TypeScript types will be out of sync until then — the build will fail if any UI code referenced `/pipeline` endpoint types.
|
||||
|
||||
- [ ] **Step 3.9: Update `.claude/rules/core-classes.md`**
|
||||
|
||||
Remove these sections entirely:
|
||||
- The `SearchIndexer` bullet (if present in the core-classes rules)
|
||||
- Any `SearchIndexerStats` interface bullet
|
||||
- Any `ExecutionUpdatedEvent` record mention
|
||||
|
||||
The file is currently 100+ lines. Search for "SearchIndexer" and "ExecutionUpdatedEvent" and delete the matching lines/bullets.
|
||||
|
||||
- [ ] **Step 3.10: Update `.claude/rules/app-classes.md`**
|
||||
|
||||
Remove:
|
||||
- The `/pipeline` endpoint mention under `ClickHouseAdminController` (line reading "GET `/api/v1/admin/clickhouse/**` (conditional on `infrastructureendpoints` flag)" can stay — `/pipeline` is no longer listed separately; if there was a specific `/pipeline` bullet, remove it).
|
||||
|
||||
Also grep for "SearchIndexer" in the rules and delete any residual mentions.
|
||||
|
||||
- [ ] **Step 3.11: Build and verify**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am compile 2>&1 | tail -20
|
||||
```
|
||||
|
||||
Expected: BUILD SUCCESS. If a reference slipped through, the compile fails with a clear `cannot find symbol` pointing at the dead class.
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='ClickHouseAdminControllerIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -20
|
||||
```
|
||||
|
||||
If this IT exists and tests `/pipeline`, its test methods for that endpoint must be removed too. Edit the IT file, remove the `/pipeline` test methods. Re-run.
|
||||
|
||||
- [ ] **Step 3.12: Commit**
|
||||
|
||||
```bash
|
||||
gitnexus_detect_changes({scope: "staged"})
|
||||
# Expected: deleted files + modified ClickHouseAdminController.java + rule updates + (optionally) UI changes and OpenAPI regen
|
||||
|
||||
git add -A cameleer-server-core/src/main/java/com/cameleer/server/core/indexing \
|
||||
cameleer-server-app/src/main/java/com/cameleer/server/app/controller/ClickHouseAdminController.java \
|
||||
cameleer-server-app/src/main/java/com/cameleer/server/app/dto/ \
|
||||
.claude/rules/core-classes.md .claude/rules/app-classes.md
|
||||
# Only add UI/openapi paths if they actually changed:
|
||||
git add ui/src/api/schema.d.ts ui/src/api/openapi.json ui/src/api/queries/admin/clickhouse.ts 2>/dev/null || true
|
||||
git commit -m "$(cat <<'EOF'
|
||||
refactor(search): drop dead SearchIndexer subsystem
|
||||
|
||||
After the ExecutionController removal (0f635576), SearchIndexer
|
||||
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
|
||||
Every SearchIndexerStats metric returned always-zero, and the admin
|
||||
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
|
||||
carried no signal.
|
||||
|
||||
Removed:
|
||||
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
|
||||
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
|
||||
ClickHouseAdminController, field + ctor param
|
||||
- bean wiring that constructed SearchIndexer
|
||||
- UI query for /pipeline if it existed
|
||||
- .claude/rules/{core,app}-classes.md references
|
||||
|
||||
OpenAPI schema regenerated.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4 — Delete unused TaggedExecution record
|
||||
|
||||
The ExecutionController removal commit (0f635576) flagged `TaggedExecution` as having no remaining callers after the legacy PG ingest path was retired.
|
||||
|
||||
**Files:**
|
||||
- Delete: `cameleer-server-core/src/main/java/com/cameleer/server/core/ingestion/TaggedExecution.java`
|
||||
- Update: `.claude/rules/core-classes.md`
|
||||
|
||||
- [ ] **Step 4.1: Impact analysis**
|
||||
|
||||
```bash
|
||||
gitnexus_impact({target: "TaggedExecution", direction: "upstream"})
|
||||
gitnexus_context({name: "TaggedExecution"})
|
||||
```
|
||||
|
||||
Expected: empty upstream (or only documentation-file references). If a test file still imports `TaggedExecution`, that test is dead code too and should be deleted.
|
||||
|
||||
```bash
|
||||
grep -rn "TaggedExecution" --include="*.java" cameleer-server-core/src cameleer-server-app/src
|
||||
```
|
||||
|
||||
Expected: only `TaggedExecution.java` itself.
|
||||
|
||||
- [ ] **Step 4.2: Delete the file**
|
||||
|
||||
```bash
|
||||
git rm cameleer-server-core/src/main/java/com/cameleer/server/core/ingestion/TaggedExecution.java
|
||||
```
|
||||
|
||||
- [ ] **Step 4.3: Update `.claude/rules/core-classes.md`**
|
||||
|
||||
Edit the file. Find the line containing "TaggedExecution still lives in the package as a leftover" (in the `ingestion/` section) and remove the parenthetical. Before:
|
||||
|
||||
```
|
||||
- `MergedExecution`, `TaggedDiagram` — tagged ingestion records. `TaggedDiagram` carries `(instanceId, applicationId, environment, graph)` — env is resolved from the agent registry in the controller and stamped on the ClickHouse `route_diagrams` row. (`TaggedExecution` still lives in the package as a leftover but has no callers since the legacy PG ingest path was retired.)
|
||||
```
|
||||
|
||||
After:
|
||||
|
||||
```
|
||||
- `MergedExecution`, `TaggedDiagram` — tagged ingestion records. `TaggedDiagram` carries `(instanceId, applicationId, environment, graph)` — env is resolved from the agent registry in the controller and stamped on the ClickHouse `route_diagrams` row.
|
||||
```
|
||||
|
||||
- [ ] **Step 4.4: Build**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-core compile 2>&1 | tail -5
|
||||
```
|
||||
|
||||
Expected: BUILD SUCCESS.
|
||||
|
||||
- [ ] **Step 4.5: Commit**
|
||||
|
||||
```bash
|
||||
gitnexus_detect_changes({scope: "staged"})
|
||||
# Expected: TaggedExecution.java deleted, core-classes.md updated
|
||||
|
||||
git commit -m "$(cat <<'EOF'
|
||||
refactor(ingestion): remove unused TaggedExecution record
|
||||
|
||||
No callers after the legacy PG ingestion path was retired in 0f635576.
|
||||
core-classes.md updated to drop the leftover note.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5 — SSE diagnosis
|
||||
|
||||
Diagnose the 4 failing SSE tests before attempting a fix. Produces a markdown diagnosis doc, not code changes.
|
||||
|
||||
**Files:**
|
||||
- Create: `.planning/sse-flakiness-diagnosis.md`
|
||||
|
||||
- [ ] **Step 5.1: Run each failing test in isolation to confirm baseline**
|
||||
|
||||
```bash
|
||||
for t in "AgentSseControllerIT#sseConnect_unknownAgent_returns404" \
|
||||
"AgentSseControllerIT#lastEventIdHeader_connectionSucceeds" \
|
||||
"AgentSseControllerIT#pingKeepalive_receivedViaSseStream" \
|
||||
"SseSigningIT#deepTraceEvent_containsValidSignature"; do
|
||||
echo "=== $t ==="
|
||||
mvn -pl cameleer-server-app -am -Dit.test="$t" -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -15
|
||||
done
|
||||
```
|
||||
|
||||
Record for each: PASS or FAIL in isolation.
|
||||
|
||||
- [ ] **Step 5.2: Run all SSE tests together in both class orders**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dsurefire.runOrder=alphabetical verify 2>&1 | tail -30
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dsurefire.runOrder=reversealphabetical verify 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Record: which tests fail in which order.
|
||||
|
||||
- [ ] **Step 5.3: Investigate the `sseConnect_unknownAgent_returns404` case specifically**
|
||||
|
||||
Read `AgentSseController.java:63-82`. Trace the control flow when:
|
||||
- JWT is valid for agent subject `X`
|
||||
- Path id is `unknown-sse-agent` (different from JWT subject)
|
||||
- `registryService.findById("unknown-sse-agent")` returns null
|
||||
- `jwtResult != null` — so auto-heal triggers, registers `unknown-sse-agent` with JWT's env+application, returns 200 with SSE stream
|
||||
|
||||
Hypothesis: the test expects 404, but the controller's auto-heal path accepts the unknown agent because it only checks "JWT present", not "JWT subject matches path id". The 5s timeout on `statusFuture.get(...)` is because the 200 response opens an infinite SSE stream; `BodyHandlers.ofString()` waits for body completion that never comes.
|
||||
|
||||
Confirm by inspecting `JwtAuthenticationFilter` and `JwtService.JwtValidationResult` to see whether `subject()` or an equivalent agent-id claim is available on the result. Then read a nearby controller that does verify subject-vs-path-id (e.g. `AgentRegistrationController.heartbeat` or `AgentCommandController`) for the accepted pattern.
|
||||
|
||||
- [ ] **Step 5.4: Investigate the `awaitConnection(5000)` tests**
|
||||
|
||||
For `lastEventIdHeader_connectionSucceeds`, `pingKeepalive_receivedViaSseStream`, `deepTraceEvent_containsValidSignature`: all register a fresh UUID-suffixed agent first, then open SSE with the JWT that was minted in `setUp()` for `test-agent-sse-it`. The JWT subject doesn't match the path id.
|
||||
|
||||
If Step 5.3 finds the auto-heal bug, these tests may also benefit: when JWT subject ≠ path id, these tests currently rely on auto-heal too (since the path id was freshly registered through the `/register` endpoint, but the JWT they use in the SSE request is a *different* agent's JWT).
|
||||
|
||||
Wait — re-check: `registerAgent` in the test uses `bootstrapHeaders()` (not JWT). It registers the agent directly. Then `openSseStream` uses `jwt` which is `securityHelper.registerTestAgent("test-agent-sse-it")` — a JWT for a different agent.
|
||||
|
||||
So in these tests:
|
||||
- Path id: fresh UUID (registered via bootstrap)
|
||||
- JWT subject: "test-agent-sse-it"
|
||||
- `findById(uuid)` succeeds — agent exists
|
||||
- Auto-heal NOT triggered
|
||||
- Controller calls `connectionManager.connect(uuid)` — returns SseEmitter
|
||||
|
||||
If this path works in isolation, why does it time out under full-class execution? Possibilities:
|
||||
- **Tomcat async thread pool** exhaustion. `SseEmitter(Long.MAX_VALUE)` holds a thread; prior tests' connections may not have closed before the pool fills
|
||||
- **SseConnectionManager state** leak — emitters from prior tests still in the map, competing
|
||||
- **The ping scheduler** (`@Scheduled(fixedDelayString = "${agent-registry.ping-interval-ms:15000}")`) — if an IOException on a stale emitter propagates
|
||||
|
||||
Check Tomcat async config in `application.yml` and any `server.tomcat.*` settings. Default max-threads is 200 but async handling has separate limits.
|
||||
|
||||
- [ ] **Step 5.5: Capture test output + logs**
|
||||
|
||||
Run the failing tests with debug logging:
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dlogging.level.com.cameleer=DEBUG -Dlogging.level.org.apache.catalina.core=DEBUG verify 2>&1 | tail -100
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `SseConnectionManager` log lines showing emitter count over time
|
||||
- "Replacing existing SSE connection" or "SSE connection timed out" patterns
|
||||
- Tomcat "async request timed out" warnings
|
||||
- Any `NOT_FOUND` being thrown that the client interprets as hanging
|
||||
|
||||
- [ ] **Step 5.6: Write the diagnosis doc**
|
||||
|
||||
Create `.planning/sse-flakiness-diagnosis.md` with sections:
|
||||
|
||||
1. **Summary** — 1-2 sentences, named root cause (or "inconclusive")
|
||||
2. **Evidence** — commands run, output snippets, code references (file:line)
|
||||
3. **Hypothesis ladder** — auto-heal over-permissiveness, thread pool exhaustion, singleton state leak — with confidence level for each
|
||||
4. **Proposed fix** — if confident: specific changes to specific files. If inconclusive: say so, recommend parking with `@Disabled`.
|
||||
5. **Risk** — what could go wrong with the proposed fix.
|
||||
|
||||
- [ ] **Step 5.7: Commit the diagnosis**
|
||||
|
||||
```bash
|
||||
git add .planning/sse-flakiness-diagnosis.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(debug): SSE flakiness root-cause analysis
|
||||
|
||||
Investigation of the 4 parked SSE test failures documented in
|
||||
.planning/it-triage-report.md. Records evidence, hypothesis ladder,
|
||||
and proposed fix shape (or recommendation to park if inconclusive).
|
||||
|
||||
See .planning/sse-flakiness-diagnosis.md for details; Task 6 (or a
|
||||
skip-to-final-verify) follows based on the conclusion.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6 — SSE fix (branches based on Task 5 diagnosis)
|
||||
|
||||
**Decision tree:**
|
||||
- **If Task 5 landed a confident root cause** → follow Task 6.A
|
||||
- **If Task 5 found auto-heal over-permissiveness as the (whole or partial) cause** → follow Task 6.B
|
||||
- **If Task 5 was inconclusive or the fix exceeds a 45-minute timebox** → follow Task 6.C (park)
|
||||
|
||||
---
|
||||
|
||||
### Task 6.A — Fix per diagnosis finding
|
||||
|
||||
- [ ] **Step 6.A.1: Impact analysis on the symbols identified by diagnosis**
|
||||
|
||||
```bash
|
||||
gitnexus_impact({target: "<symbol_named_by_diagnosis>", direction: "upstream"})
|
||||
```
|
||||
|
||||
- [ ] **Step 6.A.2: Apply the fix exactly as the diagnosis prescribes**
|
||||
|
||||
Follow the "Proposed fix" section of `.planning/sse-flakiness-diagnosis.md` step-by-step. Do not adapt or extend — the diagnosis is the plan.
|
||||
|
||||
- [ ] **Step 6.A.3: Run the 4 failing SSE tests**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: **0 failures.** If any remain, the diagnosis was incomplete — fall back to Task 6.C and park the residual.
|
||||
|
||||
- [ ] **Step 6.A.4: Commit**
|
||||
|
||||
```bash
|
||||
git commit -m "$(cat <<'EOF'
|
||||
fix(sse): <one-line description from diagnosis>
|
||||
|
||||
<2-3 sentence explanation pulled from diagnosis doc>
|
||||
|
||||
Closes 4 parked SSE test failures from .planning/it-triage-report.md.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6.B — Auto-heal guard (likely fix if Step 5.3 confirms)
|
||||
|
||||
If diagnosis confirmed `AgentSseController` auto-heals regardless of JWT subject vs path id.
|
||||
|
||||
- [ ] **Step 6.B.1: Impact analysis**
|
||||
|
||||
```bash
|
||||
gitnexus_impact({target: "AgentSseController.events", direction: "upstream"})
|
||||
gitnexus_impact({target: "JwtService.JwtValidationResult", direction: "upstream"})
|
||||
```
|
||||
|
||||
- [ ] **Step 6.B.2: Inspect JwtValidationResult for subject access**
|
||||
|
||||
Read the `JwtValidationResult` class/record. Confirm it exposes the JWT subject (likely `.subject()` or similar accessor). Note the field name.
|
||||
|
||||
- [ ] **Step 6.B.3: Add the guard to `AgentSseController.events`**
|
||||
|
||||
Edit `cameleer-server-app/src/main/java/com/cameleer/server/app/controller/AgentSseController.java:63-76`.
|
||||
|
||||
Replace the auto-heal block:
|
||||
|
||||
```java
|
||||
AgentInfo agent = registryService.findById(id);
|
||||
if (agent == null) {
|
||||
// Auto-heal: re-register agent from JWT claims after server restart
|
||||
var jwtResult = (JwtService.JwtValidationResult) httpRequest.getAttribute(
|
||||
JwtAuthenticationFilter.JWT_RESULT_ATTR);
|
||||
if (jwtResult != null) {
|
||||
String application = jwtResult.application() != null ? jwtResult.application() : "default";
|
||||
String env = jwtResult.environment() != null ? jwtResult.environment() : "default";
|
||||
registryService.register(id, id, application, env, "unknown", List.of(), Map.of());
|
||||
log.info("Auto-registered agent {} (app={}, env={}) from SSE connect after server restart", id, application, env);
|
||||
} else {
|
||||
throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Agent not found: " + id);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```java
|
||||
AgentInfo agent = registryService.findById(id);
|
||||
if (agent == null) {
|
||||
// Auto-heal re-registers an agent from JWT claims after server restart,
|
||||
// but only when the JWT subject matches the path id. Otherwise a
|
||||
// different agent could spoof any agentId in the URL.
|
||||
var jwtResult = (JwtService.JwtValidationResult) httpRequest.getAttribute(
|
||||
JwtAuthenticationFilter.JWT_RESULT_ATTR);
|
||||
if (jwtResult != null && id.equals(jwtResult.subject())) {
|
||||
String application = jwtResult.application() != null ? jwtResult.application() : "default";
|
||||
String env = jwtResult.environment() != null ? jwtResult.environment() : "default";
|
||||
registryService.register(id, id, application, env, "unknown", List.of(), Map.of());
|
||||
log.info("Auto-registered agent {} (app={}, env={}) from SSE connect after server restart", id, application, env);
|
||||
} else {
|
||||
throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Agent not found: " + id);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Adjust `jwtResult.subject()` to the actual accessor method from Step 6.B.2 (could be `.subject()`, `.instanceId()`, `.agentId()`, etc.).
|
||||
|
||||
- [ ] **Step 6.B.4: Run `sseConnect_unknownAgent_returns404`**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT#sseConnect_unknownAgent_returns404' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -15
|
||||
```
|
||||
|
||||
Expected: PASS. The controller now returns a synchronous 404 for the mismatched case.
|
||||
|
||||
- [ ] **Step 6.B.5: Run the remaining 3 SSE tests**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -30
|
||||
```
|
||||
|
||||
If all 4 now pass → Task 6.B closed everything. If 3 still fail (the `awaitConnection` trio) → the auto-heal guard fixed the 404 case but the others have a separate root cause. Fall back to Task 6.A with narrower scope, or Task 6.C to park the residual.
|
||||
|
||||
- [ ] **Step 6.B.6: Commit**
|
||||
|
||||
```bash
|
||||
git add cameleer-server-app/src/main/java/com/cameleer/server/app/controller/AgentSseController.java
|
||||
git commit -m "$(cat <<'EOF'
|
||||
fix(sse): auto-heal requires JWT subject to match requested agent id
|
||||
|
||||
AgentSseController.events auto-registered an unknown agent id from JWT
|
||||
claims whenever any valid JWT was present, regardless of whose agent the
|
||||
JWT actually identified. This was a spoofing vector — a holder of a JWT
|
||||
for agent X could open SSE for any path-id Y — and it silently masked
|
||||
404 as 200 with an infinite empty stream (surface symptom: the parked
|
||||
sseConnect_unknownAgent_returns404 test hung for 5s on the status
|
||||
future).
|
||||
|
||||
Auto-heal now triggers only when the JWT subject equals the path id.
|
||||
Cross-agent requests fall through to the existing 404.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6.C — Park and annotate (fallback if diagnosis inconclusive)
|
||||
|
||||
If the 45-minute diagnosis timebox expires without a confident root cause, or Task 6.A/6.B leaves residual failures.
|
||||
|
||||
- [ ] **Step 6.C.1: Annotate the failing tests**
|
||||
|
||||
Edit each of the (still-)failing test methods in `AgentSseControllerIT` and `SseSigningIT`. Add above each method:
|
||||
|
||||
```java
|
||||
@org.junit.jupiter.api.Disabled(
|
||||
"Parked — see .planning/sse-flakiness-diagnosis.md. Order-dependent "
|
||||
+ "flakiness; passes in isolation. Re-enable after fix.")
|
||||
```
|
||||
|
||||
Leave the rest of the method unchanged.
|
||||
|
||||
- [ ] **Step 6.C.2: Run to confirm they skip**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tail -20
|
||||
```
|
||||
|
||||
Expected: 0 failures, N skipped (where N is the number parked). Other tests still run.
|
||||
|
||||
- [ ] **Step 6.C.3: Commit**
|
||||
|
||||
```bash
|
||||
git add cameleer-server-app/src/test/java/com/cameleer/server/app/controller/AgentSseControllerIT.java \
|
||||
cameleer-server-app/src/test/java/com/cameleer/server/app/controller/SseSigningIT.java
|
||||
git commit -m "$(cat <<'EOF'
|
||||
test(sse): park flaky tests with @Disabled pending fix
|
||||
|
||||
Order-dependent flakiness; all four tests pass in isolation. Diagnosis
|
||||
in .planning/sse-flakiness-diagnosis.md was inconclusive within the
|
||||
investigation timebox. Re-enable after targeted fix.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 7 — Final verify + push
|
||||
|
||||
- [ ] **Step 7.1: Full verify**
|
||||
|
||||
```bash
|
||||
mvn -pl cameleer-server-app -am -Dit.test='!SchemaBootstrapIT' -Dtest='!*' -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify 2>&1 | tee /tmp/final-verify.log | tail -60
|
||||
```
|
||||
|
||||
Expected: **0 failures**, plus either (a) all SSE tests passing if Task 6.A/6.B succeeded, or (b) 4 skipped if Task 6.C was taken.
|
||||
|
||||
If any non-SSE test fails that previously passed: STOP. Root cause before pushing. Likely a regression from Task 1, 2, or 3 that escaped unit verification.
|
||||
|
||||
- [ ] **Step 7.2: Confirm commit history**
|
||||
|
||||
```bash
|
||||
git log --oneline -15
|
||||
```
|
||||
|
||||
Expected: the new commits in order — CH timezone fix, scheduler SpEL fix, SearchIndexer removal, TaggedExecution removal, SSE diagnosis doc, SSE fix or park.
|
||||
|
||||
- [ ] **Step 7.3: Push to main**
|
||||
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
||||
The user explicitly authorized pushing to main for this overnight run. If the remote rejects (non-fast-forward, auth), stop and report — do not `--force`.
|
||||
|
||||
- [ ] **Step 7.4: Update `.planning/it-triage-report.md`**
|
||||
|
||||
Append a closing section at the bottom of the triage report:
|
||||
|
||||
```markdown
|
||||
## Follow-up (2026-04-22)
|
||||
|
||||
Closed the 3 parked clusters:
|
||||
|
||||
- **ClickHouseStatsStoreIT (8 failures)** — fixed via column-level `DateTime('UTC')` on `bucket` + defensive `toDateTime(..., 'UTC')` cast in `ClickHouseStatsStore.lit(Instant)`.
|
||||
- **MetricsFlushScheduler property-key drift** — scheduler now binds via SpEL `#{@ingestionConfig.flushIntervalMs}`; BackpressureIT workaround property dropped.
|
||||
- **SSE flakiness (4 failures)** — see `.planning/sse-flakiness-diagnosis.md`; resolved by <one-line summary from diagnosis> / parked with `@Disabled` pending targeted fix.
|
||||
|
||||
Plus two prod-code cleanups from the ExecutionController removal follow-ons: removed dead `SearchIndexer` subsystem and unused `TaggedExecution` record.
|
||||
|
||||
Final verify: **0 failures** (or: **0 failures, 4 skipped SSE tests**).
|
||||
```
|
||||
|
||||
Commit:
|
||||
|
||||
```bash
|
||||
git add .planning/it-triage-report.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(triage): IT triage report — close-out of remaining 12 failures
|
||||
|
||||
All three parked clusters closed + two prod-code side-notes landed.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||||
EOF
|
||||
)"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review (pre-execution)
|
||||
|
||||
**Spec coverage:**
|
||||
- [x] Item 1 (CH timezone) → Task 1
|
||||
- [x] Item 2 (SSE flakiness) → Tasks 5 + 6 (diagnose-then-fix, autonomous variant without user checkpoint)
|
||||
- [x] Item 3 (scheduler property-key) → Task 2
|
||||
- [x] Item 4a (SearchIndexer cleanup) → Task 3
|
||||
- [x] Item 4b (TaggedExecution removal) → Task 4
|
||||
- [x] Execution order (Wave 1 parallelizable, Wave 2 sequential) → reflected in task numbering; Wave 1 tasks have no inter-task dependencies and can be executed in any order, Wave 2 (Tasks 5 → 6) is strictly sequential.
|
||||
|
||||
**Placeholder scan:** Every step contains concrete commands, file paths, and code blocks. The one deferred decision (Task 6 branching based on diagnosis) is bounded — all three branches (A/B/C) are fully specified.
|
||||
|
||||
**Type consistency:** `ingestionConfig` bean name consistent across Task 2 steps. `JwtValidationResult.subject()` access method flagged as "verify" in Step 6.B.2 — the actual accessor is confirmed during diagnosis, not guessed here.
|
||||
|
||||
**Deviations from spec:** the spec called for a user checkpoint between SSE diagnosis and fix. This plan runs autonomously (user is asleep), so the checkpoint becomes a decision tree (Task 6.A/6.B/6.C) with explicit stop conditions.
|
||||
Reference in New Issue
Block a user