Files
cameleer-server/.planning/sse-flakiness-diagnosis.md
hsiegeln 41df042e98 fix(sse): close 4 parked SSE test failures
Three distinct root causes, all reproducible when the classes run
solo — not order-dependent as the triage report suggested. Full
diagnosis in .planning/sse-flakiness-diagnosis.md.

1. AgentSseController.events auto-heal was over-permissive: any valid
   JWT allowed registering an arbitrary path-id, a spoofing vector.
   Surface symptom was the parked sseConnect_unknownAgent_returns404
   test hanging on a 200-with-empty-stream instead of getting 404.
   Fix: auto-heal requires JWT subject == path id.

2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms}
   (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.*
   — same family of bug as the MetricsFlushScheduler fix in a6944911.
   Fix: corrected placeholder prefix.

3. Spring's SseEmitter doesn't flush response headers until the first
   emitter.send(); clients on BodyHandlers.ofInputStream blocked on
   the first body byte, making awaitConnection(5s) unreliable under a
   15s ping cadence. Fix: send an initial ": connected" comment on
   connect() so headers hit the wire immediately.

Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:41:34 +02:00

4.5 KiB

SSE Flakiness — Root-Cause Analysis

Date: 2026-04-21 Tests: AgentSseControllerIT.sseConnect_unknownAgent_returns404, .lastEventIdHeader_connectionSucceeds, .pingKeepalive_receivedViaSseStream, SseSigningIT.deepTraceEvent_containsValidSignature

Summary

Not order-dependent flakiness (triage report was wrong). Three distinct root causes, one production bug and one test-infrastructure issue, all reproducible when running the classes in isolation.

Reproduction

mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT' -Dtest='!*' \
    -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false verify

Result: 3 failures out of 7 tests with a cold CH container. Not order-dependent.

Root causes

1. AgentSseController.events auto-heal is over-permissive (security bug)

File: cameleer-server-app/src/main/java/com/cameleer/server/app/controller/AgentSseController.java:63-76

AgentInfo agent = registryService.findById(id);
if (agent == null) {
    var jwtResult = ...;
    if (jwtResult != null) {     // ← only checks JWT presence
        registryService.register(id, id, application, env, ...);
    } else {
        throw 404;
    }
}

Bug: auto-heal registers any path id when any valid JWT is present, regardless of whether the JWT subject matches the path id. A holder of agent X's JWT can open SSE for any path-id Y, silently spoofing Y.

Surface symptom: sseConnect_unknownAgent_returns404 sends a JWT for test-agent-sse-it and requests SSE for unknown-sse-agent. Auto-heal kicks in, returns 200 with an infinite empty stream. Test's statusFuture.get(5s) — which uses BodyHandlers.ofString() and waits for the full body — times out instead of getting a synchronous 404.

Fix: only auto-heal when jwtResult.subject().equals(id).

2. SseConnectionManager.pingAll reads an unprefixed property key (production bug)

File: cameleer-server-app/src/main/java/com/cameleer/server/app/agent/SseConnectionManager.java:172

@Scheduled(fixedDelayString = "${agent-registry.ping-interval-ms:15000}")

Bug: AgentRegistryConfig is @ConfigurationProperties(prefix = "cameleer.server.agentregistry"). The scheduler reads an unprefixed agent-registry.* key that the YAML never defines — so the default 15s always applies, regardless of config. Same family of bug as the MetricsFlushScheduler fix in commit a6944911.

Fix: ${cameleer.server.agentregistry.ping-interval-ms:15000}.

3. SSE response body doesn't flush until first event (test timing dependency)

File: cameleer-server-app/src/main/java/com/cameleer/server/app/agent/SseConnectionManager.java:connect()

Spring's SseEmitter holds the response open but doesn't flush headers to the client until the first emitter.send(). Until then, clients using HttpResponse.BodyHandlers.ofInputStream() block on the first byte.

Surface symptom:

  • lastEventIdHeader_connectionSucceeds — asserts awaitConnection(5000) is true. The latch counts down in .thenAccept(response -> ...), which in practice only fires once body bytes start flowing (JDK 21 behaviour with SSE streams). Default ping cadence is 15s → 5s assertion times out.
  • pingKeepalive_receivedViaSseStream — waits 5s for a :ping line. The scheduler runs every 15s (both by default, and because of bug #2, unconditionally).
  • SseSigningIT.deepTraceEvent_containsValidSignature — same family: awaitConnection(5000).isTrue().

Fix: send an initial : connected comment as part of connect(). Spring flushes on the first .send(), so an immediate comment forces the response headers + first byte to hit the wire, which triggers the client's thenAccept callback. Also solves the ping-test: the initial comment is observed as a keepalive line within the test's polling window.

Hypothesis ladder (ruled out)

  • Order-dependent singleton leak — ruled out: every failure reproduces when the class is run solo.
  • Tomcat async thread pool exhaustion — ruled out: SseEmitter(Long.MAX_VALUE) does hold threads, but the 7-test class doesn't reach Tomcat's defaults.
  • SseConnectionManager emitter-map contamination — ruled out: each test uses a unique agent id (UUID-suffixed), and the @Component is the same instance across tests but the emitter map is keyed by agent id, no collisions.

Verification

mvn -pl cameleer-server-app -am -Dit.test='AgentSseControllerIT,SseSigningIT' ... verify
# Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

All 9 tests green with the three fixes applied.