Files
cameleer-server/.planning/phases/03-agent-registry-sse-push/03-RESEARCH.md
hsiegeln cb3ebfea7c
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
chore: rename cameleer3 to cameleer
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00

27 KiB

Phase 3: Agent Registry + SSE Push - Research

Researched: 2026-03-11 Domain: Agent lifecycle management, Server-Sent Events (SSE), in-memory registry Confidence: HIGH

Summary

This phase adds agent registration, heartbeat-based lifecycle management (LIVE/STALE/DEAD), and real-time command push via SSE to the Cameleer server. The technology stack is straightforward: Spring MVC's SseEmitter for server-push, ConcurrentHashMap for the in-memory agent registry, and @Scheduled for periodic lifecycle checks (same pattern already used by ClickHouseFlushScheduler).

The main architectural challenge is managing per-agent SSE connections reliably -- handling disconnections, timeouts, and cleanup without leaking threads or emitters. The command delivery model (PENDING with 60s expiry, acknowledgement) adds a second concurrent data structure to manage alongside the registry itself.

Primary recommendation: Use Spring MVC SseEmitter (already on classpath via spring-boot-starter-web). No new dependencies required. Follow the established core-module-plain-class / app-module-Spring-bean pattern. Agent registry service in core, SSE connection manager and controllers in app.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

  • Heartbeat interval: 30 seconds
  • STALE threshold: 90 seconds (3 missed heartbeats)
  • DEAD threshold: 5 minutes after going STALE
  • DEAD agents kept indefinitely (no auto-purge)
  • Agent list endpoint returns all agents (LIVE, STALE, DEAD) with ?status= filter parameter
  • Generic command endpoint: POST /api/v1/agents/{id}/commands with {"type": "config-update|deep-trace|replay", "payload": {...}}
  • Three targeting levels: single agent, group, all live agents
  • Agent self-declares group name at registration
  • Command delivery tracking: PENDING until acknowledged, 60s expiry
  • Agent provides its own persistent ID at registration
  • Rich registration payload: agent ID, name, group, version, list of route IDs, capabilities
  • Re-registration with same ID resumes existing identity
  • Heartbeat is just a ping -- no metadata update
  • Registration response includes: SSE endpoint URL, current server config, server public key placeholder
  • Last-Event-ID supported but does NOT replay missed events
  • Pending commands NOT auto-pushed on reconnect
  • SSE ping/keepalive interval: 15 seconds

Claude's Discretion

  • In-memory vs persistent storage for agent registry (in-memory is fine for v1)
  • Command acknowledgement mechanism details (heartbeat piggyback vs dedicated endpoint)
  • SSE implementation approach (Spring SseEmitter, WebFlux, or other)
  • Thread scheduling for lifecycle state transitions

Deferred Ideas (OUT OF SCOPE)

  • Server-side agent tags/labels for more flexible grouping
  • Auto-push pending commands on reconnect
  • Last-Event-ID replay of missed events
  • Agent capability negotiation

</user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
AGNT-01 (#13) Agent registers via POST /api/v1/agents/register with bootstrap token, receives JWT + server public key Registration controller + service; JWT/security enforcement deferred to Phase 4 but flow must work end-to-end
AGNT-02 (#14) Server maintains agent registry with LIVE/STALE/DEAD lifecycle based on heartbeat timing In-memory ConcurrentHashMap registry + @Scheduled lifecycle monitor
AGNT-03 (#15) Agent sends heartbeat via POST /api/v1/agents/{id}/heartbeat every 30s Heartbeat endpoint updates lastHeartbeat timestamp, transitions STALE back to LIVE
AGNT-04 (#16) Server pushes config-update events to agents via SSE (Ed25519 signature deferred to Phase 4) SseEmitter per-agent connection + command push infrastructure
AGNT-05 (#17) Server pushes deep-trace commands to agents via SSE for specific correlationIds Same SSE command push mechanism with deep-trace type
AGNT-06 (#18) Server pushes replay commands to agents via SSE (signed replay tokens deferred to Phase 4) Same SSE command push mechanism with replay type
AGNT-07 (#19) SSE connection includes ping keepalive and supports Last-Event-ID reconnection 15s ping via @Scheduled, Last-Event-ID header read on connect

</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
Spring MVC SseEmitter 6.2.x (via Boot 3.4.3) Server-Sent Events Already on classpath, servlet-based (matches existing stack), no WebFlux needed
ConcurrentHashMap JDK 17 Agent registry storage Thread-safe, O(1) lookup by agent ID, no external dependency
Spring @Scheduled 6.2.x (via Boot 3.4.3) Lifecycle monitor + SSE keepalive Already enabled in application, proven pattern in ClickHouseFlushScheduler

Supporting

Library Version Purpose When to Use
Jackson ObjectMapper 2.17.3 (managed) Command serialization/deserialization Already configured with JavaTimeModule, used throughout codebase

Alternatives Considered

Instead of Could Use Tradeoff
SseEmitter (MVC) WebFlux Flux Would require adding spring-boot-starter-webflux and mixing reactive/servlet stacks -- unnecessary complexity for this use case
ConcurrentHashMap Redis/ClickHouse persistence Over-engineering for v1; in-memory is sufficient since agent state is ephemeral and rebuilt on reconnect
@Scheduled ScheduledExecutorService @Scheduled already works, already enabled; raw executor only needed for complex scheduling

Installation: No new dependencies required. Everything is already on the classpath.

Architecture Patterns

cameleer-server-core/src/main/java/com/cameleer/server/core/
├── agent/
│   ├── AgentInfo.java              # Record: id, name, group, version, routeIds, capabilities, state, timestamps
│   ├── AgentState.java             # Enum: LIVE, STALE, DEAD
│   ├── AgentRegistryService.java   # Plain class: register, heartbeat, findById, findAll, lifecycle transitions
│   ├── AgentCommand.java           # Record: id, type, payload, targetAgentId, createdAt, status
│   └── CommandStatus.java          # Enum: PENDING, DELIVERED, ACKNOWLEDGED, EXPIRED

cameleer-server-app/src/main/java/com/cameleer/server/app/
├── config/
│   ├── AgentRegistryConfig.java    # @ConfigurationProperties(prefix = "agent-registry")
│   └── AgentRegistryBeanConfig.java # @Configuration: wires AgentRegistryService as bean
├── controller/
│   ├── AgentRegistrationController.java  # POST /register, POST /{id}/heartbeat, GET /agents
│   ├── AgentCommandController.java       # POST /{id}/commands, POST /groups/{group}/commands, POST /commands
│   └── AgentSseController.java           # GET /{id}/events (SSE stream)
├── agent/
│   ├── SseConnectionManager.java         # @Component: ConcurrentHashMap<agentId, SseEmitter>, ping scheduler
│   └── AgentLifecycleMonitor.java        # @Component: @Scheduled lifecycle check (like ClickHouseFlushScheduler)

Pattern 1: Core Module Plain Class + App Module Bean Config

What: Domain logic in core module as plain Java classes; Spring wiring in app module via @Configuration When to use: Always -- this is the established codebase pattern Example:

// Core module: plain class, no Spring annotations
public class AgentRegistryService {
    private final ConcurrentHashMap<String, AgentInfo> agents = new ConcurrentHashMap<>();

    public AgentInfo register(String id, String name, String group, String version,
                              List<String> routeIds, Map<String, Object> capabilities) {
        AgentInfo existing = agents.get(id);
        if (existing != null) {
            // Re-registration: update metadata, transition back to LIVE
            AgentInfo updated = existing.withState(AgentState.LIVE)
                    .withLastHeartbeat(Instant.now());
            agents.put(id, updated);
            return updated;
        }
        AgentInfo agent = new AgentInfo(id, name, group, version, routeIds,
                capabilities, AgentState.LIVE, Instant.now(), Instant.now());
        agents.put(id, agent);
        return agent;
    }

    public boolean heartbeat(String id) {
        return agents.computeIfPresent(id, (k, v) ->
                v.withState(AgentState.LIVE).withLastHeartbeat(Instant.now())) != null;
    }
}

// App module: bean config
@Configuration
public class AgentRegistryBeanConfig {
    @Bean
    public AgentRegistryService agentRegistryService() {
        return new AgentRegistryService();
    }
}

Pattern 2: SseEmitter Per-Agent Connection

What: Each agent has one SseEmitter stored in ConcurrentHashMap, managed by a dedicated component When to use: For all SSE connections to agents Example:

@Component
public class SseConnectionManager {
    private final ConcurrentHashMap<String, SseEmitter> emitters = new ConcurrentHashMap<>();

    public SseEmitter connect(String agentId) {
        // Use Long.MAX_VALUE timeout -- we manage keepalive ourselves
        SseEmitter emitter = new SseEmitter(Long.MAX_VALUE);

        emitter.onCompletion(() -> emitters.remove(agentId));
        emitter.onTimeout(() -> emitters.remove(agentId));
        emitter.onError(e -> emitters.remove(agentId));

        // Replace any existing emitter (agent reconnect)
        SseEmitter old = emitters.put(agentId, emitter);
        if (old != null) {
            old.complete(); // Close stale connection
        }

        return emitter;
    }

    public boolean sendEvent(String agentId, String eventId, String eventType, Object data) {
        SseEmitter emitter = emitters.get(agentId);
        if (emitter == null) return false;
        try {
            emitter.send(SseEmitter.event()
                    .id(eventId)
                    .name(eventType)
                    .data(data, MediaType.APPLICATION_JSON));
            return true;
        } catch (IOException e) {
            emitters.remove(agentId);
            return false;
        }
    }

    public void sendPingToAll() {
        emitters.forEach((id, emitter) -> {
            try {
                emitter.send(SseEmitter.event().comment("ping"));
            } catch (IOException e) {
                emitters.remove(id);
            }
        });
    }
}

Pattern 3: Lifecycle Monitor via @Scheduled

What: Periodic task checks all agents' lastHeartbeat timestamps and transitions states When to use: For LIVE->STALE and STALE->DEAD transitions Example:

@Component
public class AgentLifecycleMonitor {
    private final AgentRegistryService registry;
    private final AgentRegistryConfig config;

    @Scheduled(fixedDelayString = "${agent-registry.lifecycle-check-interval-ms:10000}")
    public void checkLifecycle() {
        Instant now = Instant.now();
        for (AgentInfo agent : registry.findAll()) {
            Duration sinceHeartbeat = Duration.between(agent.lastHeartbeat(), now);
            if (agent.state() == AgentState.LIVE
                    && sinceHeartbeat.toMillis() > config.getStaleThresholdMs()) {
                registry.transitionState(agent.id(), AgentState.STALE);
            } else if (agent.state() == AgentState.STALE
                    && sinceHeartbeat.toMillis() > config.getStaleThresholdMs() + config.getDeadThresholdMs()) {
                registry.transitionState(agent.id(), AgentState.DEAD);
            }
        }
    }
}

Anti-Patterns to Avoid

  • Mixing WebFlux and MVC: Do not add spring-boot-starter-webflux. The project uses servlet-based MVC. Adding WebFlux creates classpath conflicts and ambiguity.
  • Sharing SseEmitter across threads without protection: Always use ConcurrentHashMap and handle IOException on every send. A failed send means the client disconnected.
  • Storing SseEmitter in the core module: SseEmitter is a Spring MVC class. Keep it in the app module only. The core module should define interfaces for "push event to agent" that the app module implements.
  • Not setting SseEmitter timeout: Default timeout is server-dependent (often 30s). Use Long.MAX_VALUE and manage keepalive yourself.

Don't Hand-Roll

Problem Don't Build Use Instead Why
SSE protocol Custom HTTP streaming Spring SseEmitter Handles text/event-stream format, event IDs, retry fields automatically
Thread-safe map Synchronized HashMap ConcurrentHashMap Lock-free reads, segmented writes, battle-tested
Periodic scheduling Manual Thread/Timer @Scheduled + @EnableScheduling Already configured, integrates with Spring lifecycle
JSON serialization Manual string building ObjectMapper (already configured) Handles Instant, unknown fields, all edge cases
Async request timeout Manual thread management spring.mvc.async.request-timeout config Spring handles Tomcat async timeout correctly

Key insight: SSE in Spring MVC is a well-supported, first-class feature. The SseEmitter API handles the wire protocol; your job is managing the lifecycle of emitters (create, store, cleanup, send).

Common Pitfalls

Pitfall 1: SseEmitter Default Timeout Kills Long-Lived Connections

What goes wrong: Emitter times out after 30s (Tomcat default), client gets disconnected Why it happens: Not setting explicit timeout on SseEmitter constructor How to avoid: Always use new SseEmitter(Long.MAX_VALUE). Also set spring.mvc.async.request-timeout=-1 in application.yml to disable the MVC-level async timeout Warning signs: Clients disconnecting every 30 seconds, reconnection storms

Pitfall 2: IOException on Send Not Handled

What goes wrong: Client disconnects but server keeps trying to send, gets IOException, does not clean up Why it happens: Not wrapping every emitter.send() in try-catch How to avoid: Every send must catch IOException, remove the emitter from the map, and log at debug level (not error -- disconnects are normal) Warning signs: Growing emitter map, increasing IOExceptions in logs

Pitfall 3: Race Condition on Agent Reconnect

What goes wrong: Agent disconnects and reconnects rapidly; old emitter and new emitter both exist briefly Why it happens: onCompletion callback of old emitter fires after new emitter is stored, removing the new one How to avoid: Use ConcurrentHashMap.put() which returns the old value. Only remove in callbacks if the emitter in the map is still the same instance (reference equality check) Warning signs: Agent SSE stream stops working after reconnect

Pitfall 4: Tomcat Thread Exhaustion with SSE

What goes wrong: Each SSE connection holds a Tomcat thread (with default sync mode) Why it happens: MVC SseEmitter uses Servlet 3.1 async support but the async processing still occupies a thread from the pool during the initial request How to avoid: Spring Boot's default Tomcat thread pool (200 threads) is sufficient for dozens to low hundreds of agents. If scaling beyond that, configure server.tomcat.threads.max. For thousands of agents, consider WebFlux (but that is a v2 concern) Warning signs: Thread pool exhaustion, connection refused errors

Pitfall 5: Command Expiry Not Cleaned Up

What goes wrong: Expired PENDING commands accumulate in memory Why it happens: No scheduled task to clean them up How to avoid: The lifecycle monitor (or a separate @Scheduled task) should also sweep expired commands every check cycle Warning signs: Memory growth over time, stale commands in API responses

Pitfall 6: SSE Endpoint Blocked by ProtocolVersionInterceptor

What goes wrong: SSE GET request rejected because it lacks X-Cameleer-Protocol-Version header Why it happens: WebConfig already registers the interceptor for /api/v1/agents/** which includes the SSE endpoint How to avoid: Either add the protocol header requirement to agents (recommended -- agents already send it for POST requests) or exclude the SSE endpoint path from the interceptor Warning signs: 400 errors on SSE connect attempts

Code Examples

Registration Controller

@RestController
@RequestMapping("/api/v1/agents")
@Tag(name = "Agent Management", description = "Agent registration and lifecycle endpoints")
public class AgentRegistrationController {

    private final AgentRegistryService registryService;
    private final ObjectMapper objectMapper;

    @PostMapping("/register")
    @Operation(summary = "Register an agent")
    public ResponseEntity<String> register(@RequestBody String body) throws JsonProcessingException {
        // Parse registration payload
        JsonNode node = objectMapper.readTree(body);
        String agentId = node.get("agentId").asText();
        String name = node.get("name").asText();
        String group = node.has("group") ? node.get("group").asText() : "default";
        // ... extract other fields

        AgentInfo agent = registryService.register(agentId, name, group, version, routeIds, capabilities);

        // Build registration response
        Map<String, Object> response = new LinkedHashMap<>();
        response.put("agentId", agent.id());
        response.put("sseEndpoint", "/api/v1/agents/" + agentId + "/events");
        response.put("heartbeatIntervalMs", 30000);
        response.put("serverPublicKey", null); // Phase 4
        // JWT token placeholder -- Phase 4 will add real JWT
        response.put("token", "placeholder-" + agentId);

        return ResponseEntity.ok(objectMapper.writeValueAsString(response));
    }

    @PostMapping("/{id}/heartbeat")
    @Operation(summary = "Agent heartbeat ping")
    public ResponseEntity<Void> heartbeat(@PathVariable String id) {
        boolean found = registryService.heartbeat(id);
        if (!found) {
            return ResponseEntity.notFound().build();
        }
        return ResponseEntity.ok().build();
    }

    @GetMapping
    @Operation(summary = "List all agents")
    public ResponseEntity<String> listAgents(
            @RequestParam(required = false) String status) throws JsonProcessingException {
        List<AgentInfo> agents;
        if (status != null) {
            AgentState stateFilter = AgentState.valueOf(status.toUpperCase());
            agents = registryService.findByState(stateFilter);
        } else {
            agents = registryService.findAll();
        }
        return ResponseEntity.ok(objectMapper.writeValueAsString(agents));
    }
}

SSE Controller

@RestController
@RequestMapping("/api/v1/agents")
@Tag(name = "Agent SSE", description = "Server-Sent Events for agent communication")
public class AgentSseController {

    private final SseConnectionManager connectionManager;
    private final AgentRegistryService registryService;

    @GetMapping(value = "/{id}/events", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    @Operation(summary = "SSE event stream for an agent")
    public SseEmitter subscribe(
            @PathVariable String id,
            @RequestHeader(value = "Last-Event-ID", required = false) String lastEventId) {

        AgentInfo agent = registryService.findById(id);
        if (agent == null) {
            throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Agent not registered");
        }

        // Last-Event-ID acknowledged but no replay (per decision)
        if (lastEventId != null) {
            log.debug("Agent {} reconnected with Last-Event-ID: {} (no replay)", id, lastEventId);
        }

        return connectionManager.connect(id);
    }
}
@PostMapping("/{id}/commands/{commandId}/ack")
@Operation(summary = "Acknowledge command receipt")
public ResponseEntity<Void> acknowledgeCommand(
        @PathVariable String id,
        @PathVariable String commandId) {
    boolean acknowledged = registryService.acknowledgeCommand(id, commandId);
    if (!acknowledged) {
        return ResponseEntity.notFound().build();
    }
    return ResponseEntity.ok().build();
}

Application Configuration Addition

# application.yml additions
agent-registry:
  heartbeat-interval-ms: 30000
  stale-threshold-ms: 90000
  dead-threshold-ms: 300000   # 5 minutes after last heartbeat (not after going stale)
  ping-interval-ms: 15000
  command-expiry-ms: 60000
  lifecycle-check-interval-ms: 10000

spring:
  mvc:
    async:
      request-timeout: -1    # Disable async timeout for SSE

State of the Art

Old Approach Current Approach When Changed Impact
Polling for agent status SSE push for commands Always SSE for server-push Immediate delivery, lower latency
WebFlux for SSE MVC SseEmitter Spring 4.2+ MVC SseEmitter is sufficient for moderate scale; no need for reactive stack
Custom HTTP streaming SseEmitter.event() builder Spring 4.2+ Wire protocol handled automatically

Deprecated/outdated:

  • ResponseBodyEmitter directly for SSE: Use SseEmitter which extends it with SSE-specific features
  • DeferredResult for server push: Only for single-value responses, not streams

Open Questions

  1. Command acknowledgement: dedicated endpoint vs heartbeat piggyback

    • What we know: Dedicated endpoint is simpler, more explicit, and decoupled from heartbeat
    • What's unclear: Whether agent-side implementation prefers one approach
    • Recommendation: Use dedicated POST /{id}/commands/{commandId}/ack endpoint. Cleaner separation of concerns, easier to test, and does not complicate the heartbeat path
  2. Dead threshold calculation: from last heartbeat or from STALE transition?

    • What we know: CONTEXT.md says "5 minutes after going STALE"
    • What's unclear: Whether to track staleTransitionTime separately or compute from lastHeartbeat
    • Recommendation: Track staleTransitionTime in AgentInfo. Dead threshold = 5 minutes after staleTransitionTime. This matches the stated requirement precisely
  3. Async timeout vs SseEmitter timeout

    • What we know: Both spring.mvc.async.request-timeout and new SseEmitter(timeout) affect SSE lifetime
    • What's unclear: Interaction between the two
    • Recommendation: Set SseEmitter(Long.MAX_VALUE) AND spring.mvc.async.request-timeout=-1. Belt and suspenders -- both disabled ensures no premature timeout

Validation Architecture

Test Framework

Property Value
Framework JUnit 5 + Spring Boot Test (via spring-boot-starter-test)
Config file pom.xml (Surefire + Failsafe configured)
Quick run command mvn test -pl cameleer-server-core -Dtest=AgentRegistryServiceTest
Full suite command mvn clean verify

Phase Requirements to Test Map

Req ID Behavior Test Type Automated Command File Exists?
AGNT-01 Agent registers and gets response integration mvn test -pl cameleer-server-app -Dtest=AgentRegistrationControllerIT#registerAgent* No - Wave 0
AGNT-02 Lifecycle transitions LIVE/STALE/DEAD unit mvn test -pl cameleer-server-core -Dtest=AgentRegistryServiceTest#lifecycle* No - Wave 0
AGNT-03 Heartbeat updates timestamp, returns 200/404 integration mvn test -pl cameleer-server-app -Dtest=AgentRegistrationControllerIT#heartbeat* No - Wave 0
AGNT-04 Config-update pushed via SSE integration mvn test -pl cameleer-server-app -Dtest=AgentSseControllerIT#configUpdate* No - Wave 0
AGNT-05 Deep-trace command pushed via SSE integration mvn test -pl cameleer-server-app -Dtest=AgentSseControllerIT#deepTrace* No - Wave 0
AGNT-06 Replay command pushed via SSE integration mvn test -pl cameleer-server-app -Dtest=AgentSseControllerIT#replay* No - Wave 0
AGNT-07 SSE ping keepalive + Last-Event-ID integration mvn test -pl cameleer-server-app -Dtest=AgentSseControllerIT#pingKeepalive* No - Wave 0

Sampling Rate

  • Per task commit: mvn test -pl cameleer-server-core,cameleer-server-app -Dtest="Agent*" (agent-related tests only)
  • Per wave merge: mvn clean verify
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • cameleer-server-core/.../agent/AgentRegistryServiceTest.java -- covers AGNT-02, AGNT-03 (unit tests for registry logic)
  • cameleer-server-app/.../controller/AgentRegistrationControllerIT.java -- covers AGNT-01, AGNT-03
  • cameleer-server-app/.../controller/AgentSseControllerIT.java -- covers AGNT-04, AGNT-05, AGNT-06, AGNT-07
  • cameleer-server-app/.../controller/AgentCommandControllerIT.java -- covers command targeting (single, group, all)
  • No new framework install needed -- JUnit 5 + Spring Boot Test + Awaitility already in place

SSE Test Strategy

Testing SSE with TestRestTemplate requires special handling. Use Spring's WebClient from WebFlux test support or raw HttpURLConnection to read the SSE stream. Alternatively, test at the service layer (SseConnectionManager) with direct emitter interaction. The integration test should:

  1. Register agent via POST
  2. Open SSE connection (separate thread)
  3. Send command via POST
  4. Assert SSE stream received the event
  5. Verify with Awaitility for async assertions

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • Various Medium articles on SSE patterns - used for cross-referencing community patterns only

Metadata

Confidence breakdown:

  • Standard stack: HIGH - SseEmitter is built into Spring MVC, already on classpath, well-documented API
  • Architecture: HIGH - follows established codebase patterns (core plain class, app bean config, @Scheduled)
  • Pitfalls: HIGH - well-known issues documented in Spring GitHub issues and multiple sources
  • SSE test strategy: MEDIUM - SSE testing with TestRestTemplate is non-trivial, may need adaptation

Research date: 2026-03-11 Valid until: 2026-04-11 (stable stack, no fast-moving dependencies)