fix(agent): revive DEAD agents on heartbeat (not just STALE)

Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:55:47 +02:00
parent 90083f886a
commit fb54f9cbd2
3 changed files with 24 additions and 4 deletions
--- a/cameleer-server-app/src/main/java/com/cameleer/server/app/agent/AgentLifecycleMonitor.java
+++ b/cameleer-server-app/src/main/java/com/cameleer/server/app/agent/AgentLifecycleMonitor.java
@@ -70,7 +70,7 @@ public class AgentLifecycleMonitor {
    private String mapTransitionEvent(AgentState from, AgentState to) {
        if (from == AgentState.LIVE && to == AgentState.STALE) return "WENT_STALE";
        if (from == AgentState.STALE && to == AgentState.DEAD) return "WENT_DEAD";
-        if (from == AgentState.STALE && to == AgentState.LIVE) return "RECOVERED";
+        if (to == AgentState.LIVE && (from == AgentState.STALE || from == AgentState.DEAD)) return "RECOVERED";
        return null;
    }
 }