7-phase plan to replace the interim destroy-then-start flow (f8dccaae)
with a strategy-aware executor. Adds gen-suffixed container names so
old + new replicas can coexist, plus a cameleer.generation label for
Prometheus/Grafana deploy-boundary annotations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Deployment Strategies (blue-green + rolling) — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Make deploymentStrategy actually affect runtime behavior. Support blue-green (all-at-once, default) and rolling (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap.
Current state (interim fix landed in f8dccaae): strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior.
Architecture:
- Append an 8-char
gensuffix (first 8 chars ofdeployment.id) to container name ANDCAMELEER_AGENT_INSTANCEID. Unique per deployment; no new DB state. - Add a
cameleer.generationDocker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id. - Branch
DeploymentExecutor.executeAsyncon strategy:- blue-green: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running).
- rolling: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED.
- Keep destroy-then-start as the fallback for unknown strategy values (safety net).
Reference: interim-fix commit f8dccaae; investigation summary in the session log.
File Structure
Backend (new / modified)
- Create:
cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java— enumBLUE_GREEN, ROLLING;fromWire(String)with blue-green fallback;toWire()→ "blue-green" / "rolling". - Modify:
cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java— addgencomputation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body ofexecuteAsyncso stages 4–6 dispatch on strategy. Extract helper methodsdeployBlueGreenanddeployRollingto keep each path readable. - Modify:
cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java— takegenargument; emitcameleer.generationlabel;cameleer.instance-idbecomes{envSlug}-{appSlug}-{replicaIndex}-{gen}. - Modify:
cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java—containerNamestored on the row becomesenv.slug() + "-" + app.slug()(unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor). - Modify:
cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java— update the single assertion that pinscontainer_nameformat if any (spotted at line ~112 in the investigation). - Create:
cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java— two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old. - Create:
cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java— two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas.
UI
- Modify:
ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx— confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line. - Modify:
ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx— surfacedeployment.deploymentStrategyas a small text/badge near the version badge (read-only).
Docs + rules
- Modify:
.claude/rules/docker-orchestration.md— rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and thegensuffix; retire the interim destroy-then-start note. - Modify:
.claude/rules/app-classes.md— update theDeploymentExecutorbullet underruntime/. - Modify:
.claude/rules/core-classes.md— note newDeploymentStrategyenum underruntime/.
Phase 1 — Core: DeploymentStrategy enum + gen utility
Task 1.1: DeploymentStrategy enum
Files: Create cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java.
- Create enum with two constants
BLUE_GREEN,ROLLING. - Add
toWire()returning"blue-green"/"rolling". - Add
fromWire(String)— case-insensitive match; unknown or null →BLUE_GREENwith no throw (safety fallback). Returns enum, never null.
Verification: unit test covering known + unknown + null inputs.
Task 1.2: Generation suffix helper
- Decide location — inline static helper on
DeploymentExecutoris fine (private static String gen(UUID id) { return id.toString().substring(0,8); }). No new file needed.
Phase 2 — Executor: gen-suffixed naming + cameleer.generation label
This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label.
Task 2.1: TraefikLabelBuilder — accept gen, emit generation label
Files: Modify TraefikLabelBuilder.java.
- Add
String genas a new arg onbuild(...). - Change
instanceIdconstruction:envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen. - Add label
cameleer.generation = gen. - Leave the Traefik router/service label keys using
svc = envSlug + "-" + appSlug(unchanged — routing is generation-agnostic so load balancing across old+new works automatically).
Task 2.2: DeploymentExecutor — compute gen once, thread through
Files: Modify DeploymentExecutor.executeAsync.
- At the top of the try block (after
env,app,configresolution), computeString gen = gen(deployment.id());. - In the replica loop:
String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen;andString containerName = tenantId + "-" + instanceId;. - Pass
gentoTraefikLabelBuilder.build(...). - Set
CAMELEER_AGENT_INSTANCEID=instanceId(already done, just verify the new value propagates). - Leave
replicaStates[].containerNamestored as the new full name.
Task 2.3: Update the one brittle test
Files: Modify DeploymentControllerIT.java.
- Relax the container-name assertion to
startsWith("default-default-deploy-test-")or similar — verify behavior, not exact suffix.
Verification after Phase 2:
mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT- All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3).
Phase 3 — Blue-green strategy (default)
Task 3.1: Extract deployBlueGreen(...) helper
Files: Modify DeploymentExecutor.java.
- Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new
private void deployBlueGreen(...)method. - Signature: take
deployment,app,env,config,resolvedRuntimeType,mainClass,gen,primaryNetwork,additionalNets.
Task 3.2: Reorder for proper blue-green
- Remove the pre-flight "stop previous" block added in
f8dccaae(will be replaced by post-health swap). - Order: start all new → wait all healthy → find previous active (via
findActiveByAppIdAndEnvironmentIdExcluding) → stop old containers + mark old row STOPPED. - Strict all-healthy: if
healthyCount < config.replicas(), stop the new containers we just started, mark deployment FAILED with"blue-green: %d/%d replicas healthy; preserving previous deployment". Do not touch the old deployment.
Task 3.3: Wire strategy dispatch
- At the point where
deployBlueGreenis called, checkDeploymentStrategy.fromWire(config.deploymentStrategy())and dispatch. For this phase, always calldeployBlueGreen. ROLLINGdispatches todeployRolling(...)implemented in Phase 4 (stub it to throwUnsupportedOperationExceptionfor now — will be replaced before this phase lands).
Phase 4 — Rolling strategy
Task 4.1: deployRolling(...) helper
Files: Modify DeploymentExecutor.java.
- Same signature as
deployBlueGreen. - Look up previous deployment once at entry via
findActiveByAppIdAndEnvironmentIdExcluding. Capture itsreplicaStatesinto a map keyed by replica index. - For
ifrom 0 toconfig.replicas() - 1:- Start new replica
i(with gen-suffixed name). - Wait for this single container to go healthy (per-replica
waitForOneHealthy(containerId, timeoutSeconds); reusehealthCheckTimeoutper replica or introduce a smaller per-replica budget). - On success: stop the corresponding old replica
ibycontainerIdfrom the previous deployment's replicaStates (if present); log continue. - On failure: stop + remove all new replicas started so far, mark deployment FAILED with
"rolling: replica %d failed to reach healthy; preserved %d previous replicas". Do not touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving).
- Start new replica
- After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped).
Task 4.2: Add waitForOneHealthy
- Variant of
waitForAnyHealthythat polls a single container id. Returns boolean. Same sleep cadence.
Task 4.3: Replace the Phase 3 stub
ROLLINGdispatch callsdeployRollinginstead of throwing.
Phase 5 — Integration tests
Each IT extends AbstractPostgresIT, uses @MockBean RuntimeOrchestrator, and overrides cameleer.server.runtime.healthchecktimeout=2 via @TestPropertySource.
Task 5.1: BlueGreenStrategyIT
Files: Create BlueGreenStrategyIT.java.
- Test 1
blueGreen_allHealthy_stopsOldAfterNew: seed a previous RUNNING deployment (2 replicas). Trigger redeploy withcontainerConfig.deploymentStrategy=blue-green+ replicas=2. Mock orchestrator: new containers returnhealthy. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs hadstopContainer+removeContainercalled; new deployment replicaStates contain the two new container IDs;cameleer.generationlabel on both new container requests. - Test 2
blueGreen_partialHealthy_preservesOldAndMarksFailed: seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were not stopped; new deployment errorMessage contains "1/2 replicas healthy".
Task 5.2: RollingStrategyIT
Files: Create RollingStrategyIT.java.
- Test 1
rolling_allHealthy_replacesOneByOne: seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. UseArgumentCaptoronstartContainerto observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED. - Test 2
rolling_failsMidRollout_preservesRemainingOld: seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1".
Phase 6 — UI strategy indicator
Task 6.1: Strategy dropdown polish
Files: Modify ResourcesTab.tsx.
- Verify the
<select>has optionsblue-greenandrolling. - Add a one-line description under the dropdown: "Blue-green: start all new, swap when healthy. Rolling: replace one replica at a time."
Task 6.2: Strategy on StatusCard
Files: Modify DeploymentTab/StatusCard.tsx.
- Add a small subtle text line in the grid:
<span>Strategy</span><span>{deployment.deploymentStrategy}</span>(read-only, mono text ok).
Phase 7 — Docs + rules updates
Task 7.1: Update .claude/rules/docker-orchestration.md
- Replace the "DeploymentExecutor Details" section with the new flow (gen suffix, strategy dispatch, per-strategy ordering).
- Update the "Deployment Status Model" table —
DEGRADEDnow means "post-deploy replica crashed"; failed-during-deploy is alwaysFAILED. - Add a short "Deployment Strategies" section: behavior of blue-green vs rolling, resource peak, failure semantics.
Task 7.2: Update .claude/rules/app-classes.md
- Under
runtime/→DeploymentExecutorbullet: add "branches onDeploymentStrategy.fromWire(config.deploymentStrategy()). Container name format:{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{gen}where gen = 8-char prefix of deployment UUID."
Task 7.3: Update .claude/rules/core-classes.md
- Add under
runtime/:DeploymentStrategy— enum BLUE_GREEN, ROLLING;fromWirefalls back to BLUE_GREEN; note stored as kebab-case string on config.
Rollout sequence
- Phase 1 (enum + helper) — trivial, land as one commit.
- Phase 2 (naming + generation label) — one commit; interim destroy-then-start still active; regenerates no OpenAPI (no controller change).
- Phase 3 (blue-green as default) — one commit replacing the interim flow. This is where real behavior changes.
- Phase 4 (rolling) — one commit.
- Phase 5 (4 ITs) — one commit; run
mvn testagainst affected modules. - Phase 6 (UI) — one commit;
npx tscclean. - Phase 7 (docs) — one commit.
Total: 7 commits, all atomic.
Acceptance
- Existing
DeploymentSnapshotITstill passes. - New
BlueGreenStrategyIT(2 tests) andRollingStrategyIT(2 tests) pass. - Browser QA: redeploy with
deploymentStrategy=blue-greenvsrollingproduces the expected container timeline (inspect viadocker ps); Prometheus metrics show continuity across deploys when queried by{cameleer_app, cameleer_environment}; thecameleer_generationlabel flips per deploy. .claude/rules/docker-orchestration.mdreflects the new behavior.
Non-goals
- Automatic rollback on blue-green partial failure (old is left running; user redeploys).
- Automatic rollback on rolling mid-failure (remaining old replicas keep running; user redeploys).
- Per-replica
HEALTH_CHECKstage label in the UI progress bar — the 7-stage progress is reused as-is; strategy dictates internal looping. - Strategy field validation at container-config save time (executor's
fromWirefallback absorbs unknown values — consider a follow-up for strict validation if it becomes an issue).