Files

hsiegeln 2c82f29aef docs(plans): deployment strategies (blue-green + rolling) plan

7-phase plan to replace the interim destroy-then-start flow (f8dccaae)
with a strategy-aware executor. Adds gen-suffixed container names so
old + new replicas can coexist, plus a cameleer.generation label for
Prometheus/Grafana deploy-boundary annotations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 09:41:43 +02:00

15 KiB

Raw Blame History

Deployment Strategies (blue-green + rolling) — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make deploymentStrategy actually affect runtime behavior. Support blue-green (all-at-once, default) and rolling (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap.

Current state (interim fix landed in f8dccaae): strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior.

Architecture:

Append an 8-char gen suffix (first 8 chars of deployment.id) to container name AND CAMELEER_AGENT_INSTANCEID. Unique per deployment; no new DB state.
Add a cameleer.generation Docker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id.
Branch DeploymentExecutor.executeAsync on strategy:
- blue-green: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running).
- rolling: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED.
Keep destroy-then-start as the fallback for unknown strategy values (safety net).

Reference: interim-fix commit f8dccaae; investigation summary in the session log.

File Structure

Backend (new / modified)

Create: cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java — enum BLUE_GREEN, ROLLING; fromWire(String) with blue-green fallback; toWire() → "blue-green" / "rolling".
Modify: cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java — add gen computation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body of executeAsync so stages 4–6 dispatch on strategy. Extract helper methods deployBlueGreen and deployRolling to keep each path readable.
Modify: cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java — take gen argument; emit cameleer.generation label; cameleer.instance-id becomes {envSlug}-{appSlug}-{replicaIndex}-{gen}.
Modify: cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java — containerName stored on the row becomes env.slug() + "-" + app.slug() (unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor).
Modify: cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java — update the single assertion that pins container_name format if any (spotted at line ~112 in the investigation).
Create: cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java — two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old.
Create: cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java — two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas.

UI

Modify: ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx — confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line.
Modify: ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx — surface deployment.deploymentStrategy as a small text/badge near the version badge (read-only).

Docs + rules

Modify: .claude/rules/docker-orchestration.md — rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and the gen suffix; retire the interim destroy-then-start note.
Modify: .claude/rules/app-classes.md — update the DeploymentExecutor bullet under runtime/.
Modify: .claude/rules/core-classes.md — note new DeploymentStrategy enum under runtime/.

Phase 1 — Core: DeploymentStrategy enum + gen utility

Task 1.1: DeploymentStrategy enum

Files: Create cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java.

Create enum with two constants BLUE_GREEN, ROLLING.
Add toWire() returning "blue-green" / "rolling".
Add fromWire(String) — case-insensitive match; unknown or null → BLUE_GREEN with no throw (safety fallback). Returns enum, never null.

Verification: unit test covering known + unknown + null inputs.

Task 1.2: Generation suffix helper

Decide location — inline static helper on DeploymentExecutor is fine (private static String gen(UUID id) { return id.toString().substring(0,8); }). No new file needed.

Phase 2 — Executor: gen-suffixed naming + `cameleer.generation` label

This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label.

Task 2.1: TraefikLabelBuilder — accept `gen`, emit generation label

Files: Modify TraefikLabelBuilder.java.

Add String gen as a new arg on build(...).
Change instanceId construction: envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen.
Add label cameleer.generation = gen.
Leave the Traefik router/service label keys using svc = envSlug + "-" + appSlug (unchanged — routing is generation-agnostic so load balancing across old+new works automatically).

Task 2.2: DeploymentExecutor — compute gen once, thread through

Files: Modify DeploymentExecutor.executeAsync.

At the top of the try block (after env, app, config resolution), compute String gen = gen(deployment.id());.
In the replica loop: String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen; and String containerName = tenantId + "-" + instanceId;.
Pass gen to TraefikLabelBuilder.build(...).
Set CAMELEER_AGENT_INSTANCEID=instanceId (already done, just verify the new value propagates).
Leave replicaStates[].containerName stored as the new full name.

Task 2.3: Update the one brittle test

Files: Modify DeploymentControllerIT.java.

Relax the container-name assertion to startsWith("default-default-deploy-test-") or similar — verify behavior, not exact suffix.

Verification after Phase 2:

mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT
All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3).

Phase 3 — Blue-green strategy (default)

Task 3.1: Extract `deployBlueGreen(...)` helper

Files: Modify DeploymentExecutor.java.

Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new private void deployBlueGreen(...) method.
Signature: take deployment, app, env, config, resolvedRuntimeType, mainClass, gen, primaryNetwork, additionalNets.

Task 3.2: Reorder for proper blue-green

Remove the pre-flight "stop previous" block added in f8dccaae (will be replaced by post-health swap).
Order: start all new → wait all healthy → find previous active (via findActiveByAppIdAndEnvironmentIdExcluding) → stop old containers + mark old row STOPPED.
Strict all-healthy: if healthyCount < config.replicas(), stop the new containers we just started, mark deployment FAILED with "blue-green: %d/%d replicas healthy; preserving previous deployment". Do not touch the old deployment.

Task 3.3: Wire strategy dispatch

At the point where deployBlueGreen is called, check DeploymentStrategy.fromWire(config.deploymentStrategy()) and dispatch. For this phase, always call deployBlueGreen.
ROLLING dispatches to deployRolling(...) implemented in Phase 4 (stub it to throw UnsupportedOperationException for now — will be replaced before this phase lands).

Phase 4 — Rolling strategy

Task 4.1: `deployRolling(...)` helper

Files: Modify DeploymentExecutor.java.

Same signature as deployBlueGreen.
Look up previous deployment once at entry via findActiveByAppIdAndEnvironmentIdExcluding. Capture its replicaStates into a map keyed by replica index.
For i from 0 to config.replicas() - 1:
- Start new replica i (with gen-suffixed name).
- Wait for this single container to go healthy (per-replica waitForOneHealthy(containerId, timeoutSeconds); reuse healthCheckTimeout per replica or introduce a smaller per-replica budget).
- On success: stop the corresponding old replica i by containerId from the previous deployment's replicaStates (if present); log continue.
- On failure: stop + remove all new replicas started so far, mark deployment FAILED with "rolling: replica %d failed to reach healthy; preserved %d previous replicas". Do not touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving).
After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped).

Task 4.2: Add `waitForOneHealthy`

Variant of waitForAnyHealthy that polls a single container id. Returns boolean. Same sleep cadence.

Task 4.3: Replace the Phase 3 stub

ROLLING dispatch calls deployRolling instead of throwing.

Phase 5 — Integration tests

Each IT extends AbstractPostgresIT, uses @MockBean RuntimeOrchestrator, and overrides cameleer.server.runtime.healthchecktimeout=2 via @TestPropertySource.

Task 5.1: BlueGreenStrategyIT

Files: Create BlueGreenStrategyIT.java.

Test 1 blueGreen_allHealthy_stopsOldAfterNew: seed a previous RUNNING deployment (2 replicas). Trigger redeploy with containerConfig.deploymentStrategy=blue-green + replicas=2. Mock orchestrator: new containers return healthy. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs had stopContainer+removeContainer called; new deployment replicaStates contain the two new container IDs; cameleer.generation label on both new container requests.
Test 2 blueGreen_partialHealthy_preservesOldAndMarksFailed: seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were not stopped; new deployment errorMessage contains "1/2 replicas healthy".

Task 5.2: RollingStrategyIT

Files: Create RollingStrategyIT.java.

Test 1 rolling_allHealthy_replacesOneByOne: seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. Use ArgumentCaptor on startContainer to observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED.
Test 2 rolling_failsMidRollout_preservesRemainingOld: seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1".

Phase 6 — UI strategy indicator

Files: Modify ResourcesTab.tsx.

Verify the <select> has options blue-green and rolling.
Add a one-line description under the dropdown: "Blue-green: start all new, swap when healthy. Rolling: replace one replica at a time."

Task 6.2: Strategy on StatusCard

Files: Modify DeploymentTab/StatusCard.tsx.

Add a small subtle text line in the grid: <span>Strategy</span><span>{deployment.deploymentStrategy}</span> (read-only, mono text ok).

Phase 7 — Docs + rules updates

Task 7.1: Update `.claude/rules/docker-orchestration.md`

Replace the "DeploymentExecutor Details" section with the new flow (gen suffix, strategy dispatch, per-strategy ordering).
Update the "Deployment Status Model" table — DEGRADED now means "post-deploy replica crashed"; failed-during-deploy is always FAILED.
Add a short "Deployment Strategies" section: behavior of blue-green vs rolling, resource peak, failure semantics.

Task 7.2: Update `.claude/rules/app-classes.md`

Under runtime/ → DeploymentExecutor bullet: add "branches on DeploymentStrategy.fromWire(config.deploymentStrategy()). Container name format: {tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{gen} where gen = 8-char prefix of deployment UUID."

Task 7.3: Update `.claude/rules/core-classes.md`

Add under runtime/: DeploymentStrategy — enum BLUE_GREEN, ROLLING; fromWire falls back to BLUE_GREEN; note stored as kebab-case string on config.

Rollout sequence

Phase 1 (enum + helper) — trivial, land as one commit.
Phase 2 (naming + generation label) — one commit; interim destroy-then-start still active; regenerates no OpenAPI (no controller change).
Phase 3 (blue-green as default) — one commit replacing the interim flow. This is where real behavior changes.
Phase 4 (rolling) — one commit.
Phase 5 (4 ITs) — one commit; run mvn test against affected modules.
Phase 6 (UI) — one commit; npx tsc clean.
Phase 7 (docs) — one commit.

Total: 7 commits, all atomic.

Acceptance

Existing DeploymentSnapshotIT still passes.
New BlueGreenStrategyIT (2 tests) and RollingStrategyIT (2 tests) pass.
Browser QA: redeploy with deploymentStrategy=blue-green vs rolling produces the expected container timeline (inspect via docker ps); Prometheus metrics show continuity across deploys when queried by {cameleer_app, cameleer_environment}; the cameleer_generation label flips per deploy.
.claude/rules/docker-orchestration.md reflects the new behavior.

Non-goals

Automatic rollback on blue-green partial failure (old is left running; user redeploys).
Automatic rollback on rolling mid-failure (remaining old replicas keep running; user redeploys).
Per-replica HEALTH_CHECK stage label in the UI progress bar — the 7-stage progress is reused as-is; strategy dictates internal looping.
Strategy field validation at container-config save time (executor's fromWire fallback absorbs unknown values — consider a follow-up for strict validation if it becomes an issue).

15 KiB Raw Blame History Unescape Escape