docs(plans): deployment strategies (blue-green + rolling) plan
7-phase plan to replace the interim destroy-then-start flow (f8dccaae)
with a strategy-aware executor. Adds gen-suffixed container names so
old + new replicas can coexist, plus a cameleer.generation label for
Prometheus/Grafana deploy-boundary annotations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
225
docs/superpowers/plans/2026-04-23-deployment-strategies.md
Normal file
225
docs/superpowers/plans/2026-04-23-deployment-strategies.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# Deployment Strategies (blue-green + rolling) — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Make `deploymentStrategy` actually affect runtime behavior. Support **blue-green** (all-at-once, default) and **rolling** (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap.
|
||||
|
||||
**Current state (interim fix landed in `f8dccaae`):** strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior.
|
||||
|
||||
**Architecture:**
|
||||
- Append an 8-char **`gen`** suffix (first 8 chars of `deployment.id`) to container name AND `CAMELEER_AGENT_INSTANCEID`. Unique per deployment; no new DB state.
|
||||
- Add a `cameleer.generation` Docker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id.
|
||||
- Branch `DeploymentExecutor.executeAsync` on strategy:
|
||||
- **blue-green**: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running).
|
||||
- **rolling**: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED.
|
||||
- Keep destroy-then-start as the fallback for unknown strategy values (safety net).
|
||||
|
||||
**Reference:** interim-fix commit `f8dccaae`; investigation summary in the session log.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
### Backend (new / modified)
|
||||
|
||||
- **Create:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java` — enum `BLUE_GREEN, ROLLING`; `fromWire(String)` with blue-green fallback; `toWire()` → "blue-green" / "rolling".
|
||||
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java` — add `gen` computation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body of `executeAsync` so stages 4–6 dispatch on strategy. Extract helper methods `deployBlueGreen` and `deployRolling` to keep each path readable.
|
||||
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java` — take `gen` argument; emit `cameleer.generation` label; `cameleer.instance-id` becomes `{envSlug}-{appSlug}-{replicaIndex}-{gen}`.
|
||||
- **Modify:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java` — `containerName` stored on the row becomes `env.slug() + "-" + app.slug()` (unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor).
|
||||
- **Modify:** `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java` — update the single assertion that pins `container_name` format if any (spotted at line ~112 in the investigation).
|
||||
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java` — two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old.
|
||||
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java` — two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas.
|
||||
|
||||
### UI
|
||||
|
||||
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx` — confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line.
|
||||
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx` — surface `deployment.deploymentStrategy` as a small text/badge near the version badge (read-only).
|
||||
|
||||
### Docs + rules
|
||||
|
||||
- **Modify:** `.claude/rules/docker-orchestration.md` — rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and the `gen` suffix; retire the interim destroy-then-start note.
|
||||
- **Modify:** `.claude/rules/app-classes.md` — update the `DeploymentExecutor` bullet under `runtime/`.
|
||||
- **Modify:** `.claude/rules/core-classes.md` — note new `DeploymentStrategy` enum under `runtime/`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Core: DeploymentStrategy enum + gen utility
|
||||
|
||||
### Task 1.1: DeploymentStrategy enum
|
||||
|
||||
**Files:** Create `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java`.
|
||||
|
||||
- [ ] Create enum with two constants `BLUE_GREEN`, `ROLLING`.
|
||||
- [ ] Add `toWire()` returning `"blue-green"` / `"rolling"`.
|
||||
- [ ] Add `fromWire(String)` — case-insensitive match; unknown or null → `BLUE_GREEN` with no throw (safety fallback). Returns enum, never null.
|
||||
|
||||
**Verification:** unit test covering known + unknown + null inputs.
|
||||
|
||||
### Task 1.2: Generation suffix helper
|
||||
|
||||
- [ ] Decide location — inline static helper on `DeploymentExecutor` is fine (`private static String gen(UUID id) { return id.toString().substring(0,8); }`). No new file needed.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Executor: gen-suffixed naming + `cameleer.generation` label
|
||||
|
||||
This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label.
|
||||
|
||||
### Task 2.1: TraefikLabelBuilder — accept `gen`, emit generation label
|
||||
|
||||
**Files:** Modify `TraefikLabelBuilder.java`.
|
||||
|
||||
- [ ] Add `String gen` as a new arg on `build(...)`.
|
||||
- [ ] Change `instanceId` construction: `envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen`.
|
||||
- [ ] Add label `cameleer.generation = gen`.
|
||||
- [ ] Leave the Traefik router/service label keys using `svc = envSlug + "-" + appSlug` (unchanged — routing is generation-agnostic so load balancing across old+new works automatically).
|
||||
|
||||
### Task 2.2: DeploymentExecutor — compute gen once, thread through
|
||||
|
||||
**Files:** Modify `DeploymentExecutor.executeAsync`.
|
||||
|
||||
- [ ] At the top of the try block (after `env`, `app`, `config` resolution), compute `String gen = gen(deployment.id());`.
|
||||
- [ ] In the replica loop: `String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen;` and `String containerName = tenantId + "-" + instanceId;`.
|
||||
- [ ] Pass `gen` to `TraefikLabelBuilder.build(...)`.
|
||||
- [ ] Set `CAMELEER_AGENT_INSTANCEID=instanceId` (already done, just verify the new value propagates).
|
||||
- [ ] Leave `replicaStates[].containerName` stored as the new full name.
|
||||
|
||||
### Task 2.3: Update the one brittle test
|
||||
|
||||
**Files:** Modify `DeploymentControllerIT.java`.
|
||||
|
||||
- [ ] Relax the container-name assertion to `startsWith("default-default-deploy-test-")` or similar — verify behavior, not exact suffix.
|
||||
|
||||
**Verification after Phase 2:**
|
||||
- `mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT`
|
||||
- All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3).
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Blue-green strategy (default)
|
||||
|
||||
### Task 3.1: Extract `deployBlueGreen(...)` helper
|
||||
|
||||
**Files:** Modify `DeploymentExecutor.java`.
|
||||
|
||||
- [ ] Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new `private void deployBlueGreen(...)` method.
|
||||
- [ ] Signature: take `deployment`, `app`, `env`, `config`, `resolvedRuntimeType`, `mainClass`, `gen`, `primaryNetwork`, `additionalNets`.
|
||||
|
||||
### Task 3.2: Reorder for proper blue-green
|
||||
|
||||
- [ ] Remove the pre-flight "stop previous" block added in `f8dccaae` (will be replaced by post-health swap).
|
||||
- [ ] Order: start all new → wait all healthy → find previous active (via `findActiveByAppIdAndEnvironmentIdExcluding`) → stop old containers + mark old row STOPPED.
|
||||
- [ ] Strict all-healthy: if `healthyCount < config.replicas()`, stop the new containers we just started, mark deployment FAILED with `"blue-green: %d/%d replicas healthy; preserving previous deployment"`. Do **not** touch the old deployment.
|
||||
|
||||
### Task 3.3: Wire strategy dispatch
|
||||
|
||||
- [ ] At the point where `deployBlueGreen` is called, check `DeploymentStrategy.fromWire(config.deploymentStrategy())` and dispatch. For this phase, always call `deployBlueGreen`.
|
||||
- [ ] `ROLLING` dispatches to `deployRolling(...)` implemented in Phase 4 (stub it to throw `UnsupportedOperationException` for now — will be replaced before this phase lands).
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Rolling strategy
|
||||
|
||||
### Task 4.1: `deployRolling(...)` helper
|
||||
|
||||
**Files:** Modify `DeploymentExecutor.java`.
|
||||
|
||||
- [ ] Same signature as `deployBlueGreen`.
|
||||
- [ ] Look up previous deployment once at entry via `findActiveByAppIdAndEnvironmentIdExcluding`. Capture its `replicaStates` into a map keyed by replica index.
|
||||
- [ ] For `i` from 0 to `config.replicas() - 1`:
|
||||
- [ ] Start new replica `i` (with gen-suffixed name).
|
||||
- [ ] Wait for this single container to go healthy (per-replica `waitForOneHealthy(containerId, timeoutSeconds)`; reuse `healthCheckTimeout` per replica or introduce a smaller per-replica budget).
|
||||
- [ ] On success: stop the corresponding old replica `i` by `containerId` from the previous deployment's replicaStates (if present); log continue.
|
||||
- [ ] On failure: stop + remove all new replicas started so far, mark deployment FAILED with `"rolling: replica %d failed to reach healthy; preserved %d previous replicas"`. Do **not** touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving).
|
||||
- [ ] After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped).
|
||||
|
||||
### Task 4.2: Add `waitForOneHealthy`
|
||||
|
||||
- [ ] Variant of `waitForAnyHealthy` that polls a single container id. Returns boolean. Same sleep cadence.
|
||||
|
||||
### Task 4.3: Replace the Phase 3 stub
|
||||
|
||||
- [ ] `ROLLING` dispatch calls `deployRolling` instead of throwing.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Integration tests
|
||||
|
||||
Each IT extends `AbstractPostgresIT`, uses `@MockBean RuntimeOrchestrator`, and overrides `cameleer.server.runtime.healthchecktimeout=2` via `@TestPropertySource`.
|
||||
|
||||
### Task 5.1: BlueGreenStrategyIT
|
||||
|
||||
**Files:** Create `BlueGreenStrategyIT.java`.
|
||||
|
||||
- [ ] **Test 1 `blueGreen_allHealthy_stopsOldAfterNew`:** seed a previous RUNNING deployment (2 replicas). Trigger redeploy with `containerConfig.deploymentStrategy=blue-green` + replicas=2. Mock orchestrator: new containers return `healthy`. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs had `stopContainer`+`removeContainer` called; new deployment replicaStates contain the two new container IDs; `cameleer.generation` label on both new container requests.
|
||||
- [ ] **Test 2 `blueGreen_partialHealthy_preservesOldAndMarksFailed`:** seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were **not** stopped; new deployment errorMessage contains "1/2 replicas healthy".
|
||||
|
||||
### Task 5.2: RollingStrategyIT
|
||||
|
||||
**Files:** Create `RollingStrategyIT.java`.
|
||||
|
||||
- [ ] **Test 1 `rolling_allHealthy_replacesOneByOne`:** seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. Use `ArgumentCaptor` on `startContainer` to observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED.
|
||||
- [ ] **Test 2 `rolling_failsMidRollout_preservesRemainingOld`:** seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1".
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — UI strategy indicator
|
||||
|
||||
### Task 6.1: Strategy dropdown polish
|
||||
|
||||
**Files:** Modify `ResourcesTab.tsx`.
|
||||
|
||||
- [ ] Verify the `<select>` has options `blue-green` and `rolling`.
|
||||
- [ ] Add a one-line description under the dropdown: "Blue-green: start all new, swap when healthy. Rolling: replace one replica at a time."
|
||||
|
||||
### Task 6.2: Strategy on StatusCard
|
||||
|
||||
**Files:** Modify `DeploymentTab/StatusCard.tsx`.
|
||||
|
||||
- [ ] Add a small subtle text line in the grid: `<span>Strategy</span><span>{deployment.deploymentStrategy}</span>` (read-only, mono text ok).
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Docs + rules updates
|
||||
|
||||
### Task 7.1: Update `.claude/rules/docker-orchestration.md`
|
||||
|
||||
- [ ] Replace the "DeploymentExecutor Details" section with the new flow (gen suffix, strategy dispatch, per-strategy ordering).
|
||||
- [ ] Update the "Deployment Status Model" table — `DEGRADED` now means "post-deploy replica crashed"; failed-during-deploy is always `FAILED`.
|
||||
- [ ] Add a short "Deployment Strategies" section: behavior of blue-green vs rolling, resource peak, failure semantics.
|
||||
|
||||
### Task 7.2: Update `.claude/rules/app-classes.md`
|
||||
|
||||
- [ ] Under `runtime/` → `DeploymentExecutor` bullet: add "branches on `DeploymentStrategy.fromWire(config.deploymentStrategy())`. Container name format: `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{gen}` where gen = 8-char prefix of deployment UUID."
|
||||
|
||||
### Task 7.3: Update `.claude/rules/core-classes.md`
|
||||
|
||||
- [ ] Add under `runtime/`: `DeploymentStrategy` — enum BLUE_GREEN, ROLLING; `fromWire` falls back to BLUE_GREEN; note stored as kebab-case string on config.
|
||||
|
||||
---
|
||||
|
||||
## Rollout sequence
|
||||
|
||||
1. Phase 1 (enum + helper) — trivial, land as one commit.
|
||||
2. Phase 2 (naming + generation label) — one commit; interim destroy-then-start still active; regenerates no OpenAPI (no controller change).
|
||||
3. Phase 3 (blue-green as default) — one commit replacing the interim flow. This is where real behavior changes.
|
||||
4. Phase 4 (rolling) — one commit.
|
||||
5. Phase 5 (4 ITs) — one commit; run `mvn test` against affected modules.
|
||||
6. Phase 6 (UI) — one commit; `npx tsc` clean.
|
||||
7. Phase 7 (docs) — one commit.
|
||||
|
||||
Total: 7 commits, all atomic.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- Existing `DeploymentSnapshotIT` still passes.
|
||||
- New `BlueGreenStrategyIT` (2 tests) and `RollingStrategyIT` (2 tests) pass.
|
||||
- Browser QA: redeploy with `deploymentStrategy=blue-green` vs `rolling` produces the expected container timeline (inspect via `docker ps`); Prometheus metrics show continuity across deploys when queried by `{cameleer_app, cameleer_environment}`; the `cameleer_generation` label flips per deploy.
|
||||
- `.claude/rules/docker-orchestration.md` reflects the new behavior.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Automatic rollback on blue-green partial failure (old is left running; user redeploys).
|
||||
- Automatic rollback on rolling mid-failure (remaining old replicas keep running; user redeploys).
|
||||
- Per-replica `HEALTH_CHECK` stage label in the UI progress bar — the 7-stage progress is reused as-is; strategy dictates internal looping.
|
||||
- Strategy field validation at container-config save time (executor's `fromWire` fallback absorbs unknown values — consider a follow-up for strict validation if it becomes an issue).
|
||||
Reference in New Issue
Block a user