diff --git a/docs/superpowers/plans/2026-04-23-deployment-strategies.md b/docs/superpowers/plans/2026-04-23-deployment-strategies.md new file mode 100644 index 00000000..67257dd5 --- /dev/null +++ b/docs/superpowers/plans/2026-04-23-deployment-strategies.md @@ -0,0 +1,225 @@ +# Deployment Strategies (blue-green + rolling) — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make `deploymentStrategy` actually affect runtime behavior. Support **blue-green** (all-at-once, default) and **rolling** (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap. + +**Current state (interim fix landed in `f8dccaae`):** strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior. + +**Architecture:** +- Append an 8-char **`gen`** suffix (first 8 chars of `deployment.id`) to container name AND `CAMELEER_AGENT_INSTANCEID`. Unique per deployment; no new DB state. +- Add a `cameleer.generation` Docker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id. +- Branch `DeploymentExecutor.executeAsync` on strategy: + - **blue-green**: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running). + - **rolling**: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED. +- Keep destroy-then-start as the fallback for unknown strategy values (safety net). + +**Reference:** interim-fix commit `f8dccaae`; investigation summary in the session log. + +--- + +## File Structure + +### Backend (new / modified) + +- **Create:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java` — enum `BLUE_GREEN, ROLLING`; `fromWire(String)` with blue-green fallback; `toWire()` → "blue-green" / "rolling". +- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java` — add `gen` computation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body of `executeAsync` so stages 4–6 dispatch on strategy. Extract helper methods `deployBlueGreen` and `deployRolling` to keep each path readable. +- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java` — take `gen` argument; emit `cameleer.generation` label; `cameleer.instance-id` becomes `{envSlug}-{appSlug}-{replicaIndex}-{gen}`. +- **Modify:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java` — `containerName` stored on the row becomes `env.slug() + "-" + app.slug()` (unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor). +- **Modify:** `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java` — update the single assertion that pins `container_name` format if any (spotted at line ~112 in the investigation). +- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java` — two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old. +- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java` — two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas. + +### UI + +- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx` — confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line. +- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx` — surface `deployment.deploymentStrategy` as a small text/badge near the version badge (read-only). + +### Docs + rules + +- **Modify:** `.claude/rules/docker-orchestration.md` — rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and the `gen` suffix; retire the interim destroy-then-start note. +- **Modify:** `.claude/rules/app-classes.md` — update the `DeploymentExecutor` bullet under `runtime/`. +- **Modify:** `.claude/rules/core-classes.md` — note new `DeploymentStrategy` enum under `runtime/`. + +--- + +## Phase 1 — Core: DeploymentStrategy enum + gen utility + +### Task 1.1: DeploymentStrategy enum + +**Files:** Create `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java`. + +- [ ] Create enum with two constants `BLUE_GREEN`, `ROLLING`. +- [ ] Add `toWire()` returning `"blue-green"` / `"rolling"`. +- [ ] Add `fromWire(String)` — case-insensitive match; unknown or null → `BLUE_GREEN` with no throw (safety fallback). Returns enum, never null. + +**Verification:** unit test covering known + unknown + null inputs. + +### Task 1.2: Generation suffix helper + +- [ ] Decide location — inline static helper on `DeploymentExecutor` is fine (`private static String gen(UUID id) { return id.toString().substring(0,8); }`). No new file needed. + +--- + +## Phase 2 — Executor: gen-suffixed naming + `cameleer.generation` label + +This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label. + +### Task 2.1: TraefikLabelBuilder — accept `gen`, emit generation label + +**Files:** Modify `TraefikLabelBuilder.java`. + +- [ ] Add `String gen` as a new arg on `build(...)`. +- [ ] Change `instanceId` construction: `envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen`. +- [ ] Add label `cameleer.generation = gen`. +- [ ] Leave the Traefik router/service label keys using `svc = envSlug + "-" + appSlug` (unchanged — routing is generation-agnostic so load balancing across old+new works automatically). + +### Task 2.2: DeploymentExecutor — compute gen once, thread through + +**Files:** Modify `DeploymentExecutor.executeAsync`. + +- [ ] At the top of the try block (after `env`, `app`, `config` resolution), compute `String gen = gen(deployment.id());`. +- [ ] In the replica loop: `String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen;` and `String containerName = tenantId + "-" + instanceId;`. +- [ ] Pass `gen` to `TraefikLabelBuilder.build(...)`. +- [ ] Set `CAMELEER_AGENT_INSTANCEID=instanceId` (already done, just verify the new value propagates). +- [ ] Leave `replicaStates[].containerName` stored as the new full name. + +### Task 2.3: Update the one brittle test + +**Files:** Modify `DeploymentControllerIT.java`. + +- [ ] Relax the container-name assertion to `startsWith("default-default-deploy-test-")` or similar — verify behavior, not exact suffix. + +**Verification after Phase 2:** +- `mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT` +- All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3). + +--- + +## Phase 3 — Blue-green strategy (default) + +### Task 3.1: Extract `deployBlueGreen(...)` helper + +**Files:** Modify `DeploymentExecutor.java`. + +- [ ] Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new `private void deployBlueGreen(...)` method. +- [ ] Signature: take `deployment`, `app`, `env`, `config`, `resolvedRuntimeType`, `mainClass`, `gen`, `primaryNetwork`, `additionalNets`. + +### Task 3.2: Reorder for proper blue-green + +- [ ] Remove the pre-flight "stop previous" block added in `f8dccaae` (will be replaced by post-health swap). +- [ ] Order: start all new → wait all healthy → find previous active (via `findActiveByAppIdAndEnvironmentIdExcluding`) → stop old containers + mark old row STOPPED. +- [ ] Strict all-healthy: if `healthyCount < config.replicas()`, stop the new containers we just started, mark deployment FAILED with `"blue-green: %d/%d replicas healthy; preserving previous deployment"`. Do **not** touch the old deployment. + +### Task 3.3: Wire strategy dispatch + +- [ ] At the point where `deployBlueGreen` is called, check `DeploymentStrategy.fromWire(config.deploymentStrategy())` and dispatch. For this phase, always call `deployBlueGreen`. +- [ ] `ROLLING` dispatches to `deployRolling(...)` implemented in Phase 4 (stub it to throw `UnsupportedOperationException` for now — will be replaced before this phase lands). + +--- + +## Phase 4 — Rolling strategy + +### Task 4.1: `deployRolling(...)` helper + +**Files:** Modify `DeploymentExecutor.java`. + +- [ ] Same signature as `deployBlueGreen`. +- [ ] Look up previous deployment once at entry via `findActiveByAppIdAndEnvironmentIdExcluding`. Capture its `replicaStates` into a map keyed by replica index. +- [ ] For `i` from 0 to `config.replicas() - 1`: + - [ ] Start new replica `i` (with gen-suffixed name). + - [ ] Wait for this single container to go healthy (per-replica `waitForOneHealthy(containerId, timeoutSeconds)`; reuse `healthCheckTimeout` per replica or introduce a smaller per-replica budget). + - [ ] On success: stop the corresponding old replica `i` by `containerId` from the previous deployment's replicaStates (if present); log continue. + - [ ] On failure: stop + remove all new replicas started so far, mark deployment FAILED with `"rolling: replica %d failed to reach healthy; preserved %d previous replicas"`. Do **not** touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving). +- [ ] After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped). + +### Task 4.2: Add `waitForOneHealthy` + +- [ ] Variant of `waitForAnyHealthy` that polls a single container id. Returns boolean. Same sleep cadence. + +### Task 4.3: Replace the Phase 3 stub + +- [ ] `ROLLING` dispatch calls `deployRolling` instead of throwing. + +--- + +## Phase 5 — Integration tests + +Each IT extends `AbstractPostgresIT`, uses `@MockBean RuntimeOrchestrator`, and overrides `cameleer.server.runtime.healthchecktimeout=2` via `@TestPropertySource`. + +### Task 5.1: BlueGreenStrategyIT + +**Files:** Create `BlueGreenStrategyIT.java`. + +- [ ] **Test 1 `blueGreen_allHealthy_stopsOldAfterNew`:** seed a previous RUNNING deployment (2 replicas). Trigger redeploy with `containerConfig.deploymentStrategy=blue-green` + replicas=2. Mock orchestrator: new containers return `healthy`. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs had `stopContainer`+`removeContainer` called; new deployment replicaStates contain the two new container IDs; `cameleer.generation` label on both new container requests. +- [ ] **Test 2 `blueGreen_partialHealthy_preservesOldAndMarksFailed`:** seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were **not** stopped; new deployment errorMessage contains "1/2 replicas healthy". + +### Task 5.2: RollingStrategyIT + +**Files:** Create `RollingStrategyIT.java`. + +- [ ] **Test 1 `rolling_allHealthy_replacesOneByOne`:** seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. Use `ArgumentCaptor` on `startContainer` to observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED. +- [ ] **Test 2 `rolling_failsMidRollout_preservesRemainingOld`:** seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1". + +--- + +## Phase 6 — UI strategy indicator + +### Task 6.1: Strategy dropdown polish + +**Files:** Modify `ResourcesTab.tsx`. + +- [ ] Verify the `