Files
cameleer-server/docs/superpowers/plans/2026-04-23-deployment-strategies.md
hsiegeln 2c82f29aef docs(plans): deployment strategies (blue-green + rolling) plan
7-phase plan to replace the interim destroy-then-start flow (f8dccaae)
with a strategy-aware executor. Adds gen-suffixed container names so
old + new replicas can coexist, plus a cameleer.generation label for
Prometheus/Grafana deploy-boundary annotations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:41:43 +02:00

226 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment Strategies (blue-green + rolling) — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make `deploymentStrategy` actually affect runtime behavior. Support **blue-green** (all-at-once, default) and **rolling** (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap.
**Current state (interim fix landed in `f8dccaae`):** strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior.
**Architecture:**
- Append an 8-char **`gen`** suffix (first 8 chars of `deployment.id`) to container name AND `CAMELEER_AGENT_INSTANCEID`. Unique per deployment; no new DB state.
- Add a `cameleer.generation` Docker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id.
- Branch `DeploymentExecutor.executeAsync` on strategy:
- **blue-green**: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running).
- **rolling**: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED.
- Keep destroy-then-start as the fallback for unknown strategy values (safety net).
**Reference:** interim-fix commit `f8dccaae`; investigation summary in the session log.
---
## File Structure
### Backend (new / modified)
- **Create:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java` — enum `BLUE_GREEN, ROLLING`; `fromWire(String)` with blue-green fallback; `toWire()` → "blue-green" / "rolling".
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java` — add `gen` computation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body of `executeAsync` so stages 46 dispatch on strategy. Extract helper methods `deployBlueGreen` and `deployRolling` to keep each path readable.
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java` — take `gen` argument; emit `cameleer.generation` label; `cameleer.instance-id` becomes `{envSlug}-{appSlug}-{replicaIndex}-{gen}`.
- **Modify:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java``containerName` stored on the row becomes `env.slug() + "-" + app.slug()` (unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor).
- **Modify:** `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java` — update the single assertion that pins `container_name` format if any (spotted at line ~112 in the investigation).
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java` — two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old.
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java` — two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas.
### UI
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx` — confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line.
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx` — surface `deployment.deploymentStrategy` as a small text/badge near the version badge (read-only).
### Docs + rules
- **Modify:** `.claude/rules/docker-orchestration.md` — rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and the `gen` suffix; retire the interim destroy-then-start note.
- **Modify:** `.claude/rules/app-classes.md` — update the `DeploymentExecutor` bullet under `runtime/`.
- **Modify:** `.claude/rules/core-classes.md` — note new `DeploymentStrategy` enum under `runtime/`.
---
## Phase 1 — Core: DeploymentStrategy enum + gen utility
### Task 1.1: DeploymentStrategy enum
**Files:** Create `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java`.
- [ ] Create enum with two constants `BLUE_GREEN`, `ROLLING`.
- [ ] Add `toWire()` returning `"blue-green"` / `"rolling"`.
- [ ] Add `fromWire(String)` — case-insensitive match; unknown or null → `BLUE_GREEN` with no throw (safety fallback). Returns enum, never null.
**Verification:** unit test covering known + unknown + null inputs.
### Task 1.2: Generation suffix helper
- [ ] Decide location — inline static helper on `DeploymentExecutor` is fine (`private static String gen(UUID id) { return id.toString().substring(0,8); }`). No new file needed.
---
## Phase 2 — Executor: gen-suffixed naming + `cameleer.generation` label
This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label.
### Task 2.1: TraefikLabelBuilder — accept `gen`, emit generation label
**Files:** Modify `TraefikLabelBuilder.java`.
- [ ] Add `String gen` as a new arg on `build(...)`.
- [ ] Change `instanceId` construction: `envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen`.
- [ ] Add label `cameleer.generation = gen`.
- [ ] Leave the Traefik router/service label keys using `svc = envSlug + "-" + appSlug` (unchanged — routing is generation-agnostic so load balancing across old+new works automatically).
### Task 2.2: DeploymentExecutor — compute gen once, thread through
**Files:** Modify `DeploymentExecutor.executeAsync`.
- [ ] At the top of the try block (after `env`, `app`, `config` resolution), compute `String gen = gen(deployment.id());`.
- [ ] In the replica loop: `String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen;` and `String containerName = tenantId + "-" + instanceId;`.
- [ ] Pass `gen` to `TraefikLabelBuilder.build(...)`.
- [ ] Set `CAMELEER_AGENT_INSTANCEID=instanceId` (already done, just verify the new value propagates).
- [ ] Leave `replicaStates[].containerName` stored as the new full name.
### Task 2.3: Update the one brittle test
**Files:** Modify `DeploymentControllerIT.java`.
- [ ] Relax the container-name assertion to `startsWith("default-default-deploy-test-")` or similar — verify behavior, not exact suffix.
**Verification after Phase 2:**
- `mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT`
- All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3).
---
## Phase 3 — Blue-green strategy (default)
### Task 3.1: Extract `deployBlueGreen(...)` helper
**Files:** Modify `DeploymentExecutor.java`.
- [ ] Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new `private void deployBlueGreen(...)` method.
- [ ] Signature: take `deployment`, `app`, `env`, `config`, `resolvedRuntimeType`, `mainClass`, `gen`, `primaryNetwork`, `additionalNets`.
### Task 3.2: Reorder for proper blue-green
- [ ] Remove the pre-flight "stop previous" block added in `f8dccaae` (will be replaced by post-health swap).
- [ ] Order: start all new → wait all healthy → find previous active (via `findActiveByAppIdAndEnvironmentIdExcluding`) → stop old containers + mark old row STOPPED.
- [ ] Strict all-healthy: if `healthyCount < config.replicas()`, stop the new containers we just started, mark deployment FAILED with `"blue-green: %d/%d replicas healthy; preserving previous deployment"`. Do **not** touch the old deployment.
### Task 3.3: Wire strategy dispatch
- [ ] At the point where `deployBlueGreen` is called, check `DeploymentStrategy.fromWire(config.deploymentStrategy())` and dispatch. For this phase, always call `deployBlueGreen`.
- [ ] `ROLLING` dispatches to `deployRolling(...)` implemented in Phase 4 (stub it to throw `UnsupportedOperationException` for now — will be replaced before this phase lands).
---
## Phase 4 — Rolling strategy
### Task 4.1: `deployRolling(...)` helper
**Files:** Modify `DeploymentExecutor.java`.
- [ ] Same signature as `deployBlueGreen`.
- [ ] Look up previous deployment once at entry via `findActiveByAppIdAndEnvironmentIdExcluding`. Capture its `replicaStates` into a map keyed by replica index.
- [ ] For `i` from 0 to `config.replicas() - 1`:
- [ ] Start new replica `i` (with gen-suffixed name).
- [ ] Wait for this single container to go healthy (per-replica `waitForOneHealthy(containerId, timeoutSeconds)`; reuse `healthCheckTimeout` per replica or introduce a smaller per-replica budget).
- [ ] On success: stop the corresponding old replica `i` by `containerId` from the previous deployment's replicaStates (if present); log continue.
- [ ] On failure: stop + remove all new replicas started so far, mark deployment FAILED with `"rolling: replica %d failed to reach healthy; preserved %d previous replicas"`. Do **not** touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving).
- [ ] After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped).
### Task 4.2: Add `waitForOneHealthy`
- [ ] Variant of `waitForAnyHealthy` that polls a single container id. Returns boolean. Same sleep cadence.
### Task 4.3: Replace the Phase 3 stub
- [ ] `ROLLING` dispatch calls `deployRolling` instead of throwing.
---
## Phase 5 — Integration tests
Each IT extends `AbstractPostgresIT`, uses `@MockBean RuntimeOrchestrator`, and overrides `cameleer.server.runtime.healthchecktimeout=2` via `@TestPropertySource`.
### Task 5.1: BlueGreenStrategyIT
**Files:** Create `BlueGreenStrategyIT.java`.
- [ ] **Test 1 `blueGreen_allHealthy_stopsOldAfterNew`:** seed a previous RUNNING deployment (2 replicas). Trigger redeploy with `containerConfig.deploymentStrategy=blue-green` + replicas=2. Mock orchestrator: new containers return `healthy`. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs had `stopContainer`+`removeContainer` called; new deployment replicaStates contain the two new container IDs; `cameleer.generation` label on both new container requests.
- [ ] **Test 2 `blueGreen_partialHealthy_preservesOldAndMarksFailed`:** seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were **not** stopped; new deployment errorMessage contains "1/2 replicas healthy".
### Task 5.2: RollingStrategyIT
**Files:** Create `RollingStrategyIT.java`.
- [ ] **Test 1 `rolling_allHealthy_replacesOneByOne`:** seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. Use `ArgumentCaptor` on `startContainer` to observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED.
- [ ] **Test 2 `rolling_failsMidRollout_preservesRemainingOld`:** seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1".
---
## Phase 6 — UI strategy indicator
### Task 6.1: Strategy dropdown polish
**Files:** Modify `ResourcesTab.tsx`.
- [ ] Verify the `<select>` has options `blue-green` and `rolling`.
- [ ] Add a one-line description under the dropdown: "Blue-green: start all new, swap when healthy. Rolling: replace one replica at a time."
### Task 6.2: Strategy on StatusCard
**Files:** Modify `DeploymentTab/StatusCard.tsx`.
- [ ] Add a small subtle text line in the grid: `<span>Strategy</span><span>{deployment.deploymentStrategy}</span>` (read-only, mono text ok).
---
## Phase 7 — Docs + rules updates
### Task 7.1: Update `.claude/rules/docker-orchestration.md`
- [ ] Replace the "DeploymentExecutor Details" section with the new flow (gen suffix, strategy dispatch, per-strategy ordering).
- [ ] Update the "Deployment Status Model" table — `DEGRADED` now means "post-deploy replica crashed"; failed-during-deploy is always `FAILED`.
- [ ] Add a short "Deployment Strategies" section: behavior of blue-green vs rolling, resource peak, failure semantics.
### Task 7.2: Update `.claude/rules/app-classes.md`
- [ ] Under `runtime/``DeploymentExecutor` bullet: add "branches on `DeploymentStrategy.fromWire(config.deploymentStrategy())`. Container name format: `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{gen}` where gen = 8-char prefix of deployment UUID."
### Task 7.3: Update `.claude/rules/core-classes.md`
- [ ] Add under `runtime/`: `DeploymentStrategy` — enum BLUE_GREEN, ROLLING; `fromWire` falls back to BLUE_GREEN; note stored as kebab-case string on config.
---
## Rollout sequence
1. Phase 1 (enum + helper) — trivial, land as one commit.
2. Phase 2 (naming + generation label) — one commit; interim destroy-then-start still active; regenerates no OpenAPI (no controller change).
3. Phase 3 (blue-green as default) — one commit replacing the interim flow. This is where real behavior changes.
4. Phase 4 (rolling) — one commit.
5. Phase 5 (4 ITs) — one commit; run `mvn test` against affected modules.
6. Phase 6 (UI) — one commit; `npx tsc` clean.
7. Phase 7 (docs) — one commit.
Total: 7 commits, all atomic.
## Acceptance
- Existing `DeploymentSnapshotIT` still passes.
- New `BlueGreenStrategyIT` (2 tests) and `RollingStrategyIT` (2 tests) pass.
- Browser QA: redeploy with `deploymentStrategy=blue-green` vs `rolling` produces the expected container timeline (inspect via `docker ps`); Prometheus metrics show continuity across deploys when queried by `{cameleer_app, cameleer_environment}`; the `cameleer_generation` label flips per deploy.
- `.claude/rules/docker-orchestration.md` reflects the new behavior.
## Non-goals
- Automatic rollback on blue-green partial failure (old is left running; user redeploys).
- Automatic rollback on rolling mid-failure (remaining old replicas keep running; user redeploys).
- Per-replica `HEALTH_CHECK` stage label in the UI progress bar — the 7-stage progress is reused as-is; strategy dictates internal looping.
- Strategy field validation at container-config save time (executor's `fromWire` fallback absorbs unknown values — consider a follow-up for strict validation if it becomes an issue).