Compare commits

...

8 Commits

Author SHA1 Message Date
hsiegeln
007597715a docs(rules): deployment strategies + generation suffix
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m8s
CI / docker (push) Successful in 1m30s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 46s
Refresh the three rules files to match the new executor behavior:

- docker-orchestration.md: rewrite DeploymentExecutor Details with
  container naming scheme ({...}-{replica}-{generation}), strategy
  dispatch (blue-green vs rolling), and the new DEGRADED semantics
  (post-deploy only). Update TraefikLabelBuilder + ContainerLogForwarder
  bullets for the generation suffix + new cameleer.generation label.
- app-classes.md: DeploymentExecutor + TraefikLabelBuilder bullets
  mirror the same.
- core-classes.md: add DeploymentStrategy enum; note DEGRADED is now
  post-deploy-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:51 +02:00
hsiegeln
b6e54db6ec ui(deploy): strategy hint on Resources tab + indicator on StatusCard
Resources tab: add a hint under the Deploy Strategy dropdown that
explains the blue-green vs rolling trade-off (resource peak, failure
semantics), switching text based on the current selection.

StatusCard: show the active deployment's strategy inline in the info
grid so users can tell at a glance which path was taken for a given
deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:00:44 +02:00
hsiegeln
e9f523f2b8 test(deploy): blue-green + rolling strategy ITs
Four ITs covering strategy behavior:
- BlueGreenStrategyIT#blueGreen_allHealthy_stopsOldAfterNew:
  old is stopped only after all new replicas are healthy.
- BlueGreenStrategyIT#blueGreen_partialHealthy_preservesOldAndMarksFailed:
  strict all-healthy — one starting replica aborts the deploy and
  leaves the previous deployment RUNNING untouched.
- RollingStrategyIT#rolling_allHealthy_replacesOneByOne:
  InOrder on stopContainer confirms old-0 stops before old-1 (the
  interleaving that distinguishes rolling from blue-green).
- RollingStrategyIT#rolling_failsMidRollout_preservesRemainingOld:
  mid-rollout health failure stops only the in-flight new containers
  and the already-replaced old-0; old-1 stays untouched.

Shortens healthchecktimeout to 2s via @TestPropertySource so failure
paths complete in ~25s instead of ~60s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:00:00 +02:00
hsiegeln
653f983a08 deploy: rolling strategy (per-replica replacement)
Replace the Phase 3 stub with a working rolling implementation.

Flow:
- Capture previous deployment's per-index container ids up front.
- For i = 0..replicas-1:
  - Start new[i] (gen-suffixed name, coexists with old[i]).
  - Wait for new[i] healthy (new waitForOneHealthy helper).
  - On success: stop old[i] if present, continue.
  - On failure: stop in-flight new[0..i], leave un-replaced old[i+1..N]
    running, mark FAILED. Already-replaced old replicas are not
    restored — rolling is not reversible; user redeploys to recover.
- After the loop: sweep any leftover old replicas (when replica count
  shrank) and mark the old deployment STOPPED.

Resource peak: replicas + 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:53:52 +02:00
hsiegeln
459cdfe427 deploy: blue-green strategy (start → health-all → stop old)
Phase 3 of deployment-strategies plan. Refactor executeAsync to
dispatch on DeploymentStrategy.fromWire(config.deploymentStrategy()).

Blue-green (default):
- Start all N new replicas (gen-suffixed names coexist with old).
- Wait for ALL healthy (strict — partial-healthy = FAILED, preserves
  previous deployment untouched).
- Only then find + stop the previous deployment.
- Final status is always RUNNING; DEGRADED is now reserved for
  post-deploy replica crashes (set by DockerEventMonitor).

Rolling: stub — throws UnsupportedOperationException for now, gets
its real implementation in Phase 4.

Refactor details:
- Extract DeployCtx record to carry 13 per-deploy values around.
- Extract startReplica(ctx, i, stateOut) — shared by both strategy paths.
- Extract persistSnapshotAndMarkRunning(ctx, primaryCid) — shared finalizer.
- Rename waitForAnyHealthy → waitForAllHealthy (the name was misleading;
  the method already waited for all, just returned partial on timeout).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:51:24 +02:00
hsiegeln
652346dcd4 deploy: gen-suffixed container names + cameleer.generation label
Append an 8-char generation id (first 8 chars of deployment UUID) to:
- container name: {tenant}-{env}-{app}-{replica}-{gen}
- CAMELEER_AGENT_INSTANCEID (so old+new agents are distinct in the registry)
- Traefik cameleer.instance-id label

And emit a new standalone cameleer.generation label so dashboards
(Prometheus/Grafana) can pin deploy boundaries without regex on
instance-id.

Strategy branching comes next — this commit is foundation only; the
interim destroy-then-start flow still runs regardless of strategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:45:44 +02:00
hsiegeln
5304c8ee01 core(deploy): DeploymentStrategy enum with safe wire conversion
Typed enum (BLUE_GREEN, ROLLING) with fromWire/toWire kebab-case
translation. fromWire falls back to BLUE_GREEN for unknown or null
input so the executor dispatch site never null-checks and no
misconfigured container-config can throw at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:35 +02:00
hsiegeln
2c82f29aef docs(plans): deployment strategies (blue-green + rolling) plan
7-phase plan to replace the interim destroy-then-start flow (f8dccaae)
with a strategy-aware executor. Adds gen-suffixed container names so
old + new replicas can coexist, plus a cameleer.generation label for
Prometheus/Grafana deploy-boundary annotations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:41:43 +02:00
12 changed files with 1020 additions and 154 deletions

View File

@@ -118,10 +118,10 @@ Env-scoped read-path controllers (`AlertController`, `AlertRuleController`, `Ale
## runtime/ — Docker orchestration ## runtime/ — Docker orchestration
- `DockerRuntimeOrchestrator` — implements RuntimeOrchestrator; Docker Java client (zerodep transport), container lifecycle - `DockerRuntimeOrchestrator` — implements RuntimeOrchestrator; Docker Java client (zerodep transport), container lifecycle
- `DeploymentExecutor`@Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Container names are `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}` (globally unique on Docker daemon). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}`. - `DeploymentExecutor`@Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Container names are `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{generation}`, where `generation` is the first 8 chars of the deployment UUID — old and new replicas coexist during a blue/green swap. Per-replica `CAMELEER_AGENT_INSTANCEID` env var is `{envSlug}-{appSlug}-{replicaIndex}-{generation}`. Branches on `DeploymentStrategy.fromWire(config.deploymentStrategy())`: **blue-green** (default) starts all N → waits for all healthy → stops old (partial health = FAILED, preserves old untouched); **rolling** replaces replicas one at a time with rollback only for in-flight new containers (already-replaced old stay stopped; un-replaced old keep serving). DEGRADED is now only set by `DockerEventMonitor` post-deploy, never by the executor.
- `DockerNetworkManager` — ensures bridge networks (cameleer-traefik, cameleer-env-{slug}), connects containers - `DockerNetworkManager` — ensures bridge networks (cameleer-traefik, cameleer-env-{slug}), connects containers
- `DockerEventMonitor` — persistent Docker event stream listener (die, oom, start, stop), updates deployment status - `DockerEventMonitor` — persistent Docker event stream listener (die, oom, start, stop), updates deployment status
- `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing. Also emits `cameleer.replica` and `cameleer.instance-id` labels per container for labels-first identity. - `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing. Per-container identity labels: `cameleer.replica` (index), `cameleer.generation` (deployment-scoped 8-char id — for Prometheus/Grafana deploy-boundary annotations), `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}-{generation}`). Router/service label keys are generation-agnostic so load balancing spans old + new replicas during a blue/green overlap.
- `PrometheusLabelBuilder` — generates Prometheus Docker labels (`prometheus.scrape/path/port`) per runtime type for `docker_sd_configs` auto-discovery - `PrometheusLabelBuilder` — generates Prometheus Docker labels (`prometheus.scrape/path/port`) per runtime type for `docker_sd_configs` auto-discovery
- `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. 60-second max capture timeout. - `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. 60-second max capture timeout.
- `DisabledRuntimeOrchestrator` — no-op when runtime not enabled - `DisabledRuntimeOrchestrator` — no-op when runtime not enabled

View File

@@ -29,8 +29,9 @@ paths:
- `Environment` — record: id, slug, displayName, production, enabled, defaultContainerConfig, jarRetentionCount, color, createdAt. `color` is one of the 8 preset palette values validated by `EnvironmentColor.VALUES` and CHECK-constrained in PostgreSQL (V2 migration). - `Environment` — record: id, slug, displayName, production, enabled, defaultContainerConfig, jarRetentionCount, color, createdAt. `color` is one of the 8 preset palette values validated by `EnvironmentColor.VALUES` and CHECK-constrained in PostgreSQL (V2 migration).
- `EnvironmentColor` — constants: `DEFAULT = "slate"`, `VALUES = {slate,red,amber,green,teal,blue,purple,pink}`, `isValid(String)`. - `EnvironmentColor` — constants: `DEFAULT = "slate"`, `VALUES = {slate,red,amber,green,teal,blue,purple,pink}`, `isValid(String)`.
- `Deployment` — record: id, appId, appVersionId, environmentId, status, targetState, deploymentStrategy, replicaStates (JSONB), deployStage, containerId, containerName - `Deployment` — record: id, appId, appVersionId, environmentId, status, targetState, deploymentStrategy, replicaStates (JSONB), deployStage, containerId, containerName
- `DeploymentStatus` — enum: STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED - `DeploymentStatus` — enum: STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED. `DEGRADED` is reserved for post-deploy drift (a replica died after RUNNING); `DeploymentExecutor` now marks partial-healthy deploys FAILED, not DEGRADED.
- `DeployStage` — enum: PRE_FLIGHT, PULL_IMAGE, CREATE_NETWORK, START_REPLICAS, HEALTH_CHECK, SWAP_TRAFFIC, COMPLETE - `DeployStage` — enum: PRE_FLIGHT, PULL_IMAGE, CREATE_NETWORK, START_REPLICAS, HEALTH_CHECK, SWAP_TRAFFIC, COMPLETE
- `DeploymentStrategy` — enum: BLUE_GREEN, ROLLING. Stored on `ResolvedContainerConfig.deploymentStrategy` as kebab-case string (`"blue-green"` / `"rolling"`). `fromWire(String)` is the only conversion entry point; unknown/null inputs fall back to BLUE_GREEN so the executor dispatch site never null-checks or throws.
- `DeploymentService` — createDeployment (deletes terminal deployments first), markRunning, markFailed, markStopped - `DeploymentService` — createDeployment (deletes terminal deployments first), markRunning, markFailed, markStopped
- `RuntimeType` — enum: AUTO, SPRING_BOOT, QUARKUS, PLAIN_JAVA, NATIVE - `RuntimeType` — enum: AUTO, SPRING_BOOT, QUARKUS, PLAIN_JAVA, NATIVE
- `RuntimeDetector` — probes JAR files at upload time: detects runtime from manifest Main-Class (Spring Boot loader, Quarkus entry point, plain Java) or native binary (non-ZIP magic bytes) - `RuntimeDetector` — probes JAR files at upload time: detects runtime from manifest Main-Class (Spring Boot loader, Quarkus entry point, plain Java) or native binary (non-ZIP magic bytes)

View File

@@ -13,19 +13,28 @@ paths:
When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components: When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components:
- **ConfigMerger** (`core/runtime/ConfigMerger.java`) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes `runtimeType` (default `"auto"`) and `customArgs` (default `""`). - **ConfigMerger** (`core/runtime/ConfigMerger.java`) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes `runtimeType` (default `"auto"`) and `customArgs` (default `""`).
- **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles. Also sets per-replica identity labels: `cameleer.replica` (index) and `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}`). Internal processing uses labels (not container name parsing) for extensibility. - **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles. Per-replica identity labels: `cameleer.replica` (index), `cameleer.generation` (8-char deployment UUID prefix — pin Prometheus/Grafana deploy boundaries with this), `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}-{generation}`). Traefik router/service keys deliberately omit the generation so load balancing spans old + new replicas during a blue/green overlap.
- **PrometheusLabelBuilder** (`app/runtime/PrometheusLabelBuilder.java`) — generates Prometheus `docker_sd_configs` labels per resolved runtime type: Spring Boot `/actuator/prometheus:8081`, Quarkus/native `/q/metrics:9000`, plain Java `/metrics:9464`. Labels merged into container metadata alongside Traefik labels at deploy time. - **PrometheusLabelBuilder** (`app/runtime/PrometheusLabelBuilder.java`) — generates Prometheus `docker_sd_configs` labels per resolved runtime type: Spring Boot `/actuator/prometheus:8081`, Quarkus/native `/q/metrics:9000`, plain Java `/metrics:9464`. Labels merged into container metadata alongside Traefik labels at deploy time.
- **DockerNetworkManager** (`app/runtime/DockerNetworkManager.java`) — manages two Docker network tiers: - **DockerNetworkManager** (`app/runtime/DockerNetworkManager.java`) — manages two Docker network tiers:
- `cameleer-traefik` — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with `cameleer-server` DNS alias. - `cameleer-traefik` — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with `cameleer-server` DNS alias.
- `cameleer-env-{slug}` — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: `cameleer-env-{tenantId}-{envSlug}` (overloaded `envNetworkName(tenantId, envSlug)` method) to prevent cross-tenant collisions when multiple tenants have identically-named environments. - `cameleer-env-{slug}` — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: `cameleer-env-{tenantId}-{envSlug}` (overloaded `envNetworkName(tenantId, envSlug)` method) to prevent cross-tenant collisions when multiple tenants have identically-named environments.
- **DockerEventMonitor** (`app/runtime/DockerEventMonitor.java`) — persistent Docker event stream listener for containers with `managed-by=cameleer-server` label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy). - **DockerEventMonitor** (`app/runtime/DockerEventMonitor.java`) — persistent Docker event stream listener for containers with `managed-by=cameleer-server` label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy).
- **DeploymentProgress** (`ui/src/components/DeploymentProgress.tsx`) — UI step indicator showing 7 deploy stages with amber active/green completed styling. - **DeploymentProgress** (`ui/src/components/DeploymentProgress.tsx`) — UI step indicator showing 7 deploy stages with amber active/green completed styling.
- **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches with the replica's `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}`); `DockerEventMonitor` stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same `instanceId` as the agent (set via `CAMELEER_AGENT_INSTANCEID` env var) for unified log correlation at the instance level. - **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches with the replica's `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}-{generation}`); `DockerEventMonitor` stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same `instanceId` as the agent (set via `CAMELEER_AGENT_INSTANCEID` env var) for unified log correlation at the instance level. Instance-id changes per deployment — cross-deploy queries aggregate on `application + environment` (and optionally `replica_index`).
- **StartupLogPanel** (`ui/src/components/StartupLogPanel.tsx`) — collapsible log panel rendered below `DeploymentProgress`. Queries `/api/v1/logs?source=container&application={appSlug}&environment={envSlug}`. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses `useStartupLogs` hook and `LogViewer` (design system). - **StartupLogPanel** (`ui/src/components/StartupLogPanel.tsx`) — collapsible log panel rendered below `DeploymentProgress`. Queries `/api/v1/logs?source=container&application={appSlug}&environment={envSlug}`. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses `useStartupLogs` hook and `LogViewer` (design system).
## DeploymentExecutor Details ## DeploymentExecutor Details
Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use `-javaagent:/app/agent.jar -jar`, plain Java uses `-cp` with main class, native runs binary directly). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}` so container logs and agent logs share the same instance identity. Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment. Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use `-javaagent:/app/agent.jar -jar`, plain Java uses `-cp` with main class, native runs binary directly). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}-{generation}` so container logs and agent logs share the same instance identity. Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.
**Container naming**`{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{generation}`, where `generation` is the first 8 characters of the deployment UUID. The generation suffix lets old + new replicas coexist during a blue/green swap (deterministic names without a generation used to 409). All lookups across the executor, `DockerEventMonitor`, and `ContainerLogForwarder` key on container **id**, not name — the name is operator-visibility only.
**Strategy dispatch**`DeploymentStrategy.fromWire(config.deploymentStrategy())` branches the executor. Unknown values fall back to BLUE_GREEN so misconfiguration never throws at runtime.
- **Blue/green** (default): start all N new replicas → wait for ALL healthy → stop the previous deployment. Resource peak ≈ 2× replicas for the health-check window. Partial health aborts with status FAILED; the previous deployment is preserved untouched (user's safety net).
- **Rolling**: replace replicas one at a time — start new[i] → wait healthy → stop old[i] → next. Resource peak = replicas + 1. Mid-rollout health failure stops in-flight new containers and aborts; already-replaced old replicas are NOT restored (not reversible) but un-replaced old[i+1..N] keep serving traffic. User redeploys to recover.
Traffic routing is implicit: Traefik labels (`cameleer.app`, `cameleer.environment`) are generation-agnostic, so new replicas attract load balancing as soon as they come up healthy — no explicit swap step.
## Deployment Status Model ## Deployment Status Model
@@ -34,15 +43,11 @@ Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNET
| `STOPPED` | Intentionally stopped or initial state | | `STOPPED` | Intentionally stopped or initial state |
| `STARTING` | Deploy in progress | | `STARTING` | Deploy in progress |
| `RUNNING` | All replicas healthy and serving | | `RUNNING` | All replicas healthy and serving |
| `DEGRADED` | Some replicas healthy, some dead | | `DEGRADED` | Post-deploy: a replica died after the deploy was marked RUNNING. Set by `DockerEventMonitor` reconciliation, never by `DeploymentExecutor` directly. |
| `STOPPING` | Graceful shutdown in progress | | `STOPPING` | Graceful shutdown in progress |
| `FAILED` | Terminal failure (pre-flight, health check, or crash) | | `FAILED` | Terminal failure (pre-flight, health check, or crash). Partial-healthy deploys now mark FAILED — DEGRADED is reserved for post-deploy drift. |
**Replica support**: deployments can specify a replica count. `DEGRADED` is used when at least one but not all replicas are healthy. **Deploy stages** (`DeployStage`): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage). Rolling reuses the same stage labels inside the per-replica loop; the UI progress bar shows the most recent stage.
**Deploy stages** (`DeployStage`): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage).
**Blue/green strategy**: when re-deploying, new replicas are started and health-checked before old ones are stopped, minimising downtime.
**Deployment uniqueness**: `DeploymentService.createDeployment()` deletes any STOPPED/FAILED deployments for the same app+environment before creating a new one, preventing duplicate rows. **Deployment uniqueness**: `DeploymentService.createDeployment()` deletes any STOPPED/FAILED deployments for the same app+environment before creating a new one, preventing duplicate rows.

View File

@@ -89,6 +89,34 @@ public class DeploymentExecutor {
this.applicationConfigRepository = applicationConfigRepository; this.applicationConfigRepository = applicationConfigRepository;
} }
/** Deployment-scoped id suffix — distinguishes container names and
* CAMELEER_AGENT_INSTANCEID across redeploys so old + new replicas can
* coexist during a blue/green swap. First 8 chars of the deployment UUID. */
static String generationOf(Deployment deployment) {
return deployment.id().toString().substring(0, 8);
}
/**
* Per-deployment context assembled once at the top of executeAsync and passed
* into strategy handlers. Keeps the strategy methods readable instead of
* threading 12 positional args.
*/
private record DeployCtx(
Deployment deployment,
App app,
Environment env,
ResolvedContainerConfig config,
String jarPath,
String resolvedRuntimeType,
String mainClass,
String generation,
String primaryNetwork,
List<String> additionalNets,
Map<String, String> baseEnvVars,
Map<String, String> prometheusLabels,
long deployStart
) {}
@Async("deploymentTaskExecutor") @Async("deploymentTaskExecutor")
public void executeAsync(Deployment deployment) { public void executeAsync(Deployment deployment) {
long deployStart = System.currentTimeMillis(); long deployStart = System.currentTimeMillis();
@@ -96,6 +124,7 @@ public class DeploymentExecutor {
App app = appService.getById(deployment.appId()); App app = appService.getById(deployment.appId());
Environment env = envService.getById(deployment.environmentId()); Environment env = envService.getById(deployment.environmentId());
String jarPath = appService.resolveJarPath(deployment.appVersionId()); String jarPath = appService.resolveJarPath(deployment.appVersionId());
String generation = generationOf(deployment);
var globalDefaults = new ConfigMerger.GlobalRuntimeDefaults( var globalDefaults = new ConfigMerger.GlobalRuntimeDefaults(
parseMemoryLimitMb(globalMemoryLimit), parseMemoryLimitMb(globalMemoryLimit),
@@ -144,7 +173,6 @@ public class DeploymentExecutor {
updateStage(deployment.id(), DeployStage.CREATE_NETWORK); updateStage(deployment.id(), DeployStage.CREATE_NETWORK);
// Primary network: use configured CAMELEER_DOCKER_NETWORK (tenant-isolated in SaaS mode) // Primary network: use configured CAMELEER_DOCKER_NETWORK (tenant-isolated in SaaS mode)
String primaryNetwork = dockerNetwork; String primaryNetwork = dockerNetwork;
String envNet = null;
List<String> additionalNets = new ArrayList<>(); List<String> additionalNets = new ArrayList<>();
if (networkManager != null) { if (networkManager != null) {
networkManager.ensureNetwork(primaryNetwork); networkManager.ensureNetwork(primaryNetwork);
@@ -152,7 +180,7 @@ public class DeploymentExecutor {
networkManager.ensureNetwork(DockerNetworkManager.TRAEFIK_NETWORK); networkManager.ensureNetwork(DockerNetworkManager.TRAEFIK_NETWORK);
additionalNets.add(DockerNetworkManager.TRAEFIK_NETWORK); additionalNets.add(DockerNetworkManager.TRAEFIK_NETWORK);
// Per-environment network scoped to tenant to prevent cross-tenant collisions // Per-environment network scoped to tenant to prevent cross-tenant collisions
envNet = DockerNetworkManager.envNetworkName(tenantId, env.slug()); String envNet = DockerNetworkManager.envNetworkName(tenantId, env.slug());
networkManager.ensureNetwork(envNet); networkManager.ensureNetwork(envNet);
additionalNets.add(envNet); additionalNets.add(envNet);
} }
@@ -167,135 +195,21 @@ public class DeploymentExecutor {
} }
} }
// === STOP PREVIOUS ACTIVE DEPLOYMENT === DeployCtx ctx = new DeployCtx(
// Container names are deterministic ({tenant}-{env}-{app}-{replica}), so a deployment, app, env, config, jarPath,
// previous active deployment holds the Docker names we need. Stop + remove resolvedRuntimeType, mainClass, generation,
// it before starting new replicas to avoid a 409 name conflict. Excluding primaryNetwork, additionalNets,
// the current deployment id by SQL (not Java) because the newly created buildEnvVars(app, env, config),
// row already has status=STARTING and would otherwise be picked by PrometheusLabelBuilder.build(resolvedRuntimeType),
// findActiveByAppIdAndEnvironmentId ORDER BY created_at DESC LIMIT 1. deployStart);
Optional<Deployment> previous = deploymentRepository.findActiveByAppIdAndEnvironmentIdExcluding(
deployment.appId(), deployment.environmentId(), deployment.id()); // Dispatch on strategy. Unknown values fall back to BLUE_GREEN via fromWire.
if (previous.isPresent()) { DeploymentStrategy strategy = DeploymentStrategy.fromWire(config.deploymentStrategy());
log.info("Stopping previous deployment {} before starting new replicas", previous.get().id()); switch (strategy) {
stopDeploymentContainers(previous.get()); case BLUE_GREEN -> deployBlueGreen(ctx);
deploymentService.markStopped(previous.get().id()); case ROLLING -> deployRolling(ctx);
} }
// === START REPLICAS ===
updateStage(deployment.id(), DeployStage.START_REPLICAS);
Map<String, String> baseEnvVars = buildEnvVars(app, env, config);
Map<String, String> prometheusLabels = PrometheusLabelBuilder.build(resolvedRuntimeType);
List<Map<String, Object>> replicaStates = new ArrayList<>();
List<String> newContainerIds = new ArrayList<>();
for (int i = 0; i < config.replicas(); i++) {
String instanceId = env.slug() + "-" + app.slug() + "-" + i;
String containerName = tenantId + "-" + instanceId;
// Per-replica labels (include replica index and instance-id)
Map<String, String> labels = TraefikLabelBuilder.build(app.slug(), env.slug(), tenantId, config, i);
labels.putAll(prometheusLabels);
// Per-replica env vars (set agent instance ID to match container log identity)
Map<String, String> replicaEnvVars = new LinkedHashMap<>(baseEnvVars);
replicaEnvVars.put("CAMELEER_AGENT_INSTANCEID", instanceId);
String volumeName = jarDockerVolume != null && !jarDockerVolume.isBlank() ? jarDockerVolume : null;
ContainerRequest request = new ContainerRequest(
containerName, baseImage, jarPath,
volumeName, jarStoragePath,
primaryNetwork,
additionalNets,
replicaEnvVars, labels,
config.memoryLimitBytes(), config.memoryReserveBytes(),
config.dockerCpuShares(), config.dockerCpuQuota(),
config.exposedPorts(), agentHealthPort,
"on-failure", 3,
resolvedRuntimeType, config.customArgs(), mainClass
);
String containerId = orchestrator.startContainer(request);
newContainerIds.add(containerId);
// Connect to additional networks after container is started
for (String net : additionalNets) {
if (networkManager != null) {
networkManager.connectContainer(containerId, net);
}
}
orchestrator.startLogCapture(containerId, instanceId, app.slug(), env.slug(), tenantId);
replicaStates.add(Map.of(
"index", i,
"containerId", containerId,
"containerName", containerName,
"status", "STARTING"
));
}
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
// === HEALTH CHECK ===
updateStage(deployment.id(), DeployStage.HEALTH_CHECK);
int healthyCount = waitForAnyHealthy(newContainerIds, healthCheckTimeout);
if (healthyCount == 0) {
for (String cid : newContainerIds) {
try { orchestrator.stopContainer(cid); orchestrator.removeContainer(cid); }
catch (Exception e) { log.warn("Cleanup failed for {}: {}", cid, e.getMessage()); }
}
pgDeployRepo.updateDeployStage(deployment.id(), null);
deploymentService.markFailed(deployment.id(), "No replicas passed health check within " + healthCheckTimeout + "s");
serverMetrics.recordDeploymentOutcome("FAILED");
serverMetrics.recordDeploymentDuration(deployStart);
return;
}
replicaStates = updateReplicaHealth(replicaStates, newContainerIds);
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
// === SWAP TRAFFIC ===
// Traffic is routed via Traefik Docker labels, so the "swap" happens
// implicitly once the new replicas are healthy and the old containers
// are gone. The old deployment was already stopped before START_REPLICAS
// to free the deterministic container names.
updateStage(deployment.id(), DeployStage.SWAP_TRAFFIC);
// === COMPLETE ===
updateStage(deployment.id(), DeployStage.COMPLETE);
// Capture config snapshot before marking RUNNING
ApplicationConfig agentConfig = applicationConfigRepository
.findByApplicationAndEnvironment(app.slug(), env.slug())
.orElse(null);
List<String> snapshotSensitiveKeys = agentConfig != null ? agentConfig.getSensitiveKeys() : null;
DeploymentConfigSnapshot snapshot = new DeploymentConfigSnapshot(
deployment.appVersionId(),
agentConfig,
app.containerConfig(),
snapshotSensitiveKeys
);
pgDeployRepo.saveDeployedConfigSnapshot(deployment.id(), snapshot);
String primaryContainerId = newContainerIds.get(0);
DeploymentStatus finalStatus = healthyCount == config.replicas()
? DeploymentStatus.RUNNING : DeploymentStatus.DEGRADED;
deploymentService.markRunning(deployment.id(), primaryContainerId);
if (finalStatus == DeploymentStatus.DEGRADED) {
deploymentRepository.updateStatus(deployment.id(), DeploymentStatus.DEGRADED,
primaryContainerId, null);
}
pgDeployRepo.updateDeployStage(deployment.id(), null);
serverMetrics.recordDeploymentOutcome(finalStatus.name());
serverMetrics.recordDeploymentDuration(deployStart);
log.info("Deployment {} is {} ({}/{} replicas healthy)",
deployment.id(), finalStatus, healthyCount, config.replicas());
} catch (Exception e) { } catch (Exception e) {
log.error("Deployment {} FAILED: {}", deployment.id(), e.getMessage(), e); log.error("Deployment {} FAILED: {}", deployment.id(), e.getMessage(), e);
pgDeployRepo.updateDeployStage(deployment.id(), null); pgDeployRepo.updateDeployStage(deployment.id(), null);
@@ -305,6 +219,262 @@ public class DeploymentExecutor {
} }
} }
/**
* Blue/green strategy: start all N new replicas (coexisting with the old
* ones thanks to the gen-suffixed container names), wait for ALL healthy,
* then stop the previous deployment. Strict all-healthy — partial failure
* preserves the previous deployment untouched.
*/
private void deployBlueGreen(DeployCtx ctx) {
ResolvedContainerConfig config = ctx.config();
Deployment deployment = ctx.deployment();
// === START REPLICAS ===
updateStage(deployment.id(), DeployStage.START_REPLICAS);
List<Map<String, Object>> replicaStates = new ArrayList<>();
List<String> newContainerIds = new ArrayList<>();
for (int i = 0; i < config.replicas(); i++) {
Map<String, Object> state = new LinkedHashMap<>();
String containerId = startReplica(ctx, i, state);
newContainerIds.add(containerId);
replicaStates.add(state);
}
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
// === HEALTH CHECK ===
updateStage(deployment.id(), DeployStage.HEALTH_CHECK);
int healthyCount = waitForAllHealthy(newContainerIds, healthCheckTimeout);
if (healthyCount < config.replicas()) {
// Strict abort: tear down new replicas, leave the previous deployment untouched.
for (String cid : newContainerIds) {
try { orchestrator.stopContainer(cid); orchestrator.removeContainer(cid); }
catch (Exception e) { log.warn("Cleanup failed for {}: {}", cid, e.getMessage()); }
}
pgDeployRepo.updateDeployStage(deployment.id(), null);
String reason = String.format(
"blue-green: %d/%d replicas healthy within %ds; preserving previous deployment",
healthyCount, config.replicas(), healthCheckTimeout);
deploymentService.markFailed(deployment.id(), reason);
serverMetrics.recordDeploymentOutcome("FAILED");
serverMetrics.recordDeploymentDuration(ctx.deployStart());
return;
}
replicaStates = updateReplicaHealth(replicaStates, newContainerIds);
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
// === SWAP TRAFFIC ===
// All new replicas are healthy; Traefik labels are already attracting
// traffic to them. Stop the previous deployment now — the swap is
// implicit in the label-driven load balancer.
updateStage(deployment.id(), DeployStage.SWAP_TRAFFIC);
Optional<Deployment> previous = deploymentRepository.findActiveByAppIdAndEnvironmentIdExcluding(
deployment.appId(), deployment.environmentId(), deployment.id());
if (previous.isPresent()) {
log.info("blue-green: stopping previous deployment {} now that new replicas are healthy",
previous.get().id());
stopDeploymentContainers(previous.get());
deploymentService.markStopped(previous.get().id());
}
// === COMPLETE ===
updateStage(deployment.id(), DeployStage.COMPLETE);
persistSnapshotAndMarkRunning(ctx, newContainerIds.get(0));
log.info("Deployment {} is RUNNING (blue-green, {}/{} replicas healthy)",
deployment.id(), healthyCount, config.replicas());
}
/**
* Rolling strategy: replace replicas one at a time — start new[i], wait
* healthy, stop old[i]. On any replica's health failure, stop the
* in-flight new container, leave remaining old replicas serving, mark
* FAILED. Already-replaced old containers are not restored (can't unring
* that bell) — user redeploys to recover.
*
* Resource peak: replicas + 1 (briefly while a new replica warms up
* before its counterpart is stopped).
*/
private void deployRolling(DeployCtx ctx) {
ResolvedContainerConfig config = ctx.config();
Deployment deployment = ctx.deployment();
// Capture previous deployment's per-index container ids up front.
Optional<Deployment> previousOpt = deploymentRepository.findActiveByAppIdAndEnvironmentIdExcluding(
deployment.appId(), deployment.environmentId(), deployment.id());
Map<Integer, String> oldContainerByIndex = new LinkedHashMap<>();
if (previousOpt.isPresent() && previousOpt.get().replicaStates() != null) {
for (Map<String, Object> r : previousOpt.get().replicaStates()) {
Object idx = r.get("index");
Object cid = r.get("containerId");
if (idx instanceof Number n && cid instanceof String s) {
oldContainerByIndex.put(n.intValue(), s);
}
}
}
// === START REPLICAS ===
updateStage(deployment.id(), DeployStage.START_REPLICAS);
List<Map<String, Object>> replicaStates = new ArrayList<>();
List<String> newContainerIds = new ArrayList<>();
for (int i = 0; i < config.replicas(); i++) {
// Start new replica i (gen-suffixed name; coexists with old[i]).
Map<String, Object> state = new LinkedHashMap<>();
String newCid = startReplica(ctx, i, state);
newContainerIds.add(newCid);
replicaStates.add(state);
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
// === HEALTH CHECK (per-replica) ===
updateStage(deployment.id(), DeployStage.HEALTH_CHECK);
boolean healthy = waitForOneHealthy(newCid, healthCheckTimeout);
if (!healthy) {
// Abort: stop this in-flight new replica AND any new replicas
// started so far. Already-stopped old replicas stay stopped
// (rolling is not reversible). Remaining un-replaced old
// replicas keep serving traffic.
for (String cid : newContainerIds) {
try { orchestrator.stopContainer(cid); orchestrator.removeContainer(cid); }
catch (Exception e) { log.warn("Cleanup failed for {}: {}", cid, e.getMessage()); }
}
pgDeployRepo.updateDeployStage(deployment.id(), null);
String reason = String.format(
"rolling: replica %d failed to reach healthy within %ds; %d previous replicas still running",
i, healthCheckTimeout, oldContainerByIndex.size());
deploymentService.markFailed(deployment.id(), reason);
serverMetrics.recordDeploymentOutcome("FAILED");
serverMetrics.recordDeploymentDuration(ctx.deployStart());
return;
}
// Health check passed: update replica status to RUNNING, stop the
// corresponding old[i] if present, and continue with replica i+1.
replicaStates = updateReplicaHealth(replicaStates, newContainerIds);
pgDeployRepo.updateReplicaStates(deployment.id(), replicaStates);
String oldCid = oldContainerByIndex.remove(i);
if (oldCid != null) {
try {
orchestrator.stopContainer(oldCid);
orchestrator.removeContainer(oldCid);
log.info("rolling: replaced replica {} (old={}, new={})", i, oldCid, newCid);
} catch (Exception e) {
log.warn("rolling: failed to stop old replica {} ({}): {}", i, oldCid, e.getMessage());
}
}
}
// === SWAP TRAFFIC ===
// Any old replicas with indices >= new.replicas (e.g., when replica
// count shrank) are still running; sweep them now so the old
// deployment can be marked STOPPED.
updateStage(deployment.id(), DeployStage.SWAP_TRAFFIC);
for (Map.Entry<Integer, String> e : oldContainerByIndex.entrySet()) {
try {
orchestrator.stopContainer(e.getValue());
orchestrator.removeContainer(e.getValue());
log.info("rolling: stopped leftover old replica {} ({})", e.getKey(), e.getValue());
} catch (Exception ex) {
log.warn("rolling: failed to stop leftover old replica {}: {}", e.getKey(), ex.getMessage());
}
}
if (previousOpt.isPresent()) {
deploymentService.markStopped(previousOpt.get().id());
}
// === COMPLETE ===
updateStage(deployment.id(), DeployStage.COMPLETE);
persistSnapshotAndMarkRunning(ctx, newContainerIds.get(0));
log.info("Deployment {} is RUNNING (rolling, {}/{} replicas replaced)",
deployment.id(), config.replicas(), config.replicas());
}
/** Poll a single container until healthy or the timeout expires. Returns
* true on healthy, false on timeout or thread interrupt. */
private boolean waitForOneHealthy(String containerId, int timeoutSeconds) {
long deadline = System.currentTimeMillis() + (timeoutSeconds * 1000L);
while (System.currentTimeMillis() < deadline) {
ContainerStatus status = orchestrator.getContainerStatus(containerId);
if ("healthy".equals(status.state())) return true;
try { Thread.sleep(2000); } catch (InterruptedException e) {
Thread.currentThread().interrupt();
return false;
}
}
return false;
}
/** Start one replica container with the gen-suffixed name and return its
* container id. Fills `stateOut` with the replicaStates JSONB row. */
private String startReplica(DeployCtx ctx, int i, Map<String, Object> stateOut) {
Environment env = ctx.env();
App app = ctx.app();
ResolvedContainerConfig config = ctx.config();
String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + ctx.generation();
String containerName = tenantId + "-" + instanceId;
Map<String, String> labels = TraefikLabelBuilder.build(
app.slug(), env.slug(), tenantId, config, i, ctx.generation());
labels.putAll(ctx.prometheusLabels());
Map<String, String> replicaEnvVars = new LinkedHashMap<>(ctx.baseEnvVars());
replicaEnvVars.put("CAMELEER_AGENT_INSTANCEID", instanceId);
String volumeName = jarDockerVolume != null && !jarDockerVolume.isBlank() ? jarDockerVolume : null;
ContainerRequest request = new ContainerRequest(
containerName, baseImage, ctx.jarPath(),
volumeName, jarStoragePath,
ctx.primaryNetwork(),
ctx.additionalNets(),
replicaEnvVars, labels,
config.memoryLimitBytes(), config.memoryReserveBytes(),
config.dockerCpuShares(), config.dockerCpuQuota(),
config.exposedPorts(), agentHealthPort,
"on-failure", 3,
ctx.resolvedRuntimeType(), config.customArgs(), ctx.mainClass()
);
String containerId = orchestrator.startContainer(request);
// Connect to additional networks after container is started
for (String net : ctx.additionalNets()) {
if (networkManager != null) {
networkManager.connectContainer(containerId, net);
}
}
orchestrator.startLogCapture(containerId, instanceId, app.slug(), env.slug(), tenantId);
stateOut.put("index", i);
stateOut.put("containerId", containerId);
stateOut.put("containerName", containerName);
stateOut.put("status", "STARTING");
return containerId;
}
/** Persist the deployment snapshot and mark the deployment RUNNING.
* Finalizes the deploy in a single place shared by all strategy paths. */
private void persistSnapshotAndMarkRunning(DeployCtx ctx, String primaryContainerId) {
Deployment deployment = ctx.deployment();
ApplicationConfig agentConfig = applicationConfigRepository
.findByApplicationAndEnvironment(ctx.app().slug(), ctx.env().slug())
.orElse(null);
List<String> snapshotSensitiveKeys = agentConfig != null ? agentConfig.getSensitiveKeys() : null;
DeploymentConfigSnapshot snapshot = new DeploymentConfigSnapshot(
deployment.appVersionId(),
agentConfig,
ctx.app().containerConfig(),
snapshotSensitiveKeys);
pgDeployRepo.saveDeployedConfigSnapshot(deployment.id(), snapshot);
deploymentService.markRunning(deployment.id(), primaryContainerId);
pgDeployRepo.updateDeployStage(deployment.id(), null);
serverMetrics.recordDeploymentOutcome("RUNNING");
serverMetrics.recordDeploymentDuration(ctx.deployStart());
}
public void stopDeployment(Deployment deployment) { public void stopDeployment(Deployment deployment) {
pgDeployRepo.updateTargetState(deployment.id(), "STOPPED"); pgDeployRepo.updateTargetState(deployment.id(), "STOPPED");
deploymentRepository.updateStatus(deployment.id(), DeploymentStatus.STOPPING, deploymentRepository.updateStatus(deployment.id(), DeploymentStatus.STOPPING,
@@ -370,7 +540,10 @@ public class DeploymentExecutor {
return envVars; return envVars;
} }
private int waitForAnyHealthy(List<String> containerIds, int timeoutSeconds) { /** Poll until all containers are healthy or the timeout expires. Returns
* the healthy count at return time — == ids.size() on full success, less
* if the timeout won. */
private int waitForAllHealthy(List<String> containerIds, int timeoutSeconds) {
long deadline = System.currentTimeMillis() + (timeoutSeconds * 1000L); long deadline = System.currentTimeMillis() + (timeoutSeconds * 1000L);
int lastHealthy = 0; int lastHealthy = 0;
while (System.currentTimeMillis() < deadline) { while (System.currentTimeMillis() < deadline) {

View File

@@ -10,9 +10,13 @@ public final class TraefikLabelBuilder {
private TraefikLabelBuilder() {} private TraefikLabelBuilder() {}
public static Map<String, String> build(String appSlug, String envSlug, String tenantId, public static Map<String, String> build(String appSlug, String envSlug, String tenantId,
ResolvedContainerConfig config, int replicaIndex) { ResolvedContainerConfig config, int replicaIndex,
String generation) {
// Traefik router/service keys stay generation-agnostic so load balancing
// spans old + new replicas during a blue/green overlap. instance-id and
// the new generation label carry the per-deploy identity.
String svc = envSlug + "-" + appSlug; String svc = envSlug + "-" + appSlug;
String instanceId = envSlug + "-" + appSlug + "-" + replicaIndex; String instanceId = envSlug + "-" + appSlug + "-" + replicaIndex + "-" + generation;
Map<String, String> labels = new LinkedHashMap<>(); Map<String, String> labels = new LinkedHashMap<>();
labels.put("traefik.enable", "true"); labels.put("traefik.enable", "true");
@@ -21,6 +25,7 @@ public final class TraefikLabelBuilder {
labels.put("cameleer.app", appSlug); labels.put("cameleer.app", appSlug);
labels.put("cameleer.environment", envSlug); labels.put("cameleer.environment", envSlug);
labels.put("cameleer.replica", String.valueOf(replicaIndex)); labels.put("cameleer.replica", String.valueOf(replicaIndex));
labels.put("cameleer.generation", generation);
labels.put("cameleer.instance-id", instanceId); labels.put("cameleer.instance-id", instanceId);
labels.put("traefik.http.services." + svc + ".loadbalancer.server.port", labels.put("traefik.http.services." + svc + ".loadbalancer.server.port",

View File

@@ -0,0 +1,190 @@
package com.cameleer.server.app.runtime;
import com.cameleer.server.app.AbstractPostgresIT;
import com.cameleer.server.app.TestSecurityHelper;
import com.cameleer.server.app.storage.PostgresDeploymentRepository;
import com.cameleer.server.core.runtime.ContainerStatus;
import com.cameleer.server.core.runtime.Deployment;
import com.cameleer.server.core.runtime.DeploymentStatus;
import com.cameleer.server.core.runtime.RuntimeOrchestrator;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.boot.test.web.client.TestRestTemplate;
import org.springframework.core.io.ByteArrayResource;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.MediaType;
import org.springframework.test.context.TestPropertySource;
import org.springframework.util.LinkedMultiValueMap;
import org.springframework.util.MultiValueMap;
import java.util.UUID;
import java.util.concurrent.TimeUnit;
import static org.assertj.core.api.Assertions.assertThat;
import static org.awaitility.Awaitility.await;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.never;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.when;
/**
* Verifies the blue-green deployment strategy: start all new → health-check
* all → stop old. Strict all-healthy — partial failure preserves the previous
* deployment untouched.
*/
@TestPropertySource(properties = "cameleer.server.runtime.healthchecktimeout=2")
class BlueGreenStrategyIT extends AbstractPostgresIT {
@MockBean
RuntimeOrchestrator runtimeOrchestrator;
@Autowired private TestRestTemplate restTemplate;
@Autowired private ObjectMapper objectMapper;
@Autowired private TestSecurityHelper securityHelper;
@Autowired private PostgresDeploymentRepository deploymentRepository;
private String operatorJwt;
private String appSlug;
private String versionId;
@BeforeEach
void setUp() throws Exception {
operatorJwt = securityHelper.operatorToken();
jdbcTemplate.update("DELETE FROM deployments");
jdbcTemplate.update("DELETE FROM app_versions");
jdbcTemplate.update("DELETE FROM apps");
jdbcTemplate.update("DELETE FROM application_config WHERE environment = 'default'");
when(runtimeOrchestrator.isEnabled()).thenReturn(true);
appSlug = "bg-" + UUID.randomUUID().toString().substring(0, 8);
post("/api/v1/environments/default/apps", String.format("""
{"slug": "%s", "displayName": "BG App"}
""", appSlug), operatorJwt);
put("/api/v1/environments/default/apps/" + appSlug + "/container-config", """
{"runtimeType": "spring-boot", "appPort": 8081, "replicas": 2, "deploymentStrategy": "blue-green"}
""", operatorJwt);
versionId = uploadJar(appSlug, ("bg-jar-" + appSlug).getBytes());
}
@Test
void blueGreen_allHealthy_stopsOldAfterNew() throws Exception {
when(runtimeOrchestrator.startContainer(any()))
.thenReturn("old-0", "old-1", "new-0", "new-1");
ContainerStatus healthy = new ContainerStatus("healthy", true, 0, null);
when(runtimeOrchestrator.getContainerStatus("old-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("old-1")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-1")).thenReturn(healthy);
String firstDeployId = triggerDeploy();
awaitStatus(firstDeployId, DeploymentStatus.RUNNING);
String secondDeployId = triggerDeploy();
awaitStatus(secondDeployId, DeploymentStatus.RUNNING);
// Previous deployment was stopped once new was healthy
Deployment first = deploymentRepository.findById(UUID.fromString(firstDeployId)).orElseThrow();
assertThat(first.status()).isEqualTo(DeploymentStatus.STOPPED);
verify(runtimeOrchestrator).stopContainer("old-0");
verify(runtimeOrchestrator).stopContainer("old-1");
verify(runtimeOrchestrator, never()).stopContainer("new-0");
verify(runtimeOrchestrator, never()).stopContainer("new-1");
// New deployment has both new replicas recorded
Deployment second = deploymentRepository.findById(UUID.fromString(secondDeployId)).orElseThrow();
assertThat(second.replicaStates()).hasSize(2);
}
@Test
void blueGreen_partialHealthy_preservesOldAndMarksFailed() throws Exception {
when(runtimeOrchestrator.startContainer(any()))
.thenReturn("old-0", "old-1", "new-0", "new-1");
ContainerStatus healthy = new ContainerStatus("healthy", true, 0, null);
ContainerStatus starting = new ContainerStatus("starting", true, 0, null);
when(runtimeOrchestrator.getContainerStatus("old-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("old-1")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-1")).thenReturn(starting);
String firstDeployId = triggerDeploy();
awaitStatus(firstDeployId, DeploymentStatus.RUNNING);
String secondDeployId = triggerDeploy();
awaitStatus(secondDeployId, DeploymentStatus.FAILED);
Deployment second = deploymentRepository.findById(UUID.fromString(secondDeployId)).orElseThrow();
assertThat(second.errorMessage())
.contains("blue-green")
.contains("1/2");
// Previous deployment stays RUNNING — blue-green's safety promise.
Deployment first = deploymentRepository.findById(UUID.fromString(firstDeployId)).orElseThrow();
assertThat(first.status()).isEqualTo(DeploymentStatus.RUNNING);
verify(runtimeOrchestrator, never()).stopContainer("old-0");
verify(runtimeOrchestrator, never()).stopContainer("old-1");
// Cleanup ran on both new replicas.
verify(runtimeOrchestrator).stopContainer("new-0");
verify(runtimeOrchestrator).stopContainer("new-1");
}
// ---- helpers ----
private String triggerDeploy() throws Exception {
JsonNode deployResponse = post(
"/api/v1/environments/default/apps/" + appSlug + "/deployments",
String.format("{\"appVersionId\": \"%s\"}", versionId), operatorJwt);
return deployResponse.path("id").asText();
}
private void awaitStatus(String deployId, DeploymentStatus expected) {
await().atMost(30, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.untilAsserted(() -> {
Deployment d = deploymentRepository.findById(UUID.fromString(deployId))
.orElseThrow(() -> new AssertionError("Deployment not found: " + deployId));
assertThat(d.status()).isEqualTo(expected);
});
}
private JsonNode post(String path, String json, String jwt) throws Exception {
HttpHeaders headers = securityHelper.authHeaders(jwt);
var response = restTemplate.exchange(path, HttpMethod.POST,
new HttpEntity<>(json, headers), String.class);
return objectMapper.readTree(response.getBody());
}
private void put(String path, String json, String jwt) {
HttpHeaders headers = securityHelper.authHeaders(jwt);
restTemplate.exchange(path, HttpMethod.PUT,
new HttpEntity<>(json, headers), String.class);
}
private String uploadJar(String appSlug, byte[] content) throws Exception {
ByteArrayResource resource = new ByteArrayResource(content) {
@Override public String getFilename() { return "app.jar"; }
};
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", resource);
HttpHeaders headers = new HttpHeaders();
headers.set("Authorization", "Bearer " + operatorJwt);
headers.set("X-Cameleer-Protocol-Version", "1");
headers.setContentType(MediaType.MULTIPART_FORM_DATA);
var response = restTemplate.exchange(
"/api/v1/environments/default/apps/" + appSlug + "/versions",
HttpMethod.POST, new HttpEntity<>(body, headers), String.class);
JsonNode versionNode = objectMapper.readTree(response.getBody());
return versionNode.path("id").asText();
}
}

View File

@@ -0,0 +1,194 @@
package com.cameleer.server.app.runtime;
import com.cameleer.server.app.AbstractPostgresIT;
import com.cameleer.server.app.TestSecurityHelper;
import com.cameleer.server.app.storage.PostgresDeploymentRepository;
import com.cameleer.server.core.runtime.ContainerStatus;
import com.cameleer.server.core.runtime.Deployment;
import com.cameleer.server.core.runtime.DeploymentStatus;
import com.cameleer.server.core.runtime.RuntimeOrchestrator;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.mockito.InOrder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.boot.test.web.client.TestRestTemplate;
import org.springframework.core.io.ByteArrayResource;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.MediaType;
import org.springframework.test.context.TestPropertySource;
import org.springframework.util.LinkedMultiValueMap;
import org.springframework.util.MultiValueMap;
import java.util.UUID;
import java.util.concurrent.TimeUnit;
import static org.assertj.core.api.Assertions.assertThat;
import static org.awaitility.Awaitility.await;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.inOrder;
import static org.mockito.Mockito.never;
import static org.mockito.Mockito.times;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.when;
/**
* Verifies the rolling deployment strategy: per-replica start → health → stop
* old. Mid-rollout health failure preserves remaining un-replaced old replicas;
* already-stopped old replicas are not restored.
*/
@TestPropertySource(properties = "cameleer.server.runtime.healthchecktimeout=2")
class RollingStrategyIT extends AbstractPostgresIT {
@MockBean
RuntimeOrchestrator runtimeOrchestrator;
@Autowired private TestRestTemplate restTemplate;
@Autowired private ObjectMapper objectMapper;
@Autowired private TestSecurityHelper securityHelper;
@Autowired private PostgresDeploymentRepository deploymentRepository;
private String operatorJwt;
private String appSlug;
private String versionId;
@BeforeEach
void setUp() throws Exception {
operatorJwt = securityHelper.operatorToken();
jdbcTemplate.update("DELETE FROM deployments");
jdbcTemplate.update("DELETE FROM app_versions");
jdbcTemplate.update("DELETE FROM apps");
jdbcTemplate.update("DELETE FROM application_config WHERE environment = 'default'");
when(runtimeOrchestrator.isEnabled()).thenReturn(true);
appSlug = "roll-" + UUID.randomUUID().toString().substring(0, 8);
post("/api/v1/environments/default/apps", String.format("""
{"slug": "%s", "displayName": "Rolling App"}
""", appSlug), operatorJwt);
put("/api/v1/environments/default/apps/" + appSlug + "/container-config", """
{"runtimeType": "spring-boot", "appPort": 8081, "replicas": 2, "deploymentStrategy": "rolling"}
""", operatorJwt);
versionId = uploadJar(appSlug, ("roll-jar-" + appSlug).getBytes());
}
@Test
void rolling_allHealthy_replacesOneByOne() throws Exception {
when(runtimeOrchestrator.startContainer(any()))
.thenReturn("old-0", "old-1", "new-0", "new-1");
ContainerStatus healthy = new ContainerStatus("healthy", true, 0, null);
when(runtimeOrchestrator.getContainerStatus("old-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("old-1")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-1")).thenReturn(healthy);
String firstDeployId = triggerDeploy();
awaitStatus(firstDeployId, DeploymentStatus.RUNNING);
String secondDeployId = triggerDeploy();
awaitStatus(secondDeployId, DeploymentStatus.RUNNING);
// Rolling invariant: old-0 is stopped BEFORE old-1 (replicas replaced
// one at a time, not all at once). Checking stop order is sufficient —
// a blue-green path would have both stops adjacent at the end with no
// interleaved starts; rolling interleaves starts between stops.
InOrder inOrder = inOrder(runtimeOrchestrator);
inOrder.verify(runtimeOrchestrator).stopContainer("old-0");
inOrder.verify(runtimeOrchestrator).stopContainer("old-1");
// Total of 4 startContainer calls: 2 for first deploy, 2 for rolling.
verify(runtimeOrchestrator, times(4)).startContainer(any());
// New replicas were not stopped — they're the running ones now.
verify(runtimeOrchestrator, never()).stopContainer("new-0");
verify(runtimeOrchestrator, never()).stopContainer("new-1");
Deployment first = deploymentRepository.findById(UUID.fromString(firstDeployId)).orElseThrow();
assertThat(first.status()).isEqualTo(DeploymentStatus.STOPPED);
}
@Test
void rolling_failsMidRollout_preservesRemainingOld() throws Exception {
when(runtimeOrchestrator.startContainer(any()))
.thenReturn("old-0", "old-1", "new-0", "new-1");
ContainerStatus healthy = new ContainerStatus("healthy", true, 0, null);
ContainerStatus starting = new ContainerStatus("starting", true, 0, null);
when(runtimeOrchestrator.getContainerStatus("old-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("old-1")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-0")).thenReturn(healthy);
when(runtimeOrchestrator.getContainerStatus("new-1")).thenReturn(starting);
String firstDeployId = triggerDeploy();
awaitStatus(firstDeployId, DeploymentStatus.RUNNING);
String secondDeployId = triggerDeploy();
awaitStatus(secondDeployId, DeploymentStatus.FAILED);
Deployment second = deploymentRepository.findById(UUID.fromString(secondDeployId)).orElseThrow();
assertThat(second.errorMessage())
.contains("rolling")
.contains("replica 1");
// old-0 was replaced before the failure; old-1 was never touched.
verify(runtimeOrchestrator).stopContainer("old-0");
verify(runtimeOrchestrator, never()).stopContainer("old-1");
// Cleanup stops both new replicas started so far.
verify(runtimeOrchestrator).stopContainer("new-0");
verify(runtimeOrchestrator).stopContainer("new-1");
}
// ---- helpers (same pattern as BlueGreenStrategyIT) ----
private String triggerDeploy() throws Exception {
JsonNode deployResponse = post(
"/api/v1/environments/default/apps/" + appSlug + "/deployments",
String.format("{\"appVersionId\": \"%s\"}", versionId), operatorJwt);
return deployResponse.path("id").asText();
}
private void awaitStatus(String deployId, DeploymentStatus expected) {
await().atMost(30, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.untilAsserted(() -> {
Deployment d = deploymentRepository.findById(UUID.fromString(deployId))
.orElseThrow(() -> new AssertionError("Deployment not found: " + deployId));
assertThat(d.status()).isEqualTo(expected);
});
}
private JsonNode post(String path, String json, String jwt) throws Exception {
HttpHeaders headers = securityHelper.authHeaders(jwt);
var response = restTemplate.exchange(path, HttpMethod.POST,
new HttpEntity<>(json, headers), String.class);
return objectMapper.readTree(response.getBody());
}
private void put(String path, String json, String jwt) {
HttpHeaders headers = securityHelper.authHeaders(jwt);
restTemplate.exchange(path, HttpMethod.PUT,
new HttpEntity<>(json, headers), String.class);
}
private String uploadJar(String appSlug, byte[] content) throws Exception {
ByteArrayResource resource = new ByteArrayResource(content) {
@Override public String getFilename() { return "app.jar"; }
};
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", resource);
HttpHeaders headers = new HttpHeaders();
headers.set("Authorization", "Bearer " + operatorJwt);
headers.set("X-Cameleer-Protocol-Version", "1");
headers.setContentType(MediaType.MULTIPART_FORM_DATA);
var response = restTemplate.exchange(
"/api/v1/environments/default/apps/" + appSlug + "/versions",
HttpMethod.POST, new HttpEntity<>(body, headers), String.class);
JsonNode versionNode = objectMapper.readTree(response.getBody());
return versionNode.path("id").asText();
}
}

View File

@@ -0,0 +1,31 @@
package com.cameleer.server.core.runtime;
/**
* Supported deployment strategies. Persisted as a kebab-case string on
* ApplicationConfig / ResolvedContainerConfig; {@link #fromWire(String)} is
* the only conversion entry point and falls back to {@link #BLUE_GREEN} for
* unknown or null input so the executor never has to null-check.
*/
public enum DeploymentStrategy {
BLUE_GREEN("blue-green"),
ROLLING("rolling");
private final String wire;
DeploymentStrategy(String wire) {
this.wire = wire;
}
public String toWire() {
return wire;
}
public static DeploymentStrategy fromWire(String value) {
if (value == null) return BLUE_GREEN;
String normalized = value.trim().toLowerCase();
for (DeploymentStrategy s : values()) {
if (s.wire.equals(normalized)) return s;
}
return BLUE_GREEN;
}
}

View File

@@ -0,0 +1,34 @@
package com.cameleer.server.core.runtime;
import org.junit.jupiter.api.Test;
import static org.assertj.core.api.Assertions.assertThat;
class DeploymentStrategyTest {
@Test
void fromWire_knownValues() {
assertThat(DeploymentStrategy.fromWire("blue-green")).isEqualTo(DeploymentStrategy.BLUE_GREEN);
assertThat(DeploymentStrategy.fromWire("rolling")).isEqualTo(DeploymentStrategy.ROLLING);
}
@Test
void fromWire_caseInsensitiveAndTrims() {
assertThat(DeploymentStrategy.fromWire("BLUE-GREEN")).isEqualTo(DeploymentStrategy.BLUE_GREEN);
assertThat(DeploymentStrategy.fromWire(" Rolling ")).isEqualTo(DeploymentStrategy.ROLLING);
}
@Test
void fromWire_unknownOrNullFallsBackToBlueGreen() {
assertThat(DeploymentStrategy.fromWire(null)).isEqualTo(DeploymentStrategy.BLUE_GREEN);
assertThat(DeploymentStrategy.fromWire("")).isEqualTo(DeploymentStrategy.BLUE_GREEN);
assertThat(DeploymentStrategy.fromWire("canary")).isEqualTo(DeploymentStrategy.BLUE_GREEN);
}
@Test
void toWire_roundTrips() {
for (DeploymentStrategy s : DeploymentStrategy.values()) {
assertThat(DeploymentStrategy.fromWire(s.toWire())).isEqualTo(s);
}
}
}

View File

@@ -0,0 +1,225 @@
# Deployment Strategies (blue-green + rolling) — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make `deploymentStrategy` actually affect runtime behavior. Support **blue-green** (all-at-once, default) and **rolling** (per-replica) deployments with correct semantics. Unblock real blue/green by giving each deployment a unique container-name generation suffix so old + new replicas can coexist during the swap.
**Current state (interim fix landed in `f8dccaae`):** strategy field exists but executor doesn't branch on it; a destroy-then-start flow runs regardless. This plan replaces that interim behavior.
**Architecture:**
- Append an 8-char **`gen`** suffix (first 8 chars of `deployment.id`) to container name AND `CAMELEER_AGENT_INSTANCEID`. Unique per deployment; no new DB state.
- Add a `cameleer.generation` Docker label so Grafana/Prometheus can pin deploy boundaries without regex on instance-id.
- Branch `DeploymentExecutor.executeAsync` on strategy:
- **blue-green**: start all N new → health-check all → stop all old. Strict all-healthy: partial = FAILED (old stays running).
- **rolling**: per-replica loop: start new[i] → health-check → stop old[i] → next. Mid-rollout failure → stop failed new[i], leave remaining old[i..n] running, mark FAILED.
- Keep destroy-then-start as the fallback for unknown strategy values (safety net).
**Reference:** interim-fix commit `f8dccaae`; investigation summary in the session log.
---
## File Structure
### Backend (new / modified)
- **Create:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java` — enum `BLUE_GREEN, ROLLING`; `fromWire(String)` with blue-green fallback; `toWire()` → "blue-green" / "rolling".
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/DeploymentExecutor.java` — add `gen` computation, strategy branching, per-strategy START_REPLICAS + HEALTH_CHECK + SWAP_TRAFFIC flows. Rewrite the body of `executeAsync` so stages 46 dispatch on strategy. Extract helper methods `deployBlueGreen` and `deployRolling` to keep each path readable.
- **Modify:** `cameleer-server-app/src/main/java/com/cameleer/server/app/runtime/TraefikLabelBuilder.java` — take `gen` argument; emit `cameleer.generation` label; `cameleer.instance-id` becomes `{envSlug}-{appSlug}-{replicaIndex}-{gen}`.
- **Modify:** `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentService.java``containerName` stored on the row becomes `env.slug() + "-" + app.slug()` (unchanged — already just the group-name for DB/operator visibility; real Docker name is computed in the executor).
- **Modify:** `cameleer-server-app/src/test/java/com/cameleer/server/app/controller/DeploymentControllerIT.java` — update the single assertion that pins `container_name` format if any (spotted at line ~112 in the investigation).
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/BlueGreenStrategyIT.java` — two tests: all-replicas-healthy path stops old after new, and partial-healthy aborts preserving old.
- **Create:** `cameleer-server-app/src/test/java/com/cameleer/server/app/runtime/RollingStrategyIT.java` — two tests: happy rolling 3→3 replacement, and fail-on-replica-1 preserves remaining old replicas.
### UI
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/ConfigTabs/ResourcesTab.tsx` — confirm the strategy dropdown offers "blue-green" and "rolling" with descriptive labels + a hint line.
- **Modify:** `ui/src/pages/AppsTab/AppDeploymentPage/DeploymentTab/StatusCard.tsx` — surface `deployment.deploymentStrategy` as a small text/badge near the version badge (read-only).
### Docs + rules
- **Modify:** `.claude/rules/docker-orchestration.md` — rewrite the "DeploymentExecutor Details" and "Blue/green strategy" sections to describe the new behavior and the `gen` suffix; retire the interim destroy-then-start note.
- **Modify:** `.claude/rules/app-classes.md` — update the `DeploymentExecutor` bullet under `runtime/`.
- **Modify:** `.claude/rules/core-classes.md` — note new `DeploymentStrategy` enum under `runtime/`.
---
## Phase 1 — Core: DeploymentStrategy enum + gen utility
### Task 1.1: DeploymentStrategy enum
**Files:** Create `cameleer-server-core/src/main/java/com/cameleer/server/core/runtime/DeploymentStrategy.java`.
- [ ] Create enum with two constants `BLUE_GREEN`, `ROLLING`.
- [ ] Add `toWire()` returning `"blue-green"` / `"rolling"`.
- [ ] Add `fromWire(String)` — case-insensitive match; unknown or null → `BLUE_GREEN` with no throw (safety fallback). Returns enum, never null.
**Verification:** unit test covering known + unknown + null inputs.
### Task 1.2: Generation suffix helper
- [ ] Decide location — inline static helper on `DeploymentExecutor` is fine (`private static String gen(UUID id) { return id.toString().substring(0,8); }`). No new file needed.
---
## Phase 2 — Executor: gen-suffixed naming + `cameleer.generation` label
This phase is purely the naming change; no strategy branching yet. After this phase, redeploy still uses the destroy-then-start interim, but containers carry the new names + label.
### Task 2.1: TraefikLabelBuilder — accept `gen`, emit generation label
**Files:** Modify `TraefikLabelBuilder.java`.
- [ ] Add `String gen` as a new arg on `build(...)`.
- [ ] Change `instanceId` construction: `envSlug + "-" + appSlug + "-" + replicaIndex + "-" + gen`.
- [ ] Add label `cameleer.generation = gen`.
- [ ] Leave the Traefik router/service label keys using `svc = envSlug + "-" + appSlug` (unchanged — routing is generation-agnostic so load balancing across old+new works automatically).
### Task 2.2: DeploymentExecutor — compute gen once, thread through
**Files:** Modify `DeploymentExecutor.executeAsync`.
- [ ] At the top of the try block (after `env`, `app`, `config` resolution), compute `String gen = gen(deployment.id());`.
- [ ] In the replica loop: `String instanceId = env.slug() + "-" + app.slug() + "-" + i + "-" + gen;` and `String containerName = tenantId + "-" + instanceId;`.
- [ ] Pass `gen` to `TraefikLabelBuilder.build(...)`.
- [ ] Set `CAMELEER_AGENT_INSTANCEID=instanceId` (already done, just verify the new value propagates).
- [ ] Leave `replicaStates[].containerName` stored as the new full name.
### Task 2.3: Update the one brittle test
**Files:** Modify `DeploymentControllerIT.java`.
- [ ] Relax the container-name assertion to `startsWith("default-default-deploy-test-")` or similar — verify behavior, not exact suffix.
**Verification after Phase 2:**
- `mvn -pl cameleer-server-app -am test -Dtest=DeploymentSnapshotIT,DeploymentControllerIT,PostgresDeploymentRepositoryIT`
- All green; container names now include gen; redeploy still works via the interim destroy-then-start flow (which will be replaced in Phase 3).
---
## Phase 3 — Blue-green strategy (default)
### Task 3.1: Extract `deployBlueGreen(...)` helper
**Files:** Modify `DeploymentExecutor.java`.
- [ ] Move the current START_REPLICAS → HEALTH_CHECK → SWAP_TRAFFIC body into a new `private void deployBlueGreen(...)` method.
- [ ] Signature: take `deployment`, `app`, `env`, `config`, `resolvedRuntimeType`, `mainClass`, `gen`, `primaryNetwork`, `additionalNets`.
### Task 3.2: Reorder for proper blue-green
- [ ] Remove the pre-flight "stop previous" block added in `f8dccaae` (will be replaced by post-health swap).
- [ ] Order: start all new → wait all healthy → find previous active (via `findActiveByAppIdAndEnvironmentIdExcluding`) → stop old containers + mark old row STOPPED.
- [ ] Strict all-healthy: if `healthyCount < config.replicas()`, stop the new containers we just started, mark deployment FAILED with `"blue-green: %d/%d replicas healthy; preserving previous deployment"`. Do **not** touch the old deployment.
### Task 3.3: Wire strategy dispatch
- [ ] At the point where `deployBlueGreen` is called, check `DeploymentStrategy.fromWire(config.deploymentStrategy())` and dispatch. For this phase, always call `deployBlueGreen`.
- [ ] `ROLLING` dispatches to `deployRolling(...)` implemented in Phase 4 (stub it to throw `UnsupportedOperationException` for now — will be replaced before this phase lands).
---
## Phase 4 — Rolling strategy
### Task 4.1: `deployRolling(...)` helper
**Files:** Modify `DeploymentExecutor.java`.
- [ ] Same signature as `deployBlueGreen`.
- [ ] Look up previous deployment once at entry via `findActiveByAppIdAndEnvironmentIdExcluding`. Capture its `replicaStates` into a map keyed by replica index.
- [ ] For `i` from 0 to `config.replicas() - 1`:
- [ ] Start new replica `i` (with gen-suffixed name).
- [ ] Wait for this single container to go healthy (per-replica `waitForOneHealthy(containerId, timeoutSeconds)`; reuse `healthCheckTimeout` per replica or introduce a smaller per-replica budget).
- [ ] On success: stop the corresponding old replica `i` by `containerId` from the previous deployment's replicaStates (if present); log continue.
- [ ] On failure: stop + remove all new replicas started so far, mark deployment FAILED with `"rolling: replica %d failed to reach healthy; preserved %d previous replicas"`. Do **not** touch the already-replaced replicas from previous deployment (they're already stopped) or the not-yet-replaced ones (they keep serving).
- [ ] After the loop succeeds for all replicas, mark the previous deployment row STOPPED (its containers are all stopped).
### Task 4.2: Add `waitForOneHealthy`
- [ ] Variant of `waitForAnyHealthy` that polls a single container id. Returns boolean. Same sleep cadence.
### Task 4.3: Replace the Phase 3 stub
- [ ] `ROLLING` dispatch calls `deployRolling` instead of throwing.
---
## Phase 5 — Integration tests
Each IT extends `AbstractPostgresIT`, uses `@MockBean RuntimeOrchestrator`, and overrides `cameleer.server.runtime.healthchecktimeout=2` via `@TestPropertySource`.
### Task 5.1: BlueGreenStrategyIT
**Files:** Create `BlueGreenStrategyIT.java`.
- [ ] **Test 1 `blueGreen_allHealthy_stopsOldAfterNew`:** seed a previous RUNNING deployment (2 replicas). Trigger redeploy with `containerConfig.deploymentStrategy=blue-green` + replicas=2. Mock orchestrator: new containers return `healthy`. Await new deployment RUNNING. Assert: previous deployment has status STOPPED, its container IDs had `stopContainer`+`removeContainer` called; new deployment replicaStates contain the two new container IDs; `cameleer.generation` label on both new container requests.
- [ ] **Test 2 `blueGreen_partialHealthy_preservesOldAndMarksFailed`:** seed previous RUNNING (2 replicas). New deploy with replicas=2. Mock: container A healthy, container B starting forever. Await new deployment FAILED. Assert: previous deployment still RUNNING; its container IDs were **not** stopped; new deployment errorMessage contains "1/2 replicas healthy".
### Task 5.2: RollingStrategyIT
**Files:** Create `RollingStrategyIT.java`.
- [ ] **Test 1 `rolling_allHealthy_replacesOneByOne`:** seed previous RUNNING (3 replicas). New deploy with strategy=rolling, replicas=3. Mock: new containers all healthy. Use `ArgumentCaptor` on `startContainer` to observe start order. Assert: start[0] → stop[old0] → start[1] → stop[old1] → start[2] → stop[old2]; new deployment RUNNING with 3 replicaStates; old deployment STOPPED.
- [ ] **Test 2 `rolling_failsMidRollout_preservesRemainingOld`:** seed previous RUNNING (3 replicas). New deploy strategy=rolling. Mock: new[0] healthy, new[1] never healthy. Await FAILED. Assert: new[0] was stopped during cleanup; old[0] was stopped (replaced before the failure); old[1] + old[2] still RUNNING; new deployment errorMessage contains "replica 1".
---
## Phase 6 — UI strategy indicator
### Task 6.1: Strategy dropdown polish
**Files:** Modify `ResourcesTab.tsx`.
- [ ] Verify the `<select>` has options `blue-green` and `rolling`.
- [ ] Add a one-line description under the dropdown: "Blue-green: start all new, swap when healthy. Rolling: replace one replica at a time."
### Task 6.2: Strategy on StatusCard
**Files:** Modify `DeploymentTab/StatusCard.tsx`.
- [ ] Add a small subtle text line in the grid: `<span>Strategy</span><span>{deployment.deploymentStrategy}</span>` (read-only, mono text ok).
---
## Phase 7 — Docs + rules updates
### Task 7.1: Update `.claude/rules/docker-orchestration.md`
- [ ] Replace the "DeploymentExecutor Details" section with the new flow (gen suffix, strategy dispatch, per-strategy ordering).
- [ ] Update the "Deployment Status Model" table — `DEGRADED` now means "post-deploy replica crashed"; failed-during-deploy is always `FAILED`.
- [ ] Add a short "Deployment Strategies" section: behavior of blue-green vs rolling, resource peak, failure semantics.
### Task 7.2: Update `.claude/rules/app-classes.md`
- [ ] Under `runtime/``DeploymentExecutor` bullet: add "branches on `DeploymentStrategy.fromWire(config.deploymentStrategy())`. Container name format: `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{gen}` where gen = 8-char prefix of deployment UUID."
### Task 7.3: Update `.claude/rules/core-classes.md`
- [ ] Add under `runtime/`: `DeploymentStrategy` — enum BLUE_GREEN, ROLLING; `fromWire` falls back to BLUE_GREEN; note stored as kebab-case string on config.
---
## Rollout sequence
1. Phase 1 (enum + helper) — trivial, land as one commit.
2. Phase 2 (naming + generation label) — one commit; interim destroy-then-start still active; regenerates no OpenAPI (no controller change).
3. Phase 3 (blue-green as default) — one commit replacing the interim flow. This is where real behavior changes.
4. Phase 4 (rolling) — one commit.
5. Phase 5 (4 ITs) — one commit; run `mvn test` against affected modules.
6. Phase 6 (UI) — one commit; `npx tsc` clean.
7. Phase 7 (docs) — one commit.
Total: 7 commits, all atomic.
## Acceptance
- Existing `DeploymentSnapshotIT` still passes.
- New `BlueGreenStrategyIT` (2 tests) and `RollingStrategyIT` (2 tests) pass.
- Browser QA: redeploy with `deploymentStrategy=blue-green` vs `rolling` produces the expected container timeline (inspect via `docker ps`); Prometheus metrics show continuity across deploys when queried by `{cameleer_app, cameleer_environment}`; the `cameleer_generation` label flips per deploy.
- `.claude/rules/docker-orchestration.md` reflects the new behavior.
## Non-goals
- Automatic rollback on blue-green partial failure (old is left running; user redeploys).
- Automatic rollback on rolling mid-failure (remaining old replicas keep running; user redeploys).
- Per-replica `HEALTH_CHECK` stage label in the UI progress bar — the 7-stage progress is reused as-is; strategy dictates internal looping.
- Strategy field validation at container-config save time (executor's `fromWire` fallback absorbs unknown values — consider a follow-up for strict validation if it becomes an issue).

View File

@@ -172,15 +172,22 @@ export function ResourcesTab({ value, onChange, disabled, isProd = false }: Prop
/> />
<span className={styles.configLabel}>Deploy Strategy</span> <span className={styles.configLabel}>Deploy Strategy</span>
<Select <div>
disabled={disabled} <Select
value={value.deployStrategy} disabled={disabled}
onChange={(e) => update('deployStrategy', e.target.value)} value={value.deployStrategy}
options={[ onChange={(e) => update('deployStrategy', e.target.value)}
{ value: 'blue-green', label: 'Blue/Green' }, options={[
{ value: 'rolling', label: 'Rolling' }, { value: 'blue-green', label: 'Blue/Green' },
]} { value: 'rolling', label: 'Rolling' },
/> ]}
/>
<span className={styles.configHint}>
{value.deployStrategy === 'rolling'
? 'Replace one replica at a time; peak = replicas + 1. Partial failure leaves remaining old replicas serving.'
: 'Start all new replicas, swap once all are healthy; peak = 2 × replicas. Partial failure preserves the previous deployment.'}
</span>
</div>
<span className={styles.configLabel}>Strip Path Prefix</span> <span className={styles.configLabel}>Strip Path Prefix</span>
<div className={styles.configInline}> <div className={styles.configInline}>

View File

@@ -35,6 +35,7 @@ export function StatusCard({ deployment, version, externalUrl }: Props) {
{version && <><span>JAR</span><MonoText size="sm">{version.jarFilename}</MonoText></>} {version && <><span>JAR</span><MonoText size="sm">{version.jarFilename}</MonoText></>}
{version && <><span>Checksum</span><MonoText size="xs">{version.jarChecksum.substring(0, 12)}</MonoText></>} {version && <><span>Checksum</span><MonoText size="xs">{version.jarChecksum.substring(0, 12)}</MonoText></>}
<span>Replicas</span><span>{running}/{total}</span> <span>Replicas</span><span>{running}/{total}</span>
<span>Strategy</span><span>{deployment.deploymentStrategy ?? '—'}</span>
<span>URL</span> <span>URL</span>
{deployment.status === 'RUNNING' {deployment.status === 'RUNNING'
? <a href={externalUrl} target="_blank" rel="noreferrer"><MonoText size="sm">{externalUrl}</MonoText></a> ? <a href={externalUrl} target="_blank" rel="noreferrer"><MonoText size="sm">{externalUrl}</MonoText></a>