From da6bf694f8c8ca41901974cdeeeb1f00b2e83a61 Mon Sep 17 00:00:00 2001 From: hsiegeln <37154749+hsiegeln@users.noreply.github.com> Date: Wed, 8 Apr 2026 19:48:34 +0200 Subject: [PATCH] docs: Docker container orchestration design spec Covers: config merging (3-layer), Traefik label generation (path + subdomain routing), network topology (infra/traefik/env isolation), replica management, blue/green and rolling deployment strategies, Docker event stream monitoring, deployment status state machine (DEGRADED/STOPPING states), pre-flight checks, and UI changes. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../2026-04-08-docker-orchestration-design.md | 347 ++++++++++++++++++ 1 file changed, 347 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-08-docker-orchestration-design.md diff --git a/docs/superpowers/specs/2026-04-08-docker-orchestration-design.md b/docs/superpowers/specs/2026-04-08-docker-orchestration-design.md new file mode 100644 index 00000000..b219e9e6 --- /dev/null +++ b/docs/superpowers/specs/2026-04-08-docker-orchestration-design.md @@ -0,0 +1,347 @@ +# Docker Container Orchestration Design + +## Goal + +Make the `DockerRuntimeOrchestrator` fully functional: apply container configs (memory, CPU, ports, env vars) when starting containers, generate correct Traefik routing labels, support replicas, implement blue/green and rolling deployment strategies, and monitor container health via Docker event stream. + +## Scope + +- Docker single-host only (Swarm and K8s are future `RuntimeOrchestrator` implementations) +- Replicas managed by the orchestrator as independent containers +- Traefik integration for path-based and subdomain-based routing +- Docker event stream for infrastructure-level health monitoring +- UI changes for new config fields, replica management, and deployment progress + +--- + +## Network Topology + +Three network tiers with lazy creation: + +``` +cameleer-infra — server, postgres, clickhouse (databases isolated) +cameleer-traefik — server, traefik, all app containers (ingress + agent SSE) +cameleer-env-{slug} — app containers within one environment (inter-app only) +``` + +- **Server** joins `cameleer-infra` + `cameleer-traefik` +- **App containers** join `cameleer-traefik` + `cameleer-env-{envSlug}` +- **Traefik** joins `cameleer-traefik` only +- **Databases** join `cameleer-infra` only + +App containers reach the server for SSE/heartbeats via the `cameleer-traefik` network. They never touch databases directly. + +### Network Manager + +Wraps Docker network operations. `ensureNetwork(name)` creates a bridge network if it doesn't exist (idempotent). `connectContainer(containerId, networkName)` attaches a container to a second network. Called by `DeploymentExecutor` before container creation. + +--- + +## Configuration Model + +### Three-layer merge + +``` +Global defaults (application.yml) + → Environment defaults (environments.default_container_config) + → App overrides (apps.container_config) +``` + +App overrides environment, environment overrides global. Missing keys fall through. + +### Environment-level settings (`defaultContainerConfig`) + +| Key | Type | Default | Description | +|-----|------|---------|-------------| +| `memoryLimitMb` | int | 512 | Default memory limit | +| `memoryReserveMb` | int | null | Memory reservation | +| `cpuShares` | int | 512 | CPU shares | +| `cpuLimit` | float | null | CPU core limit | +| `routingMode` | string | `"path"` | `path` or `subdomain` | +| `routingDomain` | string | from global | Domain for URL generation | +| `serverUrl` | string | from global | Server URL for agent callbacks | +| `sslOffloading` | boolean | true | Traefik terminates TLS | + +### App-level settings (`containerConfig`) + +| Key | Type | Default | Description | +|-----|------|---------|-------------| +| `memoryLimitMb` | int | from env | Override memory limit | +| `memoryReserveMb` | int | from env | Override memory reservation | +| `cpuShares` | int | from env | Override CPU shares | +| `cpuLimit` | float | from env | Override CPU core limit | +| `appPort` | int | 8080 | Main HTTP port for Traefik | +| `exposedPorts` | int[] | [] | Additional ports (debug, JMX) | +| `customEnvVars` | map | {} | App-specific environment variables | +| `stripPathPrefix` | boolean | true | Traefik strips `/{env}/{app}` prefix | +| `sslOffloading` | boolean | from env | Override SSL offloading | +| `replicas` | int | 1 | Number of container replicas | +| `deploymentStrategy` | string | `"blue-green"` | `blue-green` or `rolling` | + +### ConfigMerger + +Pure function: `resolve(globalDefaults, envConfig, appConfig) → ResolvedContainerConfig` + +`ResolvedContainerConfig` is a typed Java record with all fields resolved to concrete values. No more scattered `@Value` fields in `DeploymentExecutor` for container-level settings. + +--- + +## Traefik Label Generation + +### TraefikLabelBuilder + +Pure function: takes app slug, env slug, resolved config. Returns `Map`. + +### Path-based routing (`routingMode: "path"`) + +Service name derived as `{envSlug}-{appSlug}`. + +``` +traefik.enable=true +traefik.http.routers.{svc}.rule=PathPrefix(`/{envSlug}/{appSlug}/`) +traefik.http.routers.{svc}.entrypoints=websecure +traefik.http.services.{svc}.loadbalancer.server.port={appPort} +managed-by=cameleer3-server +cameleer.app={appSlug} +cameleer.environment={envSlug} +``` + +If `stripPathPrefix` is true: +``` +traefik.http.middlewares.{svc}-strip.stripprefix.prefixes=/{envSlug}/{appSlug} +traefik.http.routers.{svc}.middlewares={svc}-strip +``` + +### Subdomain-based routing (`routingMode: "subdomain"`) + +``` +traefik.http.routers.{svc}.rule=Host(`{appSlug}-{envSlug}.{routingDomain}`) +``` + +No strip-prefix needed for subdomain routing. + +### SSL offloading + +If `sslOffloading` is true: +``` +traefik.http.routers.{svc}.tls=true +traefik.http.routers.{svc}.tls.certresolver=default +``` + +If false, Traefik passes through TLS to the container (requires the app to terminate TLS itself). + +### Replicas + +All replicas of the same app get identical Traefik labels. Traefik automatically load-balances across containers with the same service name on the same network. + +--- + +## Deployment Status Model + +### New fields on `deployments` table + +| Column | Type | Description | +|--------|------|-------------| +| `target_state` | varchar | `RUNNING` or `STOPPED` | +| `deployment_strategy` | varchar | `BLUE_GREEN` or `ROLLING` | +| `replica_states` | jsonb | Array of `{index, containerId, status}` | +| `deploy_stage` | varchar | Current stage for progress tracking (null when stable) | + +### Status values + +`STOPPED`, `STARTING`, `RUNNING`, `DEGRADED`, `STOPPING`, `FAILED` + +### State transitions + +``` +Target: RUNNING + STOPPED → STARTING → RUNNING (all replicas healthy) + → DEGRADED (some replicas healthy, some dead) + → FAILED (zero healthy / pre-flight failed) + RUNNING → DEGRADED (replica dies) + DEGRADED → RUNNING (replica recovers via restart policy) + DEGRADED → FAILED (all dead, retries exhausted) + +Target: STOPPED + RUNNING/DEGRADED → STOPPING → STOPPED (all replicas stopped and removed) + → FAILED (couldn't stop some replicas) +``` + +### Aggregate derivation + +Deployment status is derived from replica states: +- All replicas `RUNNING` → deployment `RUNNING` +- At least one `RUNNING`, some `DEAD` → deployment `DEGRADED` +- Zero `RUNNING` after retries → deployment `FAILED` +- All stopped → deployment `STOPPED` + +--- + +## Deployment Flow + +### Deploy stages (tracked in `deploy_stage` for progress UI) + +1. `PRE_FLIGHT` — Validate config, check JAR exists, verify base image +2. `PULL_IMAGE` — Pull base image if not present locally +3. `CREATE_NETWORK` — Ensure traefik and environment networks exist +4. `START_REPLICAS` — Create and start N containers +5. `HEALTH_CHECK` — Wait for at least one replica to pass health check +6. `SWAP_TRAFFIC` — (blue/green) Stop old deployment replicas +7. `COMPLETE` — Mark deployment RUNNING/DEGRADED + +### Pre-flight checks + +Before touching any running containers: +1. JAR file exists on disk at the path stored in `app_versions` +2. Base image available (pull if missing) +3. Resolved config is valid (memory > 0, appPort > 0, replicas >= 1) +4. No naming conflict with containers from other apps + +If any check fails → deployment marked `FAILED` immediately, existing deployment untouched. + +### Blue/green strategy + +1. Start all new replicas alongside the old deployment +2. Wait for health checks on new replicas +3. If healthy: stop and remove all old replicas, mark new deployment RUNNING +4. If unhealthy: remove new replicas, mark new deployment FAILED, old deployment stays + +Temporarily uses 2x resources during the swap window. + +### Rolling strategy + +1. For each replica index (0..N-1): + a. Stop old replica at index i (if exists) + b. Start new replica at index i + c. Wait for health check + d. If unhealthy: stop, mark deployment FAILED, leave remaining old replicas +2. After all replicas replaced, mark deployment RUNNING + +Lower peak resources but slower and more complex. + +### Container naming + +`{envSlug}-{appSlug}-{replicaIndex}` (e.g., `staging-payment-gateway-0`) + +### Container restart policy + +`on-failure` with max 3 retries. Docker handles transient failures. After 3 retries exhausted, the container stays dead and `DockerEventMonitor` detects the permanent failure. + +### Environment variables injected + +Base env vars (always set): +``` +CAMELEER_EXPORT_TYPE=HTTP +CAMELEER_APPLICATION_ID={appSlug} +CAMELEER_ENVIRONMENT_ID={envSlug} +CAMELEER_DISPLAY_NAME={containerName} +CAMELEER_SERVER_URL={resolvedServerUrl} +CAMELEER_AUTH_TOKEN={bootstrapToken} +``` + +Plus all entries from `customEnvVars` in the resolved config. + +--- + +## Docker Event Monitor + +### DockerEventMonitor + +`@Component` that starts a persistent Docker event stream on `@PostConstruct`. + +- Filters for containers with label `managed-by=cameleer3-server` +- Listens for events: `die`, `oom`, `stop`, `start` +- On `die`/`oom`: looks up deployment by container ID, updates replica status to `DEAD`, recomputes deployment status (RUNNING → DEGRADED → FAILED) +- On `start`: updates replica status to `RUNNING` (handles Docker restart policy recoveries) +- Reconnects automatically if the stream drops + +### Interaction with agent heartbeats + +- Agent heartbeats: app-level health (is the Camel context running, are routes active) +- Docker events: infrastructure-level health (is the container alive, OOM, crash) +- Both feed into the same deployment status. Docker events are faster for container crashes. Agent heartbeats catch app-level hangs where the container is alive but the app is stuck. + +--- + +## Database Migration + +`V7__deployment_orchestration.sql`: + +```sql +-- New status values and fields for deployments +ALTER TABLE deployments ADD COLUMN target_state VARCHAR(20) NOT NULL DEFAULT 'RUNNING'; +ALTER TABLE deployments ADD COLUMN deployment_strategy VARCHAR(20) NOT NULL DEFAULT 'BLUE_GREEN'; +ALTER TABLE deployments ADD COLUMN replica_states JSONB NOT NULL DEFAULT '[]'; +ALTER TABLE deployments ADD COLUMN deploy_stage VARCHAR(30); + +-- Backfill existing deployments +UPDATE deployments SET target_state = CASE + WHEN status = 'STOPPED' THEN 'STOPPED' + ELSE 'RUNNING' +END; +``` + +The `status` column remains but gains two new values: `DEGRADED` and `STOPPING`. The `DeploymentStatus` enum is updated to match. + +--- + +## UI Changes + +### Deployments tab — Overview + +- **Replicas column** in deployments table: shows `{healthy}/{total}` (e.g., `2/3`) +- **Status badge** updated for new states: `DEGRADED` (warning color), `STOPPING` (auto color) +- **Deployment progress** shown when `deploy_stage` is not null — horizontal step indicator: + ``` + ●━━━━●━━━━●━━━━○━━━━○━━━━○ + Pre- Pull Start Health Swap + flight reps check traffic + ``` + Completed steps filled, current step highlighted, failed step red. + +### Create App page — Resources tab + +- `appPort` — number input (default 8080) +- `replicas` — number input (default 1) +- `deploymentStrategy` — select: Blue/Green, Rolling (default Blue/Green) +- `stripPathPrefix` — toggle (default true) +- `sslOffloading` — toggle (default true) + +### Config tab — Resources sub-tab (app detail) + +Same fields as create page, plus visible in read-only mode when not editing. + +### Environment admin page + +- `routingMode` — select: Path-based, Subdomain (default Path-based) +- `routingDomain` — text input +- `serverUrl` — text input with placeholder showing global default +- `sslOffloading` — toggle (default true) + +--- + +## New/Modified Components Summary + +### Core module (cameleer3-server-core) + +- `ResolvedContainerConfig` — new record with all typed fields +- `ConfigMerger` — pure function, three-layer merge +- `ContainerRequest` — add `cpuLimit`, `exposedPorts`, `restartPolicy`, `additionalNetworks` +- `DeploymentStatus` — add `DEGRADED`, `STOPPING` +- `Deployment` — add `targetState`, `deploymentStrategy`, `replicaStates`, `deployStage` + +### App module (cameleer3-server-app) + +- `DockerRuntimeOrchestrator` — apply full config (memory reserve, CPU limit, exposed ports, restart policy) +- `DockerNetworkManager` — new component, lazy network creation + container attachment +- `DockerEventMonitor` — new component, persistent event stream listener +- `TraefikLabelBuilder` — new utility, generates full Traefik label set +- `DeploymentExecutor` — rewrite deploy flow with stages, pre-flight, strategy dispatch +- `V7__deployment_orchestration.sql` — migration for new columns + +### UI + +- `AppsTab.tsx` — new fields in create page and config tabs +- `EnvironmentsPage.tsx` — routing and SSL fields +- `DeploymentProgress` component — step indicator for deploy stages +- Status badges updated for DEGRADED/STOPPING