docs: Docker container orchestration design spec
Covers: config merging (3-layer), Traefik label generation (path + subdomain routing), network topology (infra/traefik/env isolation), replica management, blue/green and rolling deployment strategies, Docker event stream monitoring, deployment status state machine (DEGRADED/STOPPING states), pre-flight checks, and UI changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
347
docs/superpowers/specs/2026-04-08-docker-orchestration-design.md
Normal file
347
docs/superpowers/specs/2026-04-08-docker-orchestration-design.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Docker Container Orchestration Design
|
||||
|
||||
## Goal
|
||||
|
||||
Make the `DockerRuntimeOrchestrator` fully functional: apply container configs (memory, CPU, ports, env vars) when starting containers, generate correct Traefik routing labels, support replicas, implement blue/green and rolling deployment strategies, and monitor container health via Docker event stream.
|
||||
|
||||
## Scope
|
||||
|
||||
- Docker single-host only (Swarm and K8s are future `RuntimeOrchestrator` implementations)
|
||||
- Replicas managed by the orchestrator as independent containers
|
||||
- Traefik integration for path-based and subdomain-based routing
|
||||
- Docker event stream for infrastructure-level health monitoring
|
||||
- UI changes for new config fields, replica management, and deployment progress
|
||||
|
||||
---
|
||||
|
||||
## Network Topology
|
||||
|
||||
Three network tiers with lazy creation:
|
||||
|
||||
```
|
||||
cameleer-infra — server, postgres, clickhouse (databases isolated)
|
||||
cameleer-traefik — server, traefik, all app containers (ingress + agent SSE)
|
||||
cameleer-env-{slug} — app containers within one environment (inter-app only)
|
||||
```
|
||||
|
||||
- **Server** joins `cameleer-infra` + `cameleer-traefik`
|
||||
- **App containers** join `cameleer-traefik` + `cameleer-env-{envSlug}`
|
||||
- **Traefik** joins `cameleer-traefik` only
|
||||
- **Databases** join `cameleer-infra` only
|
||||
|
||||
App containers reach the server for SSE/heartbeats via the `cameleer-traefik` network. They never touch databases directly.
|
||||
|
||||
### Network Manager
|
||||
|
||||
Wraps Docker network operations. `ensureNetwork(name)` creates a bridge network if it doesn't exist (idempotent). `connectContainer(containerId, networkName)` attaches a container to a second network. Called by `DeploymentExecutor` before container creation.
|
||||
|
||||
---
|
||||
|
||||
## Configuration Model
|
||||
|
||||
### Three-layer merge
|
||||
|
||||
```
|
||||
Global defaults (application.yml)
|
||||
→ Environment defaults (environments.default_container_config)
|
||||
→ App overrides (apps.container_config)
|
||||
```
|
||||
|
||||
App overrides environment, environment overrides global. Missing keys fall through.
|
||||
|
||||
### Environment-level settings (`defaultContainerConfig`)
|
||||
|
||||
| Key | Type | Default | Description |
|
||||
|-----|------|---------|-------------|
|
||||
| `memoryLimitMb` | int | 512 | Default memory limit |
|
||||
| `memoryReserveMb` | int | null | Memory reservation |
|
||||
| `cpuShares` | int | 512 | CPU shares |
|
||||
| `cpuLimit` | float | null | CPU core limit |
|
||||
| `routingMode` | string | `"path"` | `path` or `subdomain` |
|
||||
| `routingDomain` | string | from global | Domain for URL generation |
|
||||
| `serverUrl` | string | from global | Server URL for agent callbacks |
|
||||
| `sslOffloading` | boolean | true | Traefik terminates TLS |
|
||||
|
||||
### App-level settings (`containerConfig`)
|
||||
|
||||
| Key | Type | Default | Description |
|
||||
|-----|------|---------|-------------|
|
||||
| `memoryLimitMb` | int | from env | Override memory limit |
|
||||
| `memoryReserveMb` | int | from env | Override memory reservation |
|
||||
| `cpuShares` | int | from env | Override CPU shares |
|
||||
| `cpuLimit` | float | from env | Override CPU core limit |
|
||||
| `appPort` | int | 8080 | Main HTTP port for Traefik |
|
||||
| `exposedPorts` | int[] | [] | Additional ports (debug, JMX) |
|
||||
| `customEnvVars` | map | {} | App-specific environment variables |
|
||||
| `stripPathPrefix` | boolean | true | Traefik strips `/{env}/{app}` prefix |
|
||||
| `sslOffloading` | boolean | from env | Override SSL offloading |
|
||||
| `replicas` | int | 1 | Number of container replicas |
|
||||
| `deploymentStrategy` | string | `"blue-green"` | `blue-green` or `rolling` |
|
||||
|
||||
### ConfigMerger
|
||||
|
||||
Pure function: `resolve(globalDefaults, envConfig, appConfig) → ResolvedContainerConfig`
|
||||
|
||||
`ResolvedContainerConfig` is a typed Java record with all fields resolved to concrete values. No more scattered `@Value` fields in `DeploymentExecutor` for container-level settings.
|
||||
|
||||
---
|
||||
|
||||
## Traefik Label Generation
|
||||
|
||||
### TraefikLabelBuilder
|
||||
|
||||
Pure function: takes app slug, env slug, resolved config. Returns `Map<String, String>`.
|
||||
|
||||
### Path-based routing (`routingMode: "path"`)
|
||||
|
||||
Service name derived as `{envSlug}-{appSlug}`.
|
||||
|
||||
```
|
||||
traefik.enable=true
|
||||
traefik.http.routers.{svc}.rule=PathPrefix(`/{envSlug}/{appSlug}/`)
|
||||
traefik.http.routers.{svc}.entrypoints=websecure
|
||||
traefik.http.services.{svc}.loadbalancer.server.port={appPort}
|
||||
managed-by=cameleer3-server
|
||||
cameleer.app={appSlug}
|
||||
cameleer.environment={envSlug}
|
||||
```
|
||||
|
||||
If `stripPathPrefix` is true:
|
||||
```
|
||||
traefik.http.middlewares.{svc}-strip.stripprefix.prefixes=/{envSlug}/{appSlug}
|
||||
traefik.http.routers.{svc}.middlewares={svc}-strip
|
||||
```
|
||||
|
||||
### Subdomain-based routing (`routingMode: "subdomain"`)
|
||||
|
||||
```
|
||||
traefik.http.routers.{svc}.rule=Host(`{appSlug}-{envSlug}.{routingDomain}`)
|
||||
```
|
||||
|
||||
No strip-prefix needed for subdomain routing.
|
||||
|
||||
### SSL offloading
|
||||
|
||||
If `sslOffloading` is true:
|
||||
```
|
||||
traefik.http.routers.{svc}.tls=true
|
||||
traefik.http.routers.{svc}.tls.certresolver=default
|
||||
```
|
||||
|
||||
If false, Traefik passes through TLS to the container (requires the app to terminate TLS itself).
|
||||
|
||||
### Replicas
|
||||
|
||||
All replicas of the same app get identical Traefik labels. Traefik automatically load-balances across containers with the same service name on the same network.
|
||||
|
||||
---
|
||||
|
||||
## Deployment Status Model
|
||||
|
||||
### New fields on `deployments` table
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `target_state` | varchar | `RUNNING` or `STOPPED` |
|
||||
| `deployment_strategy` | varchar | `BLUE_GREEN` or `ROLLING` |
|
||||
| `replica_states` | jsonb | Array of `{index, containerId, status}` |
|
||||
| `deploy_stage` | varchar | Current stage for progress tracking (null when stable) |
|
||||
|
||||
### Status values
|
||||
|
||||
`STOPPED`, `STARTING`, `RUNNING`, `DEGRADED`, `STOPPING`, `FAILED`
|
||||
|
||||
### State transitions
|
||||
|
||||
```
|
||||
Target: RUNNING
|
||||
STOPPED → STARTING → RUNNING (all replicas healthy)
|
||||
→ DEGRADED (some replicas healthy, some dead)
|
||||
→ FAILED (zero healthy / pre-flight failed)
|
||||
RUNNING → DEGRADED (replica dies)
|
||||
DEGRADED → RUNNING (replica recovers via restart policy)
|
||||
DEGRADED → FAILED (all dead, retries exhausted)
|
||||
|
||||
Target: STOPPED
|
||||
RUNNING/DEGRADED → STOPPING → STOPPED (all replicas stopped and removed)
|
||||
→ FAILED (couldn't stop some replicas)
|
||||
```
|
||||
|
||||
### Aggregate derivation
|
||||
|
||||
Deployment status is derived from replica states:
|
||||
- All replicas `RUNNING` → deployment `RUNNING`
|
||||
- At least one `RUNNING`, some `DEAD` → deployment `DEGRADED`
|
||||
- Zero `RUNNING` after retries → deployment `FAILED`
|
||||
- All stopped → deployment `STOPPED`
|
||||
|
||||
---
|
||||
|
||||
## Deployment Flow
|
||||
|
||||
### Deploy stages (tracked in `deploy_stage` for progress UI)
|
||||
|
||||
1. `PRE_FLIGHT` — Validate config, check JAR exists, verify base image
|
||||
2. `PULL_IMAGE` — Pull base image if not present locally
|
||||
3. `CREATE_NETWORK` — Ensure traefik and environment networks exist
|
||||
4. `START_REPLICAS` — Create and start N containers
|
||||
5. `HEALTH_CHECK` — Wait for at least one replica to pass health check
|
||||
6. `SWAP_TRAFFIC` — (blue/green) Stop old deployment replicas
|
||||
7. `COMPLETE` — Mark deployment RUNNING/DEGRADED
|
||||
|
||||
### Pre-flight checks
|
||||
|
||||
Before touching any running containers:
|
||||
1. JAR file exists on disk at the path stored in `app_versions`
|
||||
2. Base image available (pull if missing)
|
||||
3. Resolved config is valid (memory > 0, appPort > 0, replicas >= 1)
|
||||
4. No naming conflict with containers from other apps
|
||||
|
||||
If any check fails → deployment marked `FAILED` immediately, existing deployment untouched.
|
||||
|
||||
### Blue/green strategy
|
||||
|
||||
1. Start all new replicas alongside the old deployment
|
||||
2. Wait for health checks on new replicas
|
||||
3. If healthy: stop and remove all old replicas, mark new deployment RUNNING
|
||||
4. If unhealthy: remove new replicas, mark new deployment FAILED, old deployment stays
|
||||
|
||||
Temporarily uses 2x resources during the swap window.
|
||||
|
||||
### Rolling strategy
|
||||
|
||||
1. For each replica index (0..N-1):
|
||||
a. Stop old replica at index i (if exists)
|
||||
b. Start new replica at index i
|
||||
c. Wait for health check
|
||||
d. If unhealthy: stop, mark deployment FAILED, leave remaining old replicas
|
||||
2. After all replicas replaced, mark deployment RUNNING
|
||||
|
||||
Lower peak resources but slower and more complex.
|
||||
|
||||
### Container naming
|
||||
|
||||
`{envSlug}-{appSlug}-{replicaIndex}` (e.g., `staging-payment-gateway-0`)
|
||||
|
||||
### Container restart policy
|
||||
|
||||
`on-failure` with max 3 retries. Docker handles transient failures. After 3 retries exhausted, the container stays dead and `DockerEventMonitor` detects the permanent failure.
|
||||
|
||||
### Environment variables injected
|
||||
|
||||
Base env vars (always set):
|
||||
```
|
||||
CAMELEER_EXPORT_TYPE=HTTP
|
||||
CAMELEER_APPLICATION_ID={appSlug}
|
||||
CAMELEER_ENVIRONMENT_ID={envSlug}
|
||||
CAMELEER_DISPLAY_NAME={containerName}
|
||||
CAMELEER_SERVER_URL={resolvedServerUrl}
|
||||
CAMELEER_AUTH_TOKEN={bootstrapToken}
|
||||
```
|
||||
|
||||
Plus all entries from `customEnvVars` in the resolved config.
|
||||
|
||||
---
|
||||
|
||||
## Docker Event Monitor
|
||||
|
||||
### DockerEventMonitor
|
||||
|
||||
`@Component` that starts a persistent Docker event stream on `@PostConstruct`.
|
||||
|
||||
- Filters for containers with label `managed-by=cameleer3-server`
|
||||
- Listens for events: `die`, `oom`, `stop`, `start`
|
||||
- On `die`/`oom`: looks up deployment by container ID, updates replica status to `DEAD`, recomputes deployment status (RUNNING → DEGRADED → FAILED)
|
||||
- On `start`: updates replica status to `RUNNING` (handles Docker restart policy recoveries)
|
||||
- Reconnects automatically if the stream drops
|
||||
|
||||
### Interaction with agent heartbeats
|
||||
|
||||
- Agent heartbeats: app-level health (is the Camel context running, are routes active)
|
||||
- Docker events: infrastructure-level health (is the container alive, OOM, crash)
|
||||
- Both feed into the same deployment status. Docker events are faster for container crashes. Agent heartbeats catch app-level hangs where the container is alive but the app is stuck.
|
||||
|
||||
---
|
||||
|
||||
## Database Migration
|
||||
|
||||
`V7__deployment_orchestration.sql`:
|
||||
|
||||
```sql
|
||||
-- New status values and fields for deployments
|
||||
ALTER TABLE deployments ADD COLUMN target_state VARCHAR(20) NOT NULL DEFAULT 'RUNNING';
|
||||
ALTER TABLE deployments ADD COLUMN deployment_strategy VARCHAR(20) NOT NULL DEFAULT 'BLUE_GREEN';
|
||||
ALTER TABLE deployments ADD COLUMN replica_states JSONB NOT NULL DEFAULT '[]';
|
||||
ALTER TABLE deployments ADD COLUMN deploy_stage VARCHAR(30);
|
||||
|
||||
-- Backfill existing deployments
|
||||
UPDATE deployments SET target_state = CASE
|
||||
WHEN status = 'STOPPED' THEN 'STOPPED'
|
||||
ELSE 'RUNNING'
|
||||
END;
|
||||
```
|
||||
|
||||
The `status` column remains but gains two new values: `DEGRADED` and `STOPPING`. The `DeploymentStatus` enum is updated to match.
|
||||
|
||||
---
|
||||
|
||||
## UI Changes
|
||||
|
||||
### Deployments tab — Overview
|
||||
|
||||
- **Replicas column** in deployments table: shows `{healthy}/{total}` (e.g., `2/3`)
|
||||
- **Status badge** updated for new states: `DEGRADED` (warning color), `STOPPING` (auto color)
|
||||
- **Deployment progress** shown when `deploy_stage` is not null — horizontal step indicator:
|
||||
```
|
||||
●━━━━●━━━━●━━━━○━━━━○━━━━○
|
||||
Pre- Pull Start Health Swap
|
||||
flight reps check traffic
|
||||
```
|
||||
Completed steps filled, current step highlighted, failed step red.
|
||||
|
||||
### Create App page — Resources tab
|
||||
|
||||
- `appPort` — number input (default 8080)
|
||||
- `replicas` — number input (default 1)
|
||||
- `deploymentStrategy` — select: Blue/Green, Rolling (default Blue/Green)
|
||||
- `stripPathPrefix` — toggle (default true)
|
||||
- `sslOffloading` — toggle (default true)
|
||||
|
||||
### Config tab — Resources sub-tab (app detail)
|
||||
|
||||
Same fields as create page, plus visible in read-only mode when not editing.
|
||||
|
||||
### Environment admin page
|
||||
|
||||
- `routingMode` — select: Path-based, Subdomain (default Path-based)
|
||||
- `routingDomain` — text input
|
||||
- `serverUrl` — text input with placeholder showing global default
|
||||
- `sslOffloading` — toggle (default true)
|
||||
|
||||
---
|
||||
|
||||
## New/Modified Components Summary
|
||||
|
||||
### Core module (cameleer3-server-core)
|
||||
|
||||
- `ResolvedContainerConfig` — new record with all typed fields
|
||||
- `ConfigMerger` — pure function, three-layer merge
|
||||
- `ContainerRequest` — add `cpuLimit`, `exposedPorts`, `restartPolicy`, `additionalNetworks`
|
||||
- `DeploymentStatus` — add `DEGRADED`, `STOPPING`
|
||||
- `Deployment` — add `targetState`, `deploymentStrategy`, `replicaStates`, `deployStage`
|
||||
|
||||
### App module (cameleer3-server-app)
|
||||
|
||||
- `DockerRuntimeOrchestrator` — apply full config (memory reserve, CPU limit, exposed ports, restart policy)
|
||||
- `DockerNetworkManager` — new component, lazy network creation + container attachment
|
||||
- `DockerEventMonitor` — new component, persistent event stream listener
|
||||
- `TraefikLabelBuilder` — new utility, generates full Traefik label set
|
||||
- `DeploymentExecutor` — rewrite deploy flow with stages, pre-flight, strategy dispatch
|
||||
- `V7__deployment_orchestration.sql` — migration for new columns
|
||||
|
||||
### UI
|
||||
|
||||
- `AppsTab.tsx` — new fields in create page and config tabs
|
||||
- `EnvironmentsPage.tsx` — routing and SSL fields
|
||||
- `DeploymentProgress` component — step indicator for deploy stages
|
||||
- Status badges updated for DEGRADED/STOPPING
|
||||
Reference in New Issue
Block a user