The cameleer-traefik network disables inter-container communication so app containers cannot reach each other directly — only through Traefik. Environment networks keep ICC enabled for intra-env comms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
354 lines
13 KiB
Markdown
354 lines
13 KiB
Markdown
# Docker Container Orchestration Design
|
|
|
|
## Goal
|
|
|
|
Make the `DockerRuntimeOrchestrator` fully functional: apply container configs (memory, CPU, ports, env vars) when starting containers, generate correct Traefik routing labels, support replicas, implement blue/green and rolling deployment strategies, and monitor container health via Docker event stream.
|
|
|
|
## Scope
|
|
|
|
- Docker single-host only (Swarm and K8s are future `RuntimeOrchestrator` implementations)
|
|
- Replicas managed by the orchestrator as independent containers
|
|
- Traefik integration for path-based and subdomain-based routing
|
|
- Docker event stream for infrastructure-level health monitoring
|
|
- UI changes for new config fields, replica management, and deployment progress
|
|
|
|
---
|
|
|
|
## Network Topology
|
|
|
|
Three network tiers with lazy creation:
|
|
|
|
```
|
|
cameleer-infra — server, postgres, clickhouse (databases isolated)
|
|
cameleer-traefik — server, traefik, all app containers (ingress + agent SSE)
|
|
cameleer-env-{slug} — app containers within one environment (inter-app only)
|
|
```
|
|
|
|
- **Server** joins `cameleer-infra` + `cameleer-traefik`
|
|
- **App containers** join `cameleer-traefik` + `cameleer-env-{envSlug}`
|
|
- **Traefik** joins `cameleer-traefik` only
|
|
- **Databases** join `cameleer-infra` only
|
|
|
|
App containers reach the server for SSE/heartbeats via the `cameleer-traefik` network. They never touch databases directly.
|
|
|
|
### Network isolation
|
|
|
|
The `cameleer-traefik` network is created with **inter-container communication (ICC) disabled** (`--opt com.docker.network.bridge.enable_icc=false`). This means containers on the traefik network cannot communicate directly with each other — they can only be reached through Traefik's published ports. This prevents a compromised app in one environment from reaching apps in other environments via the shared routing network.
|
|
|
|
The `cameleer-env-{slug}` networks keep ICC enabled so apps within the same environment can discover and communicate with each other freely.
|
|
|
|
### Network Manager
|
|
|
|
Wraps Docker network operations. `ensureNetwork(name, iccEnabled)` creates a bridge network if it doesn't exist (idempotent). The traefik network is created with `iccEnabled=false`, environment networks with `iccEnabled=true`. `connectContainer(containerId, networkName)` attaches a container to a second network. Called by `DeploymentExecutor` before container creation.
|
|
|
|
---
|
|
|
|
## Configuration Model
|
|
|
|
### Three-layer merge
|
|
|
|
```
|
|
Global defaults (application.yml)
|
|
→ Environment defaults (environments.default_container_config)
|
|
→ App overrides (apps.container_config)
|
|
```
|
|
|
|
App overrides environment, environment overrides global. Missing keys fall through.
|
|
|
|
### Environment-level settings (`defaultContainerConfig`)
|
|
|
|
| Key | Type | Default | Description |
|
|
|-----|------|---------|-------------|
|
|
| `memoryLimitMb` | int | 512 | Default memory limit |
|
|
| `memoryReserveMb` | int | null | Memory reservation |
|
|
| `cpuShares` | int | 512 | CPU shares |
|
|
| `cpuLimit` | float | null | CPU core limit |
|
|
| `routingMode` | string | `"path"` | `path` or `subdomain` |
|
|
| `routingDomain` | string | from global | Domain for URL generation |
|
|
| `serverUrl` | string | from global | Server URL for agent callbacks |
|
|
| `sslOffloading` | boolean | true | Traefik terminates TLS |
|
|
|
|
### App-level settings (`containerConfig`)
|
|
|
|
| Key | Type | Default | Description |
|
|
|-----|------|---------|-------------|
|
|
| `memoryLimitMb` | int | from env | Override memory limit |
|
|
| `memoryReserveMb` | int | from env | Override memory reservation |
|
|
| `cpuShares` | int | from env | Override CPU shares |
|
|
| `cpuLimit` | float | from env | Override CPU core limit |
|
|
| `appPort` | int | 8080 | Main HTTP port for Traefik |
|
|
| `exposedPorts` | int[] | [] | Additional ports (debug, JMX) |
|
|
| `customEnvVars` | map | {} | App-specific environment variables |
|
|
| `stripPathPrefix` | boolean | true | Traefik strips `/{env}/{app}` prefix |
|
|
| `sslOffloading` | boolean | from env | Override SSL offloading |
|
|
| `replicas` | int | 1 | Number of container replicas |
|
|
| `deploymentStrategy` | string | `"blue-green"` | `blue-green` or `rolling` |
|
|
|
|
### ConfigMerger
|
|
|
|
Pure function: `resolve(globalDefaults, envConfig, appConfig) → ResolvedContainerConfig`
|
|
|
|
`ResolvedContainerConfig` is a typed Java record with all fields resolved to concrete values. No more scattered `@Value` fields in `DeploymentExecutor` for container-level settings.
|
|
|
|
---
|
|
|
|
## Traefik Label Generation
|
|
|
|
### TraefikLabelBuilder
|
|
|
|
Pure function: takes app slug, env slug, resolved config. Returns `Map<String, String>`.
|
|
|
|
### Path-based routing (`routingMode: "path"`)
|
|
|
|
Service name derived as `{envSlug}-{appSlug}`.
|
|
|
|
```
|
|
traefik.enable=true
|
|
traefik.http.routers.{svc}.rule=PathPrefix(`/{envSlug}/{appSlug}/`)
|
|
traefik.http.routers.{svc}.entrypoints=websecure
|
|
traefik.http.services.{svc}.loadbalancer.server.port={appPort}
|
|
managed-by=cameleer3-server
|
|
cameleer.app={appSlug}
|
|
cameleer.environment={envSlug}
|
|
```
|
|
|
|
If `stripPathPrefix` is true:
|
|
```
|
|
traefik.http.middlewares.{svc}-strip.stripprefix.prefixes=/{envSlug}/{appSlug}
|
|
traefik.http.routers.{svc}.middlewares={svc}-strip
|
|
```
|
|
|
|
### Subdomain-based routing (`routingMode: "subdomain"`)
|
|
|
|
```
|
|
traefik.http.routers.{svc}.rule=Host(`{appSlug}-{envSlug}.{routingDomain}`)
|
|
```
|
|
|
|
No strip-prefix needed for subdomain routing.
|
|
|
|
### SSL offloading
|
|
|
|
If `sslOffloading` is true:
|
|
```
|
|
traefik.http.routers.{svc}.tls=true
|
|
traefik.http.routers.{svc}.tls.certresolver=default
|
|
```
|
|
|
|
If false, Traefik passes through TLS to the container (requires the app to terminate TLS itself).
|
|
|
|
### Replicas
|
|
|
|
All replicas of the same app get identical Traefik labels. Traefik automatically load-balances across containers with the same service name on the same network.
|
|
|
|
---
|
|
|
|
## Deployment Status Model
|
|
|
|
### New fields on `deployments` table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `target_state` | varchar | `RUNNING` or `STOPPED` |
|
|
| `deployment_strategy` | varchar | `BLUE_GREEN` or `ROLLING` |
|
|
| `replica_states` | jsonb | Array of `{index, containerId, status}` |
|
|
| `deploy_stage` | varchar | Current stage for progress tracking (null when stable) |
|
|
|
|
### Status values
|
|
|
|
`STOPPED`, `STARTING`, `RUNNING`, `DEGRADED`, `STOPPING`, `FAILED`
|
|
|
|
### State transitions
|
|
|
|
```
|
|
Target: RUNNING
|
|
STOPPED → STARTING → RUNNING (all replicas healthy)
|
|
→ DEGRADED (some replicas healthy, some dead)
|
|
→ FAILED (zero healthy / pre-flight failed)
|
|
RUNNING → DEGRADED (replica dies)
|
|
DEGRADED → RUNNING (replica recovers via restart policy)
|
|
DEGRADED → FAILED (all dead, retries exhausted)
|
|
|
|
Target: STOPPED
|
|
RUNNING/DEGRADED → STOPPING → STOPPED (all replicas stopped and removed)
|
|
→ FAILED (couldn't stop some replicas)
|
|
```
|
|
|
|
### Aggregate derivation
|
|
|
|
Deployment status is derived from replica states:
|
|
- All replicas `RUNNING` → deployment `RUNNING`
|
|
- At least one `RUNNING`, some `DEAD` → deployment `DEGRADED`
|
|
- Zero `RUNNING` after retries → deployment `FAILED`
|
|
- All stopped → deployment `STOPPED`
|
|
|
|
---
|
|
|
|
## Deployment Flow
|
|
|
|
### Deploy stages (tracked in `deploy_stage` for progress UI)
|
|
|
|
1. `PRE_FLIGHT` — Validate config, check JAR exists, verify base image
|
|
2. `PULL_IMAGE` — Pull base image if not present locally
|
|
3. `CREATE_NETWORK` — Ensure traefik and environment networks exist
|
|
4. `START_REPLICAS` — Create and start N containers
|
|
5. `HEALTH_CHECK` — Wait for at least one replica to pass health check
|
|
6. `SWAP_TRAFFIC` — (blue/green) Stop old deployment replicas
|
|
7. `COMPLETE` — Mark deployment RUNNING/DEGRADED
|
|
|
|
### Pre-flight checks
|
|
|
|
Before touching any running containers:
|
|
1. JAR file exists on disk at the path stored in `app_versions`
|
|
2. Base image available (pull if missing)
|
|
3. Resolved config is valid (memory > 0, appPort > 0, replicas >= 1)
|
|
4. No naming conflict with containers from other apps
|
|
|
|
If any check fails → deployment marked `FAILED` immediately, existing deployment untouched.
|
|
|
|
### Blue/green strategy
|
|
|
|
1. Start all new replicas alongside the old deployment
|
|
2. Wait for health checks on new replicas
|
|
3. If healthy: stop and remove all old replicas, mark new deployment RUNNING
|
|
4. If unhealthy: remove new replicas, mark new deployment FAILED, old deployment stays
|
|
|
|
Temporarily uses 2x resources during the swap window.
|
|
|
|
### Rolling strategy
|
|
|
|
1. For each replica index (0..N-1):
|
|
a. Stop old replica at index i (if exists)
|
|
b. Start new replica at index i
|
|
c. Wait for health check
|
|
d. If unhealthy: stop, mark deployment FAILED, leave remaining old replicas
|
|
2. After all replicas replaced, mark deployment RUNNING
|
|
|
|
Lower peak resources but slower and more complex.
|
|
|
|
### Container naming
|
|
|
|
`{envSlug}-{appSlug}-{replicaIndex}` (e.g., `staging-payment-gateway-0`)
|
|
|
|
### Container restart policy
|
|
|
|
`on-failure` with max 3 retries. Docker handles transient failures. After 3 retries exhausted, the container stays dead and `DockerEventMonitor` detects the permanent failure.
|
|
|
|
### Environment variables injected
|
|
|
|
Base env vars (always set):
|
|
```
|
|
CAMELEER_EXPORT_TYPE=HTTP
|
|
CAMELEER_APPLICATION_ID={appSlug}
|
|
CAMELEER_ENVIRONMENT_ID={envSlug}
|
|
CAMELEER_DISPLAY_NAME={containerName}
|
|
CAMELEER_SERVER_URL={resolvedServerUrl}
|
|
CAMELEER_AUTH_TOKEN={bootstrapToken}
|
|
```
|
|
|
|
Plus all entries from `customEnvVars` in the resolved config.
|
|
|
|
---
|
|
|
|
## Docker Event Monitor
|
|
|
|
### DockerEventMonitor
|
|
|
|
`@Component` that starts a persistent Docker event stream on `@PostConstruct`.
|
|
|
|
- Filters for containers with label `managed-by=cameleer3-server`
|
|
- Listens for events: `die`, `oom`, `stop`, `start`
|
|
- On `die`/`oom`: looks up deployment by container ID, updates replica status to `DEAD`, recomputes deployment status (RUNNING → DEGRADED → FAILED)
|
|
- On `start`: updates replica status to `RUNNING` (handles Docker restart policy recoveries)
|
|
- Reconnects automatically if the stream drops
|
|
|
|
### Interaction with agent heartbeats
|
|
|
|
- Agent heartbeats: app-level health (is the Camel context running, are routes active)
|
|
- Docker events: infrastructure-level health (is the container alive, OOM, crash)
|
|
- Both feed into the same deployment status. Docker events are faster for container crashes. Agent heartbeats catch app-level hangs where the container is alive but the app is stuck.
|
|
|
|
---
|
|
|
|
## Database Migration
|
|
|
|
`V7__deployment_orchestration.sql`:
|
|
|
|
```sql
|
|
-- New status values and fields for deployments
|
|
ALTER TABLE deployments ADD COLUMN target_state VARCHAR(20) NOT NULL DEFAULT 'RUNNING';
|
|
ALTER TABLE deployments ADD COLUMN deployment_strategy VARCHAR(20) NOT NULL DEFAULT 'BLUE_GREEN';
|
|
ALTER TABLE deployments ADD COLUMN replica_states JSONB NOT NULL DEFAULT '[]';
|
|
ALTER TABLE deployments ADD COLUMN deploy_stage VARCHAR(30);
|
|
|
|
-- Backfill existing deployments
|
|
UPDATE deployments SET target_state = CASE
|
|
WHEN status = 'STOPPED' THEN 'STOPPED'
|
|
ELSE 'RUNNING'
|
|
END;
|
|
```
|
|
|
|
The `status` column remains but gains two new values: `DEGRADED` and `STOPPING`. The `DeploymentStatus` enum is updated to match.
|
|
|
|
---
|
|
|
|
## UI Changes
|
|
|
|
### Deployments tab — Overview
|
|
|
|
- **Replicas column** in deployments table: shows `{healthy}/{total}` (e.g., `2/3`)
|
|
- **Status badge** updated for new states: `DEGRADED` (warning color), `STOPPING` (auto color)
|
|
- **Deployment progress** shown when `deploy_stage` is not null — horizontal step indicator:
|
|
```
|
|
●━━━━●━━━━●━━━━○━━━━○━━━━○
|
|
Pre- Pull Start Health Swap
|
|
flight reps check traffic
|
|
```
|
|
Completed steps filled, current step highlighted, failed step red.
|
|
|
|
### Create App page — Resources tab
|
|
|
|
- `appPort` — number input (default 8080)
|
|
- `replicas` — number input (default 1)
|
|
- `deploymentStrategy` — select: Blue/Green, Rolling (default Blue/Green)
|
|
- `stripPathPrefix` — toggle (default true)
|
|
- `sslOffloading` — toggle (default true)
|
|
|
|
### Config tab — Resources sub-tab (app detail)
|
|
|
|
Same fields as create page, plus visible in read-only mode when not editing.
|
|
|
|
### Environment admin page
|
|
|
|
- `routingMode` — select: Path-based, Subdomain (default Path-based)
|
|
- `routingDomain` — text input
|
|
- `serverUrl` — text input with placeholder showing global default
|
|
- `sslOffloading` — toggle (default true)
|
|
|
|
---
|
|
|
|
## New/Modified Components Summary
|
|
|
|
### Core module (cameleer3-server-core)
|
|
|
|
- `ResolvedContainerConfig` — new record with all typed fields
|
|
- `ConfigMerger` — pure function, three-layer merge
|
|
- `ContainerRequest` — add `cpuLimit`, `exposedPorts`, `restartPolicy`, `additionalNetworks`
|
|
- `DeploymentStatus` — add `DEGRADED`, `STOPPING`
|
|
- `Deployment` — add `targetState`, `deploymentStrategy`, `replicaStates`, `deployStage`
|
|
|
|
### App module (cameleer3-server-app)
|
|
|
|
- `DockerRuntimeOrchestrator` — apply full config (memory reserve, CPU limit, exposed ports, restart policy)
|
|
- `DockerNetworkManager` — new component, lazy network creation + container attachment
|
|
- `DockerEventMonitor` — new component, persistent event stream listener
|
|
- `TraefikLabelBuilder` — new utility, generates full Traefik label set
|
|
- `DeploymentExecutor` — rewrite deploy flow with stages, pre-flight, strategy dispatch
|
|
- `V7__deployment_orchestration.sql` — migration for new columns
|
|
|
|
### UI
|
|
|
|
- `AppsTab.tsx` — new fields in create page and config tabs
|
|
- `EnvironmentsPage.tsx` — routing and SSL fields
|
|
- `DeploymentProgress` component — step indicator for deploy stages
|
|
- Status badges updated for DEGRADED/STOPPING
|