Files
cameleer-server/docs/superpowers/specs/2026-04-08-docker-orchestration-design.md
hsiegeln b196918e70
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m26s
CI / docker (push) Successful in 26s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 39s
docs: revert ICC-disabled, use shared traefik network with app-level auth
ICC=false breaks Traefik routing and agent-server communication.
Switched to shared traefik network (ICC enabled) with app-level
security boundaries. Per-env Traefik networks noted as future option.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:00:12 +02:00

13 KiB

Docker Container Orchestration Design

Goal

Make the DockerRuntimeOrchestrator fully functional: apply container configs (memory, CPU, ports, env vars) when starting containers, generate correct Traefik routing labels, support replicas, implement blue/green and rolling deployment strategies, and monitor container health via Docker event stream.

Scope

  • Docker single-host only (Swarm and K8s are future RuntimeOrchestrator implementations)
  • Replicas managed by the orchestrator as independent containers
  • Traefik integration for path-based and subdomain-based routing
  • Docker event stream for infrastructure-level health monitoring
  • UI changes for new config fields, replica management, and deployment progress

Network Topology

Three network tiers with lazy creation:

cameleer-infra          — server, postgres, clickhouse (databases isolated)
cameleer-traefik        — server, traefik, all app containers (ingress + agent SSE)
cameleer-env-{slug}     — app containers within one environment (inter-app only)
  • Server joins cameleer-infra + cameleer-traefik
  • App containers join cameleer-traefik + cameleer-env-{envSlug}
  • Traefik joins cameleer-traefik only
  • Databases join cameleer-infra only

App containers reach the server for SSE/heartbeats via the cameleer-traefik network. They never touch databases directly.

Network isolation

The cameleer-traefik network has ICC enabled (required for Traefik routing and agent-server communication). All app containers are technically reachable from each other on this network. The security boundary is at the application level (auth tokens, environment-specific credentials).

The cameleer-env-{slug} networks provide intentional service discovery isolation — apps only discover and communicate with services in their own environment via Docker DNS. Cross-environment communication requires knowing the target container's IP, which apps have no reason to discover.

Future option: Per-environment Traefik networks (each env gets its own network with Traefik and server attached) would provide full network-level isolation. This can be added based on customer security requirements without changing the orchestrator interface.

Network Manager

Wraps Docker network operations. ensureNetwork(name) creates a bridge network if it doesn't exist (idempotent). connectContainer(containerId, networkName) attaches a container to a second network. Called by DeploymentExecutor before container creation.


Configuration Model

Three-layer merge

Global defaults (application.yml)
  → Environment defaults (environments.default_container_config)
    → App overrides (apps.container_config)

App overrides environment, environment overrides global. Missing keys fall through.

Environment-level settings (defaultContainerConfig)

Key Type Default Description
memoryLimitMb int 512 Default memory limit
memoryReserveMb int null Memory reservation
cpuShares int 512 CPU shares
cpuLimit float null CPU core limit
routingMode string "path" path or subdomain
routingDomain string from global Domain for URL generation
serverUrl string from global Server URL for agent callbacks
sslOffloading boolean true Traefik terminates TLS

App-level settings (containerConfig)

Key Type Default Description
memoryLimitMb int from env Override memory limit
memoryReserveMb int from env Override memory reservation
cpuShares int from env Override CPU shares
cpuLimit float from env Override CPU core limit
appPort int 8080 Main HTTP port for Traefik
exposedPorts int[] [] Additional ports (debug, JMX)
customEnvVars map {} App-specific environment variables
stripPathPrefix boolean true Traefik strips /{env}/{app} prefix
sslOffloading boolean from env Override SSL offloading
replicas int 1 Number of container replicas
deploymentStrategy string "blue-green" blue-green or rolling

ConfigMerger

Pure function: resolve(globalDefaults, envConfig, appConfig) → ResolvedContainerConfig

ResolvedContainerConfig is a typed Java record with all fields resolved to concrete values. No more scattered @Value fields in DeploymentExecutor for container-level settings.


Traefik Label Generation

TraefikLabelBuilder

Pure function: takes app slug, env slug, resolved config. Returns Map<String, String>.

Path-based routing (routingMode: "path")

Service name derived as {envSlug}-{appSlug}.

traefik.enable=true
traefik.http.routers.{svc}.rule=PathPrefix(`/{envSlug}/{appSlug}/`)
traefik.http.routers.{svc}.entrypoints=websecure
traefik.http.services.{svc}.loadbalancer.server.port={appPort}
managed-by=cameleer3-server
cameleer.app={appSlug}
cameleer.environment={envSlug}

If stripPathPrefix is true:

traefik.http.middlewares.{svc}-strip.stripprefix.prefixes=/{envSlug}/{appSlug}
traefik.http.routers.{svc}.middlewares={svc}-strip

Subdomain-based routing (routingMode: "subdomain")

traefik.http.routers.{svc}.rule=Host(`{appSlug}-{envSlug}.{routingDomain}`)

No strip-prefix needed for subdomain routing.

SSL offloading

If sslOffloading is true:

traefik.http.routers.{svc}.tls=true
traefik.http.routers.{svc}.tls.certresolver=default

If false, Traefik passes through TLS to the container (requires the app to terminate TLS itself).

Replicas

All replicas of the same app get identical Traefik labels. Traefik automatically load-balances across containers with the same service name on the same network.


Deployment Status Model

New fields on deployments table

Column Type Description
target_state varchar RUNNING or STOPPED
deployment_strategy varchar BLUE_GREEN or ROLLING
replica_states jsonb Array of {index, containerId, status}
deploy_stage varchar Current stage for progress tracking (null when stable)

Status values

STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED

State transitions

Target: RUNNING
  STOPPED → STARTING → RUNNING     (all replicas healthy)
                     → DEGRADED    (some replicas healthy, some dead)
                     → FAILED      (zero healthy / pre-flight failed)
  RUNNING → DEGRADED               (replica dies)
  DEGRADED → RUNNING               (replica recovers via restart policy)
  DEGRADED → FAILED                (all dead, retries exhausted)

Target: STOPPED
  RUNNING/DEGRADED → STOPPING → STOPPED  (all replicas stopped and removed)
                              → FAILED   (couldn't stop some replicas)

Aggregate derivation

Deployment status is derived from replica states:

  • All replicas RUNNING → deployment RUNNING
  • At least one RUNNING, some DEAD → deployment DEGRADED
  • Zero RUNNING after retries → deployment FAILED
  • All stopped → deployment STOPPED

Deployment Flow

Deploy stages (tracked in deploy_stage for progress UI)

  1. PRE_FLIGHT — Validate config, check JAR exists, verify base image
  2. PULL_IMAGE — Pull base image if not present locally
  3. CREATE_NETWORK — Ensure traefik and environment networks exist
  4. START_REPLICAS — Create and start N containers
  5. HEALTH_CHECK — Wait for at least one replica to pass health check
  6. SWAP_TRAFFIC — (blue/green) Stop old deployment replicas
  7. COMPLETE — Mark deployment RUNNING/DEGRADED

Pre-flight checks

Before touching any running containers:

  1. JAR file exists on disk at the path stored in app_versions
  2. Base image available (pull if missing)
  3. Resolved config is valid (memory > 0, appPort > 0, replicas >= 1)
  4. No naming conflict with containers from other apps

If any check fails → deployment marked FAILED immediately, existing deployment untouched.

Blue/green strategy

  1. Start all new replicas alongside the old deployment
  2. Wait for health checks on new replicas
  3. If healthy: stop and remove all old replicas, mark new deployment RUNNING
  4. If unhealthy: remove new replicas, mark new deployment FAILED, old deployment stays

Temporarily uses 2x resources during the swap window.

Rolling strategy

  1. For each replica index (0..N-1): a. Stop old replica at index i (if exists) b. Start new replica at index i c. Wait for health check d. If unhealthy: stop, mark deployment FAILED, leave remaining old replicas
  2. After all replicas replaced, mark deployment RUNNING

Lower peak resources but slower and more complex.

Container naming

{envSlug}-{appSlug}-{replicaIndex} (e.g., staging-payment-gateway-0)

Container restart policy

on-failure with max 3 retries. Docker handles transient failures. After 3 retries exhausted, the container stays dead and DockerEventMonitor detects the permanent failure.

Environment variables injected

Base env vars (always set):

CAMELEER_EXPORT_TYPE=HTTP
CAMELEER_APPLICATION_ID={appSlug}
CAMELEER_ENVIRONMENT_ID={envSlug}
CAMELEER_DISPLAY_NAME={containerName}
CAMELEER_SERVER_URL={resolvedServerUrl}
CAMELEER_AUTH_TOKEN={bootstrapToken}

Plus all entries from customEnvVars in the resolved config.


Docker Event Monitor

DockerEventMonitor

@Component that starts a persistent Docker event stream on @PostConstruct.

  • Filters for containers with label managed-by=cameleer3-server
  • Listens for events: die, oom, stop, start
  • On die/oom: looks up deployment by container ID, updates replica status to DEAD, recomputes deployment status (RUNNING → DEGRADED → FAILED)
  • On start: updates replica status to RUNNING (handles Docker restart policy recoveries)
  • Reconnects automatically if the stream drops

Interaction with agent heartbeats

  • Agent heartbeats: app-level health (is the Camel context running, are routes active)
  • Docker events: infrastructure-level health (is the container alive, OOM, crash)
  • Both feed into the same deployment status. Docker events are faster for container crashes. Agent heartbeats catch app-level hangs where the container is alive but the app is stuck.

Database Migration

V7__deployment_orchestration.sql:

-- New status values and fields for deployments
ALTER TABLE deployments ADD COLUMN target_state VARCHAR(20) NOT NULL DEFAULT 'RUNNING';
ALTER TABLE deployments ADD COLUMN deployment_strategy VARCHAR(20) NOT NULL DEFAULT 'BLUE_GREEN';
ALTER TABLE deployments ADD COLUMN replica_states JSONB NOT NULL DEFAULT '[]';
ALTER TABLE deployments ADD COLUMN deploy_stage VARCHAR(30);

-- Backfill existing deployments
UPDATE deployments SET target_state = CASE
    WHEN status = 'STOPPED' THEN 'STOPPED'
    ELSE 'RUNNING'
END;

The status column remains but gains two new values: DEGRADED and STOPPING. The DeploymentStatus enum is updated to match.


UI Changes

Deployments tab — Overview

  • Replicas column in deployments table: shows {healthy}/{total} (e.g., 2/3)
  • Status badge updated for new states: DEGRADED (warning color), STOPPING (auto color)
  • Deployment progress shown when deploy_stage is not null — horizontal step indicator:
    ●━━━━●━━━━●━━━━○━━━━○━━━━○
    Pre-  Pull  Start Health Swap
    flight      reps  check  traffic
    
    Completed steps filled, current step highlighted, failed step red.

Create App page — Resources tab

  • appPort — number input (default 8080)
  • replicas — number input (default 1)
  • deploymentStrategy — select: Blue/Green, Rolling (default Blue/Green)
  • stripPathPrefix — toggle (default true)
  • sslOffloading — toggle (default true)

Config tab — Resources sub-tab (app detail)

Same fields as create page, plus visible in read-only mode when not editing.

Environment admin page

  • routingMode — select: Path-based, Subdomain (default Path-based)
  • routingDomain — text input
  • serverUrl — text input with placeholder showing global default
  • sslOffloading — toggle (default true)

New/Modified Components Summary

Core module (cameleer3-server-core)

  • ResolvedContainerConfig — new record with all typed fields
  • ConfigMerger — pure function, three-layer merge
  • ContainerRequest — add cpuLimit, exposedPorts, restartPolicy, additionalNetworks
  • DeploymentStatus — add DEGRADED, STOPPING
  • Deployment — add targetState, deploymentStrategy, replicaStates, deployStage

App module (cameleer3-server-app)

  • DockerRuntimeOrchestrator — apply full config (memory reserve, CPU limit, exposed ports, restart policy)
  • DockerNetworkManager — new component, lazy network creation + container attachment
  • DockerEventMonitor — new component, persistent event stream listener
  • TraefikLabelBuilder — new utility, generates full Traefik label set
  • DeploymentExecutor — rewrite deploy flow with stages, pre-flight, strategy dispatch
  • V7__deployment_orchestration.sql — migration for new columns

UI

  • AppsTab.tsx — new fields in create page and config tabs
  • EnvironmentsPage.tsx — routing and SSL fields
  • DeploymentProgress component — step indicator for deploy stages
  • Status badges updated for DEGRADED/STOPPING