Files
cameleer-server/docs/superpowers/specs/2026-04-08-docker-orchestration-design.md
hsiegeln dd4442329c docs: add ICC-disabled traefik network isolation to orchestration spec
The cameleer-traefik network disables inter-container communication
so app containers cannot reach each other directly — only through
Traefik. Environment networks keep ICC enabled for intra-env comms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:53:51 +02:00

13 KiB

Docker Container Orchestration Design

Goal

Make the DockerRuntimeOrchestrator fully functional: apply container configs (memory, CPU, ports, env vars) when starting containers, generate correct Traefik routing labels, support replicas, implement blue/green and rolling deployment strategies, and monitor container health via Docker event stream.

Scope

  • Docker single-host only (Swarm and K8s are future RuntimeOrchestrator implementations)
  • Replicas managed by the orchestrator as independent containers
  • Traefik integration for path-based and subdomain-based routing
  • Docker event stream for infrastructure-level health monitoring
  • UI changes for new config fields, replica management, and deployment progress

Network Topology

Three network tiers with lazy creation:

cameleer-infra          — server, postgres, clickhouse (databases isolated)
cameleer-traefik        — server, traefik, all app containers (ingress + agent SSE)
cameleer-env-{slug}     — app containers within one environment (inter-app only)
  • Server joins cameleer-infra + cameleer-traefik
  • App containers join cameleer-traefik + cameleer-env-{envSlug}
  • Traefik joins cameleer-traefik only
  • Databases join cameleer-infra only

App containers reach the server for SSE/heartbeats via the cameleer-traefik network. They never touch databases directly.

Network isolation

The cameleer-traefik network is created with inter-container communication (ICC) disabled (--opt com.docker.network.bridge.enable_icc=false). This means containers on the traefik network cannot communicate directly with each other — they can only be reached through Traefik's published ports. This prevents a compromised app in one environment from reaching apps in other environments via the shared routing network.

The cameleer-env-{slug} networks keep ICC enabled so apps within the same environment can discover and communicate with each other freely.

Network Manager

Wraps Docker network operations. ensureNetwork(name, iccEnabled) creates a bridge network if it doesn't exist (idempotent). The traefik network is created with iccEnabled=false, environment networks with iccEnabled=true. connectContainer(containerId, networkName) attaches a container to a second network. Called by DeploymentExecutor before container creation.


Configuration Model

Three-layer merge

Global defaults (application.yml)
  → Environment defaults (environments.default_container_config)
    → App overrides (apps.container_config)

App overrides environment, environment overrides global. Missing keys fall through.

Environment-level settings (defaultContainerConfig)

Key Type Default Description
memoryLimitMb int 512 Default memory limit
memoryReserveMb int null Memory reservation
cpuShares int 512 CPU shares
cpuLimit float null CPU core limit
routingMode string "path" path or subdomain
routingDomain string from global Domain for URL generation
serverUrl string from global Server URL for agent callbacks
sslOffloading boolean true Traefik terminates TLS

App-level settings (containerConfig)

Key Type Default Description
memoryLimitMb int from env Override memory limit
memoryReserveMb int from env Override memory reservation
cpuShares int from env Override CPU shares
cpuLimit float from env Override CPU core limit
appPort int 8080 Main HTTP port for Traefik
exposedPorts int[] [] Additional ports (debug, JMX)
customEnvVars map {} App-specific environment variables
stripPathPrefix boolean true Traefik strips /{env}/{app} prefix
sslOffloading boolean from env Override SSL offloading
replicas int 1 Number of container replicas
deploymentStrategy string "blue-green" blue-green or rolling

ConfigMerger

Pure function: resolve(globalDefaults, envConfig, appConfig) → ResolvedContainerConfig

ResolvedContainerConfig is a typed Java record with all fields resolved to concrete values. No more scattered @Value fields in DeploymentExecutor for container-level settings.


Traefik Label Generation

TraefikLabelBuilder

Pure function: takes app slug, env slug, resolved config. Returns Map<String, String>.

Path-based routing (routingMode: "path")

Service name derived as {envSlug}-{appSlug}.

traefik.enable=true
traefik.http.routers.{svc}.rule=PathPrefix(`/{envSlug}/{appSlug}/`)
traefik.http.routers.{svc}.entrypoints=websecure
traefik.http.services.{svc}.loadbalancer.server.port={appPort}
managed-by=cameleer3-server
cameleer.app={appSlug}
cameleer.environment={envSlug}

If stripPathPrefix is true:

traefik.http.middlewares.{svc}-strip.stripprefix.prefixes=/{envSlug}/{appSlug}
traefik.http.routers.{svc}.middlewares={svc}-strip

Subdomain-based routing (routingMode: "subdomain")

traefik.http.routers.{svc}.rule=Host(`{appSlug}-{envSlug}.{routingDomain}`)

No strip-prefix needed for subdomain routing.

SSL offloading

If sslOffloading is true:

traefik.http.routers.{svc}.tls=true
traefik.http.routers.{svc}.tls.certresolver=default

If false, Traefik passes through TLS to the container (requires the app to terminate TLS itself).

Replicas

All replicas of the same app get identical Traefik labels. Traefik automatically load-balances across containers with the same service name on the same network.


Deployment Status Model

New fields on deployments table

Column Type Description
target_state varchar RUNNING or STOPPED
deployment_strategy varchar BLUE_GREEN or ROLLING
replica_states jsonb Array of {index, containerId, status}
deploy_stage varchar Current stage for progress tracking (null when stable)

Status values

STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED

State transitions

Target: RUNNING
  STOPPED → STARTING → RUNNING     (all replicas healthy)
                     → DEGRADED    (some replicas healthy, some dead)
                     → FAILED      (zero healthy / pre-flight failed)
  RUNNING → DEGRADED               (replica dies)
  DEGRADED → RUNNING               (replica recovers via restart policy)
  DEGRADED → FAILED                (all dead, retries exhausted)

Target: STOPPED
  RUNNING/DEGRADED → STOPPING → STOPPED  (all replicas stopped and removed)
                              → FAILED   (couldn't stop some replicas)

Aggregate derivation

Deployment status is derived from replica states:

  • All replicas RUNNING → deployment RUNNING
  • At least one RUNNING, some DEAD → deployment DEGRADED
  • Zero RUNNING after retries → deployment FAILED
  • All stopped → deployment STOPPED

Deployment Flow

Deploy stages (tracked in deploy_stage for progress UI)

  1. PRE_FLIGHT — Validate config, check JAR exists, verify base image
  2. PULL_IMAGE — Pull base image if not present locally
  3. CREATE_NETWORK — Ensure traefik and environment networks exist
  4. START_REPLICAS — Create and start N containers
  5. HEALTH_CHECK — Wait for at least one replica to pass health check
  6. SWAP_TRAFFIC — (blue/green) Stop old deployment replicas
  7. COMPLETE — Mark deployment RUNNING/DEGRADED

Pre-flight checks

Before touching any running containers:

  1. JAR file exists on disk at the path stored in app_versions
  2. Base image available (pull if missing)
  3. Resolved config is valid (memory > 0, appPort > 0, replicas >= 1)
  4. No naming conflict with containers from other apps

If any check fails → deployment marked FAILED immediately, existing deployment untouched.

Blue/green strategy

  1. Start all new replicas alongside the old deployment
  2. Wait for health checks on new replicas
  3. If healthy: stop and remove all old replicas, mark new deployment RUNNING
  4. If unhealthy: remove new replicas, mark new deployment FAILED, old deployment stays

Temporarily uses 2x resources during the swap window.

Rolling strategy

  1. For each replica index (0..N-1): a. Stop old replica at index i (if exists) b. Start new replica at index i c. Wait for health check d. If unhealthy: stop, mark deployment FAILED, leave remaining old replicas
  2. After all replicas replaced, mark deployment RUNNING

Lower peak resources but slower and more complex.

Container naming

{envSlug}-{appSlug}-{replicaIndex} (e.g., staging-payment-gateway-0)

Container restart policy

on-failure with max 3 retries. Docker handles transient failures. After 3 retries exhausted, the container stays dead and DockerEventMonitor detects the permanent failure.

Environment variables injected

Base env vars (always set):

CAMELEER_EXPORT_TYPE=HTTP
CAMELEER_APPLICATION_ID={appSlug}
CAMELEER_ENVIRONMENT_ID={envSlug}
CAMELEER_DISPLAY_NAME={containerName}
CAMELEER_SERVER_URL={resolvedServerUrl}
CAMELEER_AUTH_TOKEN={bootstrapToken}

Plus all entries from customEnvVars in the resolved config.


Docker Event Monitor

DockerEventMonitor

@Component that starts a persistent Docker event stream on @PostConstruct.

  • Filters for containers with label managed-by=cameleer3-server
  • Listens for events: die, oom, stop, start
  • On die/oom: looks up deployment by container ID, updates replica status to DEAD, recomputes deployment status (RUNNING → DEGRADED → FAILED)
  • On start: updates replica status to RUNNING (handles Docker restart policy recoveries)
  • Reconnects automatically if the stream drops

Interaction with agent heartbeats

  • Agent heartbeats: app-level health (is the Camel context running, are routes active)
  • Docker events: infrastructure-level health (is the container alive, OOM, crash)
  • Both feed into the same deployment status. Docker events are faster for container crashes. Agent heartbeats catch app-level hangs where the container is alive but the app is stuck.

Database Migration

V7__deployment_orchestration.sql:

-- New status values and fields for deployments
ALTER TABLE deployments ADD COLUMN target_state VARCHAR(20) NOT NULL DEFAULT 'RUNNING';
ALTER TABLE deployments ADD COLUMN deployment_strategy VARCHAR(20) NOT NULL DEFAULT 'BLUE_GREEN';
ALTER TABLE deployments ADD COLUMN replica_states JSONB NOT NULL DEFAULT '[]';
ALTER TABLE deployments ADD COLUMN deploy_stage VARCHAR(30);

-- Backfill existing deployments
UPDATE deployments SET target_state = CASE
    WHEN status = 'STOPPED' THEN 'STOPPED'
    ELSE 'RUNNING'
END;

The status column remains but gains two new values: DEGRADED and STOPPING. The DeploymentStatus enum is updated to match.


UI Changes

Deployments tab — Overview

  • Replicas column in deployments table: shows {healthy}/{total} (e.g., 2/3)
  • Status badge updated for new states: DEGRADED (warning color), STOPPING (auto color)
  • Deployment progress shown when deploy_stage is not null — horizontal step indicator:
    ●━━━━●━━━━●━━━━○━━━━○━━━━○
    Pre-  Pull  Start Health Swap
    flight      reps  check  traffic
    
    Completed steps filled, current step highlighted, failed step red.

Create App page — Resources tab

  • appPort — number input (default 8080)
  • replicas — number input (default 1)
  • deploymentStrategy — select: Blue/Green, Rolling (default Blue/Green)
  • stripPathPrefix — toggle (default true)
  • sslOffloading — toggle (default true)

Config tab — Resources sub-tab (app detail)

Same fields as create page, plus visible in read-only mode when not editing.

Environment admin page

  • routingMode — select: Path-based, Subdomain (default Path-based)
  • routingDomain — text input
  • serverUrl — text input with placeholder showing global default
  • sslOffloading — toggle (default true)

New/Modified Components Summary

Core module (cameleer3-server-core)

  • ResolvedContainerConfig — new record with all typed fields
  • ConfigMerger — pure function, three-layer merge
  • ContainerRequest — add cpuLimit, exposedPorts, restartPolicy, additionalNetworks
  • DeploymentStatus — add DEGRADED, STOPPING
  • Deployment — add targetState, deploymentStrategy, replicaStates, deployStage

App module (cameleer3-server-app)

  • DockerRuntimeOrchestrator — apply full config (memory reserve, CPU limit, exposed ports, restart policy)
  • DockerNetworkManager — new component, lazy network creation + container attachment
  • DockerEventMonitor — new component, persistent event stream listener
  • TraefikLabelBuilder — new utility, generates full Traefik label set
  • DeploymentExecutor — rewrite deploy flow with stages, pre-flight, strategy dispatch
  • V7__deployment_orchestration.sql — migration for new columns

UI

  • AppsTab.tsx — new fields in create page and config tabs
  • EnvironmentsPage.tsx — routing and SSL fields
  • DeploymentProgress component — step indicator for deploy stages
  • Status badges updated for DEGRADED/STOPPING