Files
cameleer-server/.claude/rules/docker-orchestration.md
hsiegeln 21db92ff00
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m15s
CI / docker (push) Successful in 1m3s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s
fix(traefik): make TLS cert resolver configurable, omit when unset
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.

Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.

Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.

TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:18:47 +02:00

12 KiB
Raw Permalink Blame History

paths
paths
cameleer-server-app/**/runtime/**
cameleer-server-core/**/runtime/**
deploy/**
docker-compose*.yml
Dockerfile
docker-entrypoint.sh

Docker Orchestration

When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components:

  • ConfigMerger (core/runtime/ConfigMerger.java) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes runtimeType (default "auto") and customArgs (default "").
  • TraefikLabelBuilder (app/runtime/TraefikLabelBuilder.java) — generates Traefik Docker labels for path-based (/{envSlug}/{appSlug}/) or subdomain-based ({appSlug}-{envSlug}.{domain}) routing. Supports strip-prefix and SSL offloading toggles. Per-replica identity labels: cameleer.replica (index), cameleer.generation (8-char deployment UUID prefix — pin Prometheus/Grafana deploy boundaries with this), cameleer.instance-id ({envSlug}-{appSlug}-{replicaIndex}-{generation}). Traefik router/service keys deliberately omit the generation so load balancing spans old + new replicas during a blue/green overlap. When ResolvedContainerConfig.externalRouting() is false (UI: Resources → External Routing, default true), the builder emits ONLY the identity labels (managed-by, cameleer.*) and skips every traefik.* label — the container stays on cameleer-traefik and the per-env network (so sibling containers can still reach it via Docker DNS) but is invisible to Traefik. The tls.certresolver label is emitted only when CAMELEER_SERVER_RUNTIME_CERTRESOLVER is set to a non-blank resolver name (matching a resolver configured in the Traefik static config). When unset (dev installs backed by a static TLS store) only tls=true is emitted and Traefik serves the default cert from the TLS store.
  • PrometheusLabelBuilder (app/runtime/PrometheusLabelBuilder.java) — generates Prometheus docker_sd_configs labels per resolved runtime type: Spring Boot /actuator/prometheus:8081, Quarkus/native /q/metrics:9000, plain Java /metrics:9464. Labels merged into container metadata alongside Traefik labels at deploy time.
  • DockerNetworkManager (app/runtime/DockerNetworkManager.java) — manages two Docker network tiers:
    • cameleer-traefik — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with cameleer-server DNS alias.
    • cameleer-env-{slug} — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: cameleer-env-{tenantId}-{envSlug} (overloaded envNetworkName(tenantId, envSlug) method) to prevent cross-tenant collisions when multiple tenants have identically-named environments.
  • DockerEventMonitor (app/runtime/DockerEventMonitor.java) — persistent Docker event stream listener for containers with managed-by=cameleer-server label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy).
  • DeploymentProgress (ui/src/components/DeploymentProgress.tsx) — UI step indicator showing 7 deploy stages with amber active/green completed styling.
  • ContainerLogForwarder (app/runtime/ContainerLogForwarder.java) — streams Docker container stdout/stderr to ClickHouse logs table with source='container'. Uses docker logs --follow per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. DeploymentExecutor starts capture after each replica launches with the replica's instanceId ({envSlug}-{appSlug}-{replicaIndex}-{generation}); DockerEventMonitor stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same instanceId as the agent (set via CAMELEER_AGENT_INSTANCEID env var) for unified log correlation at the instance level. Instance-id changes per deployment — cross-deploy queries aggregate on application + environment (and optionally replica_index).
  • StartupLogPanel (ui/src/components/StartupLogPanel.tsx) — collapsible log panel rendered below DeploymentProgress. Queries /api/v1/logs?source=container&application={appSlug}&environment={envSlug}. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses useStartupLogs hook and LogViewer (design system).

DeploymentExecutor Details

Primary network for app containers is set via CAMELEER_SERVER_RUNTIME_DOCKERNETWORK env var (in SaaS mode: cameleer-tenant-{slug}); apps also connect to cameleer-traefik (routing) and cameleer-env-{tenantId}-{envSlug} (per-environment discovery) as additional networks. Resolves runtimeType: auto to concrete type from AppVersion.detectedRuntimeType at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use -javaagent:/app/agent.jar -jar, plain Java uses -cp with main class, native runs binary directly). Sets per-replica CAMELEER_AGENT_INSTANCEID env var to {envSlug}-{appSlug}-{replicaIndex}-{generation} so container logs and agent logs share the same instance identity. Sets CAMELEER_AGENT_* env vars from ResolvedContainerConfig (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.

Container naming{tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{generation}, where generation is the first 8 characters of the deployment UUID. The generation suffix lets old + new replicas coexist during a blue/green swap (deterministic names without a generation used to 409). All lookups across the executor, DockerEventMonitor, and ContainerLogForwarder key on container id, not name — the name is operator-visibility only.

Strategy dispatchDeploymentStrategy.fromWire(config.deploymentStrategy()) branches the executor. Unknown values fall back to BLUE_GREEN so misconfiguration never throws at runtime.

  • Blue/green (default): start all N new replicas → wait for ALL healthy → stop the previous deployment. Resource peak ≈ 2× replicas for the health-check window. Partial health aborts with status FAILED; the previous deployment is preserved untouched (user's safety net).
  • Rolling: replace replicas one at a time — start new[i] → wait healthy → stop old[i] → next. Resource peak = replicas + 1. Mid-rollout health failure stops in-flight new containers and aborts; already-replaced old replicas are NOT restored (not reversible) but un-replaced old[i+1..N] keep serving traffic. User redeploys to recover.

Traffic routing is implicit: Traefik labels (cameleer.app, cameleer.environment) are generation-agnostic, so new replicas attract load balancing as soon as they come up healthy — no explicit swap step.

Deployment Status Model

Status Meaning
STOPPED Intentionally stopped or initial state
STARTING Deploy in progress
RUNNING All replicas healthy and serving
DEGRADED Post-deploy: a replica died after the deploy was marked RUNNING. Set by DockerEventMonitor reconciliation, never by DeploymentExecutor directly.
STOPPING Graceful shutdown in progress
FAILED Terminal failure (pre-flight, health check, or crash). Partial-healthy deploys now mark FAILED — DEGRADED is reserved for post-deploy drift.

Deploy stages (DeployStage): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage). Rolling reuses the same stage labels inside the per-replica loop; the UI progress bar shows the most recent stage.

Deployment retention: DeploymentService.createDeployment() deletes FAILED deployments for the same app+environment before creating a new one, preventing failed-attempt buildup. STOPPED deployments are preserved as restorable checkpoints — the UI Checkpoints disclosure lists every deployment with a non-null deployed_config_snapshot (RUNNING, DEGRADED, STOPPED) minus the current one.

JAR Management

  • Retention policy per environment: configurable maximum number of JAR versions to keep. Older JARs are deleted automatically.
  • Nightly cleanup job (JarRetentionJob, Spring @Scheduled 03:00): purges JARs exceeding the retention limit and removes orphaned files not referenced by any app version. Skips versions currently deployed.
  • Volume-based JAR mounting for Docker-in-Docker setups: set CAMELEER_SERVER_RUNTIME_JARDOCKERVOLUME to the Docker volume name that contains the JAR storage directory. When set, the orchestrator mounts this volume into the container instead of bind-mounting the host path (required when the SaaS container itself runs inside Docker and the host path is not accessible from sibling containers).

Runtime Type Detection

The server detects the app framework from uploaded JARs and builds Docker entrypoints. The agent shaded JAR bundles the log appender, so no separate cameleer-log-appender.jar or PropertiesLauncher is needed:

  • Detection (RuntimeDetector): runs at JAR upload time. Checks ZIP magic bytes (non-ZIP = native binary), then probes META-INF/MANIFEST.MF Main-Class: Spring Boot loader prefix -> spring-boot, Quarkus entry point -> quarkus, other Main-Class -> plain-java (extracts class name). Results stored on AppVersion (detected_runtime_type, detected_main_class).
  • Runtime types (RuntimeType enum): AUTO, SPRING_BOOT, QUARKUS, PLAIN_JAVA, NATIVE. Configurable per app/environment via containerConfig.runtimeType (default "auto").
  • Entrypoint per type: All JVM types use java -javaagent:/app/agent.jar -jar app.jar. Plain Java uses -cp with explicit main class instead of -jar. Native runs the binary directly.
  • Custom arguments (containerConfig.customArgs): freeform string appended to the start command. Validated against a strict pattern to prevent shell injection (entrypoint uses sh -c).
  • AUTO resolution: at deploy time (PRE_FLIGHT), "auto" resolves to the detected type from AppVersion. Fails deployment if detection was unsuccessful — user must set type explicitly.
  • UI: Resources tab shows Runtime Type dropdown (with detection hint from latest uploaded version) and Custom Arguments text field.

SaaS Multi-Tenant Network Isolation

In SaaS mode, each tenant's server and its deployed apps are isolated at the Docker network level:

  • Tenant network (cameleer-tenant-{slug}) — primary internal bridge for all of a tenant's containers. Set as CAMELEER_SERVER_RUNTIME_DOCKERNETWORK for the tenant's server instance. Tenant A's apps cannot reach tenant B's apps.
  • Shared services network — server also connects to the shared infrastructure network (PostgreSQL, ClickHouse, Logto) and cameleer-traefik for HTTP routing.
  • Tenant-scoped environment networks (cameleer-env-{tenantId}-{envSlug}) — per-environment discovery is scoped per tenant, so alpha-corp's "dev" environment network is separate from beta-corp's "dev" environment network.

nginx / Reverse Proxy

  • client_max_body_size 200m is required in the nginx config to allow JAR uploads up to 200 MB. Without this, large JAR uploads return 413.