Customers running this server with no overrides reach the public registry alias, not the internal hostname. registry.cameleer.io and gitea.siegeln.net resolve to the same registry — buildtime CI keeps pushing to gitea.siegeln.net, runtime defaults pull via the public alias. - application.yml: baseimage, loaderimage defaults - DeploymentExecutor.java: matching @Value defaults - docker-orchestration.md: updates the documented default and notes the buildtime/public split so future changes don't "fix" the asymmetry Out of scope (intentionally still on gitea.siegeln.net): - LoaderHardeningIT and the two DockerRuntimeOrchestrator unit tests. Tests are buildtime artifacts; LoaderHardeningIT pulls the real image via CI's pre-authenticated docker login to gitea.siegeln.net. - deploy/base/*.yaml and deploy/overlays/main/*.yaml (internal k3s, customers don't use these manifests). - pom.xml, .npmrc, ui/Dockerfile (build dependency sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
paths
| paths | ||||||
|---|---|---|---|---|---|---|
|
Docker Orchestration
When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components:
- ConfigMerger (
core/runtime/ConfigMerger.java) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). IncludesruntimeType(default"auto") andcustomArgs(default""). - TraefikLabelBuilder (
app/runtime/TraefikLabelBuilder.java) — generates Traefik Docker labels for path-based (/{envSlug}/{appSlug}/) or subdomain-based ({appSlug}-{envSlug}.{domain}) routing. Supports strip-prefix and SSL offloading toggles. Per-replica identity labels:cameleer.replica(index),cameleer.generation(8-char deployment UUID prefix — pin Prometheus/Grafana deploy boundaries with this),cameleer.instance-id({envSlug}-{appSlug}-{replicaIndex}-{generation}). Traefik router/service keys deliberately omit the generation so load balancing spans old + new replicas during a blue/green overlap. WhenResolvedContainerConfig.externalRouting()isfalse(UI: Resources → External Routing, defaulttrue), the builder emits ONLY the identity labels (managed-by,cameleer.*) and skips everytraefik.*label — the container stays oncameleer-traefikand the per-env network (so sibling containers can still reach it via Docker DNS) but is invisible to Traefik. Thetls.certresolverlabel is emitted only whenCAMELEER_SERVER_RUNTIME_CERTRESOLVERis set to a non-blank resolver name (matching a resolver configured in the Traefik static config). When unset (dev installs backed by a static TLS store) onlytls=trueis emitted and Traefik serves the default cert from the TLS store. - PrometheusLabelBuilder (
app/runtime/PrometheusLabelBuilder.java) — generates Prometheusdocker_sd_configslabels per resolved runtime type: Spring Boot/actuator/prometheus:8081, Quarkus/native/q/metrics:9000, plain Java/metrics:9464. Labels merged into container metadata alongside Traefik labels at deploy time. - DockerNetworkManager (
app/runtime/DockerNetworkManager.java) — manages two Docker network tiers:cameleer-traefik— shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose withcameleer-serverDNS alias.cameleer-env-{slug}— per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped:cameleer-env-{tenantId}-{envSlug}(overloadedenvNetworkName(tenantId, envSlug)method) to prevent cross-tenant collisions when multiple tenants have identically-named environments.
- DockerEventMonitor (
app/runtime/DockerEventMonitor.java) — persistent Docker event stream listener for containers withmanaged-by=cameleer-serverlabel. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy). - DeploymentProgress (
ui/src/components/DeploymentProgress.tsx) — UI step indicator showing 7 deploy stages with amber active/green completed styling. - ContainerLogForwarder (
app/runtime/ContainerLogForwarder.java) — streams Docker container stdout/stderr to ClickHouselogstable withsource='container'. Usesdocker logs --followper container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex.DeploymentExecutorstarts capture after each replica launches with the replica'sinstanceId({envSlug}-{appSlug}-{replicaIndex}-{generation});DockerEventMonitorstops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the sameinstanceIdas the agent (set viaCAMELEER_AGENT_INSTANCEIDenv var) for unified log correlation at the instance level. Instance-id changes per deployment — cross-deploy queries aggregate onapplication + environment(and optionallyreplica_index). - StartupLogPanel (
ui/src/components/StartupLogPanel.tsx) — collapsible log panel rendered belowDeploymentProgress. Queries/api/v1/logs?source=container&application={appSlug}&environment={envSlug}. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. UsesuseStartupLogshook andLogViewer(design system).
Container Hardening (issue #152)
DockerRuntimeOrchestrator.startContainer applies an unconditional hardening contract to BOTH the loader init-container AND the main tenant container (baseHardenedHostConfig() is the shared helper). Java 17 has no SecurityManager so the JVM is not a security boundary, and isolation must live below it. Defaults are fail-closed and have no opt-out:
cap_drop= everyCapability.values()(effectively ALL — docker-java's enum has noALLconstant). Outbound TCP still works (no caps needed); raw sockets, ptrace, mounts, and bind <1024 are denied.security_opt:no-new-privileges:true,apparmor=docker-default. Default seccomp profile is applied implicitly whenseccomp=is absent.read_onlyrootfs = true.pids_limit= 512 (PIDS_LIMITconstant).tmpfsmount:/tmpwithrw,nosuid,size=256m. Nonoexec— Netty/tcnative, Snappy, LZ4, Zstd dlopen native libs from/tmpviammap(PROT_EXEC)whichnoexecblocks. Issue #153 will add per-appwriteableVolumesfor stateful tenants (Kafka Streams etc.).userns_mode=host:1000:65536on both loader and main. Container root is never UID 0 on the host — closes the last open hardening item from issue #152.
Sandboxed runtime auto-detect: at construction the orchestrator calls dockerClient.infoCmd().exec().getRuntimes() and uses runsc (gVisor) when present. Override with cameleer.server.runtime.dockerruntime (e.g. kata to force Kata Containers, or any other registered runtime). Empty/blank = auto. The override always wins over auto-detect. The DockerRuntimeOrchestrator(DockerClient, String) constructor is the canonical entry point; the single-arg constructor exists only as a convenience for tests that don't need an override.
Init-Container Loader Pattern (JAR fetch)
startContainer is now a two-phase op per replica:
- Volume create —
cameleer-jars-{containerName}named volume (per-replica, deterministic so cleanup inremoveContainercan derive it). - Loader container —
loaderImage(defaultregistry.cameleer.io/cameleer/cameleer-runtime-loader:latest, built and published by the cameleer-saas repo atdocker/runtime-loader/; CI pushes togitea.siegeln.netunder the same path — both names resolve to the same registry), name{containerName}-loader, mount the volume RW at/app/jars, env varsARTIFACT_URL+ARTIFACT_EXPECTED_SIZE. Loader downloads the JAR from the signed URL into the volume and exits 0. Orchestrator blocks onwaitContainerCmd().exec(WaitContainerResultCallback).awaitStatusCode(120, SECONDS). Loader container is removed in afinallyblock; on non-zero exit the volume is also removed andRuntimeExceptionpropagates soDeploymentExecutormarks the deployment FAILED. Loader logs are captured before removal (captureLoaderLogs—logContainerCmdwithwithTail(50), capped at 4096 chars, 5s timeout) and appended to the thrownRuntimeExceptionmessage as". loader output: <text>". Best-effort: log-capture failures are swallowed and don't mask the original exit. The loader image's Dockerfile pre-creates/app/jarsowned byloader:loader(UID 1000) so the orchestrator's fresh named volume initialises with that ownership — without it the empty volume comes up asroot:root 0755and wget exits 1 with "Permission denied".LoaderHardeningITis the cross-repo contract test (pulls the published:latestand asserts exit 0 under the orchestrator's hardening shape). - Main container — same hardening contract, mount the same volume RO at
/app/jars, entrypoint reads/app/jars/app.jar(Spring Boot/Quarkus:-jar /app/jars/app.jar; plain Java:-cp /app/jars/app.jar <MainClass>; native:exec /app/jars/app.jar).
removeContainer(id) derives the volume name from the inspected container name (Docker prefixes it with /) and removes the volume after the container removes — blue/green doesn't leak volumes.
DeploymentExecutor generates the signed URL via ArtifactDownloadTokenSigner.sign(appVersion.id(), Duration.ofSeconds(artifactTokenTtlSeconds)) and passes appVersion.id(), the URL, appVersion.jarSizeBytes(), and the loader image into ContainerRequest. The host filesystem is no longer involved at deploy time.
Loader → server reachability: the loader hits the Cameleer server from its primary Docker
network only (request.network(), set from CAMELEER_SERVER_RUNTIME_DOCKERNETWORK). Additional networks
(cameleer-traefik, per-env) are attached by DockerNetworkManager.connectContainer AFTER startContainer
returns — by which time the loader has already exited. The loader cannot use them. The signed URL is built
from cameleer.server.runtime.artifactbaseurl (preferred), falling back to cameleer.server.runtime.serverurl,
falling back to http://cameleer-server:8081. The default works in SaaS mode because the tenant's primary
network (cameleer-tenant-{slug}) hosts the tenant's own server — same CAMELEER_SERVER_RUNTIME_DOCKERNETWORK
on both. For non-SaaS topologies, set CAMELEER_SERVER_RUNTIME_ARTIFACTBASEURL to a URL the loader can reach
on its primary network.
DeploymentExecutor Details
Primary network for app containers is set via CAMELEER_SERVER_RUNTIME_DOCKERNETWORK env var (in SaaS mode: cameleer-tenant-{slug}); apps also connect to cameleer-traefik (routing) and cameleer-env-{tenantId}-{envSlug} (per-environment discovery) as additional networks. Resolves runtimeType: auto to concrete type from AppVersion.detectedRuntimeType at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use -javaagent:/app/agent.jar -jar, plain Java uses -cp with main class, native runs binary directly). Sets per-replica CAMELEER_AGENT_INSTANCEID env var to {envSlug}-{appSlug}-{replicaIndex}-{generation} so container logs and agent logs share the same instance identity. Sets CAMELEER_AGENT_* env vars from ResolvedContainerConfig (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.
Container naming — {tenantId}-{envSlug}-{appSlug}-{replicaIndex}-{generation}, where generation is the first 8 characters of the deployment UUID. The generation suffix lets old + new replicas coexist during a blue/green swap (deterministic names without a generation used to 409). All lookups across the executor, DockerEventMonitor, and ContainerLogForwarder key on container id, not name — the name is operator-visibility only.
Strategy dispatch — DeploymentStrategy.fromWire(config.deploymentStrategy()) branches the executor. Unknown values fall back to BLUE_GREEN so misconfiguration never throws at runtime.
- Blue/green (default): start all N new replicas → wait for ALL healthy → stop the previous deployment. Resource peak ≈ 2× replicas for the health-check window. Partial health aborts with status FAILED; the previous deployment is preserved untouched (user's safety net).
- Rolling: replace replicas one at a time — start new[i] → wait healthy → stop old[i] → next. Resource peak = replicas + 1. Mid-rollout health failure stops in-flight new containers and aborts; already-replaced old replicas are NOT restored (not reversible) but un-replaced old[i+1..N] keep serving traffic. User redeploys to recover.
Traffic routing is implicit: Traefik labels (cameleer.app, cameleer.environment) are generation-agnostic, so new replicas attract load balancing as soon as they come up healthy — no explicit swap step.
Deployment Status Model
| Status | Meaning |
|---|---|
STOPPED |
Intentionally stopped or initial state |
STARTING |
Deploy in progress |
RUNNING |
All replicas healthy and serving |
DEGRADED |
Post-deploy: a replica died after the deploy was marked RUNNING. Set by DockerEventMonitor reconciliation, never by DeploymentExecutor directly. |
STOPPING |
Graceful shutdown in progress |
FAILED |
Terminal failure (pre-flight, health check, or crash). Partial-healthy deploys now mark FAILED — DEGRADED is reserved for post-deploy drift. |
Deploy stages (DeployStage): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage). Rolling reuses the same stage labels inside the per-replica loop; the UI progress bar shows the most recent stage.
Deployment retention: DeploymentService.createDeployment() deletes FAILED deployments for the same app+environment before creating a new one, preventing failed-attempt buildup. STOPPED deployments are preserved as restorable checkpoints — the UI Checkpoints disclosure lists every deployment with a non-null deployed_config_snapshot (RUNNING, DEGRADED, STOPPED) minus the current one.
JAR Management
- Retention policy per environment: configurable maximum number of JAR versions to keep. Older JARs are deleted automatically.
- Nightly cleanup job (
JarRetentionJob, Spring@Scheduled03:00): purges JARs exceeding the retention limit and removes orphaned files not referenced by any app version. Skips versions currently deployed. - Storage abstraction:
ArtifactStore(incameleer-server-core/storage) is the only path that touches JAR bytes.FilesystemArtifactStorewrites undercameleer.server.runtime.jarstoragepath(default/data/jars); the orchestrator never reads the host filesystem at deploy time. - Loader-fetch at deploy time: tenant containers no longer bind-mount JARs from the host. The loader init-container streams the JAR via a signed URL (HMAC-SHA256, TTL
cameleer.server.runtime.artifacttokenttlseconds, default 600s) into a per-replica named volume; main mounts that volume RO. This works without host-path access and is the single path supported in Docker-in-Docker SaaS deployments.
Runtime Type Detection
The server detects the app framework from uploaded JARs and builds Docker entrypoints. The agent shaded JAR bundles the log appender, so no separate cameleer-log-appender.jar or PropertiesLauncher is needed:
- Detection (
RuntimeDetector): runs at JAR upload time. Checks ZIP magic bytes (non-ZIP = native binary), then probesMETA-INF/MANIFEST.MFMain-Class: Spring Boot loader prefix ->spring-boot, Quarkus entry point ->quarkus, other Main-Class ->plain-java(extracts class name). Results stored onAppVersion(detected_runtime_type,detected_main_class). - Runtime types (
RuntimeTypeenum):AUTO,SPRING_BOOT,QUARKUS,PLAIN_JAVA,NATIVE. Configurable per app/environment viacontainerConfig.runtimeType(default"auto"). - Entrypoint per type: All JVM types use
java -javaagent:/app/agent.jar -jar app.jar. Plain Java uses-cpwith explicit main class instead of-jar. Native runs the binary directly. - Custom arguments (
containerConfig.customArgs): freeform string appended to the start command. Validated against a strict pattern to prevent shell injection (entrypoint usessh -c). - AUTO resolution: at deploy time (PRE_FLIGHT),
"auto"resolves to the detected type fromAppVersion. Fails deployment if detection was unsuccessful — user must set type explicitly. - UI: Resources tab shows Runtime Type dropdown (with detection hint from latest uploaded version) and Custom Arguments text field.
SaaS Multi-Tenant Network Isolation
In SaaS mode, each tenant's server and its deployed apps are isolated at the Docker network level:
- Tenant network (
cameleer-tenant-{slug}) — primary internal bridge for all of a tenant's containers. Set asCAMELEER_SERVER_RUNTIME_DOCKERNETWORKfor the tenant's server instance. Tenant A's apps cannot reach tenant B's apps. - Shared services network — server also connects to the shared infrastructure network (PostgreSQL, ClickHouse, Logto) and
cameleer-traefikfor HTTP routing. - Tenant-scoped environment networks (
cameleer-env-{tenantId}-{envSlug}) — per-environment discovery is scoped per tenant, soalpha-corp's "dev" environment network is separate frombeta-corp's "dev" environment network.
nginx / Reverse Proxy
client_max_body_size 200mis required in the nginx config to allow JAR uploads up to 200 MB. Without this, large JAR uploads return 413.