Two diagnostics-and-confidence follow-ups to the loader-init-container pattern.
1) DockerRuntimeOrchestrator now captures the loader's last 50 lines of
stdout/stderr (capped at 4096 chars, 5s timeout) before the finally-remove
and appends them to the thrown RuntimeException as
`. loader output: <text>`. Best-effort: log-capture failures are swallowed
and never mask the original exit. Closes the visibility gap that turned a
simple "wget: Permission denied" into the opaque "Loader exited 1".
2) New LoaderHardeningIT spins up a Testcontainers nginx serving a 1KB
fixture, builds the loader image fresh from cameleer-runtime-loader/,
and runs it under the exact baseHardenedHostConfig() shape (cap_drop ALL,
readonly rootfs, /tmp tmpfs, no-new-privileges, apparmor=docker-default,
pids=512) bound to a fresh named volume RW at /app/jars. Asserts exit 0.
This would have caught the volume-permission regression in CI.
GenericContainer + OneShotStartupCheckStrategy is used instead of raw
docker-java waitContainerCmd because docker-java's unshaded api version
in this project's pom and testcontainers' shaded copy disagree on
WaitContainerCmd.getCondition() — going through GenericContainer keeps
the call inside testcontainers' shaded executor.
Rules doc updated to point at the captured-output behaviour and the IT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final-review must-fixes:
- HOWTO.md: drop CAMELEER_SERVER_RUNTIME_JARDOCKERVOLUME; add the three new
artifact env vars (loaderimage / artifacttokenttlseconds / artifactbaseurl).
- DeploymentExecutor @PostConstruct WARN, handoff doc, and docker-orchestration
rule no longer claim the loader uses cameleer-traefik. The loader runs on
the PRIMARY Docker network only — additional networks are attached after
startContainer returns, by which time the loader has exited. SaaS still
works because the tenant's primary network hosts the tenant server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-pull the loader image at PULL_IMAGE so the implicit pull on first
createContainerCmd doesn't bypass the 120s loader-wait timeout.
Wrap createAndStartLoader in try/catch so a create/start failure cleans
up the just-created volume; same guard around createAndStartMain on
phase-2 failures. Folds the wait-error message into the rethrown
RuntimeException so the cause chain is visible.
Add a @PostConstruct WARN when neither artifactbaseurl nor serverurl is
set so the implicit cameleer-server DNS dependency is loud at boot, and
document the loader-to-server reachability contract in
.claude/rules/docker-orchestration.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tasks 9+10+11 of the init-container-jar-fetch plan, landed atomically because
9 alone leaves the orchestrator+executor referencing removed ContainerRequest
fields.
ContainerRequest (core) drops jarPath/jarVolumeName/jarVolumeMountPath; adds
appVersionId, artifactDownloadUrl, artifactExpectedSize, loaderImage.
DockerRuntimeOrchestrator (app):
- per-replica named volume "cameleer-jars-{containerName}"
- phase 1: loader container with the volume mounted RW at /app/jars,
ARTIFACT_URL + ARTIFACT_EXPECTED_SIZE env, full hardening contract
- block on waitContainerCmd().awaitStatusCode(120s); on non-zero exit
remove the loader, remove the volume, propagate RuntimeException so
DeploymentExecutor marks the deployment FAILED. main is never created.
- phase 2: main container with the same volume mounted RO at /app/jars
- withUsernsMode("host:1000:65536") on BOTH containers — closes the last
open hardening gap from issue #152
- main entrypoint paths point at /app/jars/app.jar
- extracted baseHardenedHostConfig() so loader and main share the
cap_drop / security_opt / readonly / pids / tmpfs contract
- removeContainer() also removes the per-replica volume so blue/green
doesn't leak volumes
DeploymentExecutor (app):
- injects ArtifactDownloadTokenSigner; new @Value props loaderimage,
artifacttokenttlseconds, artifactbaseurl
- replaces the temporary getVersion(...).jarPath() bridge with a signed
URL ${artifactBaseUrl}/api/v1/artifacts/{id}?exp&sig
- drops the Files.exists pre-flight check; AppVersion.jarSizeBytes is
the size-of-record check now
- drops jarDockerVolume / jarStoragePath @Value fields and the volume
plumbing in startReplica
- DeployCtx carries appVersionId / artifactUrl / artifactExpectedSize
in place of jarPath
Tests:
- DockerRuntimeOrchestratorHardeningTest updated for the new shape;
captures HostConfig on the MAIN container and asserts cap_drop ALL
+ no-new-privileges + apparmor + readonly + pids + tmpfs + the new
withUsernsMode("host:1000:65536")
- DockerRuntimeOrchestratorLoaderTest (new): verifies volume create →
loader create with RW bind → loader started → awaited → loader
removed → main create with RO bind → main started; verifies abort
+ cleanup on loader exit != 0 (loader removed, volume removed, main
NEVER created); verifies userns_mode applied to both containers.
Config:
- application.yml replaces jardockervolume with loaderimage,
artifacttokenttlseconds, artifactbaseurl
Rules updated: .claude/rules/docker-orchestration.md (loader pattern,
userns, no more bind-mount); .claude/rules/core-classes.md
(ContainerRequest field map).
Test counts after change:
- cameleer-server-core: 116/116 unit tests pass
- cameleer-server-app: 273/273 unit tests pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant JARs are arbitrary user code: Camel ships components (camel-exec,
camel-bean, MVEL/Groovy templating) that turn a header into shell, and
Java 17 has no SecurityManager — the JVM is not a security boundary.
This applies an unconditional hardening contract to every tenant
container so a single runc CVE no longer equals host takeover.
DockerRuntimeOrchestrator.startContainer now sets:
- cap_drop ALL (Capability.values() — docker-java has no ALL constant)
- security_opt: no-new-privileges, apparmor=docker-default
(default seccomp profile applies implicitly)
- read_only rootfs, pids_limit=512
- /tmp tmpfs rw,nosuid,size=256m — no noexec, since Netty/Snappy/LZ4/Zstd
dlopen native libs from /tmp via mmap(PROT_EXEC) which noexec blocks
The orchestrator also probes `docker info` at construction and uses runsc
(gVisor) automatically when the daemon has it registered. Override via
cameleer.server.runtime.dockerruntime (e.g. "kata"); empty = auto.
Outbound TCP, DNS, and TLS are unaffected — caps/seccomp don't gate
those — so vanilla Camel-Kafka producers/consumers and REST integrations
keep working unchanged. Stateful tenants (Kafka Streams with on-disk
state stores, apps writing to /var/log/...) need explicit writeable
volumes; that's tracked in #153 as the natural follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.
Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.
Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.
TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a boolean `externalRouting` flag (default `true`) on
ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only
the identity labels (`managed-by`, `cameleer.*`) and skips every
`traefik.*` label, so the container is not published by Traefik.
Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}`
can still reach it via Docker DNS on whatever port the app listens on.
TDD: new TraefikLabelBuilderTest covers enabled (default labels present),
disabled (zero traefik.* labels), and disabled (identity labels retained)
cases. Full module unit suite: 208/0/0.
Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form
state, Resources tab toggle, POST payload, and snapshot-to-form mapping.
Rule files updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
the design spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh the three rules files to match the new executor behavior:
- docker-orchestration.md: rewrite DeploymentExecutor Details with
container naming scheme ({...}-{replica}-{generation}), strategy
dispatch (blue-green vs rolling), and the new DEGRADED semantics
(post-deploy only). Update TraefikLabelBuilder + ContainerLogForwarder
bullets for the generation suffix + new cameleer.generation label.
- app-classes.md: DeploymentExecutor + TraefikLabelBuilder bullets
mirror the same.
- core-classes.md: add DeploymentStrategy enum; note DEGRADED is now
post-deploy-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Un-ignore .claude/rules/ so path-scoped rule files are shared via git.
Add instruction in CLAUDE.md to update rule files when modifying classes,
controllers, endpoints, or metrics — keeps rules current as part of
normal workflow rather than requiring separate maintenance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>