feat(runtime): capture loader logs in failure exceptions; add LoaderHardeningIT regression guard
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m42s
CI / docker (push) Successful in 2m36s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 54s

Two diagnostics-and-confidence follow-ups to the loader-init-container pattern.

1) DockerRuntimeOrchestrator now captures the loader's last 50 lines of
   stdout/stderr (capped at 4096 chars, 5s timeout) before the finally-remove
   and appends them to the thrown RuntimeException as
   `. loader output: <text>`. Best-effort: log-capture failures are swallowed
   and never mask the original exit. Closes the visibility gap that turned a
   simple "wget: Permission denied" into the opaque "Loader exited 1".

2) New LoaderHardeningIT spins up a Testcontainers nginx serving a 1KB
   fixture, builds the loader image fresh from cameleer-runtime-loader/,
   and runs it under the exact baseHardenedHostConfig() shape (cap_drop ALL,
   readonly rootfs, /tmp tmpfs, no-new-privileges, apparmor=docker-default,
   pids=512) bound to a fresh named volume RW at /app/jars. Asserts exit 0.
   This would have caught the volume-permission regression in CI.

GenericContainer + OneShotStartupCheckStrategy is used instead of raw
docker-java waitContainerCmd because docker-java's unshaded api version
in this project's pom and testcontainers' shaded copy disagree on
WaitContainerCmd.getCondition() — going through GenericContainer keeps
the call inside testcontainers' shaded executor.

Rules doc updated to point at the captured-output behaviour and the IT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-27 23:51:25 +02:00
parent c2efb7fbf7
commit 2e2d069530
3 changed files with 189 additions and 3 deletions

View File

@@ -41,7 +41,7 @@ When deployed via the cameleer-saas platform, this server orchestrates customer
`startContainer` is now a two-phase op per replica: `startContainer` is now a two-phase op per replica:
1. **Volume create**`cameleer-jars-{containerName}` named volume (per-replica, deterministic so cleanup in `removeContainer` can derive it). 1. **Volume create**`cameleer-jars-{containerName}` named volume (per-replica, deterministic so cleanup in `removeContainer` can derive it).
2. **Loader container**`loaderImage` (default `gitea.siegeln.net/cameleer/cameleer-runtime-loader:latest`), name `{containerName}-loader`, mount the volume **RW at `/app/jars`**, env vars `ARTIFACT_URL` + `ARTIFACT_EXPECTED_SIZE`. Loader downloads the JAR from the signed URL into the volume and exits 0. Orchestrator blocks on `waitContainerCmd().exec(WaitContainerResultCallback).awaitStatusCode(120, SECONDS)`. Loader container is removed in a `finally` block; on non-zero exit the volume is also removed and `RuntimeException` propagates so `DeploymentExecutor` marks the deployment FAILED. 2. **Loader container**`loaderImage` (default `gitea.siegeln.net/cameleer/cameleer-runtime-loader:latest`), name `{containerName}-loader`, mount the volume **RW at `/app/jars`**, env vars `ARTIFACT_URL` + `ARTIFACT_EXPECTED_SIZE`. Loader downloads the JAR from the signed URL into the volume and exits 0. Orchestrator blocks on `waitContainerCmd().exec(WaitContainerResultCallback).awaitStatusCode(120, SECONDS)`. Loader container is removed in a `finally` block; on non-zero exit the volume is also removed and `RuntimeException` propagates so `DeploymentExecutor` marks the deployment FAILED. **Loader logs are captured before removal** (`captureLoaderLogs``logContainerCmd` with `withTail(50)`, capped at 4096 chars, 5s timeout) and appended to the thrown `RuntimeException` message as `". loader output: <text>"`. Best-effort: log-capture failures are swallowed and don't mask the original exit. The loader image's Dockerfile pre-creates `/app/jars` owned by `loader:loader` (UID 1000) so the orchestrator's fresh named volume initialises with that ownership — without it the empty volume comes up as `root:root 0755` and wget exits 1 with "Permission denied". `LoaderHardeningIT` is the regression guard.
3. **Main container** — same hardening contract, mount the same volume **RO at `/app/jars`**, entrypoint reads `/app/jars/app.jar` (Spring Boot/Quarkus: `-jar /app/jars/app.jar`; plain Java: `-cp /app/jars/app.jar <MainClass>`; native: `exec /app/jars/app.jar`). 3. **Main container** — same hardening contract, mount the same volume **RO at `/app/jars`**, entrypoint reads `/app/jars/app.jar` (Spring Boot/Quarkus: `-jar /app/jars/app.jar`; plain Java: `-cp /app/jars/app.jar <MainClass>`; native: `exec /app/jars/app.jar`).
`removeContainer(id)` derives the volume name from the inspected container name (Docker prefixes it with `/`) and removes the volume after the container removes — blue/green doesn't leak volumes. `removeContainer(id)` derives the volume name from the inspected container name (Docker prefixes it with `/`) and removes the volume after the container removes — blue/green doesn't leak volumes.

View File

@@ -13,12 +13,14 @@ import com.github.dockerjava.api.model.HealthCheck;
import com.github.dockerjava.api.model.HostConfig; import com.github.dockerjava.api.model.HostConfig;
import com.github.dockerjava.api.model.RestartPolicy; import com.github.dockerjava.api.model.RestartPolicy;
import com.github.dockerjava.api.model.Volume; import com.github.dockerjava.api.model.Volume;
import com.github.dockerjava.core.command.LogContainerResultCallback;
import com.github.dockerjava.core.command.WaitContainerResultCallback; import com.github.dockerjava.core.command.WaitContainerResultCallback;
import jakarta.annotation.PreDestroy; import jakarta.annotation.PreDestroy;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.io.IOException; import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
@@ -57,6 +59,14 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
* the JAR before giving up. */ * the JAR before giving up. */
private static final long LOADER_WAIT_TIMEOUT_SECONDS = 120; private static final long LOADER_WAIT_TIMEOUT_SECONDS = 120;
/** Tail size for loader-failure diagnostics surfaced in the thrown
* RuntimeException. wget's BusyBox failure messages are short (e.g.
* "wget: server returned error: HTTP/1.1 401" or "Permission denied"),
* so 50 lines is plenty without bloating exception strings. */
private static final int LOADER_LOG_TAIL_LINES = 50;
private static final int LOADER_LOG_MAX_CHARS = 4096;
private static final long LOADER_LOG_FETCH_TIMEOUT_SECONDS = 5;
private final DockerClient dockerClient; private final DockerClient dockerClient;
private final String dockerRuntime; private final String dockerRuntime;
@@ -148,13 +158,19 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
} }
int exitCode; int exitCode;
String loaderLogs = "";
try { try {
exitCode = dockerClient.waitContainerCmd(loaderId) exitCode = dockerClient.waitContainerCmd(loaderId)
.exec(new WaitContainerResultCallback()) .exec(new WaitContainerResultCallback())
.awaitStatusCode(LOADER_WAIT_TIMEOUT_SECONDS, TimeUnit.SECONDS); .awaitStatusCode(LOADER_WAIT_TIMEOUT_SECONDS, TimeUnit.SECONDS);
if (exitCode != 0) {
loaderLogs = captureLoaderLogs(loaderId);
}
} catch (Exception e) { } catch (Exception e) {
loaderLogs = captureLoaderLogs(loaderId);
cleanup(loaderId, volumeName); cleanup(loaderId, volumeName);
throw new RuntimeException("Loader wait failed for " + request.containerName() + ": " + e.getMessage(), e); throw new RuntimeException("Loader wait failed for " + request.containerName()
+ appendLogs(loaderLogs) + ": " + e.getMessage(), e);
} finally { } finally {
try { try {
dockerClient.removeContainerCmd(loaderId).withForce(true).exec(); dockerClient.removeContainerCmd(loaderId).withForce(true).exec();
@@ -164,7 +180,8 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
} }
if (exitCode != 0) { if (exitCode != 0) {
cleanupVolume(volumeName); cleanupVolume(volumeName);
throw new RuntimeException("Loader exited " + exitCode + " for " + request.containerName()); throw new RuntimeException("Loader exited " + exitCode + " for " + request.containerName()
+ appendLogs(loaderLogs));
} }
// Phase 2: Main container — RO on the shared volume. Wrap in try/catch // Phase 2: Main container — RO on the shared volume. Wrap in try/catch
@@ -271,6 +288,40 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
return id; return id;
} }
/** Capture the loader's last stdout/stderr so the failure exception explains
* itself instead of just "Loader exited N". Best-effort: a docker-side
* failure to stream logs must not mask the original loader exit. */
private String captureLoaderLogs(String loaderId) {
try {
StringBuilder buf = new StringBuilder();
dockerClient.logContainerCmd(loaderId)
.withStdOut(true)
.withStdErr(true)
.withTail(LOADER_LOG_TAIL_LINES)
.exec(new LogContainerResultCallback() {
@Override
public void onNext(Frame frame) {
if (buf.length() < LOADER_LOG_MAX_CHARS) {
buf.append(new String(frame.getPayload(), StandardCharsets.UTF_8));
}
}
})
.awaitCompletion(LOADER_LOG_FETCH_TIMEOUT_SECONDS, TimeUnit.SECONDS);
String text = buf.toString().trim();
if (text.length() > LOADER_LOG_MAX_CHARS) {
text = text.substring(text.length() - LOADER_LOG_MAX_CHARS);
}
return text;
} catch (Exception e) {
log.warn("Failed to capture loader logs for {}: {}", loaderId, e.getMessage());
return "";
}
}
private static String appendLogs(String logs) {
return logs == null || logs.isEmpty() ? "" : ". loader output: " + logs;
}
/** Hardening contract from issue #152 — applied uniformly to loader + main. */ /** Hardening contract from issue #152 — applied uniformly to loader + main. */
private HostConfig baseHardenedHostConfig() { private HostConfig baseHardenedHostConfig() {
return HostConfig.newHostConfig() return HostConfig.newHostConfig()

View File

@@ -0,0 +1,135 @@
package com.cameleer.server.app.runtime;
import com.github.dockerjava.api.DockerClient;
import com.github.dockerjava.api.model.AccessMode;
import com.github.dockerjava.api.model.Bind;
import com.github.dockerjava.api.model.Capability;
import com.github.dockerjava.api.model.Volume;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.testcontainers.DockerClientFactory;
import org.testcontainers.containers.BindMode;
import org.testcontainers.containers.GenericContainer;
import org.testcontainers.containers.Network;
import org.testcontainers.containers.startupcheck.OneShotStartupCheckStrategy;
import org.testcontainers.images.builder.ImageFromDockerfile;
import org.testcontainers.junit.jupiter.Testcontainers;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import static org.assertj.core.api.Assertions.assertThat;
/**
* Real-Docker IT for the cameleer-runtime-loader image. Closes the regression
* window where /app/jars wasn't pre-created in the loader image — the
* orchestrator's fresh named volume mounted as root:root 0755 and wget
* (running as UID 1000) failed with "Permission denied" only at runtime.
*
* <p>Replicates the exact hardening shape from
* {@link DockerRuntimeOrchestrator}'s {@code baseHardenedHostConfig()} +
* loader-specific bind, against a real artifact server, and asserts the
* loader writes the expected file.
*/
@Testcontainers
class LoaderHardeningIT {
private static final Path LOADER_DIR = Paths
.get(System.getProperty("user.dir"))
.getParent()
.resolve("cameleer-runtime-loader");
private static final int ARTIFACT_BYTES = 1024;
private DockerClient dockerClient;
private Network network;
private GenericContainer<?> fileServer;
private GenericContainer<?> loader;
private Path fixtureDir;
private String volumeName;
private String loaderImageId;
@BeforeEach
void setUp() throws IOException {
dockerClient = DockerClientFactory.instance().client();
network = Network.newNetwork();
fixtureDir = Files.createTempDirectory("loader-it-fixture-");
Path payload = fixtureDir.resolve("artifact.jar");
Files.write(payload, new byte[ARTIFACT_BYTES]);
fileServer = new GenericContainer<>("nginx:alpine")
.withNetwork(network)
.withNetworkAliases("file-server")
.withFileSystemBind(
fixtureDir.toAbsolutePath().toString(),
"/usr/share/nginx/html",
BindMode.READ_ONLY);
fileServer.start();
loaderImageId = new ImageFromDockerfile()
.withFileFromPath(".", LOADER_DIR)
.get();
volumeName = "cameleer-loader-it-" + UUID.randomUUID().toString().substring(0, 8);
dockerClient.createVolumeCmd().withName(volumeName).exec();
}
@AfterEach
void tearDown() throws IOException {
if (loader != null) {
try { loader.stop(); } catch (Exception ignored) { }
}
if (volumeName != null) {
try { dockerClient.removeVolumeCmd(volumeName).exec(); } catch (Exception ignored) { }
}
if (fileServer != null) fileServer.stop();
if (network != null) network.close();
if (fixtureDir != null) {
Files.walk(fixtureDir)
.sorted((a, b) -> b.getNameCount() - a.getNameCount())
.forEach(p -> {
try { Files.deleteIfExists(p); } catch (IOException ignored) { }
});
}
}
@Test
void loaderWritesArtifactUnderHardenedContract() {
// OneShotStartupCheckStrategy succeeds only when the container has
// exited with status 0. Anything else (non-zero exit, timeout) throws
// ContainerLaunchException — the assertion below is a belt-and-braces
// explicit check on the resolved exit code.
loader = new GenericContainer<>(loaderImageId)
.withNetwork(network)
.withEnv("ARTIFACT_URL", "http://file-server/artifact.jar")
.withEnv("ARTIFACT_EXPECTED_SIZE", String.valueOf(ARTIFACT_BYTES))
.withCreateContainerCmdModifier(cmd -> cmd.getHostConfig()
.withCapDrop(Capability.values())
.withSecurityOpts(List.of("no-new-privileges:true", "apparmor=docker-default"))
.withReadonlyRootfs(true)
.withPidsLimit(512L)
.withTmpFs(Map.of("/tmp", "rw,nosuid,size=64m"))
.withBinds(new Bind(volumeName, new Volume("/app/jars"), AccessMode.rw)))
.withStartupCheckStrategy(new OneShotStartupCheckStrategy()
.withTimeout(Duration.ofSeconds(60)));
loader.start();
Long exit = dockerClient.inspectContainerCmd(loader.getContainerId())
.exec()
.getState()
.getExitCodeLong();
assertThat(exit)
.as("loader must exit 0 — non-zero indicates the hardening contract "
+ "broke the artifact write (e.g. /app/jars not owned by loader UID 1000)")
.isZero();
}
}