Harden multi-tenant runtime: sandbox untrusted user JVMs (Docker + K8s) #152

Open
opened 2026-04-25 09:56:57 +02:00 by claude · 0 comments
Owner

Threat model

Cameleer is marketed as an Apache Camel observability platform, but a Camel app is just a JVM — tenants can run arbitrary Java: Runtime.exec, Unsafe, JNI, reflection, dynamic class loading. Camel itself ships components that turn a single header into shell:

  • camel-exec — direct Runtime.exec (CVE-2025-27636 / CVE-2025-29891, Mar 2025)
  • camel-bean — header-driven reflective dispatch
  • camel-groovy, camel-joor, camel-mvel, camel-simple, camel-velocity, camel-mustache — Turing-complete or templating engines (CVE-2020-11994 SSTI→RCE)

Java 17 has no SecurityManager. The JVM is not a security boundary. All meaningful isolation must live at the OS / container / network layers.

When we run as cameleer-saas, every tenant container we launch is hostile by default. We need to keep tenants from attacking:

  1. Each other
  2. The platform (server, host, cluster)
  3. The internet (mining, scanning, exfil, abuse of our IP rep)

Current gaps (from DockerRuntimeOrchestrator, DeploymentExecutor)

Layer Current state Risk
Container runtime runc only One runc CVE = host takeover. CVE-2024-21626 (Leaky Vessels) and CVE-2025-31133 / 52565 / 52881 (Nov 2025) prove this is annual.
User namespace None — tenant JVM runs as root UID 0 in container = UID 0 on host on any escape
Read-only rootfs Not set Persistence + miner unpacking trivial
Capabilities Full default set CAP_NET_RAW, CAP_SYS_PTRACE, etc. all granted
Seccomp Not applied (not even RuntimeDefault forced) Whole syscall surface available
AppArmor / SELinux None No MAC layer
--pids-limit None Fork bomb crashes the host
Egress Unrestricted Mining, scanning, exfil, IMDS access
Cross-tenant network Shared cameleer-traefik bridge Any tenant can reach any other tenant's TCP ports
Per-tenant K8s Not implemented Greenfield — design isolation in from day one

What's already OK:

  • Memory + CPU limits configurable
  • JAR bind-mounted read-only
  • No Docker socket / no --privileged
  • Per-tenant primary network exists (cameleer-tenant-{slug})
  • Per-env scoping (cameleer-env-{tenantId}-{envSlug})
  • No DB / JWT secret / K8s API token leaks into tenant containers
  • Tenant agent token is narrowly scoped (per app, env)

P0 — ship before any external SaaS tenant runs untrusted code

1. Sandboxed container runtime

  • Install gVisor (runsc) on Docker hosts. Pass --runtime=runsc for tenant containers via withRuntime("runsc") on HostConfig.
  • On K8s: declare a RuntimeClass named gvisor, force it on tenant namespaces via Kyverno policy.
  • High-sensitivity tier (regulated workloads): Kata + Firecracker (~150-300ms cold start, full guest kernel — Fly.io runs Java this way).
  • Single biggest leverage point. Converts a runc-CVE-of-the-quarter from "total host takeover" to "tenant pod owned, nothing else."

2. Harden every tenant container

Add to DockerRuntimeOrchestrator.HostConfig build (gated by cameleer.server.runtime.hardened=true, default true for SaaS):

hostConfig
  .withReadonlyRootfs(true)
  .withCapDrop(Capability.ALL)
  .withSecurityOpts(List.of(
      "no-new-privileges:true",
      "seccomp=default",
      "apparmor=docker-default"
  ))
  .withPidsLimit(512L)
  .withUsernsMode("host:1000:65536")
  .withTmpFs(Map.of("/tmp", "rw,noexec,nosuid,size=64m"));

Enforce cgroup v2 on hosts (systemd.unified_cgroup_hierarchy=1).

3. Operator-controlled JRE base image

Tenants upload only the JAR. Our CI builds the final image: pinned JRE 21, fixed entrypoint, JVM flags hard-coded:

-XX:+UseContainerSupport -XX:MaxRAMPercentage=75
-XX:ActiveProcessorCount=2 -XX:+ExitOnOutOfMemoryError
-XX:-UsePerfData -Dnetworkaddress.cache.ttl=30

Strip -javaagent / -agentlib from any user-supplied manifest before launch. Our agent javaagent must be the only one.

4. Default-deny egress per tenant

Today every tenant container can reach the internet and our control plane. Replace the shared cameleer-traefik bridge model:

  • Docker (interim): per-tenant user-defined bridge networks + iptables OUTPUT rules dropping tenant-bridge → RFC1918 to anywhere except our egress proxy. Reverse the Traefik model: Traefik joins each tenant's network rather than exposing all tenants on a shared bridge.
  • K8s: Cilium CiliumClusterwideNetworkPolicy — default-deny ingress + egress, allow only kube-dns + per-tenant L7 egress proxy. Block 169.254.169.254 (cloud IMDS), control-plane CIDR, K8s API service CIDR.

5. Kyverno admission policies (when on K8s)

  • Reject privileged, hostPath, hostNetwork, hostPID, hostIPC, docker.sock mounts, dangerous capabilities.
  • Require runtimeClassName: gvisor on tenant namespaces.
  • Require automountServiceAccountToken: false on every tenant pod.
  • Require resource limits.
  • Pod Security Standards restricted enforced via namespace label pod-security.kubernetes.io/enforce: restricted.

6. Per-tenant K8s namespace

  • ResourceQuota + LimitRange per tenant — one tenant cannot exhaust cluster.
  • Dedicated ServiceAccount with no RBAC and automountServiceAccountToken: false.

7. runc / containerd patch monitoring

Subscribe to runc-security mailing list. Auto-deploy patches.

P1 — first quarter after launch

8. Falco + custom Camel rules

Stable Falco rule set plus our additions:

  • Java process spawning sh|bash|curl|wget|nc|nmap|python|perl → catches camel-exec abuse instantly
  • Read of /proc/cpuinfo from tenant container → mining recon
  • Outbound TCP rate >50/min → port scanning
  • Sustained pod CPU >90% for >5 min on non-CPU-tier tenants → mining economic signal
  • DNS for known mining-pool / C2 domains

9. CoreDNS sinkhole

NXDOMAIN for mining-pool wildcards (*.minexmr.com, *.nanopool.org, ethermine, f2pool, supportxmr) and a published C2 feed.

10. L7 egress proxy per tenant

Squid / Envoy with domain allowlist the tenant declares at deploy time. Default: deny all egress except curated platform allowlist (Maven Central, etc.). Gives us billable byte-rate per tenant.

11. Camel component allowlist at deploy admission

Reject JARs whose META-INF/services/org/apache/camel/component/ includes exec, groovy, joor, mvel unless tenant has explicitly opted in. ASM-scan bytecode for Runtime.exec, ProcessBuilder, Unsafe, JNI — fail-closed if found and tenant tier is "untrusted."

12. Image scan + signing

Trivy in CI. Reject HIGH/CRITICAL CVEs at registry push gate. Cosign-sign every image. Kyverno verifyImages rejects unsigned at admission.

13. SBOM scan tenant JARs

Typosquatting detection (Dec 2025 org.fasterxml.jacksoncom.fasterxml.jackson swap dropped a Cobalt Strike beacon). Dependency-Track or Trivy SBOM mode. Levenshtein-close GAV coords from non-allowlisted groups → alert.

14. Camel CVE auto-patch policy

Refuse to deploy a JAR whose Camel version < our CVE-clean floor. Move floor monthly.

15. Per-tenant egress quota + flow logs

Cilium bandwidth manager. Hubble flow logs → ClickHouse for per-tenant byte-rate alerting (we already have ClickHouse — natural fit).

P2 — at scale or for regulated workloads

  1. Kata + Firecracker universally — stop trusting host kernel
  2. Tetragon in enforcing mode — kernel-level kill on policy violation
  3. Per-tenant node pools (taints/tolerations) — Spectre / L1TF side-channel hardening
  4. Per-tenant SNAT egress IP via Cilium egress gateway — IP-reputation isolation
  5. Quarterly red-team — try to escape our own sandbox

Smallest first PR (deployable today, no K8s required)

  1. Extend DockerRuntimeOrchestrator HostConfig with: read_only, cap_drop ALL, no-new-privileges, seccomp=default, apparmor=docker-default, pids_limit, userns_mode, tmpfs /tmp. Behind feature flag cameleer.server.runtime.hardened so we can ramp per-tenant.
  2. Install gVisor on Docker hosts. Add cameleer.server.runtime.dockerRuntime config; pass withRuntime("runsc"). One-line opt-in for migration.
  3. Reverse the cameleer-traefik bridge model: Traefik joins per-tenant networks rather than tenants joining a shared bridge. Kills cross-tenant TCP today.
  4. Add containerConfig.allowedEgress: List<String> to app config; default []. Wire to host iptables rules (interim) until Cilium / K8s lands.

These four are reversible, behind flags, no K8s required, and close the runc-host-takeover blast radius now.

Sub-issue tracking

This is an epic. File children for: gVisor rollout, container hardening flags, base-image control, Camel component allowlist, Kyverno policy bundle, Cilium NetworkPolicy bundle, Falco rule set, CoreDNS sinkhole, L7 egress proxy, Trivy/Cosign pipeline, JAR SBOM scanner, Tetragon enforcement, red-team exercise.

Key references

## Threat model Cameleer is marketed as an Apache Camel observability platform, but a Camel app is just a JVM — tenants can run **arbitrary Java**: `Runtime.exec`, `Unsafe`, JNI, reflection, dynamic class loading. Camel itself ships components that turn a single header into shell: - `camel-exec` — direct `Runtime.exec` (CVE-2025-27636 / CVE-2025-29891, Mar 2025) - `camel-bean` — header-driven reflective dispatch - `camel-groovy`, `camel-joor`, `camel-mvel`, `camel-simple`, `camel-velocity`, `camel-mustache` — Turing-complete or templating engines (CVE-2020-11994 SSTI→RCE) Java 17 has no `SecurityManager`. **The JVM is not a security boundary.** All meaningful isolation must live at the OS / container / network layers. When we run as `cameleer-saas`, every tenant container we launch is hostile by default. We need to keep tenants from attacking: 1. Each other 2. The platform (server, host, cluster) 3. The internet (mining, scanning, exfil, abuse of our IP rep) ## Current gaps (from `DockerRuntimeOrchestrator`, `DeploymentExecutor`) | Layer | Current state | Risk | |---|---|---| | Container runtime | `runc` only | One runc CVE = host takeover. CVE-2024-21626 (Leaky Vessels) and CVE-2025-31133 / 52565 / 52881 (Nov 2025) prove this is annual. | | User namespace | None — tenant JVM runs as root | UID 0 in container = UID 0 on host on any escape | | Read-only rootfs | Not set | Persistence + miner unpacking trivial | | Capabilities | Full default set | `CAP_NET_RAW`, `CAP_SYS_PTRACE`, etc. all granted | | Seccomp | Not applied (not even `RuntimeDefault` forced) | Whole syscall surface available | | AppArmor / SELinux | None | No MAC layer | | `--pids-limit` | None | Fork bomb crashes the host | | Egress | Unrestricted | Mining, scanning, exfil, IMDS access | | Cross-tenant network | Shared `cameleer-traefik` bridge | Any tenant can reach any other tenant's TCP ports | | Per-tenant K8s | Not implemented | Greenfield — design isolation in from day one | What's already OK: - ✅ Memory + CPU limits configurable - ✅ JAR bind-mounted read-only - ✅ No Docker socket / no `--privileged` - ✅ Per-tenant primary network exists (`cameleer-tenant-{slug}`) - ✅ Per-env scoping (`cameleer-env-{tenantId}-{envSlug}`) - ✅ No DB / JWT secret / K8s API token leaks into tenant containers - ✅ Tenant agent token is narrowly scoped (per app, env) ## P0 — ship before any external SaaS tenant runs untrusted code ### 1. Sandboxed container runtime - Install **gVisor** (`runsc`) on Docker hosts. Pass `--runtime=runsc` for tenant containers via `withRuntime("runsc")` on `HostConfig`. - On K8s: declare a `RuntimeClass` named `gvisor`, force it on tenant namespaces via Kyverno policy. - High-sensitivity tier (regulated workloads): **Kata + Firecracker** (~150-300ms cold start, full guest kernel — Fly.io runs Java this way). - Single biggest leverage point. Converts a runc-CVE-of-the-quarter from "total host takeover" to "tenant pod owned, nothing else." ### 2. Harden every tenant container Add to `DockerRuntimeOrchestrator.HostConfig` build (gated by `cameleer.server.runtime.hardened=true`, default true for SaaS): ```java hostConfig .withReadonlyRootfs(true) .withCapDrop(Capability.ALL) .withSecurityOpts(List.of( "no-new-privileges:true", "seccomp=default", "apparmor=docker-default" )) .withPidsLimit(512L) .withUsernsMode("host:1000:65536") .withTmpFs(Map.of("/tmp", "rw,noexec,nosuid,size=64m")); ``` Enforce cgroup v2 on hosts (`systemd.unified_cgroup_hierarchy=1`). ### 3. Operator-controlled JRE base image Tenants upload **only the JAR**. Our CI builds the final image: pinned JRE 21, fixed entrypoint, JVM flags hard-coded: ``` -XX:+UseContainerSupport -XX:MaxRAMPercentage=75 -XX:ActiveProcessorCount=2 -XX:+ExitOnOutOfMemoryError -XX:-UsePerfData -Dnetworkaddress.cache.ttl=30 ``` Strip `-javaagent` / `-agentlib` from any user-supplied manifest before launch. Our agent javaagent must be the only one. ### 4. Default-deny egress per tenant Today every tenant container can reach the internet *and* our control plane. Replace the shared `cameleer-traefik` bridge model: - **Docker (interim)**: per-tenant user-defined bridge networks + iptables OUTPUT rules dropping `tenant-bridge → RFC1918` to anywhere except our egress proxy. Reverse the Traefik model: Traefik joins each tenant's network rather than exposing all tenants on a shared bridge. - **K8s**: Cilium `CiliumClusterwideNetworkPolicy` — default-deny ingress + egress, allow only kube-dns + per-tenant L7 egress proxy. Block `169.254.169.254` (cloud IMDS), control-plane CIDR, K8s API service CIDR. ### 5. Kyverno admission policies (when on K8s) - Reject `privileged`, `hostPath`, `hostNetwork`, `hostPID`, `hostIPC`, docker.sock mounts, dangerous capabilities. - Require `runtimeClassName: gvisor` on tenant namespaces. - Require `automountServiceAccountToken: false` on every tenant pod. - Require resource limits. - Pod Security Standards `restricted` enforced via namespace label `pod-security.kubernetes.io/enforce: restricted`. ### 6. Per-tenant K8s namespace - `ResourceQuota` + `LimitRange` per tenant — one tenant cannot exhaust cluster. - Dedicated `ServiceAccount` with no RBAC and `automountServiceAccountToken: false`. ### 7. runc / containerd patch monitoring Subscribe to runc-security mailing list. Auto-deploy patches. ## P1 — first quarter after launch ### 8. Falco + custom Camel rules Stable [Falco rule set](https://github.com/falcosecurity/rules) plus our additions: - Java process spawning `sh|bash|curl|wget|nc|nmap|python|perl` → catches `camel-exec` abuse instantly - Read of `/proc/cpuinfo` from tenant container → mining recon - Outbound TCP rate >50/min → port scanning - Sustained pod CPU >90% for >5 min on non-CPU-tier tenants → mining economic signal - DNS for known mining-pool / C2 domains ### 9. CoreDNS sinkhole NXDOMAIN for mining-pool wildcards (`*.minexmr.com`, `*.nanopool.org`, ethermine, f2pool, supportxmr) and a published C2 feed. ### 10. L7 egress proxy per tenant Squid / Envoy with domain allowlist the tenant declares at deploy time. Default: deny all egress except curated platform allowlist (Maven Central, etc.). Gives us billable byte-rate per tenant. ### 11. Camel component allowlist at deploy admission Reject JARs whose `META-INF/services/org/apache/camel/component/` includes `exec`, `groovy`, `joor`, `mvel` unless tenant has explicitly opted in. ASM-scan bytecode for `Runtime.exec`, `ProcessBuilder`, `Unsafe`, JNI — fail-closed if found and tenant tier is "untrusted." ### 12. Image scan + signing Trivy in CI. Reject HIGH/CRITICAL CVEs at registry push gate. Cosign-sign every image. Kyverno `verifyImages` rejects unsigned at admission. ### 13. SBOM scan tenant JARs Typosquatting detection (Dec 2025 `org.fasterxml.jackson` → `com.fasterxml.jackson` swap dropped a Cobalt Strike beacon). Dependency-Track or Trivy SBOM mode. Levenshtein-close GAV coords from non-allowlisted groups → alert. ### 14. Camel CVE auto-patch policy Refuse to deploy a JAR whose Camel version < our CVE-clean floor. Move floor monthly. ### 15. Per-tenant egress quota + flow logs Cilium bandwidth manager. Hubble flow logs → ClickHouse for per-tenant byte-rate alerting (we already have ClickHouse — natural fit). ## P2 — at scale or for regulated workloads 16. **Kata + Firecracker** universally — stop trusting host kernel 17. **Tetragon** in enforcing mode — kernel-level kill on policy violation 18. **Per-tenant node pools** (taints/tolerations) — Spectre / L1TF side-channel hardening 19. **Per-tenant SNAT egress IP** via Cilium egress gateway — IP-reputation isolation 20. **Quarterly red-team** — try to escape our own sandbox ## Smallest first PR (deployable today, no K8s required) 1. Extend `DockerRuntimeOrchestrator` `HostConfig` with: `read_only`, `cap_drop ALL`, `no-new-privileges`, `seccomp=default`, `apparmor=docker-default`, `pids_limit`, `userns_mode`, `tmpfs /tmp`. Behind feature flag `cameleer.server.runtime.hardened` so we can ramp per-tenant. 2. Install gVisor on Docker hosts. Add `cameleer.server.runtime.dockerRuntime` config; pass `withRuntime("runsc")`. One-line opt-in for migration. 3. Reverse the `cameleer-traefik` bridge model: Traefik joins per-tenant networks rather than tenants joining a shared bridge. Kills cross-tenant TCP today. 4. Add `containerConfig.allowedEgress: List<String>` to app config; default `[]`. Wire to host iptables rules (interim) until Cilium / K8s lands. These four are reversible, behind flags, no K8s required, and close the runc-host-takeover blast radius now. ## Sub-issue tracking This is an epic. File children for: gVisor rollout, container hardening flags, base-image control, Camel component allowlist, Kyverno policy bundle, Cilium NetworkPolicy bundle, Falco rule set, CoreDNS sinkhole, L7 egress proxy, Trivy/Cosign pipeline, JAR SBOM scanner, Tetragon enforcement, red-team exercise. ## Key references - [CVE-2024-21626 — Leaky Vessels](https://www.wiz.io/blog/leaky-vessels-container-escape-vulnerabilities) - [CVE-2025-31133/52565/52881 — runc Nov 2025](https://www.sysdig.com/blog/runc-container-escape-vulnerabilities) - [Camel CVEs 2025 — Akamai](https://www.akamai.com/blog/security-research/march-apache-camel-vulnerability-detections-and-mitigations) - [Maven typosquat Dec 2025 — Aikido](https://www.aikido.dev/blog/maven-central-jackson-typosquatting-malware) - [Fly.io sandboxing](https://fly.io/blog/sandboxing-and-workload-isolation/) - [gVisor production at Ant Group](https://gvisor.dev/blog/2021/12/02/running-gvisor-in-production-at-scale-in-ant/) - [Cilium tenant isolation](https://docs.cilium.io/en/stable/security/policy/kubernetes/) - [K8s Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) - [Falco rules](https://github.com/falcosecurity/rules)
claude added the securitypmfepic labels 2026-04-25 09:56:57 +02:00
Sign in to join this conversation.