Commit Graph

12 Commits

Author SHA1 Message Date
hsiegeln
843e782340 chore(runtime): point shipped image defaults to registry.cameleer.io
Customers running this server with no overrides reach the public registry
alias, not the internal hostname. registry.cameleer.io and gitea.siegeln.net
resolve to the same registry — buildtime CI keeps pushing to gitea.siegeln.net,
runtime defaults pull via the public alias.

- application.yml: baseimage, loaderimage defaults
- DeploymentExecutor.java: matching @Value defaults
- docker-orchestration.md: updates the documented default and notes the
  buildtime/public split so future changes don't "fix" the asymmetry

Out of scope (intentionally still on gitea.siegeln.net):
- LoaderHardeningIT and the two DockerRuntimeOrchestrator unit tests.
  Tests are buildtime artifacts; LoaderHardeningIT pulls the real image
  via CI's pre-authenticated docker login to gitea.siegeln.net.
- deploy/base/*.yaml and deploy/overlays/main/*.yaml (internal k3s,
  customers don't use these manifests).
- pom.xml, .npmrc, ui/Dockerfile (build dependency sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:47:28 +02:00
hsiegeln
3334f0a1d2 chore: hand cameleer-runtime-loader image build to cameleer-saas
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m0s
CI / docker (push) Successful in 3m26s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 45s
The loader is infra glue (per-replica init container that fetches the
tenant JAR from a signed URL) — same shape as runtime-base, postgres,
clickhouse, traefik, logto images already living in cameleer-saas. Move
the source + CI build there so all sidecar/infra image builds are in
one place; cameleer-server's CI is back to building only what it owns
(server, server-ui).

Coordination: cameleer-saas@ac8d628 added the build step and copied the
source verbatim. Published tag path is unchanged
(gitea.siegeln.net/cameleer/cameleer-runtime-loader:latest), so running
tenant servers continue pulling the same image without disruption.

This commit:
- Deletes cameleer-runtime-loader/ (Dockerfile, entrypoint.sh, README).
- Removes the conditional "Build and push runtime-loader" step and its
  upstream "Detect runtime-loader changes" detection from .gitea/workflows/ci.yml.
  Drops the fetch-depth: 0 + outputs.loader_changed plumbing that only
  existed for the change-detection path.
- Drops cameleer-runtime-loader from the in-job and cleanup-branch image
  cleanup loops — saas owns the registry lifecycle now.
- Rewrites LoaderHardeningIT to pull the published :latest from the
  registry (via Testcontainers GenericContainer) instead of building
  from a local Dockerfile. The IT now functions as a cross-repo contract
  test: cameleer-server's hardening expectations vs. the saas-published
  artifact. Local devs need `docker login gitea.siegeln.net`; CI runners
  are pre-authenticated.
- Updates .claude/rules/docker-orchestration.md to point at the new
  source-of-truth location and reframe LoaderHardeningIT as the
  cross-repo contract test.

The image's runtime contract (ARTIFACT_URL, ARTIFACT_EXPECTED_SIZE,
/app/jars/app.jar mount, exit code semantics) is unchanged. Future
contract changes need coordinated commits across both repos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:02:54 +02:00
hsiegeln
2e2d069530 feat(runtime): capture loader logs in failure exceptions; add LoaderHardeningIT regression guard
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m42s
CI / docker (push) Successful in 2m36s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 54s
SonarQube / sonarqube (push) Successful in 7m24s
Two diagnostics-and-confidence follow-ups to the loader-init-container pattern.

1) DockerRuntimeOrchestrator now captures the loader's last 50 lines of
   stdout/stderr (capped at 4096 chars, 5s timeout) before the finally-remove
   and appends them to the thrown RuntimeException as
   `. loader output: <text>`. Best-effort: log-capture failures are swallowed
   and never mask the original exit. Closes the visibility gap that turned a
   simple "wget: Permission denied" into the opaque "Loader exited 1".

2) New LoaderHardeningIT spins up a Testcontainers nginx serving a 1KB
   fixture, builds the loader image fresh from cameleer-runtime-loader/,
   and runs it under the exact baseHardenedHostConfig() shape (cap_drop ALL,
   readonly rootfs, /tmp tmpfs, no-new-privileges, apparmor=docker-default,
   pids=512) bound to a fresh named volume RW at /app/jars. Asserts exit 0.
   This would have caught the volume-permission regression in CI.

GenericContainer + OneShotStartupCheckStrategy is used instead of raw
docker-java waitContainerCmd because docker-java's unshaded api version
in this project's pom and testcontainers' shaded copy disagree on
WaitContainerCmd.getCondition() — going through GenericContainer keeps
the call inside testcontainers' shaded executor.

Rules doc updated to point at the captured-output behaviour and the IT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:51:25 +02:00
hsiegeln
f772e868e6 docs: correct loader-network reachability claim; refresh HOWTO env vars
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 4m32s
CI / docker (push) Successful in 2m55s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 55s
Final-review must-fixes:
- HOWTO.md: drop CAMELEER_SERVER_RUNTIME_JARDOCKERVOLUME; add the three new
  artifact env vars (loaderimage / artifacttokenttlseconds / artifactbaseurl).
- DeploymentExecutor @PostConstruct WARN, handoff doc, and docker-orchestration
  rule no longer claim the loader uses cameleer-traefik. The loader runs on
  the PRIMARY Docker network only — additional networks are attached after
  startContainer returns, by which time the loader has exited. SaaS still
  works because the tenant's primary network hosts the tenant server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:13:56 +02:00
hsiegeln
cc076b1923 fix(runtime): pre-pull loader image, plug volume-leak windows, document network dep
Pre-pull the loader image at PULL_IMAGE so the implicit pull on first
createContainerCmd doesn't bypass the 120s loader-wait timeout.

Wrap createAndStartLoader in try/catch so a create/start failure cleans
up the just-created volume; same guard around createAndStartMain on
phase-2 failures. Folds the wait-error message into the rethrown
RuntimeException so the cause chain is visible.

Add a @PostConstruct WARN when neither artifactbaseurl nor serverurl is
set so the implicit cameleer-server DNS dependency is loud at boot, and
document the loader-to-server reachability contract in
.claude/rules/docker-orchestration.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:26:35 +02:00
hsiegeln
1ddae94930 feat(runtime): init-container loader pattern + withUsernsMode (#152 hardening close)
Tasks 9+10+11 of the init-container-jar-fetch plan, landed atomically because
9 alone leaves the orchestrator+executor referencing removed ContainerRequest
fields.

ContainerRequest (core) drops jarPath/jarVolumeName/jarVolumeMountPath; adds
appVersionId, artifactDownloadUrl, artifactExpectedSize, loaderImage.

DockerRuntimeOrchestrator (app):
  - per-replica named volume "cameleer-jars-{containerName}"
  - phase 1: loader container with the volume mounted RW at /app/jars,
    ARTIFACT_URL + ARTIFACT_EXPECTED_SIZE env, full hardening contract
  - block on waitContainerCmd().awaitStatusCode(120s); on non-zero exit
    remove the loader, remove the volume, propagate RuntimeException so
    DeploymentExecutor marks the deployment FAILED. main is never created.
  - phase 2: main container with the same volume mounted RO at /app/jars
  - withUsernsMode("host:1000:65536") on BOTH containers — closes the last
    open hardening gap from issue #152
  - main entrypoint paths point at /app/jars/app.jar
  - extracted baseHardenedHostConfig() so loader and main share the
    cap_drop / security_opt / readonly / pids / tmpfs contract
  - removeContainer() also removes the per-replica volume so blue/green
    doesn't leak volumes

DeploymentExecutor (app):
  - injects ArtifactDownloadTokenSigner; new @Value props loaderimage,
    artifacttokenttlseconds, artifactbaseurl
  - replaces the temporary getVersion(...).jarPath() bridge with a signed
    URL ${artifactBaseUrl}/api/v1/artifacts/{id}?exp&sig
  - drops the Files.exists pre-flight check; AppVersion.jarSizeBytes is
    the size-of-record check now
  - drops jarDockerVolume / jarStoragePath @Value fields and the volume
    plumbing in startReplica
  - DeployCtx carries appVersionId / artifactUrl / artifactExpectedSize
    in place of jarPath

Tests:
  - DockerRuntimeOrchestratorHardeningTest updated for the new shape;
    captures HostConfig on the MAIN container and asserts cap_drop ALL
    + no-new-privileges + apparmor + readonly + pids + tmpfs + the new
    withUsernsMode("host:1000:65536")
  - DockerRuntimeOrchestratorLoaderTest (new): verifies volume create →
    loader create with RW bind → loader started → awaited → loader
    removed → main create with RO bind → main started; verifies abort
    + cleanup on loader exit != 0 (loader removed, volume removed, main
    NEVER created); verifies userns_mode applied to both containers.

Config:
  - application.yml replaces jardockervolume with loaderimage,
    artifacttokenttlseconds, artifactbaseurl

Rules updated: .claude/rules/docker-orchestration.md (loader pattern,
userns, no more bind-mount); .claude/rules/core-classes.md
(ContainerRequest field map).

Test counts after change:
  - cameleer-server-core: 116/116 unit tests pass
  - cameleer-server-app: 273/273 unit tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:06:56 +02:00
hsiegeln
8e9ad47077 feat(runtime): harden tenant containers + auto-detect gVisor (#152)
Tenant JARs are arbitrary user code: Camel ships components (camel-exec,
camel-bean, MVEL/Groovy templating) that turn a header into shell, and
Java 17 has no SecurityManager — the JVM is not a security boundary.
This applies an unconditional hardening contract to every tenant
container so a single runc CVE no longer equals host takeover.

DockerRuntimeOrchestrator.startContainer now sets:
- cap_drop ALL (Capability.values() — docker-java has no ALL constant)
- security_opt: no-new-privileges, apparmor=docker-default
  (default seccomp profile applies implicitly)
- read_only rootfs, pids_limit=512
- /tmp tmpfs rw,nosuid,size=256m — no noexec, since Netty/Snappy/LZ4/Zstd
  dlopen native libs from /tmp via mmap(PROT_EXEC) which noexec blocks

The orchestrator also probes `docker info` at construction and uses runsc
(gVisor) automatically when the daemon has it registered. Override via
cameleer.server.runtime.dockerruntime (e.g. "kata"); empty = auto.

Outbound TCP, DNS, and TLS are unaffected — caps/seccomp don't gate
those — so vanilla Camel-Kafka producers/consumers and REST integrations
keep working unchanged. Stateful tenants (Kafka Streams with on-disk
state stores, apps writing to /var/log/...) need explicit writeable
volumes; that's tracked in #153 as the natural follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 20:58:26 +02:00
hsiegeln
21db92ff00 fix(traefik): make TLS cert resolver configurable, omit when unset
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m15s
CI / docker (push) Successful in 1m3s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.

Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.

Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.

TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:18:47 +02:00
hsiegeln
165c9f10e3 feat(deploy): externalRouting toggle to keep apps off Traefik
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m26s
CI / docker (push) Successful in 1m5s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
Adds a boolean `externalRouting` flag (default `true`) on
ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only
the identity labels (`managed-by`, `cameleer.*`) and skips every
`traefik.*` label, so the container is not published by Traefik.
Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}`
can still reach it via Docker DNS on whatever port the app listens on.

TDD: new TraefikLabelBuilderTest covers enabled (default labels present),
disabled (zero traefik.* labels), and disabled (identity labels retained)
cases. Full module unit suite: 208/0/0.

Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form
state, Resources tab toggle, POST payload, and snapshot-to-form mapping.
Rule files updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:03:48 +02:00
hsiegeln
c6aef5ab35 fix(deploy): Checkpoints — preserve STOPPED history, fix filter + placement
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m4s
CI / docker (push) Successful in 1m15s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
  STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
  Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
  checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
  which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
  the design spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:26:46 +02:00
hsiegeln
007597715a docs(rules): deployment strategies + generation suffix
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m8s
CI / docker (push) Successful in 1m30s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 46s
Refresh the three rules files to match the new executor behavior:

- docker-orchestration.md: rewrite DeploymentExecutor Details with
  container naming scheme ({...}-{replica}-{generation}), strategy
  dispatch (blue-green vs rolling), and the new DEGRADED semantics
  (post-deploy only). Update TraefikLabelBuilder + ContainerLogForwarder
  bullets for the generation suffix + new cameleer.generation label.
- app-classes.md: DeploymentExecutor + TraefikLabelBuilder bullets
  mirror the same.
- core-classes.md: add DeploymentStrategy enum; note DEGRADED is now
  post-deploy-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:51 +02:00
hsiegeln
810f493639 chore: track .claude/rules/ and add self-maintenance instruction
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m23s
CI / docker (push) Successful in 5m22s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 44s
Un-ignore .claude/rules/ so path-scoped rule files are shared via git.
Add instruction in CLAUDE.md to update rule files when modifying classes,
controllers, endpoints, or metrics — keeps rules current as part of
normal workflow rather than requiring separate maintenance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:26:53 +02:00