Files
cameleer-saas/docs/superpowers/specs/2026-04-04-phase-3-runtime-orchestration.md
hsiegeln 0326dc6cce docs: add Phase 3 Runtime Orchestration spec
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:13:08 +02:00

20 KiB

Phase 3: Runtime Orchestration + Environments

Date: 2026-04-04 Status: Draft Depends on: Phase 2 (Tenants + Identity + Licensing) Gitea issue: #26

Context

Phase 2 delivered multi-tenancy, identity (Logto OIDC), and license management. The platform can create tenants and issue licenses, but there is nothing to run yet. Phase 3 is the core product differentiator: customers upload a Camel JAR, the platform builds an immutable container image with the cameleer3 agent auto-injected, and deploys it to a logical environment. This is "managed Camel runtime" — similar to Coolify or MuleSoft CloudHub, but purpose-built for Apache Camel with deep observability.

Docker-first. The KubernetesRuntimeOrchestrator is deferred to Phase 5.

Single-node constraint: Because Phase 3 builds images locally via Docker socket (no registry push), the cameleer-saas control plane and the Docker daemon must reside on the same host. This is inherent to the single-tenant Docker Compose stack and is acceptable for that target. In K8s mode (Phase 5), images are built via Kaniko and pushed to a registry, removing this constraint.

Key Decisions

Decision Choice Rationale
JAR delivery Direct HTTP upload (multipart) Simplest path. Git-based and image-ref options can be added later.
Agent JAR source Bundled in cameleer-runtime-base image Version-locked to platform release. Updated by rebuilding the platform image with the new agent version. No runtime network dependency.
Build speed Pre-built base image + single-layer customer add Customer image build is FROM base + COPY app.jar. ~1-3 seconds.
Deployment model Async with polling Image builds are inherently slow. Deploy returns immediately with deployment ID. Client polls for status.
Entity hierarchy Environment → App → Deployment User thinks "I'm in dev, deploy my app." Environment is the workspace context.
Environment provisioning Hybrid auto + manual Every tenant gets a default environment on creation. Additional environments created manually, tier limit enforced.
Cross-environment isolation Logical (not network) Docker single-tenant mode — customer owns the stack. Data separated by environmentId in cameleer3-server. Network isolation is a K8s Phase 5 concern.
Container networking Shared cameleer bridge network Customer containers join the existing network. Agent reaches cameleer3-server at http://cameleer3-server:8081.
Container naming {tenant-slug}-{env-slug}-{app-slug} Human-readable, unique, identifies tenant+environment+app at a glance.
Bootstrap tokens Shared CAMELEER_AUTH_TOKEN from cameleer3-server config Platform reads the existing token and injects it into customer containers. Environment separation via agent environmentId claim, not token. Per-environment tokens deferred to K8s Phase 5.
Health checking Agent health endpoint (port 9464) Guaranteed to exist, no user config needed. User-defined health endpoints deferred.
Inbound HTTP routing Not in Phase 3 Most Camel apps are consumers (queues, polls), not servers. Traefik routing for customer apps deferred to Phase 4/4.5.
Container logs Captured via docker-java, written to ClickHouse Unified log query surface from day 1. Same pattern future app logs will use.
Resource constraints cgroups via docker-java mem_limit + cpu_shares Protect the control plane from noisy neighbors. Tier-based defaults. Even in single-tenant Docker mode, a runaway Camel app shouldn't starve Traefik/Postgres/Logto.
Orchestrator metadata JSONB field on deployment entity Docker stores containerId. K8s (Phase 5) stores namespace, deploymentName, gitCommit. Same table, different orchestrator.

Data Model

Environment Entity

CREATE TABLE environments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    slug VARCHAR(100) NOT NULL,
    display_name VARCHAR(255) NOT NULL,
    bootstrap_token TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, slug)
);

CREATE INDEX idx_environments_tenant_id ON environments(tenant_id);
  • slug — URL-safe, immutable, unique per tenant. Auto-created environment gets slug default.
  • display_name — User-editable. Auto-created environment gets Default.
  • bootstrap_token — The CAMELEER_AUTH_TOKEN value used for customer containers in this environment. In Docker mode, all environments share the same value (read from platform config). In K8s mode (Phase 5), can be unique per environment.
  • statusACTIVE or SUSPENDED.

App Entity

CREATE TABLE apps (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    environment_id UUID NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
    slug VARCHAR(100) NOT NULL,
    display_name VARCHAR(255) NOT NULL,
    jar_storage_path VARCHAR(500),
    jar_checksum VARCHAR(64),
    jar_original_filename VARCHAR(255),
    jar_size_bytes BIGINT,
    current_deployment_id UUID,
    previous_deployment_id UUID,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(environment_id, slug)
);

CREATE INDEX idx_apps_environment_id ON apps(environment_id);
  • slug — URL-safe, immutable, unique per environment.
  • jar_storage_path — Relative path to uploaded JAR (e.g., tenants/{tenant-slug}/envs/{env-slug}/apps/{app-slug}/app.jar). Relative to the configured storage root (cameleer.runtime.jar-storage-path). Makes it easy to migrate the storage volume to a different mount point or cloud provider.
  • jar_checksum — SHA-256 hex digest of the uploaded JAR.
  • current_deployment_id — Points to the active deployment. Nullable (app created but never deployed).
  • previous_deployment_id — Points to the last known good deployment. When a new deploy succeeds, current becomes the new one and previous becomes the old current. When a deploy fails, current stays as the failed one but previous still points to the last good version, enabling a rollback button.

Deployment Entity

CREATE TABLE deployments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    app_id UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
    version INTEGER NOT NULL,
    image_ref VARCHAR(500) NOT NULL,
    desired_status VARCHAR(20) NOT NULL DEFAULT 'RUNNING',
    observed_status VARCHAR(20) NOT NULL DEFAULT 'BUILDING',
    orchestrator_metadata JSONB DEFAULT '{}',
    error_message TEXT,
    deployed_at TIMESTAMPTZ,
    stopped_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(app_id, version)
);

CREATE INDEX idx_deployments_app_id ON deployments(app_id);
  • version — Sequential per app (1, 2, 3...). Incremented on each deploy.
  • image_ref — Docker image reference, e.g., cameleer-runtime-{tenant}-{app}:v3.
  • desired_status — What the user wants: RUNNING, STOPPED.
  • observed_status — What the platform sees: BUILDING, STARTING, RUNNING, FAILED, STOPPED.
  • orchestrator_metadata — Docker mode: {"containerId": "abc123"}. K8s mode (Phase 5): {"namespace": "...", "deploymentName": "...", "gitCommit": "..."}.
  • error_message — Populated when observed_status is FAILED. Build error, startup crash, etc.

Component Architecture

RuntimeOrchestrator Interface

public interface RuntimeOrchestrator {
    String buildImage(BuildImageRequest request);
    void startContainer(StartContainerRequest request);
    void stopContainer(String containerId);
    void removeContainer(String containerId);
    ContainerStatus getContainerStatus(String containerId);
    void streamLogs(String containerId, LogConsumer consumer);
}
  • Single interface, implemented by DockerRuntimeOrchestrator (Phase 3) and KubernetesRuntimeOrchestrator (Phase 5).
  • Injected via Spring @Profile or @ConditionalOnProperty.
  • Request objects carry all context (image name, env vars, network, labels, etc.).

DockerRuntimeOrchestrator

Uses com.github.docker-java:docker-java library. Connects via Docker socket (/var/run/docker.sock).

buildImage:

  1. Creates a temporary build context directory
  2. Writes a Dockerfile:
    FROM cameleer-runtime-base:{platform-version}
    COPY app.jar /app/app.jar
    
  3. Copies the customer JAR as app.jar
  4. Calls docker build via docker-java
  5. Tags as cameleer-runtime-{tenant-slug}-{app-slug}:v{version}
  6. Returns the image reference

startContainer:

  1. Creates container with:
    • Image: the built image reference
    • Name: {tenant-slug}-{env-slug}-{app-slug}
    • Network: cameleer (the platform bridge network)
    • Environment variables:
      • CAMELEER_AUTH_TOKEN={bootstrap-token}
      • CAMELEER_EXPORT_TYPE=HTTP
      • CAMELEER_EXPORT_ENDPOINT=http://cameleer3-server:8081
      • CAMELEER_APPLICATION_ID={app-slug}
      • CAMELEER_ENVIRONMENT_ID={env-slug}
      • CAMELEER_DISPLAY_NAME={tenant-slug}-{env-slug}-{app-slug}
    • Resource constraints (cgroups):
      • memory / memorySwap — hard memory limit per container
      • cpuShares — relative CPU weight (default 512)
      • Defaults configurable via cameleer.runtime.container-memory-limit (default 512m) and cameleer.runtime.container-cpu-shares (default 512)
      • Protects the control plane (Traefik, Postgres, Logto, cameleer-saas) from noisy neighbor Camel apps
    • Health check: HTTP GET to agent health port 9464
  2. Starts container
  3. Returns container ID

streamLogs:

  • Attaches to container stdout/stderr via docker-java LogContainerCmd
  • Passes log lines to a LogConsumer callback (for ClickHouse ingestion)

cameleer-runtime-base Image

A pre-built Docker image containing everything except the customer JAR:

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app

COPY cameleer3-agent-{version}-shaded.jar /app/agent.jar

ENTRYPOINT exec java \
  -Dcameleer.export.type=${CAMELEER_EXPORT_TYPE:-HTTP} \
  -Dcameleer.export.endpoint=${CAMELEER_EXPORT_ENDPOINT} \
  -Dcameleer.agent.name=${HOSTNAME} \
  -Dcameleer.agent.application=${CAMELEER_APPLICATION_ID:-default} \
  -Dcameleer.agent.environment=${CAMELEER_ENVIRONMENT_ID:-default} \
  -Dcameleer.routeControl.enabled=${CAMELEER_ROUTE_CONTROL_ENABLED:-false} \
  -Dcameleer.replay.enabled=${CAMELEER_REPLAY_ENABLED:-false} \
  -Dcameleer.health.enabled=true \
  -Dcameleer.health.port=9464 \
  -javaagent:/app/agent.jar \
  -jar /app/app.jar
  • Built as part of the CI pipeline for cameleer-saas.
  • Published to Gitea registry: gitea.siegeln.net/cameleer/cameleer-runtime-base:{version}.
  • Version tracks the platform version + agent version (e.g., 0.2.0 includes agent 1.0-SNAPSHOT).
  • Updating the agent JAR = rebuild this image with the new agent version → rebuild cameleer-saas image → all new deployments use the new agent.

JAR Upload

  • POST /api/environments/{eid}/apps with multipart file
  • Validation:
    • File extension: .jar
    • Max size: 200 MB (configurable via cameleer.runtime.max-jar-size)
    • SHA-256 checksum computed and stored
  • Storage: relative path tenants/{tenant-slug}/envs/{env-slug}/apps/{app-slug}/app.jar under the configured storage root (cameleer.runtime.jar-storage-path, default /data/jars)
    • Docker volume jardata mounted into cameleer-saas container
    • Database stores the relative path only — decoupled from mount point
  • JAR is overwritten on re-upload (new deploy uses new JAR)

Async Deployment Pipeline

  1. API receives deploy request → creates Deployment entity with observed_status=BUILDING → returns deployment ID (HTTP 202 Accepted)
  2. Background thread (Spring @Async with a bounded thread pool): a. Calls orchestrator.buildImage(...) → updates observed_status=STARTING b. Calls orchestrator.startContainer(...) → updates observed_status=STARTING c. Polls agent health endpoint (port 9464) with timeout → updates to RUNNING or FAILED d. On any failure → updates observed_status=FAILED, error_message=...
  3. Client polls GET /api/apps/{aid}/deployments/{did} for status updates
  4. On success: set previous_deployment_id = old current_deployment_id, then current_deployment_id = new deployment. Stop and remove the old container.
  5. On failure: current_deployment_id is set to the failed deployment (so status is visible), previous_deployment_id still points to the last known good version. Enables rollback.

Container Logs → ClickHouse

  • When a container starts, platform attaches a log consumer via orchestrator.streamLogs()
  • Log consumer batches lines and writes to ClickHouse table:
CREATE TABLE IF NOT EXISTS container_logs (
    tenant_id UUID,
    environment_id UUID,
    app_id UUID,
    deployment_id UUID,
    timestamp DateTime64(3),
    stream String,  -- 'stdout' or 'stderr'
    message String
) ENGINE = MergeTree()
ORDER BY (tenant_id, environment_id, app_id, timestamp);
  • Logs retrieved via GET /api/apps/{aid}/logs?since=...&limit=... which queries ClickHouse
  • ClickHouse TTL can enforce retention based on license retention_days limit (future enhancement)

Bootstrap Token Handling

In Docker single-tenant mode, all environments share the single cameleer3-server instance and its single CAMELEER_AUTH_TOKEN. The platform reads this token from its own configuration (cameleer.runtime.bootstrap-token / CAMELEER_AUTH_TOKEN env var) and injects it into every customer container. No changes to cameleer3-server are needed.

Environment-level data separation happens at the agent registration level — the agent sends its environmentId claim when it registers, and cameleer3-server uses that to scope all data. The bootstrap token is the same across environments in a Docker stack.

The bootstrap_token column on the environment entity stores the token value used for that environment's containers. In Docker mode this is the same shared value for all environments. In K8s mode (Phase 5), each environment could have its own cameleer3-server instance with a unique token, enabling true per-environment token isolation.

API Surface

Environment Endpoints

POST   /api/tenants/{tenantId}/environments
  Body: { "slug": "dev", "displayName": "Development" }
  Returns: 201 Created + EnvironmentResponse
  Enforces: tier-based max_environments limit from license

GET    /api/tenants/{tenantId}/environments
  Returns: 200 + List<EnvironmentResponse>

GET    /api/tenants/{tenantId}/environments/{environmentId}
  Returns: 200 + EnvironmentResponse

PATCH  /api/tenants/{tenantId}/environments/{environmentId}
  Body: { "displayName": "New Name" }
  Returns: 200 + EnvironmentResponse

DELETE /api/tenants/{tenantId}/environments/{environmentId}
  Returns: 204 No Content
  Precondition: no running apps in environment
  Restriction: cannot delete the auto-created "default" environment

App Endpoints

POST   /api/environments/{environmentId}/apps
  Multipart: file (JAR) + metadata { "slug": "order-service", "displayName": "Order Service" }
  Returns: 201 Created + AppResponse
  Validates: file extension, size, checksum

GET    /api/environments/{environmentId}/apps
  Returns: 200 + List<AppResponse>

GET    /api/environments/{environmentId}/apps/{appId}
  Returns: 200 + AppResponse (includes current deployment status)

PUT    /api/environments/{environmentId}/apps/{appId}/jar
  Multipart: file (JAR)
  Returns: 200 + AppResponse
  Purpose: re-upload JAR without creating new app

DELETE /api/environments/{environmentId}/apps/{appId}
  Returns: 204 No Content
  Side effect: stops running container, removes image

Deployment Endpoints

POST   /api/apps/{appId}/deploy
  Body: {} (empty — uses current JAR)
  Returns: 202 Accepted + DeploymentResponse (with deployment ID, status=BUILDING)

GET    /api/apps/{appId}/deployments
  Returns: 200 + List<DeploymentResponse> (ordered by version desc)

GET    /api/apps/{appId}/deployments/{deploymentId}
  Returns: 200 + DeploymentResponse (poll this for status updates)

POST   /api/apps/{appId}/stop
  Returns: 200 + DeploymentResponse (desired_status=STOPPED)

POST   /api/apps/{appId}/restart
  Returns: 202 Accepted + DeploymentResponse (stops + redeploys same image)

Log Endpoints

GET    /api/apps/{appId}/logs
  Query: since (ISO timestamp), until (ISO timestamp), limit (default 500), stream (stdout/stderr/both)
  Returns: 200 + List<LogEntry>
  Source: ClickHouse container_logs table

Tier Enforcement

Tier max_environments max_agents (apps)
LOW 1 3
MID 2 10
HIGH unlimited (-1) 50
BUSINESS unlimited (-1) unlimited (-1)
  • max_environments enforced on POST /api/tenants/{tid}/environments. The auto-created default environment counts toward the limit.
  • max_agents enforced on POST /api/environments/{eid}/apps. Count is total apps across all environments in the tenant.

Docker Compose Changes

The cameleer-saas service needs:

  • Docker socket mount: /var/run/docker.sock:/var/run/docker.sock (already present in docker-compose.yml)
  • JAR storage volume: jardata:/data/jars
  • cameleer-runtime-base image must be available (pre-pulled or built locally)

The cameleer3-server CAMELEER_AUTH_TOKEN is read by cameleer-saas from shared environment config and injected into customer containers.

New volume in docker-compose.yml:

volumes:
  jardata:

Dependencies

New Maven Dependencies

<!-- Docker Java client -->
<dependency>
    <groupId>com.github.docker-java</groupId>
    <artifactId>docker-java-core</artifactId>
    <version>3.4.1</version>
</dependency>
<dependency>
    <groupId>com.github.docker-java</groupId>
    <artifactId>docker-java-transport-httpclient5</artifactId>
    <version>3.4.1</version>
</dependency>

<!-- ClickHouse JDBC -->
<dependency>
    <groupId>com.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.7.1</version>
    <classifier>all</classifier>
</dependency>

New Configuration Properties

cameleer:
  runtime:
    max-jar-size: 209715200  # 200 MB
    jar-storage-path: /data/jars
    base-image: cameleer-runtime-base:latest
    docker-network: cameleer
    agent-health-port: 9464
    health-check-timeout: 60  # seconds to wait for healthy status
    deployment-thread-pool-size: 4
    container-memory-limit: 512m  # per customer container
    container-cpu-shares: 512     # relative weight (default Docker is 1024)
  clickhouse:
    url: jdbc:clickhouse://clickhouse:8123/cameleer

Verification Plan

  1. Upload a sample Camel JAR via POST /api/environments/{eid}/apps
  2. Deploy via POST /api/apps/{aid}/deploy — returns 202 with deployment ID
  3. Poll GET /api/apps/{aid}/deployments/{did} — status transitions: BUILDINGSTARTINGRUNNING
  4. Container visible in docker ps as {tenant}-{env}-{app}
  5. Container is on the cameleer network
  6. cameleer3 agent registers with cameleer3-server (visible in server logs)
  7. Agent health endpoint responds on port 9464
  8. Container logs appear in ClickHouse container_logs table
  9. GET /api/apps/{aid}/logs returns log entries
  10. POST /api/apps/{aid}/stop stops the container, status becomes STOPPED
  11. POST /api/apps/{aid}/restart restarts with same image
  12. Re-upload JAR + redeploy creates deployment v2, stops v1
  13. Tier limits enforced: LOW tenant cannot create more than 1 environment or 3 apps
  14. Default environment auto-created on tenant provisioning