docs: add Phase 3 Runtime Orchestration spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-04 17:13:08 +02:00
parent 5d14f78b9d
commit 0326dc6cce

View File

@@ -0,0 +1,424 @@
# Phase 3: Runtime Orchestration + Environments
**Date:** 2026-04-04
**Status:** Draft
**Depends on:** Phase 2 (Tenants + Identity + Licensing)
**Gitea issue:** #26
## Context
Phase 2 delivered multi-tenancy, identity (Logto OIDC), and license management. The platform can create tenants and issue licenses, but there is nothing to run yet. Phase 3 is the core product differentiator: customers upload a Camel JAR, the platform builds an immutable container image with the cameleer3 agent auto-injected, and deploys it to a logical environment. This is "managed Camel runtime" — similar to Coolify or MuleSoft CloudHub, but purpose-built for Apache Camel with deep observability.
Docker-first. The `KubernetesRuntimeOrchestrator` is deferred to Phase 5.
**Single-node constraint:** Because Phase 3 builds images locally via Docker socket (no registry push), the cameleer-saas control plane and the Docker daemon must reside on the same host. This is inherent to the single-tenant Docker Compose stack and is acceptable for that target. In K8s mode (Phase 5), images are built via Kaniko and pushed to a registry, removing this constraint.
## Key Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| JAR delivery | Direct HTTP upload (multipart) | Simplest path. Git-based and image-ref options can be added later. |
| Agent JAR source | Bundled in `cameleer-runtime-base` image | Version-locked to platform release. Updated by rebuilding the platform image with the new agent version. No runtime network dependency. |
| Build speed | Pre-built base image + single-layer customer add | Customer image build is `FROM base` + `COPY app.jar`. ~1-3 seconds. |
| Deployment model | Async with polling | Image builds are inherently slow. Deploy returns immediately with deployment ID. Client polls for status. |
| Entity hierarchy | Environment → App → Deployment | User thinks "I'm in dev, deploy my app." Environment is the workspace context. |
| Environment provisioning | Hybrid auto + manual | Every tenant gets a `default` environment on creation. Additional environments created manually, tier limit enforced. |
| Cross-environment isolation | Logical (not network) | Docker single-tenant mode — customer owns the stack. Data separated by `environmentId` in cameleer3-server. Network isolation is a K8s Phase 5 concern. |
| Container networking | Shared `cameleer` bridge network | Customer containers join the existing network. Agent reaches cameleer3-server at `http://cameleer3-server:8081`. |
| Container naming | `{tenant-slug}-{env-slug}-{app-slug}` | Human-readable, unique, identifies tenant+environment+app at a glance. |
| Bootstrap tokens | Shared `CAMELEER_AUTH_TOKEN` from cameleer3-server config | Platform reads the existing token and injects it into customer containers. Environment separation via agent `environmentId` claim, not token. Per-environment tokens deferred to K8s Phase 5. |
| Health checking | Agent health endpoint (port 9464) | Guaranteed to exist, no user config needed. User-defined health endpoints deferred. |
| Inbound HTTP routing | Not in Phase 3 | Most Camel apps are consumers (queues, polls), not servers. Traefik routing for customer apps deferred to Phase 4/4.5. |
| Container logs | Captured via docker-java, written to ClickHouse | Unified log query surface from day 1. Same pattern future app logs will use. |
| Resource constraints | cgroups via docker-java `mem_limit` + `cpu_shares` | Protect the control plane from noisy neighbors. Tier-based defaults. Even in single-tenant Docker mode, a runaway Camel app shouldn't starve Traefik/Postgres/Logto. |
| Orchestrator metadata | JSONB field on deployment entity | Docker stores `containerId`. K8s (Phase 5) stores `namespace`, `deploymentName`, `gitCommit`. Same table, different orchestrator. |
## Data Model
### Environment Entity
```sql
CREATE TABLE environments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
slug VARCHAR(100) NOT NULL,
display_name VARCHAR(255) NOT NULL,
bootstrap_token TEXT NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(tenant_id, slug)
);
CREATE INDEX idx_environments_tenant_id ON environments(tenant_id);
```
- `slug` — URL-safe, immutable, unique per tenant. Auto-created environment gets slug `default`.
- `display_name` — User-editable. Auto-created environment gets `Default`.
- `bootstrap_token` — The `CAMELEER_AUTH_TOKEN` value used for customer containers in this environment. In Docker mode, all environments share the same value (read from platform config). In K8s mode (Phase 5), can be unique per environment.
- `status``ACTIVE` or `SUSPENDED`.
### App Entity
```sql
CREATE TABLE apps (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
environment_id UUID NOT NULL REFERENCES environments(id) ON DELETE CASCADE,
slug VARCHAR(100) NOT NULL,
display_name VARCHAR(255) NOT NULL,
jar_storage_path VARCHAR(500),
jar_checksum VARCHAR(64),
jar_original_filename VARCHAR(255),
jar_size_bytes BIGINT,
current_deployment_id UUID,
previous_deployment_id UUID,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(environment_id, slug)
);
CREATE INDEX idx_apps_environment_id ON apps(environment_id);
```
- `slug` — URL-safe, immutable, unique per environment.
- `jar_storage_path` — Relative path to uploaded JAR (e.g., `tenants/{tenant-slug}/envs/{env-slug}/apps/{app-slug}/app.jar`). Relative to the configured storage root (`cameleer.runtime.jar-storage-path`). Makes it easy to migrate the storage volume to a different mount point or cloud provider.
- `jar_checksum` — SHA-256 hex digest of the uploaded JAR.
- `current_deployment_id` — Points to the active deployment. Nullable (app created but never deployed).
- `previous_deployment_id` — Points to the last known good deployment. When a new deploy succeeds, `current` becomes the new one and `previous` becomes the old `current`. When a deploy fails, `current` stays as the failed one but `previous` still points to the last good version, enabling a rollback button.
### Deployment Entity
```sql
CREATE TABLE deployments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
app_id UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
version INTEGER NOT NULL,
image_ref VARCHAR(500) NOT NULL,
desired_status VARCHAR(20) NOT NULL DEFAULT 'RUNNING',
observed_status VARCHAR(20) NOT NULL DEFAULT 'BUILDING',
orchestrator_metadata JSONB DEFAULT '{}',
error_message TEXT,
deployed_at TIMESTAMPTZ,
stopped_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(app_id, version)
);
CREATE INDEX idx_deployments_app_id ON deployments(app_id);
```
- `version` — Sequential per app (1, 2, 3...). Incremented on each deploy.
- `image_ref` — Docker image reference, e.g., `cameleer-runtime-{tenant}-{app}:v3`.
- `desired_status` — What the user wants: `RUNNING`, `STOPPED`.
- `observed_status` — What the platform sees: `BUILDING`, `STARTING`, `RUNNING`, `FAILED`, `STOPPED`.
- `orchestrator_metadata` — Docker mode: `{"containerId": "abc123"}`. K8s mode (Phase 5): `{"namespace": "...", "deploymentName": "...", "gitCommit": "..."}`.
- `error_message` — Populated when `observed_status` is `FAILED`. Build error, startup crash, etc.
## Component Architecture
### RuntimeOrchestrator Interface
```java
public interface RuntimeOrchestrator {
String buildImage(BuildImageRequest request);
void startContainer(StartContainerRequest request);
void stopContainer(String containerId);
void removeContainer(String containerId);
ContainerStatus getContainerStatus(String containerId);
void streamLogs(String containerId, LogConsumer consumer);
}
```
- Single interface, implemented by `DockerRuntimeOrchestrator` (Phase 3) and `KubernetesRuntimeOrchestrator` (Phase 5).
- Injected via Spring `@Profile` or `@ConditionalOnProperty`.
- Request objects carry all context (image name, env vars, network, labels, etc.).
### DockerRuntimeOrchestrator
Uses `com.github.docker-java:docker-java` library. Connects via Docker socket (`/var/run/docker.sock`).
**buildImage:**
1. Creates a temporary build context directory
2. Writes a Dockerfile:
```dockerfile
FROM cameleer-runtime-base:{platform-version}
COPY app.jar /app/app.jar
```
3. Copies the customer JAR as `app.jar`
4. Calls `docker build` via docker-java
5. Tags as `cameleer-runtime-{tenant-slug}-{app-slug}:v{version}`
6. Returns the image reference
**startContainer:**
1. Creates container with:
- Image: the built image reference
- Name: `{tenant-slug}-{env-slug}-{app-slug}`
- Network: `cameleer` (the platform bridge network)
- Environment variables:
- `CAMELEER_AUTH_TOKEN={bootstrap-token}`
- `CAMELEER_EXPORT_TYPE=HTTP`
- `CAMELEER_EXPORT_ENDPOINT=http://cameleer3-server:8081`
- `CAMELEER_APPLICATION_ID={app-slug}`
- `CAMELEER_ENVIRONMENT_ID={env-slug}`
- `CAMELEER_DISPLAY_NAME={tenant-slug}-{env-slug}-{app-slug}`
- Resource constraints (cgroups):
- `memory` / `memorySwap` — hard memory limit per container
- `cpuShares` — relative CPU weight (default 512)
- Defaults configurable via `cameleer.runtime.container-memory-limit` (default `512m`) and `cameleer.runtime.container-cpu-shares` (default `512`)
- Protects the control plane (Traefik, Postgres, Logto, cameleer-saas) from noisy neighbor Camel apps
- Health check: HTTP GET to agent health port 9464
2. Starts container
3. Returns container ID
**streamLogs:**
- Attaches to container stdout/stderr via docker-java `LogContainerCmd`
- Passes log lines to a `LogConsumer` callback (for ClickHouse ingestion)
### cameleer-runtime-base Image
A pre-built Docker image containing everything except the customer JAR:
```dockerfile
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY cameleer3-agent-{version}-shaded.jar /app/agent.jar
ENTRYPOINT exec java \
-Dcameleer.export.type=${CAMELEER_EXPORT_TYPE:-HTTP} \
-Dcameleer.export.endpoint=${CAMELEER_EXPORT_ENDPOINT} \
-Dcameleer.agent.name=${HOSTNAME} \
-Dcameleer.agent.application=${CAMELEER_APPLICATION_ID:-default} \
-Dcameleer.agent.environment=${CAMELEER_ENVIRONMENT_ID:-default} \
-Dcameleer.routeControl.enabled=${CAMELEER_ROUTE_CONTROL_ENABLED:-false} \
-Dcameleer.replay.enabled=${CAMELEER_REPLAY_ENABLED:-false} \
-Dcameleer.health.enabled=true \
-Dcameleer.health.port=9464 \
-javaagent:/app/agent.jar \
-jar /app/app.jar
```
- Built as part of the CI pipeline for cameleer-saas.
- Published to Gitea registry: `gitea.siegeln.net/cameleer/cameleer-runtime-base:{version}`.
- Version tracks the platform version + agent version (e.g., `0.2.0` includes agent `1.0-SNAPSHOT`).
- Updating the agent JAR = rebuild this image with the new agent version → rebuild cameleer-saas image → all new deployments use the new agent.
### JAR Upload
- `POST /api/environments/{eid}/apps` with multipart file
- Validation:
- File extension: `.jar`
- Max size: 200 MB (configurable via `cameleer.runtime.max-jar-size`)
- SHA-256 checksum computed and stored
- Storage: relative path `tenants/{tenant-slug}/envs/{env-slug}/apps/{app-slug}/app.jar` under the configured storage root (`cameleer.runtime.jar-storage-path`, default `/data/jars`)
- Docker volume `jardata` mounted into cameleer-saas container
- Database stores the relative path only — decoupled from mount point
- JAR is overwritten on re-upload (new deploy uses new JAR)
### Async Deployment Pipeline
1. **API receives deploy request** → creates `Deployment` entity with `observed_status=BUILDING` → returns deployment ID (HTTP 202 Accepted)
2. **Background thread** (Spring `@Async` with a bounded thread pool):
a. Calls `orchestrator.buildImage(...)` → updates `observed_status=STARTING`
b. Calls `orchestrator.startContainer(...)` → updates `observed_status=STARTING`
c. Polls agent health endpoint (port 9464) with timeout → updates to `RUNNING` or `FAILED`
d. On any failure → updates `observed_status=FAILED`, `error_message=...`
3. **Client polls** `GET /api/apps/{aid}/deployments/{did}` for status updates
4. **On success:** set `previous_deployment_id = old current_deployment_id`, then `current_deployment_id = new deployment`. Stop and remove the old container.
5. **On failure:** `current_deployment_id` is set to the failed deployment (so status is visible), `previous_deployment_id` still points to the last known good version. Enables rollback.
### Container Logs → ClickHouse
- When a container starts, platform attaches a log consumer via `orchestrator.streamLogs()`
- Log consumer batches lines and writes to ClickHouse table:
```sql
CREATE TABLE IF NOT EXISTS container_logs (
tenant_id UUID,
environment_id UUID,
app_id UUID,
deployment_id UUID,
timestamp DateTime64(3),
stream String, -- 'stdout' or 'stderr'
message String
) ENGINE = MergeTree()
ORDER BY (tenant_id, environment_id, app_id, timestamp);
```
- Logs retrieved via `GET /api/apps/{aid}/logs?since=...&limit=...` which queries ClickHouse
- ClickHouse TTL can enforce retention based on license `retention_days` limit (future enhancement)
### Bootstrap Token Handling
In Docker single-tenant mode, all environments share the single cameleer3-server instance and its single `CAMELEER_AUTH_TOKEN`. The platform reads this token from its own configuration (`cameleer.runtime.bootstrap-token` / `CAMELEER_AUTH_TOKEN` env var) and injects it into every customer container. No changes to cameleer3-server are needed.
Environment-level data separation happens at the agent registration level — the agent sends its `environmentId` claim when it registers, and cameleer3-server uses that to scope all data. The bootstrap token is the same across environments in a Docker stack.
The `bootstrap_token` column on the environment entity stores the token value used for that environment's containers. In Docker mode this is the same shared value for all environments. In K8s mode (Phase 5), each environment could have its own cameleer3-server instance with a unique token, enabling true per-environment token isolation.
## API Surface
### Environment Endpoints
```
POST /api/tenants/{tenantId}/environments
Body: { "slug": "dev", "displayName": "Development" }
Returns: 201 Created + EnvironmentResponse
Enforces: tier-based max_environments limit from license
GET /api/tenants/{tenantId}/environments
Returns: 200 + List<EnvironmentResponse>
GET /api/tenants/{tenantId}/environments/{environmentId}
Returns: 200 + EnvironmentResponse
PATCH /api/tenants/{tenantId}/environments/{environmentId}
Body: { "displayName": "New Name" }
Returns: 200 + EnvironmentResponse
DELETE /api/tenants/{tenantId}/environments/{environmentId}
Returns: 204 No Content
Precondition: no running apps in environment
Restriction: cannot delete the auto-created "default" environment
```
### App Endpoints
```
POST /api/environments/{environmentId}/apps
Multipart: file (JAR) + metadata { "slug": "order-service", "displayName": "Order Service" }
Returns: 201 Created + AppResponse
Validates: file extension, size, checksum
GET /api/environments/{environmentId}/apps
Returns: 200 + List<AppResponse>
GET /api/environments/{environmentId}/apps/{appId}
Returns: 200 + AppResponse (includes current deployment status)
PUT /api/environments/{environmentId}/apps/{appId}/jar
Multipart: file (JAR)
Returns: 200 + AppResponse
Purpose: re-upload JAR without creating new app
DELETE /api/environments/{environmentId}/apps/{appId}
Returns: 204 No Content
Side effect: stops running container, removes image
```
### Deployment Endpoints
```
POST /api/apps/{appId}/deploy
Body: {} (empty — uses current JAR)
Returns: 202 Accepted + DeploymentResponse (with deployment ID, status=BUILDING)
GET /api/apps/{appId}/deployments
Returns: 200 + List<DeploymentResponse> (ordered by version desc)
GET /api/apps/{appId}/deployments/{deploymentId}
Returns: 200 + DeploymentResponse (poll this for status updates)
POST /api/apps/{appId}/stop
Returns: 200 + DeploymentResponse (desired_status=STOPPED)
POST /api/apps/{appId}/restart
Returns: 202 Accepted + DeploymentResponse (stops + redeploys same image)
```
### Log Endpoints
```
GET /api/apps/{appId}/logs
Query: since (ISO timestamp), until (ISO timestamp), limit (default 500), stream (stdout/stderr/both)
Returns: 200 + List<LogEntry>
Source: ClickHouse container_logs table
```
## Tier Enforcement
| Tier | max_environments | max_agents (apps) |
|------|-----------------|-------------------|
| LOW | 1 | 3 |
| MID | 2 | 10 |
| HIGH | unlimited (-1) | 50 |
| BUSINESS | unlimited (-1) | unlimited (-1) |
- `max_environments` enforced on `POST /api/tenants/{tid}/environments`. The auto-created `default` environment counts toward the limit.
- `max_agents` enforced on `POST /api/environments/{eid}/apps`. Count is total apps across all environments in the tenant.
## Docker Compose Changes
The cameleer-saas service needs:
- Docker socket mount: `/var/run/docker.sock:/var/run/docker.sock` (already present in docker-compose.yml)
- JAR storage volume: `jardata:/data/jars`
- `cameleer-runtime-base` image must be available (pre-pulled or built locally)
The cameleer3-server `CAMELEER_AUTH_TOKEN` is read by cameleer-saas from shared environment config and injected into customer containers.
New volume in docker-compose.yml:
```yaml
volumes:
jardata:
```
## Dependencies
### New Maven Dependencies
```xml
<!-- Docker Java client -->
<dependency>
<groupId>com.github.docker-java</groupId>
<artifactId>docker-java-core</artifactId>
<version>3.4.1</version>
</dependency>
<dependency>
<groupId>com.github.docker-java</groupId>
<artifactId>docker-java-transport-httpclient5</artifactId>
<version>3.4.1</version>
</dependency>
<!-- ClickHouse JDBC -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.7.1</version>
<classifier>all</classifier>
</dependency>
```
### New Configuration Properties
```yaml
cameleer:
runtime:
max-jar-size: 209715200 # 200 MB
jar-storage-path: /data/jars
base-image: cameleer-runtime-base:latest
docker-network: cameleer
agent-health-port: 9464
health-check-timeout: 60 # seconds to wait for healthy status
deployment-thread-pool-size: 4
container-memory-limit: 512m # per customer container
container-cpu-shares: 512 # relative weight (default Docker is 1024)
clickhouse:
url: jdbc:clickhouse://clickhouse:8123/cameleer
```
## Verification Plan
1. Upload a sample Camel JAR via `POST /api/environments/{eid}/apps`
2. Deploy via `POST /api/apps/{aid}/deploy` — returns 202 with deployment ID
3. Poll `GET /api/apps/{aid}/deployments/{did}` — status transitions: `BUILDING` → `STARTING` → `RUNNING`
4. Container visible in `docker ps` as `{tenant}-{env}-{app}`
5. Container is on the `cameleer` network
6. cameleer3 agent registers with cameleer3-server (visible in server logs)
7. Agent health endpoint responds on port 9464
8. Container logs appear in ClickHouse `container_logs` table
9. `GET /api/apps/{aid}/logs` returns log entries
10. `POST /api/apps/{aid}/stop` stops the container, status becomes `STOPPED`
11. `POST /api/apps/{aid}/restart` restarts with same image
12. Re-upload JAR + redeploy creates deployment v2, stops v1
13. Tier limits enforced: LOW tenant cannot create more than 1 environment or 3 apps
14. Default environment auto-created on tenant provisioning