docs: add dual deployment architecture spec and Phase 2 plan

Architecture spec covers Docker+K8s dual deployment with build-vs-buy
decisions (Logto, Traefik, Stripe, deferred Lago/Vault). Phase 2 plan
has 12 implementation tasks for tenants, identity, and licensing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-04 14:45:33 +02:00
parent fcb372023f
commit 24309eab94
2 changed files with 3086 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,399 @@
# Dual Deployment Architecture: Docker + Kubernetes
**Date:** 2026-04-04
**Status:** Approved
**Supersedes:** Portions of `2026-03-29-saas-platform-prd.md` (deployment model, phase ordering, auth strategy)
## Context
Cameleer SaaS must serve two deployment targets:
- **Docker Compose** — production-viable for small customers and air-gapped installs (single-tenant per stack)
- **Kubernetes** — managed SaaS and enterprise self-hosted (multi-tenant)
The original PRD assumed K8s-only production. This design restructures the architecture and roadmap to treat Docker Compose as a first-class production target, uses the Docker+K8s dual requirement as a filter for build-vs-buy decisions, and reorders the phase roadmap to ship a deployable product faster.
Key constraints:
- The application is **always multi-tenant** — Docker deployments have exactly 1 tenant
- Don't build custom abstractions over K8s-only primitives when no Docker equivalent exists
- Prefer right-sized OSS tools over Swiss Army knives or custom builds
- K8s-only features (NetworkPolicies, HPA, Flux CD) are operational enhancements, never functional requirements
## Build-vs-Buy Decisions
### BUY (Use 3rd Party OSS)
| Subsystem | Tool | License | Why This Tool |
|---|---|---|---|
| **Identity & Auth** | **Logto** | MPL-2.0 | Lightest IdP (2 containers, ~0.5-1 GB). Orgs, RBAC, M2M tokens, OIDC/SSO federation all in OSS. Replaces ~3-4 months of custom auth build (OIDC, SSO, teams, invites, MFA, password reset, custom roles). |
| **Reverse Proxy** | **Traefik** | MIT | Native Docker provider (labels) and K8s provider (IngressRoute CRDs). Same mental model in both environments. Already on the k3s cluster. ForwardAuth middleware for tenant-aware routing. Auto-HTTPS via Let's Encrypt. ~256 MB RAM. |
| **Database** | **PostgreSQL** | PostgreSQL License | Already chosen. Platform data + Logto data (separate schemas). |
| **Trace/Metrics Storage** | **ClickHouse** | Apache-2.0 | Replaced OpenSearch in the cameleer3-server stack. Columnar OLAP, excellent for time-series observability data. |
| **Schema Migrations** | **Flyway** | Apache-2.0 | Already in place. |
| **Billing (subscriptions)** | **Stripe** | N/A (API) | Start with Stripe Checkout for fixed-tier subscriptions. No custom billing infrastructure day 1. |
| **Billing (usage metering)** | **Lago** (deferred) | AGPL-3.0 | Purpose-built for event-based metering. 8 containers — deploy only when usage-based pricing launches. Design event model with Lago's API shape in mind from day 1. Integrate via API only (keeps AGPL safe). |
| **GitOps (K8s only)** | **Flux CD** | Apache-2.0 | K8s-only, and that's acceptable. Docker deployments get release tarballs + upgrade scripts. |
| **Image Builds (K8s)** | **Kaniko** | Apache-2.0 | Daemonless container image builds inside K8s. For Docker mode, `docker build` via docker-java is simpler. |
| **Monitoring** | **Prometheus + Grafana + Loki** | Apache-2.0 | Works in both Docker and K8s. Optional for Docker (customer's choice), standard for K8s SaaS. |
| **TLS Certificates** | **Traefik ACME** (Docker) / **cert-manager** (K8s) | MIT / Apache-2.0 | Standard tools, no custom code. |
| **Container Registry (K8s)** | **Gitea Registry** (SaaS) / **registry:2** (self-hosted) | — | Docker mode doesn't need a registry (local image cache). |
### BUILD (Custom / Core IP)
| Subsystem | Why Build |
|---|---|
| **License signing & validation** | Ed25519 signed JWT with tier, features, limits, expiry. Dual mode: online API check + offline signed file. No off-the-shelf tool does this. Core IP. |
| **Agent bootstrap tokens** | Tightly coupled to the cameleer3 agent protocol (PROTOCOL.md). Custom Ed25519 tokens for agent registration. |
| **Tenant lifecycle** | CRUD, configuration, status management. Core business logic. User management (invites, teams, roles) is delegated to Logto's organization model. |
| **Runtime orchestration** | The core of the "managed Camel runtime" product. `RuntimeOrchestrator` interface with Docker and K8s implementations. No off-the-shelf tool does "managed Camel runtime with agent injection." |
| **Image build pipeline** | Templated Dockerfile: JRE + cameleer3-agent.jar + customer JAR + `-javaagent` flag. Simple but custom. |
| **Feature gating** | Tier-based feature gating logic. Which features are available at which tier. Business logic. |
| **Billing integration** | Stripe API calls, subscription lifecycle, webhook handling. Thin integration layer. |
| **Observability proxy** | Routing authenticated requests to tenant-specific cameleer3-server instances. |
| **MOAT features** | Debugger, Lineage, Correlation — the defensible product. Built in cameleer3 agent + server. |
### SKIP / DEFER
| Subsystem | Why Skip |
|---|---|
| **Secrets management (Vault)** | Docker: env vars + mounted files. K8s: K8s Secrets. Vault is enterprise-tier complexity. Defer until demanded. |
| **Custom role management UI** | Logto provides this. |
| **OIDC provider implementation** | Logto provides this. |
| **WireGuard VPN / VPC peering** | Far future, dedicated-tier only. |
| **Cluster API for dedicated tiers** | Don't design for this until enterprise customers exist. |
| **Management agent for updates** | Watchtower is optional for connected customers. Air-gapped gets release tarballs. Don't build custom. |
## Architecture
### Platform Stack (Docker Compose — 6 base containers)
```
+-------------------------------------------------------+
| Traefik (reverse proxy, TLS, ForwardAuth) |
| - Docker: labels-based routing |
| - K8s: IngressRoute CRDs |
+--------+---------------------+------------------------+
| |
+--------v--------+ +---------v-----------+
| cameleer-saas | | cameleer3-server |
| (Spring Boot) | | (observability) |
| Control plane | | Per-tenant instance |
+---+-------+-----+ +----------+----------+
| | |
+---v--+ +--v----+ +---------v---------+
| PG | | Logto | | ClickHouse |
| | | (IdP) | | (traces/metrics) |
+------+ +-------+ +-------------------+
```
Customer Camel apps are **additional containers** dynamically managed by the control plane via Docker API (Docker mode) or K8s API (K8s mode).
### Auth Flow
```
User login:
Browser -> Traefik -> Logto (OIDC flow) -> JWT issued by Logto
API request:
Browser -> Traefik -> ForwardAuth (cameleer-saas /auth/verify)
-> Validates Logto JWT, injects X-Tenant-Id header
-> Traefik forwards to upstream service
Machine auth (agent bootstrap):
cameleer3-agent -> cameleer-saas /api/agent/register
-> Validates bootstrap token (Ed25519)
-> Issues agent session token
-> Agent connects to cameleer3-server
```
Logto handles all user-facing identity. The cameleer-saas app handles machine-to-machine auth (agent tokens, license tokens) using Ed25519.
### Runtime Orchestration
```java
RuntimeOrchestrator (interface)
+ deployApp(tenantId, appId, envId, imageRef, config) -> Deployment
+ stopApp(tenantId, appId, envId) -> void
+ restartApp(tenantId, appId, envId) -> void
+ getAppLogs(tenantId, appId, envId, since) -> Stream<LogLine>
+ getAppStatus(tenantId, appId, envId) -> AppStatus
+ listApps(tenantId) -> List<AppSummary>
DockerRuntimeOrchestrator (docker-java library)
- Talks to Docker daemon via /var/run/docker.sock
- Creates containers with labels for Traefik routing
- Manages container lifecycle
- Builds images locally via docker build
KubernetesRuntimeOrchestrator (fabric8 kubernetes-client)
- Creates Deployments, Services, ConfigMaps in tenant namespace
- Builds images via Kaniko Jobs, pushes to registry
- Manages rollout lifecycle
```
### Image Build Pipeline
```
Customer uploads JAR
-> Validation (file type, size, SHA-256, security scan)
-> Templated Dockerfile generation:
FROM eclipse-temurin:21-jre-alpine
COPY cameleer3-agent.jar /opt/agent/
COPY customer-app.jar /opt/app/
ENTRYPOINT ["java", "-javaagent:/opt/agent/cameleer3-agent.jar", "-jar", "/opt/app/customer-app.jar"]
-> Build:
Docker mode: docker build via docker-java (local image cache)
K8s mode: Kaniko Job -> push to registry
-> Deploy to requested environment
```
### Multi-Tenancy Model
- **Always multi-tenant.** Docker Compose has 1 pre-configured tenant.
- **Schema-per-tenant** in PostgreSQL for platform data isolation.
- **Logto organizations** map 1:1 to tenants. Logto handles user-tenant membership.
- **ClickHouse** data partitioned by tenant_id.
- **cameleer3-server** instances are per-tenant (separate containers/pods).
- **K8s bonus:** Namespace-per-tenant for network isolation, resource quotas.
### Environment Model
Each tenant can have multiple logical environments (tier-dependent):
| Tier | Environments |
|---|---|
| Low | prod only |
| Mid | dev, prod |
| High+ | dev, staging, prod + custom |
Each environment is a separate deployment of the same app image with different configuration:
- Docker: separate container, different env vars
- K8s: separate Deployment, different ConfigMap
Promotion = deploy same image tag to a different environment with that environment's config.
### Configuration Strategy
The application is configured entirely via environment variables and Spring Boot profiles:
```yaml
# Detected at startup
cameleer.deployment.mode: docker | kubernetes # auto-detected
cameleer.deployment.docker.socket: /var/run/docker.sock
cameleer.deployment.k8s.namespace-template: tenant-{tenantId}
# Identity provider
cameleer.identity.issuer-uri: http://logto:3001/oidc
cameleer.identity.client-id: ${LOGTO_CLIENT_ID}
cameleer.identity.client-secret: ${LOGTO_CLIENT_SECRET}
# Ed25519 keys (externalized, not per-boot)
cameleer.jwt.private-key-path: /etc/cameleer/keys/ed25519.key
cameleer.jwt.public-key-path: /etc/cameleer/keys/ed25519.pub
# Database
spring.datasource.url: ${DATABASE_URL}
# ClickHouse
cameleer.clickhouse.url: ${CLICKHOUSE_URL}
```
### Docker Compose Production Template
```yaml
services:
traefik:
image: traefik:v3
ports: ["80:80", "443:443"]
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik.yml:/etc/traefik/traefik.yml
- acme:/etc/traefik/acme
labels:
# Dashboard (optional, secured)
cameleer-saas:
image: gitea.siegeln.net/cameleer/cameleer-saas:${VERSION}
volumes:
- /var/run/docker.sock:/var/run/docker.sock # For runtime orchestration
- ./keys:/etc/cameleer/keys:ro
environment:
- DATABASE_URL=jdbc:postgresql://postgres:5432/cameleer_saas
- LOGTO_CLIENT_ID=${LOGTO_CLIENT_ID}
- LOGTO_CLIENT_SECRET=${LOGTO_CLIENT_SECRET}
labels:
- traefik.enable=true
- traefik.http.routers.api.rule=PathPrefix(`/api`)
logto:
image: svhd/logto:latest
environment:
- DB_URL=postgresql://postgres:5432/logto
labels:
- traefik.enable=true
- traefik.http.routers.auth.rule=PathPrefix(`/auth`)
cameleer3-server:
image: gitea.siegeln.net/cameleer/cameleer3-server:${VERSION}
environment:
- CLICKHOUSE_URL=jdbc:clickhouse://clickhouse:8123/cameleer
labels:
- traefik.enable=true
- traefik.http.routers.observe.rule=PathPrefix(`/observe`)
postgres:
image: postgres:16-alpine
volumes: [pgdata:/var/lib/postgresql/data]
clickhouse:
image: clickhouse/clickhouse-server:latest
volumes: [chdata:/var/lib/clickhouse]
volumes:
pgdata:
chdata:
acme:
```
### Docker vs K8s Feature Matrix
| Feature | Docker Compose | Kubernetes |
|---|---|---|
| Deploy Camel apps | Yes (Docker API) | Yes (K8s API) |
| Multiple environments | Yes (separate containers) | Yes (separate Deployments) |
| Agent injection | Yes | Yes |
| Observability (traces, topology) | Yes | Yes |
| Identity / SSO / Teams | Yes (Logto) | Yes (Logto) |
| Licensing | Yes | Yes |
| Auto-scaling | No | Yes (HPA) |
| Network isolation (multi-tenant) | Docker networks | NetworkPolicies |
| GitOps deployment | No (manual updates) | Yes (Flux CD) |
| Rolling updates | Manual restart | Native |
| Platform monitoring | Optional (customer adds Grafana) | Standard (Prometheus/Grafana/Loki) |
| Certificate management | Traefik ACME | cert-manager |
## Revised Phase Roadmap
### Phase 2: Tenants + Identity + Licensing
**Goal:** A customer can sign up, get a tenant, and access the platform via Traefik.
- Integrate Logto as identity provider
- Replace custom user-facing auth (login, registration, password management)
- Keep Ed25519 JWT for machine tokens (agent bootstrap, license signing)
- Configure Logto organizations to map to tenants
- Tenant entity + CRUD API
- License token generation (Ed25519 signed JWT: tier, features, limits, expiry)
- Traefik integration with ForwardAuth middleware
- Docker Compose production stack (6 containers)
- Externalize Ed25519 keys (mounted files, not per-boot)
**Files to modify/create:**
- `src/main/java/net/siegeln/cameleer/saas/tenant/` — new package
- `src/main/java/net/siegeln/cameleer/saas/license/` — new package
- `src/main/java/net/siegeln/cameleer/saas/config/SecurityConfig.java` — Logto OIDC integration
- `src/main/resources/db/migration/V005__create_tenants.sql`
- `src/main/resources/db/migration/V006__create_licenses.sql`
- `docker-compose.yml` — expand to full production stack
- `traefik.yml` — static config
- `src/main/resources/application.yml` — Logto + Traefik config
### Phase 3: Runtime Orchestration + Environments
**Goal:** Customer can upload a Camel JAR, deploy it to dev/prod, see it running with agent attached.
- `RuntimeOrchestrator` interface
- `DockerRuntimeOrchestrator` implementation (docker-java)
- Customer JAR upload endpoint
- Image build pipeline (Dockerfile template + docker build)
- Logical environment model (dev/test/prod per tenant)
- Environment-specific config overlays
- App lifecycle API (deploy, start, stop, restart, logs, health)
**Key dependencies:** docker-java, Kaniko (for future K8s)
### Phase 4: Observability Pipeline
**Goal:** Customer can see traces, metrics, and route topology for deployed apps.
- Connect cameleer3-server to customer app containers
- ClickHouse tenant-scoped data partitioning
- Observability API proxy (tenant-aware routing to cameleer3-server)
- Basic topology graph endpoint
- Agent ↔ server connectivity verification
### Phase 5: K8s Operational Layer
**Goal:** Same product works on K8s with operational enhancements.
- `KubernetesRuntimeOrchestrator` implementation (fabric8)
- Kaniko-based image builds
- Flux CD integration for platform GitOps
- Namespace-per-tenant provisioning
- NetworkPolicies, ResourceQuotas
- Helm chart for K8s deployment
- Registry integration (Gitea registry / registry:2)
### Phase 6: Billing
**Goal:** Customers can subscribe and pay.
- Stripe Checkout integration
- Subscription lifecycle (create, upgrade, downgrade, cancel)
- Tier enforcement (feature gating based on active subscription)
- Usage tracking in platform DB (prep for Lago integration later)
- Webhook handling for payment events
### Phase 7: Security Hardening + Monitoring
**Goal:** Production-hardened platform.
- Prometheus/Grafana/Loki stack (optional Docker compose overlay, standard K8s)
- SOC 2 compliance review
- Rate limiting
- Container image signing (cosign)
- Supply chain security (SBOM, Trivy scanning)
- Audit log shipping to separate sink
### Frontend (React Shell) — Parallel Track (Phase 2+)
- Can start as soon as Phase 2 API contracts are defined
- Uses `@cameleer/design-system`
- Screens: login, dashboard, app deployment, environment management, observability views, team management, billing
## Verification Plan
### Phase 2 Verification
1. `docker compose up` starts all 6 containers
2. Navigate to Logto admin, create a user
3. User logs in via OIDC flow through Traefik
4. API calls with JWT include `X-Tenant-Id` header
5. License token can be generated and verified
6. All existing tests still pass
### Phase 3 Verification
1. Upload a sample Camel JAR via API
2. Platform builds container image
3. Deploy to "dev" environment
4. Container starts with cameleer3 agent attached
5. App is reachable via Traefik routing
6. Logs are accessible via API
7. Deploy same image to "prod" with different config
### Phase 4 Verification
1. Running Camel app sends traces to cameleer3-server
2. Traces visible in ClickHouse with correct tenant_id
3. Topology graph shows route structure
4. Different tenant cannot see another tenant's data
### Phase 5 Verification
1. Helm install deploys full platform to k3s
2. Tenant provisioning creates namespace + resources
3. App deployment creates K8s Deployment + Service
4. Kaniko builds image and pushes to registry
5. NetworkPolicy blocks cross-tenant traffic
6. Same API contracts work as Docker mode
### End-to-End Smoke Test (Any Phase)
```bash
# Docker Compose
docker compose up -d
# Create tenant + user via API/Logto
# Upload sample Camel JAR
# Deploy to environment
# Verify agent connects to cameleer3-server
# Verify traces in ClickHouse
# Verify observability API returns data
```