docs: multitenancy architecture design spec

Covers tenant isolation (1 tenant = 1 server instance), environment support (first-class agent property), ClickHouse partitioning (tenant → time → environment → application), PostgreSQL schema-per- tenant via JDBC currentSchema, and agent protocol changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:37:00 +02:00
parent 7429b85964
commit ee7226cf1c
1 changed files with 366 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-04-multitenancy-design.md
+++ b/docs/superpowers/specs/2026-04-04-multitenancy-design.md
@@ -0,0 +1,366 @@
+# Multitenancy Architecture Design
+
+**Date:** 2026-04-04
+**Status:** Draft
+
+## Context
+
+Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.
+
+## Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Tenant model | 1 customer = 1 tenant | SaaS customer isolation |
+| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped |
+| Environments | First-class, per-agent property | Agents belong to exactly 1 environment |
+| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param |
+| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL |
+| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` |
+| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping |
+| Migration | Fresh install | No backward-compatibility migration needed |
+
+## Data Hierarchy
+
+```
+Tenant (customer org)
+  └─ Environment (dev, staging, prod)
+       └─ Application (order-service, payment-gateway)
+            └─ Agent Instance (pod-1, pod-2)
+```
+
+## Architecture
+
+```
+Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme)
+                    ├─ PG schema: tenant_acme
+                    ├─ CH writes: tenant_id='acme'
+                    ├─ Agents: env=dev, env=prod
+                    └─ In-memory: registry, catalog, SSE
+
+Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta)
+                    ├─ PG schema: tenant_beta
+                    ├─ CH writes: tenant_id='beta'
+                    └─ ...
+
+Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)
+```
+
+Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL.
+
+## 1. Agent Protocol Changes
+
+### Registration Payload
+
+Add `environmentId` field:
+
+```json
+{
+  "instanceId": "order-svc-pod-1",
+  "displayName": "order-svc-pod-1",
+  "applicationId": "order-service",
+  "environmentId": "dev",
+  "version": "1.0-SNAPSHOT",
+  "routeIds": ["route-orders"],
+  "capabilities": { "tracing": true, "replay": false }
+}
+```
+
+`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents).
+
+### Heartbeat Payload
+
+Add `environmentId` (optional, for auto-heal after server restart):
+
+```json
+{
+  "routeStates": { "route-orders": "Started" },
+  "capabilities": { "tracing": true },
+  "environmentId": "dev"
+}
+```
+
+### JWT Claims
+
+Agent JWTs issued by the server include:
+- `tenant` — tenant ID (from server config)
+- `env` — environment ID (from registration)
+- `group` — application ID (existing)
+
+The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance.
+
+## 2. Server Configuration
+
+New environment variables:
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations |
+
+PG connection includes schema:
+```yaml
+spring:
+  datasource:
+    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
+```
+
+Flyway runs against the configured schema automatically.
+
+## 3. ClickHouse Schema Changes
+
+### Column Ordering Principle
+
+All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics**
+
+This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.
+
+### Partitioning
+
+All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables).
+
+Benefits:
+- Partition pruning by tenant (never scans other tenant's data)
+- Partition pruning by month (time-range queries)
+- Per-tenant TTL/retention (drop partitions)
+
+### Raw Tables
+
+#### `executions`
+
+```sql
+CREATE TABLE executions (
+    tenant_id         String   DEFAULT 'default',
+    start_time        DateTime64(3),
+    environment       String   DEFAULT 'default',
+    application_id    String,
+    instance_id       String,
+    -- ... existing columns ...
+) ENGINE = ReplacingMergeTree()
+PARTITION BY (tenant_id, toYYYYMM(start_time))
+ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)
+```
+
+#### `processor_executions`
+
+```sql
+ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
+PARTITION BY (tenant_id, toYYYYMM(start_time))
+```
+
+#### `logs`
+
+```sql
+ORDER BY (tenant_id, timestamp, environment, application, instance_id)
+PARTITION BY (tenant_id, toYYYYMM(timestamp))
+```
+
+#### `agent_metrics`
+
+```sql
+ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
+PARTITION BY (tenant_id, toYYYYMM(collected_at))
+```
+
+#### `route_diagrams`
+
+```sql
+ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
+PARTITION BY (tenant_id, toYYYYMM(created_at))
+```
+
+#### `agent_events`
+
+```sql
+ORDER BY (tenant_id, timestamp, environment, instance_id)
+PARTITION BY (tenant_id, toYYYYMM(timestamp))
+```
+
+#### `usage_events` (new column)
+
+```sql
+-- Add tenant_id (currently missing)
+ORDER BY (tenant_id, timestamp, environment, username, normalized)
+PARTITION BY (tenant_id, toYYYYMM(timestamp))
+```
+
+### Materialized View Targets (stats_1m_*)
+
+All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))`
+
+Example for `stats_1m_route`:
+```sql
+ORDER BY (tenant_id, bucket, environment, application_id, route_id)
+PARTITION BY (tenant_id, toYYYYMM(bucket))
+```
+
+### MV Source Queries
+
+All materialized view SELECT statements include `environment` in GROUP BY:
+
+```sql
+SELECT
+    tenant_id,
+    toStartOfMinute(start_time) AS bucket,
+    environment,
+    application_id,
+    route_id,
+    countState() AS total_count,
+    ...
+FROM executions
+GROUP BY tenant_id, bucket, environment, application_id, route_id
+```
+
+## 4. Java Code Changes
+
+### Configuration
+
+New config class:
+
+```java
+@ConfigurationProperties(prefix = "cameleer.tenant")
+public class TenantProperties {
+    private String id = "default";
+    // getter/setter
+}
+```
+
+Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`).
+
+### AgentInfo Record
+
+Add `environmentId` field:
+
+```java
+public record AgentInfo(
+    String instanceId,
+    String displayName,
+    String applicationId,
+    String environmentId,    // NEW
+    String version,
+    List<String> routeIds,
+    Map<String, Object> capabilities,
+    AgentState state,
+    Instant registeredAt,
+    Instant lastHeartbeat,
+    Instant staleTransitionTime
+) { ... }
+```
+
+### ClickHouse Stores
+
+All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`:
+
+**Pattern (applies to all stores):**
+```java
+// Before:
+private static final String TENANT = "default";
+
+// After:
+private final String tenantId;
+
+public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
+    this.jdbc = jdbc;
+    this.tenantId = tenantProps.getId();
+}
+```
+
+**Files to update:**
+- `ClickHouseExecutionStore` — writes and reads
+- `ClickHouseLogStore` — writes and reads
+- `ClickHouseMetricsStore` — add tenant_id to INSERT
+- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads
+- `ClickHouseStatsStore` — replace `TENANT` constant
+- `ClickHouseDiagramStore` — replace `TENANT` constant
+- `ClickHouseSearchIndex` — replace hardcoded `'default'`
+- `ClickHouseAgentEventRepository` — replace `TENANT` constant
+- `ClickHouseUsageTracker` — add tenant_id to writes and reads
+
+### Environment in Write Path
+
+The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`:
+
+```java
+// ChunkAccumulator.toMergedExecution():
+AgentInfo agent = registryService.findById(instanceId);
+String environment = agent != null ? agent.environmentId() : "default";
+// include environment in MergedExecution
+```
+
+### Registration Controller
+
+Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent.
+
+### Heartbeat Controller
+
+On auto-heal, use `environmentId` from heartbeat payload (if present).
+
+## 5. PostgreSQL — Schema-per-Tenant
+
+No table schema changes. Isolation via JDBC `currentSchema`:
+
+```yaml
+spring:
+  datasource:
+    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
+```
+
+Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.
+
+The SaaS shell is responsible for:
+- Creating the PG schema before starting a tenant's server instance
+- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS`
+
+## 6. UI Changes
+
+### Environment Filter
+
+Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.
+
+All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default.
+
+### Catalog
+
+The route catalog groups by environment → application → route. The sidebar tree becomes:
+
+```
+dev
+  └─ order-service
+       ├─ route-orders (42)
+       └─ route-cbr (18)
+prod
+  └─ order-service
+       ├─ route-orders (1,204)
+       └─ route-cbr (890)
+```
+
+## 7. What the SaaS Shell Must Do
+
+The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:
+
+1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer3-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema
+2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant)
+3. **Lifecycle**: Start/stop/upgrade tenant server instances
+4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth
+
+## 8. Scope Summary
+
+| Area | Change | Complexity |
+|------|--------|------------|
+| Agent protocol (cameleer3-common) | Add `environmentId` to registration + heartbeat | Low |
+| Server config | `TenantProperties` bean, PG schema URL | Low |
+| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium |
+| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium |
+| AgentInfo + registry | Add `environmentId` field | Low |
+| ChunkAccumulator + write pipeline | Include environment in data writes | Low |
+| Controllers | Pass environment from registration/heartbeat | Low |
+| UI | Environment filter dropdown, catalog grouping | Medium |
+| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None |
+
+## Verification
+
+1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme`
+2. Register agent with `environmentId=dev`
+3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'`
+4. Start second server with `CAMELEER_TENANT_ID=beta`
+5. Verify data from tenant "beta" is not visible to tenant "acme" queries
+6. Verify UI environment filter shows only selected environment's data