docs: multitenancy architecture design spec
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m4s
CI / docker (push) Successful in 10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 36s

Covers tenant isolation (1 tenant = 1 server instance), environment
support (first-class agent property), ClickHouse partitioning
(tenant → time → environment → application), PostgreSQL schema-per-
tenant via JDBC currentSchema, and agent protocol changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-04 14:37:00 +02:00
parent 7429b85964
commit ee7226cf1c

View File

@@ -0,0 +1,366 @@
# Multitenancy Architecture Design
**Date:** 2026-04-04
**Status:** Draft
## Context
Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.
## Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Tenant model | 1 customer = 1 tenant | SaaS customer isolation |
| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped |
| Environments | First-class, per-agent property | Agents belong to exactly 1 environment |
| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param |
| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL |
| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` |
| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping |
| Migration | Fresh install | No backward-compatibility migration needed |
## Data Hierarchy
```
Tenant (customer org)
└─ Environment (dev, staging, prod)
└─ Application (order-service, payment-gateway)
└─ Agent Instance (pod-1, pod-2)
```
## Architecture
```
Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme)
├─ PG schema: tenant_acme
├─ CH writes: tenant_id='acme'
├─ Agents: env=dev, env=prod
└─ In-memory: registry, catalog, SSE
Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta)
├─ PG schema: tenant_beta
├─ CH writes: tenant_id='beta'
└─ ...
Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)
```
Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL.
## 1. Agent Protocol Changes
### Registration Payload
Add `environmentId` field:
```json
{
"instanceId": "order-svc-pod-1",
"displayName": "order-svc-pod-1",
"applicationId": "order-service",
"environmentId": "dev",
"version": "1.0-SNAPSHOT",
"routeIds": ["route-orders"],
"capabilities": { "tracing": true, "replay": false }
}
```
`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents).
### Heartbeat Payload
Add `environmentId` (optional, for auto-heal after server restart):
```json
{
"routeStates": { "route-orders": "Started" },
"capabilities": { "tracing": true },
"environmentId": "dev"
}
```
### JWT Claims
Agent JWTs issued by the server include:
- `tenant` — tenant ID (from server config)
- `env` — environment ID (from registration)
- `group` — application ID (existing)
The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance.
## 2. Server Configuration
New environment variables:
| Variable | Default | Purpose |
|----------|---------|---------|
| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations |
PG connection includes schema:
```yaml
spring:
datasource:
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
```
Flyway runs against the configured schema automatically.
## 3. ClickHouse Schema Changes
### Column Ordering Principle
All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics**
This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.
### Partitioning
All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables).
Benefits:
- Partition pruning by tenant (never scans other tenant's data)
- Partition pruning by month (time-range queries)
- Per-tenant TTL/retention (drop partitions)
### Raw Tables
#### `executions`
```sql
CREATE TABLE executions (
tenant_id String DEFAULT 'default',
start_time DateTime64(3),
environment String DEFAULT 'default',
application_id String,
instance_id String,
-- ... existing columns ...
) ENGINE = ReplacingMergeTree()
PARTITION BY (tenant_id, toYYYYMM(start_time))
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)
```
#### `processor_executions`
```sql
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
PARTITION BY (tenant_id, toYYYYMM(start_time))
```
#### `logs`
```sql
ORDER BY (tenant_id, timestamp, environment, application, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```
#### `agent_metrics`
```sql
ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
PARTITION BY (tenant_id, toYYYYMM(collected_at))
```
#### `route_diagrams`
```sql
ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
PARTITION BY (tenant_id, toYYYYMM(created_at))
```
#### `agent_events`
```sql
ORDER BY (tenant_id, timestamp, environment, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```
#### `usage_events` (new column)
```sql
-- Add tenant_id (currently missing)
ORDER BY (tenant_id, timestamp, environment, username, normalized)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```
### Materialized View Targets (stats_1m_*)
All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))`
Example for `stats_1m_route`:
```sql
ORDER BY (tenant_id, bucket, environment, application_id, route_id)
PARTITION BY (tenant_id, toYYYYMM(bucket))
```
### MV Source Queries
All materialized view SELECT statements include `environment` in GROUP BY:
```sql
SELECT
tenant_id,
toStartOfMinute(start_time) AS bucket,
environment,
application_id,
route_id,
countState() AS total_count,
...
FROM executions
GROUP BY tenant_id, bucket, environment, application_id, route_id
```
## 4. Java Code Changes
### Configuration
New config class:
```java
@ConfigurationProperties(prefix = "cameleer.tenant")
public class TenantProperties {
private String id = "default";
// getter/setter
}
```
Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`).
### AgentInfo Record
Add `environmentId` field:
```java
public record AgentInfo(
String instanceId,
String displayName,
String applicationId,
String environmentId, // NEW
String version,
List<String> routeIds,
Map<String, Object> capabilities,
AgentState state,
Instant registeredAt,
Instant lastHeartbeat,
Instant staleTransitionTime
) { ... }
```
### ClickHouse Stores
All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`:
**Pattern (applies to all stores):**
```java
// Before:
private static final String TENANT = "default";
// After:
private final String tenantId;
public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
this.jdbc = jdbc;
this.tenantId = tenantProps.getId();
}
```
**Files to update:**
- `ClickHouseExecutionStore` — writes and reads
- `ClickHouseLogStore` — writes and reads
- `ClickHouseMetricsStore` — add tenant_id to INSERT
- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads
- `ClickHouseStatsStore` — replace `TENANT` constant
- `ClickHouseDiagramStore` — replace `TENANT` constant
- `ClickHouseSearchIndex` — replace hardcoded `'default'`
- `ClickHouseAgentEventRepository` — replace `TENANT` constant
- `ClickHouseUsageTracker` — add tenant_id to writes and reads
### Environment in Write Path
The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`:
```java
// ChunkAccumulator.toMergedExecution():
AgentInfo agent = registryService.findById(instanceId);
String environment = agent != null ? agent.environmentId() : "default";
// include environment in MergedExecution
```
### Registration Controller
Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent.
### Heartbeat Controller
On auto-heal, use `environmentId` from heartbeat payload (if present).
## 5. PostgreSQL — Schema-per-Tenant
No table schema changes. Isolation via JDBC `currentSchema`:
```yaml
spring:
datasource:
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
```
Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.
The SaaS shell is responsible for:
- Creating the PG schema before starting a tenant's server instance
- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS`
## 6. UI Changes
### Environment Filter
Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.
All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default.
### Catalog
The route catalog groups by environment → application → route. The sidebar tree becomes:
```
dev
└─ order-service
├─ route-orders (42)
└─ route-cbr (18)
prod
└─ order-service
├─ route-orders (1,204)
└─ route-cbr (890)
```
## 7. What the SaaS Shell Must Do
The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:
1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer3-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema
2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant)
3. **Lifecycle**: Start/stop/upgrade tenant server instances
4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth
## 8. Scope Summary
| Area | Change | Complexity |
|------|--------|------------|
| Agent protocol (cameleer3-common) | Add `environmentId` to registration + heartbeat | Low |
| Server config | `TenantProperties` bean, PG schema URL | Low |
| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium |
| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium |
| AgentInfo + registry | Add `environmentId` field | Low |
| ChunkAccumulator + write pipeline | Include environment in data writes | Low |
| Controllers | Pass environment from registration/heartbeat | Low |
| UI | Environment filter dropdown, catalog grouping | Medium |
| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None |
## Verification
1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme`
2. Register agent with `environmentId=dev`
3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'`
4. Start second server with `CAMELEER_TENANT_ID=beta`
5. Verify data from tenant "beta" is not visible to tenant "acme" queries
6. Verify UI environment filter shows only selected environment's data