docs: multitenancy architecture design spec
Covers tenant isolation (1 tenant = 1 server instance), environment support (first-class agent property), ClickHouse partitioning (tenant → time → environment → application), PostgreSQL schema-per- tenant via JDBC currentSchema, and agent protocol changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
366
docs/superpowers/specs/2026-04-04-multitenancy-design.md
Normal file
366
docs/superpowers/specs/2026-04-04-multitenancy-design.md
Normal file
@@ -0,0 +1,366 @@
|
||||
# Multitenancy Architecture Design
|
||||
|
||||
**Date:** 2026-04-04
|
||||
**Status:** Draft
|
||||
|
||||
## Context
|
||||
|
||||
Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.
|
||||
|
||||
## Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Tenant model | 1 customer = 1 tenant | SaaS customer isolation |
|
||||
| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped |
|
||||
| Environments | First-class, per-agent property | Agents belong to exactly 1 environment |
|
||||
| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param |
|
||||
| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL |
|
||||
| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` |
|
||||
| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping |
|
||||
| Migration | Fresh install | No backward-compatibility migration needed |
|
||||
|
||||
## Data Hierarchy
|
||||
|
||||
```
|
||||
Tenant (customer org)
|
||||
└─ Environment (dev, staging, prod)
|
||||
└─ Application (order-service, payment-gateway)
|
||||
└─ Agent Instance (pod-1, pod-2)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme)
|
||||
├─ PG schema: tenant_acme
|
||||
├─ CH writes: tenant_id='acme'
|
||||
├─ Agents: env=dev, env=prod
|
||||
└─ In-memory: registry, catalog, SSE
|
||||
|
||||
Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta)
|
||||
├─ PG schema: tenant_beta
|
||||
├─ CH writes: tenant_id='beta'
|
||||
└─ ...
|
||||
|
||||
Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)
|
||||
```
|
||||
|
||||
Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL.
|
||||
|
||||
## 1. Agent Protocol Changes
|
||||
|
||||
### Registration Payload
|
||||
|
||||
Add `environmentId` field:
|
||||
|
||||
```json
|
||||
{
|
||||
"instanceId": "order-svc-pod-1",
|
||||
"displayName": "order-svc-pod-1",
|
||||
"applicationId": "order-service",
|
||||
"environmentId": "dev",
|
||||
"version": "1.0-SNAPSHOT",
|
||||
"routeIds": ["route-orders"],
|
||||
"capabilities": { "tracing": true, "replay": false }
|
||||
}
|
||||
```
|
||||
|
||||
`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents).
|
||||
|
||||
### Heartbeat Payload
|
||||
|
||||
Add `environmentId` (optional, for auto-heal after server restart):
|
||||
|
||||
```json
|
||||
{
|
||||
"routeStates": { "route-orders": "Started" },
|
||||
"capabilities": { "tracing": true },
|
||||
"environmentId": "dev"
|
||||
}
|
||||
```
|
||||
|
||||
### JWT Claims
|
||||
|
||||
Agent JWTs issued by the server include:
|
||||
- `tenant` — tenant ID (from server config)
|
||||
- `env` — environment ID (from registration)
|
||||
- `group` — application ID (existing)
|
||||
|
||||
The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance.
|
||||
|
||||
## 2. Server Configuration
|
||||
|
||||
New environment variables:
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations |
|
||||
|
||||
PG connection includes schema:
|
||||
```yaml
|
||||
spring:
|
||||
datasource:
|
||||
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
|
||||
```
|
||||
|
||||
Flyway runs against the configured schema automatically.
|
||||
|
||||
## 3. ClickHouse Schema Changes
|
||||
|
||||
### Column Ordering Principle
|
||||
|
||||
All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics**
|
||||
|
||||
This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.
|
||||
|
||||
### Partitioning
|
||||
|
||||
All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables).
|
||||
|
||||
Benefits:
|
||||
- Partition pruning by tenant (never scans other tenant's data)
|
||||
- Partition pruning by month (time-range queries)
|
||||
- Per-tenant TTL/retention (drop partitions)
|
||||
|
||||
### Raw Tables
|
||||
|
||||
#### `executions`
|
||||
|
||||
```sql
|
||||
CREATE TABLE executions (
|
||||
tenant_id String DEFAULT 'default',
|
||||
start_time DateTime64(3),
|
||||
environment String DEFAULT 'default',
|
||||
application_id String,
|
||||
instance_id String,
|
||||
-- ... existing columns ...
|
||||
) ENGINE = ReplacingMergeTree()
|
||||
PARTITION BY (tenant_id, toYYYYMM(start_time))
|
||||
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)
|
||||
```
|
||||
|
||||
#### `processor_executions`
|
||||
|
||||
```sql
|
||||
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
|
||||
PARTITION BY (tenant_id, toYYYYMM(start_time))
|
||||
```
|
||||
|
||||
#### `logs`
|
||||
|
||||
```sql
|
||||
ORDER BY (tenant_id, timestamp, environment, application, instance_id)
|
||||
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||||
```
|
||||
|
||||
#### `agent_metrics`
|
||||
|
||||
```sql
|
||||
ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
|
||||
PARTITION BY (tenant_id, toYYYYMM(collected_at))
|
||||
```
|
||||
|
||||
#### `route_diagrams`
|
||||
|
||||
```sql
|
||||
ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
|
||||
PARTITION BY (tenant_id, toYYYYMM(created_at))
|
||||
```
|
||||
|
||||
#### `agent_events`
|
||||
|
||||
```sql
|
||||
ORDER BY (tenant_id, timestamp, environment, instance_id)
|
||||
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||||
```
|
||||
|
||||
#### `usage_events` (new column)
|
||||
|
||||
```sql
|
||||
-- Add tenant_id (currently missing)
|
||||
ORDER BY (tenant_id, timestamp, environment, username, normalized)
|
||||
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||||
```
|
||||
|
||||
### Materialized View Targets (stats_1m_*)
|
||||
|
||||
All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))`
|
||||
|
||||
Example for `stats_1m_route`:
|
||||
```sql
|
||||
ORDER BY (tenant_id, bucket, environment, application_id, route_id)
|
||||
PARTITION BY (tenant_id, toYYYYMM(bucket))
|
||||
```
|
||||
|
||||
### MV Source Queries
|
||||
|
||||
All materialized view SELECT statements include `environment` in GROUP BY:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
tenant_id,
|
||||
toStartOfMinute(start_time) AS bucket,
|
||||
environment,
|
||||
application_id,
|
||||
route_id,
|
||||
countState() AS total_count,
|
||||
...
|
||||
FROM executions
|
||||
GROUP BY tenant_id, bucket, environment, application_id, route_id
|
||||
```
|
||||
|
||||
## 4. Java Code Changes
|
||||
|
||||
### Configuration
|
||||
|
||||
New config class:
|
||||
|
||||
```java
|
||||
@ConfigurationProperties(prefix = "cameleer.tenant")
|
||||
public class TenantProperties {
|
||||
private String id = "default";
|
||||
// getter/setter
|
||||
}
|
||||
```
|
||||
|
||||
Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`).
|
||||
|
||||
### AgentInfo Record
|
||||
|
||||
Add `environmentId` field:
|
||||
|
||||
```java
|
||||
public record AgentInfo(
|
||||
String instanceId,
|
||||
String displayName,
|
||||
String applicationId,
|
||||
String environmentId, // NEW
|
||||
String version,
|
||||
List<String> routeIds,
|
||||
Map<String, Object> capabilities,
|
||||
AgentState state,
|
||||
Instant registeredAt,
|
||||
Instant lastHeartbeat,
|
||||
Instant staleTransitionTime
|
||||
) { ... }
|
||||
```
|
||||
|
||||
### ClickHouse Stores
|
||||
|
||||
All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`:
|
||||
|
||||
**Pattern (applies to all stores):**
|
||||
```java
|
||||
// Before:
|
||||
private static final String TENANT = "default";
|
||||
|
||||
// After:
|
||||
private final String tenantId;
|
||||
|
||||
public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
|
||||
this.jdbc = jdbc;
|
||||
this.tenantId = tenantProps.getId();
|
||||
}
|
||||
```
|
||||
|
||||
**Files to update:**
|
||||
- `ClickHouseExecutionStore` — writes and reads
|
||||
- `ClickHouseLogStore` — writes and reads
|
||||
- `ClickHouseMetricsStore` — add tenant_id to INSERT
|
||||
- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads
|
||||
- `ClickHouseStatsStore` — replace `TENANT` constant
|
||||
- `ClickHouseDiagramStore` — replace `TENANT` constant
|
||||
- `ClickHouseSearchIndex` — replace hardcoded `'default'`
|
||||
- `ClickHouseAgentEventRepository` — replace `TENANT` constant
|
||||
- `ClickHouseUsageTracker` — add tenant_id to writes and reads
|
||||
|
||||
### Environment in Write Path
|
||||
|
||||
The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`:
|
||||
|
||||
```java
|
||||
// ChunkAccumulator.toMergedExecution():
|
||||
AgentInfo agent = registryService.findById(instanceId);
|
||||
String environment = agent != null ? agent.environmentId() : "default";
|
||||
// include environment in MergedExecution
|
||||
```
|
||||
|
||||
### Registration Controller
|
||||
|
||||
Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent.
|
||||
|
||||
### Heartbeat Controller
|
||||
|
||||
On auto-heal, use `environmentId` from heartbeat payload (if present).
|
||||
|
||||
## 5. PostgreSQL — Schema-per-Tenant
|
||||
|
||||
No table schema changes. Isolation via JDBC `currentSchema`:
|
||||
|
||||
```yaml
|
||||
spring:
|
||||
datasource:
|
||||
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
|
||||
```
|
||||
|
||||
Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.
|
||||
|
||||
The SaaS shell is responsible for:
|
||||
- Creating the PG schema before starting a tenant's server instance
|
||||
- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS`
|
||||
|
||||
## 6. UI Changes
|
||||
|
||||
### Environment Filter
|
||||
|
||||
Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.
|
||||
|
||||
All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default.
|
||||
|
||||
### Catalog
|
||||
|
||||
The route catalog groups by environment → application → route. The sidebar tree becomes:
|
||||
|
||||
```
|
||||
dev
|
||||
└─ order-service
|
||||
├─ route-orders (42)
|
||||
└─ route-cbr (18)
|
||||
prod
|
||||
└─ order-service
|
||||
├─ route-orders (1,204)
|
||||
└─ route-cbr (890)
|
||||
```
|
||||
|
||||
## 7. What the SaaS Shell Must Do
|
||||
|
||||
The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:
|
||||
|
||||
1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer3-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema
|
||||
2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant)
|
||||
3. **Lifecycle**: Start/stop/upgrade tenant server instances
|
||||
4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth
|
||||
|
||||
## 8. Scope Summary
|
||||
|
||||
| Area | Change | Complexity |
|
||||
|------|--------|------------|
|
||||
| Agent protocol (cameleer3-common) | Add `environmentId` to registration + heartbeat | Low |
|
||||
| Server config | `TenantProperties` bean, PG schema URL | Low |
|
||||
| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium |
|
||||
| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium |
|
||||
| AgentInfo + registry | Add `environmentId` field | Low |
|
||||
| ChunkAccumulator + write pipeline | Include environment in data writes | Low |
|
||||
| Controllers | Pass environment from registration/heartbeat | Low |
|
||||
| UI | Environment filter dropdown, catalog grouping | Medium |
|
||||
| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None |
|
||||
|
||||
## Verification
|
||||
|
||||
1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme`
|
||||
2. Register agent with `environmentId=dev`
|
||||
3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'`
|
||||
4. Start second server with `CAMELEER_TENANT_ID=beta`
|
||||
5. Verify data from tenant "beta" is not visible to tenant "acme" queries
|
||||
6. Verify UI environment filter shows only selected environment's data
|
||||
Reference in New Issue
Block a user