367 lines
11 KiB
Markdown
367 lines
11 KiB
Markdown
|
|
# Multitenancy Architecture Design
|
||
|
|
|
||
|
|
**Date:** 2026-04-04
|
||
|
|
**Status:** Draft
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.
|
||
|
|
|
||
|
|
## Decisions
|
||
|
|
|
||
|
|
| Decision | Choice | Rationale |
|
||
|
|
|----------|--------|-----------|
|
||
|
|
| Tenant model | 1 customer = 1 tenant | SaaS customer isolation |
|
||
|
|
| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped |
|
||
|
|
| Environments | First-class, per-agent property | Agents belong to exactly 1 environment |
|
||
|
|
| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param |
|
||
|
|
| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL |
|
||
|
|
| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` |
|
||
|
|
| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping |
|
||
|
|
| Migration | Fresh install | No backward-compatibility migration needed |
|
||
|
|
|
||
|
|
## Data Hierarchy
|
||
|
|
|
||
|
|
```
|
||
|
|
Tenant (customer org)
|
||
|
|
└─ Environment (dev, staging, prod)
|
||
|
|
└─ Application (order-service, payment-gateway)
|
||
|
|
└─ Agent Instance (pod-1, pod-2)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme)
|
||
|
|
├─ PG schema: tenant_acme
|
||
|
|
├─ CH writes: tenant_id='acme'
|
||
|
|
├─ Agents: env=dev, env=prod
|
||
|
|
└─ In-memory: registry, catalog, SSE
|
||
|
|
|
||
|
|
Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta)
|
||
|
|
├─ PG schema: tenant_beta
|
||
|
|
├─ CH writes: tenant_id='beta'
|
||
|
|
└─ ...
|
||
|
|
|
||
|
|
Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)
|
||
|
|
```
|
||
|
|
|
||
|
|
Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL.
|
||
|
|
|
||
|
|
## 1. Agent Protocol Changes
|
||
|
|
|
||
|
|
### Registration Payload
|
||
|
|
|
||
|
|
Add `environmentId` field:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"instanceId": "order-svc-pod-1",
|
||
|
|
"displayName": "order-svc-pod-1",
|
||
|
|
"applicationId": "order-service",
|
||
|
|
"environmentId": "dev",
|
||
|
|
"version": "1.0-SNAPSHOT",
|
||
|
|
"routeIds": ["route-orders"],
|
||
|
|
"capabilities": { "tracing": true, "replay": false }
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents).
|
||
|
|
|
||
|
|
### Heartbeat Payload
|
||
|
|
|
||
|
|
Add `environmentId` (optional, for auto-heal after server restart):
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"routeStates": { "route-orders": "Started" },
|
||
|
|
"capabilities": { "tracing": true },
|
||
|
|
"environmentId": "dev"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### JWT Claims
|
||
|
|
|
||
|
|
Agent JWTs issued by the server include:
|
||
|
|
- `tenant` — tenant ID (from server config)
|
||
|
|
- `env` — environment ID (from registration)
|
||
|
|
- `group` — application ID (existing)
|
||
|
|
|
||
|
|
The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance.
|
||
|
|
|
||
|
|
## 2. Server Configuration
|
||
|
|
|
||
|
|
New environment variables:
|
||
|
|
|
||
|
|
| Variable | Default | Purpose |
|
||
|
|
|----------|---------|---------|
|
||
|
|
| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations |
|
||
|
|
|
||
|
|
PG connection includes schema:
|
||
|
|
```yaml
|
||
|
|
spring:
|
||
|
|
datasource:
|
||
|
|
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
|
||
|
|
```
|
||
|
|
|
||
|
|
Flyway runs against the configured schema automatically.
|
||
|
|
|
||
|
|
## 3. ClickHouse Schema Changes
|
||
|
|
|
||
|
|
### Column Ordering Principle
|
||
|
|
|
||
|
|
All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics**
|
||
|
|
|
||
|
|
This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.
|
||
|
|
|
||
|
|
### Partitioning
|
||
|
|
|
||
|
|
All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables).
|
||
|
|
|
||
|
|
Benefits:
|
||
|
|
- Partition pruning by tenant (never scans other tenant's data)
|
||
|
|
- Partition pruning by month (time-range queries)
|
||
|
|
- Per-tenant TTL/retention (drop partitions)
|
||
|
|
|
||
|
|
### Raw Tables
|
||
|
|
|
||
|
|
#### `executions`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
CREATE TABLE executions (
|
||
|
|
tenant_id String DEFAULT 'default',
|
||
|
|
start_time DateTime64(3),
|
||
|
|
environment String DEFAULT 'default',
|
||
|
|
application_id String,
|
||
|
|
instance_id String,
|
||
|
|
-- ... existing columns ...
|
||
|
|
) ENGINE = ReplacingMergeTree()
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(start_time))
|
||
|
|
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `processor_executions`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(start_time))
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `logs`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, timestamp, environment, application, instance_id)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `agent_metrics`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(collected_at))
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `route_diagrams`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(created_at))
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `agent_events`
|
||
|
|
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, timestamp, environment, instance_id)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `usage_events` (new column)
|
||
|
|
|
||
|
|
```sql
|
||
|
|
-- Add tenant_id (currently missing)
|
||
|
|
ORDER BY (tenant_id, timestamp, environment, username, normalized)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(timestamp))
|
||
|
|
```
|
||
|
|
|
||
|
|
### Materialized View Targets (stats_1m_*)
|
||
|
|
|
||
|
|
All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))`
|
||
|
|
|
||
|
|
Example for `stats_1m_route`:
|
||
|
|
```sql
|
||
|
|
ORDER BY (tenant_id, bucket, environment, application_id, route_id)
|
||
|
|
PARTITION BY (tenant_id, toYYYYMM(bucket))
|
||
|
|
```
|
||
|
|
|
||
|
|
### MV Source Queries
|
||
|
|
|
||
|
|
All materialized view SELECT statements include `environment` in GROUP BY:
|
||
|
|
|
||
|
|
```sql
|
||
|
|
SELECT
|
||
|
|
tenant_id,
|
||
|
|
toStartOfMinute(start_time) AS bucket,
|
||
|
|
environment,
|
||
|
|
application_id,
|
||
|
|
route_id,
|
||
|
|
countState() AS total_count,
|
||
|
|
...
|
||
|
|
FROM executions
|
||
|
|
GROUP BY tenant_id, bucket, environment, application_id, route_id
|
||
|
|
```
|
||
|
|
|
||
|
|
## 4. Java Code Changes
|
||
|
|
|
||
|
|
### Configuration
|
||
|
|
|
||
|
|
New config class:
|
||
|
|
|
||
|
|
```java
|
||
|
|
@ConfigurationProperties(prefix = "cameleer.tenant")
|
||
|
|
public class TenantProperties {
|
||
|
|
private String id = "default";
|
||
|
|
// getter/setter
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`).
|
||
|
|
|
||
|
|
### AgentInfo Record
|
||
|
|
|
||
|
|
Add `environmentId` field:
|
||
|
|
|
||
|
|
```java
|
||
|
|
public record AgentInfo(
|
||
|
|
String instanceId,
|
||
|
|
String displayName,
|
||
|
|
String applicationId,
|
||
|
|
String environmentId, // NEW
|
||
|
|
String version,
|
||
|
|
List<String> routeIds,
|
||
|
|
Map<String, Object> capabilities,
|
||
|
|
AgentState state,
|
||
|
|
Instant registeredAt,
|
||
|
|
Instant lastHeartbeat,
|
||
|
|
Instant staleTransitionTime
|
||
|
|
) { ... }
|
||
|
|
```
|
||
|
|
|
||
|
|
### ClickHouse Stores
|
||
|
|
|
||
|
|
All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`:
|
||
|
|
|
||
|
|
**Pattern (applies to all stores):**
|
||
|
|
```java
|
||
|
|
// Before:
|
||
|
|
private static final String TENANT = "default";
|
||
|
|
|
||
|
|
// After:
|
||
|
|
private final String tenantId;
|
||
|
|
|
||
|
|
public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
|
||
|
|
this.jdbc = jdbc;
|
||
|
|
this.tenantId = tenantProps.getId();
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Files to update:**
|
||
|
|
- `ClickHouseExecutionStore` — writes and reads
|
||
|
|
- `ClickHouseLogStore` — writes and reads
|
||
|
|
- `ClickHouseMetricsStore` — add tenant_id to INSERT
|
||
|
|
- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads
|
||
|
|
- `ClickHouseStatsStore` — replace `TENANT` constant
|
||
|
|
- `ClickHouseDiagramStore` — replace `TENANT` constant
|
||
|
|
- `ClickHouseSearchIndex` — replace hardcoded `'default'`
|
||
|
|
- `ClickHouseAgentEventRepository` — replace `TENANT` constant
|
||
|
|
- `ClickHouseUsageTracker` — add tenant_id to writes and reads
|
||
|
|
|
||
|
|
### Environment in Write Path
|
||
|
|
|
||
|
|
The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`:
|
||
|
|
|
||
|
|
```java
|
||
|
|
// ChunkAccumulator.toMergedExecution():
|
||
|
|
AgentInfo agent = registryService.findById(instanceId);
|
||
|
|
String environment = agent != null ? agent.environmentId() : "default";
|
||
|
|
// include environment in MergedExecution
|
||
|
|
```
|
||
|
|
|
||
|
|
### Registration Controller
|
||
|
|
|
||
|
|
Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent.
|
||
|
|
|
||
|
|
### Heartbeat Controller
|
||
|
|
|
||
|
|
On auto-heal, use `environmentId` from heartbeat payload (if present).
|
||
|
|
|
||
|
|
## 5. PostgreSQL — Schema-per-Tenant
|
||
|
|
|
||
|
|
No table schema changes. Isolation via JDBC `currentSchema`:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
spring:
|
||
|
|
datasource:
|
||
|
|
url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
|
||
|
|
```
|
||
|
|
|
||
|
|
Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.
|
||
|
|
|
||
|
|
The SaaS shell is responsible for:
|
||
|
|
- Creating the PG schema before starting a tenant's server instance
|
||
|
|
- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS`
|
||
|
|
|
||
|
|
## 6. UI Changes
|
||
|
|
|
||
|
|
### Environment Filter
|
||
|
|
|
||
|
|
Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.
|
||
|
|
|
||
|
|
All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default.
|
||
|
|
|
||
|
|
### Catalog
|
||
|
|
|
||
|
|
The route catalog groups by environment → application → route. The sidebar tree becomes:
|
||
|
|
|
||
|
|
```
|
||
|
|
dev
|
||
|
|
└─ order-service
|
||
|
|
├─ route-orders (42)
|
||
|
|
└─ route-cbr (18)
|
||
|
|
prod
|
||
|
|
└─ order-service
|
||
|
|
├─ route-orders (1,204)
|
||
|
|
└─ route-cbr (890)
|
||
|
|
```
|
||
|
|
|
||
|
|
## 7. What the SaaS Shell Must Do
|
||
|
|
|
||
|
|
The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:
|
||
|
|
|
||
|
|
1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer3-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema
|
||
|
|
2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant)
|
||
|
|
3. **Lifecycle**: Start/stop/upgrade tenant server instances
|
||
|
|
4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth
|
||
|
|
|
||
|
|
## 8. Scope Summary
|
||
|
|
|
||
|
|
| Area | Change | Complexity |
|
||
|
|
|------|--------|------------|
|
||
|
|
| Agent protocol (cameleer3-common) | Add `environmentId` to registration + heartbeat | Low |
|
||
|
|
| Server config | `TenantProperties` bean, PG schema URL | Low |
|
||
|
|
| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium |
|
||
|
|
| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium |
|
||
|
|
| AgentInfo + registry | Add `environmentId` field | Low |
|
||
|
|
| ChunkAccumulator + write pipeline | Include environment in data writes | Low |
|
||
|
|
| Controllers | Pass environment from registration/heartbeat | Low |
|
||
|
|
| UI | Environment filter dropdown, catalog grouping | Medium |
|
||
|
|
| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None |
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme`
|
||
|
|
2. Register agent with `environmentId=dev`
|
||
|
|
3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'`
|
||
|
|
4. Start second server with `CAMELEER_TENANT_ID=beta`
|
||
|
|
5. Verify data from tenant "beta" is not visible to tenant "acme" queries
|
||
|
|
6. Verify UI environment filter shows only selected environment's data
|