From ee7226cf1c6ba47b279557c79fdb56f8a639dd31 Mon Sep 17 00:00:00 2001 From: hsiegeln <37154749+hsiegeln@users.noreply.github.com> Date: Sat, 4 Apr 2026 14:37:00 +0200 Subject: [PATCH] docs: multitenancy architecture design spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Covers tenant isolation (1 tenant = 1 server instance), environment support (first-class agent property), ClickHouse partitioning (tenant → time → environment → application), PostgreSQL schema-per- tenant via JDBC currentSchema, and agent protocol changes. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../specs/2026-04-04-multitenancy-design.md | 366 ++++++++++++++++++ 1 file changed, 366 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-04-multitenancy-design.md diff --git a/docs/superpowers/specs/2026-04-04-multitenancy-design.md b/docs/superpowers/specs/2026-04-04-multitenancy-design.md new file mode 100644 index 00000000..821d6ee8 --- /dev/null +++ b/docs/superpowers/specs/2026-04-04-multitenancy-design.md @@ -0,0 +1,366 @@ +# Multitenancy Architecture Design + +**Date:** 2026-04-04 +**Status:** Draft + +## Context + +Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant. + +## Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Tenant model | 1 customer = 1 tenant | SaaS customer isolation | +| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped | +| Environments | First-class, per-agent property | Agents belong to exactly 1 environment | +| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param | +| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL | +| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` | +| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping | +| Migration | Fresh install | No backward-compatibility migration needed | + +## Data Hierarchy + +``` +Tenant (customer org) + └─ Environment (dev, staging, prod) + └─ Application (order-service, payment-gateway) + └─ Agent Instance (pod-1, pod-2) +``` + +## Architecture + +``` +Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme) + ├─ PG schema: tenant_acme + ├─ CH writes: tenant_id='acme' + ├─ Agents: env=dev, env=prod + └─ In-memory: registry, catalog, SSE + +Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta) + ├─ PG schema: tenant_beta + ├─ CH writes: tenant_id='beta' + └─ ... + +Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning) +``` + +Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL. + +## 1. Agent Protocol Changes + +### Registration Payload + +Add `environmentId` field: + +```json +{ + "instanceId": "order-svc-pod-1", + "displayName": "order-svc-pod-1", + "applicationId": "order-service", + "environmentId": "dev", + "version": "1.0-SNAPSHOT", + "routeIds": ["route-orders"], + "capabilities": { "tracing": true, "replay": false } +} +``` + +`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents). + +### Heartbeat Payload + +Add `environmentId` (optional, for auto-heal after server restart): + +```json +{ + "routeStates": { "route-orders": "Started" }, + "capabilities": { "tracing": true }, + "environmentId": "dev" +} +``` + +### JWT Claims + +Agent JWTs issued by the server include: +- `tenant` — tenant ID (from server config) +- `env` — environment ID (from registration) +- `group` — application ID (existing) + +The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance. + +## 2. Server Configuration + +New environment variables: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations | + +PG connection includes schema: +```yaml +spring: + datasource: + url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default} +``` + +Flyway runs against the configured schema automatically. + +## 3. ClickHouse Schema Changes + +### Column Ordering Principle + +All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics** + +This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping. + +### Partitioning + +All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables). + +Benefits: +- Partition pruning by tenant (never scans other tenant's data) +- Partition pruning by month (time-range queries) +- Per-tenant TTL/retention (drop partitions) + +### Raw Tables + +#### `executions` + +```sql +CREATE TABLE executions ( + tenant_id String DEFAULT 'default', + start_time DateTime64(3), + environment String DEFAULT 'default', + application_id String, + instance_id String, + -- ... existing columns ... +) ENGINE = ReplacingMergeTree() +PARTITION BY (tenant_id, toYYYYMM(start_time)) +ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id) +``` + +#### `processor_executions` + +```sql +ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq) +PARTITION BY (tenant_id, toYYYYMM(start_time)) +``` + +#### `logs` + +```sql +ORDER BY (tenant_id, timestamp, environment, application, instance_id) +PARTITION BY (tenant_id, toYYYYMM(timestamp)) +``` + +#### `agent_metrics` + +```sql +ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name) +PARTITION BY (tenant_id, toYYYYMM(collected_at)) +``` + +#### `route_diagrams` + +```sql +ORDER BY (tenant_id, created_at, environment, route_id, instance_id) +PARTITION BY (tenant_id, toYYYYMM(created_at)) +``` + +#### `agent_events` + +```sql +ORDER BY (tenant_id, timestamp, environment, instance_id) +PARTITION BY (tenant_id, toYYYYMM(timestamp)) +``` + +#### `usage_events` (new column) + +```sql +-- Add tenant_id (currently missing) +ORDER BY (tenant_id, timestamp, environment, username, normalized) +PARTITION BY (tenant_id, toYYYYMM(timestamp)) +``` + +### Materialized View Targets (stats_1m_*) + +All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))` + +Example for `stats_1m_route`: +```sql +ORDER BY (tenant_id, bucket, environment, application_id, route_id) +PARTITION BY (tenant_id, toYYYYMM(bucket)) +``` + +### MV Source Queries + +All materialized view SELECT statements include `environment` in GROUP BY: + +```sql +SELECT + tenant_id, + toStartOfMinute(start_time) AS bucket, + environment, + application_id, + route_id, + countState() AS total_count, + ... +FROM executions +GROUP BY tenant_id, bucket, environment, application_id, route_id +``` + +## 4. Java Code Changes + +### Configuration + +New config class: + +```java +@ConfigurationProperties(prefix = "cameleer.tenant") +public class TenantProperties { + private String id = "default"; + // getter/setter +} +``` + +Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`). + +### AgentInfo Record + +Add `environmentId` field: + +```java +public record AgentInfo( + String instanceId, + String displayName, + String applicationId, + String environmentId, // NEW + String version, + List routeIds, + Map capabilities, + AgentState state, + Instant registeredAt, + Instant lastHeartbeat, + Instant staleTransitionTime +) { ... } +``` + +### ClickHouse Stores + +All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`: + +**Pattern (applies to all stores):** +```java +// Before: +private static final String TENANT = "default"; + +// After: +private final String tenantId; + +public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) { + this.jdbc = jdbc; + this.tenantId = tenantProps.getId(); +} +``` + +**Files to update:** +- `ClickHouseExecutionStore` — writes and reads +- `ClickHouseLogStore` — writes and reads +- `ClickHouseMetricsStore` — add tenant_id to INSERT +- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads +- `ClickHouseStatsStore` — replace `TENANT` constant +- `ClickHouseDiagramStore` — replace `TENANT` constant +- `ClickHouseSearchIndex` — replace hardcoded `'default'` +- `ClickHouseAgentEventRepository` — replace `TENANT` constant +- `ClickHouseUsageTracker` — add tenant_id to writes and reads + +### Environment in Write Path + +The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`: + +```java +// ChunkAccumulator.toMergedExecution(): +AgentInfo agent = registryService.findById(instanceId); +String environment = agent != null ? agent.environmentId() : "default"; +// include environment in MergedExecution +``` + +### Registration Controller + +Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent. + +### Heartbeat Controller + +On auto-heal, use `environmentId` from heartbeat payload (if present). + +## 5. PostgreSQL — Schema-per-Tenant + +No table schema changes. Isolation via JDBC `currentSchema`: + +```yaml +spring: + datasource: + url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default} +``` + +Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently. + +The SaaS shell is responsible for: +- Creating the PG schema before starting a tenant's server instance +- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS` + +## 6. UI Changes + +### Environment Filter + +Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params. + +All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default. + +### Catalog + +The route catalog groups by environment → application → route. The sidebar tree becomes: + +``` +dev + └─ order-service + ├─ route-orders (42) + └─ route-cbr (18) +prod + └─ order-service + ├─ route-orders (1,204) + └─ route-cbr (890) +``` + +## 7. What the SaaS Shell Must Do + +The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for: + +1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer3-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema +2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant) +3. **Lifecycle**: Start/stop/upgrade tenant server instances +4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth + +## 8. Scope Summary + +| Area | Change | Complexity | +|------|--------|------------| +| Agent protocol (cameleer3-common) | Add `environmentId` to registration + heartbeat | Low | +| Server config | `TenantProperties` bean, PG schema URL | Low | +| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium | +| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium | +| AgentInfo + registry | Add `environmentId` field | Low | +| ChunkAccumulator + write pipeline | Include environment in data writes | Low | +| Controllers | Pass environment from registration/heartbeat | Low | +| UI | Environment filter dropdown, catalog grouping | Medium | +| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None | + +## Verification + +1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme` +2. Register agent with `environmentId=dev` +3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'` +4. Start second server with `CAMELEER_TENANT_ID=beta` +5. Verify data from tenant "beta" is not visible to tenant "acme" queries +6. Verify UI environment filter shows only selected environment's data