# Multitenancy Architecture Design **Date:** 2026-04-04 **Status:** Draft ## Context Cameleer Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant. ## Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Tenant model | 1 customer = 1 tenant | SaaS customer isolation | | Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped | | Environments | First-class, per-agent property | Agents belong to exactly 1 environment | | PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param | | CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL | | Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` | | User scope | Single tenant per user | Logto organizations handle user↔tenant mapping | | Migration | Fresh install | No backward-compatibility migration needed | ## Data Hierarchy ``` Tenant (customer org) └─ Environment (dev, staging, prod) └─ Application (order-service, payment-gateway) └─ Agent Instance (pod-1, pod-2) ``` ## Architecture ``` Tenant "Acme" ──► cameleer-server (TENANT_ID=acme) ├─ PG schema: tenant_acme ├─ CH writes: tenant_id='acme' ├─ Agents: env=dev, env=prod └─ In-memory: registry, catalog, SSE Tenant "Beta" ──► cameleer-server (TENANT_ID=beta) ├─ PG schema: tenant_beta ├─ CH writes: tenant_id='beta' └─ ... Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning) ``` Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL. ## 1. Agent Protocol Changes ### Registration Payload Add `environmentId` field: ```json { "instanceId": "order-svc-pod-1", "displayName": "order-svc-pod-1", "applicationId": "order-service", "environmentId": "dev", "version": "1.0-SNAPSHOT", "routeIds": ["route-orders"], "capabilities": { "tracing": true, "replay": false } } ``` `environmentId` defaults to `"default"` if omitted (backward compatibility with older agents). ### Heartbeat Payload Add `environmentId` (optional, for auto-heal after server restart): ```json { "routeStates": { "route-orders": "Started" }, "capabilities": { "tracing": true }, "environmentId": "dev" } ``` ### JWT Claims Agent JWTs issued by the server include: - `tenant` — tenant ID (from server config) - `env` — environment ID (from registration) - `group` — application ID (existing) The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance. ## 2. Server Configuration New environment variables: | Variable | Default | Purpose | |----------|---------|---------| | `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations | PG connection includes schema: ```yaml spring: datasource: url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default} ``` Flyway runs against the configured schema automatically. ## 3. ClickHouse Schema Changes ### Column Ordering Principle All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics** This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping. ### Partitioning All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables). Benefits: - Partition pruning by tenant (never scans other tenant's data) - Partition pruning by month (time-range queries) - Per-tenant TTL/retention (drop partitions) ### Raw Tables #### `executions` ```sql CREATE TABLE executions ( tenant_id String DEFAULT 'default', start_time DateTime64(3), environment String DEFAULT 'default', application_id String, instance_id String, -- ... existing columns ... ) ENGINE = ReplacingMergeTree() PARTITION BY (tenant_id, toYYYYMM(start_time)) ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id) ``` #### `processor_executions` ```sql ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq) PARTITION BY (tenant_id, toYYYYMM(start_time)) ``` #### `logs` ```sql ORDER BY (tenant_id, timestamp, environment, application, instance_id) PARTITION BY (tenant_id, toYYYYMM(timestamp)) ``` #### `agent_metrics` ```sql ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name) PARTITION BY (tenant_id, toYYYYMM(collected_at)) ``` #### `route_diagrams` ```sql ORDER BY (tenant_id, created_at, environment, route_id, instance_id) PARTITION BY (tenant_id, toYYYYMM(created_at)) ``` #### `agent_events` ```sql ORDER BY (tenant_id, timestamp, environment, instance_id) PARTITION BY (tenant_id, toYYYYMM(timestamp)) ``` #### `usage_events` (new column) ```sql -- Add tenant_id (currently missing) ORDER BY (tenant_id, timestamp, environment, username, normalized) PARTITION BY (tenant_id, toYYYYMM(timestamp)) ``` ### Materialized View Targets (stats_1m_*) All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))` Example for `stats_1m_route`: ```sql ORDER BY (tenant_id, bucket, environment, application_id, route_id) PARTITION BY (tenant_id, toYYYYMM(bucket)) ``` ### MV Source Queries All materialized view SELECT statements include `environment` in GROUP BY: ```sql SELECT tenant_id, toStartOfMinute(start_time) AS bucket, environment, application_id, route_id, countState() AS total_count, ... FROM executions GROUP BY tenant_id, bucket, environment, application_id, route_id ``` ## 4. Java Code Changes ### Configuration New config class: ```java @ConfigurationProperties(prefix = "cameleer.tenant") public class TenantProperties { private String id = "default"; // getter/setter } ``` Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`). ### AgentInfo Record Add `environmentId` field: ```java public record AgentInfo( String instanceId, String displayName, String applicationId, String environmentId, // NEW String version, List routeIds, Map capabilities, AgentState state, Instant registeredAt, Instant lastHeartbeat, Instant staleTransitionTime ) { ... } ``` ### ClickHouse Stores All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`: **Pattern (applies to all stores):** ```java // Before: private static final String TENANT = "default"; // After: private final String tenantId; public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) { this.jdbc = jdbc; this.tenantId = tenantProps.getId(); } ``` **Files to update:** - `ClickHouseExecutionStore` — writes and reads - `ClickHouseLogStore` — writes and reads - `ClickHouseMetricsStore` — add tenant_id to INSERT - `ClickHouseMetricsQueryStore` — add tenant_id filter to reads - `ClickHouseStatsStore` — replace `TENANT` constant - `ClickHouseDiagramStore` — replace `TENANT` constant - `ClickHouseSearchIndex` — replace hardcoded `'default'` - `ClickHouseAgentEventRepository` — replace `TENANT` constant - `ClickHouseUsageTracker` — add tenant_id to writes and reads ### Environment in Write Path The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`: ```java // ChunkAccumulator.toMergedExecution(): AgentInfo agent = registryService.findById(instanceId); String environment = agent != null ? agent.environmentId() : "default"; // include environment in MergedExecution ``` ### Registration Controller Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent. ### Heartbeat Controller On auto-heal, use `environmentId` from heartbeat payload (if present). ## 5. PostgreSQL — Schema-per-Tenant No table schema changes. Isolation via JDBC `currentSchema`: ```yaml spring: datasource: url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default} ``` Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently. The SaaS shell is responsible for: - Creating the PG schema before starting a tenant's server instance - Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS` ## 6. UI Changes ### Environment Filter Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params. All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default. ### Catalog The route catalog groups by environment → application → route. The sidebar tree becomes: ``` dev └─ order-service ├─ route-orders (42) └─ route-cbr (18) prod └─ order-service ├─ route-orders (1,204) └─ route-cbr (890) ``` ## 7. What the SaaS Shell Must Do The cameleer-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for: 1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema 2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant) 3. **Lifecycle**: Start/stop/upgrade tenant server instances 4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth ## 8. Scope Summary | Area | Change | Complexity | |------|--------|------------| | Agent protocol (cameleer-common) | Add `environmentId` to registration + heartbeat | Low | | Server config | `TenantProperties` bean, PG schema URL | Low | | ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium | | ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium | | AgentInfo + registry | Add `environmentId` field | Low | | ChunkAccumulator + write pipeline | Include environment in data writes | Low | | Controllers | Pass environment from registration/heartbeat | Low | | UI | Environment filter dropdown, catalog grouping | Medium | | PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None | ## Verification 1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme` 2. Register agent with `environmentId=dev` 3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'` 4. Start second server with `CAMELEER_TENANT_ID=beta` 5. Verify data from tenant "beta" is not visible to tenant "acme" queries 6. Verify UI environment filter shows only selected environment's data