Files
cameleer-server/docs/superpowers/specs/2026-04-04-multitenancy-design.md
hsiegeln cb3ebfea7c
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
chore: rename cameleer3 to cameleer
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00

11 KiB

Multitenancy Architecture Design

Date: 2026-04-04 Status: Draft

Context

Cameleer Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.

Decisions

Decision Choice Rationale
Tenant model 1 customer = 1 tenant SaaS customer isolation
Instance model 1 tenant = 1 server instance In-memory state (registry, catalog, SSE) is tenant-scoped
Environments First-class, per-agent property Agents belong to exactly 1 environment
PG isolation Schema-per-tenant No query changes needed; Flyway runs per-schema; JDBC currentSchema param
CH isolation Shared DB, tenant_id column + partition key Already partially in place; tenant in partition key enables pruning + TTL
Agent auth Per-tenant bootstrap token SaaS shell provisions tokens; JWT includes tenant_id
User scope Single tenant per user Logto organizations handle user↔tenant mapping
Migration Fresh install No backward-compatibility migration needed

Data Hierarchy

Tenant (customer org)
  └─ Environment (dev, staging, prod)
       └─ Application (order-service, payment-gateway)
            └─ Agent Instance (pod-1, pod-2)

Architecture

Tenant "Acme" ──► cameleer-server (TENANT_ID=acme)
                    ├─ PG schema: tenant_acme
                    ├─ CH writes: tenant_id='acme'
                    ├─ Agents: env=dev, env=prod
                    └─ In-memory: registry, catalog, SSE

Tenant "Beta" ──► cameleer-server (TENANT_ID=beta)
                    ├─ PG schema: tenant_beta
                    ├─ CH writes: tenant_id='beta'
                    └─ ...

Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)

Each server instance reads CAMELEER_TENANT_ID from its environment (default: "default"). This value is used for all ClickHouse reads/writes. The PG schema is set via ?currentSchema=tenant_{id} on the JDBC URL.

1. Agent Protocol Changes

Registration Payload

Add environmentId field:

{
  "instanceId": "order-svc-pod-1",
  "displayName": "order-svc-pod-1",
  "applicationId": "order-service",
  "environmentId": "dev",
  "version": "1.0-SNAPSHOT",
  "routeIds": ["route-orders"],
  "capabilities": { "tracing": true, "replay": false }
}

environmentId defaults to "default" if omitted (backward compatibility with older agents).

Heartbeat Payload

Add environmentId (optional, for auto-heal after server restart):

{
  "routeStates": { "route-orders": "Started" },
  "capabilities": { "tracing": true },
  "environmentId": "dev"
}

JWT Claims

Agent JWTs issued by the server include:

  • tenant — tenant ID (from server config)
  • env — environment ID (from registration)
  • group — application ID (existing)

The SaaS shell uses tenant + env claims to route agent traffic to the correct server instance.

2. Server Configuration

New environment variables:

Variable Default Purpose
CAMELEER_TENANT_ID default Tenant identifier for all CH data operations

PG connection includes schema:

spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}

Flyway runs against the configured schema automatically.

3. ClickHouse Schema Changes

Column Ordering Principle

All tables follow the ordering: tenant → time → environment → application → agent/route → specifics

This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.

Partitioning

All tables: PARTITION BY (tenant_id, toYYYYMM(timestamp)) (or toYYYYMM(bucket) for stats tables).

Benefits:

  • Partition pruning by tenant (never scans other tenant's data)
  • Partition pruning by month (time-range queries)
  • Per-tenant TTL/retention (drop partitions)

Raw Tables

executions

CREATE TABLE executions (
    tenant_id         String   DEFAULT 'default',
    start_time        DateTime64(3),
    environment       String   DEFAULT 'default',
    application_id    String,
    instance_id       String,
    -- ... existing columns ...
) ENGINE = ReplacingMergeTree()
PARTITION BY (tenant_id, toYYYYMM(start_time))
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)

processor_executions

ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
PARTITION BY (tenant_id, toYYYYMM(start_time))

logs

ORDER BY (tenant_id, timestamp, environment, application, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

agent_metrics

ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
PARTITION BY (tenant_id, toYYYYMM(collected_at))

route_diagrams

ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
PARTITION BY (tenant_id, toYYYYMM(created_at))

agent_events

ORDER BY (tenant_id, timestamp, environment, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

usage_events (new column)

-- Add tenant_id (currently missing)
ORDER BY (tenant_id, timestamp, environment, username, normalized)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

Materialized View Targets (stats_1m_*)

All follow: ORDER BY (tenant_id, bucket, environment, ...), PARTITION BY (tenant_id, toYYYYMM(bucket))

Example for stats_1m_route:

ORDER BY (tenant_id, bucket, environment, application_id, route_id)
PARTITION BY (tenant_id, toYYYYMM(bucket))

MV Source Queries

All materialized view SELECT statements include environment in GROUP BY:

SELECT
    tenant_id,
    toStartOfMinute(start_time) AS bucket,
    environment,
    application_id,
    route_id,
    countState() AS total_count,
    ...
FROM executions
GROUP BY tenant_id, bucket, environment, application_id, route_id

4. Java Code Changes

Configuration

New config class:

@ConfigurationProperties(prefix = "cameleer.tenant")
public class TenantProperties {
    private String id = "default";
    // getter/setter
}

Read from CAMELEER_TENANT_ID env var (Spring Boot relaxed binding: cameleer.tenant.id).

AgentInfo Record

Add environmentId field:

public record AgentInfo(
    String instanceId,
    String displayName,
    String applicationId,
    String environmentId,    // NEW
    String version,
    List<String> routeIds,
    Map<String, Object> capabilities,
    AgentState state,
    Instant registeredAt,
    Instant lastHeartbeat,
    Instant staleTransitionTime
) { ... }

ClickHouse Stores

All stores receive TenantProperties via constructor injection and use tenantProperties.getId() instead of hardcoded "default":

Pattern (applies to all stores):

// Before:
private static final String TENANT = "default";

// After:
private final String tenantId;

public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
    this.jdbc = jdbc;
    this.tenantId = tenantProps.getId();
}

Files to update:

  • ClickHouseExecutionStore — writes and reads
  • ClickHouseLogStore — writes and reads
  • ClickHouseMetricsStore — add tenant_id to INSERT
  • ClickHouseMetricsQueryStore — add tenant_id filter to reads
  • ClickHouseStatsStore — replace TENANT constant
  • ClickHouseDiagramStore — replace TENANT constant
  • ClickHouseSearchIndex — replace hardcoded 'default'
  • ClickHouseAgentEventRepository — replace TENANT constant
  • ClickHouseUsageTracker — add tenant_id to writes and reads

Environment in Write Path

The ChunkAccumulator extracts environmentId from the agent registry and includes it in MergedExecution and ProcessorBatch:

// ChunkAccumulator.toMergedExecution():
AgentInfo agent = registryService.findById(instanceId);
String environment = agent != null ? agent.environmentId() : "default";
// include environment in MergedExecution

Registration Controller

Pass environmentId from registration payload to AgentRegistryService.register(). Default to "default" if absent.

Heartbeat Controller

On auto-heal, use environmentId from heartbeat payload (if present).

5. PostgreSQL — Schema-per-Tenant

No table schema changes. Isolation via JDBC currentSchema:

spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}

Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.

The SaaS shell is responsible for:

  • Creating the PG schema before starting a tenant's server instance
  • Or the server creates it on startup via Flyway's CREATE SCHEMA IF NOT EXISTS

6. UI Changes

Environment Filter

Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.

All data queries (executions, stats, logs, catalog) include environment filter when set. "All environments" is the default.

Catalog

The route catalog groups by environment → application → route. The sidebar tree becomes:

dev
  └─ order-service
       ├─ route-orders (42)
       └─ route-cbr (18)
prod
  └─ order-service
       ├─ route-orders (1,204)
       └─ route-cbr (890)

7. What the SaaS Shell Must Do

The cameleer-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:

  1. Provisioning: Create PG schema tenant_{id}, generate per-tenant bootstrap token, start cameleer-server container with CAMELEER_TENANT_ID={id} and PG URL pointing to the schema
  2. Routing: Route agent and UI traffic to the correct server instance (by tenant)
  3. Lifecycle: Start/stop/upgrade tenant server instances
  4. Auth: Issue JWTs with tenant claims (via Logto), configure ForwardAuth

8. Scope Summary

Area Change Complexity
Agent protocol (cameleer-common) Add environmentId to registration + heartbeat Low
Server config TenantProperties bean, PG schema URL Low
ClickHouse schema Add environment column, update ORDER BY/PARTITION BY Medium
ClickHouse stores (8 files) Replace hardcoded "default" with injected tenant ID, add environment Medium
AgentInfo + registry Add environmentId field Low
ChunkAccumulator + write pipeline Include environment in data writes Low
Controllers Pass environment from registration/heartbeat Low
UI Environment filter dropdown, catalog grouping Medium
PostgreSQL No table changes (schema-per-tenant via JDBC URL) None

Verification

  1. Start server with CAMELEER_TENANT_ID=acme and PG currentSchema=tenant_acme
  2. Register agent with environmentId=dev
  3. Verify ClickHouse writes contain tenant_id='acme' and environment='dev'
  4. Start second server with CAMELEER_TENANT_ID=beta
  5. Verify data from tenant "beta" is not visible to tenant "acme" queries
  6. Verify UI environment filter shows only selected environment's data