cameleer/cameleer-server

Fork 0

Files

hsiegeln ee7226cf1c

CI / cleanup-branch (push) Has been skipped

Details

CI / build (push) Successful in 1m4s

Details

CI / docker (push) Successful in 10s

Details

CI / deploy-feature (push) Has been skipped

Details

CI / deploy (push) Successful in 36s

Details

docs: multitenancy architecture design spec

Covers tenant isolation (1 tenant = 1 server instance), environment
support (first-class agent property), ClickHouse partitioning
(tenant → time → environment → application), PostgreSQL schema-per-
tenant via JDBC currentSchema, and agent protocol changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-04 14:37:00 +02:00

11 KiB

Raw Blame History

Multitenancy Architecture Design

Date: 2026-04-04 Status: Draft

Context

Cameleer3 Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer3-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.

Decisions

Decision	Choice	Rationale
Tenant model	1 customer = 1 tenant	SaaS customer isolation
Instance model	1 tenant = 1 server instance	In-memory state (registry, catalog, SSE) is tenant-scoped
Environments	First-class, per-agent property	Agents belong to exactly 1 environment
PG isolation	Schema-per-tenant	No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param
CH isolation	Shared DB, `tenant_id` column + partition key	Already partially in place; tenant in partition key enables pruning + TTL
Agent auth	Per-tenant bootstrap token	SaaS shell provisions tokens; JWT includes `tenant_id`
User scope	Single tenant per user	Logto organizations handle user↔tenant mapping
Migration	Fresh install	No backward-compatibility migration needed

Data Hierarchy

Tenant (customer org)
  └─ Environment (dev, staging, prod)
       └─ Application (order-service, payment-gateway)
            └─ Agent Instance (pod-1, pod-2)

Architecture

Tenant "Acme" ──► cameleer3-server (TENANT_ID=acme)
                    ├─ PG schema: tenant_acme
                    ├─ CH writes: tenant_id='acme'
                    ├─ Agents: env=dev, env=prod
                    └─ In-memory: registry, catalog, SSE

Tenant "Beta" ──► cameleer3-server (TENANT_ID=beta)
                    ├─ PG schema: tenant_beta
                    ├─ CH writes: tenant_id='beta'
                    └─ ...

Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)

Each server instance reads CAMELEER_TENANT_ID from its environment (default: "default"). This value is used for all ClickHouse reads/writes. The PG schema is set via ?currentSchema=tenant_{id} on the JDBC URL.

1. Agent Protocol Changes

Registration Payload

Add environmentId field:

{
  "instanceId": "order-svc-pod-1",
  "displayName": "order-svc-pod-1",
  "applicationId": "order-service",
  "environmentId": "dev",
  "version": "1.0-SNAPSHOT",
  "routeIds": ["route-orders"],
  "capabilities": { "tracing": true, "replay": false }
}

environmentId defaults to "default" if omitted (backward compatibility with older agents).

Heartbeat Payload

Add environmentId (optional, for auto-heal after server restart):

{
  "routeStates": { "route-orders": "Started" },
  "capabilities": { "tracing": true },
  "environmentId": "dev"
}

JWT Claims

Agent JWTs issued by the server include:

tenant — tenant ID (from server config)
env — environment ID (from registration)
group — application ID (existing)

The SaaS shell uses tenant + env claims to route agent traffic to the correct server instance.

2. Server Configuration

New environment variables:

Variable	Default	Purpose
`CAMELEER_TENANT_ID`	`default`	Tenant identifier for all CH data operations

PG connection includes schema:

spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}

Flyway runs against the configured schema automatically.

3. ClickHouse Schema Changes

Column Ordering Principle

All tables follow the ordering: tenant → time → environment → application → agent/route → specifics

This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.

Partitioning

All tables: PARTITION BY (tenant_id, toYYYYMM(timestamp)) (or toYYYYMM(bucket) for stats tables).

Benefits:

Partition pruning by tenant (never scans other tenant's data)
Partition pruning by month (time-range queries)
Per-tenant TTL/retention (drop partitions)

Raw Tables

`executions`

CREATE TABLE executions (
    tenant_id         String   DEFAULT 'default',
    start_time        DateTime64(3),
    environment       String   DEFAULT 'default',
    application_id    String,
    instance_id       String,
    -- ... existing columns ...
) ENGINE = ReplacingMergeTree()
PARTITION BY (tenant_id, toYYYYMM(start_time))
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)

`processor_executions`

ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
PARTITION BY (tenant_id, toYYYYMM(start_time))

`logs`

ORDER BY (tenant_id, timestamp, environment, application, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

`agent_metrics`

ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
PARTITION BY (tenant_id, toYYYYMM(collected_at))

`route_diagrams`

ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
PARTITION BY (tenant_id, toYYYYMM(created_at))

`agent_events`

ORDER BY (tenant_id, timestamp, environment, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

`usage_events` (new column)

-- Add tenant_id (currently missing)
ORDER BY (tenant_id, timestamp, environment, username, normalized)
PARTITION BY (tenant_id, toYYYYMM(timestamp))

Materialized View Targets (stats_1m_*)

All follow: ORDER BY (tenant_id, bucket, environment, ...), PARTITION BY (tenant_id, toYYYYMM(bucket))

Example for stats_1m_route:

ORDER BY (tenant_id, bucket, environment, application_id, route_id)
PARTITION BY (tenant_id, toYYYYMM(bucket))

MV Source Queries

All materialized view SELECT statements include environment in GROUP BY:

SELECT
    tenant_id,
    toStartOfMinute(start_time) AS bucket,
    environment,
    application_id,
    route_id,
    countState() AS total_count,
    ...
FROM executions
GROUP BY tenant_id, bucket, environment, application_id, route_id

4. Java Code Changes

Configuration

New config class:

@ConfigurationProperties(prefix = "cameleer.tenant")
public class TenantProperties {
    private String id = "default";
    // getter/setter
}

Read from CAMELEER_TENANT_ID env var (Spring Boot relaxed binding: cameleer.tenant.id).

AgentInfo Record

Add environmentId field:

public record AgentInfo(
    String instanceId,
    String displayName,
    String applicationId,
    String environmentId,    // NEW
    String version,
    List<String> routeIds,
    Map<String, Object> capabilities,
    AgentState state,
    Instant registeredAt,
    Instant lastHeartbeat,
    Instant staleTransitionTime
) { ... }

ClickHouse Stores

All stores receive TenantProperties via constructor injection and use tenantProperties.getId() instead of hardcoded "default":

Pattern (applies to all stores):

// Before:
private static final String TENANT = "default";

// After:
private final String tenantId;

public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
    this.jdbc = jdbc;
    this.tenantId = tenantProps.getId();
}

Files to update:

ClickHouseExecutionStore — writes and reads
ClickHouseLogStore — writes and reads
ClickHouseMetricsStore — add tenant_id to INSERT
ClickHouseMetricsQueryStore — add tenant_id filter to reads
ClickHouseStatsStore — replace TENANT constant
ClickHouseDiagramStore — replace TENANT constant
ClickHouseSearchIndex — replace hardcoded 'default'
ClickHouseAgentEventRepository — replace TENANT constant
ClickHouseUsageTracker — add tenant_id to writes and reads

Environment in Write Path

The ChunkAccumulator extracts environmentId from the agent registry and includes it in MergedExecution and ProcessorBatch:

// ChunkAccumulator.toMergedExecution():
AgentInfo agent = registryService.findById(instanceId);
String environment = agent != null ? agent.environmentId() : "default";
// include environment in MergedExecution

Registration Controller

Pass environmentId from registration payload to AgentRegistryService.register(). Default to "default" if absent.

Heartbeat Controller

On auto-heal, use environmentId from heartbeat payload (if present).

5. PostgreSQL — Schema-per-Tenant

No table schema changes. Isolation via JDBC currentSchema:

spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}

Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.

The SaaS shell is responsible for:

Creating the PG schema before starting a tenant's server instance
Or the server creates it on startup via Flyway's CREATE SCHEMA IF NOT EXISTS

6. UI Changes

Environment Filter

Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.

All data queries (executions, stats, logs, catalog) include environment filter when set. "All environments" is the default.

Catalog

The route catalog groups by environment → application → route. The sidebar tree becomes:

dev
  └─ order-service
       ├─ route-orders (42)
       └─ route-cbr (18)
prod
  └─ order-service
       ├─ route-orders (1,204)
       └─ route-cbr (890)

7. What the SaaS Shell Must Do

The cameleer3-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:

Provisioning: Create PG schema tenant_{id}, generate per-tenant bootstrap token, start cameleer3-server container with CAMELEER_TENANT_ID={id} and PG URL pointing to the schema
Routing: Route agent and UI traffic to the correct server instance (by tenant)
Lifecycle: Start/stop/upgrade tenant server instances
Auth: Issue JWTs with tenant claims (via Logto), configure ForwardAuth

8. Scope Summary

Area	Change	Complexity
Agent protocol (cameleer3-common)	Add `environmentId` to registration + heartbeat	Low
Server config	`TenantProperties` bean, PG schema URL	Low
ClickHouse schema	Add `environment` column, update ORDER BY/PARTITION BY	Medium
ClickHouse stores (8 files)	Replace hardcoded `"default"` with injected tenant ID, add environment	Medium
AgentInfo + registry	Add `environmentId` field	Low
ChunkAccumulator + write pipeline	Include environment in data writes	Low
Controllers	Pass environment from registration/heartbeat	Low
UI	Environment filter dropdown, catalog grouping	Medium
PostgreSQL	No table changes (schema-per-tenant via JDBC URL)	None

Verification

Start server with CAMELEER_TENANT_ID=acme and PG currentSchema=tenant_acme
Register agent with environmentId=dev
Verify ClickHouse writes contain tenant_id='acme' and environment='dev'
Start second server with CAMELEER_TENANT_ID=beta
Verify data from tenant "beta" is not visible to tenant "acme" queries
Verify UI environment filter shows only selected environment's data

11 KiB Raw Blame History

Multitenancy Architecture Design

Context

Decisions

Data Hierarchy

Architecture

1. Agent Protocol Changes

Registration Payload

Heartbeat Payload

JWT Claims

2. Server Configuration

3. ClickHouse Schema Changes

Column Ordering Principle

Partitioning

Raw Tables

executions

processor_executions

logs

agent_metrics

route_diagrams

agent_events

usage_events (new column)

Materialized View Targets (stats_1m_*)

MV Source Queries

4. Java Code Changes

Configuration

AgentInfo Record

ClickHouse Stores

Environment in Write Path

Registration Controller

Heartbeat Controller

5. PostgreSQL — Schema-per-Tenant

6. UI Changes

Environment Filter

Catalog

7. What the SaaS Shell Must Do

8. Scope Summary

Verification

11 KiB

Raw Blame History

`executions`

`processor_executions`

`logs`

`agent_metrics`

`route_diagrams`

`agent_events`

`usage_events` (new column)