# Multitenancy Architecture Design

**Date:** 2026-04-04
**Status:** Draft

## Context

Cameleer Server is being integrated into a SaaS platform (cameleer-saas). The server must support multiple tenants sharing PostgreSQL and ClickHouse while guaranteeing strict data isolation. Each tenant gets their own cameleer-server instance. Environments (dev/staging/prod) are a first-class concept within each tenant.

## Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Tenant model | 1 customer = 1 tenant | SaaS customer isolation |
| Instance model | 1 tenant = 1 server instance | In-memory state (registry, catalog, SSE) is tenant-scoped |
| Environments | First-class, per-agent property | Agents belong to exactly 1 environment |
| PG isolation | Schema-per-tenant | No query changes needed; Flyway runs per-schema; JDBC `currentSchema` param |
| CH isolation | Shared DB, `tenant_id` column + partition key | Already partially in place; tenant in partition key enables pruning + TTL |
| Agent auth | Per-tenant bootstrap token | SaaS shell provisions tokens; JWT includes `tenant_id` |
| User scope | Single tenant per user | Logto organizations handle user↔tenant mapping |
| Migration | Fresh install | No backward-compatibility migration needed |

## Data Hierarchy

```
Tenant (customer org)
  └─ Environment (dev, staging, prod)
       └─ Application (order-service, payment-gateway)
            └─ Agent Instance (pod-1, pod-2)
```

## Architecture

```
Tenant "Acme" ──► cameleer-server (TENANT_ID=acme)
                    ├─ PG schema: tenant_acme
                    ├─ CH writes: tenant_id='acme'
                    ├─ Agents: env=dev, env=prod
                    └─ In-memory: registry, catalog, SSE

Tenant "Beta" ──► cameleer-server (TENANT_ID=beta)
                    ├─ PG schema: tenant_beta
                    ├─ CH writes: tenant_id='beta'
                    └─ ...

Shared: PostgreSQL (multiple schemas) + ClickHouse (single DB, tenant_id partitioning)
```

Each server instance reads `CAMELEER_TENANT_ID` from its environment (default: `"default"`). This value is used for all ClickHouse reads/writes. The PG schema is set via `?currentSchema=tenant_{id}` on the JDBC URL.

## 1. Agent Protocol Changes

### Registration Payload

Add `environmentId` field:

```json
{
  "instanceId": "order-svc-pod-1",
  "displayName": "order-svc-pod-1",
  "applicationId": "order-service",
  "environmentId": "dev",
  "version": "1.0-SNAPSHOT",
  "routeIds": ["route-orders"],
  "capabilities": { "tracing": true, "replay": false }
}
```

`environmentId` defaults to `"default"` if omitted (backward compatibility with older agents).

### Heartbeat Payload

Add `environmentId` (optional, for auto-heal after server restart):

```json
{
  "routeStates": { "route-orders": "Started" },
  "capabilities": { "tracing": true },
  "environmentId": "dev"
}
```

### JWT Claims

Agent JWTs issued by the server include:
- `tenant` — tenant ID (from server config)
- `env` — environment ID (from registration)
- `group` — application ID (existing)

The SaaS shell uses `tenant` + `env` claims to route agent traffic to the correct server instance.

## 2. Server Configuration

New environment variables:

| Variable | Default | Purpose |
|----------|---------|---------|
| `CAMELEER_TENANT_ID` | `default` | Tenant identifier for all CH data operations |

PG connection includes schema:
```yaml
spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
```

Flyway runs against the configured schema automatically.

## 3. ClickHouse Schema Changes

### Column Ordering Principle

All tables follow the ordering: **tenant → time → environment → application → agent/route → specifics**

This matches query patterns (most-filtered-first) and gives optimal sparse index data skipping.

### Partitioning

All tables: `PARTITION BY (tenant_id, toYYYYMM(timestamp))` (or `toYYYYMM(bucket)` for stats tables).

Benefits:
- Partition pruning by tenant (never scans other tenant's data)
- Partition pruning by month (time-range queries)
- Per-tenant TTL/retention (drop partitions)

### Raw Tables

#### `executions`

```sql
CREATE TABLE executions (
    tenant_id         String   DEFAULT 'default',
    start_time        DateTime64(3),
    environment       String   DEFAULT 'default',
    application_id    String,
    instance_id       String,
    -- ... existing columns ...
) ENGINE = ReplacingMergeTree()
PARTITION BY (tenant_id, toYYYYMM(start_time))
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id)
```

#### `processor_executions`

```sql
ORDER BY (tenant_id, start_time, environment, application_id, route_id, execution_id, seq)
PARTITION BY (tenant_id, toYYYYMM(start_time))
```

#### `logs`

```sql
ORDER BY (tenant_id, timestamp, environment, application, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```

#### `agent_metrics`

```sql
ORDER BY (tenant_id, collected_at, environment, instance_id, metric_name)
PARTITION BY (tenant_id, toYYYYMM(collected_at))
```

#### `route_diagrams`

```sql
ORDER BY (tenant_id, created_at, environment, route_id, instance_id)
PARTITION BY (tenant_id, toYYYYMM(created_at))
```

#### `agent_events`

```sql
ORDER BY (tenant_id, timestamp, environment, instance_id)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```

#### `usage_events` (new column)

```sql
-- Add tenant_id (currently missing)
ORDER BY (tenant_id, timestamp, environment, username, normalized)
PARTITION BY (tenant_id, toYYYYMM(timestamp))
```

### Materialized View Targets (stats_1m_*)

All follow: `ORDER BY (tenant_id, bucket, environment, ...)`, `PARTITION BY (tenant_id, toYYYYMM(bucket))`

Example for `stats_1m_route`:
```sql
ORDER BY (tenant_id, bucket, environment, application_id, route_id)
PARTITION BY (tenant_id, toYYYYMM(bucket))
```

### MV Source Queries

All materialized view SELECT statements include `environment` in GROUP BY:

```sql
SELECT
    tenant_id,
    toStartOfMinute(start_time) AS bucket,
    environment,
    application_id,
    route_id,
    countState() AS total_count,
    ...
FROM executions
GROUP BY tenant_id, bucket, environment, application_id, route_id
```

## 4. Java Code Changes

### Configuration

New config class:

```java
@ConfigurationProperties(prefix = "cameleer.tenant")
public class TenantProperties {
    private String id = "default";
    // getter/setter
}
```

Read from `CAMELEER_TENANT_ID` env var (Spring Boot relaxed binding: `cameleer.tenant.id`).

### AgentInfo Record

Add `environmentId` field:

```java
public record AgentInfo(
    String instanceId,
    String displayName,
    String applicationId,
    String environmentId,    // NEW
    String version,
    List<String> routeIds,
    Map<String, Object> capabilities,
    AgentState state,
    Instant registeredAt,
    Instant lastHeartbeat,
    Instant staleTransitionTime
) { ... }
```

### ClickHouse Stores

All stores receive `TenantProperties` via constructor injection and use `tenantProperties.getId()` instead of hardcoded `"default"`:

**Pattern (applies to all stores):**
```java
// Before:
private static final String TENANT = "default";

// After:
private final String tenantId;

public ClickHouseStatsStore(JdbcTemplate jdbc, TenantProperties tenantProps) {
    this.jdbc = jdbc;
    this.tenantId = tenantProps.getId();
}
```

**Files to update:**
- `ClickHouseExecutionStore` — writes and reads
- `ClickHouseLogStore` — writes and reads
- `ClickHouseMetricsStore` — add tenant_id to INSERT
- `ClickHouseMetricsQueryStore` — add tenant_id filter to reads
- `ClickHouseStatsStore` — replace `TENANT` constant
- `ClickHouseDiagramStore` — replace `TENANT` constant
- `ClickHouseSearchIndex` — replace hardcoded `'default'`
- `ClickHouseAgentEventRepository` — replace `TENANT` constant
- `ClickHouseUsageTracker` — add tenant_id to writes and reads

### Environment in Write Path

The `ChunkAccumulator` extracts `environmentId` from the agent registry and includes it in `MergedExecution` and `ProcessorBatch`:

```java
// ChunkAccumulator.toMergedExecution():
AgentInfo agent = registryService.findById(instanceId);
String environment = agent != null ? agent.environmentId() : "default";
// include environment in MergedExecution
```

### Registration Controller

Pass `environmentId` from registration payload to `AgentRegistryService.register()`. Default to `"default"` if absent.

### Heartbeat Controller

On auto-heal, use `environmentId` from heartbeat payload (if present).

## 5. PostgreSQL — Schema-per-Tenant

No table schema changes. Isolation via JDBC `currentSchema`:

```yaml
spring:
  datasource:
    url: jdbc:postgresql://pg:5432/cameleer?currentSchema=tenant_${CAMELEER_TENANT_ID:default}
```

Flyway creates tables in the tenant's schema on first startup. Each server instance manages its own schema independently.

The SaaS shell is responsible for:
- Creating the PG schema before starting a tenant's server instance
- Or the server creates it on startup via Flyway's `CREATE SCHEMA IF NOT EXISTS`

## 6. UI Changes

### Environment Filter

Add an environment filter dropdown to the sidebar header (next to the time range picker). Persisted in URL query params.

All data queries (executions, stats, logs, catalog) include `environment` filter when set. "All environments" is the default.

### Catalog

The route catalog groups by environment → application → route. The sidebar tree becomes:

```
dev
  └─ order-service
       ├─ route-orders (42)
       └─ route-cbr (18)
prod
  └─ order-service
       ├─ route-orders (1,204)
       └─ route-cbr (890)
```

## 7. What the SaaS Shell Must Do

The cameleer-server does NOT manage tenants. The SaaS shell (cameleer-saas) is responsible for:

1. **Provisioning**: Create PG schema `tenant_{id}`, generate per-tenant bootstrap token, start cameleer-server container with `CAMELEER_TENANT_ID={id}` and PG URL pointing to the schema
2. **Routing**: Route agent and UI traffic to the correct server instance (by tenant)
3. **Lifecycle**: Start/stop/upgrade tenant server instances
4. **Auth**: Issue JWTs with tenant claims (via Logto), configure ForwardAuth

## 8. Scope Summary

| Area | Change | Complexity |
|------|--------|------------|
| Agent protocol (cameleer-common) | Add `environmentId` to registration + heartbeat | Low |
| Server config | `TenantProperties` bean, PG schema URL | Low |
| ClickHouse schema | Add `environment` column, update ORDER BY/PARTITION BY | Medium |
| ClickHouse stores (8 files) | Replace hardcoded `"default"` with injected tenant ID, add environment | Medium |
| AgentInfo + registry | Add `environmentId` field | Low |
| ChunkAccumulator + write pipeline | Include environment in data writes | Low |
| Controllers | Pass environment from registration/heartbeat | Low |
| UI | Environment filter dropdown, catalog grouping | Medium |
| PostgreSQL | No table changes (schema-per-tenant via JDBC URL) | None |

## Verification

1. Start server with `CAMELEER_TENANT_ID=acme` and PG `currentSchema=tenant_acme`
2. Register agent with `environmentId=dev`
3. Verify ClickHouse writes contain `tenant_id='acme'` and `environment='dev'`
4. Start second server with `CAMELEER_TENANT_ID=beta`
5. Verify data from tenant "beta" is not visible to tenant "acme" queries
6. Verify UI environment filter shows only selected environment's data