docs(license): brainstorm spec for license enforcement design

Captures the agreed design for enforcing licensing on cameleer-server:
- Default tier with hard caps when no license is configured
- Arbitrary per-customer limits in signed Ed25519 license tokens
- Standalone cameleer-license-minter module (vendor-only)
- DB-persisted license with env/file override paths
- ABSENT/ACTIVE/GRACE/EXPIRED state machine; offline expiry only
- Removes the dead Feature enum scaffolding

Pending writing-plans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-25 21:55:18 +02:00
parent f6b76b2d5e
commit 0e512a3c0c

View File

@@ -0,0 +1,449 @@
# License Enforcement — Design
**Date:** 2026-04-25
**Status:** Approved (brainstorm); pending writing-plans
**Related:** cameleer-saas#7 (Epic: License & Feature Gating), cameleer-saas#42 (vendor minting), cameleer-saas#50 (customer license view)
## Problem
`cameleer-server` ships a license skeleton (`LicenseValidator`, `LicenseGate`, admin endpoint) but
nothing enforces anything. Open mode (no license configured) currently grants *all* features and
*no* limits — the opposite of what we want for a self-hosted distribution that needs to gate scale
behind a paid license.
We want:
1. A self-hosted server with **no license** to operate within a small, hard-coded "default tier"
that is enough to evaluate the product but not enough to run it in production.
2. Licenses to express **arbitrary per-customer limits** (no fixed tiers) on a vendor-defined set
of resources: entity counts, compute footprint, retention.
3. A **standalone minter** owned by the vendor that signs licenses with an Ed25519 private key the
customer never sees.
4. Licenses to be **persisted** on the server, **installable** via env var, file, or admin POST,
and **renewable** by replacement.
5. **Revocation** handled out of band (vendor suspends the SaaS tenant, or issues short-`exp`
licenses) — no online revocation callback in v1.
## Non-goals
- Feature flags. The current `Feature` enum (topology/lineage/correlation/debugger/replay) is dead
scaffolding and gets removed; this design is about quantitative limits only.
- Ingestion-rate limits (executions/minute, logs/minute). Defer to a follow-up.
- Online revocation. Vendor uses shorter `exp` + reissue; SaaS suspension is independent.
- Auto-deletion of resources when caps are lowered. Existing rows stay; only new creates reject.
- Minter keypair generation tooling. Vendor uses standard `openssl genpkey -algorithm ed25519`
out of band.
---
## 1. Architecture
### 1.1 Module layout
```
cameleer-server-core/ (existing — pure domain, no Spring)
└── license/
├── LicenseInfo (record — see §2)
├── LicenseLimits (typed wrapper over the limits map)
├── LicenseValidator (existing, payload schema updated)
├── LicenseGate (existing, gutted: no Feature; getLimits() only)
├── LicenseStateMachine (NEW — pure FSM: ABSENT / ACTIVE / GRACE / EXPIRED)
└── DefaultTierLimits (constant — §5 numbers)
cameleer-server-app/ (existing — Spring, web, persistence)
├── license/
│ ├── LicenseRepository (NEW — PostgreSQL persistence)
│ ├── LicenseService (NEW — load/save/replace; emits state events)
│ ├── LicenseEnforcer (NEW — assertWithinCap entry point)
│ ├── LicenseUsageReader (NEW — counts current usage for /usage endpoint)
│ ├── LicenseCapExceededException (NEW — mapped to 403 by ControllerAdvice)
│ └── LicenseMetrics (NEW — Prometheus gauges)
├── controller/
│ ├── LicenseAdminController (existing — extended; persists, audited)
│ └── LicenseUsageController (NEW — GET /admin/license/usage)
└── config/
└── LicenseBeanConfig (existing — extended for DB load order)
cameleer-license-minter/ (NEW — top-level Maven module)
├── pom.xml (depends on cameleer-server-core)
├── LicenseMinter (signing primitive; takes private key + LicenseInfo)
└── cli/LicenseMinterCli (CLI main class)
```
### 1.2 Why a separate `cameleer-license-minter` module
Not shipped in the runtime JAR. Vendor distributes it independently or builds it from source on a
trusted machine. Customers never receive it.
This is module hygiene + smaller runtime attack surface, not a cryptographic protection — license
forgery requires the vendor's private key, and the public key in the server is enough to verify
forged tokens regardless of where the minter code lives.
### 1.3 Dependency graph
```
cameleer-license-minter ──▶ cameleer-server-core (LicenseInfo schema only)
cameleer-server-app ──▶ cameleer-server-core (validator, gate, FSM, defaults)
cameleer-saas ──▶ cameleer-license-minter (for SaaS-mode minting)
cameleer-saas ──▶ cameleer-server-core (transitive)
```
`cameleer-server-app` has **no** dependency on `cameleer-license-minter`.
---
## 2. License envelope
Wire format unchanged: `base64(payload).base64(ed25519_signature)`. Payload schema:
```json
{
"licenseId": "550e8400-e29b-41d4-a716-446655440000",
"tenantId": "acme-corp",
"label": "ACME prod 2026",
"iat": 1745539200,
"exp": 1777075200,
"gracePeriodDays": 30,
"limits": {
"max_environments": 5,
"max_apps": 50,
"max_agents": 100,
"max_users": 25,
"max_outbound_connections": 10,
"max_alert_rules": 200,
"max_total_cpu_millis": 32000,
"max_total_memory_mb": 65536,
"max_total_replicas": 100,
"max_execution_retention_days": 90,
"max_log_retention_days": 30,
"max_metric_retention_days": 365,
"max_jar_retention_count": 10
}
}
```
### 2.1 Field rules
| Field | Required | Notes |
|---|---|---|
| `licenseId` | yes | UUID. Used in audit + future revocation. |
| `tenantId` | optional | If present and `CAMELEER_SERVER_TENANT_ID` differs, treat as no license + log error. Air-gapped customers may omit. |
| `label` | optional | Free-form human description. Surfaced in UI. |
| `iat` | yes | Unix seconds. |
| `exp` | yes | Unix seconds. |
| `gracePeriodDays` | optional, default `0` | Days `exp` may be in the past while limits still apply. |
| `limits.*` | each optional | Missing key inherits from `DefaultTierLimits`. A license can lift any subset. |
### 2.2 Removed from the current envelope
- `tier` (string) — was a non-functional label. Folded into `label`.
- `features` (array) — out of scope. `Feature` enum deleted.
---
## 3. License state machine
```
exp + grace passes
┌─────────┐ install valid ┌────────┐ exp ┌────────┐ ────────► ┌─────────┐
│ ABSENT │ ───────────────▶│ ACTIVE │──────▶│ GRACE │ │ EXPIRED │
└─────────┘ └────────┘ └────────┘ └─────────┘
▲ │ │ ▲ │
│ install invalid │ replace │ │ replace valid │ replace
│ (sig/tenant/parse) ▼ │ │ ▼
└────────────────────────────┴──────────────┴─┴───────────────────┘
all transitions persist + audit-log
```
### 3.1 State semantics
| State | Effective limits | Trigger |
|---|---|---|
| `ABSENT` | `DefaultTierLimits` | No DB row, or signature/tenant/parse failure. |
| `ACTIVE` | `merge(default, license.limits)` | License loaded, `now < exp`. |
| `GRACE` | Same as `ACTIVE` | `exp ≤ now < exp + gracePeriodDays`. UI banner. |
| `EXPIRED` | `DefaultTierLimits` | `now ≥ exp + gracePeriodDays`. Distinct UI label vs ABSENT. |
State is recomputed on every limit check (clock comparison only) — no scheduler needed for
transitions. The only "background" behaviour is the Prometheus gauge refresh.
### 3.2 Default tier (the "no license" caps)
| Limit | Default |
|---|---|
| `max_environments` | 1 |
| `max_apps` | 3 |
| `max_agents` | 5 |
| `max_users` | 3 |
| `max_outbound_connections` | 1 |
| `max_alert_rules` | 2 |
| `max_total_cpu_millis` | 2000 (2 cores) |
| `max_total_memory_mb` | 2048 (2 GB) |
| `max_total_replicas` | 5 |
| `max_execution_retention_days` | 1 |
| `max_log_retention_days` | 1 |
| `max_metric_retention_days` | 1 |
| `max_jar_retention_count` | 3 |
Encoded as `public static final Map<String, Integer> DEFAULTS` in `DefaultTierLimits`. Keys
match the license payload exactly.
---
## 4. Enforcement map
Every limit check goes through one method on `LicenseEnforcer`:
```java
void assertWithinCap(String limitKey, long currentUsage, long requestedDelta);
```
Throws `LicenseCapExceededException(limitKey, current, cap)` when `currentUsage + requestedDelta > cap`.
A `@ControllerAdvice` maps it to `403` with body
`{"error":"license cap reached","limit":"max_apps","current":3,"cap":3}`.
| Limit | Call site | Failure response |
|---|---|---|
| `max_environments` | `EnvironmentService.create` (start) | 403 |
| `max_apps` | `AppService.createApp` | 403 |
| `max_agents` | `AgentRegistryService.register` | 403 — agent treated as unregistered (no SSE, no commands) |
| `max_users` | `UserAdminController.createUser` and `OidcAuthController.callback` (auto-signup) | 403 / OIDC login failure |
| `max_outbound_connections` | `OutboundConnectionServiceImpl.create` | 403 |
| `max_alert_rules` | `AlertRuleController.create` | 403 |
| `max_total_cpu_millis` | `DeploymentExecutor.PRE_FLIGHT` (sum across non-stopped deploys + new) | Deploy fails fast at PRE_FLIGHT, status FAILED, audit row |
| `max_total_memory_mb` | same | same |
| `max_total_replicas` | same | same |
| `max_execution_retention_days` | `EnvironmentService.update` (per-env field, see §4.1) + `ClickHouseSchemaInitializer.applyRetention()` at boot | 422 on update; boot pins effective TTL = `min(licenseCap, configured)` |
| `max_log_retention_days` | same | same |
| `max_metric_retention_days` | same | same |
| `max_jar_retention_count` | `EnvironmentAdminController.PUT /jar-retention` | 422 |
### 4.1 Per-environment retention fields
Three new columns on `environments` (Flyway V2):
```sql
ALTER TABLE environments
ADD COLUMN execution_retention_days INTEGER NOT NULL DEFAULT 1,
ADD COLUMN log_retention_days INTEGER NOT NULL DEFAULT 1,
ADD COLUMN metric_retention_days INTEGER NOT NULL DEFAULT 1;
```
These are the configured per-env values. The effective ClickHouse TTL is
`min(licenseCap, configured)`, applied at startup by `ClickHouseSchemaInitializer`. Admin UI
surfaces the configured values; `EnvironmentService.update` rejects values above the license cap
with 422.
### 4.2 Boot-time invariant
If a license is added that *lowers* a cap below current usage (10 apps, license now allows 5), the
server logs one WARN per limit at boot. **No deletion**. New creates reject; existing resources
keep working.
---
## 5. Usage endpoint
`GET /api/v1/admin/license/usage` (ADMIN only):
```json
{
"state": "ACTIVE",
"expiresAt": "2027-04-25T00:00:00Z",
"daysRemaining": 365,
"gracePeriodDays": 30,
"tenantId": "acme-corp",
"label": "ACME prod 2026",
"limits": [
{"key": "max_apps", "current": 7, "cap": 50, "source": "license"},
{"key": "max_agents", "current": 12, "cap": 100, "source": "license"},
{"key": "max_total_cpu_millis", "current": 8500, "cap": 32000, "source": "license"},
{"key": "max_outbound_connections", "current": 0, "cap": 1, "source": "default"}
]
}
```
`source` is `"default"` when the cap comes from `DefaultTierLimits` (i.e. the license omits this
key, or there is no license), and `"license"` when the cap is explicit in the license. Drives the
SaaS UI's "free tier" badge.
`LicenseUsageReader` issues one cheap aggregate per limit (`SELECT COUNT(*)` per entity table; a
single grouped `SELECT SUM(replicas * cpuMillis), SUM(replicas * memoryMb), SUM(replicas)` over
non-stopped deployments).
`GET /api/v1/admin/license` (existing) is extended to return `{state, envelope}` with the raw token
omitted from the response.
---
## 6. Lifecycle, persistence, install paths
### 6.1 Storage
Flyway V2 migration:
```sql
CREATE TABLE license (
tenant_id TEXT PRIMARY KEY, -- one row per server (= one tenant)
token TEXT NOT NULL, -- full signed token
license_id UUID NOT NULL,
installed_at TIMESTAMPTZ NOT NULL,
installed_by TEXT NOT NULL, -- users.user_id (bare) or 'system' for env/file boot
expires_at TIMESTAMPTZ NOT NULL
);
```
### 6.2 Boot order
`LicenseBeanConfig`:
1. If `CAMELEER_SERVER_LICENSE_TOKEN` env var is set → validate → write to DB (overwrite) →
load.
2. Else if `CAMELEER_SERVER_LICENSE_FILE` is set → read file → validate → write to DB → load.
3. Else read `license` row from DB → validate → load.
4. Else `ABSENT`.
Env-var / file act as **idempotent overrides** — they always win and replace the DB row, so the
operator's last action survives reboots.
### 6.3 Runtime install
`POST /api/v1/admin/license { "token": "..." }` (existing):
- Validates against the configured public key.
- On success, persists to `license` table (`installed_by = user_id`), updates the in-memory
`LicenseGate`, audits.
- On failure, returns 400 with the validator error message and audits the rejection.
### 6.4 Public key custody
`CAMELEER_SERVER_LICENSE_PUBLICKEY` (existing) remains the only verification key. Build- /
deploy-time secret bound to the vendor distribution. **Not stored in DB.** If unset *and* a
license is present → reject all licenses (existing behaviour).
### 6.5 Audit trail
New `AuditCategory.LICENSE`. Actions:
| Action | When | Payload |
|---|---|---|
| `install_license` | First successful install in an empty state | `{licenseId, expiresAt, installedBy, source}` (`source` = `env`/`file`/`api`) |
| `replace_license` | Successful install over an existing license | same + `previousLicenseId` |
| `reject_license` | Validation failed (signature, tenant, parse, public key missing) | `{reason, source}` |
| `cap_exceeded` | Any `LicenseCapExceededException` | `{limit, current, cap, requestedBy}` |
---
## 7. Minter
### 7.1 `LicenseMinter` (library)
Pure function, packaged in `cameleer-license-minter`:
```java
public final class LicenseMinter {
public static String mint(LicenseInfo info, PrivateKey ed25519PrivateKey);
}
```
Serializes `LicenseInfo` to canonical JSON (sorted keys), signs the bytes with Ed25519, returns
`base64(payload).base64(signature)`. cameleer-saas calls this directly to mint per-tenant tokens.
### 7.2 `LicenseMinterCli` (CLI)
```bash
java -jar cameleer-license-minter-1.0-SNAPSHOT.jar \
--private-key=/secure/vendor.key \
--tenant=acme-corp \
--label="ACME prod 2026" \
--expires=2027-04-25 \
--grace-days=30 \
--max-apps=50 \
--max-agents=100 \
--max-total-cpu-millis=32000 \
--max-total-memory-mb=65536 \
--max-execution-retention-days=90 \
--output=acme-license.tok
```
- `--private-key` reads a PEM-encoded Ed25519 private key (output of
`openssl genpkey -algorithm ed25519`).
- Unspecified `--max-*` flags are omitted from the payload — the license inherits the default for
that key.
- Unknown flags fail fast.
- `--output` writes the token; if omitted, prints to stdout.
Keypair generation is **out of band** — vendor uses `openssl` and stores both halves in their
secret manager. We deliberately do not ship a `--gen-keypair` subcommand to keep the boundary
clean.
---
## 8. Telemetry
Prometheus gauges scraped via `/api/v1/prometheus`:
| Metric | Labels | Notes |
|---|---|---|
| `cameleer_license_state` | `state="ABSENT|ACTIVE|GRACE|EXPIRED"` | Boolean — exactly one is 1. |
| `cameleer_license_days_remaining` | (none) | Negative in GRACE/EXPIRED. |
| `cameleer_license_limit_utilisation`| `limit="max_apps"` etc. | `current / cap`, in `[0, 1+]`. |
| `cameleer_license_cap_rejections_total` | `limit="..."` | Counter. |
State-transition log lines: `INFO` on install/ACTIVE, `WARN` on GRACE, `ERROR` on EXPIRED, `WARN`
on cap reject (sampled to avoid log spam).
---
## 9. Dead-code removal
Performed in the **first commit** of the implementation. Per the project's "no backwards
compatibility shims" preference, no deprecated path or feature flag.
- Delete `Feature.java`.
- Delete `LicenseGate.isEnabled(Feature)`.
- Delete `LicenseInfo.features` field, `LicenseInfo.hasFeature(Feature)`.
- Delete `LicenseGateTest.withLicense_onlyLicensedFeaturesEnabled` and `LicenseInfo.open()`'s
`Set.of(Feature.values())` assertion.
- Update `LicenseValidator` to ignore `features` if present in old tokens (silently dropped,
not an error).
---
## 10. Testing
| Layer | Tests |
|---|---|
| Core unit | `LicenseValidatorTest` — signature, expiry, tenant mismatch, missing required fields, unknown extra fields. |
| Core unit | `LicenseStateMachineTest` — all four transitions including grace boundary, replace from any state, invalid install. |
| Core unit | `DefaultTierLimitsTest` — every documented key has a default. |
| Minter unit | `LicenseMinterTest` — round-trip with a throwaway Ed25519 keypair. Canonical JSON is stable across runs. |
| Minter CLI | `LicenseMinterCliTest` — invokes `main` with `--private-key=tmp` and checks output token validates. |
| App unit | `LicenseEnforcerTest` — for each limit: cap-reached, under-cap, default-tier with no license, missing-cap-inherits-default. |
| App integration | `LicenseLifecycleIT` — install via env, replace via POST, restart restores from DB. Driven through REST. |
| App integration | `LicenseEnforcementIT` — REST-driven, hit each cap end-to-end (per the project's "REST-API-driven ITs" preference). Includes `cap_exceeded` audit row check. |
| Boot | `SchemaBootstrapIT` extension — `license` table exists, `environments` retention columns exist, retention pinning honoured at boot. |
No raw-SQL seeding of caps in ITs. All caps installed via the REST endpoint or env var.
---
## 11. Open follow-ups (deliberately deferred)
- Ingestion-rate limits (`max_executions_per_minute`, `max_logs_per_minute`).
- Online revocation callback (the `revocation_check_url` envelope field).
- Concurrent debug session limit (`max_concurrent_debug_sessions` from the SaaS epic).
- A "license usage history" report for vendors to see growth over time.
- Open a tracking issue on `cameleer/cameleer-server` (Gitea) — none exists today.
---
## 12. Risk register
| Risk | Mitigation |
|---|---|
| Default tier so tight that an honest evaluator cannot try the product. | Defaults documented; vendor can ship a longer-`exp` "trial" license at install time if needed. |
| Customer lowers `gracePeriodDays` field by editing token. | Token is signed; any edit invalidates the signature. |
| License removed from DB out of band, server lands in ABSENT and rejects new resources but old ones are above default tier. | Boot-time WARN per over-cap limit. UI banner in the admin license page. No auto-deletion. |
| Public key rotation. | Out of scope for v1; documented as "redeploy with new key" — vendors are expected to rotate via redeployment. |
| Compute cap arithmetic relies on `cpuLimit` and `memoryLimitMb` being set on every container. | Existing `ResolvedContainerConfig` already enforces these; `DeploymentExecutor.PRE_FLIGHT` rejects deploys with unset compute fields. |
| Per-env retention column added but old ClickHouse partitions retain longer. | Documented: TTL change is honoured by ClickHouse on its next merge cycle. New rows inserted always honour the new TTL. |