diff --git a/docs/superpowers/specs/2026-04-25-license-enforcement-design.md b/docs/superpowers/specs/2026-04-25-license-enforcement-design.md new file mode 100644 index 00000000..fafbe1a9 --- /dev/null +++ b/docs/superpowers/specs/2026-04-25-license-enforcement-design.md @@ -0,0 +1,449 @@ +# License Enforcement — Design + +**Date:** 2026-04-25 +**Status:** Approved (brainstorm); pending writing-plans +**Related:** cameleer-saas#7 (Epic: License & Feature Gating), cameleer-saas#42 (vendor minting), cameleer-saas#50 (customer license view) + +## Problem + +`cameleer-server` ships a license skeleton (`LicenseValidator`, `LicenseGate`, admin endpoint) but +nothing enforces anything. Open mode (no license configured) currently grants *all* features and +*no* limits — the opposite of what we want for a self-hosted distribution that needs to gate scale +behind a paid license. + +We want: + +1. A self-hosted server with **no license** to operate within a small, hard-coded "default tier" + that is enough to evaluate the product but not enough to run it in production. +2. Licenses to express **arbitrary per-customer limits** (no fixed tiers) on a vendor-defined set + of resources: entity counts, compute footprint, retention. +3. A **standalone minter** owned by the vendor that signs licenses with an Ed25519 private key the + customer never sees. +4. Licenses to be **persisted** on the server, **installable** via env var, file, or admin POST, + and **renewable** by replacement. +5. **Revocation** handled out of band (vendor suspends the SaaS tenant, or issues short-`exp` + licenses) — no online revocation callback in v1. + +## Non-goals + +- Feature flags. The current `Feature` enum (topology/lineage/correlation/debugger/replay) is dead + scaffolding and gets removed; this design is about quantitative limits only. +- Ingestion-rate limits (executions/minute, logs/minute). Defer to a follow-up. +- Online revocation. Vendor uses shorter `exp` + reissue; SaaS suspension is independent. +- Auto-deletion of resources when caps are lowered. Existing rows stay; only new creates reject. +- Minter keypair generation tooling. Vendor uses standard `openssl genpkey -algorithm ed25519` + out of band. + +--- + +## 1. Architecture + +### 1.1 Module layout + +``` +cameleer-server-core/ (existing — pure domain, no Spring) +└── license/ + ├── LicenseInfo (record — see §2) + ├── LicenseLimits (typed wrapper over the limits map) + ├── LicenseValidator (existing, payload schema updated) + ├── LicenseGate (existing, gutted: no Feature; getLimits() only) + ├── LicenseStateMachine (NEW — pure FSM: ABSENT / ACTIVE / GRACE / EXPIRED) + └── DefaultTierLimits (constant — §5 numbers) + +cameleer-server-app/ (existing — Spring, web, persistence) +├── license/ +│ ├── LicenseRepository (NEW — PostgreSQL persistence) +│ ├── LicenseService (NEW — load/save/replace; emits state events) +│ ├── LicenseEnforcer (NEW — assertWithinCap entry point) +│ ├── LicenseUsageReader (NEW — counts current usage for /usage endpoint) +│ ├── LicenseCapExceededException (NEW — mapped to 403 by ControllerAdvice) +│ └── LicenseMetrics (NEW — Prometheus gauges) +├── controller/ +│ ├── LicenseAdminController (existing — extended; persists, audited) +│ └── LicenseUsageController (NEW — GET /admin/license/usage) +└── config/ + └── LicenseBeanConfig (existing — extended for DB load order) + +cameleer-license-minter/ (NEW — top-level Maven module) +├── pom.xml (depends on cameleer-server-core) +├── LicenseMinter (signing primitive; takes private key + LicenseInfo) +└── cli/LicenseMinterCli (CLI main class) +``` + +### 1.2 Why a separate `cameleer-license-minter` module + +Not shipped in the runtime JAR. Vendor distributes it independently or builds it from source on a +trusted machine. Customers never receive it. + +This is module hygiene + smaller runtime attack surface, not a cryptographic protection — license +forgery requires the vendor's private key, and the public key in the server is enough to verify +forged tokens regardless of where the minter code lives. + +### 1.3 Dependency graph + +``` +cameleer-license-minter ──▶ cameleer-server-core (LicenseInfo schema only) +cameleer-server-app ──▶ cameleer-server-core (validator, gate, FSM, defaults) +cameleer-saas ──▶ cameleer-license-minter (for SaaS-mode minting) +cameleer-saas ──▶ cameleer-server-core (transitive) +``` + +`cameleer-server-app` has **no** dependency on `cameleer-license-minter`. + +--- + +## 2. License envelope + +Wire format unchanged: `base64(payload).base64(ed25519_signature)`. Payload schema: + +```json +{ + "licenseId": "550e8400-e29b-41d4-a716-446655440000", + "tenantId": "acme-corp", + "label": "ACME prod 2026", + "iat": 1745539200, + "exp": 1777075200, + "gracePeriodDays": 30, + "limits": { + "max_environments": 5, + "max_apps": 50, + "max_agents": 100, + "max_users": 25, + "max_outbound_connections": 10, + "max_alert_rules": 200, + "max_total_cpu_millis": 32000, + "max_total_memory_mb": 65536, + "max_total_replicas": 100, + "max_execution_retention_days": 90, + "max_log_retention_days": 30, + "max_metric_retention_days": 365, + "max_jar_retention_count": 10 + } +} +``` + +### 2.1 Field rules + +| Field | Required | Notes | +|---|---|---| +| `licenseId` | yes | UUID. Used in audit + future revocation. | +| `tenantId` | optional | If present and `CAMELEER_SERVER_TENANT_ID` differs, treat as no license + log error. Air-gapped customers may omit. | +| `label` | optional | Free-form human description. Surfaced in UI. | +| `iat` | yes | Unix seconds. | +| `exp` | yes | Unix seconds. | +| `gracePeriodDays` | optional, default `0` | Days `exp` may be in the past while limits still apply. | +| `limits.*` | each optional | Missing key inherits from `DefaultTierLimits`. A license can lift any subset. | + +### 2.2 Removed from the current envelope + +- `tier` (string) — was a non-functional label. Folded into `label`. +- `features` (array) — out of scope. `Feature` enum deleted. + +--- + +## 3. License state machine + +``` + exp + grace passes + ┌─────────┐ install valid ┌────────┐ exp ┌────────┐ ────────► ┌─────────┐ + │ ABSENT │ ───────────────▶│ ACTIVE │──────▶│ GRACE │ │ EXPIRED │ + └─────────┘ └────────┘ └────────┘ └─────────┘ + ▲ │ │ ▲ │ + │ install invalid │ replace │ │ replace valid │ replace + │ (sig/tenant/parse) ▼ │ │ ▼ + └────────────────────────────┴──────────────┴─┴───────────────────┘ + all transitions persist + audit-log +``` + +### 3.1 State semantics + +| State | Effective limits | Trigger | +|---|---|---| +| `ABSENT` | `DefaultTierLimits` | No DB row, or signature/tenant/parse failure. | +| `ACTIVE` | `merge(default, license.limits)` | License loaded, `now < exp`. | +| `GRACE` | Same as `ACTIVE` | `exp ≤ now < exp + gracePeriodDays`. UI banner. | +| `EXPIRED` | `DefaultTierLimits` | `now ≥ exp + gracePeriodDays`. Distinct UI label vs ABSENT. | + +State is recomputed on every limit check (clock comparison only) — no scheduler needed for +transitions. The only "background" behaviour is the Prometheus gauge refresh. + +### 3.2 Default tier (the "no license" caps) + +| Limit | Default | +|---|---| +| `max_environments` | 1 | +| `max_apps` | 3 | +| `max_agents` | 5 | +| `max_users` | 3 | +| `max_outbound_connections` | 1 | +| `max_alert_rules` | 2 | +| `max_total_cpu_millis` | 2000 (2 cores) | +| `max_total_memory_mb` | 2048 (2 GB) | +| `max_total_replicas` | 5 | +| `max_execution_retention_days` | 1 | +| `max_log_retention_days` | 1 | +| `max_metric_retention_days` | 1 | +| `max_jar_retention_count` | 3 | + +Encoded as `public static final Map DEFAULTS` in `DefaultTierLimits`. Keys +match the license payload exactly. + +--- + +## 4. Enforcement map + +Every limit check goes through one method on `LicenseEnforcer`: + +```java +void assertWithinCap(String limitKey, long currentUsage, long requestedDelta); +``` + +Throws `LicenseCapExceededException(limitKey, current, cap)` when `currentUsage + requestedDelta > cap`. +A `@ControllerAdvice` maps it to `403` with body +`{"error":"license cap reached","limit":"max_apps","current":3,"cap":3}`. + +| Limit | Call site | Failure response | +|---|---|---| +| `max_environments` | `EnvironmentService.create` (start) | 403 | +| `max_apps` | `AppService.createApp` | 403 | +| `max_agents` | `AgentRegistryService.register` | 403 — agent treated as unregistered (no SSE, no commands) | +| `max_users` | `UserAdminController.createUser` and `OidcAuthController.callback` (auto-signup) | 403 / OIDC login failure | +| `max_outbound_connections` | `OutboundConnectionServiceImpl.create` | 403 | +| `max_alert_rules` | `AlertRuleController.create` | 403 | +| `max_total_cpu_millis` | `DeploymentExecutor.PRE_FLIGHT` (sum across non-stopped deploys + new) | Deploy fails fast at PRE_FLIGHT, status FAILED, audit row | +| `max_total_memory_mb` | same | same | +| `max_total_replicas` | same | same | +| `max_execution_retention_days` | `EnvironmentService.update` (per-env field, see §4.1) + `ClickHouseSchemaInitializer.applyRetention()` at boot | 422 on update; boot pins effective TTL = `min(licenseCap, configured)` | +| `max_log_retention_days` | same | same | +| `max_metric_retention_days` | same | same | +| `max_jar_retention_count` | `EnvironmentAdminController.PUT /jar-retention` | 422 | + +### 4.1 Per-environment retention fields + +Three new columns on `environments` (Flyway V2): + +```sql +ALTER TABLE environments + ADD COLUMN execution_retention_days INTEGER NOT NULL DEFAULT 1, + ADD COLUMN log_retention_days INTEGER NOT NULL DEFAULT 1, + ADD COLUMN metric_retention_days INTEGER NOT NULL DEFAULT 1; +``` + +These are the configured per-env values. The effective ClickHouse TTL is +`min(licenseCap, configured)`, applied at startup by `ClickHouseSchemaInitializer`. Admin UI +surfaces the configured values; `EnvironmentService.update` rejects values above the license cap +with 422. + +### 4.2 Boot-time invariant + +If a license is added that *lowers* a cap below current usage (10 apps, license now allows 5), the +server logs one WARN per limit at boot. **No deletion**. New creates reject; existing resources +keep working. + +--- + +## 5. Usage endpoint + +`GET /api/v1/admin/license/usage` (ADMIN only): + +```json +{ + "state": "ACTIVE", + "expiresAt": "2027-04-25T00:00:00Z", + "daysRemaining": 365, + "gracePeriodDays": 30, + "tenantId": "acme-corp", + "label": "ACME prod 2026", + "limits": [ + {"key": "max_apps", "current": 7, "cap": 50, "source": "license"}, + {"key": "max_agents", "current": 12, "cap": 100, "source": "license"}, + {"key": "max_total_cpu_millis", "current": 8500, "cap": 32000, "source": "license"}, + {"key": "max_outbound_connections", "current": 0, "cap": 1, "source": "default"} + ] +} +``` + +`source` is `"default"` when the cap comes from `DefaultTierLimits` (i.e. the license omits this +key, or there is no license), and `"license"` when the cap is explicit in the license. Drives the +SaaS UI's "free tier" badge. + +`LicenseUsageReader` issues one cheap aggregate per limit (`SELECT COUNT(*)` per entity table; a +single grouped `SELECT SUM(replicas * cpuMillis), SUM(replicas * memoryMb), SUM(replicas)` over +non-stopped deployments). + +`GET /api/v1/admin/license` (existing) is extended to return `{state, envelope}` with the raw token +omitted from the response. + +--- + +## 6. Lifecycle, persistence, install paths + +### 6.1 Storage + +Flyway V2 migration: + +```sql +CREATE TABLE license ( + tenant_id TEXT PRIMARY KEY, -- one row per server (= one tenant) + token TEXT NOT NULL, -- full signed token + license_id UUID NOT NULL, + installed_at TIMESTAMPTZ NOT NULL, + installed_by TEXT NOT NULL, -- users.user_id (bare) or 'system' for env/file boot + expires_at TIMESTAMPTZ NOT NULL +); +``` + +### 6.2 Boot order + +`LicenseBeanConfig`: + +1. If `CAMELEER_SERVER_LICENSE_TOKEN` env var is set → validate → write to DB (overwrite) → + load. +2. Else if `CAMELEER_SERVER_LICENSE_FILE` is set → read file → validate → write to DB → load. +3. Else read `license` row from DB → validate → load. +4. Else `ABSENT`. + +Env-var / file act as **idempotent overrides** — they always win and replace the DB row, so the +operator's last action survives reboots. + +### 6.3 Runtime install + +`POST /api/v1/admin/license { "token": "..." }` (existing): +- Validates against the configured public key. +- On success, persists to `license` table (`installed_by = user_id`), updates the in-memory + `LicenseGate`, audits. +- On failure, returns 400 with the validator error message and audits the rejection. + +### 6.4 Public key custody + +`CAMELEER_SERVER_LICENSE_PUBLICKEY` (existing) remains the only verification key. Build- / +deploy-time secret bound to the vendor distribution. **Not stored in DB.** If unset *and* a +license is present → reject all licenses (existing behaviour). + +### 6.5 Audit trail + +New `AuditCategory.LICENSE`. Actions: + +| Action | When | Payload | +|---|---|---| +| `install_license` | First successful install in an empty state | `{licenseId, expiresAt, installedBy, source}` (`source` = `env`/`file`/`api`) | +| `replace_license` | Successful install over an existing license | same + `previousLicenseId` | +| `reject_license` | Validation failed (signature, tenant, parse, public key missing) | `{reason, source}` | +| `cap_exceeded` | Any `LicenseCapExceededException` | `{limit, current, cap, requestedBy}` | + +--- + +## 7. Minter + +### 7.1 `LicenseMinter` (library) + +Pure function, packaged in `cameleer-license-minter`: + +```java +public final class LicenseMinter { + public static String mint(LicenseInfo info, PrivateKey ed25519PrivateKey); +} +``` + +Serializes `LicenseInfo` to canonical JSON (sorted keys), signs the bytes with Ed25519, returns +`base64(payload).base64(signature)`. cameleer-saas calls this directly to mint per-tenant tokens. + +### 7.2 `LicenseMinterCli` (CLI) + +```bash +java -jar cameleer-license-minter-1.0-SNAPSHOT.jar \ + --private-key=/secure/vendor.key \ + --tenant=acme-corp \ + --label="ACME prod 2026" \ + --expires=2027-04-25 \ + --grace-days=30 \ + --max-apps=50 \ + --max-agents=100 \ + --max-total-cpu-millis=32000 \ + --max-total-memory-mb=65536 \ + --max-execution-retention-days=90 \ + --output=acme-license.tok +``` + +- `--private-key` reads a PEM-encoded Ed25519 private key (output of + `openssl genpkey -algorithm ed25519`). +- Unspecified `--max-*` flags are omitted from the payload — the license inherits the default for + that key. +- Unknown flags fail fast. +- `--output` writes the token; if omitted, prints to stdout. + +Keypair generation is **out of band** — vendor uses `openssl` and stores both halves in their +secret manager. We deliberately do not ship a `--gen-keypair` subcommand to keep the boundary +clean. + +--- + +## 8. Telemetry + +Prometheus gauges scraped via `/api/v1/prometheus`: + +| Metric | Labels | Notes | +|---|---|---| +| `cameleer_license_state` | `state="ABSENT|ACTIVE|GRACE|EXPIRED"` | Boolean — exactly one is 1. | +| `cameleer_license_days_remaining` | (none) | Negative in GRACE/EXPIRED. | +| `cameleer_license_limit_utilisation`| `limit="max_apps"` etc. | `current / cap`, in `[0, 1+]`. | +| `cameleer_license_cap_rejections_total` | `limit="..."` | Counter. | + +State-transition log lines: `INFO` on install/ACTIVE, `WARN` on GRACE, `ERROR` on EXPIRED, `WARN` +on cap reject (sampled to avoid log spam). + +--- + +## 9. Dead-code removal + +Performed in the **first commit** of the implementation. Per the project's "no backwards +compatibility shims" preference, no deprecated path or feature flag. + +- Delete `Feature.java`. +- Delete `LicenseGate.isEnabled(Feature)`. +- Delete `LicenseInfo.features` field, `LicenseInfo.hasFeature(Feature)`. +- Delete `LicenseGateTest.withLicense_onlyLicensedFeaturesEnabled` and `LicenseInfo.open()`'s + `Set.of(Feature.values())` assertion. +- Update `LicenseValidator` to ignore `features` if present in old tokens (silently dropped, + not an error). + +--- + +## 10. Testing + +| Layer | Tests | +|---|---| +| Core unit | `LicenseValidatorTest` — signature, expiry, tenant mismatch, missing required fields, unknown extra fields. | +| Core unit | `LicenseStateMachineTest` — all four transitions including grace boundary, replace from any state, invalid install. | +| Core unit | `DefaultTierLimitsTest` — every documented key has a default. | +| Minter unit | `LicenseMinterTest` — round-trip with a throwaway Ed25519 keypair. Canonical JSON is stable across runs. | +| Minter CLI | `LicenseMinterCliTest` — invokes `main` with `--private-key=tmp` and checks output token validates. | +| App unit | `LicenseEnforcerTest` — for each limit: cap-reached, under-cap, default-tier with no license, missing-cap-inherits-default. | +| App integration | `LicenseLifecycleIT` — install via env, replace via POST, restart restores from DB. Driven through REST. | +| App integration | `LicenseEnforcementIT` — REST-driven, hit each cap end-to-end (per the project's "REST-API-driven ITs" preference). Includes `cap_exceeded` audit row check. | +| Boot | `SchemaBootstrapIT` extension — `license` table exists, `environments` retention columns exist, retention pinning honoured at boot. | + +No raw-SQL seeding of caps in ITs. All caps installed via the REST endpoint or env var. + +--- + +## 11. Open follow-ups (deliberately deferred) + +- Ingestion-rate limits (`max_executions_per_minute`, `max_logs_per_minute`). +- Online revocation callback (the `revocation_check_url` envelope field). +- Concurrent debug session limit (`max_concurrent_debug_sessions` from the SaaS epic). +- A "license usage history" report for vendors to see growth over time. +- Open a tracking issue on `cameleer/cameleer-server` (Gitea) — none exists today. + +--- + +## 12. Risk register + +| Risk | Mitigation | +|---|---| +| Default tier so tight that an honest evaluator cannot try the product. | Defaults documented; vendor can ship a longer-`exp` "trial" license at install time if needed. | +| Customer lowers `gracePeriodDays` field by editing token. | Token is signed; any edit invalidates the signature. | +| License removed from DB out of band, server lands in ABSENT and rejects new resources but old ones are above default tier. | Boot-time WARN per over-cap limit. UI banner in the admin license page. No auto-deletion. | +| Public key rotation. | Out of scope for v1; documented as "redeploy with new key" — vendors are expected to rotate via redeployment. | +| Compute cap arithmetic relies on `cpuLimit` and `memoryLimitMb` being set on every container. | Existing `ResolvedContainerConfig` already enforces these; `DeploymentExecutor.PRE_FLIGHT` rejects deploys with unset compute fields. | +| Per-env retention column added but old ClickHouse partitions retain longer. | Documented: TTL change is honoured by ClickHouse on its next merge cycle. New rows inserted always honour the new TTL. |