Files
cameleer-server/docs/handoff/2026-04-26-license-saas-handoff.md
hsiegeln 5864553fed docs(license): minter README + operator guide + SaaS handoff
cameleer-license-minter/README.md — vendor-side guide: build, public
LicenseMinter API, CLI usage with all flags, token format (standard
base64, not url-safe), LicenseInfo schema, Ed25519 key generation,
worked example, security guidance, runtime-separation verification.

docs/license-enforcement.md — operator guide: install paths and
priority (env > file > DB > none), public-key config, REST API,
state machine (ABSENT/ACTIVE/GRACE/EXPIRED/INVALID), default tier
caps, 403 envelope semantics, retention TTL recompute, daily
revalidation, audit + Prometheus surfaces, troubleshooting.

docs/handoff/2026-04-26-license-saas-handoff.md — SaaS playbook:
trust model, onboarding/renewal/revocation runbooks, key management,
cap matrix per plan tier, telemetry, failure modes, testing guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:33:12 +02:00

378 lines
21 KiB
Markdown

# License Enforcement — SaaS Handoff (2026-04-26)
Handoff for the cameleer-saas team and customer-success engineers operating customer-facing cameleer-server deployments. Covers issuing, renewing, revoking, and operationally observing licenses.
For end-customer operator docs, see `docs/license-enforcement.md`. For minting tooling, see `cameleer-license-minter/README.md`. For the original design + plan, see:
- `docs/superpowers/specs/2026-04-25-license-enforcement-design.md`
- `docs/superpowers/plans/2026-04-25-license-enforcement.md`
## Table of contents
## Session context
## What this delivers
## Trust model architecture
## Operational playbook
## Key management
## Cap matrix (plan tiers)
## Telemetry the SaaS team can observe
## Failure modes & runbook
## Edge cases the SaaS team should know
## Testing guidance
## Pointers
---
## Session context
- **Branch:** `feature/runtime-hardening`
- **Commit range:** `ec51aef8..140ea884` — 40 commits delivering the full feature (3 doc/spec/plan commits + 14 implementation commits + 23 follow-ons covering enforcement, retention, metrics, REST surface, integration tests, and rules updates).
- **Plan tasks:** 36 of 36 complete. Tests green: core (122), minter (7), app unit (230), key ITs (`PostgresLicenseRepositoryIT`, `LicenseLifecycleIT`, `LicenseEnforcementIT`, `RetentionRuntimeRecomputeIT`, `SchemaBootstrapIT`).
- **Persisted state:** Flyway migration **V5** — adds the `license` table and three retention columns on `environments` (`execution_retention_days`, `log_retention_days`, `metric_retention_days`).
### Key SHAs
| SHA | Subject |
|---|---|
| `ec51aef8` | start of plan (above this is unrelated runtime-hardening work) |
| `551a7f12` | refactor(license): remove dead Feature enum and isEnabled scaffolding |
| `2ebe4989..0499a54e` | LicenseInfo / Validator / Limits / Gate redesign |
| `896b7e6e..f6657f81` | Standalone `cameleer-license-minter` module |
| `20aefd5b..b95e80a2` | PG schema, repository, service, boot wiring |
| `2bad9c3e..e198c13e` | Enforcement points, retention applier, REST surface, metrics, ITs |
| `140ea884` | docs(rules): document license enforcement classes + endpoints (head) |
## What this delivers
- **Cap enforcement** at 8 surfaces (env/app/agent/user/outbound/alert-rule creation, deploy-time compute caps, jar retention).
- **License lifecycle**: install (env > file > DB > API), daily revalidation cron + 60s post-startup tick, grace period, full state machine (ABSENT/ACTIVE/GRACE/EXPIRED/INVALID).
- **Retention enforcement**: ClickHouse TTL recomputed on every license change for `executions`, `processor_executions`, `logs`, `agent_metrics`, `agent_events`. Effective TTL = `min(licenseCap, env.configured)`.
- **Standalone `cameleer-license-minter` Maven module** for vendor-side license generation. **Not** in the server runtime/compile classpath.
- **Audit trail**: every install/replace/cap_exceeded/revalidate event under `AuditCategory.LICENSE`.
- **Observability**: 3 Prometheus gauges + 1 counter (see [Telemetry](#telemetry-the-saas-team-can-observe)).
- **Default tier**: small fixed caps when no license is installed; intentionally restrictive.
## Trust model architecture
```
VENDOR / SaaS CUSTOMER (cameleer-server)
+-------------------------+ +------------------------------------+
| cameleer-license- | | CAMELEER_SERVER_LICENSE_PUBLICKEY |
| minter (CLI/Java) | | CAMELEER_SERVER_TENANT_ID |
| | | |
| Ed25519 PRIVATE key | | Ed25519 PUBLIC key (matching) |
| (HSM / KMS / Vault) | | |
| | | | ^ |
| v | | | validate |
| LicenseMinter.mint | | | |
| | | token (HTTPS) | LicenseValidator |
| +-----token----+----------------->+ | |
| | env-var or POST | v |
+-------------------------+ | LicenseGate (state + limits) |
| | |
| v |
| LicenseEnforcer (cap checks) |
+------------------------------------+
```
The vendor holds the **only** copy of the private key. Customers receive only the public key (over deployment-config channels) and the signed token. A compromised customer can read tokens but cannot forge new ones.
The minter module physically lives in the cameleer-server repo for shared `LicenseInfo` types but is intentionally absent from the runtime classpath of the server. Verify with:
```bash
mvn dependency:tree -pl cameleer-server-app | grep license-minter
# expected: empty (or test-scope only on dev branches)
```
## Operational playbook
### Onboarding a new tenant
1. Choose the tenant id (must match the customer's `CAMELEER_SERVER_TENANT_ID`; lowercase alphanumeric + dashes; immutable).
2. Decide whether to use the shared SaaS signing key or a dedicated per-tenant key. Shared is simpler and standard; per-tenant only if a customer has compliance requirements that mandate isolation.
3. Mint the initial license:
```bash
java -jar cameleer-license-minter-1.0-SNAPSHOT-cli.jar \
--private-key=<vault path>/cameleer-license-priv.pem \
--tenant=<tenant id> \
--label="<Customer Name> (<Plan>)" \
--expires=2027-04-26 \
--grace-days=14 \
--max-environments=<plan> \
--max-apps=<plan> \
--max-agents=<plan> \
--max-users=<plan> \
--max-outbound-connections=<plan> \
--max-alert-rules=<plan> \
--max-total-cpu-millis=<plan> \
--max-total-memory-mb=<plan> \
--max-total-replicas=<plan> \
--max-execution-retention-days=<plan> \
--max-log-retention-days=<plan> \
--max-metric-retention-days=<plan> \
--max-jar-retention-count=<plan> \
--output=/tmp/<tenant>.lic \
--public-key=<vault path>/cameleer-license-pub.b64 \
--verify
```
4. Deliver to the customer's server via either:
- **Container env var** (preferred for SaaS-managed deployments): `CAMELEER_SERVER_LICENSE_TOKEN=<token>` set on the deploy descriptor. Activates at next boot.
- **Admin REST POST** (for hot install on a running server): `POST /api/v1/admin/license` with `{"token": "..."}`. Confirms successful installation in the response body.
5. Confirm acceptance: `GET /api/v1/admin/license` returns `state=ACTIVE`, the audit log shows `install_license`/`SUCCESS`, and `cameleer_license_state{state="ACTIVE"} == 1.0` in Prometheus.
### Renewing a license
1. Mint a new token with a later `--expires`. Use a **fresh `licenseId`** so the audit trail clearly distinguishes the renewal from the prior license.
2. Install via admin POST. The PG `license` row is updated in place (one row per tenant, upserted on `tenant_id`); the audit row records `replace_license` with `previousLicenseId`.
3. Confirm `lastValidatedAt` advances on the next 03:00 cron tick (or trigger by restart / `POST /admin/license`).
### Adjusting caps mid-term
Same as renewal: mint a new token with the new limits and install. The `limits` map of the new license replaces the prior one entirely (no merging — only `DefaultTierLimits` provides fallback for keys the new license omits).
If the customer is **lowering** caps below current usage, there is no automatic enforcement against existing entities — only future creates are rejected. Communicate the implication clearly. The `/api/v1/admin/license/usage` endpoint after install will show `current > cap` rows, which is the operator's signal to clean up.
### Revoking a license
There is no remote revocation. Practical options:
1. **Wait for expiry.** Short license terms (12 months max) keep this honest.
2. **Rotate the public key.** Push a new `CAMELEER_SERVER_LICENSE_PUBLICKEY` to the customer's server config and restart. All existing tokens become `INVALID` because the signature no longer verifies. This is destructive (all customers sharing this signing key need a re-issue), so reserve for true compromise scenarios.
3. **Deploy a corrupted token.** If the customer cooperates, set `CAMELEER_SERVER_LICENSE_TOKEN` to garbage; the boot loader marks it `INVALID`, default-tier caps apply.
In all cases the customer falls to default-tier caps (1 env, 3 apps, 5 agents). They can continue running for evaluation; new creates fail with 403.
### Migrating a license between server instances
Tokens are bound to `tenantId`, not to a particular server instance. A token works on any server configured for the same tenant. To migrate:
1. Provision the new server with `CAMELEER_SERVER_TENANT_ID=<same id>` and `CAMELEER_SERVER_LICENSE_PUBLICKEY=<same key>`.
2. Install the existing token on the new server (env var or POST). PG state is fresh on the new instance — usage starts at zero.
3. Decommission the old server.
If both run simultaneously they both pass validation (same token, same key, same tenant id) and both apply the caps independently against their own local state — usage is **not** federated.
## Key management
### Where the signing key lives
The SaaS team's Ed25519 private key is the trust root. Place it in:
- **Production:** AWS KMS, GCP KMS, Azure Key Vault (with a non-exportable signing key) **or** HashiCorp Vault Transit. The minter API supports signing via a `PrivateKey` instance, so a custom integration that asks the KMS to sign canonicalized payload bytes is straightforward to build on top of `LicenseMinter.canonicalPayload(...)` (it's `static`-accessible for that purpose).
- **Pre-production / dev:** sealed file in a single privileged operator's home directory. Never on a CI server, never in the repo.
For high-security environments, the minter CLI's `--private-key=<path>` is the wrong fit — it requires the key bytes to be readable. Use the Java API directly:
```java
PrivateKey kmsKey = kmsClient.getSigningKey("cameleer-license-prod");
String token = LicenseMinter.mint(info, kmsKey);
```
The JCE provider for the KMS handles signing; the private bytes never leave the KMS.
### Public key distribution
Each tenant's server reads the public key from `CAMELEER_SERVER_LICENSE_PUBLICKEY` (base64-encoded X.509 SPKI). Distribute via:
- **Helm values / Kubernetes Secret** for k8s-orchestrated tenants.
- **Docker compose env file** for self-hosted tenants.
- **Bare environment variable on the host** for VM tenants.
A typo or whitespace difference will cause every license to be rejected. Build a smoke test that boots a sandbox server with the candidate public key and POSTs a known-good test token.
### Rotation playbook
Rotation is the trickiest part. The validator does not support multiple public keys — exactly one is configured. Procedure:
1. **Generate the new keypair** in production storage (KMS / Vault).
2. **Coordinate downtime windows** with each customer running on the old key. There is no overlap-period mechanism; you must:
- Push the new public key to all tenants (config rollout, restart).
- Re-mint and re-deliver every active license under the new key.
- Each customer's server is `INVALID` between the public-key change and the new token install.
3. **Decommission the old private key** only after every active license has been re-issued.
To avoid emergency rotations, sign with a **fresh** keypair every 24 months on a planned schedule. License terms shorter than the rotation interval keep customer impact bounded — at most one re-issue per customer per rotation.
## Cap matrix (plan tiers)
These are suggested values — adjust to your pricing model. Caps not listed fall through to defaults.
| Limit key | Default (no license) | Starter | Team | Business | Enterprise |
|---|---|---|---|---|---|
| `max_environments` | 1 | 2 | 5 | 10 | 50 |
| `max_apps` | 3 | 10 | 50 | 200 | 1000 |
| `max_agents` | 5 | 20 | 100 | 500 | 5000 |
| `max_users` | 3 | 5 | 25 | 100 | 1000 |
| `max_outbound_connections` | 1 | 5 | 25 | 100 | 500 |
| `max_alert_rules` | 2 | 10 | 50 | 200 | 1000 |
| `max_total_cpu_millis` | 2000 | 8000 | 32000 | 128000 | 512000 |
| `max_total_memory_mb` | 2048 | 8192 | 32768 | 131072 | 524288 |
| `max_total_replicas` | 5 | 25 | 100 | 500 | 2000 |
| `max_execution_retention_days` | 1 | 7 | 30 | 90 | 365 |
| `max_log_retention_days` | 1 | 7 | 30 | 90 | 180 |
| `max_metric_retention_days` | 1 | 7 | 30 | 90 | 180 |
| `max_jar_retention_count` | 3 | 5 | 10 | 25 | 50 |
## Telemetry the SaaS team can observe
### Audit log
Every license event lives in `audit_log` with `category=LICENSE`. Useful queries:
```sql
-- Last 30 license events for tenant X
SELECT timestamp, username, action, target, result, detail
FROM audit_log
WHERE category = 'LICENSE'
ORDER BY timestamp DESC
LIMIT 30;
-- Customers hitting caps in the last 24h
SELECT target AS limit, COUNT(*) AS rejections
FROM audit_log
WHERE category = 'LICENSE' AND action = 'cap_exceeded'
AND timestamp > now() - INTERVAL '24 hours'
GROUP BY target
ORDER BY rejections DESC;
-- Customers running with rejected licenses
SELECT timestamp, detail->>'reason' AS reason, detail->>'source' AS source
FROM audit_log
WHERE category = 'LICENSE' AND action = 'reject_license'
ORDER BY timestamp DESC;
```
### Prometheus metrics
| Metric | Type | Labels | Use |
|---|---|---|---|
| `cameleer_license_state` | gauge | `state` | Dashboard tile: which state is each tenant in. One-hot per state. |
| `cameleer_license_days_remaining` | gauge | (none) | Renewal alerting. Recommended thresholds: warn at 30 days, page at 7 days, critical at 1 day. `-1.0` means no license. |
| `cameleer_license_last_validated_age_seconds` | gauge | (none) | Detect stuck schedulers. Alert at >86400. |
| `cameleer_license_cap_rejections_total` | counter | `limit` | Account-management signal — customers consistently hitting caps are upgrade prospects. |
### REST API
`/api/v1/admin/license/usage` returns the per-limit current/cap/source table — wire this into your SaaS-side admin UI for at-a-glance per-tenant view. The endpoint requires an ADMIN-role JWT; SaaS-side automation can mint short-lived ADMIN tokens scoped per tenant or use a shared service account.
## Failure modes & runbook
### "Customer reports 403s after upgrade"
1. Pull `/api/v1/admin/license/usage`. Identify which `limit` row has `current >= cap`.
2. If `state = ACTIVE` and a higher-tier license is owed, mint and install it.
3. If `state = EXPIRED`/`INVALID`/`ABSENT`, fix the license-state issue first — the cap rejection is downstream of that.
4. Confirm by replaying the failing operation; the 403 should clear.
### "Customer reports state=INVALID"
1. Pull `/api/v1/admin/license` — note `invalidReason`.
2. Most likely causes:
- Public-key mismatch — the customer's `CAMELEER_SERVER_LICENSE_PUBLICKEY` differs from the key used to mint. Diff the two values byte-for-byte.
- Tenant mismatch — `CAMELEER_SERVER_TENANT_ID` on the server differs from the `--tenant` used when minting. The customer must restart with the correct tenant id (it's immutable for the lifetime of the deployment because it appears in PG schema names and CH partition keys — coordinate carefully).
- Token tampering — base64-decode the payload portion (`<base64payload>.<base64sig>`), confirm the JSON looks well-formed.
3. Re-mint or fix config; re-install.
### "License will expire in N days"
1. Alert on `cameleer_license_days_remaining < 30`.
2. Mint a renewal license (new `licenseId`, later `expiresAt`).
3. Install via the customer's preferred channel (env-var on next deploy, or hot via POST).
### "Audit table fills up with cap_exceeded rows"
Customer is hammering a creation path. Either:
- They genuinely outgrew their tier — upgrade conversation.
- Their automation has a runaway loop creating environments/apps. Coordinate with the customer to throttle and clean up.
The `cameleer_license_cap_rejections_total{limit=...}` counter is more efficient for monitoring this than scanning audit; use audit only for forensic detail.
### "TTL recompute logs WARN: Failed to apply TTL"
`RetentionPolicyApplier` could not run `ALTER TABLE ... MODIFY TTL` on ClickHouse. The license install itself succeeded; only the retention update failed. Check:
- ClickHouse user has `ALTER` privilege on the cameleer DB.
- ClickHouse version is >= 22.3 (required for `WHERE` predicate on TTL).
- ClickHouse cluster health.
## Edge cases the SaaS team should know
- **Default tier is restrictive on purpose.** A customer on default tier cannot stand up a real production workload (1 env, 3 apps, 5 agents, 1-day retention). Onboarding should always include license install before the customer adds any real workload.
- **Grace period defaults to 0.** If you want a buffer between `expiresAt` and capability loss, set `--grace-days=N` at mint time. We recommend 14 days for paid plans so a slipped renewal doesn't immediately drop the customer to default-tier caps.
- **Public key change invalidates all installed tokens immediately on next revalidation.** Daily revalidation runs at 03:00 server-local time, with a 60-second post-startup tick. A surprise public-key rollout will surface as `state=INVALID` for every customer running on the old key on the next tick or restart.
- **Caps reduce on revalidation, not just install.** A token whose `expiresAt` lapses will, at the next revalidation, transition `ACTIVE → GRACE → EXPIRED` automatically, dropping caps to default-tier on the EXPIRED transition. The state change is announced via `LicenseChangedEvent` and triggers TTL recompute.
- **Compute caps are evaluated at deploy time, not at runtime.** A deployment that successfully started under a high-tier license will keep running unchanged when the license downgrades. Only the *next* deploy attempt will see the new cap.
- **Agent count is in-memory.** `max_agents` is enforced against the `AgentRegistryService.liveCount()` (LIVE state agents). Restarts reset the count to zero until agents re-register; this is by design — DEAD agents shouldn't pin a license slot.
- **License id changes on every renewal.** Always use a fresh `UUID.randomUUID()` when minting a renewal. The audit `previousLicenseId` field then tells you which token superseded which.
## Testing guidance
Three approaches for dry-running licenses without touching a customer server:
### 1. Pure unit test — `LicenseMinter` round-trip with `LicenseValidator`
```java
KeyPair kp = KeyPairGenerator.getInstance("Ed25519").generateKeyPair();
String pubB64 = Base64.getEncoder().encodeToString(kp.getPublic().getEncoded());
LicenseInfo info = new LicenseInfo(
UUID.randomUUID(), "test-tenant", "Test", Map.of("max_apps", 50),
Instant.now(), Instant.now().plus(365, ChronoUnit.DAYS), 0
);
String token = LicenseMinter.mint(info, kp.getPrivate());
LicenseValidator validator = new LicenseValidator(pubB64, "test-tenant");
LicenseInfo parsed = validator.validate(token);
assertEquals(info.licenseId(), parsed.licenseId());
```
This is the model already used in `LicenseMinterTest` and `LicenseValidatorTest` in the repo — copy from there.
### 2. CLI dry-run — mint and self-verify
```bash
java -jar cameleer-license-minter-1.0-SNAPSHOT-cli.jar \
--private-key=test-priv.pem \
--public-key=test-pub.b64 \
--tenant=test-tenant \
--expires=2027-12-31 \
--max-apps=50 \
--output=/tmp/test.lic \
--verify
```
`--verify` runs the full `LicenseValidator.validate(...)` round-trip and exits 3 on failure. Useful for shaking out wrong-key / wrong-tenant before sending to a customer.
### 3. Test server with a test public key
Spin up a sandbox cameleer-server (docker-compose or k8s-test-namespace) with:
```yaml
environment:
CAMELEER_SERVER_TENANT_ID: test-tenant
CAMELEER_SERVER_LICENSE_PUBLICKEY: <test public key base64>
```
Install the test license, exercise the customer's reported scenario, observe `state` transitions and audit rows. The `LicenseLifecycleIT` and `LicenseEnforcementIT` integration tests in `cameleer-server-app/src/test/java/.../license/` are good templates for full-stack reproduction.
## Pointers
| Document | Audience |
|---|---|
| `cameleer-license-minter/README.md` | Vendor-side mint operations |
| `docs/license-enforcement.md` | End-customer operators (install, monitor, troubleshoot) |
| `docs/superpowers/specs/2026-04-25-license-enforcement-design.md` | Original design rationale |
| `docs/superpowers/plans/2026-04-25-license-enforcement.md` | Implementation plan (36 tasks) |
| `.claude/rules/core-classes.md` `# license/` section | License domain class map |
| `.claude/rules/app-classes.md` `# license/` section | Server license-app class map + endpoint surface |