Files
cameleer-server/docs/superpowers/specs/2026-04-25-license-enforcement-design.md
hsiegeln 0e512a3c0c docs(license): brainstorm spec for license enforcement design
Captures the agreed design for enforcing licensing on cameleer-server:
- Default tier with hard caps when no license is configured
- Arbitrary per-customer limits in signed Ed25519 license tokens
- Standalone cameleer-license-minter module (vendor-only)
- DB-persisted license with env/file override paths
- ABSENT/ACTIVE/GRACE/EXPIRED state machine; offline expiry only
- Removes the dead Feature enum scaffolding

Pending writing-plans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 21:55:18 +02:00

20 KiB

License Enforcement — Design

Date: 2026-04-25 Status: Approved (brainstorm); pending writing-plans Related: cameleer-saas#7 (Epic: License & Feature Gating), cameleer-saas#42 (vendor minting), cameleer-saas#50 (customer license view)

Problem

cameleer-server ships a license skeleton (LicenseValidator, LicenseGate, admin endpoint) but nothing enforces anything. Open mode (no license configured) currently grants all features and no limits — the opposite of what we want for a self-hosted distribution that needs to gate scale behind a paid license.

We want:

  1. A self-hosted server with no license to operate within a small, hard-coded "default tier" that is enough to evaluate the product but not enough to run it in production.
  2. Licenses to express arbitrary per-customer limits (no fixed tiers) on a vendor-defined set of resources: entity counts, compute footprint, retention.
  3. A standalone minter owned by the vendor that signs licenses with an Ed25519 private key the customer never sees.
  4. Licenses to be persisted on the server, installable via env var, file, or admin POST, and renewable by replacement.
  5. Revocation handled out of band (vendor suspends the SaaS tenant, or issues short-exp licenses) — no online revocation callback in v1.

Non-goals

  • Feature flags. The current Feature enum (topology/lineage/correlation/debugger/replay) is dead scaffolding and gets removed; this design is about quantitative limits only.
  • Ingestion-rate limits (executions/minute, logs/minute). Defer to a follow-up.
  • Online revocation. Vendor uses shorter exp + reissue; SaaS suspension is independent.
  • Auto-deletion of resources when caps are lowered. Existing rows stay; only new creates reject.
  • Minter keypair generation tooling. Vendor uses standard openssl genpkey -algorithm ed25519 out of band.

1. Architecture

1.1 Module layout

cameleer-server-core/                    (existing — pure domain, no Spring)
└── license/
    ├── LicenseInfo                      (record — see §2)
    ├── LicenseLimits                    (typed wrapper over the limits map)
    ├── LicenseValidator                 (existing, payload schema updated)
    ├── LicenseGate                      (existing, gutted: no Feature; getLimits() only)
    ├── LicenseStateMachine              (NEW — pure FSM: ABSENT / ACTIVE / GRACE / EXPIRED)
    └── DefaultTierLimits                (constant — §5 numbers)

cameleer-server-app/                     (existing — Spring, web, persistence)
├── license/
│   ├── LicenseRepository                (NEW — PostgreSQL persistence)
│   ├── LicenseService                   (NEW — load/save/replace; emits state events)
│   ├── LicenseEnforcer                  (NEW — assertWithinCap entry point)
│   ├── LicenseUsageReader               (NEW — counts current usage for /usage endpoint)
│   ├── LicenseCapExceededException      (NEW — mapped to 403 by ControllerAdvice)
│   └── LicenseMetrics                   (NEW — Prometheus gauges)
├── controller/
│   ├── LicenseAdminController           (existing — extended; persists, audited)
│   └── LicenseUsageController           (NEW — GET /admin/license/usage)
└── config/
    └── LicenseBeanConfig                (existing — extended for DB load order)

cameleer-license-minter/                 (NEW — top-level Maven module)
├── pom.xml                              (depends on cameleer-server-core)
├── LicenseMinter                        (signing primitive; takes private key + LicenseInfo)
└── cli/LicenseMinterCli                 (CLI main class)

1.2 Why a separate cameleer-license-minter module

Not shipped in the runtime JAR. Vendor distributes it independently or builds it from source on a trusted machine. Customers never receive it.

This is module hygiene + smaller runtime attack surface, not a cryptographic protection — license forgery requires the vendor's private key, and the public key in the server is enough to verify forged tokens regardless of where the minter code lives.

1.3 Dependency graph

cameleer-license-minter ──▶ cameleer-server-core (LicenseInfo schema only)
cameleer-server-app     ──▶ cameleer-server-core (validator, gate, FSM, defaults)
cameleer-saas           ──▶ cameleer-license-minter (for SaaS-mode minting)
cameleer-saas           ──▶ cameleer-server-core   (transitive)

cameleer-server-app has no dependency on cameleer-license-minter.


2. License envelope

Wire format unchanged: base64(payload).base64(ed25519_signature). Payload schema:

{
  "licenseId": "550e8400-e29b-41d4-a716-446655440000",
  "tenantId": "acme-corp",
  "label": "ACME prod 2026",
  "iat": 1745539200,
  "exp": 1777075200,
  "gracePeriodDays": 30,
  "limits": {
    "max_environments": 5,
    "max_apps": 50,
    "max_agents": 100,
    "max_users": 25,
    "max_outbound_connections": 10,
    "max_alert_rules": 200,
    "max_total_cpu_millis": 32000,
    "max_total_memory_mb": 65536,
    "max_total_replicas": 100,
    "max_execution_retention_days": 90,
    "max_log_retention_days": 30,
    "max_metric_retention_days": 365,
    "max_jar_retention_count": 10
  }
}

2.1 Field rules

Field Required Notes
licenseId yes UUID. Used in audit + future revocation.
tenantId optional If present and CAMELEER_SERVER_TENANT_ID differs, treat as no license + log error. Air-gapped customers may omit.
label optional Free-form human description. Surfaced in UI.
iat yes Unix seconds.
exp yes Unix seconds.
gracePeriodDays optional, default 0 Days exp may be in the past while limits still apply.
limits.* each optional Missing key inherits from DefaultTierLimits. A license can lift any subset.

2.2 Removed from the current envelope

  • tier (string) — was a non-functional label. Folded into label.
  • features (array) — out of scope. Feature enum deleted.

3. License state machine

                                                      exp + grace passes
       ┌─────────┐  install valid  ┌────────┐  exp  ┌────────┐ ────────► ┌─────────┐
       │ ABSENT  │ ───────────────▶│ ACTIVE │──────▶│ GRACE  │           │ EXPIRED │
       └─────────┘                  └────────┘       └────────┘           └─────────┘
            ▲                            │              │ ▲                   │
            │ install invalid            │ replace      │ │ replace valid     │ replace
            │  (sig/tenant/parse)        ▼              │ │                   ▼
            └────────────────────────────┴──────────────┴─┴───────────────────┘
                                              all transitions persist + audit-log

3.1 State semantics

State Effective limits Trigger
ABSENT DefaultTierLimits No DB row, or signature/tenant/parse failure.
ACTIVE merge(default, license.limits) License loaded, now < exp.
GRACE Same as ACTIVE exp ≤ now < exp + gracePeriodDays. UI banner.
EXPIRED DefaultTierLimits now ≥ exp + gracePeriodDays. Distinct UI label vs ABSENT.

State is recomputed on every limit check (clock comparison only) — no scheduler needed for transitions. The only "background" behaviour is the Prometheus gauge refresh.

3.2 Default tier (the "no license" caps)

Limit Default
max_environments 1
max_apps 3
max_agents 5
max_users 3
max_outbound_connections 1
max_alert_rules 2
max_total_cpu_millis 2000 (2 cores)
max_total_memory_mb 2048 (2 GB)
max_total_replicas 5
max_execution_retention_days 1
max_log_retention_days 1
max_metric_retention_days 1
max_jar_retention_count 3

Encoded as public static final Map<String, Integer> DEFAULTS in DefaultTierLimits. Keys match the license payload exactly.


4. Enforcement map

Every limit check goes through one method on LicenseEnforcer:

void assertWithinCap(String limitKey, long currentUsage, long requestedDelta);

Throws LicenseCapExceededException(limitKey, current, cap) when currentUsage + requestedDelta > cap. A @ControllerAdvice maps it to 403 with body {"error":"license cap reached","limit":"max_apps","current":3,"cap":3}.

Limit Call site Failure response
max_environments EnvironmentService.create (start) 403
max_apps AppService.createApp 403
max_agents AgentRegistryService.register 403 — agent treated as unregistered (no SSE, no commands)
max_users UserAdminController.createUser and OidcAuthController.callback (auto-signup) 403 / OIDC login failure
max_outbound_connections OutboundConnectionServiceImpl.create 403
max_alert_rules AlertRuleController.create 403
max_total_cpu_millis DeploymentExecutor.PRE_FLIGHT (sum across non-stopped deploys + new) Deploy fails fast at PRE_FLIGHT, status FAILED, audit row
max_total_memory_mb same same
max_total_replicas same same
max_execution_retention_days EnvironmentService.update (per-env field, see §4.1) + ClickHouseSchemaInitializer.applyRetention() at boot 422 on update; boot pins effective TTL = min(licenseCap, configured)
max_log_retention_days same same
max_metric_retention_days same same
max_jar_retention_count EnvironmentAdminController.PUT /jar-retention 422

4.1 Per-environment retention fields

Three new columns on environments (Flyway V2):

ALTER TABLE environments
  ADD COLUMN execution_retention_days INTEGER NOT NULL DEFAULT 1,
  ADD COLUMN log_retention_days       INTEGER NOT NULL DEFAULT 1,
  ADD COLUMN metric_retention_days    INTEGER NOT NULL DEFAULT 1;

These are the configured per-env values. The effective ClickHouse TTL is min(licenseCap, configured), applied at startup by ClickHouseSchemaInitializer. Admin UI surfaces the configured values; EnvironmentService.update rejects values above the license cap with 422.

4.2 Boot-time invariant

If a license is added that lowers a cap below current usage (10 apps, license now allows 5), the server logs one WARN per limit at boot. No deletion. New creates reject; existing resources keep working.


5. Usage endpoint

GET /api/v1/admin/license/usage (ADMIN only):

{
  "state": "ACTIVE",
  "expiresAt": "2027-04-25T00:00:00Z",
  "daysRemaining": 365,
  "gracePeriodDays": 30,
  "tenantId": "acme-corp",
  "label": "ACME prod 2026",
  "limits": [
    {"key": "max_apps", "current": 7, "cap": 50, "source": "license"},
    {"key": "max_agents", "current": 12, "cap": 100, "source": "license"},
    {"key": "max_total_cpu_millis", "current": 8500, "cap": 32000, "source": "license"},
    {"key": "max_outbound_connections", "current": 0, "cap": 1, "source": "default"}
  ]
}

source is "default" when the cap comes from DefaultTierLimits (i.e. the license omits this key, or there is no license), and "license" when the cap is explicit in the license. Drives the SaaS UI's "free tier" badge.

LicenseUsageReader issues one cheap aggregate per limit (SELECT COUNT(*) per entity table; a single grouped SELECT SUM(replicas * cpuMillis), SUM(replicas * memoryMb), SUM(replicas) over non-stopped deployments).

GET /api/v1/admin/license (existing) is extended to return {state, envelope} with the raw token omitted from the response.


6. Lifecycle, persistence, install paths

6.1 Storage

Flyway V2 migration:

CREATE TABLE license (
  tenant_id    TEXT PRIMARY KEY,        -- one row per server (= one tenant)
  token        TEXT NOT NULL,           -- full signed token
  license_id   UUID NOT NULL,
  installed_at TIMESTAMPTZ NOT NULL,
  installed_by TEXT NOT NULL,           -- users.user_id (bare) or 'system' for env/file boot
  expires_at   TIMESTAMPTZ NOT NULL
);

6.2 Boot order

LicenseBeanConfig:

  1. If CAMELEER_SERVER_LICENSE_TOKEN env var is set → validate → write to DB (overwrite) → load.
  2. Else if CAMELEER_SERVER_LICENSE_FILE is set → read file → validate → write to DB → load.
  3. Else read license row from DB → validate → load.
  4. Else ABSENT.

Env-var / file act as idempotent overrides — they always win and replace the DB row, so the operator's last action survives reboots.

6.3 Runtime install

POST /api/v1/admin/license { "token": "..." } (existing):

  • Validates against the configured public key.
  • On success, persists to license table (installed_by = user_id), updates the in-memory LicenseGate, audits.
  • On failure, returns 400 with the validator error message and audits the rejection.

6.4 Public key custody

CAMELEER_SERVER_LICENSE_PUBLICKEY (existing) remains the only verification key. Build- / deploy-time secret bound to the vendor distribution. Not stored in DB. If unset and a license is present → reject all licenses (existing behaviour).

6.5 Audit trail

New AuditCategory.LICENSE. Actions:

Action When Payload
install_license First successful install in an empty state {licenseId, expiresAt, installedBy, source} (source = env/file/api)
replace_license Successful install over an existing license same + previousLicenseId
reject_license Validation failed (signature, tenant, parse, public key missing) {reason, source}
cap_exceeded Any LicenseCapExceededException {limit, current, cap, requestedBy}

7. Minter

7.1 LicenseMinter (library)

Pure function, packaged in cameleer-license-minter:

public final class LicenseMinter {
    public static String mint(LicenseInfo info, PrivateKey ed25519PrivateKey);
}

Serializes LicenseInfo to canonical JSON (sorted keys), signs the bytes with Ed25519, returns base64(payload).base64(signature). cameleer-saas calls this directly to mint per-tenant tokens.

7.2 LicenseMinterCli (CLI)

java -jar cameleer-license-minter-1.0-SNAPSHOT.jar \
     --private-key=/secure/vendor.key \
     --tenant=acme-corp \
     --label="ACME prod 2026" \
     --expires=2027-04-25 \
     --grace-days=30 \
     --max-apps=50 \
     --max-agents=100 \
     --max-total-cpu-millis=32000 \
     --max-total-memory-mb=65536 \
     --max-execution-retention-days=90 \
     --output=acme-license.tok
  • --private-key reads a PEM-encoded Ed25519 private key (output of openssl genpkey -algorithm ed25519).
  • Unspecified --max-* flags are omitted from the payload — the license inherits the default for that key.
  • Unknown flags fail fast.
  • --output writes the token; if omitted, prints to stdout.

Keypair generation is out of band — vendor uses openssl and stores both halves in their secret manager. We deliberately do not ship a --gen-keypair subcommand to keep the boundary clean.


8. Telemetry

Prometheus gauges scraped via /api/v1/prometheus:

Metric Labels Notes
cameleer_license_state `state="ABSENT ACTIVE
cameleer_license_days_remaining (none) Negative in GRACE/EXPIRED.
cameleer_license_limit_utilisation limit="max_apps" etc. current / cap, in [0, 1+].
cameleer_license_cap_rejections_total limit="..." Counter.

State-transition log lines: INFO on install/ACTIVE, WARN on GRACE, ERROR on EXPIRED, WARN on cap reject (sampled to avoid log spam).


9. Dead-code removal

Performed in the first commit of the implementation. Per the project's "no backwards compatibility shims" preference, no deprecated path or feature flag.

  • Delete Feature.java.
  • Delete LicenseGate.isEnabled(Feature).
  • Delete LicenseInfo.features field, LicenseInfo.hasFeature(Feature).
  • Delete LicenseGateTest.withLicense_onlyLicensedFeaturesEnabled and LicenseInfo.open()'s Set.of(Feature.values()) assertion.
  • Update LicenseValidator to ignore features if present in old tokens (silently dropped, not an error).

10. Testing

Layer Tests
Core unit LicenseValidatorTest — signature, expiry, tenant mismatch, missing required fields, unknown extra fields.
Core unit LicenseStateMachineTest — all four transitions including grace boundary, replace from any state, invalid install.
Core unit DefaultTierLimitsTest — every documented key has a default.
Minter unit LicenseMinterTest — round-trip with a throwaway Ed25519 keypair. Canonical JSON is stable across runs.
Minter CLI LicenseMinterCliTest — invokes main with --private-key=tmp and checks output token validates.
App unit LicenseEnforcerTest — for each limit: cap-reached, under-cap, default-tier with no license, missing-cap-inherits-default.
App integration LicenseLifecycleIT — install via env, replace via POST, restart restores from DB. Driven through REST.
App integration LicenseEnforcementIT — REST-driven, hit each cap end-to-end (per the project's "REST-API-driven ITs" preference). Includes cap_exceeded audit row check.
Boot SchemaBootstrapIT extension — license table exists, environments retention columns exist, retention pinning honoured at boot.

No raw-SQL seeding of caps in ITs. All caps installed via the REST endpoint or env var.


11. Open follow-ups (deliberately deferred)

  • Ingestion-rate limits (max_executions_per_minute, max_logs_per_minute).
  • Online revocation callback (the revocation_check_url envelope field).
  • Concurrent debug session limit (max_concurrent_debug_sessions from the SaaS epic).
  • A "license usage history" report for vendors to see growth over time.
  • Open a tracking issue on cameleer/cameleer-server (Gitea) — none exists today.

12. Risk register

Risk Mitigation
Default tier so tight that an honest evaluator cannot try the product. Defaults documented; vendor can ship a longer-exp "trial" license at install time if needed.
Customer lowers gracePeriodDays field by editing token. Token is signed; any edit invalidates the signature.
License removed from DB out of band, server lands in ABSENT and rejects new resources but old ones are above default tier. Boot-time WARN per over-cap limit. UI banner in the admin license page. No auto-deletion.
Public key rotation. Out of scope for v1; documented as "redeploy with new key" — vendors are expected to rotate via redeployment.
Compute cap arithmetic relies on cpuLimit and memoryLimitMb being set on every container. Existing ResolvedContainerConfig already enforces these; DeploymentExecutor.PRE_FLIGHT rejects deploys with unset compute fields.
Per-env retention column added but old ClickHouse partitions retain longer. Documented: TTL change is honoured by ClickHouse on its next merge cycle. New rows inserted always honour the new TTL.