Files
cameleer-server/docs/license-enforcement.md

368 lines
21 KiB
Markdown
Raw Normal View History

# License Enforcement
Operator documentation for the cameleer-server license subsystem. Audience: operators running their own cameleer-server instance who need to install, monitor, or troubleshoot a license.
For *issuing* licenses, see `cameleer-license-minter/README.md`. For SaaS-team operational playbooks, see `docs/handoff/2026-04-26-license-saas-handoff.md`.
## Table of contents
## Overview
## What gets enforced
## Install paths and priority
## Public-key configuration
## REST API
## License state machine
## Default tier caps
## Cap-exceeded behavior
## Retention semantics
## Daily revalidation
## Audit categories
## Prometheus metrics
## Troubleshooting
---
## Overview
cameleer-server can run in one of two postures:
- **Default tier (no license installed).** A small fixed cap-set applies (1 environment, 3 apps, 5 agents, 1 day retention, etc.). Suitable for evaluation and self-host single-instance use. The default tier engages automatically when no license is configured.
- **Licensed (token installed).** Caps from the signed token override the default tier on a per-key basis. Any limit key the token does not specify falls through to the default value, so a partial license that only raises `max_environments` and `max_apps` keeps default retention.
A signed Ed25519 license token carries the customer's `tenantId`, an `expiresAt` timestamp, an optional `gracePeriodDays`, and a `limits` map. The server's `LicenseValidator` (`cameleer-server-core/src/main/java/com/cameleer/server/core/license/LicenseValidator.java`) checks the signature against `CAMELEER_SERVER_LICENSE_PUBLICKEY`, verifies the tenant matches `CAMELEER_SERVER_TENANT_ID`, and rejects expired tokens (past `expiresAt + gracePeriodDays`).
The license posture is summarized as a `LicenseState`:
- `ABSENT` — no license configured. Default-tier caps apply.
- `ACTIVE` — valid token, current time is at or before `expiresAt`. License caps apply.
- `GRACE` — past `expiresAt` but within `gracePeriodDays`. License caps still apply; the operator should renew.
- `EXPIRED` — past `expiresAt + gracePeriodDays`. Default-tier caps apply.
- `INVALID` — signature, tenant, or schema validation failed. Default-tier caps apply.
## What gets enforced
License caps are enforced through a single component, `LicenseEnforcer.assertWithinCap(limitKey, currentUsage, requestedDelta)`, called from each creation path.
| Limit key | Enforcement point | Effect when exceeded |
|---|---|---|
| `max_environments` | `EnvironmentService.create(...)` | HTTP 403 from `EnvironmentAdminController.create`. |
| `max_apps` | `AppService.createApp(...)` | HTTP 403 from `AppController.create`. |
| `max_agents` | `AgentRegistryService.register(...)` | HTTP 403 from `AgentRegistrationController.register`. Counted against the in-memory live agent registry. |
| `max_users` | User creation paths in `UserAdminController`, `UiAuthController`, `OidcAuthController` | HTTP 403 (REST) or rejection during OIDC first-login. |
| `max_outbound_connections` | `OutboundConnectionServiceImpl.create(...)` | HTTP 403. |
| `max_alert_rules` | `AlertRuleController.create(...)` | HTTP 403. |
| `max_total_cpu_millis` | `DeploymentExecutor` `PRE_FLIGHT` stage | Deployment fails before pulling images; row is marked FAILED with the cap message in `deployments.error_message`. |
| `max_total_memory_mb` | same | same |
| `max_total_replicas` | same | same |
| `max_jar_retention_count` | `EnvironmentAdminController` PUT `/{envSlug}/jar-retention` | HTTP 403 if requested value > cap. The daily `JarRetentionJob` is also bounded by this cap. |
| `max_execution_retention_days`, `max_log_retention_days`, `max_metric_retention_days` | Not a creation cap; clamps ClickHouse TTL to `min(cap, env.configured)` — see [Retention semantics](#retention-semantics). |
Note that the three compute caps are checked together at deploy time, after `ConfigMerger.resolve(...)` produces the final `ResolvedContainerConfig` but before the image is pulled. The current usage figure is computed by `LicenseUsageReader.computeUsage()` over non-stopped deployments.
## Install paths and priority
Tokens can be installed by four mechanisms; resolution at boot is highest-priority-first:
1. **`CAMELEER_SERVER_LICENSE_TOKEN` environment variable.** Highest priority. The raw token is read on `@PostConstruct` from `LicenseBeanConfig.LicenseBootLoader`.
2. **`cameleer.server.license.file` Spring property** (or `CAMELEER_SERVER_LICENSE_FILE`). Path to a file containing the token. Read at boot if no env-var token is present.
3. **PostgreSQL `license` table.** Set via the admin REST POST. Loaded at boot if the env var and file both miss.
4. **None of the above.** State is `ABSENT`, default-tier caps apply, the boot loader publishes a `LicenseChangedEvent(ABSENT, null)` so listeners (Prometheus gauges, retention applier) settle on default values.
If a higher-priority source rejects (signature failure, tenant mismatch, expired) the loader logs the reason and **does not** fall through to a lower-priority source. This is deliberate: an operator who set `CAMELEER_SERVER_LICENSE_TOKEN` expects that token to be the active one, not a silently-stale DB row.
Any token loaded at boot also flows through `LicenseService.install(...)` so audit, persistence, and `LicenseChangedEvent` publishing are uniform across paths.
## Public-key configuration
```bash
export CAMELEER_SERVER_LICENSE_PUBLICKEY="$(cat cameleer-license-pub.b64)"
```
The value is the base64 encoding of the Ed25519 public key in X.509 SubjectPublicKeyInfo form (see `cameleer-license-minter/README.md` for generation).
When `CAMELEER_SERVER_LICENSE_PUBLICKEY` is **unset**:
- `LicenseBeanConfig.licenseValidator()` (line 62) logs a WARN: `CAMELEER_SERVER_LICENSE_PUBLICKEY not set — all licenses will be rejected as INVALID`.
- The bean is constructed against a throwaway public key whose private counterpart no one holds. The override's `validate(...)` always throws `IllegalStateException("license public key not configured")`.
- Any token loaded from any source routes through `LicenseService.install(...)`, fails validation, marks the gate `INVALID`, and writes a `reject_license` audit row with the failure reason.
- The state will be `INVALID`, default-tier caps apply, and the operator must set the variable and restart (or hot-install via POST after restart).
## REST API
All endpoints require an ADMIN-role JWT. Source-of-truth controllers: `cameleer-server-app/src/main/java/com/cameleer/server/app/controller/LicenseAdminController.java`, `LicenseUsageController.java`.
### `GET /api/v1/admin/license`
```json
{
"state": "ACTIVE",
"invalidReason": null,
"envelope": {
"licenseId": "fd3a8f2a-1c44-4eac-aa07-1a5d1ce9c4a4",
"tenantId": "acme-prod",
"label": "Acme Production",
"limits": { "max_apps": 25, "max_environments": 3 },
"issuedAt": "2026-04-26T10:00:00Z",
"expiresAt": "2027-01-01T00:00:00Z",
"gracePeriodDays": 14
},
"lastValidatedAt": "2026-04-26T03:00:00Z"
}
```
The raw token string is **deliberately not** returned — only the parsed envelope. `lastValidatedAt` is omitted when no DB row exists yet (env-var or file source on first boot before the next revalidation tick).
### `POST /api/v1/admin/license`
```bash
curl -X POST https://server.example.com/api/v1/admin/license \
-H "Authorization: Bearer ${ADMIN_JWT}" \
-H "Content-Type: application/json" \
-d '{"token": "eyJ...long.base64.string..."}'
```
Body shape: `{"token": "<minted token>"}`. On success returns `{"state": "ACTIVE", "envelope": {...}}`. On failure returns HTTP 400 with `{"error": "<reason>"}`.
The handler delegates to `LicenseService.install(token, userId, "api")`. Acting `userId` comes from the authenticated principal stripped of the `user:` prefix (see `app-classes.md` user-id convention).
This endpoint installs *or replaces* — there is one row per tenant in the `license` table, so a successful POST upserts and supersedes any prior token. The previous license id is captured in the `replace_license` audit detail.
### `GET /api/v1/admin/license/usage`
```json
{
"state": "ACTIVE",
"expiresAt": "2027-01-01T00:00:00Z",
"daysRemaining": 250,
"gracePeriodDays": 14,
"tenantId": "acme-prod",
"label": "Acme Production",
"lastValidatedAt": "2026-04-26T03:00:00Z",
"message": "License active. 250 days remaining.",
"limits": [
{"key": "max_environments", "current": 2, "cap": 3, "source": "license"},
{"key": "max_apps", "current": 12, "cap": 25, "source": "license"},
{"key": "max_agents", "current": 38, "cap": 50, "source": "license"},
{"key": "max_users", "current": 4, "cap": 3, "source": "default"}
]
}
```
For each effective-limits key:
- `current` — current usage. `max_agents` is read from the in-memory `AgentRegistryService.liveCount()`; everything else comes from `LicenseUsageReader.snapshot()` (PostgreSQL counts, plus deployment compute aggregates from `deployed_config_snapshot`). Limits the server does not measure return `0`.
- `cap` — effective cap (license override or default-tier value).
- `source``"license"` if the cap came from the token's `limits` map, `"default"` if it fell through.
## License state machine
```
+---------------+
| ABSENT | (no token configured)
+-------+-------+
|
| install via env / file / DB / POST
v
+-------+-------+
+-------------- | ACTIVE | --------------+
| +-------+-------+ |
| revalidate | now > expiresAt
| fails sig/tenant/ |
| parse v
| +-------+-------+
| | GRACE |
| +-------+-------+
| |
| | now > exp + gracePeriodDays
| v
| +-------+-------+
| | EXPIRED |
| +-------+-------+
v
+-------+-------+
| INVALID | (signature mismatch, tenant mismatch,
+---------------+ missing public key, malformed payload)
```
Classification logic: `LicenseStateMachine.classify(license, invalidReason)` (`cameleer-server-core/src/main/java/com/cameleer/server/core/license/LicenseStateMachine.java`).
- `INVALID` and `EXPIRED` revert to **default-tier caps**. The license envelope is dropped from the gate (`getCurrent()` returns null in `INVALID`; the gate retains the parsed info in `EXPIRED` but `getEffectiveLimits()` returns defaults-only).
- `GRACE` keeps **license caps**. This is the only state where the operator should be running but should also be actively working on renewal.
## Default tier caps
Source: `cameleer-server-core/src/main/java/com/cameleer/server/core/license/DefaultTierLimits.java`.
| Key | Default | Semantics |
|---|---|---|
| `max_environments` | 1 | Total environments across the tenant. |
| `max_apps` | 3 | Total apps across all environments. |
| `max_agents` | 5 | Live agents in the in-memory registry (LIVE state). |
| `max_users` | 3 | Local + OIDC users in the `users` table. |
| `max_outbound_connections` | 1 | Rows in `outbound_connections`. |
| `max_alert_rules` | 2 | Rows in `alert_rules`. |
| `max_total_cpu_millis` | 2000 | Sum of `replicas * cpuLimit` over non-stopped deployments. cpuLimit is millicores; 1000 = one core. |
| `max_total_memory_mb` | 2048 | Sum of `replicas * memoryLimitMb` over non-stopped deployments. |
| `max_total_replicas` | 5 | Sum of `replicas` over non-stopped deployments. |
| `max_execution_retention_days` | 1 | Cap on TTL applied to `executions` and `processor_executions`. |
| `max_log_retention_days` | 1 | Cap on TTL applied to `logs`. |
| `max_metric_retention_days` | 1 | Cap on TTL applied to `agent_metrics` and `agent_events`. |
| `max_jar_retention_count` | 3 | Maximum JAR retention count per environment. |
The default tier is intentionally restrictive — it is sized for evaluation, single-developer demos, and "I forgot to install my license" recovery, not production. New customers should install a license at first onboarding.
## Cap-exceeded behavior
When a creation path exceeds its cap, `LicenseEnforcer.assertWithinCap(...)` throws `LicenseCapExceededException(limitKey, current, cap)`. `LicenseExceptionAdvice` (`@ControllerAdvice`) maps it to:
```http
HTTP/1.1 403 Forbidden
Content-Type: application/json
{
"error": "license cap reached",
"limit": "max_apps",
"current": 4,
"cap": 3,
"state": "ABSENT",
"message": "License absent. Default tier limits apply. Cap reached for max_apps (3 of 3 used)."
}
```
Concurrently:
- The Prometheus counter `cameleer_license_cap_rejections_total{limit=...}` increments.
- An audit row is written: `category=LICENSE`, `action=cap_exceeded`, `target=<limit key>`, `result=FAILURE`, `detail` carries `{limit, current, requested, cap, state}`. If audit storage fails, the 403 still surfaces (audit is best-effort here).
The `message` field is rendered by `LicenseMessageRenderer.forCap(...)` and varies per state — under `EXPIRED` it nudges the operator to renew; under `INVALID` it cites `invalidReason`.
## Retention semantics
The license caps `max_execution_retention_days`, `max_log_retention_days`, `max_metric_retention_days`, and `max_jar_retention_count` define **maximums**. Per-environment configuration (`environments.execution_retention_days`, `log_retention_days`, `metric_retention_days`, `jar_retention_count`) defines the **operator preference**. The effective TTL applied to ClickHouse tables is:
```
effective = min(licenseCap, env.configuredRetentionDays)
```
When `LicenseChangedEvent` fires (any install/replace/revalidate/boot transition), `RetentionPolicyApplier` (`@EventListener @Async`) recomputes TTL for every (table, env) pair using:
```sql
ALTER TABLE <table>
MODIFY TTL toDateTime(<time_col>) + INTERVAL <effective> DAY DELETE
WHERE environment = '<env_slug>'
```
Tables affected: `executions`, `processor_executions`, `logs`, `agent_metrics`, `agent_events`. Excluded:
- `route_diagrams` — content-addressed `ReplacingMergeTree`, no time-based TTL.
- `server_metrics` — server-wide, no `environment` column. Its 90-day cap is fixed in the schema.
ClickHouse failures are logged (WARN) but do not fail the originating license install — TTL recompute is best-effort.
## Daily revalidation
`LicenseRevalidationJob` (`@Scheduled(cron = "0 0 3 * * *")`) re-runs `LicenseService.revalidate()` against the persisted token at 03:00 server-local time. It also fires once 60 seconds after `ApplicationReadyEvent` to catch the case where a license was installed via SQL between server starts.
Each revalidation:
- Re-reads the token from `license` table.
- Runs `LicenseValidator.validate(...)` again — same checks as install (signature, tenant, expiry).
- On success: bumps `last_validated_at`, reloads the gate, publishes `LicenseChangedEvent`.
- On failure: marks the gate `INVALID`, writes an audit row `revalidate_license` / `FAILURE`, publishes `LicenseChangedEvent(INVALID, null)`.
A token transitioning `ACTIVE → GRACE → EXPIRED` will surface as a state change at the next revalidation tick (or on the next license-touching admin action).
## Audit categories
All license lifecycle events use `AuditCategory.LICENSE`. Action codes:
| Action | Result | Detail keys |
|---|---|---|
| `install_license` | SUCCESS | `licenseId, expiresAt, installedBy, source` |
| `replace_license` | SUCCESS | same plus `previousLicenseId` |
| `reject_license` | FAILURE | `reason, source` |
| `revalidate_license` | FAILURE | `licenseId, reason` |
| `cap_exceeded` | FAILURE | `limit, current, requested, cap, state` |
The `source` value is one of `env`, `file`, `db`, `api` — corresponds to the install path.
## Prometheus metrics
Scraped at `/api/v1/prometheus`. Source: `LicenseMetrics` (`cameleer-server-app/src/main/java/com/cameleer/server/app/license/LicenseMetrics.java`).
| Metric | Type | Labels | Semantics |
|---|---|---|---|
| `cameleer_license_state` | gauge | `state=<ABSENT\|ACTIVE\|GRACE\|EXPIRED\|INVALID>` | One-hot per state — exactly one tag value carries `1.0` at any time, others are `0.0`. |
| `cameleer_license_days_remaining` | gauge | (none) | Whole days until `expiresAt`. `-1.0` when no license is loaded (ABSENT/INVALID). Suitable alert thresholds: warn at 30, page at 7. |
| `cameleer_license_last_validated_age_seconds` | gauge | (none) | Seconds since the persisted `last_validated_at`. `0` when there is no DB row. Alerts at >86400 (revalidation hasn't run for >24h) detect a stuck scheduler or a misconfigured server. |
| `cameleer_license_cap_rejections_total` | counter | `limit=<limit_key>` | Incremented every time `LicenseEnforcer` rejects a creation due to a cap. A non-zero rate indicates customers hitting their plan ceiling. |
Gauges refresh on every `LicenseChangedEvent` and on a 60-second `@Scheduled(fixedDelay)` so values stay current even without state changes.
## Troubleshooting
### My license shows `INVALID` — why?
Check `invalidReason` from `GET /api/v1/admin/license`. Common causes:
| `invalidReason` substring | Cause | Fix |
|---|---|---|
| `License signature verification failed` | Public key on the server does not match the private key the token was signed with. | Confirm `CAMELEER_SERVER_LICENSE_PUBLICKEY` matches the keypair used to mint the token. |
| `License tenantId 'X' does not match server tenant 'Y'` | Token minted for a different `tenantId`. | Re-mint with `--tenant=<correct id>` matching `CAMELEER_SERVER_TENANT_ID`. |
| `licenseId is required` / `tenantId is required` / `exp is required` | Malformed token (missing required field). | Re-mint via the supported minter — fields are mandatory. |
| `License expired at <...>` | Past `expiresAt + gracePeriodDays`. | Issue a renewal license. |
| `license public key not configured` | `CAMELEER_SERVER_LICENSE_PUBLICKEY` is unset. | Set the env var and either restart or POST the token again. |
### I'm getting 403s on creates — which cap is biting?
```bash
curl https://server.example.com/api/v1/admin/license/usage \
-H "Authorization: Bearer ${ADMIN_JWT}"
```
The `limits[]` array shows current/cap per limit key. Any row with `current >= cap` is a candidate. The 403 response body itself names the limit:
```json
{"error":"license cap reached","limit":"max_apps","current":3,"cap":3,"state":"ABSENT", ...}
```
If `state` is `ABSENT` or `EXPIRED`/`INVALID`, the fix is to install a license. If `state` is `ACTIVE` and you are at the license cap, you need a higher-tier license re-issued.
### My new license didn't take effect
1. Check the audit log:
```bash
curl 'https://server.example.com/api/v1/admin/audit?category=LICENSE&limit=10' \
-H "Authorization: Bearer ${ADMIN_JWT}"
```
You should see an `install_license` or `replace_license` row at `SUCCESS`. A `reject_license` `FAILURE` row carries the reason.
2. Confirm the public key matches the private key used to mint:
- Vendor side: `openssl pkey -in <priv> -pubout -outform DER | base64 -w0`
- Server side: `echo $CAMELEER_SERVER_LICENSE_PUBLICKEY`
- These must be byte-identical.
3. Confirm `CAMELEER_SERVER_TENANT_ID` matches the `tenantId` in the token envelope (`GET /api/v1/admin/license`).
4. If the env var token disagrees with what's in the DB (e.g. you POSTed but a stale env var remains): the env var wins on next boot. Either remove the env var or update it before restarting.
### Cap rejections spiking but no licensed customer should be hitting the cap
Inspect `cameleer_license_cap_rejections_total{limit=...}`. If a tenant is on default tier (state = `ABSENT`/`EXPIRED`/`INVALID`) the very low default caps will trip immediately on routine activity. Install a license to restore expected behavior.
### Retention TTL didn't change after installing a license
`RetentionPolicyApplier` runs on `LicenseChangedEvent` asynchronously (`@Async`). Look for the log line:
```
License changed (state=ACTIVE) — recomputing TTL across N environment(s) and 5 table(s)
Applied TTL: table=executions env=prod days=30 (cap=30, configured=90)
```
If the log shows `Failed to apply TTL` warnings, ClickHouse rejected the `ALTER TABLE ... MODIFY TTL` statement — most often because of a permissions issue or a ClickHouse version below 22.3. The license install itself still succeeded; the TTL change just didn't land.