docs(license): apply review feedback to enforcement design

- Add INVALID state to FSM (signature/tenant/parse failure ≠ ABSENT)
  with loud UI/audit/metric severity; ABSENT stays a calm state.
- Make tenantId required in the license envelope (it's already inside
  the signed payload, so a self-hosted customer cannot strip it).
- Move ClickHouse TTL recompute from boot-only to a
  RetentionPolicyApplier @EventListener(LicenseChangedEvent), so a
  long-running server that lands in EXPIRED tightens TTL automatically.
- Add LicenseRevalidationJob (daily) that re-runs signature check
  against the DB row and updates last_validated_at; transitions to
  INVALID on failure (catches public-key rotation drift).
- Add last_validated_at column to the license table, surfaced on the
  /usage endpoint and as cameleer_license_last_validated_age_seconds.
- Enrich enforcement-failure responses and the /usage endpoint with a
  per-state human-readable message so 403s and the UI both explain
  WHY caps changed.
- Add --verify (with --public-key) to the minter CLI to round-trip a
  freshly-minted token through LicenseValidator before shipping it,
  deleting the output file on verify failure.
- Add corresponding tests, telemetry gauge, and a runtime-recompute IT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-26 09:42:16 +02:00
parent 0e512a3c0c
commit e0be6a069f

View File

@@ -47,16 +47,18 @@ cameleer-server-core/ (existing — pure domain, no Spring)
├── LicenseLimits (typed wrapper over the limits map)
├── LicenseValidator (existing, payload schema updated)
├── LicenseGate (existing, gutted: no Feature; getLimits() only)
├── LicenseStateMachine (NEW — pure FSM: ABSENT / ACTIVE / GRACE / EXPIRED)
└── DefaultTierLimits (constant — §5 numbers)
├── LicenseStateMachine (NEW — pure FSM: ABSENT / ACTIVE / GRACE / EXPIRED / INVALID)
└── DefaultTierLimits (constant — §3.2 numbers)
cameleer-server-app/ (existing — Spring, web, persistence)
├── license/
│ ├── LicenseRepository (NEW — PostgreSQL persistence)
│ ├── LicenseService (NEW — load/save/replace; emits state events)
│ ├── LicenseService (NEW — load/save/replace; publishes LicenseChangedEvent)
│ ├── LicenseEnforcer (NEW — assertWithinCap entry point)
│ ├── LicenseUsageReader (NEW — counts current usage for /usage endpoint)
│ ├── LicenseCapExceededException (NEW — mapped to 403 by ControllerAdvice)
│ ├── LicenseRevalidationJob (NEW — @Scheduled daily; updates last_validated_at)
│ ├── RetentionPolicyApplier (NEW — @EventListener(LicenseChangedEvent); recomputes ClickHouse TTL + per-env caps)
│ └── LicenseMetrics (NEW — Prometheus gauges)
├── controller/
│ ├── LicenseAdminController (existing — extended; persists, audited)
@@ -67,7 +69,7 @@ cameleer-server-app/ (existing — Spring, web, persistence)
cameleer-license-minter/ (NEW — top-level Maven module)
├── pom.xml (depends on cameleer-server-core)
├── LicenseMinter (signing primitive; takes private key + LicenseInfo)
└── cli/LicenseMinterCli (CLI main class)
└── cli/LicenseMinterCli (CLI main class, supports --verify)
```
### 1.2 Why a separate `cameleer-license-minter` module
@@ -100,7 +102,7 @@ Wire format unchanged: `base64(payload).base64(ed25519_signature)`. Payload sche
{
"licenseId": "550e8400-e29b-41d4-a716-446655440000",
"tenantId": "acme-corp",
"label": "ACME prod 2026",
"label": "ACME prod 2026 — site:hamburg",
"iat": 1745539200,
"exp": 1777075200,
"gracePeriodDays": 30,
@@ -127,7 +129,7 @@ Wire format unchanged: `base64(payload).base64(ed25519_signature)`. Payload sche
| Field | Required | Notes |
|---|---|---|
| `licenseId` | yes | UUID. Used in audit + future revocation. |
| `tenantId` | optional | If present and `CAMELEER_SERVER_TENANT_ID` differs, treat as no license + log error. Air-gapped customers may omit. |
| `tenantId` | **yes** | Must match `CAMELEER_SERVER_TENANT_ID`. Mismatch = `INVALID` state (see §3). The field is inside the signed payload, so a self-hosted customer cannot strip it to make a license portable across tenants — any edit invalidates the signature. Air-gapped customers receive a license bound to a vendor-issued tenant id (not necessarily a UUID — any non-empty slug). |
| `label` | optional | Free-form human description. Surfaced in UI. |
| `iat` | yes | Unix seconds. |
| `exp` | yes | Unix seconds. |
@@ -149,23 +151,42 @@ Wire format unchanged: `base64(payload).base64(ed25519_signature)`. Payload sche
│ ABSENT │ ───────────────▶│ ACTIVE │──────▶│ GRACE │ │ EXPIRED │
└─────────┘ └────────┘ └────────┘ └─────────┘
▲ │ │ ▲ │
install invalid │ replace │ │ replace valid │ replace
(sig/tenant/parse) ▼ │ │ ▼
└────────────────────────────┴──────────────┴─┴───────────────────┘
│ replace │ │ replace valid │ replace
▼ │ │ ▼
│ ┌─────────┐ └──────────────┴─┴───────────────────┘
└──┤ INVALID │ ──── replace valid ────────────────────────────────▶ ACTIVE
└─────────┘
│ install fails (signature / tenant / parse / public-key-missing)
all transitions persist + audit-log
```
### 3.1 State semantics
| State | Effective limits | Trigger |
|---|---|---|
| `ABSENT` | `DefaultTierLimits` | No DB row, or signature/tenant/parse failure. |
| `ACTIVE` | `merge(default, license.limits)` | License loaded, `now < exp`. |
| `GRACE` | Same as `ACTIVE` | `exp ≤ now < exp + gracePeriodDays`. UI banner. |
| `EXPIRED` | `DefaultTierLimits` | `now ≥ exp + gracePeriodDays`. Distinct UI label vs ABSENT. |
| State | Effective limits | Trigger | Severity |
|--- |--- |--- |--- |
| `ABSENT` | `DefaultTierLimits` | No DB row. Clean install with no license configured. | INFO |
| `ACTIVE` | `merge(default, license.limits)` | License loaded, `now < exp`. | INFO |
| `GRACE` | Same as `ACTIVE` | `exp ≤ now < exp + gracePeriodDays`. UI warning banner. | WARN |
| `EXPIRED` | `DefaultTierLimits` | `now ≥ exp + gracePeriodDays`. UI label distinct from ABSENT. | ERROR |
| `INVALID` | `DefaultTierLimits` | Signature failure, tenant mismatch, parse error, or public key not configured but a token is present. | **ERROR — loud** |
State is recomputed on every limit check (clock comparison only) — no scheduler needed for
transitions. The only "background" behaviour is the Prometheus gauge refresh.
`ABSENT` and `INVALID` produce the same enforcement (default tier) but are surfaced very
differently:
- **`ABSENT`** is a clean state — fresh install, no license yet. UI shows a calm "Install a
license to lift the default-tier caps" call to action. No audit row beyond the boot log line.
- **`INVALID`** is an active error — tampering, wrong public key, or a paste that lost
characters. UI shows a red banner with the validator's error message
(e.g. "License signature verification failed", "License tenantId 'acme-corp' does not match
server tenant 'beta-corp'"). Audit row written under
`AuditCategory.LICENSE` action `reject_license`. Prometheus
`cameleer_license_state{state="INVALID"} = 1` so an alert can fire.
State is recomputed on every limit check (clock comparison only against parsed in-memory
`LicenseInfo`) — no scheduler needed for `ACTIVE → GRACE → EXPIRED` transitions. A separate
**daily revalidation job** (§6.6) re-runs the signature check against the DB row to catch slow
failures like public-key rotation drift.
### 3.2 Default tier (the "no license" caps)
@@ -199,8 +220,31 @@ void assertWithinCap(String limitKey, long currentUsage, long requestedDelta);
```
Throws `LicenseCapExceededException(limitKey, current, cap)` when `currentUsage + requestedDelta > cap`.
A `@ControllerAdvice` maps it to `403` with body
`{"error":"license cap reached","limit":"max_apps","current":3,"cap":3}`.
A `@ControllerAdvice` maps it to `403` with a body that explains the "why" so operators can act
without grepping logs:
```json
{
"error": "license cap reached",
"limit": "max_apps",
"current": 3,
"cap": 3,
"state": "EXPIRED",
"message": "License expired 5 days ago: system reverted to default tier (3 apps). Current usage is 3. Install or renew the license to create more apps."
}
```
The `message` field is rendered server-side from a small template per state:
| State | Message template |
|--- |---|
| `ABSENT` | "No license installed: default tier applies (cap = N for {limit}). Install a license to raise this." |
| `ACTIVE` | "License cap reached: {limit} = {cap}. Current usage is {current}. Contact your vendor to raise the cap." |
| `GRACE` | "License expired {n} day(s) ago and is in its grace period (ends in {m} days). Cap unchanged at {cap}. Renew before grace ends." |
| `EXPIRED`| "License expired {n} days ago: system reverted to default tier (cap = N for {limit}). Current usage is {current}. Renew the license to lift the cap." |
| `INVALID`| "License rejected ({reason}): default tier applies (cap = N for {limit}). Fix the license to raise this." |
### 4.1 Per-limit call sites
| Limit | Call site | Failure response |
|---|---|---|
@@ -213,12 +257,12 @@ A `@ControllerAdvice` maps it to `403` with body
| `max_total_cpu_millis` | `DeploymentExecutor.PRE_FLIGHT` (sum across non-stopped deploys + new) | Deploy fails fast at PRE_FLIGHT, status FAILED, audit row |
| `max_total_memory_mb` | same | same |
| `max_total_replicas` | same | same |
| `max_execution_retention_days` | `EnvironmentService.update` (per-env field, see §4.1) + `ClickHouseSchemaInitializer.applyRetention()` at boot | 422 on update; boot pins effective TTL = `min(licenseCap, configured)` |
| `max_execution_retention_days` | `EnvironmentService.update` (per-env field, see §4.2) + `RetentionPolicyApplier` (see §4.3) | 422 on update; ClickHouse TTL recomputed on every license change |
| `max_log_retention_days` | same | same |
| `max_metric_retention_days` | same | same |
| `max_jar_retention_count` | `EnvironmentAdminController.PUT /jar-retention` | 422 |
### 4.1 Per-environment retention fields
### 4.2 Per-environment retention fields
Three new columns on `environments` (Flyway V2):
@@ -230,11 +274,30 @@ ALTER TABLE environments
```
These are the configured per-env values. The effective ClickHouse TTL is
`min(licenseCap, configured)`, applied at startup by `ClickHouseSchemaInitializer`. Admin UI
surfaces the configured values; `EnvironmentService.update` rejects values above the license cap
with 422.
`min(licenseCap, configured)`. Admin UI surfaces the configured values;
`EnvironmentService.update` rejects values above the license cap with 422.
### 4.2 Boot-time invariant
### 4.3 Runtime retention recompute
`RetentionPolicyApplier` is `@EventListener(LicenseChangedEvent)`:
- Triggered on every `LicenseService.replace(...)` (boot install, env-var override, file
override, POST `/admin/license`) **and** on every state transition the revalidation job
detects (e.g. license becomes `EXPIRED`, caps drop to default).
- Recomputes the effective TTL per env (`min(licenseCap, configured)`), then issues
`ALTER TABLE … MODIFY TTL …` on the affected ClickHouse tables (executions, processors,
logs, metrics, route_diagrams, agent_events). One ALTER per table per affected env.
- Errors are logged WARN; a failed ALTER does not block the license install — the operator can
retry by reposting the license. The previous TTL keeps applying until the next successful
ALTER.
- At boot, `LicenseService.loadInitial(...)` publishes one `LicenseChangedEvent` after the
load order in §6.2 settles, so the boot path goes through the same applier as runtime
changes.
Result: a server that stays up for months and lands in `EXPIRED` will see ClickHouse TTLs
collapse to default-tier values automatically — no restart needed.
### 4.4 Boot-time invariant
If a license is added that *lowers* a cap below current usage (10 apps, license now allows 5), the
server logs one WARN per limit at boot. **No deletion**. New creates reject; existing resources
@@ -254,6 +317,8 @@ keep working.
"gracePeriodDays": 30,
"tenantId": "acme-corp",
"label": "ACME prod 2026",
"lastValidatedAt": "2026-04-26T03:14:07Z",
"message": "License active. 365 days remaining.",
"limits": [
{"key": "max_apps", "current": 7, "cap": 50, "source": "license"},
{"key": "max_agents", "current": 12, "cap": 100, "source": "license"},
@@ -267,12 +332,20 @@ keep working.
key, or there is no license), and `"license"` when the cap is explicit in the license. Drives the
SaaS UI's "free tier" badge.
`message` carries the same human-readable explanation that the 403 body uses, varying by state:
- `ABSENT` — "No license installed. Default tier applies."
- `ACTIVE` — "License active. {n} days remaining."
- `GRACE` — "License expired {n} days ago. Grace period ends in {m} days. Renew now to avoid degradation."
- `EXPIRED`— "License expired {n} days ago. System reverted to default tier."
- `INVALID`— "License rejected: {reason}. Default tier applies. Fix the license to recover."
`LicenseUsageReader` issues one cheap aggregate per limit (`SELECT COUNT(*)` per entity table; a
single grouped `SELECT SUM(replicas * cpuMillis), SUM(replicas * memoryMb), SUM(replicas)` over
non-stopped deployments).
`GET /api/v1/admin/license` (existing) is extended to return `{state, envelope}` with the raw token
omitted from the response.
`GET /api/v1/admin/license` (existing) is extended to return `{state, envelope, lastValidatedAt}`
with the raw token omitted from the response.
---
@@ -289,20 +362,30 @@ CREATE TABLE license (
license_id UUID NOT NULL,
installed_at TIMESTAMPTZ NOT NULL,
installed_by TEXT NOT NULL, -- users.user_id (bare) or 'system' for env/file boot
expires_at TIMESTAMPTZ NOT NULL
expires_at TIMESTAMPTZ NOT NULL,
last_validated_at TIMESTAMPTZ NOT NULL -- updated by boot, install, and revalidation job
);
```
`last_validated_at` is the timestamp of the most recent **successful** signature/parse round-trip
against the current public key. Useful for troubleshooting "why did my license stop working" — a
stale `last_validated_at` next to a recent `now` is a strong signal that revalidation is failing
and the operator should check the public key.
### 6.2 Boot order
`LicenseBeanConfig`:
1. If `CAMELEER_SERVER_LICENSE_TOKEN` env var is set → validate → write to DB (overwrite) →
load.
1. If `CAMELEER_SERVER_LICENSE_TOKEN` env var is set → validate → write to DB (overwrite,
sets `last_validated_at = now`) → load.
2. Else if `CAMELEER_SERVER_LICENSE_FILE` is set → read file → validate → write to DB → load.
3. Else read `license` row from DB → validate → load.
3. Else read `license` row from DB → validate → on success update `last_validated_at = now`
load.
4. Else `ABSENT`.
After step 13 the service publishes one `LicenseChangedEvent` so the retention applier and
metrics gauges initialise off the same code path as runtime changes.
Env-var / file act as **idempotent overrides** — they always win and replace the DB row, so the
operator's last action survives reboots.
@@ -310,15 +393,17 @@ operator's last action survives reboots.
`POST /api/v1/admin/license { "token": "..." }` (existing):
- Validates against the configured public key.
- On success, persists to `license` table (`installed_by = user_id`), updates the in-memory
`LicenseGate`, audits.
- On success, persists to `license` table (`installed_by = user_id`, `last_validated_at = now`),
updates the in-memory `LicenseGate`, publishes `LicenseChangedEvent`, audits.
- On failure, returns 400 with the validator error message and audits the rejection.
Server transitions to `INVALID` state if a previously-loaded license was replaced; otherwise
remains in its prior state (the rejected token is *not* written to DB).
### 6.4 Public key custody
`CAMELEER_SERVER_LICENSE_PUBLICKEY` (existing) remains the only verification key. Build- /
deploy-time secret bound to the vendor distribution. **Not stored in DB.** If unset *and* a
license is present → reject all licenses (existing behaviour).
license is present → reject all licenses (existing behaviour)`INVALID` state.
### 6.5 Audit trail
@@ -329,7 +414,23 @@ New `AuditCategory.LICENSE`. Actions:
| `install_license` | First successful install in an empty state | `{licenseId, expiresAt, installedBy, source}` (`source` = `env`/`file`/`api`) |
| `replace_license` | Successful install over an existing license | same + `previousLicenseId` |
| `reject_license` | Validation failed (signature, tenant, parse, public key missing) | `{reason, source}` |
| `cap_exceeded` | Any `LicenseCapExceededException` | `{limit, current, cap, requestedBy}` |
| `revalidate_license` | Daily job result, on **failure only** | `{licenseId, reason}` |
| `cap_exceeded` | Any `LicenseCapExceededException` | `{limit, current, cap, requestedBy, state}` |
### 6.6 Daily revalidation job
`LicenseRevalidationJob`:
- `@Scheduled(cron = "0 0 3 * * *")` (03:00 server local time) plus an immediate run 60s
after boot.
- Reads the DB token, re-runs `LicenseValidator.validate(token)` against the current public
key.
- On success: `UPDATE license SET last_validated_at = now WHERE tenant_id = ?`.
- On failure (e.g. operator rotated the public key without reinstalling the license, or DB
row was tampered with directly): transition state to `INVALID`, publish
`LicenseChangedEvent` (so retention recomputes too), audit `revalidate_license` with the
reason, log `ERROR`.
- Cheap (no I/O beyond one DB read + one DB write); safe to run frequently. 03:00 is chosen
to coincide with off-peak so the WARN noise lands when humans aren't deploying.
---
@@ -353,6 +454,7 @@ Serializes `LicenseInfo` to canonical JSON (sorted keys), signs the bytes with E
```bash
java -jar cameleer-license-minter-1.0-SNAPSHOT.jar \
--private-key=/secure/vendor.key \
--public-key=/secure/vendor.pub \
--tenant=acme-corp \
--label="ACME prod 2026" \
--expires=2027-04-25 \
@@ -362,15 +464,26 @@ java -jar cameleer-license-minter-1.0-SNAPSHOT.jar \
--max-total-cpu-millis=32000 \
--max-total-memory-mb=65536 \
--max-execution-retention-days=90 \
--output=acme-license.tok
--output=acme-license.tok \
--verify
```
- `--private-key` reads a PEM-encoded Ed25519 private key (output of
`openssl genpkey -algorithm ed25519`).
- `--public-key` *(used only with `--verify`)* reads the matching public key. Required when
`--verify` is set; ignored otherwise.
- Unspecified `--max-*` flags are omitted from the payload — the license inherits the default for
that key.
- Unknown flags fail fast.
- `--output` writes the token; if omitted, prints to stdout.
- `--verify` round-trips the freshly-minted token through `LicenseValidator` against
`--public-key` *after* writing the output file. This catches:
- corruption between `String → file` write,
- wrong-key pairing (vendor accidentally pointed `--public-key` at a different keypair's
public half),
- signature mismatch from a buggy build of the minter.
On verify failure the CLI exits non-zero, prints the validator error, and (if `--output` was
written) deletes the output file so the bad token does not get shipped.
Keypair generation is **out of band** — vendor uses `openssl` and stores both halves in their
secret manager. We deliberately do not ship a `--gen-keypair` subcommand to keep the boundary
@@ -384,13 +497,17 @@ Prometheus gauges scraped via `/api/v1/prometheus`:
| Metric | Labels | Notes |
|---|---|---|
| `cameleer_license_state` | `state="ABSENT|ACTIVE|GRACE|EXPIRED"` | Boolean — exactly one is 1. |
| `cameleer_license_state` | `state="ABSENT|ACTIVE|GRACE|EXPIRED|INVALID"` | Boolean — exactly one is 1. |
| `cameleer_license_days_remaining` | (none) | Negative in GRACE/EXPIRED. |
| `cameleer_license_limit_utilisation`| `limit="max_apps"` etc. | `current / cap`, in `[0, 1+]`. |
| `cameleer_license_cap_rejections_total` | `limit="..."` | Counter. |
| `cameleer_license_last_validated_age_seconds` | (none) | `now - last_validated_at`. Spikes if the daily revalidation job is failing. |
State-transition log lines: `INFO` on install/ACTIVE, `WARN` on GRACE, `ERROR` on EXPIRED, `WARN`
on cap reject (sampled to avoid log spam).
State-transition log lines: `INFO` on install/ACTIVE, `WARN` on GRACE, `ERROR` on EXPIRED,
`ERROR` on INVALID, `WARN` on cap reject (sampled to avoid log spam).
Recommended alert (in cameleer-saas Grafana, not shipped with the server): page on
`cameleer_license_state{state="INVALID"} == 1` for > 5 minutes.
---
@@ -413,15 +530,17 @@ compatibility shims" preference, no deprecated path or feature flag.
| Layer | Tests |
|---|---|
| Core unit | `LicenseValidatorTest` — signature, expiry, tenant mismatch, missing required fields, unknown extra fields. |
| Core unit | `LicenseStateMachineTest` — all four transitions including grace boundary, replace from any state, invalid install. |
| Core unit | `LicenseValidatorTest` — signature, expiry, tenant mismatch, missing required fields (`tenantId`, `licenseId`, `iat`, `exp`), unknown extra fields. |
| Core unit | `LicenseStateMachineTest` — all five transitions including grace boundary, replace from any state, invalid install routes to `INVALID`, valid install from `INVALID` recovers to `ACTIVE`. |
| Core unit | `DefaultTierLimitsTest` — every documented key has a default. |
| Minter unit | `LicenseMinterTest` — round-trip with a throwaway Ed25519 keypair. Canonical JSON is stable across runs. |
| Minter CLI | `LicenseMinterCliTest` — invokes `main` with `--private-key=tmp` and checks output token validates. |
| App unit | `LicenseEnforcerTest` — for each limit: cap-reached, under-cap, default-tier with no license, missing-cap-inherits-default. |
| App integration | `LicenseLifecycleIT` — install via env, replace via POST, restart restores from DB. Driven through REST. |
| App integration | `LicenseEnforcementIT` — REST-driven, hit each cap end-to-end (per the project's "REST-API-driven ITs" preference). Includes `cap_exceeded` audit row check. |
| Boot | `SchemaBootstrapIT` extension — `license` table exists, `environments` retention columns exist, retention pinning honoured at boot. |
| Minter CLI | `LicenseMinterCliTest` — invokes `main` with `--private-key=tmp` and checks output token validates; `--verify` happy path; `--verify` failure path deletes the output file and exits non-zero. |
| App unit | `LicenseEnforcerTest` — for each limit: cap-reached, under-cap, default-tier with no license, missing-cap-inherits-default, message text varies per state. |
| App unit | `RetentionPolicyApplierTest` — license-changed event recomputes effective TTL per env; failed ALTER logs WARN and does not throw. |
| App integration | `LicenseLifecycleIT` — install via env, replace via POST, restart restores from DB, public-key removal at runtime transitions to `INVALID`, daily revalidation job updates `last_validated_at`. Driven through REST. |
| App integration | `LicenseEnforcementIT` — REST-driven, hit each cap end-to-end (per the project's "REST-API-driven ITs" preference). Includes `cap_exceeded` audit row check and verifies the 403 body's `message` field matches the state. |
| App integration | `RetentionRuntimeRecomputeIT` — install license with `max_log_retention_days=30`, observe `logs` TTL ALTER fires; replace with `max_log_retention_days=7`, observe TTL drops to 7 without restart. |
| Boot | `SchemaBootstrapIT` extension — `license` table exists with `last_validated_at`, `environments` retention columns exist, retention pinning honoured at boot. |
No raw-SQL seeding of caps in ITs. All caps installed via the REST endpoint or env var.
@@ -444,6 +563,7 @@ No raw-SQL seeding of caps in ITs. All caps installed via the REST endpoint or e
| Default tier so tight that an honest evaluator cannot try the product. | Defaults documented; vendor can ship a longer-`exp` "trial" license at install time if needed. |
| Customer lowers `gracePeriodDays` field by editing token. | Token is signed; any edit invalidates the signature. |
| License removed from DB out of band, server lands in ABSENT and rejects new resources but old ones are above default tier. | Boot-time WARN per over-cap limit. UI banner in the admin license page. No auto-deletion. |
| Public key rotation. | Out of scope for v1; documented as "redeploy with new key" — vendors are expected to rotate via redeployment. |
| Public key rotation. | Out of scope for v1; documented as "redeploy with new key" — vendors are expected to rotate via redeployment. Daily revalidation job catches a rotation that wasn't paired with a reinstall (state → `INVALID`, alertable). |
| Compute cap arithmetic relies on `cpuLimit` and `memoryLimitMb` being set on every container. | Existing `ResolvedContainerConfig` already enforces these; `DeploymentExecutor.PRE_FLIGHT` rejects deploys with unset compute fields. |
| Per-env retention column added but old ClickHouse partitions retain longer. | Documented: TTL change is honoured by ClickHouse on its next merge cycle. New rows inserted always honour the new TTL. |
| `RetentionPolicyApplier` issues blocking ALTERs from the event listener thread. | Applier runs ALTERs serialised but on a separate executor (not the publisher thread) so a slow ClickHouse does not stall the install API call. License install API returns immediately with the new state; retention recompute completes asynchronously and is observable via metrics. |