LicenseGate now exposes getState() (delegates to LicenseStateMachine),
getEffectiveLimits() (merged over DefaultTierLimits in ACTIVE/GRACE,
defaults-only in ABSENT/EXPIRED/INVALID), markInvalid(reason), and
clear(). Internal snapshot is an immutable record-like class swapped
atomically so concurrent reads see a consistent license+reason pair.
Removes the transient openSentinel() and getTier() introduced by
earlier tasks (no production consumers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §2.1 — both fields are required and the validator rejects a
token whose tenantId does not match the server's configured tenant
(CAMELEER_SERVER_TENANT_ID). Self-hosted customers cannot strip
tenantId because the field is in the signed payload.
LicenseBeanConfig and LicenseAdminController updated to pass the
expected tenant to the validator constructor. The transient
placeholder/TODO from Task 2 is removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required fields per spec §2.1. tenantId is non-blank; gracePeriodDays
defines the post-exp window during which limits keep applying.
isExpired() now honours the grace; isAfterRawExpiry() distinguishes
ACTIVE from GRACE for the state machine in Task 4.
Validator and gate use placeholder values temporarily; Task 3 wires
the validator to read the new fields, Task 5 rewrites the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §9 — feature flags are out of scope for license enforcement.
Drops Feature.java, LicenseGate.isEnabled, LicenseInfo.hasFeature,
and the corresponding test cases. LicenseValidator now silently
ignores any features array on the wire (no error).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant JARs are arbitrary user code: Camel ships components (camel-exec,
camel-bean, MVEL/Groovy templating) that turn a header into shell, and
Java 17 has no SecurityManager — the JVM is not a security boundary.
This applies an unconditional hardening contract to every tenant
container so a single runc CVE no longer equals host takeover.
DockerRuntimeOrchestrator.startContainer now sets:
- cap_drop ALL (Capability.values() — docker-java has no ALL constant)
- security_opt: no-new-privileges, apparmor=docker-default
(default seccomp profile applies implicitly)
- read_only rootfs, pids_limit=512
- /tmp tmpfs rw,nosuid,size=256m — no noexec, since Netty/Snappy/LZ4/Zstd
dlopen native libs from /tmp via mmap(PROT_EXEC) which noexec blocks
The orchestrator also probes `docker info` at construction and uses runsc
(gVisor) automatically when the daemon has it registered. Override via
cameleer.server.runtime.dockerruntime (e.g. "kata"); empty = auto.
Outbound TCP, DNS, and TLS are unaffected — caps/seccomp don't gate
those — so vanilla Camel-Kafka producers/consumers and REST integrations
keep working unchanged. Stateful tenants (Kafka Streams with on-disk
state stores, apps writing to /var/log/...) need explicit writeable
volumes; that's tracked in #153 as the natural follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a repeatable attr query parameter to the GET /executions endpoint that
parses key-only (exists check) and key:value (exact or wildcard-via-*)
filters. Invalid keys are mapped to HTTP 400 via ResponseStatusException.
The POST /executions/search path already honoured attributeFilters from
the request body via the Jackson canonical ctor; an IT now proves it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Store-level: assert findLatestContentHashForAppRoute picks the newest
hash across publishing instances (proves the lookup survives agent
removal), isolates by (app, env), and returns empty for blank inputs.
Controller-level: assert the env-scoped /routes/{routeId}/diagram
endpoint resolves without a registry prerequisite, 404s for unknown
routes, and that an execution's stored diagramContentHash stays pinned
to the point-in-time version after a newer diagram is stored — the
"latest" endpoint flips to v2, the by-hash render remains byte-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All production callers migrated to findLatestContentHashForAppRoute in
the preceding commits. The agent-scoped lookup adds no coverage beyond
the latest-per-(app,env,route) resolver, so the dead API is removed
along with its test coverage and unused imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.
Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.
Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.
TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a boolean `externalRouting` flag (default `true`) on
ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only
the identity labels (`managed-by`, `cameleer.*`) and skips every
`traefik.*` label, so the container is not published by Traefik.
Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}`
can still reach it via Docker DNS on whatever port the app listens on.
TDD: new TraefikLabelBuilderTest covers enabled (default labels present),
disabled (zero traefik.* labels), and disabled (identity labels retained)
cases. Full module unit suite: 208/0/0.
Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form
state, Resources tab toggle, POST payload, and snapshot-to-form mapping.
Rule files updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface from the Task 0 testcontainers.reuse enable: when the same Postgres
container is reused across `mvn verify` runs, Flyway migrates both `public`
and `tenant_default` schemas (the app.yml default URL uses
?currentSchema=tenant_default; AbstractPostgresIT overrides to public).
Schema-introspection assertions saw duplicate rows/indexes/enums.
Plus: OutboundConnectionAdminControllerIT's AfterEach couldn't delete its
test users because sibling deployment ITs (Task 4) left deployments.created_by
references — FK blocks the DELETE. Clear referencing deployments first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds List<String> instanceIds to LogSearchRequest (null-normalized to
List.of() in compact ctor) and generates an IN clause in both
ClickHouseLogStore.search() and countLogs(), mirroring the existing
sources pattern. LogQueryController parses ?instanceIds= as a
comma-split list. All existing LogSearchRequest call sites updated.
New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty
filter (all rows), null filter (all rows), single-value filter, and
coexistence with the singular instanceId field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires AuditService and AppVersionRepository into DeploymentController.
Replaces null createdBy placeholder with currentUserId() on createDeployment/promote.
Adds audit log entries (SUCCESS + FAILURE) for deploy_app, stop_deployment,
and promote_deployment actions. Fixes FK violations in affected ITs by
seeding the test-operator and alice users into the users table before deploy calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix Issue 1: Add @AfterEach cleanup for alice/bob users in PostgresDeploymentRepositoryCreatedByIT to prevent test leakage (FK order: deployments -> app_versions -> apps, then users).
Fix Issue 2: Add comment at first create(..., null) call site in PostgresDeploymentRepositoryIT documenting the null placeholder for pre-V4 rows where createdBy is nullable.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Appends String createdBy to the Deployment record (after createdAt), updates
both with-er methods to pass it through, threads the parameter through
DeploymentRepository.create, DeploymentService.createDeployment/promote, and
PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController
passes null as placeholder (Task 4 will resolve from SecurityContextHolder).
Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via
both createDeployment and promote.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Maven: enable useIncrementalCompilation; Surefire forkCount=1C +
reuseForks=true so unit-test JVMs are reused per CPU core instead of
spawning per class (205 tests pass under the new strategy).
- Testcontainers: opt-in reuse via .withReuse(true) on Postgres +
ClickHouse base; per-developer enable via ~/.testcontainers.properties.
- UI: drop redundant `tsc --noEmit` from `npm run build` (Vite already
type-checks); split into a dedicated `npm run typecheck` script.
- CI: cache ~/.npm and ui/node_modules/.vite alongside Maven; npm ci with
--prefer-offline --no-audit --fund=false; paths-ignore for docs-only,
.planning/ and .claude/ changes so doc-only pushes skip the pipeline.
- Docs: CLAUDE.md + .claude/rules/cicd.md updated with the new build
knobs and the Testcontainers reuse opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
the design spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four ITs covering strategy behavior:
- BlueGreenStrategyIT#blueGreen_allHealthy_stopsOldAfterNew:
old is stopped only after all new replicas are healthy.
- BlueGreenStrategyIT#blueGreen_partialHealthy_preservesOldAndMarksFailed:
strict all-healthy — one starting replica aborts the deploy and
leaves the previous deployment RUNNING untouched.
- RollingStrategyIT#rolling_allHealthy_replacesOneByOne:
InOrder on stopContainer confirms old-0 stops before old-1 (the
interleaving that distinguishes rolling from blue-green).
- RollingStrategyIT#rolling_failsMidRollout_preservesRemainingOld:
mid-rollout health failure stops only the in-flight new containers
and the already-replaced old-0; old-1 stays untouched.
Shortens healthchecktimeout to 2s via @TestPropertySource so failure
paths complete in ~25s instead of ~60s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing IT only exercises the startContainer-throws path, where the
exception bypasses the entire try block. Add a test where startContainer
succeeds but getContainerStatus never returns healthy — this covers the
early-exit at the HEALTH_CHECK stage, which is the common real-world
failure shape and closest to the snapshot-write point.
Shortens healthchecktimeout to 2s via @TestPropertySource so the test
completes in a few seconds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate
from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from
snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field
- Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the
new app is in cache before the router renders the deployed view (eliminates transient Save-
disabled flash)
- Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits;
background refetches only overwrite form when no local changes are present
- Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment
resolves; handleSave invalidates dirty-state after staged save
- Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy,
environment, application) before JSON comparison via scrubAgentConfig(); adds
versionBumpDoesNotMarkDirty test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires DirtyStateCalculator behind an HTTP endpoint on AppController.
Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository,
registers DirtyStateCalculator as a Spring bean (with ObjectMapper for
JavaTimeModule support), and covers all three scenarios with IT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add @DirtiesContext(AFTER_CLASS) so the SpyBean-forked context is torn
down after the 6 tests finish, preventing permanent cache pollution
- Replace single-row queryForObject with queryForList + hasSize(1) in both
audit tests so spurious extra rows will fail explicitly
- Assert auditCount == 0 in the 400 test to lock in the no-audit-on-bad-input invariant
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the misleading putConfig_staged_auditActionIsStagedAppConfig test
(which only checked pushResult.total == 0, a duplicate of _savesButDoesNotPush)
with two real audit-log assertions: one verifying "stage_app_config" is written
for apply=staged and a new companion test verifying "update_app_config" for the
live path. Uses jdbcTemplate to query audit_log directly (Option B).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents.
When apply=live (default, back-compat), preserves today's immediate-push behavior.
Unknown apply values return 400. Audit action is stage_app_config vs update_app_config.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the missing containerConfig assertion to snapshot_isPopulated_whenDeploymentReachesRunning
(runtimeType + appPort entries), and tightens the await predicate from .isIn(RUNNING, DEGRADED)
to .isEqualTo(RUNNING) — the mock returns a healthy container so RUNNING is deterministic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Injects PostgresApplicationConfigRepository into DeploymentExecutor and
calls saveDeployedConfigSnapshot at the COMPLETE stage, before
markRunning. Snapshot contains jarVersionId, agentConfig (nullable),
and app.containerConfig. The FAILED catch path is left untouched so
snapshot stays null on failure. Verified by DeploymentSnapshotIT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace manual `new PostgresDeploymentRepository(jdbcTemplate, new ObjectMapper())` with
`@Autowired PostgresDeploymentRepository repository` to use the Spring-managed bean whose
ObjectMapper has JavaTimeModule registered. Also removes the redundant isNotNull() assertion
whose work is done by the field-level assertions that follow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.
Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces
exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no
new exchanges produces zero new instances or notifications. Ack on one
instance leaves the other four untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).
Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
Three failing IT tests documenting the contract Task 3.3 will satisfy:
- createPerExchangeRule_withReNotifyMinutesNonZero_returns400
- createPerExchangeRule_withForDurationSecondsNonZero_returns400
- createAnyRule_withEmptyWebhooksAndTargets_returns400
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.
Pre-prod; no shim / migration.
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
Fault-injection IT asserts that a crash mid-batch rolls back every
instance + notification write AND leaves the cursor unchanged. Fails
against current (Phase 1 only) code — turns green when Task 2.2
wraps batch processing in @Transactional.
Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE
executions — matches the convention used in ClickHouseExecutionStoreIT
and peers. Env-slug isolation was already preventing cross-test pollution;
this change is about hygiene and determinism (TRUNCATE is synchronous).
Two failing tests documenting the contract Task 1.5 will satisfy:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory
Compile may fail until Task 1.4 adds AlertRule.withEvalState wither.
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.
When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.
Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.
Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed) but IngestionConfig binds cameleer.server.ingestion.* —
YAML tuning of the metrics flush interval was silently ignored and the
scheduler fell back to the 1s default in every environment.
Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}.
(The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs}
failed because beans registered via @EnableConfigurationProperties use a
compound bean name "<prefix>-<FQN>", not the simple camelCase form. The
property-placeholder path is sufficient — IngestionConfig still owns
the Java-side default.)
BackpressureIT: drops the obsolete workaround property
`ingestion.flush-interval-ms=60000`; the single prefixed override now
controls both buffer config and flush cadence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>