Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Store-level: assert findLatestContentHashForAppRoute picks the newest
hash across publishing instances (proves the lookup survives agent
removal), isolates by (app, env), and returns empty for blank inputs.
Controller-level: assert the env-scoped /routes/{routeId}/diagram
endpoint resolves without a registry prerequisite, 404s for unknown
routes, and that an execution's stored diagramContentHash stays pinned
to the point-in-time version after a newer diagram is stored — the
"latest" endpoint flips to v2, the by-hash render remains byte-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All production callers migrated to findLatestContentHashForAppRoute in
the preceding commits. The agent-scoped lookup adds no coverage beyond
the latest-per-(app,env,route) resolver, so the dead API is removed
along with its test coverage and unused imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both catalog controllers resolved the from-endpoint URI via
findContentHashForRouteByAgents, which filtered by the currently-live
agent instance_ids. Routes removed between app versions therefore lost
their fromUri even though the diagram row still exists.
Route through findLatestContentHashForAppRoute so resolution depends
only on (app, env, route) — stays populated for historical routes.
CatalogController now resolves the per-row env slug up-front so the
fromUri lookup works even for cross-env queries against managed apps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env-scoped /routes/{routeId}/diagram endpoint filtered diagrams by
the currently-live agent instance_ids. Routes removed between app
versions have no live publisher, so the lookup returned 404 even though
the historical diagram row still exists in route_diagrams. Sidebar
entries for removed routes showed "no diagram" as a result.
Switch to findLatestContentHashForAppRoute which resolves directly off
(applicationId, environment, routeId) + created_at DESC, independent of
the agent registry. The controller no longer depends on
AgentRegistryService.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agent-scoped lookups miss diagrams from routes whose publishing agents
have been redeployed or removed. The new method resolves by
(applicationId, environment, routeId) + created_at DESC, independent of
the agent registry. An in-memory cache mirrors the existing hashCache
pattern, warm-loaded at startup via argMax.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.
Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.
Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.
TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a boolean `externalRouting` flag (default `true`) on
ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only
the identity labels (`managed-by`, `cameleer.*`) and skips every
`traefik.*` label, so the container is not published by Traefik.
Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}`
can still reach it via Docker DNS on whatever port the app listens on.
TDD: new TraefikLabelBuilderTest covers enabled (default labels present),
disabled (zero traefik.* labels), and disabled (identity labels retained)
cases. Full module unit suite: 208/0/0.
Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form
state, Resources tab toggle, POST payload, and snapshot-to-form mapping.
Rule files updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exclusion list still named the legacy flat `/api/v1/search/executions`
URL, which no longer exists — the endpoint moved to env-scoped
`/api/v1/environments/{envSlug}/executions/search`. Exact-match Set
lookup never matched, so every UI search POST produced an audit row.
Switch to AntPathMatcher over a pattern list so the dynamic envSlug is
handled correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface from the Task 0 testcontainers.reuse enable: when the same Postgres
container is reused across `mvn verify` runs, Flyway migrates both `public`
and `tenant_default` schemas (the app.yml default URL uses
?currentSchema=tenant_default; AbstractPostgresIT overrides to public).
Schema-introspection assertions saw duplicate rows/indexes/enums.
Plus: OutboundConnectionAdminControllerIT's AfterEach couldn't delete its
test users because sibling deployment ITs (Task 4) left deployments.created_by
references — FK blocks the DELETE. Clear referencing deployments first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds List<String> instanceIds to LogSearchRequest (null-normalized to
List.of() in compact ctor) and generates an IN clause in both
ClickHouseLogStore.search() and countLogs(), mirroring the existing
sources pattern. LogQueryController parses ?instanceIds= as a
comma-split list. All existing LogSearchRequest call sites updated.
New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty
filter (all rows), null filter (all rows), single-value filter, and
coexistence with the singular instanceId field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires AuditService and AppVersionRepository into DeploymentController.
Replaces null createdBy placeholder with currentUserId() on createDeployment/promote.
Adds audit log entries (SUCCESS + FAILURE) for deploy_app, stop_deployment,
and promote_deployment actions. Fixes FK violations in affected ITs by
seeding the test-operator and alice users into the users table before deploy calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix Issue 1: Add @AfterEach cleanup for alice/bob users in PostgresDeploymentRepositoryCreatedByIT to prevent test leakage (FK order: deployments -> app_versions -> apps, then users).
Fix Issue 2: Add comment at first create(..., null) call site in PostgresDeploymentRepositoryIT documenting the null placeholder for pre-V4 rows where createdBy is nullable.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Appends String createdBy to the Deployment record (after createdAt), updates
both with-er methods to pass it through, threads the parameter through
DeploymentRepository.create, DeploymentService.createDeployment/promote, and
PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController
passes null as placeholder (Task 4 will resolve from SecurityContextHolder).
Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via
both createDeployment and promote.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Maven: enable useIncrementalCompilation; Surefire forkCount=1C +
reuseForks=true so unit-test JVMs are reused per CPU core instead of
spawning per class (205 tests pass under the new strategy).
- Testcontainers: opt-in reuse via .withReuse(true) on Postgres +
ClickHouse base; per-developer enable via ~/.testcontainers.properties.
- UI: drop redundant `tsc --noEmit` from `npm run build` (Vite already
type-checks); split into a dedicated `npm run typecheck` script.
- CI: cache ~/.npm and ui/node_modules/.vite alongside Maven; npm ci with
--prefer-offline --no-audit --fund=false; paths-ignore for docs-only,
.planning/ and .claude/ changes so doc-only pushes skip the pipeline.
- Docs: CLAUDE.md + .claude/rules/cicd.md updated with the new build
knobs and the Testcontainers reuse opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
the design spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four ITs covering strategy behavior:
- BlueGreenStrategyIT#blueGreen_allHealthy_stopsOldAfterNew:
old is stopped only after all new replicas are healthy.
- BlueGreenStrategyIT#blueGreen_partialHealthy_preservesOldAndMarksFailed:
strict all-healthy — one starting replica aborts the deploy and
leaves the previous deployment RUNNING untouched.
- RollingStrategyIT#rolling_allHealthy_replacesOneByOne:
InOrder on stopContainer confirms old-0 stops before old-1 (the
interleaving that distinguishes rolling from blue-green).
- RollingStrategyIT#rolling_failsMidRollout_preservesRemainingOld:
mid-rollout health failure stops only the in-flight new containers
and the already-replaced old-0; old-1 stays untouched.
Shortens healthchecktimeout to 2s via @TestPropertySource so failure
paths complete in ~25s instead of ~60s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the Phase 3 stub with a working rolling implementation.
Flow:
- Capture previous deployment's per-index container ids up front.
- For i = 0..replicas-1:
- Start new[i] (gen-suffixed name, coexists with old[i]).
- Wait for new[i] healthy (new waitForOneHealthy helper).
- On success: stop old[i] if present, continue.
- On failure: stop in-flight new[0..i], leave un-replaced old[i+1..N]
running, mark FAILED. Already-replaced old replicas are not
restored — rolling is not reversible; user redeploys to recover.
- After the loop: sweep any leftover old replicas (when replica count
shrank) and mark the old deployment STOPPED.
Resource peak: replicas + 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of deployment-strategies plan. Refactor executeAsync to
dispatch on DeploymentStrategy.fromWire(config.deploymentStrategy()).
Blue-green (default):
- Start all N new replicas (gen-suffixed names coexist with old).
- Wait for ALL healthy (strict — partial-healthy = FAILED, preserves
previous deployment untouched).
- Only then find + stop the previous deployment.
- Final status is always RUNNING; DEGRADED is now reserved for
post-deploy replica crashes (set by DockerEventMonitor).
Rolling: stub — throws UnsupportedOperationException for now, gets
its real implementation in Phase 4.
Refactor details:
- Extract DeployCtx record to carry 13 per-deploy values around.
- Extract startReplica(ctx, i, stateOut) — shared by both strategy paths.
- Extract persistSnapshotAndMarkRunning(ctx, primaryCid) — shared finalizer.
- Rename waitForAnyHealthy → waitForAllHealthy (the name was misleading;
the method already waited for all, just returned partial on timeout).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append an 8-char generation id (first 8 chars of deployment UUID) to:
- container name: {tenant}-{env}-{app}-{replica}-{gen}
- CAMELEER_AGENT_INSTANCEID (so old+new agents are distinct in the registry)
- Traefik cameleer.instance-id label
And emit a new standalone cameleer.generation label so dashboards
(Prometheus/Grafana) can pin deploy boundaries without regex on
instance-id.
Strategy branching comes next — this commit is foundation only; the
interim destroy-then-start flow still runs regardless of strategy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container names are deterministic: {tenant}-{envSlug}-{appSlug}-{replica}.
The prior code did the stop-existing step at SWAP_TRAFFIC, *after*
START_REPLICAS had already tried to create containers with the same
names — so a redeploy against a RUNNING app consistently failed with
Docker 409 "container name already in use".
Move the stop-existing block to run right after CREATE_NETWORK and
before START_REPLICAS. SWAP_TRAFFIC becomes a label-only marker (traffic
is swapped implicitly by Traefik labels once new replicas are healthy).
Also: add `findActiveByAppIdAndEnvironmentIdExcluding` so the SQL
excludes the current deployment by id — previously the Java-side
`!id.equals(me)` guard failed because the newly-inserted row has
status=STARTING (DB default) and ORDER BY created_at DESC LIMIT 1
picked the new row, hiding the actual previous deployment.
Trade-off: this is destroy-then-start rather than true blue/green —
brief downtime during the swap. Matches the pre-unified-page behavior
and is what users reasonably expect. True blue/green would require
per-deployment container names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace empty-body ResponseEntity.status(BAD_REQUEST).build() with
ResponseStatusException so Spring returns the usual error body shape
with a descriptive reason string, matching the idiom used by
UserAdminController, AppSettingsController, ThresholdAdminController.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds @Parameter description so the generated OpenAPI spec / Swagger UI
explains what 'staged' vs 'live' means instead of just surfacing the
bare param name. Follow-up: run `cd ui && npm run generate-api:live`
against a live backend to refresh openapi.json + schema.d.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing IT only exercises the startContainer-throws path, where the
exception bypasses the entire try block. Add a test where startContainer
succeeds but getContainerStatus never returns healthy — this covers the
early-exit at the HEALTH_CHECK stage, which is the common real-world
failure shape and closest to the snapshot-write point.
Shortens healthchecktimeout to 2s via @TestPropertySource so the test
completes in a few seconds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot is written by DeploymentExecutor before the RUNNING/DEGRADED
split, so DEGRADED rows already carry a deployed_config_snapshot. Treat
them as checkpoints — partial-healthy deploys still produced a working
config worth restoring. Aligns repo query with UI filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate
from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from
snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field
- Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the
new app is in cache before the router renders the deployed view (eliminates transient Save-
disabled flash)
- Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits;
background refetches only overwrite form when no local changes are present
- Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment
resolves; handleSave invalidates dirty-state after staged save
- Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy,
environment, application) before JSON comparison via scrubAgentConfig(); adds
versionBumpDoesNotMarkDirty test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires DirtyStateCalculator behind an HTTP endpoint on AppController.
Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository,
registers DirtyStateCalculator as a Spring bean (with ObjectMapper for
JavaTimeModule support), and covers all three scenarios with IT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add @DirtiesContext(AFTER_CLASS) so the SpyBean-forked context is torn
down after the 6 tests finish, preventing permanent cache pollution
- Replace single-row queryForObject with queryForList + hasSize(1) in both
audit tests so spurious extra rows will fail explicitly
- Assert auditCount == 0 in the 400 test to lock in the no-audit-on-bad-input invariant
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the misleading putConfig_staged_auditActionIsStagedAppConfig test
(which only checked pushResult.total == 0, a duplicate of _savesButDoesNotPush)
with two real audit-log assertions: one verifying "stage_app_config" is written
for apply=staged and a new companion test verifying "update_app_config" for the
live path. Uses jdbcTemplate to query audit_log directly (Option B).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents.
When apply=live (default, back-compat), preserves today's immediate-push behavior.
Unknown apply values return 400. Audit action is stage_app_config vs update_app_config.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the missing containerConfig assertion to snapshot_isPopulated_whenDeploymentReachesRunning
(runtimeType + appPort entries), and tightens the await predicate from .isIn(RUNNING, DEGRADED)
to .isEqualTo(RUNNING) — the mock returns a healthy container so RUNNING is deterministic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Injects PostgresApplicationConfigRepository into DeploymentExecutor and
calls saveDeployedConfigSnapshot at the COMPLETE stage, before
markRunning. Snapshot contains jarVersionId, agentConfig (nullable),
and app.containerConfig. The FAILED catch path is left untouched so
snapshot stays null on failure. Verified by DeploymentSnapshotIT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace manual `new PostgresDeploymentRepository(jdbcTemplate, new ObjectMapper())` with
`@Autowired PostgresDeploymentRepository repository` to use the Spring-managed bean whose
ObjectMapper has JavaTimeModule registered. Also removes the redundant isNotNull() assertion
whose work is done by the field-level assertions that follow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jmustache on the classpath (for alert notification templates) triggers
Spring Boot's MustacheAutoConfiguration, which warns about the missing
classpath:/templates/ folder we don't use. Disable its check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand
EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface
restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink
rule), exactly-once guarantee scope, and the first-run backlog-cap knob.
Surface the new config in application.yml with the 24h default and the
opt-out-to-0 semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.
Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces
exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no
new exchanges produces zero new instances or notifications. Ack on one
instance leaves the other four untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).
Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
Three failing IT tests documenting the contract Task 3.3 will satisfy:
- createPerExchangeRule_withReNotifyMinutesNonZero_returns400
- createPerExchangeRule_withForDurationSecondsNonZero_returns400
- createAnyRule_withEmptyWebhooksAndTargets_returns400
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.
Pre-prod; no shim / migration.
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
Fault-injection IT asserts that a crash mid-batch rolls back every
instance + notification write AND leaves the cursor unchanged. Fails
against current (Phase 1 only) code — turns green when Task 2.2
wraps batch processing in @Transactional.
Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE
executions — matches the convention used in ClickHouseExecutionStoreIT
and peers. Env-slug isolation was already preventing cross-test pollution;
this change is about hygiene and determinism (TRUNCATE is synchronous).