Adds /api/v1/admin/server-metrics/{catalog,instances,query} so SaaS control
planes can build the server-health dashboard without direct ClickHouse
access. One generic /query endpoint covers every panel in the
server-self-metrics doc: aggregation (avg/sum/max/min/latest), group-by-tag,
filter-by-tag, counter-delta mode with per-server_instance_id rotation
handling, and a derived 'mean' statistic for timers. Regex-validated
identifiers, parameterised literals, 31-day range cap, 500-series response
cap. ADMIN-only via the existing /api/v1/admin/** RBAC gate. Docs updated:
all 17 suggested panels now expressed as single-endpoint queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot the full Micrometer registry (cameleer business metrics, alerting
metrics, and Spring Boot Actuator defaults) every 60s into a new
server_metrics table so server health survives restarts without an external
Prometheus. Includes a dashboard-builder reference for the SaaS team.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Store-level: assert findLatestContentHashForAppRoute picks the newest
hash across publishing instances (proves the lookup survives agent
removal), isolates by (app, env), and returns empty for blank inputs.
Controller-level: assert the env-scoped /routes/{routeId}/diagram
endpoint resolves without a registry prerequisite, 404s for unknown
routes, and that an execution's stored diagramContentHash stays pinned
to the point-in-time version after a newer diagram is stored — the
"latest" endpoint flips to v2, the by-hash render remains byte-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All production callers migrated to findLatestContentHashForAppRoute in
the preceding commits. The agent-scoped lookup adds no coverage beyond
the latest-per-(app,env,route) resolver, so the dead API is removed
along with its test coverage and unused imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `TraefikLabelBuilder` hardcoded `tls.certresolver=default` on
every router. That assumes a resolver literally named `default` exists
in the Traefik static config — true for ACME-backed installs, false for
dev/local installs that use a file-based TLS store. Traefik logs
"Router uses a nonexistent certificate resolver" for the bogus resolver
on every managed app, and any future attempt to define a differently-
named real resolver would silently skip these routers.
Server-wide setting via `CAMELEER_SERVER_RUNTIME_CERTRESOLVER` (empty by
default) flows through `ConfigMerger.GlobalRuntimeDefaults.certResolver`
into `ResolvedContainerConfig.certResolver`. When blank the
`tls.certresolver` label is omitted entirely; `tls=true` is still
emitted so Traefik serves the default TLS-store cert. When set, the
label is emitted with the configured resolver name.
Not per-app/per-env configurable: there is one Traefik per server
instance and one resolver config; app-level override would only let
users break their own routers.
TDD: TraefikLabelBuilderTest gains 3 cases (resolver set, null, blank).
Full unit suite 211/0/0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a boolean `externalRouting` flag (default `true`) on
ResolvedContainerConfig. When `false`, TraefikLabelBuilder emits only
the identity labels (`managed-by`, `cameleer.*`) and skips every
`traefik.*` label, so the container is not published by Traefik.
Sibling containers on `cameleer-traefik` / `cameleer-env-{tenant}-{env}`
can still reach it via Docker DNS on whatever port the app listens on.
TDD: new TraefikLabelBuilderTest covers enabled (default labels present),
disabled (zero traefik.* labels), and disabled (identity labels retained)
cases. Full module unit suite: 208/0/0.
Plumbed through ConfigMerger read, DeploymentExecutor snapshot, UI form
state, Resources tab toggle, POST payload, and snapshot-to-form mapping.
Rule files updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface from the Task 0 testcontainers.reuse enable: when the same Postgres
container is reused across `mvn verify` runs, Flyway migrates both `public`
and `tenant_default` schemas (the app.yml default URL uses
?currentSchema=tenant_default; AbstractPostgresIT overrides to public).
Schema-introspection assertions saw duplicate rows/indexes/enums.
Plus: OutboundConnectionAdminControllerIT's AfterEach couldn't delete its
test users because sibling deployment ITs (Task 4) left deployments.created_by
references — FK blocks the DELETE. Clear referencing deployments first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds List<String> instanceIds to LogSearchRequest (null-normalized to
List.of() in compact ctor) and generates an IN clause in both
ClickHouseLogStore.search() and countLogs(), mirroring the existing
sources pattern. LogQueryController parses ?instanceIds= as a
comma-split list. All existing LogSearchRequest call sites updated.
New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty
filter (all rows), null filter (all rows), single-value filter, and
coexistence with the singular instanceId field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires AuditService and AppVersionRepository into DeploymentController.
Replaces null createdBy placeholder with currentUserId() on createDeployment/promote.
Adds audit log entries (SUCCESS + FAILURE) for deploy_app, stop_deployment,
and promote_deployment actions. Fixes FK violations in affected ITs by
seeding the test-operator and alice users into the users table before deploy calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix Issue 1: Add @AfterEach cleanup for alice/bob users in PostgresDeploymentRepositoryCreatedByIT to prevent test leakage (FK order: deployments -> app_versions -> apps, then users).
Fix Issue 2: Add comment at first create(..., null) call site in PostgresDeploymentRepositoryIT documenting the null placeholder for pre-V4 rows where createdBy is nullable.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Appends String createdBy to the Deployment record (after createdAt), updates
both with-er methods to pass it through, threads the parameter through
DeploymentRepository.create, DeploymentService.createDeployment/promote, and
PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController
passes null as placeholder (Task 4 will resolve from SecurityContextHolder).
Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via
both createDeployment and promote.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Maven: enable useIncrementalCompilation; Surefire forkCount=1C +
reuseForks=true so unit-test JVMs are reused per CPU core instead of
spawning per class (205 tests pass under the new strategy).
- Testcontainers: opt-in reuse via .withReuse(true) on Postgres +
ClickHouse base; per-developer enable via ~/.testcontainers.properties.
- UI: drop redundant `tsc --noEmit` from `npm run build` (Vite already
type-checks); split into a dedicated `npm run typecheck` script.
- CI: cache ~/.npm and ui/node_modules/.vite alongside Maven; npm ci with
--prefer-offline --no-audit --fund=false; paths-ignore for docs-only,
.planning/ and .claude/ changes so doc-only pushes skip the pipeline.
- Docs: CLAUDE.md + .claude/rules/cicd.md updated with the new build
knobs and the Testcontainers reuse opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
the design spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four ITs covering strategy behavior:
- BlueGreenStrategyIT#blueGreen_allHealthy_stopsOldAfterNew:
old is stopped only after all new replicas are healthy.
- BlueGreenStrategyIT#blueGreen_partialHealthy_preservesOldAndMarksFailed:
strict all-healthy — one starting replica aborts the deploy and
leaves the previous deployment RUNNING untouched.
- RollingStrategyIT#rolling_allHealthy_replacesOneByOne:
InOrder on stopContainer confirms old-0 stops before old-1 (the
interleaving that distinguishes rolling from blue-green).
- RollingStrategyIT#rolling_failsMidRollout_preservesRemainingOld:
mid-rollout health failure stops only the in-flight new containers
and the already-replaced old-0; old-1 stays untouched.
Shortens healthchecktimeout to 2s via @TestPropertySource so failure
paths complete in ~25s instead of ~60s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing IT only exercises the startContainer-throws path, where the
exception bypasses the entire try block. Add a test where startContainer
succeeds but getContainerStatus never returns healthy — this covers the
early-exit at the HEALTH_CHECK stage, which is the common real-world
failure shape and closest to the snapshot-write point.
Shortens healthchecktimeout to 2s via @TestPropertySource so the test
completes in a few seconds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate
from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from
snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field
- Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the
new app is in cache before the router renders the deployed view (eliminates transient Save-
disabled flash)
- Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits;
background refetches only overwrite form when no local changes are present
- Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment
resolves; handleSave invalidates dirty-state after staged save
- Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy,
environment, application) before JSON comparison via scrubAgentConfig(); adds
versionBumpDoesNotMarkDirty test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires DirtyStateCalculator behind an HTTP endpoint on AppController.
Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository,
registers DirtyStateCalculator as a Spring bean (with ObjectMapper for
JavaTimeModule support), and covers all three scenarios with IT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add @DirtiesContext(AFTER_CLASS) so the SpyBean-forked context is torn
down after the 6 tests finish, preventing permanent cache pollution
- Replace single-row queryForObject with queryForList + hasSize(1) in both
audit tests so spurious extra rows will fail explicitly
- Assert auditCount == 0 in the 400 test to lock in the no-audit-on-bad-input invariant
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the misleading putConfig_staged_auditActionIsStagedAppConfig test
(which only checked pushResult.total == 0, a duplicate of _savesButDoesNotPush)
with two real audit-log assertions: one verifying "stage_app_config" is written
for apply=staged and a new companion test verifying "update_app_config" for the
live path. Uses jdbcTemplate to query audit_log directly (Option B).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents.
When apply=live (default, back-compat), preserves today's immediate-push behavior.
Unknown apply values return 400. Audit action is stage_app_config vs update_app_config.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the missing containerConfig assertion to snapshot_isPopulated_whenDeploymentReachesRunning
(runtimeType + appPort entries), and tightens the await predicate from .isIn(RUNNING, DEGRADED)
to .isEqualTo(RUNNING) — the mock returns a healthy container so RUNNING is deterministic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Injects PostgresApplicationConfigRepository into DeploymentExecutor and
calls saveDeployedConfigSnapshot at the COMPLETE stage, before
markRunning. Snapshot contains jarVersionId, agentConfig (nullable),
and app.containerConfig. The FAILED catch path is left untouched so
snapshot stays null on failure. Verified by DeploymentSnapshotIT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace manual `new PostgresDeploymentRepository(jdbcTemplate, new ObjectMapper())` with
`@Autowired PostgresDeploymentRepository repository` to use the Spring-managed bean whose
ObjectMapper has JavaTimeModule registered. Also removes the redundant isNotNull() assertion
whose work is done by the field-level assertions that follow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.
Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces
exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no
new exchanges produces zero new instances or notifications. Ack on one
instance leaves the other four untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).
Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
Three failing IT tests documenting the contract Task 3.3 will satisfy:
- createPerExchangeRule_withReNotifyMinutesNonZero_returns400
- createPerExchangeRule_withForDurationSecondsNonZero_returns400
- createAnyRule_withEmptyWebhooksAndTargets_returns400
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.
Pre-prod; no shim / migration.
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
Fault-injection IT asserts that a crash mid-batch rolls back every
instance + notification write AND leaves the cursor unchanged. Fails
against current (Phase 1 only) code — turns green when Task 2.2
wraps batch processing in @Transactional.
Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE
executions — matches the convention used in ClickHouseExecutionStoreIT
and peers. Env-slug isolation was already preventing cross-test pollution;
this change is about hygiene and determinism (TRUNCATE is synchronous).
Two failing tests documenting the contract Task 1.5 will satisfy:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory
Compile may fail until Task 1.4 adds AlertRule.withEvalState wither.
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.
When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.
Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.
Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed) but IngestionConfig binds cameleer.server.ingestion.* —
YAML tuning of the metrics flush interval was silently ignored and the
scheduler fell back to the 1s default in every environment.
Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}.
(The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs}
failed because beans registered via @EnableConfigurationProperties use a
compound bean name "<prefix>-<FQN>", not the simple camelCase form. The
property-placeholder path is sufficient — IngestionConfig still owns
the Java-side default.)
BackpressureIT: drops the obsolete workaround property
`ingestion.flush-interval-ms=60000`; the single prefixed override now
controls both buffer config and flush cadence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-layer fix for the TZ drift that caused stats reads to miss every row
when the JVM default TZ and CH session TZ disagreed:
- Insert side: ClickHouse JDBC 0.9.7 formats java.sql.Timestamp via
Timestamp.toString(), which uses JVM default TZ. A CEST JVM shipping
to a UTC CH server stored Unix timestamps off by the TZ offset (the
triage report's original symptom). Pinned JVM default to UTC in
CameleerServerApplication.main() — standard practice for observability
servers that push to time-series stores.
- Read side: stats_1m_* tables now declare bucket as DateTime('UTC'),
MV SELECTs wrap toStartOfMinute(start_time) in toDateTime(..., 'UTC')
so projections match column type, and ClickHouseStatsStore.lit(Instant)
emits toDateTime('...', 'UTC') rather than a bare literal — defence
in depth against future refactors.
Test class pins its own JVM TZ (the store IT builds its own
HikariDataSource, bypassing the main() path). Debug scaffolding from
the triage investigation removed.
Greenfield CH — no migration needed.
Verified: 14/14 ClickHouseStatsStoreIT green, plus 84/84 across all
ClickHouse IT classes (no regression from the JVM TZ default change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pushToAgents fan-out iterates every distinct (app, env) slice in
the shared agent registry. In isolated runs that's 0, but with Spring
context reuse across IT classes we always see non-zero here. Assert
the response has a pushResult.total field (shape) rather than exact 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ForwardCompatIT: send a valid ExecutionChunk envelope with extra
unknown fields instead of a bare {futureField}. Was being parsed into
an empty/degenerate chunk and rejected with 400.
- ProtocolVersionIT.requestWithCorrectProtocolVersionPassesInterceptor:
same shape fix — minimal valid chunk so the controller's 400 is not
an ambiguous signal for interceptor-passthrough.
- BackpressureIT:
* TestPropertySource keys were "ingestion.*" but IngestionConfig is
bound under "cameleer.server.ingestion.*" — overrides were ignored
and the buffer stayed at its default 50_000, so the 503 overflow
branch was unreachable. Corrected the keys.
* MetricsFlushScheduler's @Scheduled uses a *different* key again
("ingestion.flush-interval-ms"), so we override that separately to
stop the default 1s flush from draining the buffer mid-test.
* executionIngestion_isSynchronous_returnsAccepted now uses the
chunked envelope format.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickHouseChunkPipelineIT.setUp was loading /clickhouse/V2__executions.sql
and /clickhouse/V3__processor_executions.sql — resource paths that no
longer exist after 90083f88 collapsed the V1..V18 ClickHouse schema into
init.sql. Swapped for ClickHouseTestHelper.executeInitSql(jdbc).
ClickHouseExecutionReadIT.detailService_buildTree_withIterations was
asserting getLoopIndex() on children of a split, but DetailService's
seq-based buildTree path (buildTreeBySeq) maps FlatProcessorRecord.iteration
into ProcessorNode.iteration — not loopIndex. The loopIndex path is only
populated by buildTreeByProcessorId (the legacy ID-only fallback). Switched
the assertion to getIteration() to match the seq-driven reconstruction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both tests extend AbstractPostgresIT and inherit the Postgres jdbcTemplate,
which they were using to query ClickHouse-resident tables (executions,
processor_executions, route_diagrams). Now:
- DiagramLinkingIT reads diagramContentHash off the execution-detail REST
response (and tolerates JSON null by normalising to empty string, which
matches how the ingestion service stamps un-linked executions).
- IngestionSchemaIT asserts the reconstructed processor tree through the
execution-detail endpoint (covers both flattening on write and
buildTree on read) and reads processor bodies via the processor-snapshot
endpoint rather than raw processor_executions rows.
Both tests now use the ExecutionChunk envelope on POST /data/executions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Largest Cluster B test: seeded 10 executions via the legacy RouteExecution
shape which ChunkIngestionController silently degenerates to empty chunks,
then verified via a Postgres SELECT against a ClickHouse table. Both
failure modes addressed:
- All 10 seed payloads are now ExecutionChunk envelopes (chunkSeq=0,
final=true, flat processors[]).
- Pipeline visibility probe is the env-scoped search REST endpoint
(polling for the last corr-page-10 row).
- searchGet() helper was using the AGENT token; env-scoped read
endpoints require VIEWER+, so it now uses viewerJwt (matches what
searchPost already did).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DiagramControllerIT.postDiagram_dataAppearsAfterFlush now verifies via
GET /api/v1/environments/{env}/apps/{app}/routes/{route}/diagram instead
of a PG SELECT against the ClickHouse route_diagrams table.
DiagramRenderControllerIT seeds both a diagram and an execution on the
same route, then reads the stamped diagramContentHash off the execution-
detail REST response to drive the flat /api/v1/diagrams/{hash}/render
tests. The env-scoped endpoint only serves JSON, so SVG tests still hit
the content-hash endpoint — but the hash comes from REST now, not SQL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>