Commit Graph

185 Commits

Author SHA1 Message Date
hsiegeln
ed0e616109 refactor(logs): drop dead null guards on instanceIds filter (record normalizes) 2026-04-23 12:52:18 +02:00
hsiegeln
382e1801a7 feat(logs): add instanceIds multi-value filter to /logs endpoint
Adds List<String> instanceIds to LogSearchRequest (null-normalized to
List.of() in compact ctor) and generates an IN clause in both
ClickHouseLogStore.search() and countLogs(), mirroring the existing
sources pattern. LogQueryController parses ?instanceIds= as a
comma-split list. All existing LogSearchRequest call sites updated.
New ClickHouseLogStoreInstanceIdsIT covers: multi-value filter, empty
filter (all rows), null filter (all rows), single-value filter, and
coexistence with the singular instanceId field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 12:41:09 +02:00
hsiegeln
2312a7304d fix(deploy): widen promote FAILURE audit detail + clean up test envs 2026-04-23 12:29:46 +02:00
hsiegeln
47d5611462 feat(audit): audit deploy/stop/promote with DEPLOYMENT category
Wires AuditService and AppVersionRepository into DeploymentController.
Replaces null createdBy placeholder with currentUserId() on createDeployment/promote.
Adds audit log entries (SUCCESS + FAILURE) for deploy_app, stop_deployment,
and promote_deployment actions. Fixes FK violations in affected ITs by
seeding the test-operator and alice users into the users table before deploy calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 12:24:27 +02:00
hsiegeln
9043dc00b0 test(deploy): clean up seeded users + document null createdBy placeholder
Fix Issue 1: Add @AfterEach cleanup for alice/bob users in PostgresDeploymentRepositoryCreatedByIT to prevent test leakage (FK order: deployments -> app_versions -> apps, then users).

Fix Issue 2: Add comment at first create(..., null) call site in PostgresDeploymentRepositoryIT documenting the null placeholder for pre-V4 rows where createdBy is nullable.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-23 12:10:21 +02:00
hsiegeln
a141e99a07 feat(deploy): cascade createdBy through Deployment record + service + repo
Appends String createdBy to the Deployment record (after createdAt), updates
both with-er methods to pass it through, threads the parameter through
DeploymentRepository.create, DeploymentService.createDeployment/promote, and
PostgresDeploymentRepository (INSERT + SELECT_COLS + mapRow). DeploymentController
passes null as placeholder (Task 4 will resolve from SecurityContextHolder).
Covers with PostgresDeploymentRepositoryCreatedByIT verifying round-trip via
both createDeployment and promote.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 12:04:15 +02:00
hsiegeln
35748ea7a1 feat(deploy): V4 migration — add created_by to deployments 2026-04-23 11:44:05 +02:00
hsiegeln
242ef1f0af perf(build): faster Maven + UI + CI pipelines
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m43s
CI / docker (push) Successful in 4m13s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
- Maven: enable useIncrementalCompilation; Surefire forkCount=1C +
  reuseForks=true so unit-test JVMs are reused per CPU core instead of
  spawning per class (205 tests pass under the new strategy).
- Testcontainers: opt-in reuse via .withReuse(true) on Postgres +
  ClickHouse base; per-developer enable via ~/.testcontainers.properties.
- UI: drop redundant `tsc --noEmit` from `npm run build` (Vite already
  type-checks); split into a dedicated `npm run typecheck` script.
- CI: cache ~/.npm and ui/node_modules/.vite alongside Maven; npm ci with
  --prefer-offline --no-audit --fund=false; paths-ignore for docs-only,
  .planning/ and .claude/ changes so doc-only pushes skip the pipeline.
- Docs: CLAUDE.md + .claude/rules/cicd.md updated with the new build
  knobs and the Testcontainers reuse opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:48:34 +02:00
hsiegeln
c6aef5ab35 fix(deploy): Checkpoints — preserve STOPPED history, fix filter + placement
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m4s
CI / docker (push) Successful in 1m15s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
- Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment.
  STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty.
  Now only FAILED rows are pruned; STOPPED deployments are retained as restorable
  checkpoints (they still carry deployed_config_snapshot from their RUNNING window).
- UI filter: any deployment with a snapshot is a checkpoint (was RUNNING|DEGRADED only,
  which excluded the main case — the previous blue/green deployment now in STOPPED).
- UI placement: Checkpoints disclosure now renders inside IdentitySection, matching
  the design spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:26:46 +02:00
hsiegeln
e9f523f2b8 test(deploy): blue-green + rolling strategy ITs
Four ITs covering strategy behavior:
- BlueGreenStrategyIT#blueGreen_allHealthy_stopsOldAfterNew:
  old is stopped only after all new replicas are healthy.
- BlueGreenStrategyIT#blueGreen_partialHealthy_preservesOldAndMarksFailed:
  strict all-healthy — one starting replica aborts the deploy and
  leaves the previous deployment RUNNING untouched.
- RollingStrategyIT#rolling_allHealthy_replacesOneByOne:
  InOrder on stopContainer confirms old-0 stops before old-1 (the
  interleaving that distinguishes rolling from blue-green).
- RollingStrategyIT#rolling_failsMidRollout_preservesRemainingOld:
  mid-rollout health failure stops only the in-flight new containers
  and the already-replaced old-0; old-1 stays untouched.

Shortens healthchecktimeout to 2s via @TestPropertySource so failure
paths complete in ~25s instead of ~60s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:00:00 +02:00
hsiegeln
653f983a08 deploy: rolling strategy (per-replica replacement)
Replace the Phase 3 stub with a working rolling implementation.

Flow:
- Capture previous deployment's per-index container ids up front.
- For i = 0..replicas-1:
  - Start new[i] (gen-suffixed name, coexists with old[i]).
  - Wait for new[i] healthy (new waitForOneHealthy helper).
  - On success: stop old[i] if present, continue.
  - On failure: stop in-flight new[0..i], leave un-replaced old[i+1..N]
    running, mark FAILED. Already-replaced old replicas are not
    restored — rolling is not reversible; user redeploys to recover.
- After the loop: sweep any leftover old replicas (when replica count
  shrank) and mark the old deployment STOPPED.

Resource peak: replicas + 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:53:52 +02:00
hsiegeln
459cdfe427 deploy: blue-green strategy (start → health-all → stop old)
Phase 3 of deployment-strategies plan. Refactor executeAsync to
dispatch on DeploymentStrategy.fromWire(config.deploymentStrategy()).

Blue-green (default):
- Start all N new replicas (gen-suffixed names coexist with old).
- Wait for ALL healthy (strict — partial-healthy = FAILED, preserves
  previous deployment untouched).
- Only then find + stop the previous deployment.
- Final status is always RUNNING; DEGRADED is now reserved for
  post-deploy replica crashes (set by DockerEventMonitor).

Rolling: stub — throws UnsupportedOperationException for now, gets
its real implementation in Phase 4.

Refactor details:
- Extract DeployCtx record to carry 13 per-deploy values around.
- Extract startReplica(ctx, i, stateOut) — shared by both strategy paths.
- Extract persistSnapshotAndMarkRunning(ctx, primaryCid) — shared finalizer.
- Rename waitForAnyHealthy → waitForAllHealthy (the name was misleading;
  the method already waited for all, just returned partial on timeout).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:51:24 +02:00
hsiegeln
652346dcd4 deploy: gen-suffixed container names + cameleer.generation label
Append an 8-char generation id (first 8 chars of deployment UUID) to:
- container name: {tenant}-{env}-{app}-{replica}-{gen}
- CAMELEER_AGENT_INSTANCEID (so old+new agents are distinct in the registry)
- Traefik cameleer.instance-id label

And emit a new standalone cameleer.generation label so dashboards
(Prometheus/Grafana) can pin deploy boundaries without regex on
instance-id.

Strategy branching comes next — this commit is foundation only; the
interim destroy-then-start flow still runs regardless of strategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:45:44 +02:00
hsiegeln
f8dccaae2b fix(deploy): stop previous active deployment before START_REPLICAS (fixes 409)
Container names are deterministic: {tenant}-{envSlug}-{appSlug}-{replica}.
The prior code did the stop-existing step at SWAP_TRAFFIC, *after*
START_REPLICAS had already tried to create containers with the same
names — so a redeploy against a RUNNING app consistently failed with
Docker 409 "container name already in use".

Move the stop-existing block to run right after CREATE_NETWORK and
before START_REPLICAS. SWAP_TRAFFIC becomes a label-only marker (traffic
is swapped implicitly by Traefik labels once new replicas are healthy).

Also: add `findActiveByAppIdAndEnvironmentIdExcluding` so the SQL
excludes the current deployment by id — previously the Java-side
`!id.equals(me)` guard failed because the newly-inserted row has
status=STARTING (DB default) and ORDER BY created_at DESC LIMIT 1
picked the new row, hiding the actual previous deployment.

Trade-off: this is destroy-then-start rather than true blue/green —
brief downtime during the swap. Matches the pre-unified-page behavior
and is what users reasonably expect. True blue/green would require
per-deployment container names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 01:01:00 +02:00
hsiegeln
b655de3975 fix(config): structured 400 body on unknown apply value
Replace empty-body ResponseEntity.status(BAD_REQUEST).build() with
ResponseStatusException so Spring returns the usual error body shape
with a descriptive reason string, matching the idiom used by
UserAdminController, AppSettingsController, ThresholdAdminController.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:45:31 +02:00
hsiegeln
b5ecd39100 docs(api): document ?apply query param on updateConfig (Swagger)
Adds @Parameter description so the generated OpenAPI spec / Swagger UI
explains what 'staged' vs 'live' means instead of just surfacing the
bare param name. Follow-up: run `cd ui && npm run generate-api:live`
against a live backend to refresh openapi.json + schema.d.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:39:10 +02:00
hsiegeln
ffdaeabc9f test(deploy): lock in FAILED→null snapshot for health-check-fail path
Existing IT only exercises the startContainer-throws path, where the
exception bypasses the entire try block. Add a test where startContainer
succeeds but getContainerStatus never returns healthy — this covers the
early-exit at the HEALTH_CHECK stage, which is the common real-world
failure shape and closest to the snapshot-write point.

Shortens healthchecktimeout to 2s via @TestPropertySource so the test
completes in a few seconds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:37:37 +02:00
hsiegeln
4d4c59efe3 fix(deploy): include DEGRADED deploys as restorable checkpoints
Snapshot is written by DeploymentExecutor before the RUNNING/DEGRADED
split, so DEGRADED rows already carry a deployed_config_snapshot. Treat
them as checkpoints — partial-healthy deploys still produced a working
config worth restoring. Aligns repo query with UI filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:34:25 +02:00
hsiegeln
d33c039a17 fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations
- Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate
  from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from
  snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field
- Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the
  new app is in cache before the router renders the deployed view (eliminates transient Save-
  disabled flash)
- Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits;
  background refetches only overwrite form when no local changes are present
- Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment
  resolves; handleSave invalidates dirty-state after staged save
- Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy,
  environment, application) before JSON comparison via scrubAgentConfig(); adds
  versionBumpDoesNotMarkDirty test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:29:01 +02:00
hsiegeln
6591f2fde3 api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff
Wires DirtyStateCalculator behind an HTTP endpoint on AppController.
Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository,
registers DirtyStateCalculator as a Spring bean (with ObjectMapper for
JavaTimeModule support), and covers all three scenarios with IT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:35:35 +02:00
hsiegeln
76352c0d6f test(config): tighten audit assertions + @DirtiesContext on ApplicationConfigControllerIT
- Add @DirtiesContext(AFTER_CLASS) so the SpyBean-forked context is torn
  down after the 6 tests finish, preventing permanent cache pollution
- Replace single-row queryForObject with queryForList + hasSize(1) in both
  audit tests so spurious extra rows will fail explicitly
- Assert auditCount == 0 in the 400 test to lock in the no-audit-on-bad-input invariant

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:18:44 +02:00
hsiegeln
e716dbf8ca test(config): verify audit action in staged/live config IT
Replace the misleading putConfig_staged_auditActionIsStagedAppConfig test
(which only checked pushResult.total == 0, a duplicate of _savesButDoesNotPush)
with two real audit-log assertions: one verifying "stage_app_config" is written
for apply=staged and a new companion test verifying "update_app_config" for the
live path. Uses jdbcTemplate to query audit_log directly (Option B).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:13:53 +02:00
hsiegeln
76129d407e api(config): ?apply=staged|live gates SSE push on PUT /apps/{slug}/config
When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents.
When apply=live (default, back-compat), preserves today's immediate-push behavior.
Unknown apply values return 400. Audit action is stage_app_config vs update_app_config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:07:36 +02:00
hsiegeln
9b1240274d test(deploy): assert containerConfig round-trip + strict RUNNING in snapshot IT
Adds the missing containerConfig assertion to snapshot_isPopulated_whenDeploymentReachesRunning
(runtimeType + appPort entries), and tightens the await predicate from .isIn(RUNNING, DEGRADED)
to .isEqualTo(RUNNING) — the mock returns a healthy container so RUNNING is deterministic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:54:57 +02:00
hsiegeln
a79eafeaf4 runtime(deploy): capture config snapshot on RUNNING transition
Injects PostgresApplicationConfigRepository into DeploymentExecutor and
calls saveDeployedConfigSnapshot at the COMPLETE stage, before
markRunning. Snapshot contains jarVersionId, agentConfig (nullable),
and app.containerConfig. The FAILED catch path is left untouched so
snapshot stays null on failure. Verified by DeploymentSnapshotIT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:51:00 +02:00
hsiegeln
9b851c4622 test(deploy): autowire repository in snapshot IT (JavaTimeModule-safe)
Replace manual `new PostgresDeploymentRepository(jdbcTemplate, new ObjectMapper())` with
`@Autowired PostgresDeploymentRepository repository` to use the Spring-managed bean whose
ObjectMapper has JavaTimeModule registered. Also removes the redundant isNotNull() assertion
whose work is done by the field-level assertions that follow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:43:40 +02:00
hsiegeln
d3e86b9d77 storage(deploy): persist deployed_config_snapshot as JSONB
Wire SELECT_COLS, mapRow deserialization, and saveDeployedConfigSnapshot
update method. Adds PostgresDeploymentRepositoryIT with roundtrip,
null-default, and clear-to-null tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:39:04 +02:00
hsiegeln
7f9cfc7f18 core(deploy): add deployedConfigSnapshot field to Deployment model
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:31:48 +02:00
hsiegeln
ff95187707 db(deploy): add deployments.deployed_config_snapshot column (V3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:23:46 +02:00
hsiegeln
c2eab71a31 env(admin): per-environment color field + V2 migration
- V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate.
- Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API.
- UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400.
- ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:30 +02:00
hsiegeln
e6dcad1e07 config(app): silence MustacheAutoConfiguration templates-dir warning
jmustache on the classpath (for alert notification templates) triggers
Spring Boot's MustacheAutoConfiguration, which warns about the missing
classpath:/templates/ folder we don't use. Disable its check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:47:46 +02:00
hsiegeln
eda74b7339 docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m7s
CI / docker (push) Successful in 1m22s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand
EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface
restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink
rule), exactly-once guarantee scope, and the first-run backlog-cap knob.

Surface the new config in application.yml with the 24h default and the
opt-out-to-0 semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:39:49 +02:00
hsiegeln
e470fc0dab alerting(eval): clamp first-run cursor to deployBacklogCap — flood guard
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.

Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:34:23 +02:00
hsiegeln
cfc619505a alerting(it): AlertingFullLifecycleIT — exactly-once across ticks, ack isolation
End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces
exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no
new exchanges produces zero new instances or notifications. Ack on one
instance leaves the other four untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:07:45 +02:00
hsiegeln
0f6bafae8e alerting(api): cross-field validation for PER_EXCHANGE + empty-targets guard
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).

Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
2026-04-22 17:31:11 +02:00
hsiegeln
377968eb53 alerting(it): RED tests for PER_EXCHANGE cross-field validation + empty targets
Three failing IT tests documenting the contract Task 3.3 will satisfy:
- createPerExchangeRule_withReNotifyMinutesNonZero_returns400
- createPerExchangeRule_withForDurationSecondsNonZero_returns400
- createAnyRule_withEmptyWebhooksAndTargets_returns400
2026-04-22 17:17:47 +02:00
hsiegeln
e483e52eee alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.

Pre-prod; no shim / migration.
2026-04-22 17:10:53 +02:00
hsiegeln
ba4e2bb68f alerting(eval): atomic per-rule batch commit via @Transactional — Phase 2 close
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
2026-04-22 17:03:07 +02:00
hsiegeln
989dde23eb alerting(it): RED test pinning Phase 2 tick-atomicity contract
Fault-injection IT asserts that a crash mid-batch rolls back every
instance + notification write AND leaves the cursor unchanged. Fails
against current (Phase 1 only) code — turns green when Task 2.2
wraps batch processing in @Transactional.
2026-04-22 16:51:09 +02:00
hsiegeln
3c3d90c45b test(alerting): align AlertEvaluatorJobIT CH cleanup with house style
Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE
executions — matches the convention used in ClickHouseExecutionStoreIT
and peers. Env-slug isolation was already preventing cross-test pollution;
this change is about hygiene and determinism (TRUNCATE is synchronous).
2026-04-22 16:45:28 +02:00
hsiegeln
5bd0e09df3 alerting(eval): persist advanced cursor via releaseClaim — Phase 1 close
Fixes the notification-bleed regression pinned by
AlertEvaluatorJobIT#tick2_noNewExchanges_enqueuesZeroAdditionalNotifications.
2026-04-22 16:36:01 +02:00
hsiegeln
b8d4b59f40 alerting(eval): AlertEvaluatorJob persists advanced cursor via withEvalState
Thread EvalResult.Batch.nextEvalState into releaseClaim so the composite
cursor from Task 1.5 actually lands in rule.evalState across tick boundaries.
Guards against empty-batch wipe (would regress to first-run scan).
2026-04-22 16:24:27 +02:00
hsiegeln
850c030642 search: compose ORDER BY with execution_id when afterExecutionId set
Follow-up to Task 1.2 flagged by Task 1.5 review (I-1). Single-column
ORDER BY could drop tail rows in a same-millisecond group >50 when
paginating via the composite cursor. Appending ', execution_id <dir>'
as secondary key only when afterExecutionId is set preserves existing
behaviour for UI/stats callers.
2026-04-22 16:21:52 +02:00
hsiegeln
4acf0aeeff alerting(eval): PER_EXCHANGE composite cursor — monotone across same-ms exchanges
Tests:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory
2026-04-22 16:11:01 +02:00
hsiegeln
c2252a0e72 alerting(eval): RED tests for PER_EXCHANGE cursor monotonicity + first-run bound
Two failing tests documenting the contract Task 1.5 will satisfy:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory

Compile may fail until Task 1.4 adds AlertRule.withEvalState wither.
2026-04-22 15:58:16 +02:00
hsiegeln
b41f34c090 search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.

When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.

Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.

Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
2026-04-22 15:49:05 +02:00
hsiegeln
6fa8e3aa30 alerting(eval): EvalResult.Batch carries nextEvalState for cursor threading 2026-04-22 15:42:20 +02:00
hsiegeln
41df042e98 fix(sse): close 4 parked SSE test failures
Three distinct root causes, all reproducible when the classes run
solo — not order-dependent as the triage report suggested. Full
diagnosis in .planning/sse-flakiness-diagnosis.md.

1. AgentSseController.events auto-heal was over-permissive: any valid
   JWT allowed registering an arbitrary path-id, a spoofing vector.
   Surface symptom was the parked sseConnect_unknownAgent_returns404
   test hanging on a 200-with-empty-stream instead of getting 404.
   Fix: auto-heal requires JWT subject == path id.

2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms}
   (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.*
   — same family of bug as the MetricsFlushScheduler fix in a6944911.
   Fix: corrected placeholder prefix.

3. Spring's SseEmitter doesn't flush response headers until the first
   emitter.send(); clients on BodyHandlers.ofInputStream blocked on
   the first body byte, making awaitConnection(5s) unreliable under a
   15s ping cadence. Fix: send an initial ": connected" comment on
   connect() so headers hit the wire immediately.

Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:41:34 +02:00
hsiegeln
98cbf8f3fc refactor(search): drop dead SearchIndexer subsystem
After the ExecutionController removal (0f635576), SearchIndexer
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
Every SearchIndexerStats metric returned always-zero, and the admin
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
carried no signal.

Backend removed:
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
  ClickHouseAdminController (field + ctor param)
- StorageBeanConfig.searchIndexer bean

UI removed:
- IndexerPipeline type + useIndexerPipeline hook in
  api/queries/admin/clickhouse.ts
- Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar
  import and pipeline* CSS classes)

OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path
and IndexerPipelineResponse schema removed).

SearchIndex interface + ClickHouseSearchIndex impl kept — those are
live and used by SearchService + ExchangeMatchEvaluator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:32:49 +02:00
hsiegeln
a694491140 fix(metrics): MetricsFlushScheduler honour ingestion config flush interval
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed) but IngestionConfig binds cameleer.server.ingestion.* —
YAML tuning of the metrics flush interval was silently ignored and the
scheduler fell back to the 1s default in every environment.

Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}.

(The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs}
failed because beans registered via @EnableConfigurationProperties use a
compound bean name "<prefix>-<FQN>", not the simple camelCase form. The
property-placeholder path is sufficient — IngestionConfig still owns
the Java-side default.)

BackpressureIT: drops the obsolete workaround property
`ingestion.flush-interval-ms=60000`; the single prefixed override now
controls both buffer config and flush cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:28:00 +02:00