Commit Graph

1486 Commits

Author SHA1 Message Date
hsiegeln
d3e86b9d77 storage(deploy): persist deployed_config_snapshot as JSONB
Wire SELECT_COLS, mapRow deserialization, and saveDeployedConfigSnapshot
update method. Adds PostgresDeploymentRepositoryIT with roundtrip,
null-default, and clear-to-null tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:39:04 +02:00
hsiegeln
7f9cfc7f18 core(deploy): add deployedConfigSnapshot field to Deployment model
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:31:48 +02:00
hsiegeln
06fa7d832f core(deploy): type jarVersionId as UUID (match domain convention)
All other FKs to app_versions.id (e.g. Deployment.appVersionId) use UUID;
DeploymentConfigSnapshot.jarVersionId was incorrectly typed as String.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:29:26 +02:00
hsiegeln
d580b6e90c core(deploy): add DeploymentConfigSnapshot record
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:26:30 +02:00
hsiegeln
ff95187707 db(deploy): add deployments.deployed_config_snapshot column (V3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:23:46 +02:00
hsiegeln
1a376eb25f plan(deploy): unified app deployment page implementation plan
13 phases, TDD-oriented: Flyway V3 snapshot column, staged/live config
write flag, dirty-state endpoint, regen OpenAPI, then the new React page
(Identity, Checkpoints, 7 tabs including the live-apply Traces+Taps and
Route Recording with banner), primary Save/Redeploy state machine,
router blocker, old view cleanup, rules docs, and a manual QA walkthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:14:11 +02:00
hsiegeln
58ec67aef9 spec(deploy): unified app deployment page design
Single page at /apps/:slug (+ /apps/new in net-new mode) replacing the
CreateAppView/AppDetailView split. Save ↔ Redeploy state machine driven
by a deployment snapshot on the deployments table, agent-config writes
gain ?apply=staged|live, Identity & Artifact always visible, new
Deployment tab carries progress + startup log, and checkpoints restore
full prior state (JAR + config) from past successful deploys.

Concurrent-edit protection deferred to #147.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:02:50 +02:00
hsiegeln
2835d08418 ui(env): explicit switcher button+modal, forced selection, 3px color bar
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m6s
CI / docker (push) Successful in 1m18s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
- Replace EnvironmentSelector "All Envs" dropdown with Button+Modal (DS Modal, forced on first-use).
- Add 8-swatch preset color picker in the Environment settings "Appearance" section; commits via useUpdateEnvironment.
- Render a 3px fixed top bar in the current env's color across every page (z-index 900, below DS modals).
- New env-colors tokens (--env-color-*, light + dark) and envColorVar() helper with slate fallback.
- Vitest coverage for button, modal, and color helpers (13 new specs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:48 +02:00
hsiegeln
79fa4c097c api(schema): regenerate OpenAPI + schema.d.ts for env color field
UpdateEnvironmentRequest gains an optional color; Environment schema
surfaces color on GET responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:35 +02:00
hsiegeln
c2eab71a31 env(admin): per-environment color field + V2 migration
- V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate.
- Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API.
- UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400.
- ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:30 +02:00
hsiegeln
88b003d4f0 docs(spec): explicit env switcher + per-env color (design)
Replace env dropdown with button+modal pattern, remove All Envs,
add 8-swatch preset color palette per env rendered as 3px top bar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:13:00 +02:00
hsiegeln
e6dcad1e07 config(app): silence MustacheAutoConfiguration templates-dir warning
jmustache on the classpath (for alert notification templates) triggers
Spring Boot's MustacheAutoConfiguration, which warns about the missing
classpath:/templates/ folder we don't use. Disable its check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:47:46 +02:00
hsiegeln
eda74b7339 docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m7s
CI / docker (push) Successful in 1m22s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand
EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface
restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink
rule), exactly-once guarantee scope, and the first-run backlog-cap knob.

Surface the new config in application.yml with the 24h default and the
opt-out-to-0 semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:39:49 +02:00
hsiegeln
e470fc0dab alerting(eval): clamp first-run cursor to deployBacklogCap — flood guard
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.

Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:34:23 +02:00
hsiegeln
32c52aa22e docs(rules): update app-classes for BatchResultApplier
Task 6.2 housekeeping — add BatchResultApplier to the class map per
CLAUDE.md convention. Introduced in Task 2.2 as the @Transactional
wrapper for atomic per-rule batch commits (instance writes + notification
enqueues + cursor advance).

Also refreshes GitNexus index stats auto-emitted into AGENTS.md /
CLAUDE.md (8778 -> 8893 nodes, 22647 -> 23049 edges).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:13:57 +02:00
hsiegeln
cfc619505a alerting(it): AlertingFullLifecycleIT — exactly-once across ticks, ack isolation
End-to-end lifecycle test: 5 FAILED exchanges across 2 ticks produces
exactly 5 FIRING instances + 5 PENDING notifications. Tick 3 with no
new exchanges produces zero new instances or notifications. Ack on one
instance leaves the other four untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:07:45 +02:00
hsiegeln
e0496fdba2 ui(alerts): ReviewStep — render-preview pane for existing rules
Wire up the existing POST /alerts/rules/{id}/render-preview endpoint
so rule authors can preview their Mustache-templated notification
before saving. Available in edit mode only (new rules require save
first — endpoint is id-bound). Matches the njams gap: their rules
builder ships no in-builder preview and operators compensate with
trial-and-error save/retry.

Implementation notes:
- ReviewStep gains an optional `ruleId` prop; when present, a
  "Preview notification" button calls `useRenderPreview` (the
  existing TanStack mutation in api/queries/alertRules.ts) and
  renders title + message in a titled, read-only pane styled like
  a notification card.
- Errors surface as a DS Alert (variant=error) beneath the button.
- `RuleEditorWizard` passes `ruleId={id}` through — mirrors the
  existing TriggerStep / NotifyStep wiring.
- No stateless (/render-preview without id) variant exists on the
  backend, so for new rules the button is simply omitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:58:17 +02:00
hsiegeln
f096365e05 ui(alerts): ReviewStep blocks save on empty webhooks+targets
Shows a warning banner and disables the Save button when a rule has
neither webhooks nor targets — would have been rejected at the server
edge (Task 3.3 validator), now caught earlier in the wizard with clear
reason.
2026-04-22 17:55:13 +02:00
hsiegeln
36cb93ecdd ui(alerts): ExchangeMatchForm — enforce PER_EXCHANGE UI constraints
Disable reNotifyMinutes at 0 with tooltip when PER_EXCHANGE is selected
(server rejects non-zero per Task 3.3 validator). Hide forDurationSeconds
entirely for PER_EXCHANGE (not applicable to per-exchange semantics).
Values stay zeroed via Task 4.3's applyFireModeChange helper on any
mode toggle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:51:58 +02:00
hsiegeln
9960fd8c36 ui(alerts): applyFireModeChange — clear mode-specific fields on toggle
Prevents stale COUNT_IN_WINDOW threshold/windowSeconds from surviving
PER_EXCHANGE save (would trip the Task 3.3 server-side validator).
Also forces reNotifyMinutes=0 and forDurationSeconds=0 when switching to
PER_EXCHANGE.

Turns green: form-state.test.ts#applyFireModeChange (3 tests).
2026-04-22 17:48:51 +02:00
hsiegeln
4d37dff9f8 ui(alerts): RED tests for form-state fireMode toggle clearing
Three failing tests pinning Task 4.3's mode-toggle state hygiene:
- clears threshold+windowSeconds on COUNT_IN_WINDOW -> PER_EXCHANGE
- returns to defaults (not stale values) on PER_EXCHANGE -> COUNT_IN_WINDOW
- forces reNotifyMinutes=0 and forDurationSeconds=0 on PER_EXCHANGE

Targets a to-be-introduced pure helper `applyFireModeChange(form, newMode)`
in form-state.ts. Task 4.3 will implement the helper and wire it into
ExchangeMatchForm so the Fire-mode <Select> calls it instead of the current
raw patch({ fireMode }) that leaves stale fields.
2026-04-22 17:46:11 +02:00
hsiegeln
7677df33e5 ui(api): regen types + drop perExchangeLingerSeconds from SPA
Follows backend removal of the field (Task 3.1). Typechecker confirms
zero remaining references. The ExchangeMatchForm linger-input is
visually removed in Task 4.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:40:43 +02:00
hsiegeln
0f6bafae8e alerting(api): cross-field validation for PER_EXCHANGE + empty-targets guard
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).

Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
2026-04-22 17:31:11 +02:00
hsiegeln
377968eb53 alerting(it): RED tests for PER_EXCHANGE cross-field validation + empty targets
Three failing IT tests documenting the contract Task 3.3 will satisfy:
- createPerExchangeRule_withReNotifyMinutesNonZero_returns400
- createPerExchangeRule_withForDurationSecondsNonZero_returns400
- createAnyRule_withEmptyWebhooksAndTargets_returns400
2026-04-22 17:17:47 +02:00
hsiegeln
e483e52eee alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.

Pre-prod; no shim / migration.
2026-04-22 17:10:53 +02:00
hsiegeln
ba4e2bb68f alerting(eval): atomic per-rule batch commit via @Transactional — Phase 2 close
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
2026-04-22 17:03:07 +02:00
hsiegeln
989dde23eb alerting(it): RED test pinning Phase 2 tick-atomicity contract
Fault-injection IT asserts that a crash mid-batch rolls back every
instance + notification write AND leaves the cursor unchanged. Fails
against current (Phase 1 only) code — turns green when Task 2.2
wraps batch processing in @Transactional.
2026-04-22 16:51:09 +02:00
hsiegeln
3c3d90c45b test(alerting): align AlertEvaluatorJobIT CH cleanup with house style
Replace async @AfterEach ALTER...DELETE with @BeforeEach TRUNCATE TABLE
executions — matches the convention used in ClickHouseExecutionStoreIT
and peers. Env-slug isolation was already preventing cross-test pollution;
this change is about hygiene and determinism (TRUNCATE is synchronous).
2026-04-22 16:45:28 +02:00
hsiegeln
5bd0e09df3 alerting(eval): persist advanced cursor via releaseClaim — Phase 1 close
Fixes the notification-bleed regression pinned by
AlertEvaluatorJobIT#tick2_noNewExchanges_enqueuesZeroAdditionalNotifications.
2026-04-22 16:36:01 +02:00
hsiegeln
b8d4b59f40 alerting(eval): AlertEvaluatorJob persists advanced cursor via withEvalState
Thread EvalResult.Batch.nextEvalState into releaseClaim so the composite
cursor from Task 1.5 actually lands in rule.evalState across tick boundaries.
Guards against empty-batch wipe (would regress to first-run scan).
2026-04-22 16:24:27 +02:00
hsiegeln
850c030642 search: compose ORDER BY with execution_id when afterExecutionId set
Follow-up to Task 1.2 flagged by Task 1.5 review (I-1). Single-column
ORDER BY could drop tail rows in a same-millisecond group >50 when
paginating via the composite cursor. Appending ', execution_id <dir>'
as secondary key only when afterExecutionId is set preserves existing
behaviour for UI/stats callers.
2026-04-22 16:21:52 +02:00
hsiegeln
4acf0aeeff alerting(eval): PER_EXCHANGE composite cursor — monotone across same-ms exchanges
Tests:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory
2026-04-22 16:11:01 +02:00
hsiegeln
0bad014811 core(alerting): AlertRule.withEvalState wither for cursor threading 2026-04-22 16:04:55 +02:00
hsiegeln
c2252a0e72 alerting(eval): RED tests for PER_EXCHANGE cursor monotonicity + first-run bound
Two failing tests documenting the contract Task 1.5 will satisfy:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory

Compile may fail until Task 1.4 adds AlertRule.withEvalState wither.
2026-04-22 15:58:16 +02:00
hsiegeln
b41f34c090 search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.

When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.

Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.

Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
2026-04-22 15:49:05 +02:00
hsiegeln
6fa8e3aa30 alerting(eval): EvalResult.Batch carries nextEvalState for cursor threading 2026-04-22 15:42:20 +02:00
hsiegeln
031fe725b5 docs(plan): PER_EXCHANGE exactly-once — implementation plan (21 tasks, 6 phases)
Plan for executing the tightened spec. TDD per task: RED test first,
minimal GREEN impl, commit. Phases 1-2 land the cursor + atomic batch
commit; phase 3 validates config; phase 4 fixes the UI mode-toggle
leakage + empty-targets guard + render-preview pane; phases 5-6 close
with full-lifecycle IT and regression sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:39:31 +02:00
hsiegeln
2f9b9c9b0f docs(spec): PER_EXCHANGE — tighten motivation, fold in njams review
Correct the factual claim that the cursor advances — it is dead code:
_nextCursor is computed but never persisted by applyBatchFiring/reschedule,
so every tick re-enqueues notifications for every matching exchange in
retention. Clarify that instance-level dedup already works via the unique
index; notification-level dedup is what's broken. Reframe §2 as "make it
atomic before §1 goes live."

Add builder-UX lessons from the njams Server_4 rules editor: clear stale
fields on fireMode toggle (not just hide them); block save on empty
webhooks+targets; wire the already-existing /render-preview endpoint into
the Review step. Add Test 5 (red-first notification-bleed regression) and
Test 6 (form-state clear on mode toggle).

Park two follow-ups explicitly: sealed condition-type hierarchy (backend
lags the UI's condition-forms/* sharding) and a coalesceSeconds primitive
for Inbox-storm taming. Amend cursor-format-churn risk: benign in theory,
but first post-deploy tick against long-standing rules could scan from
rule.createdAt forward — suggests a deployBacklogCap clamp to bound the
one-time backlog flood.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:57:25 +02:00
hsiegeln
817b61058a docs(spec): PER_EXCHANGE exactly-once-per-exchange alerting
Four focused correctness fixes for the "fire exactly once per FAILED
exchange" use case (alerting layer only; HTTP-level idempotency is a
separate scope):

1. Composite cursor (startTime, executionId) replaces the current
   single-timestamp, inclusive cursor — prevents same-millisecond
   drops and same-exchange re-selection.
2. First-run cursor initialized to rule createdAt (not null) —
   prevents the current unbounded historical-retention scan on first
   tick of a new rule.
3. Transactional coupling of instance writes + notification enqueue +
   cursor advance — eliminates partial-progress failure modes on crash
   or rollback.
4. Config hygiene: reNotifyMinutes forced to 0, forDurationSeconds
   rejected, perExchangeLingerSeconds removed entirely (was validated
   as required but never read) — the rule shape stops admitting
   nonsensical PER_EXCHANGE combinations.

Alert stays FIRING until human ack/resolve (no auto-resolve); webhook
fires exactly once per AlertInstance; Inbox never sees duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:17:18 +02:00
hsiegeln
e4492b10e1 chore: refresh GitNexus index stats + drop stale ci logs
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m8s
CI / docker (push) Successful in 1m17s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 46s
- AGENTS.md / CLAUDE.md: GitNexus stat block re-rendered by the analyze
  hook after the last indexing run (8778 symbols / 22647 relationships).
- Remove checked-in ci-log.txt and ci-log2.txt — leftover debug output
  from an earlier CI troubleshooting session, not referenced anywhere.

Also deleted untracked ui/playwright.config.js and ui/vitest.config.js
from the working tree — those are stray compiled-to-JS artifacts of the
tracked .ts config sources, not intended to be committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:24:49 +02:00
hsiegeln
6f78d0a513 ui(alerts): MustacheEditor — completion consumes existing }} instead of duplicating
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m2s
CI / docker (push) Successful in 1m20s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
closeBrackets auto-inserts `}}` when the user types `{{`, so the buffer
already reads `{{<prefix>}}` before a completion is accepted. The apply
callback was unconditionally appending another `}}`, producing
`{{path}}}}` (valid Mustache but obviously wrong).

Fix: peek at the two characters immediately after the completion range
and, when they're `}}`, extend the replacement range by two so the
existing closing braces are overwritten rather than left in place.

Added a regression test that drives `apply` through a real EditorView
for both the bare-prefix (no trailing `}}`) and auto-closed
(`{{prefix}}`) scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:12:56 +02:00
hsiegeln
1c4a98c0da ui(alerts): Silences page adopts Rules UX — top-right button + modal form
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m28s
CI / docker (push) Has started running
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
Before: the Silences page rendered an always-visible 4-field form strip
above the list, taking room even when the environment had zero silences.
Inconsistent with Rules, which puts a "New rule" action in the page
header and reserves the content area for either the list or an empty
state.

After: header mirrors Rules — title + subtitle on the left, a "New
silence" primary button on the right. The create form moved into a
Modal opened by that button (and by the empty-state's "Create silence"
action). `?ruleId=` deep links still work: the param is read on mount,
prefills the Rule ID field, and auto-opens the modal — preserving the
InboxPage "Silence rule… → Custom…" flow.

Dropped: unused `sectionStyles` import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:09:13 +02:00
hsiegeln
be45ba2d59 docs(triage): close-out follow-up — all 12 parked failures resolved, 560/560 green
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m54s
CI / docker (push) Successful in 4m28s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 2m4s
SonarQube / sonarqube (push) Successful in 5m57s
Records the three fix commits + two prod-code cleanup commits, with
one-paragraph summaries for each cluster and pointers to the diagnosis
doc for SSE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:45:59 +02:00
hsiegeln
41df042e98 fix(sse): close 4 parked SSE test failures
Three distinct root causes, all reproducible when the classes run
solo — not order-dependent as the triage report suggested. Full
diagnosis in .planning/sse-flakiness-diagnosis.md.

1. AgentSseController.events auto-heal was over-permissive: any valid
   JWT allowed registering an arbitrary path-id, a spoofing vector.
   Surface symptom was the parked sseConnect_unknownAgent_returns404
   test hanging on a 200-with-empty-stream instead of getting 404.
   Fix: auto-heal requires JWT subject == path id.

2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms}
   (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.*
   — same family of bug as the MetricsFlushScheduler fix in a6944911.
   Fix: corrected placeholder prefix.

3. Spring's SseEmitter doesn't flush response headers until the first
   emitter.send(); clients on BodyHandlers.ofInputStream blocked on
   the first body byte, making awaitConnection(5s) unreliable under a
   15s ping cadence. Fix: send an initial ": connected" comment on
   connect() so headers hit the wire immediately.

Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:41:34 +02:00
hsiegeln
06c6f53bbc refactor(ingestion): remove unused TaggedExecution record
No callers after the legacy PG ingestion path was retired in 0f635576.
core-classes.md updated to drop the leftover note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:33:26 +02:00
hsiegeln
98cbf8f3fc refactor(search): drop dead SearchIndexer subsystem
After the ExecutionController removal (0f635576), SearchIndexer
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
Every SearchIndexerStats metric returned always-zero, and the admin
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
carried no signal.

Backend removed:
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
  ClickHouseAdminController (field + ctor param)
- StorageBeanConfig.searchIndexer bean

UI removed:
- IndexerPipeline type + useIndexerPipeline hook in
  api/queries/admin/clickhouse.ts
- Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar
  import and pipeline* CSS classes)

OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path
and IndexerPipelineResponse schema removed).

SearchIndex interface + ClickHouseSearchIndex impl kept — those are
live and used by SearchService + ExchangeMatchEvaluator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:32:49 +02:00
hsiegeln
a694491140 fix(metrics): MetricsFlushScheduler honour ingestion config flush interval
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed) but IngestionConfig binds cameleer.server.ingestion.* —
YAML tuning of the metrics flush interval was silently ignored and the
scheduler fell back to the 1s default in every environment.

Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}.

(The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs}
failed because beans registered via @EnableConfigurationProperties use a
compound bean name "<prefix>-<FQN>", not the simple camelCase form. The
property-placeholder path is sufficient — IngestionConfig still owns
the Java-side default.)

BackpressureIT: drops the obsolete workaround property
`ingestion.flush-interval-ms=60000`; the single prefixed override now
controls both buffer config and flush cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:28:00 +02:00
hsiegeln
a9a6b465d4 fix(stats): close 8 ClickHouseStatsStoreIT TZ failures (bucket DateTime('UTC') + JVM UTC pin)
Two-layer fix for the TZ drift that caused stats reads to miss every row
when the JVM default TZ and CH session TZ disagreed:

- Insert side: ClickHouse JDBC 0.9.7 formats java.sql.Timestamp via
  Timestamp.toString(), which uses JVM default TZ. A CEST JVM shipping
  to a UTC CH server stored Unix timestamps off by the TZ offset (the
  triage report's original symptom). Pinned JVM default to UTC in
  CameleerServerApplication.main() — standard practice for observability
  servers that push to time-series stores.
- Read side: stats_1m_* tables now declare bucket as DateTime('UTC'),
  MV SELECTs wrap toStartOfMinute(start_time) in toDateTime(..., 'UTC')
  so projections match column type, and ClickHouseStatsStore.lit(Instant)
  emits toDateTime('...', 'UTC') rather than a bare literal — defence
  in depth against future refactors.

Test class pins its own JVM TZ (the store IT builds its own
HikariDataSource, bypassing the main() path). Debug scaffolding from
the triage investigation removed.

Greenfield CH — no migration needed.

Verified: 14/14 ClickHouseStatsStoreIT green, plus 84/84 across all
ClickHouse IT classes (no regression from the JVM TZ default change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:25:22 +02:00
hsiegeln
d32208d403 docs(plan): IT triage follow-ups — implementation plan
Task-by-task plan for the 2026-04-21-it-triage-followups-design spec.
Autonomous execution variant — SSE diagnose-then-fix branches to either
apply-fix or park-with-@Disabled based on diagnosis confidence, since
this runs unattended overnight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:10:55 +02:00
hsiegeln
6c1cbc289c docs(spec): IT triage follow-ups — design
Design for closing the 12 parked IT failures (ClickHouseStatsStoreIT
timezone, SSE flakiness in AgentSseControllerIT/SseSigningIT) plus two
production-code side notes the ExecutionController removal surfaced:

- ClickHouseStatsStore timezone fix — column-level DateTime('UTC') on
  bucket, greenfield CH
- SSE flakiness — diagnose-then-fix with user checkpoint between phases
- MetricsFlushScheduler property-key fix — bind via SpEL, single source
  of truth in IngestionConfig
- Dead-code cleanup — SearchIndexer.onExecutionUpdated listener +
  unused TaggedExecution record

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:03:08 +02:00