Commit Graph

132 Commits

Author SHA1 Message Date
hsiegeln
b655de3975 fix(config): structured 400 body on unknown apply value
Replace empty-body ResponseEntity.status(BAD_REQUEST).build() with
ResponseStatusException so Spring returns the usual error body shape
with a descriptive reason string, matching the idiom used by
UserAdminController, AppSettingsController, ThresholdAdminController.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:45:31 +02:00
hsiegeln
b5ecd39100 docs(api): document ?apply query param on updateConfig (Swagger)
Adds @Parameter description so the generated OpenAPI spec / Swagger UI
explains what 'staged' vs 'live' means instead of just surfacing the
bare param name. Follow-up: run `cd ui && npm run generate-api:live`
against a live backend to refresh openapi.json + schema.d.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:39:10 +02:00
hsiegeln
4d4c59efe3 fix(deploy): include DEGRADED deploys as restorable checkpoints
Snapshot is written by DeploymentExecutor before the RUNNING/DEGRADED
split, so DEGRADED rows already carry a deployed_config_snapshot. Treat
them as checkpoints — partial-healthy deploys still produced a working
config worth restoring. Aligns repo query with UI filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 00:34:25 +02:00
hsiegeln
d33c039a17 fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations
- Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate
  from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from
  snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field
- Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the
  new app is in cache before the router renders the deployed view (eliminates transient Save-
  disabled flash)
- Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits;
  background refetches only overwrite form when no local changes are present
- Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment
  resolves; handleSave invalidates dirty-state after staged save
- Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy,
  environment, application) before JSON comparison via scrubAgentConfig(); adds
  versionBumpDoesNotMarkDirty test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:29:01 +02:00
hsiegeln
6591f2fde3 api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff
Wires DirtyStateCalculator behind an HTTP endpoint on AppController.
Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository,
registers DirtyStateCalculator as a Spring bean (with ObjectMapper for
JavaTimeModule support), and covers all three scenarios with IT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:35:35 +02:00
hsiegeln
76129d407e api(config): ?apply=staged|live gates SSE push on PUT /apps/{slug}/config
When apply=staged, saves to DB only — no CONFIG_UPDATE dispatched to agents.
When apply=live (default, back-compat), preserves today's immediate-push behavior.
Unknown apply values return 400. Audit action is stage_app_config vs update_app_config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:07:36 +02:00
hsiegeln
a79eafeaf4 runtime(deploy): capture config snapshot on RUNNING transition
Injects PostgresApplicationConfigRepository into DeploymentExecutor and
calls saveDeployedConfigSnapshot at the COMPLETE stage, before
markRunning. Snapshot contains jarVersionId, agentConfig (nullable),
and app.containerConfig. The FAILED catch path is left untouched so
snapshot stays null on failure. Verified by DeploymentSnapshotIT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:51:00 +02:00
hsiegeln
d3e86b9d77 storage(deploy): persist deployed_config_snapshot as JSONB
Wire SELECT_COLS, mapRow deserialization, and saveDeployedConfigSnapshot
update method. Adds PostgresDeploymentRepositoryIT with roundtrip,
null-default, and clear-to-null tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:39:04 +02:00
hsiegeln
7f9cfc7f18 core(deploy): add deployedConfigSnapshot field to Deployment model
Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment
record and adds a matching withDeployedConfigSnapshot wither. All
positional call sites (repository mapper, test fixture) updated to pass
null; Task 1.4 will wire real persistence and Task 1.5 will populate
the field on RUNNING transition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:31:48 +02:00
hsiegeln
ff95187707 db(deploy): add deployments.deployed_config_snapshot column (V3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 21:23:46 +02:00
hsiegeln
c2eab71a31 env(admin): per-environment color field + V2 migration
- V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate.
- Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API.
- UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400.
- ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:30 +02:00
hsiegeln
e6dcad1e07 config(app): silence MustacheAutoConfiguration templates-dir warning
jmustache on the classpath (for alert notification templates) triggers
Spring Boot's MustacheAutoConfiguration, which warns about the missing
classpath:/templates/ folder we don't use. Disable its check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:47:46 +02:00
hsiegeln
eda74b7339 docs(alerting): PER_EXCHANGE exactly-once — fireMode reference + deploy-backlog-cap
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m7s
CI / docker (push) Successful in 1m22s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
Fix stale `AGGREGATE` label (actual enum: `COUNT_IN_WINDOW`). Expand
EXCHANGE_MATCH section with both fire modes, PER_EXCHANGE config-surface
restrictions (0 for reNotifyMinutes/forDurationSeconds, at-least-one-sink
rule), exactly-once guarantee scope, and the first-run backlog-cap knob.

Surface the new config in application.yml with the 24h default and the
opt-out-to-0 semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:39:49 +02:00
hsiegeln
e470fc0dab alerting(eval): clamp first-run cursor to deployBacklogCap — flood guard
New property cameleer.server.alerting.perExchangeDeployBacklogCapSeconds
(default 86400 = 24h, 0 disables). On first run (no persisted cursor
or malformed), clamp cursorTs to max(rule.createdAt, now - cap) so a
long-lived PER_EXCHANGE rule doesn't scan from its creation date
forward on first post-deploy tick. Normal-advance path unaffected.

Follows up final-review I-1 on the PER_EXCHANGE exactly-once phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:34:23 +02:00
hsiegeln
0f6bafae8e alerting(api): cross-field validation for PER_EXCHANGE + empty-targets guard
PER_EXCHANGE rules: 400 if reNotifyMinutes != 0 or forDurationSeconds != 0.
Any rule: 400 if webhooks + targets are both empty (never notifies anyone).

Turns green: AlertRuleControllerIT#createPerExchangeRule_with*NonZero_returns400,
AlertRuleControllerIT#createAnyRule_withEmptyWebhooksAndTargets_returns400.
2026-04-22 17:31:11 +02:00
hsiegeln
ba4e2bb68f alerting(eval): atomic per-rule batch commit via @Transactional — Phase 2 close
Wraps instance writes, notification enqueues, and cursor advance in one
transactional boundary per rule tick. Rollback leaves the rule replayable
on next tick. Turns the Phase 2 atomicity IT green (see AlertEvaluatorJobIT
#tickRollback_faultOnSecondNotificationInsert_leavesCursorUnchanged).
2026-04-22 17:03:07 +02:00
hsiegeln
b8d4b59f40 alerting(eval): AlertEvaluatorJob persists advanced cursor via withEvalState
Thread EvalResult.Batch.nextEvalState into releaseClaim so the composite
cursor from Task 1.5 actually lands in rule.evalState across tick boundaries.
Guards against empty-batch wipe (would regress to first-run scan).
2026-04-22 16:24:27 +02:00
hsiegeln
850c030642 search: compose ORDER BY with execution_id when afterExecutionId set
Follow-up to Task 1.2 flagged by Task 1.5 review (I-1). Single-column
ORDER BY could drop tail rows in a same-millisecond group >50 when
paginating via the composite cursor. Appending ', execution_id <dir>'
as secondary key only when afterExecutionId is set preserves existing
behaviour for UI/stats callers.
2026-04-22 16:21:52 +02:00
hsiegeln
4acf0aeeff alerting(eval): PER_EXCHANGE composite cursor — monotone across same-ms exchanges
Tests:
- cursorMonotonicity_sameMillisecondExchanges_fireExactlyOncePerTick
- firstRun_boundedByRuleCreatedAt_notRetentionHistory
2026-04-22 16:11:01 +02:00
hsiegeln
b41f34c090 search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.

When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.

Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.

Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
2026-04-22 15:49:05 +02:00
hsiegeln
6fa8e3aa30 alerting(eval): EvalResult.Batch carries nextEvalState for cursor threading 2026-04-22 15:42:20 +02:00
hsiegeln
41df042e98 fix(sse): close 4 parked SSE test failures
Three distinct root causes, all reproducible when the classes run
solo — not order-dependent as the triage report suggested. Full
diagnosis in .planning/sse-flakiness-diagnosis.md.

1. AgentSseController.events auto-heal was over-permissive: any valid
   JWT allowed registering an arbitrary path-id, a spoofing vector.
   Surface symptom was the parked sseConnect_unknownAgent_returns404
   test hanging on a 200-with-empty-stream instead of getting 404.
   Fix: auto-heal requires JWT subject == path id.

2. SseConnectionManager.pingAll read ${agent-registry.ping-interval-ms}
   (unprefixed). AgentRegistryConfig binds cameleer.server.agentregistry.*
   — same family of bug as the MetricsFlushScheduler fix in a6944911.
   Fix: corrected placeholder prefix.

3. Spring's SseEmitter doesn't flush response headers until the first
   emitter.send(); clients on BodyHandlers.ofInputStream blocked on
   the first body byte, making awaitConnection(5s) unreliable under a
   15s ping cadence. Fix: send an initial ": connected" comment on
   connect() so headers hit the wire immediately.

Verified: 9/9 SSE tests green across AgentSseControllerIT + SseSigningIT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:41:34 +02:00
hsiegeln
98cbf8f3fc refactor(search): drop dead SearchIndexer subsystem
After the ExecutionController removal (0f635576), SearchIndexer
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
Every SearchIndexerStats metric returned always-zero, and the admin
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
carried no signal.

Backend removed:
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
  ClickHouseAdminController (field + ctor param)
- StorageBeanConfig.searchIndexer bean

UI removed:
- IndexerPipeline type + useIndexerPipeline hook in
  api/queries/admin/clickhouse.ts
- Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar
  import and pipeline* CSS classes)

OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path
and IndexerPipelineResponse schema removed).

SearchIndex interface + ClickHouseSearchIndex impl kept — those are
live and used by SearchService + ExchangeMatchEvaluator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:32:49 +02:00
hsiegeln
a694491140 fix(metrics): MetricsFlushScheduler honour ingestion config flush interval
The @Scheduled placeholder read ${ingestion.flush-interval-ms:1000}
(unprefixed) but IngestionConfig binds cameleer.server.ingestion.* —
YAML tuning of the metrics flush interval was silently ignored and the
scheduler fell back to the 1s default in every environment.

Corrected to ${cameleer.server.ingestion.flush-interval-ms:1000}.

(The initial attempt to bind via SpEL #{@ingestionConfig.flushIntervalMs}
failed because beans registered via @EnableConfigurationProperties use a
compound bean name "<prefix>-<FQN>", not the simple camelCase form. The
property-placeholder path is sufficient — IngestionConfig still owns
the Java-side default.)

BackpressureIT: drops the obsolete workaround property
`ingestion.flush-interval-ms=60000`; the single prefixed override now
controls both buffer config and flush cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:28:00 +02:00
hsiegeln
a9a6b465d4 fix(stats): close 8 ClickHouseStatsStoreIT TZ failures (bucket DateTime('UTC') + JVM UTC pin)
Two-layer fix for the TZ drift that caused stats reads to miss every row
when the JVM default TZ and CH session TZ disagreed:

- Insert side: ClickHouse JDBC 0.9.7 formats java.sql.Timestamp via
  Timestamp.toString(), which uses JVM default TZ. A CEST JVM shipping
  to a UTC CH server stored Unix timestamps off by the TZ offset (the
  triage report's original symptom). Pinned JVM default to UTC in
  CameleerServerApplication.main() — standard practice for observability
  servers that push to time-series stores.
- Read side: stats_1m_* tables now declare bucket as DateTime('UTC'),
  MV SELECTs wrap toStartOfMinute(start_time) in toDateTime(..., 'UTC')
  so projections match column type, and ClickHouseStatsStore.lit(Instant)
  emits toDateTime('...', 'UTC') rather than a bare literal — defence
  in depth against future refactors.

Test class pins its own JVM TZ (the store IT builds its own
HikariDataSource, bypassing the main() path). Debug scaffolding from
the triage investigation removed.

Greenfield CH — no migration needed.

Verified: 14/14 ClickHouseStatsStoreIT green, plus 84/84 across all
ClickHouse IT classes (no regression from the JVM TZ default change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:25:22 +02:00
hsiegeln
0f635576a3 refactor(ingestion): drop dead legacy execution-ingestion path
ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class),
and ChunkAccumulator is registered unconditionally — the legacy controller
never bound in any profile. Even if it had, IngestionService.ingestExecution
called executionStore.upsert(), and the only ExecutionStore impl
(ClickHouseExecutionStore) threw UnsupportedOperationException from upsert
and upsertProcessors. The entire RouteExecution → upsert path was dead code
carrying four transitive dependencies (RouteExecution import, eventPublisher
wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook).

Removed:
- cameleer-server-app/.../controller/ExecutionController.java (whole file)
- ExecutionStore.upsert + upsertProcessors (interface methods)
- ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides)
- IngestionService.ingestExecution + toExecutionRecord + flattenProcessors
  + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers
- IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>);
  dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit
- StorageBeanConfig.ingestionService(...) simplified accordingly

Untouched because still in use:
- ExecutionRecord / ProcessorRecord records (findById / findProcessors /
  SearchIndexer / DetailController)
- SearchIndexer (its onExecutionUpdated never fires now since no-one
  publishes ExecutionUpdatedEvent, but SearchIndexerStats is still
  referenced by ClickHouseAdminController — separate cleanup)
- TaggedExecution record has no remaining callers after this change —
  flagged in core-classes.md as a leftover; separate cleanup.

Rule docs updated:
- .claude/rules/app-classes.md: retired ExecutionController bullet, fixed
  stale URL for ChunkIngestionController (it owns /api/v1/data/executions,
  not /api/v1/ingestion/chunk/executions).
- .claude/rules/core-classes.md: IngestionService surface + note the dead
  TaggedExecution.

Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures
in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT
SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 22:50:51 +02:00
hsiegeln
fb54f9cbd2 fix(agent): revive DEAD agents on heartbeat (not just STALE)
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m5s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
Reproduction: pause a container long enough to cross both the stale
and dead thresholds, then unpause. The agent resumes sending heartbeats
but the server keeps it shown as DEAD. Only a full container restart
(which re-registers) fixes it.

Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE.
A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged.
checkLifecycle() never downgrades DEAD either (no-op in that branch),
so the agent was permanently stuck in DEAD until a register() call.

Fix: extend the revival branch to also cover DEAD. Same process; a
heartbeat is proof of liveness regardless of the previous state.

Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED
for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the
lifecycle timeline captures the transition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:55:47 +02:00
hsiegeln
90083f886a refactor(schema): collapse V1..V18 into single V1__init.sql baseline
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m4s
CI / docker (push) Successful in 1m17s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
The project is still greenfield (no production deployment) so this is
the last safe moment to flatten the migration archaeology before the
checksum history starts mattering for real.

Schema changes
 - 18 migration files (531 lines) → one V1__init.sql (~380 lines)
   declaring the final end-state: RBAC + claim mappings + runtime
   management + config + audit + outbound + alerting, plus seed data
   (system roles, Admins group, default environment).
 - Drops the data-repair statements from V14 (firemode backfill),
   V16 (subjectFingerprint migration), V17 (ACKNOWLEDGED → FIRING
   coercion) — they were no-ops on any DB that starts at V1.
 - Declares condition_kind_enum with AGENT_LIFECYCLE from the start
   (was added retroactively by V18).
 - Declares alert_state_enum with three values only (was five, then
   swapped in V17) and alert_instances with read_at / deleted_at
   columns from day one (was added by V17).
 - alert_reads table never created (V12 created, V17 dropped).
 - alert_instances_open_rule_uq built with the V17 predicate from
   the start.

Test changes
 - Replace V12MigrationIT / V17MigrationIT / V18MigrationIT with one
   SchemaBootstrapIT that asserts the combined invariants: tables
   present, alert_reads absent, enum value sets, alert_instances has
   read_at + deleted_at, open_rule_uq exists and is unique, env-delete
   cascade fires.

Verification
 - pg_dump of the new V1 matches the pg_dump of V1..V18 applied in
   sequence (bytewise modulo column order and Postgres-auto FK names).
 - Full alerting IT suite (53 tests across 6 classes) green against
   the new schema.
 - The 47 pre-existing test failures on main (AgentRegistrationIT,
   SearchControllerIT, ClickHouseStatsStoreIT, …) are unrelated and
   fail identically without this change.

Developer impact
 - Existing local DBs will fail checksum validation on boot. Wipe:
   docker compose down -v  (or drop the tenant_default schema).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:52:22 +02:00
hsiegeln
b7d201d743 fix(alerts): add AGENT_LIFECYCLE to condition_kind_enum + readable error toasts
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m5s
CI / docker (push) Successful in 1m19s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Backend
 - V18 migration adds AGENT_LIFECYCLE to condition_kind_enum. Java
   ConditionKind enum shipped with this value but no Postgres migration
   extended the type, so any AGENT_LIFECYCLE rule insert failed with
   "invalid input value for enum condition_kind_enum".
 - ALTER TYPE ... ADD VALUE lives alone in its migration per Postgres
   constraint that the new value cannot be referenced in the same tx.
 - V18MigrationIT asserts the enum now contains all 7 kinds.

Frontend
 - Add describeApiError(e) helper to unwrap openapi-fetch error bodies
   (Spring error JSON) into readable strings. String(e) on a plain
   object rendered "[object Object]" in toasts — the actual failure
   reason was hidden from the user.
 - Replace String(e) in all 13 toast descriptions across the alerting
   and outbound-connection mutation paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:23:14 +02:00
hsiegeln
99b739d946 fix(alerts): backend hardening + complete ACKNOWLEDGED migration
- new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation
- AlertController.inEnvLiveIds now one SQL round-trip instead of N
- bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL
- bulkAck SQL already had deleted_at IS NULL guard — no change needed
- PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted
- V12MigrationIT: remove alert_reads assertion (table dropped by V17)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 18:48:57 +02:00
hsiegeln
efd8396045 feat(alerts): controller — DELETE/bulk-delete/bulk-ack/restore + acked/read filters + readAt on DTO
- GET /alerts gains tri-state acked + read query params
- new endpoints: DELETE /{id} (soft-delete), POST /bulk-delete, POST /bulk-ack, POST /{id}/restore
- requireLiveInstance 404s on soft-deleted rows; restore() reads the row regardless
- BulkReadRequest → BulkIdsRequest (shared body for bulk read/ack/delete)
- AlertDto gains readAt; deletedAt stays off the wire
- InAppInboxQuery.listInbox threads acked/read through to the repo (7-arg, no more null placeholders)
- SecurityConfig: new matchers for bulk-ack (VIEWER+), DELETE/bulk-delete/restore (OPERATOR+)
- AlertControllerIT: persistence assertions on /read + /bulk-read; full coverage for new endpoints
- InAppInboxQueryTest: updated to 7-arg listInbox signature

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 18:15:16 +02:00
hsiegeln
da2819332c feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations
- save/rowMapper read+write read_at and deleted_at
- listForInbox: tri-state acked/read filters; always excludes deleted
- countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill
- new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore
- delete PostgresAlertReadRepository + its bean
- restore zero-fill Javadoc on interface
- mechanical compile-fixes in AlertController, InAppInboxQuery,
  AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite
- PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:56:06 +02:00
hsiegeln
6e8d890442 fix(alerts): remove dead ACKNOWLEDGED enum SQL + TODO comments
Remove SET state='ACKNOWLEDGED' from ack() and the ACKNOWLEDGED predicate
from findOpenForRule — both would error after V17. The final ack() + open-rule
semantics (idempotent guards, deleted_at) are owned by Task 5; this is just
the minimum to stop runtime SQL errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:36:02 +02:00
hsiegeln
5b1b3f215a test(alerts): state machine — ack is orthogonal, does not transition FIRING
- AlertStateTransitionsTest: add null,null for readAt/deletedAt in openInstance helper;
  replace firingWhenAcknowledgedIsNoOp with firing_with_ack_stays_firing_on_next_firing_tick;
  convert ackedInstanceClearsToResolved to use FIRING+withAck; update section comment.
- PostgresAlertInstanceRepository: stub null,null for readAt/deletedAt in rowMapper
  to unblock compilation (Task 4 will read the actual DB columns).
- All other alerting test files: add null,null for readAt/deletedAt to AlertInstance
  ctor calls so the test source tree compiles; stub ACKNOWLEDGED JSON/state assertions
  with FIRING + TODO Task 4 comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:28:31 +02:00
hsiegeln
82e82350f9 refactor(alerts): drop ACKNOWLEDGED from AlertState, add readAt/deletedAt to AlertInstance
- AlertState: remove ACKNOWLEDGED case (V17 migration already dropped it from DB enum)
- AlertInstance: insert readAt + deletedAt Instant fields after lastNotifiedAt; add withReadAt/withDeletedAt withers; update all existing withers to pass both fields positionally
- AlertStateTransitions: add null,null for readAt/deletedAt in newInstance ctor call; collapse FIRING,ACKNOWLEDGED switch arm to just FIRING
- AlertScopeTest: update AlertState.values() assertion to 3 values; fix stale ConditionKind.hasSize(6) to 7 (JVM_METRIC was added earlier)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:12:37 +02:00
hsiegeln
e95c21d0cb feat(alerts): V17 migration — drop ACKNOWLEDGED, add read_at + deleted_at
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:04:09 +02:00
hsiegeln
414f7204bf feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode
Allows alert rules to fire on agent-lifecycle events — REGISTERED,
RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather
than only on current state. Each matching `(agent, eventType, timestamp)`
becomes its own ackable AlertInstance, so outages on distinct agents are
independently routable.

Core:
- New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record
  (scope, eventTypes, withinSeconds). Compact ctor rejects empty
  eventTypes and withinSeconds<1.
- Strict allowlist enum `AgentLifecycleEventType` (six entries matching
  the server-emitted types in `AgentRegistrationController` and
  `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in
  backlog issue #145.
- `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes,
  from, to, limit)` — new read path ordered `(timestamp ASC, insert_id
  ASC)` used by the evaluator. Implemented on
  `ClickHouseAgentEventRepository` with tenant + env filter mandatory.

App:
- `AgentLifecycleEvaluator` queries events in the last `withinSeconds`
  window and returns `EvalResult.Batch` with one `Firing` per row.
  Every Firing carries a canonical `_subjectFingerprint` of
  `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event`
  subtrees for Mustache templating.
- `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that
  exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}`.
- Validation is delegated to the record compact ctor + enum at Jackson
  deserialization time — matches the existing policy of keeping
  controller validators focused on env-scoped / SQL-injection concerns.

Schema:
- V16 migration generalises the V15 per-exchange discriminator on
  `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a
  fallback to the legacy `exchange.id` expression. Scalar kinds still
  resolve to `''` and keep one-open-per-rule. Duplicate-key path in
  `PostgresAlertInstanceRepository.save` is unchanged — the index is
  the deduper.

UI:
- New `AgentLifecycleForm.tsx` wizard form with multi-select chips for
  the six allowed event types + `withinSeconds` input. Wired into
  `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD,
  300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the
  new option array.
- `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind,
  and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`.

Tests (all passing):
- 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive +
  empty/zero/unknown-type rejection).
- 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty
  window, multi-agent fingerprint shape, scope forwarding, missing env).
- `NotificationContextBuilderTest` switch now covers the new kind.
- 119 alerting unit tests + 71 UI tests green.

Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.
2026-04-21 14:52:08 +02:00
hsiegeln
f037d8c922 feat(alerting): server-side state+severity filters, ButtonGroup filter UI
Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value
`state=…` and `severity=…` query params. Filters are pushed down to
PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` /
`AND severity::text = ANY(?)` to the inbox query (null/empty = no filter).

`AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg
form is preserved as a default delegate so existing callers (evaluator,
AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged.
`InAppInboxQuery.listInbox` also has a new filtered overload.

UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select,
no color cues) to `ButtonGroup` (multi-select with severity-coloured dots),
matching the topnavbar status-filter pattern. `useAlerts` forwards the
filters as query params and cache-keys on the filter tuple so each combo
is independently cached.

Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit
tests passing). OpenAPI types regenerated from the fresh local backend.
2026-04-21 12:47:31 +02:00
hsiegeln
037a27d405 fix(alerting): allow multiple open alert_instances per rule for PER_EXCHANGE
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m51s
CI / docker (push) Successful in 1m17s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
V13 added a partial unique index on alert_instances(rule_id) WHERE state
IN (PENDING,FIRING,ACKNOWLEDGED). Correct for scalar condition kinds
(ROUTE_METRIC / AGENT_STATE / DEPLOYMENT_STATE / LOG_PATTERN / JVM_METRIC
/ EXCHANGE_MATCH in COUNT_IN_WINDOW) but wrong for EXCHANGE_MATCH /
PER_EXCHANGE, which by design emits one alert_instance per matching
exchange. Under V13 every PER_EXCHANGE tick with >1 match logged
"Skipped duplicate open alert_instance for rule …" at evaluator cadence
and silently lost alert fidelity — only the first matching exchange per
tick got an AlertInstance + webhook dispatch.

V15 drops the rule_id-only constraint and recreates it with a
discriminator on context->'exchange'->>'id'. Scalar kinds emit
Map.of() as context, so their expression resolves to '' — "one open per
rule" preserved. ExchangeMatchEvaluator.evaluatePerExchange always
populates exchange.id, so per-exchange instances coexist cleanly.

Two new PostgresAlertInstanceRepositoryIT tests:
  - multiple open instances for same rule + distinct exchanges all land
  - second open for identical (rule, exchange) still dedups via the
    DuplicateKeyException fallback in save() — defense-in-depth kept

Also fixes pre-existing PostgresAlertReadRepositoryIT brokenness: its
setup() inserted 3 open instances sharing one rule_id, which V13 blocked
on arrival. Migrate to one rule_id per instance (pattern already used
across other storage ITs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 22:26:19 +02:00
hsiegeln
efa8390108 fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m2s
CI / docker (push) Successful in 1m20s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
SonarQube / sonarqube (push) Successful in 5m31s
The rule editor wizard reset the condition payload on kind-change without
seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to
pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule.

- core: compact ctor now rejects null fireMode (Jackson-deser path only — all
  production callers already pass a value).
- V14: repair existing EXCHANGE_MATCH rows with fireMode=null to
  PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard).
- ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the
  Select's displayed fallback ("Per exchange") is actually in form state.
- ui: validateStep('condition', ...) now enforces fireMode presence + the
  mode-specific fields before the user reaches Review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:05:55 +02:00
hsiegeln
ae6473635d fix(auth): OidcAuthController + UserAdminController upsert unprefixed
Follow-up to the UiAuthController fix: every write path that puts a row
into users/user_roles/user_groups must use the bare DB key, because
the env-scoped controllers (Alert, AlertRule, AlertSilence, Outbound)
strip "user:" before using the name as an FK. If the write path stores
prefixed, first-time alerting/outbound writes fail with
alert_rules_created_by_fkey violation.

UiAuthController shipped the model in the prior commit (bare userId
for all DB/RBAC calls, "user:"-namespaced subject for JWT signing).
Bringing the other two write paths in line:

- OidcAuthController.callback:
    userId  = "oidc:" + oidcUser.subject()    // DB key, no "user:"
    subject = "user:" + userId                // JWT subject (namespaced)
  All userRepository / rbacService / applyClaimMappings calls use
  userId. Tokens still carry the namespaced subject so
  JwtAuthenticationFilter can distinguish user vs agent tokens.

- UserAdminController.createUser: userId = request.username() (bare).
  resetPassword: dropped the "user:"-strip fallback that was only
  needed because create used to prefix — now dead.

No migration. Greenfield alpha product — any pre-existing prefixed
rows in a dev DB will become orphans on next login (login upserts
the unprefixed row, old prefixed row is harmless but unused).
Operators doing a clean re-index can wipe the DB.

Read-path controllers still strip — harmless for bare DB rows, and
OIDC humans (JWT sub "user:oidc:<s>") still resolve correctly to
the new DB key "oidc:<s>" after stripping.

Verified: 45/45 alerting + outbound ITs pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:44:17 +02:00
hsiegeln
1ea0258393 fix(auth): upsert UI login user_id unprefixed (drop docker seeder workaround)
Root cause of the mismatch that prompted the one-shot cameleer-seed
docker service: UiAuthController stored users.user_id as the JWT
subject "user:admin" (JWT sub format). Every env-scoped controller
(Alert, AlertSilence, AlertRule, OutboundConnectionAdmin) already
strips the "user:" prefix on the read path — so the rest of the
system expects the DB key to be the bare username. With UiAuth
storing prefixed, fresh docker stacks hit
"alert_rules_created_by_fkey violation" on the first rule create.

Fix: inside login(), compute `userId = request.username()` and use
it everywhere the DB/RBAC layer is touched (isLocked, getPasswordHash,
record/clearFailedLogins, upsert, assignRoleToUser, addUserToGroup,
getSystemRoleNames). Keep `subject = "user:" + userId` — we still
sign JWTs with the namespaced subject so JwtAuthenticationFilter can
distinguish user vs agent tokens.

refresh() and me() follow the same rule via a stripSubjectPrefix()
helper (JWT subject in, bare DB key out).

With the write path aligned, the docker bridge is no longer needed:
- Deleted deploy/docker/postgres-init.sql
- Deleted cameleer-seed service from docker-compose.yml

Scope: UiAuthController only. UserAdminController + OidcAuthController
still prefix on upsert — that's the bug class the triage identified
as "Option A or B either way OK". Not changing them now because:
  a) prod admins are provisioned unprefixed through some other path,
     so those two files aren't the docker-only failure observed;
  b) stripping them would need a data migration for any existing
     prod users stored prefixed, which is out of scope for a cleanup
     phase. Follow-up worth scheduling if we ever wire OIDC or admin-
     created users into alerting FKs.

Verified: 33/33 alerting+outbound controller ITs pass (9 outbound,
10 rules, 9 silences, 5 alert inbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:26:03 +02:00
hsiegeln
09b49f096c feat(alerting): per-severity breakdown on unread-count DTO
Spec §13 calls for the notification bell to colour-code by highest
unread severity (CRITICAL → error, WARNING → amber, INFO → muted).
The old { count } DTO forced the UI to pick one static colour, so
NotificationBell shipped with a TODO. Grow the contract instead:

  UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } }

Guarantees:
- every severity is always present with a >=0 value (no undefined
  keys on the wire), so the UI can branch without defaults.
- total = sum of bySeverity values — kept explicit on the wire for
  cheap top-line display, not recomputed client-side.

Backend
- AlertInstanceRepository: replaces countUnreadForUser(long) with
  countUnreadBySeverityForUser returning Map<AlertSeverity, Long>.
  One SQL round-trip per (env, user) — GROUP BY ai.severity over the
  same NOT EXISTS(alert_reads) filter.
- UnreadCountResponse.from(Map) normalises and defensively copies;
  missing severities default to 0.
- InAppInboxQuery.countUnread now returns the DTO, caches the full
  response (still 5s TTL) so severity breakdown gets the same
  hit-rate as the total did before.
- AlertController just hands the DTO back.

Breaking change — no backwards-compat shim: the `count` field is
gone. UI and tests updated in the same commit; there are no other
API consumers in the tree.

Frontend
- Regenerated openapi.json + schema.d.ts against a fresh build of
  the new backend.
- NotificationBell branches badge colour on the highest unread
  severity (CRITICAL > WARNING > INFO) via new CSS variants.
- Tests cover all four paths: zero, critical-present, warning-only,
  info-only.

Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map)
       + 49 vitest (was 46; +3 severity-branch assertions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:15:56 +02:00
hsiegeln
5edf7eb23a fix(alerting): @Autowired on AlertingMetrics production constructor
Task 29's refactor added a package-private test-friendly constructor
alongside the public production one. Without @Autowired Spring cannot pick
which constructor to use for the @Component, and falls back to searching
for a no-arg default — crashing startup with 'No default constructor found'.

Detected when launching the server via the new docker-compose stack; unit
tests still pass because they invoke the package-private test constructor
directly.
2026-04-20 16:02:48 +02:00
hsiegeln
9f109b20fd perf(alerting): 30s TTL cache on AlertingMetrics gauge suppliers
Prometheus scrapes can fire every few seconds. The open-alerts / open-rules
gauges query Postgres on each read — caching the values for 30s amortises
that to one query per half-minute. Addresses final-review NIT from Plan 02.

- Introduces a package-private TtlCache that wraps a Supplier<Long> and
  memoises the last read for a configurable Duration against a Supplier<Instant>
  clock.
- Wraps each gauge supplier (alerting_rules_total{enabled|disabled},
  alerting_instances_total{state}) in its own TtlCache.
- Adds a test-friendly constructor (package-private) taking explicit
  Duration + Supplier<Instant> so AlertingMetricsCachingTest can advance
  a fake clock without waiting wall-clock time.
- Adds AlertingMetricsCachingTest covering:
  * supplier invoked once per TTL across repeated scrapes
  * 29 s elapsed → still cached; 31 s elapsed → re-queried
  * gauge value reflects the cached result even after delegate mutates

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:22:54 +02:00
hsiegeln
5ebc729b82 feat(alerting): SSRF guard on outbound connection URL
Rejects webhook URLs that resolve to loopback, link-local, or RFC-1918
private ranges (IPv4 + IPv6 ULA fc00::/7). Enforced on both create and
update in OutboundConnectionServiceImpl before persistence; returns 400
Bad Request with "private or loopback" in the body.

Bypass via `cameleer.server.outbound-http.allow-private-targets=true`
for dev environments where webhooks legitimately point at local
services. Production default is `false`.

Test profile sets the flag to `true` in application-test.yml so the
existing ITs that post webhooks to WireMock on https://localhost:PORT
keep working. A dedicated OutboundConnectionSsrfIT overrides the flag
back to false (via @TestPropertySource + @DirtiesContext) to exercise
the reject path end-to-end through the admin controller.

Plan 01 scope; required before SaaS exposure (spec §17).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:17:44 +02:00
hsiegeln
c9c93ac565 fix(alerting): declare ClickHouseSearchIndex bean as concrete type
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 3m6s
CI / cleanup-branch (pull_request) Has been skipped
CI / build (pull_request) Successful in 4m5s
CI / docker (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / deploy-feature (pull_request) Has been skipped
CI / docker (push) Successful in 1m37s
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Successful in 39s
Production crashlooped on startup: ExchangeMatchEvaluator autowires the
concrete ClickHouseSearchIndex (for countExecutionsForAlerting, which
lives only on the concrete class, not the SearchIndex interface), but
StorageBeanConfig declared the bean with interface return type SearchIndex.
Spring matches autowire candidates by declared bean type, not by runtime
instance class, so the concrete-typed autowire failed with:

  Parameter 0 of constructor in ExchangeMatchEvaluator required a bean
  of type 'ClickHouseSearchIndex' that could not be found.

ClickHouseLogStore's bean is already declared with the concrete return
type (line 171), which is why LogPatternEvaluator autowires fine.

All alerting ITs passed pre-merge because AbstractPostgresIT replaces the
clickHouseSearchIndex bean with @MockBean(name=...) whose declared type
IS the concrete ClickHouseSearchIndex. The mock masked the prod bug.

Follow-up: remove @MockBean(name="clickHouseSearchIndex") from
AbstractPostgresIT so the real bean graph is exercised by alerting ITs
(and add a SpringContextSmokeIT that loads the context with no mocks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:11:47 +02:00
hsiegeln
2c82b50ea2 fix(alerting/B-1): AlertStateTransitions.newInstance() propagates rule targets to AlertInstance
newInstance() now maps rule.targets() into targetUserIds/targetGroupIds/targetRoleNames
so newly created AlertInstance rows carry the correct target arrays.
Previously these were always empty List.of(), making the inbox query return nothing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:25 +02:00
hsiegeln
7e79ff4d98 fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states
V13 migration creates alert_instances_open_rule_uq — a partial unique index on
(rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing
duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches
DuplicateKeyException and returns the existing open instance instead of failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:07 +02:00
hsiegeln
424894a3e2 fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing
AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets
attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response
fields. AlertNotificationController calls resetForRetry instead of
scheduleRetry so a manual retry always starts from a clean slate.

AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify
attempts==0 and status==PENDING after three prior markFailed calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:59 +02:00