Commit Graph

48 Commits

Author SHA1 Message Date
hsiegeln
c2eab71a31 env(admin): per-environment color field + V2 migration
- V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate.
- Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API.
- UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400.
- ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:24:30 +02:00
hsiegeln
e483e52eee alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition
Dead field — was enforced by compact ctor as required for PER_EXCHANGE,
but never read anywhere in the codebase. Removal tightens the API surface
and is precondition for the Task 3.3 cross-field validator.

Pre-prod; no shim / migration.
2026-04-22 17:10:53 +02:00
hsiegeln
0bad014811 core(alerting): AlertRule.withEvalState wither for cursor threading 2026-04-22 16:04:55 +02:00
hsiegeln
b41f34c090 search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate
Adds an optional afterExecutionId field to SearchRequest. When combined
with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after
tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id))
so same-millisecond exchanges can be consumed exactly once across ticks.

When afterExecutionId is null, timeFrom keeps its existing >= semantics —
no behaviour change for any current caller.

Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field
through existing withInstanceIds / withEnvironment witheres. All existing
positional call-sites (SearchController, ExchangeMatchEvaluator,
ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new
slot.

Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md.
The evaluator-side wiring that actually supplies the cursor is Task 1.5.
2026-04-22 15:49:05 +02:00
hsiegeln
06c6f53bbc refactor(ingestion): remove unused TaggedExecution record
No callers after the legacy PG ingestion path was retired in 0f635576.
core-classes.md updated to drop the leftover note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:33:26 +02:00
hsiegeln
98cbf8f3fc refactor(search): drop dead SearchIndexer subsystem
After the ExecutionController removal (0f635576), SearchIndexer
subscribed to ExecutionUpdatedEvent but nothing publishes that event.
Every SearchIndexerStats metric returned always-zero, and the admin
/api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats
carried no signal.

Backend removed:
- core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent
- app: IndexerPipelineResponse DTO, /pipeline endpoint on
  ClickHouseAdminController (field + ctor param)
- StorageBeanConfig.searchIndexer bean

UI removed:
- IndexerPipeline type + useIndexerPipeline hook in
  api/queries/admin/clickhouse.ts
- Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar
  import and pipeline* CSS classes)

OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path
and IndexerPipelineResponse schema removed).

SearchIndex interface + ClickHouseSearchIndex impl kept — those are
live and used by SearchService + ExchangeMatchEvaluator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 23:32:49 +02:00
hsiegeln
0f635576a3 refactor(ingestion): drop dead legacy execution-ingestion path
ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class),
and ChunkAccumulator is registered unconditionally — the legacy controller
never bound in any profile. Even if it had, IngestionService.ingestExecution
called executionStore.upsert(), and the only ExecutionStore impl
(ClickHouseExecutionStore) threw UnsupportedOperationException from upsert
and upsertProcessors. The entire RouteExecution → upsert path was dead code
carrying four transitive dependencies (RouteExecution import, eventPublisher
wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook).

Removed:
- cameleer-server-app/.../controller/ExecutionController.java (whole file)
- ExecutionStore.upsert + upsertProcessors (interface methods)
- ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides)
- IngestionService.ingestExecution + toExecutionRecord + flattenProcessors
  + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers
- IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>);
  dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit
- StorageBeanConfig.ingestionService(...) simplified accordingly

Untouched because still in use:
- ExecutionRecord / ProcessorRecord records (findById / findProcessors /
  SearchIndexer / DetailController)
- SearchIndexer (its onExecutionUpdated never fires now since no-one
  publishes ExecutionUpdatedEvent, but SearchIndexerStats is still
  referenced by ClickHouseAdminController — separate cleanup)
- TaggedExecution record has no remaining callers after this change —
  flagged in core-classes.md as a leftover; separate cleanup.

Rule docs updated:
- .claude/rules/app-classes.md: retired ExecutionController bullet, fixed
  stale URL for ChunkIngestionController (it owns /api/v1/data/executions,
  not /api/v1/ingestion/chunk/executions).
- .claude/rules/core-classes.md: IngestionService surface + note the dead
  TaggedExecution.

Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures
in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT
SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 22:50:51 +02:00
hsiegeln
fb54f9cbd2 fix(agent): revive DEAD agents on heartbeat (not just STALE)
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m5s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
Reproduction: pause a container long enough to cross both the stale
and dead thresholds, then unpause. The agent resumes sending heartbeats
but the server keeps it shown as DEAD. Only a full container restart
(which re-registers) fixes it.

Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE.
A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged.
checkLifecycle() never downgrades DEAD either (no-op in that branch),
so the agent was permanently stuck in DEAD until a register() call.

Fix: extend the revival branch to also cover DEAD. Same process; a
heartbeat is proof of liveness regardless of the previous state.

Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED
for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the
lifecycle timeline captures the transition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:55:47 +02:00
hsiegeln
99b739d946 fix(alerts): backend hardening + complete ACKNOWLEDGED migration
- new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation
- AlertController.inEnvLiveIds now one SQL round-trip instead of N
- bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL
- bulkAck SQL already had deleted_at IS NULL guard — no change needed
- PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted
- V12MigrationIT: remove alert_reads assertion (table dropped by V17)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 18:48:57 +02:00
hsiegeln
da2819332c feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations
- save/rowMapper read+write read_at and deleted_at
- listForInbox: tri-state acked/read filters; always excludes deleted
- countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill
- new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore
- delete PostgresAlertReadRepository + its bean
- restore zero-fill Javadoc on interface
- mechanical compile-fixes in AlertController, InAppInboxQuery,
  AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite
- PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:56:06 +02:00
hsiegeln
55b2a00458 feat(alerts): core repo — filter params + markRead/softDelete/bulkAck/restore; drop AlertReadRepository
- listForInbox gains tri-state acked/read filter params
- countUnreadBySeverityForUser(envId, userId) → countUnreadBySeverity(envId, userId, groupIds, roleNames)
- new methods: markRead, bulkMarkRead, softDelete, bulkSoftDelete, bulkAck, restore
- delete AlertReadRepository — read is now global on alert_instances.read_at

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:38:10 +02:00
hsiegeln
82e82350f9 refactor(alerts): drop ACKNOWLEDGED from AlertState, add readAt/deletedAt to AlertInstance
- AlertState: remove ACKNOWLEDGED case (V17 migration already dropped it from DB enum)
- AlertInstance: insert readAt + deletedAt Instant fields after lastNotifiedAt; add withReadAt/withDeletedAt withers; update all existing withers to pass both fields positionally
- AlertStateTransitions: add null,null for readAt/deletedAt in newInstance ctor call; collapse FIRING,ACKNOWLEDGED switch arm to just FIRING
- AlertScopeTest: update AlertState.values() assertion to 3 values; fix stale ConditionKind.hasSize(6) to 7 (JVM_METRIC was added earlier)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:12:37 +02:00
hsiegeln
414f7204bf feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode
Allows alert rules to fire on agent-lifecycle events — REGISTERED,
RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather
than only on current state. Each matching `(agent, eventType, timestamp)`
becomes its own ackable AlertInstance, so outages on distinct agents are
independently routable.

Core:
- New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record
  (scope, eventTypes, withinSeconds). Compact ctor rejects empty
  eventTypes and withinSeconds<1.
- Strict allowlist enum `AgentLifecycleEventType` (six entries matching
  the server-emitted types in `AgentRegistrationController` and
  `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in
  backlog issue #145.
- `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes,
  from, to, limit)` — new read path ordered `(timestamp ASC, insert_id
  ASC)` used by the evaluator. Implemented on
  `ClickHouseAgentEventRepository` with tenant + env filter mandatory.

App:
- `AgentLifecycleEvaluator` queries events in the last `withinSeconds`
  window and returns `EvalResult.Batch` with one `Firing` per row.
  Every Firing carries a canonical `_subjectFingerprint` of
  `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event`
  subtrees for Mustache templating.
- `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that
  exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}`.
- Validation is delegated to the record compact ctor + enum at Jackson
  deserialization time — matches the existing policy of keeping
  controller validators focused on env-scoped / SQL-injection concerns.

Schema:
- V16 migration generalises the V15 per-exchange discriminator on
  `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a
  fallback to the legacy `exchange.id` expression. Scalar kinds still
  resolve to `''` and keep one-open-per-rule. Duplicate-key path in
  `PostgresAlertInstanceRepository.save` is unchanged — the index is
  the deduper.

UI:
- New `AgentLifecycleForm.tsx` wizard form with multi-select chips for
  the six allowed event types + `withinSeconds` input. Wired into
  `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD,
  300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the
  new option array.
- `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind,
  and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`.

Tests (all passing):
- 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive +
  empty/zero/unknown-type rejection).
- 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty
  window, multi-agent fingerprint shape, scope forwarding, missing env).
- `NotificationContextBuilderTest` switch now covers the new kind.
- 119 alerting unit tests + 71 UI tests green.

Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.
2026-04-21 14:52:08 +02:00
hsiegeln
f037d8c922 feat(alerting): server-side state+severity filters, ButtonGroup filter UI
Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value
`state=…` and `severity=…` query params. Filters are pushed down to
PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` /
`AND severity::text = ANY(?)` to the inbox query (null/empty = no filter).

`AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg
form is preserved as a default delegate so existing callers (evaluator,
AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged.
`InAppInboxQuery.listInbox` also has a new filtered overload.

UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select,
no color cues) to `ButtonGroup` (multi-select with severity-coloured dots),
matching the topnavbar status-filter pattern. `useAlerts` forwards the
filters as query params and cache-keys on the filter tuple so each combo
is independently cached.

Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit
tests passing). OpenAPI types regenerated from the fresh local backend.
2026-04-21 12:47:31 +02:00
hsiegeln
efa8390108 fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m2s
CI / docker (push) Successful in 1m20s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
SonarQube / sonarqube (push) Successful in 5m31s
The rule editor wizard reset the condition payload on kind-change without
seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to
pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule.

- core: compact ctor now rejects null fireMode (Jackson-deser path only — all
  production callers already pass a value).
- V14: repair existing EXCHANGE_MATCH rows with fireMode=null to
  PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard).
- ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the
  Select's displayed fallback ("Per exchange") is actually in form state.
- ui: validateStep('condition', ...) now enforces fireMode presence + the
  mode-specific fields before the user reaches Review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:05:55 +02:00
hsiegeln
09b49f096c feat(alerting): per-severity breakdown on unread-count DTO
Spec §13 calls for the notification bell to colour-code by highest
unread severity (CRITICAL → error, WARNING → amber, INFO → muted).
The old { count } DTO forced the UI to pick one static colour, so
NotificationBell shipped with a TODO. Grow the contract instead:

  UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } }

Guarantees:
- every severity is always present with a >=0 value (no undefined
  keys on the wire), so the UI can branch without defaults.
- total = sum of bySeverity values — kept explicit on the wire for
  cheap top-line display, not recomputed client-side.

Backend
- AlertInstanceRepository: replaces countUnreadForUser(long) with
  countUnreadBySeverityForUser returning Map<AlertSeverity, Long>.
  One SQL round-trip per (env, user) — GROUP BY ai.severity over the
  same NOT EXISTS(alert_reads) filter.
- UnreadCountResponse.from(Map) normalises and defensively copies;
  missing severities default to 0.
- InAppInboxQuery.countUnread now returns the DTO, caches the full
  response (still 5s TTL) so severity breakdown gets the same
  hit-rate as the total did before.
- AlertController just hands the DTO back.

Breaking change — no backwards-compat shim: the `count` field is
gone. UI and tests updated in the same commit; there are no other
API consumers in the tree.

Frontend
- Regenerated openapi.json + schema.d.ts against a fresh build of
  the new backend.
- NotificationBell branches badge colour on the highest unread
  severity (CRITICAL > WARNING > INFO) via new CSS variants.
- Tests cover all four paths: zero, critical-present, warning-only,
  info-only.

Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map)
       + 49 vitest (was 46; +3 severity-branch assertions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:15:56 +02:00
hsiegeln
424894a3e2 fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing
AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets
attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response
fields. AlertNotificationController calls resetForRetry instead of
scheduleRetry so a manual retry always starts from a clean slate.

AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify
attempts==0 and status==PENDING after three prior markFailed calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:59 +02:00
hsiegeln
d74079da63 fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking
AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns
instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL
branch excluded: sweep only re-notifies, initial notify is the dispatcher's job).

AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh
notifications for eligible instances and stamps last_notified_at.

NotificationDispatchJob stamps last_notified_at on the alert_instance when a
notification is DELIVERED, providing the anchor timestamp for cadence checks.

PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering
the three-rule eligibility matrix (never-notified, long-ago, recent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:25:50 +02:00
hsiegeln
f1abca3a45 refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes
The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because
stats_1m_route has no p95 column. Exposing the old name implied p95 semantics
operators did not get. Rename to AVG_DURATION_MS makes the contract honest.
Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:43 +02:00
hsiegeln
bf178ba141 fix(alerting): populate AlertInstance.rule_snapshot so history survives rule delete
- Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers)
- Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and
  applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot
- Strip null values from the Jackson-serialized map before wrapping in the immutable
  snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields
- Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind
  appear in the rule_snapshot column after a tick fires an instance
- Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts
  rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:09:28 +02:00
hsiegeln
657dc2d407 feat(alerting): AlertingProperties + AlertStateTransitions state machine
- AlertingProperties @ConfigurationProperties with effective*() accessors and
  5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup
- AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error
  branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated
  to job
- AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck,
  withSilenced, withTitleMessage, withLastNotifiedAt, withContext
- AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties),
  alertingInstanceId bean (hostname:pid), alertingClock bean,
  PerKindCircuitBreaker bean wired from props
- 12 unit tests in AlertStateTransitionsTest covering all transitions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:58:12 +02:00
hsiegeln
c53f642838 chore(alerting): add jmustache 1.16
Declared in cameleer-server-core pom (canonical location for unit-testable
rendering without Spring) and mirrored in cameleer-server-app pom so the
app module compiles standalone without a full reactor install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:26:57 +02:00
hsiegeln
7b79d3aa64 feat(alerting): countExecutionsForAlerting for exchange-match evaluator
Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting —
no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window,
and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString
key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:18:49 +02:00
hsiegeln
45028de1db feat(alerting): Postgres repository for alert_instances with inbox queries
Implements AlertInstanceRepository: save (upsert), findById, findOpenForRule,
listForInbox (3-way OR: user/group/role via && array-overlap + ANY), countUnreadForUser
(LEFT JOIN alert_reads), ack, resolve, markSilenced, deleteResolvedBefore.
Integration test covers all 9 scenarios including inbox fan-out across all
three target types. Also adds @JsonIgnoreProperties(ignoreUnknown=true) to
SilenceMatcher to suppress Jackson serializing isWildcard() as a round-trip field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:04:51 +02:00
hsiegeln
1ff256dce0 feat(alerting): core repository interfaces 2026-04-19 18:43:36 +02:00
hsiegeln
e7a9042677 feat(alerting): core domain records (rule, instance, silence, notification) 2026-04-19 18:43:03 +02:00
hsiegeln
56a7b6de7d feat(alerting): sealed AlertCondition hierarchy with Jackson deduction 2026-04-19 18:42:04 +02:00
hsiegeln
530bc32040 feat(alerting): core enums + AlertScope 2026-04-19 18:36:29 +02:00
hsiegeln
5103dc91be feat(alerting): add ALERT_RULE_CHANGE + ALERT_SILENCE_CHANGE audit categories 2026-04-19 18:34:08 +02:00
hsiegeln
ea4c56e7f6 feat(outbound): admin CRUD REST + RBAC + audit
New audit categories: OUTBOUND_CONNECTION_CHANGE, OUTBOUND_HTTP_TRUST_CHANGE.
Controller-level @PreAuthorize defaults to ADMIN; GETs relaxed to ADMIN|OPERATOR.
SecurityConfig permits OPERATOR GETs on /api/v1/admin/outbound-connections/**.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:43:48 +02:00
hsiegeln
380ccb102b fix(outbound): align user FK with users(user_id) TEXT schema
V11 migration referenced users(id) as uuid, but V1 users table has
user_id as TEXT primary key. Amending V11 and the OutboundConnection
record before Task 7's integration tests catch this at Flyway startup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:18:12 +02:00
hsiegeln
46b8f63fd1 feat(outbound): core domain records for outbound connections
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:10:17 +02:00
hsiegeln
2224f7d902 feat(http): core outbound HTTP interfaces and property records 2026-04-19 15:39:57 +02:00
hsiegeln
6d3956935d refactor(events): remove dead non-paginated query path
AgentEventService.queryEvents, AgentEventRepository.query, and the
ClickHouse implementation have had no callers since /agents/events
became cursor-paginated. Remove them along with their dedicated IT
tests. queryPage and its tests remain as the single query path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 13:16:28 +02:00
hsiegeln
67a834153e feat(events): add AgentEventPage + queryPage interface
Introduces cursor-paginated query on AgentEventRepository. The cursor
format is owned by the implementation. The existing non-paginated
query(...) is kept for internal consumers.
2026-04-17 11:52:42 +02:00
hsiegeln
769752a327 feat(logs): widen source filter to multi-value OR list
Replaces LogSearchRequest.source (String) with sources (List<String>)
and emits 'source IN (...)' when non-empty. LogQueryController parses
?source=a,b,c the same way it parses ?level=a,b,c.
2026-04-17 11:48:10 +02:00
hsiegeln
62dd71b860 fix: stamp environment on agent_events rows
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m28s
CI / docker (push) Successful in 1m13s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 43s
The agent_events table has an `environment` column and AgentEventsController
filters on it, but the INSERT never populated it — every row got the
column default ('default'). Result: Timeline on the Application Runtime
page was empty whenever the user's selected env was anything other than
'default'.

Thread env through the write path:
- AgentEventRepository.insert + AgentEventService.recordEvent gain an
  `environment` param; delete the no-env query overload (unused).
- ClickHouseAgentEventRepository.insert writes the column (falls back to
  'default' on null to match column DEFAULT).
- All 5 callers source env from the agent registry (AgentInfo.environmentId)
  or the registration request body; AgentLifecycleMonitor, deregister,
  command ack, event ingestion, register/re-register.
- Integration test updated for the new signatures.

Pre-existing rows in deployed CH will still report environment='default'.
New events from this build forward will carry the correct env. Backfill
(UPDATE ... FROM apps) is left as a manual DB step if historical timeline
is needed for non-default envs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:30:56 +02:00
hsiegeln
d02fa73080 fix: scope correlation-chain query to the exchange's own env
Correlated exchanges always share the env of the one being viewed —
using the globally-selected env from the picker was wrong if the user
switched envs after opening a detail view (or arrived via permalink).

Thread `environment` through:
- `ExecutionStore.ExecutionRecord` gains `environment` field; the
  ClickHouse `executions` table already stores this, just not read back.
- `ClickHouseExecutionStore.findById` SELECT adds the column; mapper
  populates it.
- `ExecutionDetail` gains `environment`; `DetailService` passes through.
- `IngestionService.toExecutionRecord` passes null — this legacy PG
  ingestion path isn't active when ClickHouse is enabled, and the
  read-side is what drives the correlation UI.
- UI `ExchangeHeader` reads `detail.environment ?? storeEnv` and
  extends the TS type locally (schema.d.ts catches up on next regen).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:19:42 +02:00
hsiegeln
6d9e456b97 feat!: move apps & deployments under /api/v1/environments/{envSlug}/apps/{appSlug}/...
P3B of the taxonomy migration. App and deployment routes are now
env-scoped in the URL itself, making the (env, app_slug) uniqueness
key explicit. Previously /api/v1/apps/{appSlug} was ambiguous: with
the same app deployed to multiple environments (dev/staging/prod),
the handler called AppService.getBySlug(slug) which returns the
first row matching slug regardless of env.

Server:
- AppController: @RequestMapping("/api/v1/environments/{envSlug}/
  apps"). Every handler now calls
  appService.getByEnvironmentAndSlug(env.id(), appSlug) — 404 if the
  app doesn't exist in *this* env. CreateAppRequest body drops
  environmentId (it's in the path).
- DeploymentController: @RequestMapping("/api/v1/environments/
  {envSlug}/apps/{appSlug}/deployments"). DeployRequest body drops
  environmentId. PromoteRequest body switches from
  targetEnvironmentId (UUID) to targetEnvironment (slug);
  promote handler resolves the target env by slug and looks up the
  app with the same slug in the target env (fails with 404 if the
  target app doesn't exist yet — apps must exist in both source
  and target before promote).
- AppService: added getByEnvironmentAndSlug helper; createApp now
  validates slug against ^[a-z0-9][a-z0-9-]{0,63}$ (400 on
  invalid).

SPA:
- queries/admin/apps.ts: rewritten. Hooks take envSlug where
  env-scoped. Removed useAllApps (no flat endpoint). Renamed path
  param naming: appId → appSlug throughout. Added
  usePromoteDeployment. Query keys include envSlug so cache is
  env-scoped.
- AppsTab.tsx: call sites updated. When no environment is selected,
  the managed-app list is empty — cross-env discovery lives in the
  Runtime tab (catalog). handleDeploy/handleStop/etc. pass envSlug
  to the new hook signatures.

BREAKING CHANGE: /api/v1/apps/** paths removed. Clients must use
/api/v1/environments/{envSlug}/apps/{appSlug}/**. Request bodies
for POST /apps and POST /apps/{slug}/deployments no longer accept
environmentId (use the URL path instead). Promote body uses slug
not UUID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 23:38:37 +02:00
hsiegeln
6b5ee10944 feat!: environment admin URLs use slug; validate and immutabilize slug
UUID-based admin paths were the only remaining UUID-in-URL pattern in
the API. Migrates /api/v1/admin/environments/{id} to /{envSlug} so
slugs are the single environment identifier in every URL. UUIDs stay
internal to the database.

- Controller: @PathVariable UUID id → @PathVariable String envSlug on
  get/update/delete and the two nested endpoints (default-container-
  config, jar-retention). Handlers resolve slug → Environment via
  EnvironmentService.getBySlug, then delegate to existing UUID-based
  service methods.
- Service: create() now validates slug against ^[a-z0-9][a-z0-9-]{0,63}$
  and returns 400 on invalid slugs. Rationale documented in the class:
  slugs are immutable after creation because they appear in URLs,
  Docker network names, container names, and ClickHouse partition keys.
- UpdateEnvironmentRequest has no slug field and Jackson's default
  ignore-unknown behavior drops any slug supplied in a PUT body;
  regression test (updateEnvironment_withSlugInBody_ignoresSlug)
  documents this invariant.
- SPA: mutation args change from { id } to { slug }. EnvironmentsPage
  still uses env.id for local selection state (UUID from DB) but
  passes env.slug to every mutation.

BREAKING CHANGE: /api/v1/admin/environments/{id:UUID}/... paths removed.
Clients must use /{envSlug}/... (slug from the environments list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 23:23:31 +02:00
hsiegeln
fcb53dd010 fix!: require environment on diagram lookup and attribute keys queries
Closes two cross-env data leakage paths. Both endpoints previously
returned data aggregated across all environments, so a diagram or
attribute key from dev could appear in a prod UI query (and vice versa).

B1: GET /api/v1/diagrams?application=&routeId= now requires
?environment= and resolves agents via
registryService.findByApplicationAndEnvironment instead of
findByApplication. Prevents serving a dev diagram for a prod route.

B2: GET /api/v1/search/attributes/keys now requires ?environment=.
SearchIndex.distinctAttributeKeys gains an environment parameter and
the ClickHouse query adds the env filter alongside the existing
tenant_id filter. Prevents prod attribute names leaking into dev
autocompletion (and vice versa).

SPA hooks updated to thread environment through from
useEnvironmentStore; query keys include environment so React Query
re-fetches on env switch. No call-site changes needed — hook
signatures unchanged.

B3 (AgentMetricsController env scope) deferred to P3C: agent-env is
effectively 1:1 today via the instance_id naming
({envSlug}-{appSlug}-{replicaIndex}), and the URL migration in P3C
to /api/v1/environments/{env}/agents/{agentId}/metrics naturally
introduces env from path. A minimal P1 fix would regress the "view
metrics of a killed agent" case.

BREAKING CHANGE: Both endpoints now require ?environment= (slug).
Clients omitting the parameter receive 400.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 23:19:55 +02:00
hsiegeln
9b1ef51d77 feat!: scope per-app config and settings by environment
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m27s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m40s
SonarQube / sonarqube (push) Successful in 4m29s
BREAKING: wipe dev PostgreSQL before deploying — V1 checksum changes.
Agents must now send environmentId on registration (400 if missing).

Two tables previously keyed on app name alone caused cross-environment
data bleed: writing config for (app=X, env=dev) would overwrite the row
used by (app=X, env=prod) agents, and agent startup fetches ignored env
entirely.

- V1 schema: application_config and app_settings are now PK (app, env).
- Repositories: env-keyed finders/saves; env is the authoritative column,
  stamped on the stored JSON so the row agrees with itself.
- ApplicationConfigController.getConfig is dual-mode — AGENT role uses
  JWT env claim (agents cannot spoof env); non-agent callers provide env
  via ?environment= query param.
- AppSettingsController endpoints now require ?environment=.
- SensitiveKeysAdminController fan-out iterates (app, env) slices so each
  env gets its own merged keys.
- DiagramController ingestion stamps env on TaggedDiagram; ClickHouse
  route_diagrams INSERT + findProcessorRouteMapping are env-scoped.
- AgentRegistrationController: environmentId is required on register;
  removed all "default" fallbacks from register/refresh/heartbeat auto-heal.
- UI hooks (useApplicationConfig, useProcessorRouteMapping, useAppSettings,
  useAllAppSettings, useUpdateAppSettings) take env, wired to
  useEnvironmentStore at all call sites.
- New ConfigEnvIsolationIT covers env-isolation for both repositories.

Plan in docs/superpowers/plans/2026-04-16-environment-scoping.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 22:25:21 +02:00
hsiegeln
e2d9428dff fix: drop stale instance_id filter from search and scope route stats by app
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m28s
CI / docker (push) Successful in 1m11s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s
The exchange search silently filtered by the in-memory agent registry's
current instance IDs on top of application_id. Historical exchanges written
by previous agent instances (or any instance not currently registered, e.g.
after a server restart before agents heartbeat back) were hidden from
results even though they matched the application filter.

Fix: drop the applicationId -> instanceIds resolution in SearchController.
Rely on application_id = ? in ClickHouseSearchIndex; keep explicit
instanceIds filtering only when a client passes them.

Related cleanup: the agentIds parameter on StatsStore.statsForRoute /
timeseriesForRoute was silently discarded inside ClickHouseStatsStore, so
per-route stats aggregated across any apps sharing a routeId. Replace with
String applicationId and add application_id to the stats_1m_route filters
so per-route stats are correctly scoped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 19:49:55 +02:00
hsiegeln
4ee43bab95 fix: include headers and properties in has_trace_data detection
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m28s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
IngestionService.hasAnyTraceData() and ChunkAccumulator only checked
for inputBody/outputBody when setting has_trace_data on executions.
Headers and properties captured via deep tracing were not considered,
causing the trace data indicator to be missing in the exchange list.

DetailService already checked all six fields — this aligns the
ingestion path to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:26:15 +02:00
hsiegeln
887a9b6faa feat: add RouteCatalogStore interface and RouteCatalogEntry record 2026-04-16 18:45:42 +02:00
hsiegeln
78396a2796 fix: sidebar route selection and missing routes after server restart
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m20s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Two sidebar bugs fixed:

1. Route entries never highlighted on navigation because sidebar-utils
   generated /apps/ paths for route children while effectiveSelectedPath
   normalizes to /exchanges/. The design system does exact string matching.

2. Routes disappeared from sidebar when agents had no recent exchange
   data. Heartbeat carried routeStates (with route IDs as keys) but
   AgentRegistryService.heartbeat() never updated AgentInfo.routeIds.
   After server restart, auto-heal registered agents with empty routes,
   leaving ClickHouse (24h window) as the only discovery source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:42:01 +02:00
hsiegeln
859cf7c10d fix: support pre-3.2 Spring Boot JARs in runtime entrypoint
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m39s
CI / docker (push) Successful in 1m38s
CI / deploy (push) Successful in 47s
CI / deploy-feature (push) Has been skipped
RuntimeDetector now derives the correct PropertiesLauncher FQN from
the JAR manifest Main-Class package. Spring Boot 3.2+ uses
org.springframework.boot.loader.launch.PropertiesLauncher, pre-3.2
uses org.springframework.boot.loader.PropertiesLauncher.

DockerRuntimeOrchestrator uses the detected class instead of a
hardcoded 3.2+ reference, falling back to 3.2+ when not auto-detected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:21:01 +02:00
hsiegeln
cb3ebfea7c chore: rename cameleer3 to cameleer
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00