cameleer-server

Author	SHA1	Message	Date
hsiegeln	c6aef5ab35	fix(deploy): Checkpoints — preserve STOPPED history, fix filter + placement All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m4s Details CI / docker (push) Successful in 1m15s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 41s Details - Backend: rename deleteTerminalByAppAndEnvironment → deleteFailedByAppAndEnvironment. STOPPED rows were being wiped on every redeploy, so Checkpoints was always empty. Now only FAILED rows are pruned; STOPPED deployments are retained as restorable checkpoints (they still carry deployed_config_snapshot from their RUNNING window). - UI filter: any deployment with a snapshot is a checkpoint (was RUNNING\|DEGRADED only, which excluded the main case — the previous blue/green deployment now in STOPPED). - UI placement: Checkpoints disclosure now renders inside IdentitySection, matching the design spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:26:46 +02:00
hsiegeln	5304c8ee01	core(deploy): DeploymentStrategy enum with safe wire conversion Typed enum (BLUE_GREEN, ROLLING) with fromWire/toWire kebab-case translation. fromWire falls back to BLUE_GREEN for unknown or null input so the executor dispatch site never null-checks and no misconfigured container-config can throw at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:42:35 +02:00
hsiegeln	f8dccaae2b	fix(deploy): stop previous active deployment before START_REPLICAS (fixes 409) Container names are deterministic: {tenant}-{envSlug}-{appSlug}-{replica}. The prior code did the stop-existing step at SWAP_TRAFFIC, after START_REPLICAS had already tried to create containers with the same names — so a redeploy against a RUNNING app consistently failed with Docker 409 "container name already in use". Move the stop-existing block to run right after CREATE_NETWORK and before START_REPLICAS. SWAP_TRAFFIC becomes a label-only marker (traffic is swapped implicitly by Traefik labels once new replicas are healthy). Also: add `findActiveByAppIdAndEnvironmentIdExcluding` so the SQL excludes the current deployment by id — previously the Java-side `!id.equals(me)` guard failed because the newly-inserted row has status=STARTING (DB default) and ORDER BY created_at DESC LIMIT 1 picked the new row, hiding the actual previous deployment. Trade-off: this is destroy-then-start rather than true blue/green — brief downtime during the swap. Matches the pre-unified-page behavior and is what users reasonably expect. True blue/green would require per-deployment container names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 01:01:00 +02:00
hsiegeln	d33c039a17	fix(deploy): address final review — sensitiveKeys snapshot, dirty scrubbing, transition race, refetch invalidations - Issue 1: add List<String> sensitiveKeys as 4th field to DeploymentConfigSnapshot; populate from agentConfig.getSensitiveKeys() in DeploymentExecutor; handleRestore hydrates from snap.sensitiveKeys directly; Deployment type in apps.ts gains sensitiveKeys field - Issue 2: after createApp succeeds, refetchQueries(['apps', envSlug]) before navigate so the new app is in cache before the router renders the deployed view (eliminates transient Save- disabled flash) - Issue 3: useDeploymentPageState useEffect now uses prevServerStateRef to detect local edits; background refetches only overwrite form when no local changes are present - Issue 5: handleRedeploy invalidates dirty-state + versions queries after createDeployment resolves; handleSave invalidates dirty-state after staged save - Issue 10: DirtyStateCalculator strips volatile agentConfig keys (version, updatedAt, updatedBy, environment, application) before JSON comparison via scrubAgentConfig(); adds versionBumpDoesNotMarkDirty test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 23:29:01 +02:00
hsiegeln	97f25b4c7e	test(deploy): register JavaTimeModule in DirtyStateCalculator unit test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:38:57 +02:00
hsiegeln	6591f2fde3	api(apps): GET /apps/{slug}/dirty-state returns desired-vs-deployed diff Wires DirtyStateCalculator behind an HTTP endpoint on AppController. Adds findLatestSuccessfulByAppAndEnv to PostgresDeploymentRepository, registers DirtyStateCalculator as a Spring bean (with ObjectMapper for JavaTimeModule support), and covers all three scenarios with IT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:35:35 +02:00
hsiegeln	24464c0772	core(deploy): recurse into nested diffs + unquote scalar values in DirtyStateCalculator - compareJson now recurses when both nodes are ObjectNode, so nested maps (tracedProcessors, routeRecording, routeSamplingRates) produce deep paths like agentConfig.tracedProcessors.proc-1 instead of a blob diff - Extract nodeToString helper: value nodes use asText() (strips JSON quotes), null becomes "(none)", arrays/objects get compact JSON - Apply nodeToString in both diff-emission paths (top-level mismatch + leaf) - Add three new tests: nullAgentConfigInSnapshot, nestedAgentField_reportsDeepPath, stringField_differenceValueIsUnquoted (8 tests total, all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:25:04 +02:00
hsiegeln	e4ccce1e3b	core(deploy): add DirtyStateCalculator + DirtyStateResult Pure-logic dirty-state detection: compares desired JAR + agent config + container config against the DeploymentConfigSnapshot from the last successful deployment. Returns a structured DirtyStateResult with per-field differences. 5 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 22:20:49 +02:00
hsiegeln	7f9cfc7f18	core(deploy): add deployedConfigSnapshot field to Deployment model Appends DeploymentConfigSnapshot deployedConfigSnapshot to the Deployment record and adds a matching withDeployedConfigSnapshot wither. All positional call sites (repository mapper, test fixture) updated to pass null; Task 1.4 will wire real persistence and Task 1.5 will populate the field on RUNNING transition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:31:48 +02:00
hsiegeln	06fa7d832f	core(deploy): type jarVersionId as UUID (match domain convention) All other FKs to app_versions.id (e.g. Deployment.appVersionId) use UUID; DeploymentConfigSnapshot.jarVersionId was incorrectly typed as String. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:29:26 +02:00
hsiegeln	d580b6e90c	core(deploy): add DeploymentConfigSnapshot record Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 21:26:30 +02:00
hsiegeln	c2eab71a31	env(admin): per-environment color field + V2 migration - V2__add_environment_color.sql adds a CHECK-constrained VARCHAR color column (default 'slate'); existing rows backfill to slate. - Environment record + EnvironmentColor constants (8 preset values) flow through repository, service, and admin API. - UpdateEnvironmentRequest.color nullable: null preserves existing; unknown values → 400. - ITs cover valid / invalid / null-preserves behaviour; existing Environment constructor call-sites updated with the new color arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:24:30 +02:00
hsiegeln	e483e52eee	alerting(core): drop unused perExchangeLingerSeconds from ExchangeMatchCondition Dead field — was enforced by compact ctor as required for PER_EXCHANGE, but never read anywhere in the codebase. Removal tightens the API surface and is precondition for the Task 3.3 cross-field validator. Pre-prod; no shim / migration.	2026-04-22 17:10:53 +02:00
hsiegeln	0bad014811	core(alerting): AlertRule.withEvalState wither for cursor threading	2026-04-22 16:04:55 +02:00
hsiegeln	b41f34c090	search: SearchRequest.afterExecutionId — composite (startTime, execId) predicate Adds an optional afterExecutionId field to SearchRequest. When combined with a non-null timeFrom, ClickHouseSearchIndex applies a strictly-after tuple predicate (start_time > ts OR (start_time = ts AND execution_id > id)) so same-millisecond exchanges can be consumed exactly once across ticks. When afterExecutionId is null, timeFrom keeps its existing >= semantics — no behaviour change for any current caller. Also adds the SearchRequest.withCursor(ts, id) wither. Threads the field through existing withInstanceIds / withEnvironment witheres. All existing positional call-sites (SearchController, ExchangeMatchEvaluator, ClickHouseSearchIndexIT, ClickHouseChunkPipelineIT) pass null for the new slot. Task 1.2 of docs/superpowers/plans/2026-04-22-per-exchange-exactly-once.md. The evaluator-side wiring that actually supplies the cursor is Task 1.5.	2026-04-22 15:49:05 +02:00
hsiegeln	06c6f53bbc	refactor(ingestion): remove unused TaggedExecution record No callers after the legacy PG ingestion path was retired in `0f635576`. core-classes.md updated to drop the leftover note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:33:26 +02:00
hsiegeln	98cbf8f3fc	refactor(search): drop dead SearchIndexer subsystem After the ExecutionController removal (`0f635576`), SearchIndexer subscribed to ExecutionUpdatedEvent but nothing publishes that event. Every SearchIndexerStats metric returned always-zero, and the admin /api/v1/admin/clickhouse/pipeline endpoint that surfaced those stats carried no signal. Backend removed: - core: SearchIndexer, SearchIndexerStats, ExecutionUpdatedEvent - app: IndexerPipelineResponse DTO, /pipeline endpoint on ClickHouseAdminController (field + ctor param) - StorageBeanConfig.searchIndexer bean UI removed: - IndexerPipeline type + useIndexerPipeline hook in api/queries/admin/clickhouse.ts - Indexer Pipeline card in ClickHouseAdminPage.tsx (plus ProgressBar import and pipeline* CSS classes) OpenAPI schema.d.ts + openapi.json regenerated (stale /pipeline path and IndexerPipelineResponse schema removed). SearchIndex interface + ClickHouseSearchIndex impl kept — those are live and used by SearchService + ExchangeMatchEvaluator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 23:32:49 +02:00
hsiegeln	0f635576a3	refactor(ingestion): drop dead legacy execution-ingestion path ExecutionController was @ConditionalOnMissingBean(ChunkAccumulator.class), and ChunkAccumulator is registered unconditionally — the legacy controller never bound in any profile. Even if it had, IngestionService.ingestExecution called executionStore.upsert(), and the only ExecutionStore impl (ClickHouseExecutionStore) threw UnsupportedOperationException from upsert and upsertProcessors. The entire RouteExecution → upsert path was dead code carrying four transitive dependencies (RouteExecution import, eventPublisher wiring, body-size-limit config, searchIndexer::onExecutionUpdated hook). Removed: - cameleer-server-app/.../controller/ExecutionController.java (whole file) - ExecutionStore.upsert + upsertProcessors (interface methods) - ClickHouseExecutionStore.upsert + upsertProcessors (thrower overrides) - IngestionService.ingestExecution + toExecutionRecord + flattenProcessors + hasAnyTraceData + truncateBody + toJson/toJsonObject helpers - IngestionService constructor now takes (DiagramStore, WriteBuffer<Metrics>); dropped ExecutionStore + Consumer<ExecutionUpdatedEvent> + bodySizeLimit - StorageBeanConfig.ingestionService(...) simplified accordingly Untouched because still in use: - ExecutionRecord / ProcessorRecord records (findById / findProcessors / SearchIndexer / DetailController) - SearchIndexer (its onExecutionUpdated never fires now since no-one publishes ExecutionUpdatedEvent, but SearchIndexerStats is still referenced by ClickHouseAdminController — separate cleanup) - TaggedExecution record has no remaining callers after this change — flagged in core-classes.md as a leftover; separate cleanup. Rule docs updated: - .claude/rules/app-classes.md: retired ExecutionController bullet, fixed stale URL for ChunkIngestionController (it owns /api/v1/data/executions, not /api/v1/ingestion/chunk/executions). - .claude/rules/core-classes.md: IngestionService surface + note the dead TaggedExecution. Full IT suite post-removal: 560 tests run, 11 F + 1 E — same 12 failures in the same 3 previously-parked classes (AgentSseControllerIT / SseSigningIT SSE-timing + ClickHouseStatsStoreIT timezone bug). No regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 22:50:51 +02:00
hsiegeln	fb54f9cbd2	fix(agent): revive DEAD agents on heartbeat (not just STALE) Some checks failed CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m5s Details CI / deploy (push) Has been cancelled Details CI / deploy-feature (push) Has been cancelled Details CI / docker (push) Has been cancelled Details Reproduction: pause a container long enough to cross both the stale and dead thresholds, then unpause. The agent resumes sending heartbeats but the server keeps it shown as DEAD. Only a full container restart (which re-registers) fixes it. Root cause: AgentRegistryService.heartbeat() only revived STALE → LIVE. A DEAD agent's heartbeat updated lastHeartbeat but left state unchanged. checkLifecycle() never downgrades DEAD either (no-op in that branch), so the agent was permanently stuck in DEAD until a register() call. Fix: extend the revival branch to also cover DEAD. Same process; a heartbeat is proof of liveness regardless of the previous state. Also: AgentLifecycleMonitor.mapTransitionEvent() now emits RECOVERED for DEAD → LIVE, mirroring its behavior for STALE → LIVE, so the lifecycle timeline captures the transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:55:47 +02:00
hsiegeln	99b739d946	fix(alerts): backend hardening + complete ACKNOWLEDGED migration - new AlertInstanceRepository.filterInEnvLive(ids, env): single-query bulk ID validation - AlertController.inEnvLiveIds now one SQL round-trip instead of N - bulkMarkRead SQL: defense-in-depth AND deleted_at IS NULL - bulkAck SQL already had deleted_at IS NULL guard — no change needed - PostgresAlertInstanceRepositoryIT: add filterInEnvLive_excludes_other_env_and_soft_deleted - V12MigrationIT: remove alert_reads assertion (table dropped by V17) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:48:57 +02:00
hsiegeln	da2819332c	feat(alerts): Postgres repo — read_at/deleted_at columns, filter params, new mutations - save/rowMapper read+write read_at and deleted_at - listForInbox: tri-state acked/read filters; always excludes deleted - countUnreadBySeverity: rewire without alert_reads join, preserve zero-fill - new: markRead/bulkMarkRead/softDelete/bulkSoftDelete/bulkAck/restore - delete PostgresAlertReadRepository + its bean - restore zero-fill Javadoc on interface - mechanical compile-fixes in AlertController, InAppInboxQuery, AlertControllerIT, InAppInboxQueryTest; Task 6 owns the rewrite - PostgresAlertReadRepositoryIT stubbed @Disabled; Task 7 owns migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:06 +02:00
hsiegeln	55b2a00458	feat(alerts): core repo — filter params + markRead/softDelete/bulkAck/restore; drop AlertReadRepository - listForInbox gains tri-state acked/read filter params - countUnreadBySeverityForUser(envId, userId) → countUnreadBySeverity(envId, userId, groupIds, roleNames) - new methods: markRead, bulkMarkRead, softDelete, bulkSoftDelete, bulkAck, restore - delete AlertReadRepository — read is now global on alert_instances.read_at Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:38:10 +02:00
hsiegeln	82e82350f9	refactor(alerts): drop ACKNOWLEDGED from AlertState, add readAt/deletedAt to AlertInstance - AlertState: remove ACKNOWLEDGED case (V17 migration already dropped it from DB enum) - AlertInstance: insert readAt + deletedAt Instant fields after lastNotifiedAt; add withReadAt/withDeletedAt withers; update all existing withers to pass both fields positionally - AlertStateTransitions: add null,null for readAt/deletedAt in newInstance ctor call; collapse FIRING,ACKNOWLEDGED switch arm to just FIRING - AlertScopeTest: update AlertState.values() assertion to 3 values; fix stale ConditionKind.hasSize(6) to 7 (JVM_METRIC was added earlier) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:12:37 +02:00
hsiegeln	414f7204bf	feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode Allows alert rules to fire on agent-lifecycle events — REGISTERED, RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather than only on current state. Each matching `(agent, eventType, timestamp)` becomes its own ackable AlertInstance, so outages on distinct agents are independently routable. Core: - New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record (scope, eventTypes, withinSeconds). Compact ctor rejects empty eventTypes and withinSeconds<1. - Strict allowlist enum `AgentLifecycleEventType` (six entries matching the server-emitted types in `AgentRegistrationController` and `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in backlog issue #145. - `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes, from, to, limit)` — new read path ordered `(timestamp ASC, insert_id ASC)` used by the evaluator. Implemented on `ClickHouseAgentEventRepository` with tenant + env filter mandatory. App: - `AgentLifecycleEvaluator` queries events in the last `withinSeconds` window and returns `EvalResult.Batch` with one `Firing` per row. Every Firing carries a canonical `_subjectFingerprint` of `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event` subtrees for Mustache templating. - `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}`. - Validation is delegated to the record compact ctor + enum at Jackson deserialization time — matches the existing policy of keeping controller validators focused on env-scoped / SQL-injection concerns. Schema: - V16 migration generalises the V15 per-exchange discriminator on `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a fallback to the legacy `exchange.id` expression. Scalar kinds still resolve to `''` and keep one-open-per-rule. Duplicate-key path in `PostgresAlertInstanceRepository.save` is unchanged — the index is the deduper. UI: - New `AgentLifecycleForm.tsx` wizard form with multi-select chips for the six allowed event types + `withinSeconds` input. Wired into `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD, 300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the new option array. - `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`, `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind, and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`. Tests (all passing): - 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive + empty/zero/unknown-type rejection). - 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty window, multi-agent fingerprint shape, scope forwarding, missing env). - `NotificationContextBuilderTest` switch now covers the new kind. - 119 alerting unit tests + 71 UI tests green. Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.	2026-04-21 14:52:08 +02:00
hsiegeln	f037d8c922	feat(alerting): server-side state+severity filters, ButtonGroup filter UI Backend: `GET /environments/{envSlug}/alerts` now accepts optional multi-value `state=…` and `severity=…` query params. Filters are pushed down to PostgresAlertInstanceRepository, which appends `AND state::text = ANY(?)` / `AND severity::text = ANY(?)` to the inbox query (null/empty = no filter). `AlertInstanceRepository.listForInbox` gained a 7-arg overload; the old 5-arg form is preserved as a default delegate so existing callers (evaluator, AlertingFullLifecycleIT, PostgresAlertInstanceRepositoryIT) compile unchanged. `InAppInboxQuery.listInbox` also has a new filtered overload. UI: InboxPage severity filter migrated from `SegmentedTabs` (single-select, no color cues) to `ButtonGroup` (multi-select with severity-coloured dots), matching the topnavbar status-filter pattern. `useAlerts` forwards the filters as query params and cache-keys on the filter tuple so each combo is independently cached. Unit + hook tests updated to the new contract (5 UI tests + 8 Java unit tests passing). OpenAPI types regenerated from the fresh local backend.	2026-04-21 12:47:31 +02:00
hsiegeln	efa8390108	fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 2m2s Details CI / docker (push) Successful in 1m20s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 37s Details SonarQube / sonarqube (push) Successful in 5m31s Details The rule editor wizard reset the condition payload on kind-change without seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule. - core: compact ctor now rejects null fireMode (Jackson-deser path only — all production callers already pass a value). - V14: repair existing EXCHANGE_MATCH rows with fireMode=null to PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard). - ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the Select's displayed fallback ("Per exchange") is actually in form state. - ui: validateStep('condition', ...) now enforces fireMode presence + the mode-specific fields before the user reaches Review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:05:55 +02:00
hsiegeln	09b49f096c	feat(alerting): per-severity breakdown on unread-count DTO Spec §13 calls for the notification bell to colour-code by highest unread severity (CRITICAL → error, WARNING → amber, INFO → muted). The old { count } DTO forced the UI to pick one static colour, so NotificationBell shipped with a TODO. Grow the contract instead: UnreadCountResponse = { total, bySeverity: { CRITICAL, WARNING, INFO } } Guarantees: - every severity is always present with a >=0 value (no undefined keys on the wire), so the UI can branch without defaults. - total = sum of bySeverity values — kept explicit on the wire for cheap top-line display, not recomputed client-side. Backend - AlertInstanceRepository: replaces countUnreadForUser(long) with countUnreadBySeverityForUser returning Map<AlertSeverity, Long>. One SQL round-trip per (env, user) — GROUP BY ai.severity over the same NOT EXISTS(alert_reads) filter. - UnreadCountResponse.from(Map) normalises and defensively copies; missing severities default to 0. - InAppInboxQuery.countUnread now returns the DTO, caches the full response (still 5s TTL) so severity breakdown gets the same hit-rate as the total did before. - AlertController just hands the DTO back. Breaking change — no backwards-compat shim: the `count` field is gone. UI and tests updated in the same commit; there are no other API consumers in the tree. Frontend - Regenerated openapi.json + schema.d.ts against a fresh build of the new backend. - NotificationBell branches badge colour on the highest unread severity (CRITICAL > WARNING > INFO) via new CSS variants. - Tests cover all four paths: zero, critical-present, warning-only, info-only. Tests: 7 unit tests + 12 ITs (incl. new grouping + empty-map) + 49 vitest (was 46; +3 severity-branch assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 18:15:56 +02:00
hsiegeln	424894a3e2	fix(alerting/I-1): retry endpoint resets attempts to 0 instead of incrementing AlertNotificationRepository gains resetForRetry(UUID, Instant) which sets attempts=0, status=PENDING, next_attempt_at=now, and clears claim/response fields. AlertNotificationController calls resetForRetry instead of scheduleRetry so a manual retry always starts from a clean slate. AlertNotificationControllerIT adds retryResetsAttemptsToZero to verify attempts==0 and status==PENDING after three prior markFailed calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:59 +02:00
hsiegeln	d74079da63	fix(alerting/B-2): implement re-notify cadence sweep and lastNotifiedAt tracking AlertInstanceRepository gains listFiringDueForReNotify(Instant) — only returns instances where last_notified_at IS NOT NULL and cadence has elapsed (IS NULL branch excluded: sweep only re-notifies, initial notify is the dispatcher's job). AlertEvaluatorJob.sweepReNotify() runs at the end of each tick, enqueues fresh notifications for eligible instances and stamps last_notified_at. NotificationDispatchJob stamps last_notified_at on the alert_instance when a notification is DELIVERED, providing the anchor timestamp for cadence checks. PostgresAlertInstanceRepositoryIT adds listFiringDueForReNotify test covering the three-rule eligibility matrix (never-notified, long-ago, recent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 08:25:50 +02:00
hsiegeln	f1abca3a45	refactor(alerting): rename P95_LATENCY_MS → AVG_DURATION_MS to match what stats_1m_route exposes The evaluator mapped P95_LATENCY_MS to ExecutionStats.avgDurationMs because stats_1m_route has no p95 column. Exposing the old name implied p95 semantics operators did not get. Rename to AVG_DURATION_MS makes the contract honest. Updated RouteMetric enum (with javadoc), evaluator switch, and admin guide. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 07:36:43 +02:00
hsiegeln	bf178ba141	fix(alerting): populate AlertInstance.rule_snapshot so history survives rule delete - Add withRuleSnapshot(Map) wither to AlertInstance (same pattern as other withers) - Call snapshotRule(rule) + withRuleSnapshot in both applyResult (single-firing) and applyBatchFiring paths so every persisted instance carries a non-empty JSONB snapshot - Strip null values from the Jackson-serialized map before wrapping in the immutable snapshot so Map.copyOf in the compact ctor does not throw NPE on nullable rule fields - Add ruleSnapshotIsPersistedOnInstanceCreation IT: asserts name/severity/conditionKind appear in the rule_snapshot column after a tick fires an instance - Add historySurvivesRuleDelete IT: fires an instance, deletes the rule, asserts rule_id IS NULL and rule_snapshot still contains the rule name (spec §5 guarantee) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:09:28 +02:00
hsiegeln	657dc2d407	feat(alerting): AlertingProperties + AlertStateTransitions state machine - AlertingProperties @ConfigurationProperties with effective*() accessors and 5000 ms floor clamp on evaluatorTickIntervalMs; warn logged at startup - AlertStateTransitions pure static state machine: Clear/Firing/Batch/Error branches, PENDING→FIRING promotion on forDuration elapsed; Batch delegated to job - AlertInstance wither helpers: withState, withFiredAt, withResolvedAt, withAck, withSilenced, withTitleMessage, withLastNotifiedAt, withContext - AlertingBeanConfig gains @EnableConfigurationProperties(AlertingProperties), alertingInstanceId bean (hostname:pid), alertingClock bean, PerKindCircuitBreaker bean wired from props - 12 unit tests in AlertStateTransitionsTest covering all transitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:58:12 +02:00
hsiegeln	c53f642838	chore(alerting): add jmustache 1.16 Declared in cameleer-server-core pom (canonical location for unit-testable rendering without Spring) and mirrored in cameleer-server-app pom so the app module compiles standalone without a full reactor install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:26:57 +02:00
hsiegeln	7b79d3aa64	feat(alerting): countExecutionsForAlerting for exchange-match evaluator Adds AlertMatchSpec record (core) and ClickHouseSearchIndex.countExecutionsForAlerting — no FINAL, no text subqueries. Filters by tenant, env, app, route, status, time window, and optional after-cursor. Attributes (JSON string column) use inlined JSONExtractString key literals since ClickHouse JDBC does not bind ? placeholders inside JSON functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:18:49 +02:00
hsiegeln	45028de1db	feat(alerting): Postgres repository for alert_instances with inbox queries Implements AlertInstanceRepository: save (upsert), findById, findOpenForRule, listForInbox (3-way OR: user/group/role via && array-overlap + ANY), countUnreadForUser (LEFT JOIN alert_reads), ack, resolve, markSilenced, deleteResolvedBefore. Integration test covers all 9 scenarios including inbox fan-out across all three target types. Also adds @JsonIgnoreProperties(ignoreUnknown=true) to SilenceMatcher to suppress Jackson serializing isWildcard() as a round-trip field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:04:51 +02:00
hsiegeln	1ff256dce0	feat(alerting): core repository interfaces	2026-04-19 18:43:36 +02:00
hsiegeln	e7a9042677	feat(alerting): core domain records (rule, instance, silence, notification)	2026-04-19 18:43:03 +02:00
hsiegeln	56a7b6de7d	feat(alerting): sealed AlertCondition hierarchy with Jackson deduction	2026-04-19 18:42:04 +02:00
hsiegeln	530bc32040	feat(alerting): core enums + AlertScope	2026-04-19 18:36:29 +02:00
hsiegeln	5103dc91be	feat(alerting): add ALERT_RULE_CHANGE + ALERT_SILENCE_CHANGE audit categories	2026-04-19 18:34:08 +02:00
hsiegeln	ea4c56e7f6	feat(outbound): admin CRUD REST + RBAC + audit New audit categories: OUTBOUND_CONNECTION_CHANGE, OUTBOUND_HTTP_TRUST_CHANGE. Controller-level @PreAuthorize defaults to ADMIN; GETs relaxed to ADMIN\|OPERATOR. SecurityConfig permits OPERATOR GETs on /api/v1/admin/outbound-connections/**. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:43:48 +02:00
hsiegeln	380ccb102b	fix(outbound): align user FK with users(user_id) TEXT schema V11 migration referenced users(id) as uuid, but V1 users table has user_id as TEXT primary key. Amending V11 and the OutboundConnection record before Task 7's integration tests catch this at Flyway startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:18:12 +02:00
hsiegeln	46b8f63fd1	feat(outbound): core domain records for outbound connections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:10:17 +02:00
hsiegeln	2224f7d902	feat(http): core outbound HTTP interfaces and property records	2026-04-19 15:39:57 +02:00
hsiegeln	6d3956935d	refactor(events): remove dead non-paginated query path AgentEventService.queryEvents, AgentEventRepository.query, and the ClickHouse implementation have had no callers since /agents/events became cursor-paginated. Remove them along with their dedicated IT tests. queryPage and its tests remain as the single query path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 13:16:28 +02:00
hsiegeln	67a834153e	feat(events): add AgentEventPage + queryPage interface Introduces cursor-paginated query on AgentEventRepository. The cursor format is owned by the implementation. The existing non-paginated query(...) is kept for internal consumers.	2026-04-17 11:52:42 +02:00
hsiegeln	769752a327	feat(logs): widen source filter to multi-value OR list Replaces LogSearchRequest.source (String) with sources (List<String>) and emits 'source IN (...)' when non-empty. LogQueryController parses ?source=a,b,c the same way it parses ?level=a,b,c.	2026-04-17 11:48:10 +02:00
hsiegeln	62dd71b860	fix: stamp environment on agent_events rows All checks were successful CI / cleanup-branch (push) Has been skipped Details CI / build (push) Successful in 1m28s Details CI / docker (push) Successful in 1m13s Details CI / deploy-feature (push) Has been skipped Details CI / deploy (push) Successful in 43s Details The agent_events table has an `environment` column and AgentEventsController filters on it, but the INSERT never populated it — every row got the column default ('default'). Result: Timeline on the Application Runtime page was empty whenever the user's selected env was anything other than 'default'. Thread env through the write path: - AgentEventRepository.insert + AgentEventService.recordEvent gain an `environment` param; delete the no-env query overload (unused). - ClickHouseAgentEventRepository.insert writes the column (falls back to 'default' on null to match column DEFAULT). - All 5 callers source env from the agent registry (AgentInfo.environmentId) or the registration request body; AgentLifecycleMonitor, deregister, command ack, event ingestion, register/re-register. - Integration test updated for the new signatures. Pre-existing rows in deployed CH will still report environment='default'. New events from this build forward will carry the correct env. Backfill (UPDATE ... FROM apps) is left as a manual DB step if historical timeline is needed for non-default envs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:30:56 +02:00
hsiegeln	d02fa73080	fix: scope correlation-chain query to the exchange's own env Correlated exchanges always share the env of the one being viewed — using the globally-selected env from the picker was wrong if the user switched envs after opening a detail view (or arrived via permalink). Thread `environment` through: - `ExecutionStore.ExecutionRecord` gains `environment` field; the ClickHouse `executions` table already stores this, just not read back. - `ClickHouseExecutionStore.findById` SELECT adds the column; mapper populates it. - `ExecutionDetail` gains `environment`; `DetailService` passes through. - `IngestionService.toExecutionRecord` passes null — this legacy PG ingestion path isn't active when ClickHouse is enabled, and the read-side is what drives the correlation UI. - UI `ExchangeHeader` reads `detail.environment ?? storeEnv` and extends the TS type locally (schema.d.ts catches up on next regen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:19:42 +02:00
hsiegeln	6d9e456b97	feat!: move apps & deployments under /api/v1/environments/{envSlug}/apps/{appSlug}/... P3B of the taxonomy migration. App and deployment routes are now env-scoped in the URL itself, making the (env, app_slug) uniqueness key explicit. Previously /api/v1/apps/{appSlug} was ambiguous: with the same app deployed to multiple environments (dev/staging/prod), the handler called AppService.getBySlug(slug) which returns the first row matching slug regardless of env. Server: - AppController: @RequestMapping("/api/v1/environments/{envSlug}/ apps"). Every handler now calls appService.getByEnvironmentAndSlug(env.id(), appSlug) — 404 if the app doesn't exist in this env. CreateAppRequest body drops environmentId (it's in the path). - DeploymentController: @RequestMapping("/api/v1/environments/ {envSlug}/apps/{appSlug}/deployments"). DeployRequest body drops environmentId. PromoteRequest body switches from targetEnvironmentId (UUID) to targetEnvironment (slug); promote handler resolves the target env by slug and looks up the app with the same slug in the target env (fails with 404 if the target app doesn't exist yet — apps must exist in both source and target before promote). - AppService: added getByEnvironmentAndSlug helper; createApp now validates slug against ^[a-z0-9][a-z0-9-]{0,63}$ (400 on invalid). SPA: - queries/admin/apps.ts: rewritten. Hooks take envSlug where env-scoped. Removed useAllApps (no flat endpoint). Renamed path param naming: appId → appSlug throughout. Added usePromoteDeployment. Query keys include envSlug so cache is env-scoped. - AppsTab.tsx: call sites updated. When no environment is selected, the managed-app list is empty — cross-env discovery lives in the Runtime tab (catalog). handleDeploy/handleStop/etc. pass envSlug to the new hook signatures. BREAKING CHANGE: /api/v1/apps/ paths removed. Clients must use /api/v1/environments/{envSlug}/apps/{appSlug}/. Request bodies for POST /apps and POST /apps/{slug}/deployments no longer accept environmentId (use the URL path instead). Promote body uses slug not UUID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 23:38:37 +02:00

1 2

59 Commits