Commit Graph

19 Commits

Author SHA1 Message Date
hsiegeln
90083f886a refactor(schema): collapse V1..V18 into single V1__init.sql baseline
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m4s
CI / docker (push) Successful in 1m17s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
The project is still greenfield (no production deployment) so this is
the last safe moment to flatten the migration archaeology before the
checksum history starts mattering for real.

Schema changes
 - 18 migration files (531 lines) → one V1__init.sql (~380 lines)
   declaring the final end-state: RBAC + claim mappings + runtime
   management + config + audit + outbound + alerting, plus seed data
   (system roles, Admins group, default environment).
 - Drops the data-repair statements from V14 (firemode backfill),
   V16 (subjectFingerprint migration), V17 (ACKNOWLEDGED → FIRING
   coercion) — they were no-ops on any DB that starts at V1.
 - Declares condition_kind_enum with AGENT_LIFECYCLE from the start
   (was added retroactively by V18).
 - Declares alert_state_enum with three values only (was five, then
   swapped in V17) and alert_instances with read_at / deleted_at
   columns from day one (was added by V17).
 - alert_reads table never created (V12 created, V17 dropped).
 - alert_instances_open_rule_uq built with the V17 predicate from
   the start.

Test changes
 - Replace V12MigrationIT / V17MigrationIT / V18MigrationIT with one
   SchemaBootstrapIT that asserts the combined invariants: tables
   present, alert_reads absent, enum value sets, alert_instances has
   read_at + deleted_at, open_rule_uq exists and is unique, env-delete
   cascade fires.

Verification
 - pg_dump of the new V1 matches the pg_dump of V1..V18 applied in
   sequence (bytewise modulo column order and Postgres-auto FK names).
 - Full alerting IT suite (53 tests across 6 classes) green against
   the new schema.
 - The 47 pre-existing test failures on main (AgentRegistrationIT,
   SearchControllerIT, ClickHouseStatsStoreIT, …) are unrelated and
   fail identically without this change.

Developer impact
 - Existing local DBs will fail checksum validation on boot. Wipe:
   docker compose down -v  (or drop the tenant_default schema).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:52:22 +02:00
hsiegeln
b7d201d743 fix(alerts): add AGENT_LIFECYCLE to condition_kind_enum + readable error toasts
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m5s
CI / docker (push) Successful in 1m19s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Backend
 - V18 migration adds AGENT_LIFECYCLE to condition_kind_enum. Java
   ConditionKind enum shipped with this value but no Postgres migration
   extended the type, so any AGENT_LIFECYCLE rule insert failed with
   "invalid input value for enum condition_kind_enum".
 - ALTER TYPE ... ADD VALUE lives alone in its migration per Postgres
   constraint that the new value cannot be referenced in the same tx.
 - V18MigrationIT asserts the enum now contains all 7 kinds.

Frontend
 - Add describeApiError(e) helper to unwrap openapi-fetch error bodies
   (Spring error JSON) into readable strings. String(e) on a plain
   object rendered "[object Object]" in toasts — the actual failure
   reason was hidden from the user.
 - Replace String(e) in all 13 toast descriptions across the alerting
   and outbound-connection mutation paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:23:14 +02:00
hsiegeln
e95c21d0cb feat(alerts): V17 migration — drop ACKNOWLEDGED, add read_at + deleted_at
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:04:09 +02:00
hsiegeln
414f7204bf feat(alerting): AGENT_LIFECYCLE condition kind with per-subject fire mode
Allows alert rules to fire on agent-lifecycle events — REGISTERED,
RE_REGISTERED, DEREGISTERED, WENT_STALE, WENT_DEAD, RECOVERED — rather
than only on current state. Each matching `(agent, eventType, timestamp)`
becomes its own ackable AlertInstance, so outages on distinct agents are
independently routable.

Core:
- New `ConditionKind.AGENT_LIFECYCLE` + `AgentLifecycleCondition` record
  (scope, eventTypes, withinSeconds). Compact ctor rejects empty
  eventTypes and withinSeconds<1.
- Strict allowlist enum `AgentLifecycleEventType` (six entries matching
  the server-emitted types in `AgentRegistrationController` and
  `AgentLifecycleMonitor`). Custom agent-emitted event types tracked in
  backlog issue #145.
- `AgentEventRepository.findInWindow(env, appSlug, agentId, eventTypes,
  from, to, limit)` — new read path ordered `(timestamp ASC, insert_id
  ASC)` used by the evaluator. Implemented on
  `ClickHouseAgentEventRepository` with tenant + env filter mandatory.

App:
- `AgentLifecycleEvaluator` queries events in the last `withinSeconds`
  window and returns `EvalResult.Batch` with one `Firing` per row.
  Every Firing carries a canonical `_subjectFingerprint` of
  `"<agentId>:<eventType>:<tsMillis>"` in context plus `agent` / `event`
  subtrees for Mustache templating.
- `NotificationContextBuilder` gains an `AGENT_LIFECYCLE` branch that
  exposes `{{agent.id}}`, `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}`.
- Validation is delegated to the record compact ctor + enum at Jackson
  deserialization time — matches the existing policy of keeping
  controller validators focused on env-scoped / SQL-injection concerns.

Schema:
- V16 migration generalises the V15 per-exchange discriminator on
  `alert_instances_open_rule_uq` to prefer `_subjectFingerprint` with a
  fallback to the legacy `exchange.id` expression. Scalar kinds still
  resolve to `''` and keep one-open-per-rule. Duplicate-key path in
  `PostgresAlertInstanceRepository.save` is unchanged — the index is
  the deduper.

UI:
- New `AgentLifecycleForm.tsx` wizard form with multi-select chips for
  the six allowed event types + `withinSeconds` input. Wired into
  `ConditionStep`, `form-state` (validation + defaults: WENT_DEAD,
  300 s), and `enums.ts` options. Tests in `enums.test.ts` pin the
  new option array.
- `alert-variables.ts` registers `{{agent.app}}`, `{{event.type}}`,
  `{{event.timestamp}}`, `{{event.detail}}` leaves for the new kind,
  and extends `agent.id`'s availability list to include `AGENT_LIFECYCLE`.

Tests (all passing):
- 5 new JSON-roundtrip cases on `AlertConditionJsonTest` (positive +
  empty/zero/unknown-type rejection).
- 5 new evaluator unit tests on `AgentLifecycleEvaluatorTest` (empty
  window, multi-agent fingerprint shape, scope forwarding, missing env).
- `NotificationContextBuilderTest` switch now covers the new kind.
- 119 alerting unit tests + 71 UI tests green.

Docs: `.claude/rules/{core,app,ui}` and CLAUDE.md migration list updated.
2026-04-21 14:52:08 +02:00
hsiegeln
037a27d405 fix(alerting): allow multiple open alert_instances per rule for PER_EXCHANGE
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m51s
CI / docker (push) Successful in 1m17s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
V13 added a partial unique index on alert_instances(rule_id) WHERE state
IN (PENDING,FIRING,ACKNOWLEDGED). Correct for scalar condition kinds
(ROUTE_METRIC / AGENT_STATE / DEPLOYMENT_STATE / LOG_PATTERN / JVM_METRIC
/ EXCHANGE_MATCH in COUNT_IN_WINDOW) but wrong for EXCHANGE_MATCH /
PER_EXCHANGE, which by design emits one alert_instance per matching
exchange. Under V13 every PER_EXCHANGE tick with >1 match logged
"Skipped duplicate open alert_instance for rule …" at evaluator cadence
and silently lost alert fidelity — only the first matching exchange per
tick got an AlertInstance + webhook dispatch.

V15 drops the rule_id-only constraint and recreates it with a
discriminator on context->'exchange'->>'id'. Scalar kinds emit
Map.of() as context, so their expression resolves to '' — "one open per
rule" preserved. ExchangeMatchEvaluator.evaluatePerExchange always
populates exchange.id, so per-exchange instances coexist cleanly.

Two new PostgresAlertInstanceRepositoryIT tests:
  - multiple open instances for same rule + distinct exchanges all land
  - second open for identical (rule, exchange) still dedups via the
    DuplicateKeyException fallback in save() — defense-in-depth kept

Also fixes pre-existing PostgresAlertReadRepositoryIT brokenness: its
setup() inserted 3 open instances sharing one rule_id, which V13 blocked
on arrival. Migrate to one rule_id per instance (pattern already used
across other storage ITs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 22:26:19 +02:00
hsiegeln
efa8390108 fix(alerting): reject null fireMode on ExchangeMatchCondition + repair in-flight rows
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 2m2s
CI / docker (push) Successful in 1m20s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
SonarQube / sonarqube (push) Successful in 5m31s
The rule editor wizard reset the condition payload on kind-change without
seeding a fireMode default; the ExchangeMatchCondition ctor allowed null to
pass through; AlertEvaluatorJob then NPE-looped every tick on a saved rule.

- core: compact ctor now rejects null fireMode (Jackson-deser path only — all
  production callers already pass a value).
- V14: repair existing EXCHANGE_MATCH rows with fireMode=null to
  PER_EXCHANGE + perExchangeLingerSeconds=300 (default matches the wizard).
- ui: ConditionStep.onKindChange seeds EXCHANGE_MATCH defaults so the
  Select's displayed fallback ("Per exchange") is actually in form state.
- ui: validateStep('condition', ...) now enforces fireMode presence + the
  mode-specific fields before the user reaches Review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:05:55 +02:00
hsiegeln
7e79ff4d98 fix(alerting/I-2): add unique partial index on alert_instances(rule_id) for open states
V13 migration creates alert_instances_open_rule_uq — a partial unique index on
(rule_id) WHERE state IN ('PENDING','FIRING','ACKNOWLEDGED'), preventing
duplicate open instances per rule. PostgresAlertInstanceRepository.save() catches
DuplicateKeyException and returns the existing open instance instead of failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 08:26:07 +02:00
hsiegeln
8bf45d5456 fix(alerting): use ALTER TABLE MODIFY SETTING to enable projections on executions ReplacingMergeTree
Investigated three approaches for CH 24.12:
- Inline SETTINGS on ADD PROJECTION: rejected (UNKNOWN_SETTING — not a query-level setting).
- ALTER TABLE MODIFY SETTING deduplicate_merge_projection_mode='rebuild': works; persists in
  table metadata across connection restarts; runs before ADD PROJECTION in the SQL script.
- Session-level JDBC URL param: not pursued (MODIFY SETTING is strictly better).

alerting_projections.sql now runs MODIFY SETTING before the two executions ADD PROJECTIONs.
AlertingProjectionsIT strengthened to assert all four projections (including alerting_app_status
and alerting_route_status on executions) exist after schema init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 07:36:55 +02:00
hsiegeln
63669bd1d7 docs(alerting): default config + admin guide
Adds alerting stanza to application.yml with all AlertingProperties
fields backed by env-var overrides.  Creates docs/alerting.md covering
six condition kinds (with example JSON), template variables, webhook
setup (Slack/PagerDuty examples), silence patterns, circuit-breaker
and retention troubleshooting, and Prometheus metrics reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:16:38 +02:00
hsiegeln
7c0e94a425 feat(alerting): ClickHouse projections for alerting read paths
Adds alerting_projections.sql with four projections (alerting_app_status,
alerting_route_status on executions; alerting_app_level on logs;
alerting_instance_metric on agent_metrics). ClickHouseSchemaInitializer now
runs both init.sql and alerting_projections.sql, with ADD PROJECTION and
MATERIALIZE treated as non-fatal — executions (ReplacingMergeTree) requires
deduplicate_merge_projection_mode=rebuild which is unavailable via JDBC pool.
MergeTree projections (logs, agent_metrics) always succeed and are asserted in IT.

Column names confirmed from init.sql: logs uses 'application' (not application_id),
agent_metrics uses 'collected_at' (not timestamp). All column names match the plan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:18:58 +02:00
hsiegeln
59e76bdfb6 feat(alerting): V12 flyway migration for alerting tables 2026-04-19 18:28:09 +02:00
hsiegeln
380ccb102b fix(outbound): align user FK with users(user_id) TEXT schema
V11 migration referenced users(id) as uuid, but V1 users table has
user_id as TEXT primary key. Amending V11 and the OutboundConnection
record before Task 7's integration tests catch this at Flyway startup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:18:12 +02:00
hsiegeln
000e9d2847 feat(http): ApacheOutboundHttpClientFactory with memoization and startup validation
Adds ApacheOutboundHttpClientFactory (Apache HttpClient 5) that memoizes
CloseableHttpClient instances keyed on effective TLS + timeout config, and
OutboundHttpConfig (@ConfigurationProperties) that validates trusted CA paths
at startup and exposes OutboundHttpClientFactory as a Spring bean.

TRUST_ALL mode disables both cert validation (TrustAllManager in SslContextBuilder)
and hostname verification (NoopHostnameVerifier on SSLConnectionSocketFactoryBuilder).
WireMock HTTPS integration test covers trust-all bypass, system-default PKIX rejection,
and client memoization.

OIDC audit: OidcProviderHelper and OidcTokenExchanger use Nimbus SDK's own HTTP layer
(DefaultResourceRetriever for JWKS, HTTPRequest.send() for token exchange) plus the
bespoke InsecureTlsHelper for TLS skip-verify; neither uses OutboundHttpClientFactory.
Retrofit deferred to a separate follow-up per plan §20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:03:56 +02:00
hsiegeln
ffdfd6cd9a feat(outbound): add HTTPS CHECK constraint on outbound_connections.url
Defense-in-depth per code review. DTO layer already validates HTTPS at save
time; this DB-level check guards against future code paths that might bypass
the DTO validator. Mustache template variables in the URL (e.g., {{env.slug}})
remain valid since only the scheme prefix is constrained.
2026-04-19 15:37:35 +02:00
hsiegeln
116038262a feat(outbound): V11 flyway migration for outbound_connections table 2026-04-19 15:33:39 +02:00
hsiegeln
89c9b53edd fix(pagination): add insert_id UUID tiebreak to cursor keyset
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m26s
CI / docker (push) Successful in 1m12s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 43s
Same-millisecond rows were silently skipped between pages because the
log cursor had no tiebreak and the events cursor tied by instance_id
(which also collides when one instance emits multiple events within a
millisecond). Add an insert_id UUID (DEFAULT generateUUIDv4()) column
to both logs and agent_events, order by (timestamp, insert_id)
consistently, and encode the cursor as 'timestamp|insert_id'. Existing
data is materialized via ALTER TABLE MATERIALIZE COLUMN (one-time
background mutation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:25:36 +02:00
hsiegeln
9b1ef51d77 feat!: scope per-app config and settings by environment
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m27s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m40s
SonarQube / sonarqube (push) Successful in 4m29s
BREAKING: wipe dev PostgreSQL before deploying — V1 checksum changes.
Agents must now send environmentId on registration (400 if missing).

Two tables previously keyed on app name alone caused cross-environment
data bleed: writing config for (app=X, env=dev) would overwrite the row
used by (app=X, env=prod) agents, and agent startup fetches ignored env
entirely.

- V1 schema: application_config and app_settings are now PK (app, env).
- Repositories: env-keyed finders/saves; env is the authoritative column,
  stamped on the stored JSON so the row agrees with itself.
- ApplicationConfigController.getConfig is dual-mode — AGENT role uses
  JWT env claim (agents cannot spoof env); non-agent callers provide env
  via ?environment= query param.
- AppSettingsController endpoints now require ?environment=.
- SensitiveKeysAdminController fan-out iterates (app, env) slices so each
  env gets its own merged keys.
- DiagramController ingestion stamps env on TaggedDiagram; ClickHouse
  route_diagrams INSERT + findProcessorRouteMapping are env-scoped.
- AgentRegistrationController: environmentId is required on register;
  removed all "default" fallbacks from register/refresh/heartbeat auto-heal.
- UI hooks (useApplicationConfig, useProcessorRouteMapping, useAppSettings,
  useAllAppSettings, useUpdateAppSettings) take env, wired to
  useEnvironmentStore at all call sites.
- New ConfigEnvIsolationIT covers env-isolation for both repositories.

Plan in docs/superpowers/plans/2026-04-16-environment-scoping.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 22:25:21 +02:00
hsiegeln
04da0af4bc feat: add route_catalog table to ClickHouse schema 2026-04-16 18:45:38 +02:00
hsiegeln
cb3ebfea7c chore: rename cameleer3 to cameleer
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 18s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Rename Java packages from com.cameleer3 to com.cameleer, module
directories from cameleer3-* to cameleer-*, and all references
throughout workflows, Dockerfiles, docs, migrations, and pom.xml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:28:42 +02:00