Commit Graph

319 Commits

Author SHA1 Message Date
hsiegeln
f4eafd9a0f feat: sync OIDC roles on every login, not just first
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / build (push) Has been cancelled
Roles from the id_token's rolesClaim are now diffed against stored
system roles on each OIDC login. Missing roles are added, revoked
roles are removed. Group memberships (manually assigned) are never
touched. This propagates scope revocations from the OIDC provider
on next user login.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:11:06 +02:00
hsiegeln
9c2e6aacad feat: support server:-prefixed scopes and case-insensitive role mapping
Some checks failed
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / cleanup-branch (push) Has been cancelled
CI / build (push) Has been cancelled
M2M scope mapping now accepts both 'server:admin' and 'admin' (case-
insensitive). OIDC user login role assignment strips the 'server:'
prefix before looking up SystemRole, so 'server:viewer' from the
id_token maps to VIEWER correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:05:13 +02:00
hsiegeln
c757a0ea51 fix: replace last hardcoded paths with BASE_PATH-aware alternatives
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m13s
CI / docker (push) Successful in 57s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 36s
- index.html: change /src/main.tsx to ./src/main.tsx (relative, respects
  <base> tag)
- AgentRegistrationController: derive SSE endpoint URL from request
  context via ServletUriComponentsBuilder instead of hardcoding /api/v1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 09:53:00 +02:00
hsiegeln
e8bcc39ca9 fix: add ES384 to OidcTokenExchanger JWT algorithm list
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m13s
CI / docker (push) Successful in 43s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 40s
Logto signs id_tokens with ES384 by default. SecurityConfig already
included it but OidcTokenExchanger only had RS256 and ES256.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 01:37:22 +02:00
hsiegeln
d7001804f7 fix: permit branding endpoints without authentication
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
The login page loads the branding logo before the user is signed in.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 01:15:21 +02:00
hsiegeln
0fab20e67a fix: append .well-known/openid-configuration to issuerUri in token exchanger
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m12s
CI / docker (push) Successful in 40s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
OidcTokenExchanger fetched the discovery document from the issuerUri
as-is, but the database stores the issuer URI (e.g. /oidc), not the
full discovery URL. Logto returns 404 for the bare issuer path.
SecurityConfig already appended the well-known suffix — now the token
exchanger does the same.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 01:04:57 +02:00
hsiegeln
d7563902a7 fix: read oidcTlsSkipVerify at call time instead of caching in constructor
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m13s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Has been cancelled
OidcTokenExchanger cached securityProperties.isOidcTlsSkipVerify() in
the constructor as a boolean field. If Spring constructed the bean
before property binding completed, the cached value was false even when
the env var was set. SecurityConfig worked because it read the property
at call time. Now OidcTokenExchanger stores the SecurityProperties
reference and reads the flag on each call, matching SecurityConfig's
pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 01:02:36 +02:00
hsiegeln
99e2a8354f fix: handle HTTPS redirects in InsecureTlsHelper for OIDC discovery
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m13s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Java's automatic redirect following creates new connections that do NOT
inherit custom SSLSocketFactory/HostnameVerifier. This caused the OIDC
discovery fetch to fail on redirect even with TLS_SKIP_VERIFY=true.
Now disables auto-redirect and follows manually with SSL on each hop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:51:49 +02:00
hsiegeln
083cb8b9ec feat: add CAMELEER_CORS_ALLOWED_ORIGINS for multi-origin CORS support
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 41s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Behind a reverse proxy the browser sends Origin matching the proxy's
public URL, which the single-origin CAMELEER_UI_ORIGIN rejects.
New env var accepts comma-separated origins and takes priority over
UI_ORIGIN, which remains as a backwards-compatible fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:41:00 +02:00
hsiegeln
ca92b3ce7d feat: add CAMELEER_OIDC_TLS_SKIP_VERIFY to bypass cert verification for OIDC
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 43s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 36s
Self-signed CA certs on the OIDC provider (e.g. Logto behind a reverse
proxy) cause the login flow to fail because Java's truststore rejects
the connection. This adds an opt-in env var that creates a trust-all
SSLContext scoped to OIDC HTTP calls only (discovery, token exchange,
JWKS fetch) without affecting system-wide TLS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:26:40 +02:00
hsiegeln
3c70313d78 feat: add CAMELEER_OIDC_JWK_SET_URI for direct JWKS fetching
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / build (push) Has been cancelled
When set, fetches JWKs from this URL directly instead of discovering
from the OIDC well-known endpoint. Needed when the public issuer URL
(e.g., https://domain.com/oidc) isn't reachable from inside containers
but the internal URL (http://logto:3001/oidc/jwks) is.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:02:51 +02:00
hsiegeln
fec6717a85 feat: update default rolesClaim to 'roles' for Logto compatibility
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 13:10:53 +02:00
hsiegeln
3bd07c9b07 feat: add OIDC resource server support with JWKS discovery and scope-based roles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:10:08 +02:00
hsiegeln
a5c4e0cead feat: add spring-boot-starter-oauth2-resource-server and OIDC properties
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:06:53 +02:00
hsiegeln
de85cdf5a2 fix: let SPRING_DATASOURCE_URL fully control datasource connection
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m5s
CI / docker (push) Successful in 41s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
SonarQube / sonarqube (push) Successful in 3m26s
Explicit spring.datasource.url in YAML takes precedence over the env var,
causing deployed containers to connect to localhost instead of the postgres
service. Now the YAML uses ${SPRING_DATASOURCE_URL:...} so the env var
wins when set. Flyway inherits from the datasource (no separate URL).
Removed CAMELEER_DB_SCHEMA — schema is part of the datasource URL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:24:22 +02:00
hsiegeln
ac87aa6eb2 fix: derive PG schema from tenant ID instead of defaulting to public
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m6s
CI / docker (push) Successful in 43s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Failing after 2m17s
Schema now defaults to tenant_${cameleer.tenant.id} (e.g. tenant_default,
tenant_acme) instead of public. Flyway create-schemas: true ensures the
schema is auto-created on first startup. CAMELEER_DB_SCHEMA env var still
available as override for feature branch isolation. Removed hardcoded
public schema from K8s base and main overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 21:46:57 +02:00
hsiegeln
37eb56332a fix: use environmentId from heartbeat body for auto-heal
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m5s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
HeartbeatRequest now carries environmentId (cameleer3-common update).
Auto-heal prefers the heartbeat value (most current) over the JWT
claim, ensuring agents recover their correct environment immediately
on the first heartbeat after server restart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:21:55 +02:00
hsiegeln
72ec87a3ba fix: persist environment in JWT claims for auto-heal recovery
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m4s
CI / docker (push) Successful in 1m7s
CI / deploy (push) Successful in 45s
CI / deploy-feature (push) Has been skipped
Add 'env' claim to agent JWTs (set at registration, carried through
refresh). Auto-heal on heartbeat/SSE now reads environment from the
JWT instead of hardcoding 'default', so agents retain their correct
environment after server restart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:12:25 +02:00
hsiegeln
694d0eef59 feat: add environment filtering across all APIs and UI
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
Backend: Added optional `environment` query parameter to catalog,
search, stats, timeseries, punchcard, top-errors, logs, and agents
endpoints. ClickHouse queries filter by environment when specified
(literal SQL for AggregatingMergeTree, ? binds for raw tables).
StatsStore interface methods all accept environment parameter.

UI: Added EnvironmentSelector component (compact native select).
LayoutShell extracts distinct environments from agent data and
passes selected environment to catalog and agent queries via URL
search param (?env=). TopBar shows current environment label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:42:26 +02:00
hsiegeln
a188308ec5 feat: implement multitenancy with tenant isolation + environment support
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m25s
Adds configurable tenant ID (CAMELEER_TENANT_ID env var, default:
"default") and environment as a first-class concept. Each server
instance serves one tenant with multiple environments.

Changes across 36 files:
- TenantProperties config bean for tenant ID injection
- AgentInfo: added environmentId field
- AgentRegistrationRequest: added environmentId field
- All 9 ClickHouse stores: inject tenant ID, replace hardcoded
  "default" constant, add environment to writes/reads
- ChunkAccumulator: configurable tenant ID + environment resolver
- MergedExecution/ProcessorBatch/BufferedLogEntry: added environment
- ClickHouse init.sql: added environment column to all tables,
  updated ORDER BY (tenant→time→env→app), added tenant_id to
  usage_events, updated all MV GROUP BY clauses
- Controllers: pass environmentId through registration/auto-heal
- K8s deploy: added CAMELEER_TENANT_ID env var
- All tests updated for new signatures

Closes #123

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:00:18 +02:00
hsiegeln
45a74075a1 feat: restore agent capabilities from heartbeat after server restart
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 40s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
The heartbeat now carries capabilities (per protocol v2 update).
On each heartbeat, capabilities are updated in the agent registry.
On auto-heal (server restart), capabilities from the heartbeat
are used instead of empty Map.of(), so the agent's feature flags
(replay, routeControl, logForwarding, etc.) are restored
immediately on the first heartbeat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 13:19:15 +02:00
hsiegeln
abed4dc96f security: fix SQL injection in ClickHouse query escaping
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m48s
CI / docker (push) Successful in 1m6s
CI / deploy (push) Successful in 47s
CI / deploy-feature (push) Has been skipped
Convert ClickHouseUsageTracker and ClickHouseMetricsQueryStore to
use JDBC parameterized queries (? binds) — these query raw tables
without AggregateFunction columns.

Fix lit(String) in RouteMetricsController and ClickHouseStatsStore
to escape backslashes before single quotes. Without this, an input
like \' breaks out of the string literal in ClickHouse (where \
is an escaped backslash). These must remain as literal SQL because
the ClickHouse JDBC 0.9.x driver wraps PreparedStatement in
sub-queries that strip AggregateFunction types, breaking -Merge
combinators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:17:12 +02:00
hsiegeln
b04b12220b fix: resolve 25 SonarQube code smells across 21 files
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m2s
CI / docker (push) Successful in 45s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Remove unused fields (log, rbacService, roleRepository, jwt),
unused variables (agentTps, routeKeys, updated), unused imports
(HttpHeaders, JdbcTemplate). Rename restricted identifier 'record'
to 'auditRecord'/'event'. Return empty collections instead of null.
Replace .collect(Collectors.toList()) with .toList(). Simplify
conditional return in BootstrapTokenValidator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:36:13 +02:00
hsiegeln
633a61d89d perf: batch processor and log inserts to reduce ClickHouse part creation
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 39s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m2s
SonarQube / sonarqube (push) Failing after 1m58s
Diagnostics showed ~3,200 tiny inserts per 5 minutes:
- processor_executions: 2,376 inserts (14 rows avg) — one per chunk
- logs: 803 inserts (5 rows avg) — synchronous in HTTP handler

Fix 1: Consolidate processor inserts — new insertProcessorBatches() method
flattens all ProcessorBatch records into a single INSERT per flush cycle.

Fix 2: Buffer log inserts — route through WriteBuffer<BufferedLogEntry>,
flushed on the same 5s interval as executions. LogIngestionController now
pushes to buffer instead of inserting directly.

Also reverts async_insert config (doesn't work with JDBC inline VALUES).

Expected: ~3,200 inserts/5min → ~160 (20x reduction in part creation,
MV triggers, and background merge work).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 22:48:04 +02:00
hsiegeln
ac94a67a49 fix: reduce ClickHouse CPU by increasing flush interval, rename LIVE→AUTO labels
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 1m24s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 41s
- Increase ingestion flush interval from 500ms to 5000ms to reduce MV merge storms
- Reduce ClickHouse background_schedule_pool_size from 8 to 4
- Rename LIVE/PAUSED badge labels to AUTO/MANUAL across all pages
- Update design system to v0.1.29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 22:05:29 +02:00
hsiegeln
a9ec424d52 fix: derive Ed25519 signing key from JWT secret, no DB storage
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m11s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Replace DB-persisted keypair with deterministic derivation from
CAMELEER_JWT_SECRET via HMAC-SHA256 seed + seeded SHA1PRNG KeyPairGenerator.
Same secret = same key pair across restarts, no private key in the database.

Closes #121

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:18:43 +02:00
hsiegeln
81f13396a0 fix: persist Ed25519 signing key to survive server restarts
All checks were successful
CI / build (push) Successful in 2m8s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 50s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 54s
The keypair was generated ephemerally on each startup, causing agents
to reject all commands after a server restart (signature mismatch).
Now persisted to PostgreSQL server_config table and restored on startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:13:40 +02:00
hsiegeln
670e458376 fix: update ITs to use consolidated init.sql, remove dead code
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m23s
CI / docker (push) Successful in 1m29s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 50s
- All 7 ClickHouse integration tests now load init.sql via shared
  ClickHouseTestHelper instead of deleted V1-V11 migration files
- Remove unused useScope exports (setApp, setRoute, setExchange, clearScope)
- Remove unused CSS classes (monoCell, punchcardStack)
- Update ui/README.md DS version to v0.1.28

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:03:54 +02:00
hsiegeln
d4327af6a4 refactor: consolidate ClickHouse schema into single init.sql, cache diagrams
All checks were successful
CI / build (push) Successful in 2m2s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 51s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
- Merge all V1-V11 migration scripts into one idempotent init.sql
- Simplify ClickHouseSchemaInitializer to load single file
- Replace route_diagrams projection with in-memory caches:
  hashCache (routeId+instanceId → contentHash) warm-loaded on startup,
  graphCache (contentHash → RouteGraph) lazy-populated on access
- Eliminates 9M+ row scans on diagram lookups

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 15:24:53 +02:00
hsiegeln
bb3e1e2bc3 fix: set deduplicate_merge_projection_mode for ReplacingMergeTree projection
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
ClickHouse 24.12 requires this setting before adding projections to
ReplacingMergeTree tables. Using 'drop' mode which discards the projection
during deduplication merges and rebuilds it afterward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 15:14:56 +02:00
hsiegeln
984bb2d40f fix: sort ClickHouse migration scripts by numeric version prefix
All checks were successful
CI / build (push) Successful in 2m32s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 55s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 52s
Alphabetical sort put V10/V11 before V2-V9 ("V11" < "V1_" in ASCII),
causing the route_diagrams projection to run before the table existed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 15:06:56 +02:00
hsiegeln
6f00ff2e28 fix: reduce ClickHouse log noise, admin query spam, and diagram scan perf
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 1m25s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 40s
- Set com.clickhouse log level to INFO and org.apache.hc.client5 to WARN
- Admin hooks (useUsers/useGroups/useRoles) now only fetch on admin pages,
  eliminating AUDIT view_users entries on every UI click
- Add ClickHouse projection on route_diagrams for (tenant_id, route_id,
  instance_id, created_at) to avoid full table scans on diagram lookups
- Bump @cameleer/design-system to v0.1.28 (PAUSED mode time range fix,
  refreshTimeRange API)
- Call refreshTimeRange before invalidateQueries in PAUSED mode manual
  refresh so sidebar clicks use current time window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:48:30 +02:00
hsiegeln
d30c267292 fix: route catalog missing routes after server restart
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m20s
CI / docker (push) Successful in 52s
CI / deploy (push) Successful in 54s
CI / deploy-feature (push) Has been skipped
After server restart, auto-healed agents register with empty
routeIds. The catalog only looked at agent registry for routes,
so routes and counts disappeared.

Now merges route IDs from ClickHouse stats_1m_route into the
catalog. Also includes apps that only exist in ClickHouse data
(no agent currently registered). Routes and exchange counts
survive server restarts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:14:27 +02:00
hsiegeln
2bc3efad7f fix: agent auth, heartbeat, and SSE all break after server restart
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m9s
CI / docker (push) Successful in 41s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
Three related issues caused by in-memory agent registry being empty
after server restart:

1. JwtAuthenticationFilter rejected valid agent JWTs if agent wasn't
   in registry — now authenticates any valid JWT regardless

2. Heartbeat returned 404 for unknown agents — now auto-registers
   the agent from JWT claims (subject, application)

3. SSE endpoint returned 404 — same auto-registration fix

JWT validation result is stored as a request attribute so downstream
controllers can extract the application claim for auto-registration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 10:41:23 +02:00
hsiegeln
0632f1c6a8 fix: agent token refresh returns 404 after server restart
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m23s
The refresh endpoint required the agent to exist in the in-memory
registry. After server restart the registry is empty, so all refresh
attempts got 404. The refresh token itself is self-contained with
subject, application, and roles — the registry lookup is optional.

Now uses application from the JWT, falling back to registry only
if the agent happens to be registered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 10:37:57 +02:00
hsiegeln
bdac363e40 fix: active queries list always showed itself
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 40s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 36s
The system.processes query was returning its own row. Added
filter: query NOT LIKE '%system.processes%'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 10:33:47 +02:00
hsiegeln
2896bb90a9 fix: usage events never flushed to ClickHouse
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m11s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m7s
UsageFlushScheduler was a @Component with @ConditionalOnBean, but
ClickHouseUsageTracker is created via @Bean — component scan runs
first, so the condition always evaluated false. Events accumulated
in the WriteBuffer but flush() was never called.

Moved scheduler to @Bean in StorageBeanConfig with the same
@ConditionalOnProperty guard as the tracker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 10:07:13 +02:00
hsiegeln
e495b80432 fix: increase ClickHouse pool size and reduce flush interval
All checks were successful
CI / build (push) Successful in 1m49s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 2m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 43s
Pool was hardcoded to 10 connections serving 7 concurrent write
streams + UI reads, causing "too many simultaneous queries" and
WriteBuffer overflow. Pool now defaults to 50 (configurable via
clickhouse.pool-size), flush interval reduced from 1000ms to 500ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:11:15 +02:00
hsiegeln
8d899cc70c refactor: use HeartbeatRequest from cameleer3-common
Some checks failed
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / cleanup-branch (push) Has been cancelled
CI / build (push) Has been cancelled
Replace local HeartbeatRequest DTO with the shared model from
cameleer3-common. Message types exchanged between server and agent
belong in the common module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:05:26 +02:00
hsiegeln
520b80444a feat(#119): accept route states in heartbeat and state-change events
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 34s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Replace ACK-based route state inference with agent-reported state.
Heartbeats now carry optional routeStates map, and ROUTE_STATE_CHANGED
events update the registry immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:45:13 +02:00
hsiegeln
b714d3363f feat(#119): expose route state in catalog API and sidebar/dashboard
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 29s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Add routeState field to RouteSummary DTO (null for started, 'stopped'
or 'suspended' for non-default states). Sidebar shows stop/pause icons
and state badge for affected routes in both Apps and Routes sections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:15:46 +02:00
hsiegeln
0acceaf1a9 feat(#119): add RouteStateRegistry for tracking route operational state
In-memory registry that infers route state (started/stopped/suspended)
from successful route-control command ACKs. Updates state only when all
agents in a group confirm success.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:15:35 +02:00
hsiegeln
027e45aadf feat(#116): synchronous group command dispatch with multi-agent response collection
Add addGroupCommandWithReplies() to AgentRegistryService that sends commands
to all LIVE agents in a group and returns CompletableFuture per agent for
collecting replies. Update sendGroupCommand() and pushConfigToAgents() to
wait with a shared 10-second deadline, returning CommandGroupResponse with
per-agent status, timeouts, and overall success. Config update endpoint now
returns ConfigUpdateResponse wrapping both the saved config and push result.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:00:56 +02:00
hsiegeln
56297701e6 fix: use ILIKE for case-insensitive log search in ClickHouse
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m4s
CI / docker (push) Successful in 57s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m29s
LIKE is case-sensitive in ClickHouse. Switch to ILIKE for message,
stack_trace, and logger_name searches so queries match regardless
of casing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:35:34 +02:00
hsiegeln
b73f5e6dd4 feat: add Logs tab with cursor-paginated search, level filters, and live tail
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m3s
CI / docker (push) Successful in 1m11s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 49s
- Extend GET /api/v1/logs with cursor pagination, multi-level filtering,
  optional application scoping, and level count aggregation
- Add exchangeId, instanceId, application, mdc fields to log responses
- Refactor ClickHouseLogStore with keyset pagination (N+1 pattern)
- Add LogSearchRequest/LogSearchResponse core domain records
- Create LogSearchPageResponse wrapper DTO
- Add Logs as 4th content tab (Exchanges | Dashboard | Runtime | Logs)
- Implement LogSearch component with debounced search, level filter bar,
  expandable log entries, cursor pagination, and live tail mode
- Add cross-navigation: exchange header → logs, log tab → logs tab
- Update ClickHouseLogStoreIT with cursor, multi-level, cross-app tests

Closes: #104

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 08:47:16 +02:00
hsiegeln
a52751da1b fix: avoid alias shadowing in processor metrics -Merge query
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m11s
CI / docker (push) Successful in 43s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 37s
SonarQube / sonarqube (push) Failing after 1m52s
ClickHouse 24.12 new query analyzer resolves countMerge(total_count)
in the CASE WHEN to the SELECT alias (UInt64) instead of the original
AggregateFunction column when the alias has the same name. Renamed
aliases to tc/fc to avoid the collision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:24:50 +02:00
hsiegeln
51780031ea fix: use alias in ORDER BY for processor metrics query
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 44s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 39s
ClickHouse rejects countMerge() in ORDER BY after GROUP BY because the
column is already finalized to UInt64. Use the SELECT alias instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:11:54 +02:00
hsiegeln
805e6d51cb fix: add processor_type to stats_1m_processor_detail MV
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m14s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
The table and materialized view were missing the processor_type column,
causing the RouteMetricsController query to fail and the dashboard
processor metrics table to render empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:00:23 +02:00
hsiegeln
f3feaddbfe feat: show distinct attribute keys in cmd-k Attributes tab
All checks were successful
CI / build (push) Successful in 1m58s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 1m46s
CI / deploy (push) Successful in 47s
CI / deploy-feature (push) Has been skipped
Add GET /search/attributes/keys endpoint that queries distinct
attribute key names from ClickHouse using JSONExtractKeys. Attribute
keys appear in the cmd-k Attributes tab alongside attribute value
matches from exchange results.

- SearchIndex.distinctAttributeKeys() interface method
- ClickHouseSearchIndex implementation using arrayJoin(JSONExtractKeys)
- SearchController /attributes/keys endpoint
- useAttributeKeys() React Query hook
- buildSearchData includes attribute keys as 'attribute' category items

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 21:39:27 +02:00
hsiegeln
910230cbf8 fix: add <mark> highlighting to search match context snippets
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m14s
CI / docker (push) Successful in 46s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
The command palette renders matchContext via dangerouslySetInnerHTML
expecting HTML with <mark> tags, but extractSnippet() returned plain
text. Wrap the matched term in <mark> tags and escape surrounding
text to prevent XSS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 21:18:04 +02:00