Commit Graph

28 Commits

Author SHA1 Message Date
hsiegeln
694d0eef59 feat: add environment filtering across all APIs and UI
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / docker (push) Has been cancelled
Backend: Added optional `environment` query parameter to catalog,
search, stats, timeseries, punchcard, top-errors, logs, and agents
endpoints. ClickHouse queries filter by environment when specified
(literal SQL for AggregatingMergeTree, ? binds for raw tables).
StatsStore interface methods all accept environment parameter.

UI: Added EnvironmentSelector component (compact native select).
LayoutShell extracts distinct environments from agent data and
passes selected environment to catalog and agent queries via URL
search param (?env=). TopBar shows current environment label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:42:26 +02:00
hsiegeln
a188308ec5 feat: implement multitenancy with tenant isolation + environment support
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 42s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m25s
Adds configurable tenant ID (CAMELEER_TENANT_ID env var, default:
"default") and environment as a first-class concept. Each server
instance serves one tenant with multiple environments.

Changes across 36 files:
- TenantProperties config bean for tenant ID injection
- AgentInfo: added environmentId field
- AgentRegistrationRequest: added environmentId field
- All 9 ClickHouse stores: inject tenant ID, replace hardcoded
  "default" constant, add environment to writes/reads
- ChunkAccumulator: configurable tenant ID + environment resolver
- MergedExecution/ProcessorBatch/BufferedLogEntry: added environment
- ClickHouse init.sql: added environment column to all tables,
  updated ORDER BY (tenant→time→env→app), added tenant_id to
  usage_events, updated all MV GROUP BY clauses
- Controllers: pass environmentId through registration/auto-heal
- K8s deploy: added CAMELEER_TENANT_ID env var
- All tests updated for new signatures

Closes #123

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:00:18 +02:00
hsiegeln
b04b12220b fix: resolve 25 SonarQube code smells across 21 files
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m2s
CI / docker (push) Successful in 45s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Remove unused fields (log, rbacService, roleRepository, jwt),
unused variables (agentTps, routeKeys, updated), unused imports
(HttpHeaders, JdbcTemplate). Rename restricted identifier 'record'
to 'auditRecord'/'event'. Return empty collections instead of null.
Replace .collect(Collectors.toList()) with .toList(). Simplify
conditional return in BootstrapTokenValidator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:36:13 +02:00
hsiegeln
8d899cc70c refactor: use HeartbeatRequest from cameleer3-common
Some checks failed
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / cleanup-branch (push) Has been cancelled
CI / build (push) Has been cancelled
Replace local HeartbeatRequest DTO with the shared model from
cameleer3-common. Message types exchanged between server and agent
belong in the common module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:05:26 +02:00
hsiegeln
520b80444a feat(#119): accept route states in heartbeat and state-change events
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 34s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Replace ACK-based route state inference with agent-reported state.
Heartbeats now carry optional routeStates map, and ROUTE_STATE_CHANGED
events update the registry immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:45:13 +02:00
hsiegeln
b714d3363f feat(#119): expose route state in catalog API and sidebar/dashboard
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 29s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Add routeState field to RouteSummary DTO (null for started, 'stopped'
or 'suspended' for non-default states). Sidebar shows stop/pause icons
and state badge for affected routes in both Apps and Routes sections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:15:46 +02:00
hsiegeln
027e45aadf feat(#116): synchronous group command dispatch with multi-agent response collection
Add addGroupCommandWithReplies() to AgentRegistryService that sends commands
to all LIVE agents in a group and returns CompletableFuture per agent for
collecting replies. Update sendGroupCommand() and pushConfigToAgents() to
wait with a shared 10-second deadline, returning CommandGroupResponse with
per-agent status, timeouts, and overall success. Config update endpoint now
returns ConfigUpdateResponse wrapping both the saved config and push result.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:00:56 +02:00
hsiegeln
b73f5e6dd4 feat: add Logs tab with cursor-paginated search, level filters, and live tail
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m3s
CI / docker (push) Successful in 1m11s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 49s
- Extend GET /api/v1/logs with cursor pagination, multi-level filtering,
  optional application scoping, and level count aggregation
- Add exchangeId, instanceId, application, mdc fields to log responses
- Refactor ClickHouseLogStore with keyset pagination (N+1 pattern)
- Add LogSearchRequest/LogSearchResponse core domain records
- Create LogSearchPageResponse wrapper DTO
- Add Logs as 4th content tab (Exchanges | Dashboard | Runtime | Logs)
- Implement LogSearch component with debounced search, level filter bar,
  expandable log entries, cursor pagination, and live tail mode
- Add cross-navigation: exchange header → logs, log tab → logs tab
- Update ClickHouseLogStoreIT with cursor, multi-level, cross-app tests

Closes: #104

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 08:47:16 +02:00
hsiegeln
f82aa26371 fix: improve ClickHouse admin page, fix AgentHealth type error
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m13s
CI / docker (push) Successful in 3m46s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 58s
Rewrite ClickHouse admin to show useful storage metrics instead of
often-empty system.events data. Add active queries section.

- Replace performance endpoint: query system.parts for disk size,
  uncompressed size, compression ratio, total rows, part count
- Add /queries endpoint querying system.processes for active queries
- Frontend: storage overview strip, tables with total size, active
  queries DataTable
- Fix AgentHealth.tsx type: agentId → instanceId in inline type cast

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:18:06 +02:00
hsiegeln
188810e54b feat: remove TimescaleDB, dead PG stores, and storage feature flags
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 32s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Complete the ClickHouse migration by removing all PostgreSQL analytics
code. PostgreSQL now serves only RBAC, config, and audit — all
observability data is exclusively in ClickHouse.

- Delete 6 dead PostgreSQL store classes (executions, stats, diagrams,
  events, metrics, metrics-query) and 2 integration tests
- Delete RetentionScheduler (ClickHouse TTL handles retention)
- Remove all 7 cameleer.storage.* feature flags from application.yml
- Remove all @ConditionalOnProperty from ClickHouse beans in StorageBeanConfig
- Consolidate 14 Flyway migrations (V1-V14) into single clean V1 with
  only RBAC/config/audit tables (no TimescaleDB, no analytics tables)
- Switch from timescale/timescaledb-ha:pg16 to postgres:16 everywhere
  (docker-compose, deploy/postgres.yaml, test containers)
- Remove TimescaleDB check and /metrics-pipeline from DatabaseAdminController
- Set clickhouse.enabled default to true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:10:58 +02:00
hsiegeln
283e38a20d feat: remove OpenSearch, add ClickHouse admin page
Some checks failed
CI / cleanup-branch (push) Has been skipped
CI / build (push) Failing after 33s
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Remove all OpenSearch code, dependencies, configuration, deployment
manifests, and CI/CD references. Replace the OpenSearch admin page
with a ClickHouse admin page showing cluster status, table sizes,
performance metrics, and indexer pipeline stats.

- Delete 11 OpenSearch Java files (config, search impl, admin controller, DTOs, tests)
- Delete 3 OpenSearch frontend files (admin page, CSS, query hooks)
- Delete deploy/opensearch.yaml K8s manifest
- Remove opensearch Maven dependencies from pom.xml
- Remove opensearch config from application.yml, Dockerfile, docker-compose
- Remove opensearch from CI workflow (secrets, deploy, cleanup steps)
- Simplify ThresholdConfig (remove OpenSearch thresholds, database-only)
- Change default search backend from opensearch to clickhouse
- Add ClickHouseAdminController with /status, /tables, /performance, /pipeline
- Add ClickHouseAdminPage with StatCards, pipeline ProgressBar, tables DataTable
- Update CLAUDE.md, HOWTO.md, and source comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:56:06 +02:00
hsiegeln
909d713837 feat: rename agent identity fields for protocol v2 + add SHUTDOWN lifecycle state
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m7s
CI / docker (push) Successful in 45s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 22s
Align all internal naming with the agent team's protocol v2 identity rename:
- agentId → instanceId (unique per-JVM identifier)
- applicationName → applicationId (shared app identifier)
- AgentInfo: id → instanceId, name → displayName, application → applicationId

Add SHUTDOWN lifecycle state for graceful agent shutdowns:
- New POST /data/events endpoint receives agent lifecycle events
- AGENT_STOPPED event transitions agent to SHUTDOWN (skips STALE/DEAD)
- New POST /{id}/deregister endpoint removes agent from registry
- Server now distinguishes graceful shutdown from crash (heartbeat timeout)

Includes ClickHouse V9 and PostgreSQL V14 migrations for column renames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 12:22:42 +02:00
hsiegeln
715cbc1894 feat: synchronous replay endpoint with agent response status
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m8s
CI / docker (push) Successful in 56s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Add dedicated POST /agents/{id}/replay endpoint that uses
addCommandWithReply to wait for the agent ACK (30s timeout).
Returns the actual replay result (status, message, data) instead
of just a delivery confirmation.

Frontend toast now reflects the agent's response: "Replay completed"
on success, agent error message on failure, timeout message if the
agent doesn't respond.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:48:02 +02:00
hsiegeln
213aa86c47 feat: progressive drill-down dashboard with RED metrics and SLA compliance (#94)
Three-level dashboard driven by sidebar selection:
- L1 (no selection): all-apps overview with health table, per-app charts
- L2 (app selected): route performance table, error velocity, top errors
- L3 (route selected): processor table, latency heatmap data, bottleneck KPI

Backend: 3 new endpoints (timeseries/by-app, timeseries/by-route, errors/top),
per-app SLA settings (app_settings table, V12 migration), exact SLA compliance
from executions hypertable, error velocity with acceleration detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:29:20 +02:00
hsiegeln
e5e6175aca feat: use endpointUri for cross-route drill-down instead of label parsing
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m2s
CI / docker (push) Successful in 1m0s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 38s
Server:
- Add endpointUri to PositionedNode (from RouteNode)
- Add fromEndpointUri to RouteSummary (catalog API)
- Catalog controller resolves endpoint URI from diagram store

UI:
- Build endpointRouteMap from catalog's fromEndpointUri field
- Drill-down uses exact match on node.endpointUri against the map
- Remove label parsing heuristics (extractTargetEndpoint, camelToKebab)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:31:08 +01:00
hsiegeln
d6d96aad07 feat: add TEST_EXPRESSION command with request-reply infrastructure
Adds CompletableFuture-based request-reply mechanism for commands that
need synchronous results. CommandReply record in core, pendingReplies
map in AgentRegistryService, test-expression endpoint on config controller
with 5s timeout. CommandAckRequest extended with optional data field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 18:27:59 +01:00
hsiegeln
b612941aae feat: wire up application logs from OpenSearch, fix event autoscroll
All checks were successful
CI / build (push) Successful in 55s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 51s
CI / deploy (push) Successful in 37s
CI / deploy-feature (push) Has been skipped
Add GET /api/v1/logs endpoint to query application logs stored in
OpenSearch with filters for application, agent, level, time range,
and text search. Wire up the AgentInstance LogViewer with real data
and an EventFeed-style toolbar (search input + level filter pills).

Fix agent events timeline autoscroll by reversing the DESC-ordered
events so newest entries appear at the bottom where EventFeed
autoscrolls to.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 18:56:13 +01:00
4d9a9ff851 feat: add CommandAckRequest DTO for enriched command acknowledgments
Some checks failed
CI / build (push) Has started running
CI / docker (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / deploy-feature (push) Has been cancelled
CI / cleanup-branch (push) Has been cancelled
2026-03-24 16:10:27 +01:00
hsiegeln
ff76751629 refactor: rename agent group→application across entire codebase
All checks were successful
CI / build (push) Successful in 1m22s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 52s
CI / deploy (push) Successful in 39s
CI / deploy-feature (push) Has been skipped
Complete the group→application terminology rename in the agent
registry subsystem:

- AgentInfo: field group → application, all wither methods updated
- AgentRegistryService: findByGroup → findByApplication
- AgentInstanceResponse: field group → application (API response)
- AgentRegistrationRequest: field group → application (API request)
- JwtServiceImpl: parameter names group → application (JWT claim
  string "group" preserved for token backward compatibility)
- All controllers, lifecycle monitor, command controller updated
- Integration tests: JSON request bodies "group" → "application"
- Frontend: schema.d.ts, openapi.json, agent queries, AgentHealth

RBAC group references (groups table, GroupAdminController, etc.)
are NOT affected — they are a separate domain concept.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 08:48:12 +01:00
hsiegeln
f4d2693561 feat: enrich AgentInstanceResponse with version/capabilities, add password reset endpoint
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 18:13:37 +01:00
hsiegeln
2051572ee2 feat: add GET /agents/{id}/metrics endpoint for JVM metrics
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 18:11:22 +01:00
hsiegeln
cc433b4215 feat: add GET /routes/metrics/processors endpoint
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 18:10:54 +01:00
hsiegeln
2b111c603c feat: migrate UI to @cameleer/design-system, add backend endpoints
Some checks failed
CI / build (push) Failing after 47s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Has been skipped
CI / deploy (push) Has been skipped
CI / deploy-feature (push) Has been skipped
Backend:
- Add agent_events table (V5) and lifecycle event recording
- Add route catalog endpoint (GET /routes/catalog)
- Add route metrics endpoint (GET /routes/metrics)
- Add agent events endpoint (GET /agents/events-log)
- Enrich AgentInstanceResponse with tps, errorRate, activeRoutes, uptimeSeconds
- Add TimescaleDB retention/compression policies (V6)

Frontend:
- Replace custom Mission Control UI with @cameleer/design-system components
- Rebuild all pages: Dashboard, ExchangeDetail, RoutesMetrics, AgentHealth,
  AgentInstance, RBAC, AuditLog, OIDC, DatabaseAdmin, OpenSearchAdmin, Swagger
- New LayoutShell with design system AppShell, Sidebar, TopBar, CommandPalette
- Consume design system from Gitea npm registry (@cameleer/design-system@0.0.1)
- Add .npmrc for scoped registry, update Dockerfile with REGISTRY_TOKEN arg

CI:
- Pass REGISTRY_TOKEN build-arg to UI Docker build step

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 17:38:39 +01:00
hsiegeln
17ef48e392 fix: return rotated refresh token from agent token refresh endpoint
All checks were successful
CI / build (push) Successful in 1m22s
CI / cleanup-branch (push) Has been skipped
CI / docker (push) Successful in 56s
CI / deploy (push) Successful in 47s
CI / deploy-feature (push) Has been skipped
Previously the refresh endpoint only returned a new accessToken, causing
agents to lose their refreshToken after the first refresh cycle and
forcing a full re-registration every ~2 hours.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 16:44:16 +01:00
hsiegeln
0cea8af6bc feat: add response/request DTOs for admin infrastructure endpoints
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 15:51:31 +01:00
hsiegeln
a6f94e8a70 Full OIDC logout with id_token_hint for provider session termination
Some checks failed
CI / build (push) Successful in 1m10s
CI / docker (push) Successful in 48s
CI / deploy (push) Has been cancelled
Return the OIDC id_token in the callback response so the frontend can
store it and pass it as id_token_hint to the provider's end-session
endpoint on logout. This lets Authentik (or any OIDC provider) honor
the post_logout_redirect_uri and redirect back to the Cameleer login
page instead of showing the provider's own logout page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 16:14:07 +01:00
hsiegeln
463cab1196 Add displayName to auth response and configurable display name claim for OIDC
Some checks failed
CI / build (push) Successful in 1m11s
CI / docker (push) Successful in 49s
CI / deploy (push) Failing after 2m9s
- Add displayName field to AuthTokenResponse so the UI shows human-readable
  names instead of internal JWT subjects (e.g. user:oidc:<hash>)
- Add displayNameClaim to OIDC config (default: "name") allowing admins to
  configure which ID token claim contains the user's display name
- Support dot-separated claim paths (e.g. profile.display_name) like rolesClaim
- Add admin UI field for Display Name Claim on the OIDC config page
- ClickHouse migration: ALTER TABLE adds display_name_claim column

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 16:09:24 +01:00
hsiegeln
465f210aee Contract-first API with DTOs, validation, and server-side OpenAPI post-processing
All checks were successful
CI / build (push) Successful in 1m27s
CI / docker (push) Successful in 2m6s
CI / deploy (push) Successful in 30s
Add dedicated request/response DTOs for all controllers, replacing raw
JsonNode parameters with validated types. Move OpenAPI path-prefix stripping
and ProcessorNode children injection into OpenApiCustomizer beans so the
spec served at /api/v1/api-docs is already clean — eliminating the need for
the ui/scripts/process-openapi.mjs post-processing script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 15:33:37 +01:00