From 810f493639b016b201501134da00e1f786c0783f Mon Sep 17 00:00:00 2001
From: hsiegeln <37154749+hsiegeln@users.noreply.github.com>
Date: Thu, 16 Apr 2026 09:26:53 +0200
Subject: [PATCH] chore: track .claude/rules/ and add self-maintenance
 instruction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Un-ignore .claude/rules/ so path-scoped rule files are shared via git.
Add instruction in CLAUDE.md to update rule files when modifying classes,
controllers, endpoints, or metrics — keeps rules current as part of
normal workflow rather than requiring separate maintenance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .claude/rules/app-classes.md          | 109 ++++++++++++++++++++++++++
 .claude/rules/cicd.md                 |  24 ++++++
 .claude/rules/core-classes.md         |  97 +++++++++++++++++++++++
 .claude/rules/docker-orchestration.md |  76 ++++++++++++++++++
 .claude/rules/gitnexus.md             |  98 +++++++++++++++++++++++
 .claude/rules/metrics.md              |  85 ++++++++++++++++++++
 .claude/rules/ui.md                   |  43 ++++++++++
 .gitignore                            |   3 +-
 CLAUDE.md                             |   4 +
 9 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 .claude/rules/app-classes.md
 create mode 100644 .claude/rules/cicd.md
 create mode 100644 .claude/rules/core-classes.md
 create mode 100644 .claude/rules/docker-orchestration.md
 create mode 100644 .claude/rules/gitnexus.md
 create mode 100644 .claude/rules/metrics.md
 create mode 100644 .claude/rules/ui.md
diff --git a/.claude/rules/app-classes.md b/.claude/rules/app-classes.md
new file mode 100644
index 00000000..c561df90
--- /dev/null
+++ b/.claude/rules/app-classes.md
@@ -0,0 +1,109 @@
+---
+paths:
+  - "cameleer-server-app/**"
+---
+
+# App Module Key Classes
+
+`cameleer-server-app/src/main/java/com/cameleer/server/app/`
+
+## controller/ — REST endpoints
+
+- `AgentRegistrationController` — POST /register, POST /heartbeat, GET / (list), POST /refresh-token
+- `AgentSseController` — GET /sse (Server-Sent Events connection)
+- `AgentCommandController` — POST /broadcast, POST /{agentId}, POST /{agentId}/ack
+- `AppController` — CRUD /api/v1/apps, POST /{appId}/upload-jar, GET /{appId}/versions
+- `DeploymentController` — GET/POST /api/v1/apps/{appId}/deployments, POST /{id}/stop, POST /{id}/promote, GET /{id}/logs
+- `EnvironmentAdminController` — CRUD /api/v1/admin/environments, PUT /{id}/jar-retention
+- `ExecutionController` — GET /api/v1/executions (search + detail)
+- `SearchController` — POST /api/v1/search, GET /routes, GET /top-errors, GET /punchcard
+- `LogQueryController` — GET /api/v1/logs (filters: source, application, agentId, exchangeId, level, logger, q, environment, time range)
+- `LogIngestionController` — POST /api/v1/data/logs (accepts `List<LogEntry>` JSON array, each entry has `source`: app/agent). Logs WARN for: missing agent identity, unregistered agents, empty payloads, buffer-full drops, deserialization failures. Normal acceptance at DEBUG.
+- `CatalogController` — GET /api/v1/catalog (unified app catalog merging PG managed apps + in-memory agents + CH stats), DELETE /api/v1/catalog/{applicationId} (ADMIN: dismiss app, purge all CH data + PG record). Auto-filters discovered apps older than `discoveryttldays` with no live agents.
+- `ChunkIngestionController` — POST /api/v1/ingestion/chunk/{executions|metrics|diagrams}
+- `UserAdminController` — CRUD /api/v1/admin/users, POST /{id}/roles, POST /{id}/set-password
+- `RoleAdminController` — CRUD /api/v1/admin/roles
+- `GroupAdminController` — CRUD /api/v1/admin/groups
+- `OidcConfigAdminController` — GET/POST /api/v1/admin/oidc, POST /test
+- `SensitiveKeysAdminController` — GET/PUT /api/v1/admin/sensitive-keys. GET returns 200 with config or 204 if not configured. PUT accepts `{ keys: [...] }` with optional `?pushToAgents=true` to fan out merged keys to all LIVE agents. Stored in `server_config` table (key `sensitive_keys`).
+- `AuditLogController` — GET /api/v1/admin/audit
+- `MetricsController` — GET /api/v1/metrics, GET /timeseries
+- `DiagramController` — GET /api/v1/diagrams/{id}, POST /
+- `DiagramRenderController` — POST /api/v1/diagrams/render (ELK layout)
+- `ClaimMappingAdminController` — CRUD /api/v1/admin/claim-mappings, POST /test (accepts inline rules + claims for preview without saving)
+- `LicenseAdminController` — GET/POST /api/v1/admin/license
+- `AgentEventsController` — GET /api/v1/agent-events (agent state change history)
+- `AgentMetricsController` — GET /api/v1/agent-metrics (JVM/Camel metrics per agent instance)
+- `AppSettingsController` — GET/PUT /api/v1/apps/{appId}/settings
+- `ApplicationConfigController` — GET/PUT /api/v1/apps/{appId}/config (traced processors, route recording, sensitive keys per app)
+- `ClickHouseAdminController` — GET /api/v1/admin/clickhouse (ClickHouse admin, conditional on infrastructure endpoints)
+- `DatabaseAdminController` — GET /api/v1/admin/database (PG admin, conditional on infrastructure endpoints)
+- `DetailController` — GET /api/v1/detail (execution detail with processor tree)
+- `EventIngestionController` — POST /api/v1/data/events (agent event ingestion)
+- `RbacStatsController` — GET /api/v1/admin/rbac/stats
+- `RouteCatalogController` — GET /api/v1/routes/catalog (merged route catalog from registry + ClickHouse)
+- `RouteMetricsController` — GET /api/v1/route-metrics (per-route Camel metrics)
+- `ThresholdAdminController` — CRUD /api/v1/admin/thresholds
+- `UsageAnalyticsController` — GET /api/v1/admin/usage (ClickHouse usage_events)
+
+## runtime/ — Docker orchestration
+
+- `DockerRuntimeOrchestrator` — implements RuntimeOrchestrator; Docker Java client (zerodep transport), container lifecycle
+- `DeploymentExecutor` — @Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Container names are `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}` (globally unique on Docker daemon). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}`.
+- `DockerNetworkManager` — ensures bridge networks (cameleer-traefik, cameleer-env-{slug}), connects containers
+- `DockerEventMonitor` — persistent Docker event stream listener (die, oom, start, stop), updates deployment status
+- `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing. Also emits `cameleer.replica` and `cameleer.instance-id` labels per container for labels-first identity.
+- `PrometheusLabelBuilder` — generates Prometheus Docker labels (`prometheus.scrape/path/port`) per runtime type for `docker_sd_configs` auto-discovery
+- `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. 60-second max capture timeout.
+- `DisabledRuntimeOrchestrator` — no-op when runtime not enabled
+
+## metrics/ — Prometheus observability
+
+- `ServerMetrics` — centralized business metrics: gauges (agents by state, SSE connections, buffer depths), counters (ingestion drops, agent transitions, deployment outcomes, auth failures), timers (flush duration, deployment duration). Exposed via `/api/v1/prometheus`.
+
+## storage/ — PostgreSQL repositories (JdbcTemplate)
+
+- `PostgresAppRepository`, `PostgresAppVersionRepository`, `PostgresEnvironmentRepository`
+- `PostgresDeploymentRepository` — includes JSONB replica_states, deploy_stage, findByContainerId
+- `PostgresUserRepository`, `PostgresRoleRepository`, `PostgresGroupRepository`
+- `PostgresAuditRepository`, `PostgresOidcConfigRepository`, `PostgresClaimMappingRepository`, `PostgresSensitiveKeysRepository`
+- `PostgresAppSettingsRepository`, `PostgresApplicationConfigRepository`, `PostgresThresholdRepository`
+
+## storage/ — ClickHouse stores
+
+- `ClickHouseExecutionStore`, `ClickHouseMetricsStore`, `ClickHouseMetricsQueryStore`
+- `ClickHouseStatsStore` — pre-aggregated stats, punchcard
+- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository`
+- `ClickHouseUsageTracker` — usage_events for billing
+
+## search/ — ClickHouse search and log stores
+
+- `ClickHouseLogStore` — log storage and query, MDC-based exchange/processor correlation
+- `ClickHouseSearchIndex` — full-text search
+
+## security/ — Spring Security
+
+- `SecurityConfig` — WebSecurityFilterChain, JWT filter, CORS, OIDC conditional
+- `JwtAuthenticationFilter` — OncePerRequestFilter, validates Bearer tokens
+- `JwtServiceImpl` — HMAC-SHA256 JWT (Nimbus JOSE)
+- `OidcAuthController` — /api/v1/auth/oidc (login-uri, token-exchange, logout)
+- `OidcTokenExchanger` — code -> tokens, role extraction from access_token then id_token
+- `OidcProviderHelper` — OIDC discovery, JWK source cache
+
+## agent/ — Agent lifecycle
+
+- `SseConnectionManager` — manages per-agent SSE connections, delivers commands
+- `AgentLifecycleMonitor` — @Scheduled 10s, LIVE->STALE->DEAD transitions
+- `SsePayloadSigner` — Ed25519 signs SSE payloads for agent verification
+
+## retention/ — JAR cleanup
+
+- `JarRetentionJob` — @Scheduled 03:00 daily, per-environment retention, skips deployed versions
+
+## config/ — Spring beans
+
+- `RuntimeOrchestratorAutoConfig` — conditional Docker/Disabled orchestrator + NetworkManager + EventMonitor
+- `RuntimeBeanConfig` — DeploymentExecutor, AppService, EnvironmentService
+- `SecurityBeanConfig` — JwtService, Ed25519, BootstrapTokenValidator
+- `StorageBeanConfig` — all repositories
+- `ClickHouseConfig` — ClickHouse JdbcTemplate, schema initializer
diff --git a/.claude/rules/cicd.md b/.claude/rules/cicd.md
new file mode 100644
index 00000000..98c83942
--- /dev/null
+++ b/.claude/rules/cicd.md
@@ -0,0 +1,24 @@
+---
+paths:
+  - ".gitea/**"
+  - "deploy/**"
+  - "Dockerfile"
+  - "docker-entrypoint.sh"
+---
+
+# CI/CD & Deployment
+
+- CI workflow: `.gitea/workflows/ci.yml` — build -> docker -> deploy on push to main or feature branches
+- Build step skips integration tests (`-DskipITs`) — Testcontainers needs Docker daemon
+- Docker: multi-stage build (`Dockerfile`), `$BUILDPLATFORM` for native Maven on ARM64 runner, amd64 runtime. `docker-entrypoint.sh` imports `/certs/ca.pem` into JVM truststore before starting the app (supports custom CAs for OIDC discovery without `CAMELEER_SERVER_SECURITY_OIDCTLSSKIPVERIFY`).
+- `REGISTRY_TOKEN` build arg required for `cameleer-common` dependency resolution
+- Registry: `gitea.siegeln.net/cameleer/cameleer-server` (container images)
+- K8s manifests in `deploy/` — Kustomize base + overlays (main/feature), shared infra (PostgreSQL, ClickHouse, Logto) as top-level manifests
+- Deployment target: k3s at 192.168.50.86, namespace `cameleer` (main), `cam-<slug>` (feature branches)
+- Feature branches: isolated namespace, PG schema; Traefik Ingress at `<slug>-api.cameleer.siegeln.net`
+- Secrets managed in CI deploy step (idempotent `--dry-run=client | kubectl apply`): `cameleer-auth`, `cameleer-postgres-credentials`, `cameleer-clickhouse-credentials`
+- K8s probes: server uses `/api/v1/health`, PostgreSQL uses `pg_isready -U "$POSTGRES_USER"` (env var, not hardcoded)
+- K8s security: server and database pods run with `securityContext.runAsNonRoot`. UI (nginx) runs without securityContext (needs root for entrypoint setup).
+- Docker: server Dockerfile has no default credentials — all DB config comes from env vars at runtime
+- Docker build uses buildx registry cache + `--provenance=false` for Gitea compatibility
+- CI: branch slug sanitization extracted to `.gitea/sanitize-branch.sh`, sourced by docker and deploy-feature jobs
diff --git a/.claude/rules/core-classes.md b/.claude/rules/core-classes.md
new file mode 100644
index 00000000..61d22934
--- /dev/null
+++ b/.claude/rules/core-classes.md
@@ -0,0 +1,97 @@
+---
+paths:
+  - "cameleer-server-core/**"
+---
+
+# Core Module Key Classes
+
+`cameleer-server-core/src/main/java/com/cameleer/server/core/`
+
+## agent/ — Agent lifecycle and commands
+
+- `AgentRegistryService` — in-memory registry (ConcurrentHashMap), register/heartbeat/lifecycle
+- `AgentInfo` — record: id, name, application, environmentId, version, routeIds, capabilities, state
+- `AgentCommand` — record: id, type, targetAgent, payload, createdAt, expiresAt
+- `AgentEventService` — records agent state changes, heartbeats
+- `AgentState` — enum: LIVE, STALE, DEAD, SHUTDOWN
+- `CommandType` — enum for command types (config-update, deep-trace, replay, route-control, etc.)
+- `CommandStatus` — enum for command acknowledgement states
+- `CommandReply` — record: command execution result from agent
+- `AgentEventRecord`, `AgentEventRepository` — event persistence
+- `AgentEventListener` — callback interface for agent events
+- `RouteStateRegistry` — tracks per-agent route states
+
+## runtime/ — App/Environment/Deployment domain
+
+- `App` — record: id, environmentId, slug, displayName, containerConfig (JSONB)
+- `AppVersion` — record: id, appId, version, jarPath, detectedRuntimeType, detectedMainClass
+- `Environment` — record: id, slug, jarRetentionCount
+- `Deployment` — record: id, appId, appVersionId, environmentId, status, targetState, deploymentStrategy, replicaStates (JSONB), deployStage, containerId, containerName
+- `DeploymentStatus` — enum: STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED
+- `DeployStage` — enum: PRE_FLIGHT, PULL_IMAGE, CREATE_NETWORK, START_REPLICAS, HEALTH_CHECK, SWAP_TRAFFIC, COMPLETE
+- `DeploymentService` — createDeployment (deletes terminal deployments first), markRunning, markFailed, markStopped
+- `RuntimeType` — enum: AUTO, SPRING_BOOT, QUARKUS, PLAIN_JAVA, NATIVE
+- `RuntimeDetector` — probes JAR files at upload time: detects runtime from manifest Main-Class (Spring Boot loader, Quarkus entry point, plain Java) or native binary (non-ZIP magic bytes)
+- `ContainerRequest` — record: 20 fields for Docker container creation (includes runtimeType, customArgs, mainClass)
+- `ContainerStatus` — record: state, running, exitCode, error
+- `ResolvedContainerConfig` — record: typed config with memoryLimitMb, memoryReserveMb, cpuRequest, cpuLimit, appPort, exposedPorts, customEnvVars, stripPathPrefix, sslOffloading, routingMode, routingDomain, serverUrl, replicas, deploymentStrategy, routeControlEnabled, replayEnabled, runtimeType, customArgs, extraNetworks
+- `RoutingMode` — enum for routing strategies
+- `ConfigMerger` — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig
+- `RuntimeOrchestrator` — interface: startContainer, stopContainer, getContainerStatus, getLogs, startLogCapture, stopLogCapture
+- `AppRepository`, `AppVersionRepository`, `EnvironmentRepository`, `DeploymentRepository` — repository interfaces
+- `AppService`, `EnvironmentService` — domain services
+
+## search/ — Execution search and stats
+
+- `SearchService` — search, count, stats, statsForApp, timeseries, timeseriesForApp, timeseriesForRoute, timeseriesGroupedByApp, timeseriesGroupedByRoute, slaCompliance, slaCountsByApp, slaCountsByRoute, topErrors, activeErrorTypes, punchcard, distinctAttributeKeys
+- `SearchRequest` / `SearchResult` — search DTOs
+- `ExecutionStats`, `ExecutionSummary` — stats aggregation records
+- `StatsTimeseries`, `TopError` — timeseries and error DTOs
+- `LogSearchRequest` / `LogSearchResponse` — log search DTOs
+
+## storage/ — Storage abstractions
+
+- `ExecutionStore`, `MetricsStore`, `MetricsQueryStore`, `StatsStore`, `DiagramStore`, `SearchIndex`, `LogIndex` — interfaces
+- `LogEntryResult` — log query result record
+- `model/` — `ExecutionDocument`, `MetricTimeSeries`, `MetricsSnapshot`
+
+## rbac/ — Role-based access control
+
+- `RbacService` — interface: role/group CRUD, assignRoleToUser, removeRoleFromUser, addUserToGroup, removeUserFromGroup, getDirectRolesForUser, getEffectiveRolesForUser, clearManagedAssignments, assignManagedRole, addUserToManagedGroup, getStats, listUsers
+- `SystemRole` — enum: AGENT, VIEWER, OPERATOR, ADMIN; `normalizeScope()` maps scopes
+- `UserDetail`, `RoleDetail`, `GroupDetail` — records
+- `UserSummary`, `RoleSummary`, `GroupSummary` — lightweight list records
+- `RbacStats` — aggregate stats record
+- `AssignmentOrigin` — enum: DIRECT, CLAIM_MAPPING (tracks how roles were assigned)
+- `ClaimMappingRule` — record: OIDC claim-to-role mapping rule
+- `ClaimMappingService` — interface: CRUD for claim mapping rules
+- `ClaimMappingRepository` — persistence interface
+- `RoleRepository`, `GroupRepository` — persistence interfaces
+
+## admin/ — Server-wide admin config
+
+- `SensitiveKeysConfig` — record: keys (List<String>, immutable)
+- `SensitiveKeysRepository` — interface: find(), save()
+- `SensitiveKeysMerger` — pure function: merge(global, perApp) -> union with case-insensitive dedup, preserves first-seen casing. Returns null when both inputs null.
+- `AppSettings`, `AppSettingsRepository` — per-app settings config and persistence
+- `ThresholdConfig`, `ThresholdRepository` — alerting threshold config and persistence
+- `AuditService` — audit logging facade
+- `AuditRecord`, `AuditResult`, `AuditCategory`, `AuditRepository` — audit trail records and persistence
+
+## security/ — Auth
+
+- `JwtService` — interface: createAccessToken, createRefreshToken, validateAccessToken, validateRefreshToken
+- `Ed25519SigningService` — interface: sign, getPublicKeyBase64 (config signing)
+- `OidcConfig` — record: enabled, issuerUri, clientId, clientSecret, rolesClaim, defaultRoles, autoSignup, displayNameClaim, userIdClaim, audience, additionalScopes
+- `OidcConfigRepository` — persistence interface
+- `PasswordPolicyValidator` — min 12 chars, 3-of-4 character classes, no username match
+- `UserInfo`, `UserRepository` — user identity records and persistence
+- `InvalidTokenException` — thrown on revoked/expired tokens
+
+## ingestion/ — Buffered data pipeline
+
+- `IngestionService` — ingestExecution, ingestMetric, ingestLog, ingestDiagram
+- `ChunkAccumulator` — batches data for efficient flush
+- `WriteBuffer` — bounded ring buffer for async flush
+- `BufferedLogEntry` — log entry wrapper with metadata
+- `MergedExecution`, `TaggedExecution`, `TaggedDiagram` — tagged ingestion records
diff --git a/.claude/rules/docker-orchestration.md b/.claude/rules/docker-orchestration.md
new file mode 100644
index 00000000..7e762a53
--- /dev/null
+++ b/.claude/rules/docker-orchestration.md
@@ -0,0 +1,76 @@
+---
+paths:
+  - "cameleer-server-app/**/runtime/**"
+  - "cameleer-server-core/**/runtime/**"
+  - "deploy/**"
+  - "docker-compose*.yml"
+  - "Dockerfile"
+  - "docker-entrypoint.sh"
+---
+
+# Docker Orchestration
+
+When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components:
+
+- **ConfigMerger** (`core/runtime/ConfigMerger.java`) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes `runtimeType` (default `"auto"`) and `customArgs` (default `""`).
+- **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles. Also sets per-replica identity labels: `cameleer.replica` (index) and `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}`). Internal processing uses labels (not container name parsing) for extensibility.
+- **PrometheusLabelBuilder** (`app/runtime/PrometheusLabelBuilder.java`) — generates Prometheus `docker_sd_configs` labels per resolved runtime type: Spring Boot `/actuator/prometheus:8081`, Quarkus/native `/q/metrics:9000`, plain Java `/metrics:9464`. Labels merged into container metadata alongside Traefik labels at deploy time.
+- **DockerNetworkManager** (`app/runtime/DockerNetworkManager.java`) — manages two Docker network tiers:
+  - `cameleer-traefik` — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with `cameleer-server` DNS alias.
+  - `cameleer-env-{slug}` — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: `cameleer-env-{tenantId}-{envSlug}` (overloaded `envNetworkName(tenantId, envSlug)` method) to prevent cross-tenant collisions when multiple tenants have identically-named environments.
+- **DockerEventMonitor** (`app/runtime/DockerEventMonitor.java`) — persistent Docker event stream listener for containers with `managed-by=cameleer-server` label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy).
+- **DeploymentProgress** (`ui/src/components/DeploymentProgress.tsx`) — UI step indicator showing 7 deploy stages with amber active/green completed styling.
+- **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches with the replica's `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}`); `DockerEventMonitor` stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same `instanceId` as the agent (set via `CAMELEER_AGENT_INSTANCEID` env var) for unified log correlation at the instance level.
+- **StartupLogPanel** (`ui/src/components/StartupLogPanel.tsx`) — collapsible log panel rendered below `DeploymentProgress`. Queries `/api/v1/logs?source=container&application={appSlug}&environment={envSlug}`. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses `useStartupLogs` hook and `LogViewer` (design system).
+
+## DeploymentExecutor Details
+
+Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use `-javaagent:/app/agent.jar -jar`, plain Java uses `-cp` with main class, native runs binary directly). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}` so container logs and agent logs share the same instance identity. Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.
+
+## Deployment Status Model
+
+| Status | Meaning |
+|--------|---------|
+| `STOPPED` | Intentionally stopped or initial state |
+| `STARTING` | Deploy in progress |
+| `RUNNING` | All replicas healthy and serving |
+| `DEGRADED` | Some replicas healthy, some dead |
+| `STOPPING` | Graceful shutdown in progress |
+| `FAILED` | Terminal failure (pre-flight, health check, or crash) |
+
+**Replica support**: deployments can specify a replica count. `DEGRADED` is used when at least one but not all replicas are healthy.
+
+**Deploy stages** (`DeployStage`): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage).
+
+**Blue/green strategy**: when re-deploying, new replicas are started and health-checked before old ones are stopped, minimising downtime.
+
+**Deployment uniqueness**: `DeploymentService.createDeployment()` deletes any STOPPED/FAILED deployments for the same app+environment before creating a new one, preventing duplicate rows.
+
+## JAR Management
+
+- **Retention policy** per environment: configurable maximum number of JAR versions to keep. Older JARs are deleted automatically.
+- **Nightly cleanup job** (`JarRetentionJob`, Spring `@Scheduled` 03:00): purges JARs exceeding the retention limit and removes orphaned files not referenced by any app version. Skips versions currently deployed.
+- **Volume-based JAR mounting** for Docker-in-Docker setups: set `CAMELEER_SERVER_RUNTIME_JARDOCKERVOLUME` to the Docker volume name that contains the JAR storage directory. When set, the orchestrator mounts this volume into the container instead of bind-mounting the host path (required when the SaaS container itself runs inside Docker and the host path is not accessible from sibling containers).
+
+## Runtime Type Detection
+
+The server detects the app framework from uploaded JARs and builds Docker entrypoints. The agent shaded JAR bundles the log appender, so no separate `cameleer-log-appender.jar` or `PropertiesLauncher` is needed:
+
+- **Detection** (`RuntimeDetector`): runs at JAR upload time. Checks ZIP magic bytes (non-ZIP = native binary), then probes `META-INF/MANIFEST.MF` Main-Class: Spring Boot loader prefix -> `spring-boot`, Quarkus entry point -> `quarkus`, other Main-Class -> `plain-java` (extracts class name). Results stored on `AppVersion` (`detected_runtime_type`, `detected_main_class`).
+- **Runtime types** (`RuntimeType` enum): `AUTO`, `SPRING_BOOT`, `QUARKUS`, `PLAIN_JAVA`, `NATIVE`. Configurable per app/environment via `containerConfig.runtimeType` (default `"auto"`).
+- **Entrypoint per type**: All JVM types use `java -javaagent:/app/agent.jar -jar app.jar`. Plain Java uses `-cp` with explicit main class instead of `-jar`. Native runs the binary directly.
+- **Custom arguments** (`containerConfig.customArgs`): freeform string appended to the start command. Validated against a strict pattern to prevent shell injection (entrypoint uses `sh -c`).
+- **AUTO resolution**: at deploy time (PRE_FLIGHT), `"auto"` resolves to the detected type from `AppVersion`. Fails deployment if detection was unsuccessful — user must set type explicitly.
+- **UI**: Resources tab shows Runtime Type dropdown (with detection hint from latest uploaded version) and Custom Arguments text field.
+
+## SaaS Multi-Tenant Network Isolation
+
+In SaaS mode, each tenant's server and its deployed apps are isolated at the Docker network level:
+
+- **Tenant network** (`cameleer-tenant-{slug}`) — primary internal bridge for all of a tenant's containers. Set as `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` for the tenant's server instance. Tenant A's apps cannot reach tenant B's apps.
+- **Shared services network** — server also connects to the shared infrastructure network (PostgreSQL, ClickHouse, Logto) and `cameleer-traefik` for HTTP routing.
+- **Tenant-scoped environment networks** (`cameleer-env-{tenantId}-{envSlug}`) — per-environment discovery is scoped per tenant, so `alpha-corp`'s "dev" environment network is separate from `beta-corp`'s "dev" environment network.
+
+## nginx / Reverse Proxy
+
+- `client_max_body_size 200m` is required in the nginx config to allow JAR uploads up to 200 MB. Without this, large JAR uploads return 413.
diff --git a/.claude/rules/gitnexus.md b/.claude/rules/gitnexus.md
new file mode 100644
index 00000000..e2e24175
--- /dev/null
+++ b/.claude/rules/gitnexus.md
@@ -0,0 +1,98 @@
+# GitNexus — Code Intelligence
+
+This project is indexed by GitNexus as **cameleer-server** (6306 symbols, 15892 relationships, 300 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
+
+> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.
+
+## Always Do
+
+- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
+- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
+- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
+- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
+- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.
+
+## When Debugging
+
+1. `gitnexus_query({query: "<error or symptom>"})` — find execution flows related to the issue
+2. `gitnexus_context({name: "<suspect function>"})` — see all callers, callees, and process participation
+3. `READ gitnexus://repo/cameleer-server/process/{processName}` — trace the full execution flow step by step
+4. For regressions: `gitnexus_detect_changes({scope: "compare", base_ref: "main"})` — see what your branch changed
+
+## When Refactoring
+
+- **Renaming**: MUST use `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` first. Review the preview — graph edits are safe, text_search edits need manual review. Then run with `dry_run: false`.
+- **Extracting/Splitting**: MUST run `gitnexus_context({name: "target"})` to see all incoming/outgoing refs, then `gitnexus_impact({target: "target", direction: "upstream"})` to find all external callers before moving code.
+- After any refactor: run `gitnexus_detect_changes({scope: "all"})` to verify only expected files changed.
+
+## Never Do
+
+- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
+- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
+- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
+- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.
+
+## Tools Quick Reference
+
+| Tool | When to use | Command |
+|------|-------------|---------|
+| `query` | Find code by concept | `gitnexus_query({query: "auth validation"})` |
+| `context` | 360-degree view of one symbol | `gitnexus_context({name: "validateUser"})` |
+| `impact` | Blast radius before editing | `gitnexus_impact({target: "X", direction: "upstream"})` |
+| `detect_changes` | Pre-commit scope check | `gitnexus_detect_changes({scope: "staged"})` |
+| `rename` | Safe multi-file rename | `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` |
+| `cypher` | Custom graph queries | `gitnexus_cypher({query: "MATCH ..."})` |
+
+## Impact Risk Levels
+
+| Depth | Meaning | Action |
+|-------|---------|--------|
+| d=1 | WILL BREAK — direct callers/importers | MUST update these |
+| d=2 | LIKELY AFFECTED — indirect deps | Should test |
+| d=3 | MAY NEED TESTING — transitive | Test if critical path |
+
+## Resources
+
+| Resource | Use for |
+|----------|---------|
+| `gitnexus://repo/cameleer-server/context` | Codebase overview, check index freshness |
+| `gitnexus://repo/cameleer-server/clusters` | All functional areas |
+| `gitnexus://repo/cameleer-server/processes` | All execution flows |
+| `gitnexus://repo/cameleer-server/process/{name}` | Step-by-step execution trace |
+
+## Self-Check Before Finishing
+
+Before completing any code modification task, verify:
+1. `gitnexus_impact` was run for all modified symbols
+2. No HIGH/CRITICAL risk warnings were ignored
+3. `gitnexus_detect_changes()` confirms changes match expected scope
+4. All d=1 (WILL BREAK) dependents were updated
+
+## Keeping the Index Fresh
+
+After committing code changes, the GitNexus index becomes stale. Re-run analyze to update it:
+
+```bash
+npx gitnexus analyze
+```
+
+If the index previously included embeddings, preserve them by adding `--embeddings`:
+
+```bash
+npx gitnexus analyze --embeddings
+```
+
+To check whether embeddings exist, inspect `.gitnexus/meta.json` — the `stats.embeddings` field shows the count (0 means no embeddings). **Running analyze without `--embeddings` will delete any previously generated embeddings.**
+
+> Claude Code users: A PostToolUse hook handles this automatically after `git commit` and `git merge`.
+
+## CLI
+
+| Task | Read this skill file |
+|------|---------------------|
+| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
+| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
+| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
+| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
+| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
+| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |
diff --git a/.claude/rules/metrics.md b/.claude/rules/metrics.md
new file mode 100644
index 00000000..a2f0f365
--- /dev/null
+++ b/.claude/rules/metrics.md
@@ -0,0 +1,85 @@
+---
+paths:
+  - "cameleer-server-app/**/metrics/**"
+  - "cameleer-server-app/**/ServerMetrics*"
+  - "ui/src/pages/RuntimeTab/**"
+  - "ui/src/pages/DashboardTab/**"
+---
+
+# Prometheus Metrics
+
+Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component:
+
+## Gauges (auto-polled)
+
+| Metric | Tags | Source |
+|--------|------|--------|
+| `cameleer.agents.connected` | `state` (live, stale, dead, shutdown) | `AgentRegistryService.findByState()` |
+| `cameleer.agents.sse.active` | — | `SseConnectionManager.getConnectionCount()` |
+| `cameleer.ingestion.buffer.size` | `type` (execution, processor, log, metrics) | `WriteBuffer.size()` |
+| `cameleer.ingestion.accumulator.pending` | — | `ChunkAccumulator.getPendingCount()` |
+
+## Counters
+
+| Metric | Tags | Instrumented in |
+|--------|------|-----------------|
+| `cameleer.ingestion.drops` | `reason` (buffer_full, no_agent, no_identity) | `LogIngestionController` |
+| `cameleer.agents.transitions` | `transition` (went_stale, went_dead, recovered) | `AgentLifecycleMonitor` |
+| `cameleer.deployments.outcome` | `status` (running, failed, degraded) | `DeploymentExecutor` |
+| `cameleer.auth.failures` | `reason` (invalid_token, revoked, oidc_rejected) | `JwtAuthenticationFilter` |
+
+## Timers
+
+| Metric | Tags | Instrumented in |
+|--------|------|-----------------|
+| `cameleer.ingestion.flush.duration` | `type` (execution, processor, log) | `ExecutionFlushScheduler` |
+| `cameleer.deployments.duration` | — | `DeploymentExecutor` |
+
+## Agent container Prometheus labels (set by PrometheusLabelBuilder at deploy time)
+
+| Runtime Type | `prometheus.path` | `prometheus.port` |
+|---|---|---|
+| `spring-boot` | `/actuator/prometheus` | `8081` |
+| `quarkus` / `native` | `/q/metrics` | `9000` |
+| `plain-java` | `/metrics` | `9464` |
+
+All containers also get `prometheus.scrape=true`. These labels enable Prometheus `docker_sd_configs` auto-discovery.
+
+## Agent Metric Names (Micrometer)
+
+Agents send `MetricsSnapshot` records with Micrometer-convention metric names. The server stores them generically (ClickHouse `agent_metrics.metric_name`). The UI references specific names in `AgentInstance.tsx` for JVM charts.
+
+### JVM metrics (used by UI)
+
+| Metric name | UI usage |
+|---|---|
+| `process.cpu.usage.value` | CPU % stat card + chart |
+| `jvm.memory.used.value` | Heap MB stat card + chart (tags: `area=heap`) |
+| `jvm.memory.max.value` | Heap max for % calculation (tags: `area=heap`) |
+| `jvm.threads.live.value` | Thread count chart |
+| `jvm.gc.pause.total_time` | GC time chart |
+
+### Camel route metrics (stored, queried by dashboard)
+
+| Metric name | Type | Tags |
+|---|---|---|
+| `camel.exchanges.succeeded.count` | counter | `routeId`, `camelContext` |
+| `camel.exchanges.failed.count` | counter | `routeId`, `camelContext` |
+| `camel.exchanges.total.count` | counter | `routeId`, `camelContext` |
+| `camel.exchanges.failures.handled.count` | counter | `routeId`, `camelContext` |
+| `camel.route.policy.count` | count | `routeId`, `camelContext` |
+| `camel.route.policy.total_time` | total | `routeId`, `camelContext` |
+| `camel.route.policy.max` | gauge | `routeId`, `camelContext` |
+| `camel.routes.running.value` | gauge | — |
+
+Mean processing time = `camel.route.policy.total_time / camel.route.policy.count`. Min processing time is not available (Micrometer does not track minimums).
+
+### Cameleer agent metrics
+
+| Metric name | Type | Tags |
+|---|---|---|
+| `cameleer.chunks.exported.count` | counter | `instanceId` |
+| `cameleer.chunks.dropped.count` | counter | `instanceId`, `reason` |
+| `cameleer.sse.reconnects.count` | counter | `instanceId` |
+| `cameleer.taps.evaluated.count` | counter | `instanceId` |
+| `cameleer.metrics.exported.count` | counter | `instanceId` |
diff --git a/.claude/rules/ui.md b/.claude/rules/ui.md
new file mode 100644
index 00000000..084ed9bb
--- /dev/null
+++ b/.claude/rules/ui.md
@@ -0,0 +1,43 @@
+---
+paths:
+  - "ui/**"
+---
+
+# UI Structure
+
+The UI has 4 main tabs: **Exchanges**, **Dashboard**, **Runtime**, **Deployments**.
+
+- **Exchanges** — route execution search and detail (`ui/src/pages/Exchanges/`)
+- **Dashboard** — metrics and stats with L1/L2/L3 drill-down (`ui/src/pages/DashboardTab/`)
+- **Runtime** — live agent status, logs, commands (`ui/src/pages/RuntimeTab/`)
+- **Deployments** — app management, JAR upload, deployment lifecycle (`ui/src/pages/AppsTab/`)
+  - Config sub-tabs: **Monitoring | Resources | Variables | Traces & Taps | Route Recording**
+  - Create app: full page at `/apps/new` (not a modal)
+  - Deployment progress: `ui/src/components/DeploymentProgress.tsx` (7-stage step indicator)
+
+**Admin pages** (ADMIN-only, under `/admin/`):
+- **Sensitive Keys** (`ui/src/pages/Admin/SensitiveKeysPage.tsx`) — global sensitive key masking config. Shows agent built-in defaults as outlined Badge reference, editable Tag pills for custom keys, amber-highlighted push-to-agents toggle. Keys add to (not replace) agent defaults. Per-app sensitive key additions managed via `ApplicationConfigController` API. Note: `AppConfigDetailPage.tsx` exists but is not routed in `router.tsx`.
+
+## Key UI Files
+
+- `ui/src/router.tsx` — React Router v6 routes
+- `ui/src/config.ts` — apiBaseUrl, basePath
+- `ui/src/auth/auth-store.ts` — Zustand: accessToken, user, roles, login/logout
+- `ui/src/api/environment-store.ts` — Zustand: selected environment (localStorage)
+- `ui/src/components/ContentTabs.tsx` — main tab switcher
+- `ui/src/components/ExecutionDiagram/` — interactive trace view (canvas)
+- `ui/src/components/ProcessDiagram/` — ELK-rendered route diagram
+- `ui/src/hooks/useScope.ts` — TabKey type, scope inference
+- `ui/src/components/StartupLogPanel.tsx` — deployment startup log viewer (container logs from ClickHouse, polls 3s while STARTING)
+- `ui/src/api/queries/logs.ts` — `useStartupLogs` hook for container startup log polling, `useLogs`/`useApplicationLogs` for general log search
+
+## UI Styling
+
+- Always use `@cameleer/design-system` CSS variables for colors (`var(--amber)`, `var(--error)`, `var(--success)`, etc.) — never hardcode hex values. This applies to CSS modules, inline styles, and SVG `fill`/`stroke` attributes. SVG presentation attributes resolve `var()` correctly. All colors use CSS variables (no hardcoded hex).
+- Shared CSS modules in `ui/src/styles/` (table-section, log-panel, rate-colors, refresh-indicator, chart-card, section-card) — import these instead of duplicating patterns.
+- Shared `PageLoader` component replaces copy-pasted spinner patterns.
+- Design system components used consistently: `Select`, `Tabs`, `Toggle`, `Button`, `LogViewer`, `Label` — prefer DS components over raw HTML elements. `LogViewer` renders optional source badges (`container`, `app`, `agent`) via `LogEntry.source` field (DS v0.1.49+).
+- Environment slugs are auto-computed from display name (read-only in UI).
+- Brand assets: `@cameleer/design-system/assets/` provides `camel-logo.svg` (currentColor), `cameleer-{16,32,48,192,512}.png`, and `cameleer-logo.png`. Copied to `ui/public/` for use as favicon (`favicon-16.png`, `favicon-32.png`) and logo (`camel-logo.svg` — login dialog 36px, sidebar 28x24px).
+- Sidebar generates `/exchanges/` paths directly (no legacy `/apps/` redirects). basePath is centralized in `ui/src/config.ts`; router.tsx imports it instead of re-reading `<base>` tag.
+- Global user preferences (environment selection) use Zustand stores with localStorage persistence — never URL search params. URL params are for page-specific state only (e.g. `?text=` search query). Switching environment resets all filters and remounts pages.
diff --git a/.gitignore b/.gitignore
index b9037c8f..7f339af9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -38,7 +38,8 @@ Thumbs.db
 logs/
 
 # Claude
-.claude/
+.claude/*
+!.claude/rules/
 .superpowers/
 .playwright-mcp/
 .worktrees/
diff --git a/CLAUDE.md b/CLAUDE.md
index 52ec1f37..846f03c4 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -67,6 +67,10 @@ PostgreSQL (Flyway): `cameleer-server-app/src/main/resources/db/migration/`
 
 ClickHouse: `cameleer-server-app/src/main/resources/clickhouse/init.sql` (run idempotently on startup)
 
+## Maintaining .claude/rules/
+
+When adding, removing, or renaming classes, controllers, endpoints, UI components, or metrics, update the corresponding `.claude/rules/` file as part of the same change. The rule files are the class/API map that future sessions rely on — stale rules cause wrong assumptions. Treat rule file updates like updating an import: part of the change, not a separate task.
+
 ## Disabled Skills
 
 - Do NOT use any `gsd:*` skills in this project. This includes all `/gsd:` prefixed commands.