From 810f493639b016b201501134da00e1f786c0783f Mon Sep 17 00:00:00 2001 From: hsiegeln <37154749+hsiegeln@users.noreply.github.com> Date: Thu, 16 Apr 2026 09:26:53 +0200 Subject: [PATCH] chore: track .claude/rules/ and add self-maintenance instruction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Un-ignore .claude/rules/ so path-scoped rule files are shared via git. Add instruction in CLAUDE.md to update rule files when modifying classes, controllers, endpoints, or metrics — keeps rules current as part of normal workflow rather than requiring separate maintenance. Co-Authored-By: Claude Opus 4.6 (1M context) --- .claude/rules/app-classes.md | 109 ++++++++++++++++++++++++++ .claude/rules/cicd.md | 24 ++++++ .claude/rules/core-classes.md | 97 +++++++++++++++++++++++ .claude/rules/docker-orchestration.md | 76 ++++++++++++++++++ .claude/rules/gitnexus.md | 98 +++++++++++++++++++++++ .claude/rules/metrics.md | 85 ++++++++++++++++++++ .claude/rules/ui.md | 43 ++++++++++ .gitignore | 3 +- CLAUDE.md | 4 + 9 files changed, 538 insertions(+), 1 deletion(-) create mode 100644 .claude/rules/app-classes.md create mode 100644 .claude/rules/cicd.md create mode 100644 .claude/rules/core-classes.md create mode 100644 .claude/rules/docker-orchestration.md create mode 100644 .claude/rules/gitnexus.md create mode 100644 .claude/rules/metrics.md create mode 100644 .claude/rules/ui.md diff --git a/.claude/rules/app-classes.md b/.claude/rules/app-classes.md new file mode 100644 index 00000000..c561df90 --- /dev/null +++ b/.claude/rules/app-classes.md @@ -0,0 +1,109 @@ +--- +paths: + - "cameleer-server-app/**" +--- + +# App Module Key Classes + +`cameleer-server-app/src/main/java/com/cameleer/server/app/` + +## controller/ — REST endpoints + +- `AgentRegistrationController` — POST /register, POST /heartbeat, GET / (list), POST /refresh-token +- `AgentSseController` — GET /sse (Server-Sent Events connection) +- `AgentCommandController` — POST /broadcast, POST /{agentId}, POST /{agentId}/ack +- `AppController` — CRUD /api/v1/apps, POST /{appId}/upload-jar, GET /{appId}/versions +- `DeploymentController` — GET/POST /api/v1/apps/{appId}/deployments, POST /{id}/stop, POST /{id}/promote, GET /{id}/logs +- `EnvironmentAdminController` — CRUD /api/v1/admin/environments, PUT /{id}/jar-retention +- `ExecutionController` — GET /api/v1/executions (search + detail) +- `SearchController` — POST /api/v1/search, GET /routes, GET /top-errors, GET /punchcard +- `LogQueryController` — GET /api/v1/logs (filters: source, application, agentId, exchangeId, level, logger, q, environment, time range) +- `LogIngestionController` — POST /api/v1/data/logs (accepts `List` JSON array, each entry has `source`: app/agent). Logs WARN for: missing agent identity, unregistered agents, empty payloads, buffer-full drops, deserialization failures. Normal acceptance at DEBUG. +- `CatalogController` — GET /api/v1/catalog (unified app catalog merging PG managed apps + in-memory agents + CH stats), DELETE /api/v1/catalog/{applicationId} (ADMIN: dismiss app, purge all CH data + PG record). Auto-filters discovered apps older than `discoveryttldays` with no live agents. +- `ChunkIngestionController` — POST /api/v1/ingestion/chunk/{executions|metrics|diagrams} +- `UserAdminController` — CRUD /api/v1/admin/users, POST /{id}/roles, POST /{id}/set-password +- `RoleAdminController` — CRUD /api/v1/admin/roles +- `GroupAdminController` — CRUD /api/v1/admin/groups +- `OidcConfigAdminController` — GET/POST /api/v1/admin/oidc, POST /test +- `SensitiveKeysAdminController` — GET/PUT /api/v1/admin/sensitive-keys. GET returns 200 with config or 204 if not configured. PUT accepts `{ keys: [...] }` with optional `?pushToAgents=true` to fan out merged keys to all LIVE agents. Stored in `server_config` table (key `sensitive_keys`). +- `AuditLogController` — GET /api/v1/admin/audit +- `MetricsController` — GET /api/v1/metrics, GET /timeseries +- `DiagramController` — GET /api/v1/diagrams/{id}, POST / +- `DiagramRenderController` — POST /api/v1/diagrams/render (ELK layout) +- `ClaimMappingAdminController` — CRUD /api/v1/admin/claim-mappings, POST /test (accepts inline rules + claims for preview without saving) +- `LicenseAdminController` — GET/POST /api/v1/admin/license +- `AgentEventsController` — GET /api/v1/agent-events (agent state change history) +- `AgentMetricsController` — GET /api/v1/agent-metrics (JVM/Camel metrics per agent instance) +- `AppSettingsController` — GET/PUT /api/v1/apps/{appId}/settings +- `ApplicationConfigController` — GET/PUT /api/v1/apps/{appId}/config (traced processors, route recording, sensitive keys per app) +- `ClickHouseAdminController` — GET /api/v1/admin/clickhouse (ClickHouse admin, conditional on infrastructure endpoints) +- `DatabaseAdminController` — GET /api/v1/admin/database (PG admin, conditional on infrastructure endpoints) +- `DetailController` — GET /api/v1/detail (execution detail with processor tree) +- `EventIngestionController` — POST /api/v1/data/events (agent event ingestion) +- `RbacStatsController` — GET /api/v1/admin/rbac/stats +- `RouteCatalogController` — GET /api/v1/routes/catalog (merged route catalog from registry + ClickHouse) +- `RouteMetricsController` — GET /api/v1/route-metrics (per-route Camel metrics) +- `ThresholdAdminController` — CRUD /api/v1/admin/thresholds +- `UsageAnalyticsController` — GET /api/v1/admin/usage (ClickHouse usage_events) + +## runtime/ — Docker orchestration + +- `DockerRuntimeOrchestrator` — implements RuntimeOrchestrator; Docker Java client (zerodep transport), container lifecycle +- `DeploymentExecutor` — @Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Container names are `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}` (globally unique on Docker daemon). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}`. +- `DockerNetworkManager` — ensures bridge networks (cameleer-traefik, cameleer-env-{slug}), connects containers +- `DockerEventMonitor` — persistent Docker event stream listener (die, oom, start, stop), updates deployment status +- `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing. Also emits `cameleer.replica` and `cameleer.instance-id` labels per container for labels-first identity. +- `PrometheusLabelBuilder` — generates Prometheus Docker labels (`prometheus.scrape/path/port`) per runtime type for `docker_sd_configs` auto-discovery +- `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. 60-second max capture timeout. +- `DisabledRuntimeOrchestrator` — no-op when runtime not enabled + +## metrics/ — Prometheus observability + +- `ServerMetrics` — centralized business metrics: gauges (agents by state, SSE connections, buffer depths), counters (ingestion drops, agent transitions, deployment outcomes, auth failures), timers (flush duration, deployment duration). Exposed via `/api/v1/prometheus`. + +## storage/ — PostgreSQL repositories (JdbcTemplate) + +- `PostgresAppRepository`, `PostgresAppVersionRepository`, `PostgresEnvironmentRepository` +- `PostgresDeploymentRepository` — includes JSONB replica_states, deploy_stage, findByContainerId +- `PostgresUserRepository`, `PostgresRoleRepository`, `PostgresGroupRepository` +- `PostgresAuditRepository`, `PostgresOidcConfigRepository`, `PostgresClaimMappingRepository`, `PostgresSensitiveKeysRepository` +- `PostgresAppSettingsRepository`, `PostgresApplicationConfigRepository`, `PostgresThresholdRepository` + +## storage/ — ClickHouse stores + +- `ClickHouseExecutionStore`, `ClickHouseMetricsStore`, `ClickHouseMetricsQueryStore` +- `ClickHouseStatsStore` — pre-aggregated stats, punchcard +- `ClickHouseDiagramStore`, `ClickHouseAgentEventRepository` +- `ClickHouseUsageTracker` — usage_events for billing + +## search/ — ClickHouse search and log stores + +- `ClickHouseLogStore` — log storage and query, MDC-based exchange/processor correlation +- `ClickHouseSearchIndex` — full-text search + +## security/ — Spring Security + +- `SecurityConfig` — WebSecurityFilterChain, JWT filter, CORS, OIDC conditional +- `JwtAuthenticationFilter` — OncePerRequestFilter, validates Bearer tokens +- `JwtServiceImpl` — HMAC-SHA256 JWT (Nimbus JOSE) +- `OidcAuthController` — /api/v1/auth/oidc (login-uri, token-exchange, logout) +- `OidcTokenExchanger` — code -> tokens, role extraction from access_token then id_token +- `OidcProviderHelper` — OIDC discovery, JWK source cache + +## agent/ — Agent lifecycle + +- `SseConnectionManager` — manages per-agent SSE connections, delivers commands +- `AgentLifecycleMonitor` — @Scheduled 10s, LIVE->STALE->DEAD transitions +- `SsePayloadSigner` — Ed25519 signs SSE payloads for agent verification + +## retention/ — JAR cleanup + +- `JarRetentionJob` — @Scheduled 03:00 daily, per-environment retention, skips deployed versions + +## config/ — Spring beans + +- `RuntimeOrchestratorAutoConfig` — conditional Docker/Disabled orchestrator + NetworkManager + EventMonitor +- `RuntimeBeanConfig` — DeploymentExecutor, AppService, EnvironmentService +- `SecurityBeanConfig` — JwtService, Ed25519, BootstrapTokenValidator +- `StorageBeanConfig` — all repositories +- `ClickHouseConfig` — ClickHouse JdbcTemplate, schema initializer diff --git a/.claude/rules/cicd.md b/.claude/rules/cicd.md new file mode 100644 index 00000000..98c83942 --- /dev/null +++ b/.claude/rules/cicd.md @@ -0,0 +1,24 @@ +--- +paths: + - ".gitea/**" + - "deploy/**" + - "Dockerfile" + - "docker-entrypoint.sh" +--- + +# CI/CD & Deployment + +- CI workflow: `.gitea/workflows/ci.yml` — build -> docker -> deploy on push to main or feature branches +- Build step skips integration tests (`-DskipITs`) — Testcontainers needs Docker daemon +- Docker: multi-stage build (`Dockerfile`), `$BUILDPLATFORM` for native Maven on ARM64 runner, amd64 runtime. `docker-entrypoint.sh` imports `/certs/ca.pem` into JVM truststore before starting the app (supports custom CAs for OIDC discovery without `CAMELEER_SERVER_SECURITY_OIDCTLSSKIPVERIFY`). +- `REGISTRY_TOKEN` build arg required for `cameleer-common` dependency resolution +- Registry: `gitea.siegeln.net/cameleer/cameleer-server` (container images) +- K8s manifests in `deploy/` — Kustomize base + overlays (main/feature), shared infra (PostgreSQL, ClickHouse, Logto) as top-level manifests +- Deployment target: k3s at 192.168.50.86, namespace `cameleer` (main), `cam-` (feature branches) +- Feature branches: isolated namespace, PG schema; Traefik Ingress at `-api.cameleer.siegeln.net` +- Secrets managed in CI deploy step (idempotent `--dry-run=client | kubectl apply`): `cameleer-auth`, `cameleer-postgres-credentials`, `cameleer-clickhouse-credentials` +- K8s probes: server uses `/api/v1/health`, PostgreSQL uses `pg_isready -U "$POSTGRES_USER"` (env var, not hardcoded) +- K8s security: server and database pods run with `securityContext.runAsNonRoot`. UI (nginx) runs without securityContext (needs root for entrypoint setup). +- Docker: server Dockerfile has no default credentials — all DB config comes from env vars at runtime +- Docker build uses buildx registry cache + `--provenance=false` for Gitea compatibility +- CI: branch slug sanitization extracted to `.gitea/sanitize-branch.sh`, sourced by docker and deploy-feature jobs diff --git a/.claude/rules/core-classes.md b/.claude/rules/core-classes.md new file mode 100644 index 00000000..61d22934 --- /dev/null +++ b/.claude/rules/core-classes.md @@ -0,0 +1,97 @@ +--- +paths: + - "cameleer-server-core/**" +--- + +# Core Module Key Classes + +`cameleer-server-core/src/main/java/com/cameleer/server/core/` + +## agent/ — Agent lifecycle and commands + +- `AgentRegistryService` — in-memory registry (ConcurrentHashMap), register/heartbeat/lifecycle +- `AgentInfo` — record: id, name, application, environmentId, version, routeIds, capabilities, state +- `AgentCommand` — record: id, type, targetAgent, payload, createdAt, expiresAt +- `AgentEventService` — records agent state changes, heartbeats +- `AgentState` — enum: LIVE, STALE, DEAD, SHUTDOWN +- `CommandType` — enum for command types (config-update, deep-trace, replay, route-control, etc.) +- `CommandStatus` — enum for command acknowledgement states +- `CommandReply` — record: command execution result from agent +- `AgentEventRecord`, `AgentEventRepository` — event persistence +- `AgentEventListener` — callback interface for agent events +- `RouteStateRegistry` — tracks per-agent route states + +## runtime/ — App/Environment/Deployment domain + +- `App` — record: id, environmentId, slug, displayName, containerConfig (JSONB) +- `AppVersion` — record: id, appId, version, jarPath, detectedRuntimeType, detectedMainClass +- `Environment` — record: id, slug, jarRetentionCount +- `Deployment` — record: id, appId, appVersionId, environmentId, status, targetState, deploymentStrategy, replicaStates (JSONB), deployStage, containerId, containerName +- `DeploymentStatus` — enum: STOPPED, STARTING, RUNNING, DEGRADED, STOPPING, FAILED +- `DeployStage` — enum: PRE_FLIGHT, PULL_IMAGE, CREATE_NETWORK, START_REPLICAS, HEALTH_CHECK, SWAP_TRAFFIC, COMPLETE +- `DeploymentService` — createDeployment (deletes terminal deployments first), markRunning, markFailed, markStopped +- `RuntimeType` — enum: AUTO, SPRING_BOOT, QUARKUS, PLAIN_JAVA, NATIVE +- `RuntimeDetector` — probes JAR files at upload time: detects runtime from manifest Main-Class (Spring Boot loader, Quarkus entry point, plain Java) or native binary (non-ZIP magic bytes) +- `ContainerRequest` — record: 20 fields for Docker container creation (includes runtimeType, customArgs, mainClass) +- `ContainerStatus` — record: state, running, exitCode, error +- `ResolvedContainerConfig` — record: typed config with memoryLimitMb, memoryReserveMb, cpuRequest, cpuLimit, appPort, exposedPorts, customEnvVars, stripPathPrefix, sslOffloading, routingMode, routingDomain, serverUrl, replicas, deploymentStrategy, routeControlEnabled, replayEnabled, runtimeType, customArgs, extraNetworks +- `RoutingMode` — enum for routing strategies +- `ConfigMerger` — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig +- `RuntimeOrchestrator` — interface: startContainer, stopContainer, getContainerStatus, getLogs, startLogCapture, stopLogCapture +- `AppRepository`, `AppVersionRepository`, `EnvironmentRepository`, `DeploymentRepository` — repository interfaces +- `AppService`, `EnvironmentService` — domain services + +## search/ — Execution search and stats + +- `SearchService` — search, count, stats, statsForApp, timeseries, timeseriesForApp, timeseriesForRoute, timeseriesGroupedByApp, timeseriesGroupedByRoute, slaCompliance, slaCountsByApp, slaCountsByRoute, topErrors, activeErrorTypes, punchcard, distinctAttributeKeys +- `SearchRequest` / `SearchResult` — search DTOs +- `ExecutionStats`, `ExecutionSummary` — stats aggregation records +- `StatsTimeseries`, `TopError` — timeseries and error DTOs +- `LogSearchRequest` / `LogSearchResponse` — log search DTOs + +## storage/ — Storage abstractions + +- `ExecutionStore`, `MetricsStore`, `MetricsQueryStore`, `StatsStore`, `DiagramStore`, `SearchIndex`, `LogIndex` — interfaces +- `LogEntryResult` — log query result record +- `model/` — `ExecutionDocument`, `MetricTimeSeries`, `MetricsSnapshot` + +## rbac/ — Role-based access control + +- `RbacService` — interface: role/group CRUD, assignRoleToUser, removeRoleFromUser, addUserToGroup, removeUserFromGroup, getDirectRolesForUser, getEffectiveRolesForUser, clearManagedAssignments, assignManagedRole, addUserToManagedGroup, getStats, listUsers +- `SystemRole` — enum: AGENT, VIEWER, OPERATOR, ADMIN; `normalizeScope()` maps scopes +- `UserDetail`, `RoleDetail`, `GroupDetail` — records +- `UserSummary`, `RoleSummary`, `GroupSummary` — lightweight list records +- `RbacStats` — aggregate stats record +- `AssignmentOrigin` — enum: DIRECT, CLAIM_MAPPING (tracks how roles were assigned) +- `ClaimMappingRule` — record: OIDC claim-to-role mapping rule +- `ClaimMappingService` — interface: CRUD for claim mapping rules +- `ClaimMappingRepository` — persistence interface +- `RoleRepository`, `GroupRepository` — persistence interfaces + +## admin/ — Server-wide admin config + +- `SensitiveKeysConfig` — record: keys (List, immutable) +- `SensitiveKeysRepository` — interface: find(), save() +- `SensitiveKeysMerger` — pure function: merge(global, perApp) -> union with case-insensitive dedup, preserves first-seen casing. Returns null when both inputs null. +- `AppSettings`, `AppSettingsRepository` — per-app settings config and persistence +- `ThresholdConfig`, `ThresholdRepository` — alerting threshold config and persistence +- `AuditService` — audit logging facade +- `AuditRecord`, `AuditResult`, `AuditCategory`, `AuditRepository` — audit trail records and persistence + +## security/ — Auth + +- `JwtService` — interface: createAccessToken, createRefreshToken, validateAccessToken, validateRefreshToken +- `Ed25519SigningService` — interface: sign, getPublicKeyBase64 (config signing) +- `OidcConfig` — record: enabled, issuerUri, clientId, clientSecret, rolesClaim, defaultRoles, autoSignup, displayNameClaim, userIdClaim, audience, additionalScopes +- `OidcConfigRepository` — persistence interface +- `PasswordPolicyValidator` — min 12 chars, 3-of-4 character classes, no username match +- `UserInfo`, `UserRepository` — user identity records and persistence +- `InvalidTokenException` — thrown on revoked/expired tokens + +## ingestion/ — Buffered data pipeline + +- `IngestionService` — ingestExecution, ingestMetric, ingestLog, ingestDiagram +- `ChunkAccumulator` — batches data for efficient flush +- `WriteBuffer` — bounded ring buffer for async flush +- `BufferedLogEntry` — log entry wrapper with metadata +- `MergedExecution`, `TaggedExecution`, `TaggedDiagram` — tagged ingestion records diff --git a/.claude/rules/docker-orchestration.md b/.claude/rules/docker-orchestration.md new file mode 100644 index 00000000..7e762a53 --- /dev/null +++ b/.claude/rules/docker-orchestration.md @@ -0,0 +1,76 @@ +--- +paths: + - "cameleer-server-app/**/runtime/**" + - "cameleer-server-core/**/runtime/**" + - "deploy/**" + - "docker-compose*.yml" + - "Dockerfile" + - "docker-entrypoint.sh" +--- + +# Docker Orchestration + +When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components: + +- **ConfigMerger** (`core/runtime/ConfigMerger.java`) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes `runtimeType` (default `"auto"`) and `customArgs` (default `""`). +- **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles. Also sets per-replica identity labels: `cameleer.replica` (index) and `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}`). Internal processing uses labels (not container name parsing) for extensibility. +- **PrometheusLabelBuilder** (`app/runtime/PrometheusLabelBuilder.java`) — generates Prometheus `docker_sd_configs` labels per resolved runtime type: Spring Boot `/actuator/prometheus:8081`, Quarkus/native `/q/metrics:9000`, plain Java `/metrics:9464`. Labels merged into container metadata alongside Traefik labels at deploy time. +- **DockerNetworkManager** (`app/runtime/DockerNetworkManager.java`) — manages two Docker network tiers: + - `cameleer-traefik` — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with `cameleer-server` DNS alias. + - `cameleer-env-{slug}` — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: `cameleer-env-{tenantId}-{envSlug}` (overloaded `envNetworkName(tenantId, envSlug)` method) to prevent cross-tenant collisions when multiple tenants have identically-named environments. +- **DockerEventMonitor** (`app/runtime/DockerEventMonitor.java`) — persistent Docker event stream listener for containers with `managed-by=cameleer-server` label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy). +- **DeploymentProgress** (`ui/src/components/DeploymentProgress.tsx`) — UI step indicator showing 7 deploy stages with amber active/green completed styling. +- **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches with the replica's `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}`); `DockerEventMonitor` stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same `instanceId` as the agent (set via `CAMELEER_AGENT_INSTANCEID` env var) for unified log correlation at the instance level. +- **StartupLogPanel** (`ui/src/components/StartupLogPanel.tsx`) — collapsible log panel rendered below `DeploymentProgress`. Queries `/api/v1/logs?source=container&application={appSlug}&environment={envSlug}`. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses `useStartupLogs` hook and `LogViewer` (design system). + +## DeploymentExecutor Details + +Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds Docker entrypoint per runtime type (all JVM types use `-javaagent:/app/agent.jar -jar`, plain Java uses `-cp` with main class, native runs binary directly). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}` so container logs and agent logs share the same instance identity. Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment. + +## Deployment Status Model + +| Status | Meaning | +|--------|---------| +| `STOPPED` | Intentionally stopped or initial state | +| `STARTING` | Deploy in progress | +| `RUNNING` | All replicas healthy and serving | +| `DEGRADED` | Some replicas healthy, some dead | +| `STOPPING` | Graceful shutdown in progress | +| `FAILED` | Terminal failure (pre-flight, health check, or crash) | + +**Replica support**: deployments can specify a replica count. `DEGRADED` is used when at least one but not all replicas are healthy. + +**Deploy stages** (`DeployStage`): PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE (or FAILED at any stage). + +**Blue/green strategy**: when re-deploying, new replicas are started and health-checked before old ones are stopped, minimising downtime. + +**Deployment uniqueness**: `DeploymentService.createDeployment()` deletes any STOPPED/FAILED deployments for the same app+environment before creating a new one, preventing duplicate rows. + +## JAR Management + +- **Retention policy** per environment: configurable maximum number of JAR versions to keep. Older JARs are deleted automatically. +- **Nightly cleanup job** (`JarRetentionJob`, Spring `@Scheduled` 03:00): purges JARs exceeding the retention limit and removes orphaned files not referenced by any app version. Skips versions currently deployed. +- **Volume-based JAR mounting** for Docker-in-Docker setups: set `CAMELEER_SERVER_RUNTIME_JARDOCKERVOLUME` to the Docker volume name that contains the JAR storage directory. When set, the orchestrator mounts this volume into the container instead of bind-mounting the host path (required when the SaaS container itself runs inside Docker and the host path is not accessible from sibling containers). + +## Runtime Type Detection + +The server detects the app framework from uploaded JARs and builds Docker entrypoints. The agent shaded JAR bundles the log appender, so no separate `cameleer-log-appender.jar` or `PropertiesLauncher` is needed: + +- **Detection** (`RuntimeDetector`): runs at JAR upload time. Checks ZIP magic bytes (non-ZIP = native binary), then probes `META-INF/MANIFEST.MF` Main-Class: Spring Boot loader prefix -> `spring-boot`, Quarkus entry point -> `quarkus`, other Main-Class -> `plain-java` (extracts class name). Results stored on `AppVersion` (`detected_runtime_type`, `detected_main_class`). +- **Runtime types** (`RuntimeType` enum): `AUTO`, `SPRING_BOOT`, `QUARKUS`, `PLAIN_JAVA`, `NATIVE`. Configurable per app/environment via `containerConfig.runtimeType` (default `"auto"`). +- **Entrypoint per type**: All JVM types use `java -javaagent:/app/agent.jar -jar app.jar`. Plain Java uses `-cp` with explicit main class instead of `-jar`. Native runs the binary directly. +- **Custom arguments** (`containerConfig.customArgs`): freeform string appended to the start command. Validated against a strict pattern to prevent shell injection (entrypoint uses `sh -c`). +- **AUTO resolution**: at deploy time (PRE_FLIGHT), `"auto"` resolves to the detected type from `AppVersion`. Fails deployment if detection was unsuccessful — user must set type explicitly. +- **UI**: Resources tab shows Runtime Type dropdown (with detection hint from latest uploaded version) and Custom Arguments text field. + +## SaaS Multi-Tenant Network Isolation + +In SaaS mode, each tenant's server and its deployed apps are isolated at the Docker network level: + +- **Tenant network** (`cameleer-tenant-{slug}`) — primary internal bridge for all of a tenant's containers. Set as `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` for the tenant's server instance. Tenant A's apps cannot reach tenant B's apps. +- **Shared services network** — server also connects to the shared infrastructure network (PostgreSQL, ClickHouse, Logto) and `cameleer-traefik` for HTTP routing. +- **Tenant-scoped environment networks** (`cameleer-env-{tenantId}-{envSlug}`) — per-environment discovery is scoped per tenant, so `alpha-corp`'s "dev" environment network is separate from `beta-corp`'s "dev" environment network. + +## nginx / Reverse Proxy + +- `client_max_body_size 200m` is required in the nginx config to allow JAR uploads up to 200 MB. Without this, large JAR uploads return 413. diff --git a/.claude/rules/gitnexus.md b/.claude/rules/gitnexus.md new file mode 100644 index 00000000..e2e24175 --- /dev/null +++ b/.claude/rules/gitnexus.md @@ -0,0 +1,98 @@ +# GitNexus — Code Intelligence + +This project is indexed by GitNexus as **cameleer-server** (6306 symbols, 15892 relationships, 300 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely. + +> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first. + +## Always Do + +- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user. +- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows. +- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits. +- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance. +- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`. + +## When Debugging + +1. `gitnexus_query({query: ""})` — find execution flows related to the issue +2. `gitnexus_context({name: ""})` — see all callers, callees, and process participation +3. `READ gitnexus://repo/cameleer-server/process/{processName}` — trace the full execution flow step by step +4. For regressions: `gitnexus_detect_changes({scope: "compare", base_ref: "main"})` — see what your branch changed + +## When Refactoring + +- **Renaming**: MUST use `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` first. Review the preview — graph edits are safe, text_search edits need manual review. Then run with `dry_run: false`. +- **Extracting/Splitting**: MUST run `gitnexus_context({name: "target"})` to see all incoming/outgoing refs, then `gitnexus_impact({target: "target", direction: "upstream"})` to find all external callers before moving code. +- After any refactor: run `gitnexus_detect_changes({scope: "all"})` to verify only expected files changed. + +## Never Do + +- NEVER edit a function, class, or method without first running `gitnexus_impact` on it. +- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis. +- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph. +- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope. + +## Tools Quick Reference + +| Tool | When to use | Command | +|------|-------------|---------| +| `query` | Find code by concept | `gitnexus_query({query: "auth validation"})` | +| `context` | 360-degree view of one symbol | `gitnexus_context({name: "validateUser"})` | +| `impact` | Blast radius before editing | `gitnexus_impact({target: "X", direction: "upstream"})` | +| `detect_changes` | Pre-commit scope check | `gitnexus_detect_changes({scope: "staged"})` | +| `rename` | Safe multi-file rename | `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` | +| `cypher` | Custom graph queries | `gitnexus_cypher({query: "MATCH ..."})` | + +## Impact Risk Levels + +| Depth | Meaning | Action | +|-------|---------|--------| +| d=1 | WILL BREAK — direct callers/importers | MUST update these | +| d=2 | LIKELY AFFECTED — indirect deps | Should test | +| d=3 | MAY NEED TESTING — transitive | Test if critical path | + +## Resources + +| Resource | Use for | +|----------|---------| +| `gitnexus://repo/cameleer-server/context` | Codebase overview, check index freshness | +| `gitnexus://repo/cameleer-server/clusters` | All functional areas | +| `gitnexus://repo/cameleer-server/processes` | All execution flows | +| `gitnexus://repo/cameleer-server/process/{name}` | Step-by-step execution trace | + +## Self-Check Before Finishing + +Before completing any code modification task, verify: +1. `gitnexus_impact` was run for all modified symbols +2. No HIGH/CRITICAL risk warnings were ignored +3. `gitnexus_detect_changes()` confirms changes match expected scope +4. All d=1 (WILL BREAK) dependents were updated + +## Keeping the Index Fresh + +After committing code changes, the GitNexus index becomes stale. Re-run analyze to update it: + +```bash +npx gitnexus analyze +``` + +If the index previously included embeddings, preserve them by adding `--embeddings`: + +```bash +npx gitnexus analyze --embeddings +``` + +To check whether embeddings exist, inspect `.gitnexus/meta.json` — the `stats.embeddings` field shows the count (0 means no embeddings). **Running analyze without `--embeddings` will delete any previously generated embeddings.** + +> Claude Code users: A PostToolUse hook handles this automatically after `git commit` and `git merge`. + +## CLI + +| Task | Read this skill file | +|------|---------------------| +| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` | +| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` | +| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` | +| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` | +| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` | +| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` | diff --git a/.claude/rules/metrics.md b/.claude/rules/metrics.md new file mode 100644 index 00000000..a2f0f365 --- /dev/null +++ b/.claude/rules/metrics.md @@ -0,0 +1,85 @@ +--- +paths: + - "cameleer-server-app/**/metrics/**" + - "cameleer-server-app/**/ServerMetrics*" + - "ui/src/pages/RuntimeTab/**" + - "ui/src/pages/DashboardTab/**" +--- + +# Prometheus Metrics + +Server exposes `/api/v1/prometheus` (unauthenticated, Prometheus text format). Spring Boot Actuator provides JVM, GC, thread pool, and `http.server.requests` metrics automatically. Business metrics via `ServerMetrics` component: + +## Gauges (auto-polled) + +| Metric | Tags | Source | +|--------|------|--------| +| `cameleer.agents.connected` | `state` (live, stale, dead, shutdown) | `AgentRegistryService.findByState()` | +| `cameleer.agents.sse.active` | — | `SseConnectionManager.getConnectionCount()` | +| `cameleer.ingestion.buffer.size` | `type` (execution, processor, log, metrics) | `WriteBuffer.size()` | +| `cameleer.ingestion.accumulator.pending` | — | `ChunkAccumulator.getPendingCount()` | + +## Counters + +| Metric | Tags | Instrumented in | +|--------|------|-----------------| +| `cameleer.ingestion.drops` | `reason` (buffer_full, no_agent, no_identity) | `LogIngestionController` | +| `cameleer.agents.transitions` | `transition` (went_stale, went_dead, recovered) | `AgentLifecycleMonitor` | +| `cameleer.deployments.outcome` | `status` (running, failed, degraded) | `DeploymentExecutor` | +| `cameleer.auth.failures` | `reason` (invalid_token, revoked, oidc_rejected) | `JwtAuthenticationFilter` | + +## Timers + +| Metric | Tags | Instrumented in | +|--------|------|-----------------| +| `cameleer.ingestion.flush.duration` | `type` (execution, processor, log) | `ExecutionFlushScheduler` | +| `cameleer.deployments.duration` | — | `DeploymentExecutor` | + +## Agent container Prometheus labels (set by PrometheusLabelBuilder at deploy time) + +| Runtime Type | `prometheus.path` | `prometheus.port` | +|---|---|---| +| `spring-boot` | `/actuator/prometheus` | `8081` | +| `quarkus` / `native` | `/q/metrics` | `9000` | +| `plain-java` | `/metrics` | `9464` | + +All containers also get `prometheus.scrape=true`. These labels enable Prometheus `docker_sd_configs` auto-discovery. + +## Agent Metric Names (Micrometer) + +Agents send `MetricsSnapshot` records with Micrometer-convention metric names. The server stores them generically (ClickHouse `agent_metrics.metric_name`). The UI references specific names in `AgentInstance.tsx` for JVM charts. + +### JVM metrics (used by UI) + +| Metric name | UI usage | +|---|---| +| `process.cpu.usage.value` | CPU % stat card + chart | +| `jvm.memory.used.value` | Heap MB stat card + chart (tags: `area=heap`) | +| `jvm.memory.max.value` | Heap max for % calculation (tags: `area=heap`) | +| `jvm.threads.live.value` | Thread count chart | +| `jvm.gc.pause.total_time` | GC time chart | + +### Camel route metrics (stored, queried by dashboard) + +| Metric name | Type | Tags | +|---|---|---| +| `camel.exchanges.succeeded.count` | counter | `routeId`, `camelContext` | +| `camel.exchanges.failed.count` | counter | `routeId`, `camelContext` | +| `camel.exchanges.total.count` | counter | `routeId`, `camelContext` | +| `camel.exchanges.failures.handled.count` | counter | `routeId`, `camelContext` | +| `camel.route.policy.count` | count | `routeId`, `camelContext` | +| `camel.route.policy.total_time` | total | `routeId`, `camelContext` | +| `camel.route.policy.max` | gauge | `routeId`, `camelContext` | +| `camel.routes.running.value` | gauge | — | + +Mean processing time = `camel.route.policy.total_time / camel.route.policy.count`. Min processing time is not available (Micrometer does not track minimums). + +### Cameleer agent metrics + +| Metric name | Type | Tags | +|---|---|---| +| `cameleer.chunks.exported.count` | counter | `instanceId` | +| `cameleer.chunks.dropped.count` | counter | `instanceId`, `reason` | +| `cameleer.sse.reconnects.count` | counter | `instanceId` | +| `cameleer.taps.evaluated.count` | counter | `instanceId` | +| `cameleer.metrics.exported.count` | counter | `instanceId` | diff --git a/.claude/rules/ui.md b/.claude/rules/ui.md new file mode 100644 index 00000000..084ed9bb --- /dev/null +++ b/.claude/rules/ui.md @@ -0,0 +1,43 @@ +--- +paths: + - "ui/**" +--- + +# UI Structure + +The UI has 4 main tabs: **Exchanges**, **Dashboard**, **Runtime**, **Deployments**. + +- **Exchanges** — route execution search and detail (`ui/src/pages/Exchanges/`) +- **Dashboard** — metrics and stats with L1/L2/L3 drill-down (`ui/src/pages/DashboardTab/`) +- **Runtime** — live agent status, logs, commands (`ui/src/pages/RuntimeTab/`) +- **Deployments** — app management, JAR upload, deployment lifecycle (`ui/src/pages/AppsTab/`) + - Config sub-tabs: **Monitoring | Resources | Variables | Traces & Taps | Route Recording** + - Create app: full page at `/apps/new` (not a modal) + - Deployment progress: `ui/src/components/DeploymentProgress.tsx` (7-stage step indicator) + +**Admin pages** (ADMIN-only, under `/admin/`): +- **Sensitive Keys** (`ui/src/pages/Admin/SensitiveKeysPage.tsx`) — global sensitive key masking config. Shows agent built-in defaults as outlined Badge reference, editable Tag pills for custom keys, amber-highlighted push-to-agents toggle. Keys add to (not replace) agent defaults. Per-app sensitive key additions managed via `ApplicationConfigController` API. Note: `AppConfigDetailPage.tsx` exists but is not routed in `router.tsx`. + +## Key UI Files + +- `ui/src/router.tsx` — React Router v6 routes +- `ui/src/config.ts` — apiBaseUrl, basePath +- `ui/src/auth/auth-store.ts` — Zustand: accessToken, user, roles, login/logout +- `ui/src/api/environment-store.ts` — Zustand: selected environment (localStorage) +- `ui/src/components/ContentTabs.tsx` — main tab switcher +- `ui/src/components/ExecutionDiagram/` — interactive trace view (canvas) +- `ui/src/components/ProcessDiagram/` — ELK-rendered route diagram +- `ui/src/hooks/useScope.ts` — TabKey type, scope inference +- `ui/src/components/StartupLogPanel.tsx` — deployment startup log viewer (container logs from ClickHouse, polls 3s while STARTING) +- `ui/src/api/queries/logs.ts` — `useStartupLogs` hook for container startup log polling, `useLogs`/`useApplicationLogs` for general log search + +## UI Styling + +- Always use `@cameleer/design-system` CSS variables for colors (`var(--amber)`, `var(--error)`, `var(--success)`, etc.) — never hardcode hex values. This applies to CSS modules, inline styles, and SVG `fill`/`stroke` attributes. SVG presentation attributes resolve `var()` correctly. All colors use CSS variables (no hardcoded hex). +- Shared CSS modules in `ui/src/styles/` (table-section, log-panel, rate-colors, refresh-indicator, chart-card, section-card) — import these instead of duplicating patterns. +- Shared `PageLoader` component replaces copy-pasted spinner patterns. +- Design system components used consistently: `Select`, `Tabs`, `Toggle`, `Button`, `LogViewer`, `Label` — prefer DS components over raw HTML elements. `LogViewer` renders optional source badges (`container`, `app`, `agent`) via `LogEntry.source` field (DS v0.1.49+). +- Environment slugs are auto-computed from display name (read-only in UI). +- Brand assets: `@cameleer/design-system/assets/` provides `camel-logo.svg` (currentColor), `cameleer-{16,32,48,192,512}.png`, and `cameleer-logo.png`. Copied to `ui/public/` for use as favicon (`favicon-16.png`, `favicon-32.png`) and logo (`camel-logo.svg` — login dialog 36px, sidebar 28x24px). +- Sidebar generates `/exchanges/` paths directly (no legacy `/apps/` redirects). basePath is centralized in `ui/src/config.ts`; router.tsx imports it instead of re-reading `` tag. +- Global user preferences (environment selection) use Zustand stores with localStorage persistence — never URL search params. URL params are for page-specific state only (e.g. `?text=` search query). Switching environment resets all filters and remounts pages. diff --git a/.gitignore b/.gitignore index b9037c8f..7f339af9 100644 --- a/.gitignore +++ b/.gitignore @@ -38,7 +38,8 @@ Thumbs.db logs/ # Claude -.claude/ +.claude/* +!.claude/rules/ .superpowers/ .playwright-mcp/ .worktrees/ diff --git a/CLAUDE.md b/CLAUDE.md index 52ec1f37..846f03c4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -67,6 +67,10 @@ PostgreSQL (Flyway): `cameleer-server-app/src/main/resources/db/migration/` ClickHouse: `cameleer-server-app/src/main/resources/clickhouse/init.sql` (run idempotently on startup) +## Maintaining .claude/rules/ + +When adding, removing, or renaming classes, controllers, endpoints, UI components, or metrics, update the corresponding `.claude/rules/` file as part of the same change. The rule files are the class/API map that future sessions rely on — stale rules cause wrong assumptions. Treat rule file updates like updating an import: part of the change, not a separate task. + ## Disabled Skills - Do NOT use any `gsd:*` skills in this project. This includes all `/gsd:` prefixed commands.