diff --git a/docs/superpowers/specs/2026-04-14-container-startup-log-capture-design.md b/docs/superpowers/specs/2026-04-14-container-startup-log-capture-design.md new file mode 100644 index 00000000..7deb4651 --- /dev/null +++ b/docs/superpowers/specs/2026-04-14-container-startup-log-capture-design.md @@ -0,0 +1,187 @@ +# Container Startup Log Capture + +Capture Docker container stdout/stderr from the moment a container starts until the Cameleer agent inside fully registers (SSE connection established). Stores logs in ClickHouse for display in the deployment view and general log search. + +## Problem + +When a deployed application crashes during startup — before the Cameleer agent can connect and send logs via the normal ingestion pipeline — all diagnostic output is lost. The container may be removed before anyone can inspect it, leaving operators blind to the root cause. + +## Solution + +A `ContainerLogForwarder` component streams Docker log output in real-time for each managed container, batches lines, and flushes them to the existing ClickHouse `logs` table with `source = 'container'`. Capture stops when the agent establishes its SSE connection, at which point the agent's own log pipeline takes over. + +## Architecture + +### Core Interface Extension + +Extend `RuntimeOrchestrator` (core module) with three new methods: + +```java +// in RuntimeOrchestrator.java +void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId); +void stopLogCapture(String containerId); +void stopLogCaptureByApp(String appSlug, String envSlug); +``` + +`DisabledRuntimeOrchestrator` implements these as no-ops. `DockerRuntimeOrchestrator` delegates to `ContainerLogForwarder`. + +### ContainerLogForwarder + +**Package:** `com.cameleer3.server.app.runtime` (Docker-specific, alongside `DockerRuntimeOrchestrator`, `DockerEventMonitor`, etc.) + +**Responsibilities:** +- Manages active capture sessions in a `ConcurrentHashMap` keyed by container ID +- Each `CaptureSession` holds: containerId, appSlug, envSlug, tenantId, a `Future` for the streaming thread, and a buffer of pending log lines +- Uses a bounded thread pool (fixed size ~10 threads) + +**Streaming logic:** +- Calls `dockerClient.logContainerCmd(containerId).withFollowStream(true).withStdOut(true).withStdErr(true).withTimestamps(true)` +- Callback `onNext(Frame)` appends to an in-memory buffer +- Every ~2 seconds (or every 50 lines, whichever comes first), flushes the buffer to ClickHouse via `ClickHouseLogStore.insertBufferedBatch()` — constructs `BufferedLogEntry` records with `source = "container"`, the deployment's app/env/tenant metadata, and container name as `instanceId` +- On `onComplete()` (container stopped) or `onError()` — final flush, remove session from map + +**Safety:** +- Max capture duration: 5 minutes. A scheduled cleanup (every 30s) stops sessions exceeding this limit. +- `@PreDestroy` cleanup: stop all active captures on server shutdown. + +### ClickHouse Field Mapping + +Uses the existing `logs` table. No schema changes required. + +| Field | Value | +|-------|-------| +| `source` | `'container'` | +| `application` | appSlug from deployment | +| `environment` | envSlug from deployment | +| `tenant_id` | tenantId from deployment | +| `instance_id` | containerName (e.g., `prod-orderservice-0`) | +| `timestamp` | Parsed from Docker timestamp prefix | +| `message` | Log line content (after timestamp) | +| `level` | Inferred by regex (see below) | +| `logger_name` | Empty string (not parseable from raw stdout) | +| `thread_name` | Empty string | +| `stack_trace` | Empty string (stack traces appear as consecutive message lines) | +| `exchange_id` | Empty string | +| `mdc` | Empty map | + +### Log Level Inference + +- Regex scan for common Java log patterns: ` ERROR `, ` WARN `, ` INFO `, ` DEBUG `, ` TRACE ` +- Stack trace continuation lines (starting with `\tat ` or `Caused by:`) inherit ERROR level +- Lines matching no pattern default to INFO + +## Integration Points + +### Start Capture — DeploymentExecutor + +After each replica container is started (inside the replica loop): + +```java +orchestrator.startLogCapture(containerId, appSlug, envSlug, tenantId); +``` + +### Stop Capture — SseConnectionManager.connect() + +When an agent connects SSE, look up its `AgentInfo` from the registry to get `application` + `environmentId`: + +```java +orchestrator.stopLogCaptureByApp(application, environmentId); +``` + +Best-effort call — no-op if no capture exists for that app+env (e.g., non-Docker agent). + +### Stop Capture — Container Death + +`DockerEventMonitor` handles `die`/`oom` events. After updating replica state: + +```java +orchestrator.stopLogCapture(containerId); +``` + +Triggers final flush of buffered lines before cleanup. + +### Stop Capture — Deployment Failure Cleanup + +No extra code needed. When `DeploymentExecutor` stops/removes containers on health check failure, the Docker `die` event flows through `DockerEventMonitor` which calls `stopLogCapture`. The event monitor path handles it. + +## UI Changes + +### 1. Deployment Startup Log Panel + +A collapsible log panel below the `DeploymentProgress` component in the deployment detail view. + +**Data source:** Queries `/api/v1/logs?application={appSlug}&environment={envSlug}&source=container&from={deployCreatedAt}` + +**Polling behavior:** +- Auto-refreshes every 3 seconds while deployment status is STARTING +- Stops polling when status reaches RUNNING or FAILED +- Manual refresh button available in all states + +**Status indicator:** +- Green "live" badge + "polling every 3s" text while STARTING +- Red "stopped" badge when FAILED +- No badge when RUNNING (panel remains visible with historical startup logs) + +**Layout:** Uses existing `LogViewer` component from `@cameleer/design-system` and shared log panel styles from `ui/src/styles/log-panel.module.css`. + +### 2. Source Badge in Log Views + +Everywhere logs are displayed (AgentInstance page, LogTab, general log search), each log line gets a small source badge: +- `container` — slate/gray badge +- `app` — green badge +- `agent` — existing behavior + +The `source` field already exists in `LogEntryResponse`. This is a rendering-only change in the LogViewer or its wrapper. + +### 3. Source Filter Update + +The log toolbar source filter (currently App vs Agent) adds `Container` as a third option. The backend `/api/v1/logs` endpoint already accepts `source` as a query parameter — no backend change needed for filtering. + +## Edge Cases + +**Multi-replica:** Each replica gets its own capture session keyed by container ID. `instance_id` in ClickHouse is the container name (e.g., `prod-orderservice-0`). `stopLogCaptureByApp()` stops all sessions for that app+env pair. + +**Server restart during capture:** Active sessions are in-memory and lost on restart. Not a problem — containers likely restart too (Docker restart policy), and new captures start when `DeploymentExecutor` runs again. Already-flushed logs survive in ClickHouse. + +**Container produces no output:** Follow stream stays open but idle (parked thread, no CPU cost). Cleaned up by the 5-minute timeout or container death. + +**Rapid redeployment:** Old container dies -> `stopLogCapture(oldContainerId)`. New container starts -> `startLogCapture(newContainerId, ...)`. Different container IDs, no conflict. + +**Log overlap:** When the agent connects and starts sending `source='app'` logs, there may be a brief overlap with `source='container'` logs for the same timeframe. Both are shown with source badges. Users can filter by source if needed. + +## Files Changed + +### Backend — New + +| File | Description | +|------|-------------| +| `app/runtime/ContainerLogForwarder.java` | Docker log streaming, buffering, ClickHouse flush | + +### Backend — Modified + +| File | Change | +|------|--------| +| `core/runtime/RuntimeOrchestrator.java` | Add 3 log capture methods to interface | +| `app/runtime/DockerRuntimeOrchestrator.java` | Implement log capture methods, delegate to ContainerLogForwarder | +| `app/runtime/DisabledRuntimeOrchestrator.java` | No-op implementations of new methods | +| `app/runtime/DeploymentExecutor.java` | Call `startLogCapture()` after container start | +| `app/agent/SseConnectionManager.java` | Call `stopLogCaptureByApp()` on SSE connect | +| `app/runtime/DockerEventMonitor.java` | Call `stopLogCapture()` on die/oom events | +| `app/runtime/RuntimeOrchestratorAutoConfig.java` | Wire ContainerLogForwarder into DockerRuntimeOrchestrator | + +### Frontend — Modified + +| File | Change | +|------|--------| +| `ui/src/pages/AppsTab/AppsTab.tsx` | Add startup log panel below DeploymentProgress | +| `ui/src/api/queries/logs.ts` | Hook for deployment startup logs query | +| Log display components | Add source badge rendering | +| Log toolbar | Add Container to source filter options | + +### No Changes + +| File | Reason | +|------|--------| +| ClickHouse `init.sql` | Existing `logs` table with `source` column is sufficient | +| `LogQueryController.java` | Already accepts `source` filter parameter | +| `ClickHouseLogStore.java` | Already writes `source` field from log entries |