docs/superpowers/specs/2026-04-14-container-startup-log-capture-design.md

# Container Startup Log Capture

Capture Docker container stdout/stderr from the moment a container starts until the Cameleer agent inside fully registers (SSE connection established). Stores logs in ClickHouse for display in the deployment view and general log search.

## Problem

When a deployed application crashes during startup — before the Cameleer agent can connect and send logs via the normal ingestion pipeline — all diagnostic output is lost. The container may be removed before anyone can inspect it, leaving operators blind to the root cause.

## Solution

A `ContainerLogForwarder` component streams Docker log output in real-time for each managed container, batches lines, and flushes them to the existing ClickHouse `logs` table with `source = 'container'`. Capture stops when the agent establishes its SSE connection, at which point the agent's own log pipeline takes over.

## Architecture

### Core Interface Extension

Extend `RuntimeOrchestrator` (core module) with three new methods:

```java
// in RuntimeOrchestrator.java
void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId);
void stopLogCapture(String containerId);
void stopLogCaptureByApp(String appSlug, String envSlug);
```

`DisabledRuntimeOrchestrator` implements these as no-ops. `DockerRuntimeOrchestrator` delegates to `ContainerLogForwarder`.

### ContainerLogForwarder

**Package:** `com.cameleer.server.app.runtime` (Docker-specific, alongside `DockerRuntimeOrchestrator`, `DockerEventMonitor`, etc.)

**Responsibilities:**
- Manages active capture sessions in a `ConcurrentHashMap<String, CaptureSession>` keyed by container ID
- Each `CaptureSession` holds: containerId, appSlug, envSlug, tenantId, a `Future<?>` for the streaming thread, and a buffer of pending log lines
- Uses a bounded thread pool (fixed size ~10 threads)

**Streaming logic:**
- Calls `dockerClient.logContainerCmd(containerId).withFollowStream(true).withStdOut(true).withStdErr(true).withTimestamps(true)`
- Callback `onNext(Frame)` appends to an in-memory buffer
- Every ~2 seconds (or every 50 lines, whichever comes first), flushes the buffer to ClickHouse via `ClickHouseLogStore.insertBufferedBatch()` — constructs `BufferedLogEntry` records with `source = "container"`, the deployment's app/env/tenant metadata, and container name as `instanceId`
- On `onComplete()` (container stopped) or `onError()` — final flush, remove session from map

**Safety:**
- Max capture duration: 5 minutes. A scheduled cleanup (every 30s) stops sessions exceeding this limit.
- `@PreDestroy` cleanup: stop all active captures on server shutdown.

### ClickHouse Field Mapping

Uses the existing `logs` table. No schema changes required.

| Field | Value |
|-------|-------|
| `source` | `'container'` |
| `application` | appSlug from deployment |
| `environment` | envSlug from deployment |
| `tenant_id` | tenantId from deployment |
| `instance_id` | containerName (e.g., `prod-orderservice-0`) |
| `timestamp` | Parsed from Docker timestamp prefix |
| `message` | Log line content (after timestamp) |
| `level` | Inferred by regex (see below) |
| `logger_name` | Empty string (not parseable from raw stdout) |
| `thread_name` | Empty string |
| `stack_trace` | Empty string (stack traces appear as consecutive message lines) |
| `exchange_id` | Empty string |
| `mdc` | Empty map |

### Log Level Inference

- Regex scan for common Java log patterns: ` ERROR `, ` WARN `, ` INFO `, ` DEBUG `, ` TRACE `
- Stack trace continuation lines (starting with `\tat ` or `Caused by:`) inherit ERROR level
- Lines matching no pattern default to INFO

## Integration Points

### Start Capture — DeploymentExecutor

After each replica container is started (inside the replica loop):

```java
orchestrator.startLogCapture(containerId, appSlug, envSlug, tenantId);
```

### Stop Capture — SseConnectionManager.connect()

When an agent connects SSE, look up its `AgentInfo` from the registry to get `application` + `environmentId`:

```java
orchestrator.stopLogCaptureByApp(application, environmentId);
```

Best-effort call — no-op if no capture exists for that app+env (e.g., non-Docker agent).

### Stop Capture — Container Death

`DockerEventMonitor` handles `die`/`oom` events. After updating replica state:

```java
orchestrator.stopLogCapture(containerId);
```

Triggers final flush of buffered lines before cleanup.

### Stop Capture — Deployment Failure Cleanup

No extra code needed. When `DeploymentExecutor` stops/removes containers on health check failure, the Docker `die` event flows through `DockerEventMonitor` which calls `stopLogCapture`. The event monitor path handles it.

## UI Changes

### 1. Deployment Startup Log Panel

A collapsible log panel below the `DeploymentProgress` component in the deployment detail view.

**Data source:** Queries `/api/v1/logs?application={appSlug}&environment={envSlug}&source=container&from={deployCreatedAt}`

**Polling behavior:**
- Auto-refreshes every 3 seconds while deployment status is STARTING
- Stops polling when status reaches RUNNING or FAILED
- Manual refresh button available in all states

**Status indicator:**
- Green "live" badge + "polling every 3s" text while STARTING
- Red "stopped" badge when FAILED
- No badge when RUNNING (panel remains visible with historical startup logs)

**Layout:** Uses existing `LogViewer` component from `@cameleer/design-system` and shared log panel styles from `ui/src/styles/log-panel.module.css`.

### 2. Source Badge in Log Views

Everywhere logs are displayed (AgentInstance page, LogTab, general log search), each log line gets a small source badge:
- `container` — slate/gray badge
- `app` — green badge
- `agent` — existing behavior

The `source` field already exists in `LogEntryResponse`. This is a rendering-only change in the LogViewer or its wrapper.

### 3. Source Filter Update

The log toolbar source filter (currently App vs Agent) adds `Container` as a third option. The backend `/api/v1/logs` endpoint already accepts `source` as a query parameter — no backend change needed for filtering.

## Edge Cases

**Multi-replica:** Each replica gets its own capture session keyed by container ID. `instance_id` in ClickHouse is the container name (e.g., `prod-orderservice-0`). `stopLogCaptureByApp()` stops all sessions for that app+env pair.

**Server restart during capture:** Active sessions are in-memory and lost on restart. Not a problem — containers likely restart too (Docker restart policy), and new captures start when `DeploymentExecutor` runs again. Already-flushed logs survive in ClickHouse.

**Container produces no output:** Follow stream stays open but idle (parked thread, no CPU cost). Cleaned up by the 5-minute timeout or container death.

**Rapid redeployment:** Old container dies -> `stopLogCapture(oldContainerId)`. New container starts -> `startLogCapture(newContainerId, ...)`. Different container IDs, no conflict.

**Log overlap:** When the agent connects and starts sending `source='app'` logs, there may be a brief overlap with `source='container'` logs for the same timeframe. Both are shown with source badges. Users can filter by source if needed.

## Files Changed

### Backend — New

| File | Description |
|------|-------------|
| `app/runtime/ContainerLogForwarder.java` | Docker log streaming, buffering, ClickHouse flush |

### Backend — Modified

| File | Change |
|------|--------|
| `core/runtime/RuntimeOrchestrator.java` | Add 3 log capture methods to interface |
| `app/runtime/DockerRuntimeOrchestrator.java` | Implement log capture methods, delegate to ContainerLogForwarder |
| `app/runtime/DisabledRuntimeOrchestrator.java` | No-op implementations of new methods |
| `app/runtime/DeploymentExecutor.java` | Call `startLogCapture()` after container start |
| `app/agent/SseConnectionManager.java` | Call `stopLogCaptureByApp()` on SSE connect |
| `app/runtime/DockerEventMonitor.java` | Call `stopLogCapture()` on die/oom events |
| `app/runtime/RuntimeOrchestratorAutoConfig.java` | Wire ContainerLogForwarder into DockerRuntimeOrchestrator |

### Frontend — Modified

| File | Change |
|------|--------|
| `ui/src/pages/AppsTab/AppsTab.tsx` | Add startup log panel below DeploymentProgress |
| `ui/src/api/queries/logs.ts` | Hook for deployment startup logs query |
| Log display components | Add source badge rendering |
| Log toolbar | Add Container to source filter options |

### No Changes

| File | Reason |
|------|--------|
| ClickHouse `init.sql` | Existing `logs` table with `source` column is sufficient |
| `LogQueryController.java` | Already accepts `source` filter parameter |
| `ClickHouseLogStore.java` | Already writes `source` field from log entries |
docs: add container startup log capture design spec Covers streaming Docker logs to ClickHouse until agent SSE connect, deployment log panel UI, and source badge in general log views. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-14 23:04:24 +02:00			`# Container Startup Log Capture`

			`Capture Docker container stdout/stderr from the moment a container starts until the Cameleer agent inside fully registers (SSE connection established). Stores logs in ClickHouse for display in the deployment view and general log search.`

			`## Problem`

			`When a deployed application crashes during startup — before the Cameleer agent can connect and send logs via the normal ingestion pipeline — all diagnostic output is lost. The container may be removed before anyone can inspect it, leaving operators blind to the root cause.`

			`## Solution`

			A `ContainerLogForwarder` component streams Docker log output in real-time for each managed container, batches lines, and flushes them to the existing ClickHouse `logs` table with `source = 'container'`. Capture stops when the agent establishes its SSE connection, at which point the agent's own log pipeline takes over.

			`## Architecture`

			`### Core Interface Extension`

			Extend `RuntimeOrchestrator` (core module) with three new methods:

			```java
			`// in RuntimeOrchestrator.java`
			`void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId);`
			`void stopLogCapture(String containerId);`
			`void stopLogCaptureByApp(String appSlug, String envSlug);`
			```

			`DisabledRuntimeOrchestrator` implements these as no-ops. `DockerRuntimeOrchestrator` delegates to `ContainerLogForwarder`.

			`### ContainerLogForwarder`

chore: rename cameleer3 to cameleer Rename Java packages from com.cameleer3 to com.cameleer, module directories from cameleer3-* to cameleer-*, and all references throughout workflows, Dockerfiles, docs, migrations, and pom.xml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-15 15:28:42 +02:00			Package: `com.cameleer.server.app.runtime` (Docker-specific, alongside `DockerRuntimeOrchestrator`, `DockerEventMonitor`, etc.)
docs: add container startup log capture design spec Covers streaming Docker logs to ClickHouse until agent SSE connect, deployment log panel UI, and source badge in general log views. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-14 23:04:24 +02:00
			`Responsibilities:`
			- Manages active capture sessions in a `ConcurrentHashMap<String, CaptureSession>` keyed by container ID
			- Each `CaptureSession` holds: containerId, appSlug, envSlug, tenantId, a `Future<?>` for the streaming thread, and a buffer of pending log lines
			`- Uses a bounded thread pool (fixed size ~10 threads)`

			`Streaming logic:`
			- Calls `dockerClient.logContainerCmd(containerId).withFollowStream(true).withStdOut(true).withStdErr(true).withTimestamps(true)`
			- Callback `onNext(Frame)` appends to an in-memory buffer
			- Every ~2 seconds (or every 50 lines, whichever comes first), flushes the buffer to ClickHouse via `ClickHouseLogStore.insertBufferedBatch()` — constructs `BufferedLogEntry` records with `source = "container"`, the deployment's app/env/tenant metadata, and container name as `instanceId`
			- On `onComplete()` (container stopped) or `onError()` — final flush, remove session from map

			`Safety:`
			`- Max capture duration: 5 minutes. A scheduled cleanup (every 30s) stops sessions exceeding this limit.`
			- `@PreDestroy` cleanup: stop all active captures on server shutdown.

			`### ClickHouse Field Mapping`

			Uses the existing `logs` table. No schema changes required.

			`\| Field \| Value \|`
			`\|-------\|-------\|`
			\| `source` \| `'container'` \|
			\| `application` \| appSlug from deployment \|
			\| `environment` \| envSlug from deployment \|
			\| `tenant_id` \| tenantId from deployment \|
			\| `instance_id` \| containerName (e.g., `prod-orderservice-0`) \|
			\| `timestamp` \| Parsed from Docker timestamp prefix \|
			\| `message` \| Log line content (after timestamp) \|
			\| `level` \| Inferred by regex (see below) \|
			\| `logger_name` \| Empty string (not parseable from raw stdout) \|
			\| `thread_name` \| Empty string \|
			\| `stack_trace` \| Empty string (stack traces appear as consecutive message lines) \|
			\| `exchange_id` \| Empty string \|
			\| `mdc` \| Empty map \|

			`### Log Level Inference`

			- Regex scan for common Java log patterns: ` ERROR `, ` WARN `, ` INFO `, ` DEBUG `, ` TRACE `
			- Stack trace continuation lines (starting with `\tat ` or `Caused by:`) inherit ERROR level
			`- Lines matching no pattern default to INFO`

			`## Integration Points`

			`### Start Capture — DeploymentExecutor`

			`After each replica container is started (inside the replica loop):`

			```java
			`orchestrator.startLogCapture(containerId, appSlug, envSlug, tenantId);`
			```

			`### Stop Capture — SseConnectionManager.connect()`

			When an agent connects SSE, look up its `AgentInfo` from the registry to get `application` + `environmentId`:

			```java
			`orchestrator.stopLogCaptureByApp(application, environmentId);`
			```

			`Best-effort call — no-op if no capture exists for that app+env (e.g., non-Docker agent).`

			`### Stop Capture — Container Death`

			`DockerEventMonitor` handles `die`/`oom` events. After updating replica state:

			```java
			`orchestrator.stopLogCapture(containerId);`
			```

			`Triggers final flush of buffered lines before cleanup.`

			`### Stop Capture — Deployment Failure Cleanup`

			No extra code needed. When `DeploymentExecutor` stops/removes containers on health check failure, the Docker `die` event flows through `DockerEventMonitor` which calls `stopLogCapture`. The event monitor path handles it.

			`## UI Changes`

			`### 1. Deployment Startup Log Panel`

			A collapsible log panel below the `DeploymentProgress` component in the deployment detail view.

			Data source: Queries `/api/v1/logs?application={appSlug}&environment={envSlug}&source=container&from={deployCreatedAt}`

			`Polling behavior:`
			`- Auto-refreshes every 3 seconds while deployment status is STARTING`
			`- Stops polling when status reaches RUNNING or FAILED`
			`- Manual refresh button available in all states`

			`Status indicator:`
			`- Green "live" badge + "polling every 3s" text while STARTING`
			`- Red "stopped" badge when FAILED`
			`- No badge when RUNNING (panel remains visible with historical startup logs)`

			Layout: Uses existing `LogViewer` component from `@cameleer/design-system` and shared log panel styles from `ui/src/styles/log-panel.module.css`.

			`### 2. Source Badge in Log Views`

			`Everywhere logs are displayed (AgentInstance page, LogTab, general log search), each log line gets a small source badge:`
			- `container` — slate/gray badge
			- `app` — green badge
			- `agent` — existing behavior

			The `source` field already exists in `LogEntryResponse`. This is a rendering-only change in the LogViewer or its wrapper.

			`### 3. Source Filter Update`

			The log toolbar source filter (currently App vs Agent) adds `Container` as a third option. The backend `/api/v1/logs` endpoint already accepts `source` as a query parameter — no backend change needed for filtering.

			`## Edge Cases`

			Multi-replica: Each replica gets its own capture session keyed by container ID. `instance_id` in ClickHouse is the container name (e.g., `prod-orderservice-0`). `stopLogCaptureByApp()` stops all sessions for that app+env pair.

			Server restart during capture: Active sessions are in-memory and lost on restart. Not a problem — containers likely restart too (Docker restart policy), and new captures start when `DeploymentExecutor` runs again. Already-flushed logs survive in ClickHouse.

			`Container produces no output: Follow stream stays open but idle (parked thread, no CPU cost). Cleaned up by the 5-minute timeout or container death.`

			Rapid redeployment: Old container dies -> `stopLogCapture(oldContainerId)`. New container starts -> `startLogCapture(newContainerId, ...)`. Different container IDs, no conflict.

			Log overlap: When the agent connects and starts sending `source='app'` logs, there may be a brief overlap with `source='container'` logs for the same timeframe. Both are shown with source badges. Users can filter by source if needed.

			`## Files Changed`

			`### Backend — New`

			`\| File \| Description \|`
			`\|------\|-------------\|`
			\| `app/runtime/ContainerLogForwarder.java` \| Docker log streaming, buffering, ClickHouse flush \|

			`### Backend — Modified`

			`\| File \| Change \|`
			`\|------\|--------\|`
			\| `core/runtime/RuntimeOrchestrator.java` \| Add 3 log capture methods to interface \|
			\| `app/runtime/DockerRuntimeOrchestrator.java` \| Implement log capture methods, delegate to ContainerLogForwarder \|
			\| `app/runtime/DisabledRuntimeOrchestrator.java` \| No-op implementations of new methods \|
			\| `app/runtime/DeploymentExecutor.java` \| Call `startLogCapture()` after container start \|
			\| `app/agent/SseConnectionManager.java` \| Call `stopLogCaptureByApp()` on SSE connect \|
			\| `app/runtime/DockerEventMonitor.java` \| Call `stopLogCapture()` on die/oom events \|
			\| `app/runtime/RuntimeOrchestratorAutoConfig.java` \| Wire ContainerLogForwarder into DockerRuntimeOrchestrator \|

			`### Frontend — Modified`

			`\| File \| Change \|`
			`\|------\|--------\|`
			\| `ui/src/pages/AppsTab/AppsTab.tsx` \| Add startup log panel below DeploymentProgress \|
			\| `ui/src/api/queries/logs.ts` \| Hook for deployment startup logs query \|
			`\| Log display components \| Add source badge rendering \|`
			`\| Log toolbar \| Add Container to source filter options \|`

			`### No Changes`

			`\| File \| Reason \|`
			`\|------\|--------\|`
			\| ClickHouse `init.sql` \| Existing `logs` table with `source` column is sufficient \|
			\| `LogQueryController.java` \| Already accepts `source` filter parameter \|
			\| `ClickHouseLogStore.java` \| Already writes `source` field from log entries \|