fix: unify container/agent log identity and fix multi-replica log capture
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m21s
CI / docker (push) Successful in 1m18s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 42s

Four logging pipeline fixes:

1. Multi-replica startup logs: remove stopLogCaptureByApp from
   SseConnectionManager — container log capture now expires naturally
   after 60s instead of being killed when the first agent connects SSE.
   This ensures all replicas' bootstrap output is captured.

2. Unified instance_id: container logs and agent logs now share the same
   instance identity ({envSlug}-{appSlug}-{replicaIndex}). DeploymentExecutor
   sets CAMELEER_AGENT_INSTANCEID per replica so the agent uses the same
   ID as ContainerLogForwarder. Instance-level log views now show both
   container and agent logs.

3. Labels-first container identity: TraefikLabelBuilder emits cameleer.replica
   and cameleer.instance-id labels. Container names are tenant-prefixed
   ({tenantId}-{envSlug}-{appSlug}-{idx}) for global Docker daemon uniqueness.

4. Environment filter on log queries: useApplicationLogs now passes the
   selected environment to the API, preventing log leakage across environments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-04-15 10:54:05 +02:00
parent bf2d07f3ba
commit fd806b9f2d
9 changed files with 43 additions and 65 deletions

View File

@@ -53,7 +53,7 @@ java -jar cameleer3-server-app/target/cameleer3-server-app-1.0-SNAPSHOT.jar
- `ContainerRequest` — record: 20 fields for Docker container creation (includes runtimeType, customArgs, mainClass)
- `ResolvedContainerConfig` — record: typed config with memoryLimitMb, cpuShares, cpuLimit, appPort, replicas, routingMode, routeControlEnabled, replayEnabled, runtimeType, customArgs, etc.
- `ConfigMerger` — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig
- `RuntimeOrchestrator` — interface: startContainer, stopContainer, getContainerStatus, getLogs, startLogCapture, stopLogCapture, stopLogCaptureByApp
- `RuntimeOrchestrator` — interface: startContainer, stopContainer, getContainerStatus, getLogs, startLogCapture(containerId, instanceId, appSlug, envSlug, tenantId), stopLogCapture
**search/** — Execution search
- `SearchService` — search, topErrors, punchcard, distinctAttributeKeys
@@ -110,12 +110,12 @@ java -jar cameleer3-server-app/target/cameleer3-server-app-1.0-SNAPSHOT.jar
**runtime/** — Docker orchestration
- `DockerRuntimeOrchestrator` — implements RuntimeOrchestrator; Docker Java client (zerodep transport), container lifecycle
- `DeploymentExecutor`@Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds framework-specific Docker entrypoint per runtime type (Spring Boot PropertiesLauncher, Quarkus `-jar`, plain Java classpath, native binary). Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.
- `DeploymentExecutor`@Async staged deploy: PRE_FLIGHT -> PULL_IMAGE -> CREATE_NETWORK -> START_REPLICAS -> HEALTH_CHECK -> SWAP_TRAFFIC -> COMPLETE. Container names are `{tenantId}-{envSlug}-{appSlug}-{replicaIndex}` (globally unique on Docker daemon). Primary network for app containers is set via `CAMELEER_SERVER_RUNTIME_DOCKERNETWORK` env var (in SaaS mode: `cameleer-tenant-{slug}`); apps also connect to `cameleer-traefik` (routing) and `cameleer-env-{tenantId}-{envSlug}` (per-environment discovery) as additional networks. Resolves `runtimeType: auto` to concrete type from `AppVersion.detectedRuntimeType` at PRE_FLIGHT (fails deployment if unresolvable). Builds framework-specific Docker entrypoint per runtime type (Spring Boot PropertiesLauncher, Quarkus `-jar`, plain Java classpath, native binary). Sets per-replica `CAMELEER_AGENT_INSTANCEID` env var to `{envSlug}-{appSlug}-{replicaIndex}` so container logs and agent logs share the same instance identity. Sets `CAMELEER_AGENT_*` env vars from `ResolvedContainerConfig` (routeControlEnabled, replayEnabled, health port). These are startup-only agent properties — changing them requires redeployment.
- `DockerNetworkManager` — ensures bridge networks (cameleer-traefik, cameleer-env-{slug}), connects containers
- `DockerEventMonitor` — persistent Docker event stream listener (die, oom, start, stop), updates deployment status
- `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing
- `TraefikLabelBuilder` — generates Traefik Docker labels for path-based or subdomain routing. Also emits `cameleer.replica` and `cameleer.instance-id` labels per container for labels-first identity.
- `PrometheusLabelBuilder` — generates Prometheus Docker labels (`prometheus.scrape/path/port`) per runtime type for `docker_sd_configs` auto-discovery
- `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. Started by `DeploymentExecutor` after container creation, stopped by `SseConnectionManager` on agent SSE connect or by `DockerEventMonitor` on die/oom. 5-minute max capture timeout.
- `ContainerLogForwarder` — streams Docker container stdout/stderr to ClickHouse with `source='container'`. One follow-stream thread per container, batches lines every 2s/50 lines via `ClickHouseLogStore.insertBufferedBatch()`. Started by `DeploymentExecutor` after each replica launches, stopped by `DockerEventMonitor` on die/oom. 60-second max capture timeout (captures full bootstrap). Uses the same `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}`) as the agent for unified log correlation.
- `DisabledRuntimeOrchestrator` — no-op when runtime not enabled
**metrics/** — Prometheus observability
@@ -252,14 +252,14 @@ The UI has 4 main tabs: **Exchanges**, **Dashboard**, **Runtime**, **Deployments
When deployed via the cameleer-saas platform, this server orchestrates customer app containers using Docker. Key components:
- **ConfigMerger** (`core/runtime/ConfigMerger.java`) — pure function: resolve(globalDefaults, envConfig, appConfig) -> ResolvedContainerConfig. Three-layer merge: global (application.yml) -> environment (defaultContainerConfig JSONB) -> app (containerConfig JSONB). Includes `runtimeType` (default `"auto"`) and `customArgs` (default `""`).
- **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles.
- **TraefikLabelBuilder** (`app/runtime/TraefikLabelBuilder.java`) — generates Traefik Docker labels for path-based (`/{envSlug}/{appSlug}/`) or subdomain-based (`{appSlug}-{envSlug}.{domain}`) routing. Supports strip-prefix and SSL offloading toggles. Also sets per-replica identity labels: `cameleer.replica` (index) and `cameleer.instance-id` (`{envSlug}-{appSlug}-{replicaIndex}`). Internal processing uses labels (not container name parsing) for extensibility.
- **PrometheusLabelBuilder** (`app/runtime/PrometheusLabelBuilder.java`) — generates Prometheus `docker_sd_configs` labels per resolved runtime type: Spring Boot `/actuator/prometheus:8081`, Quarkus/native `/q/metrics:9000`, plain Java `/metrics:9464`. Labels merged into container metadata alongside Traefik labels at deploy time.
- **DockerNetworkManager** (`app/runtime/DockerNetworkManager.java`) — manages two Docker network tiers:
- `cameleer-traefik` — shared network; Traefik, server, and all app containers attach here. Server joined via docker-compose with `cameleer3-server` DNS alias.
- `cameleer-env-{slug}` — per-environment isolated network; containers in the same environment discover each other via Docker DNS. In SaaS mode, env networks are tenant-scoped: `cameleer-env-{tenantId}-{envSlug}` (overloaded `envNetworkName(tenantId, envSlug)` method) to prevent cross-tenant collisions when multiple tenants have identically-named environments.
- **DockerEventMonitor** (`app/runtime/DockerEventMonitor.java`) — persistent Docker event stream listener for containers with `managed-by=cameleer3-server` label. Detects die/oom/start/stop events and updates deployment replica states. Periodic reconciliation (@Scheduled every 30s) inspects actual container state and corrects deployment status mismatches (fixes stale DEGRADED with all replicas healthy).
- **DeploymentProgress** (`ui/src/components/DeploymentProgress.tsx`) — UI step indicator showing 7 deploy stages with amber active/green completed styling.
- **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches; `SseConnectionManager` stops capture when the agent connects SSE; `DockerEventMonitor` stops capture on die/oom. 5-minute max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads.
- **ContainerLogForwarder** (`app/runtime/ContainerLogForwarder.java`) — streams Docker container stdout/stderr to ClickHouse `logs` table with `source='container'`. Uses `docker logs --follow` per container, batches lines every 2s or 50 lines. Parses Docker timestamp prefix, infers log level via regex. `DeploymentExecutor` starts capture after each replica launches with the replica's `instanceId` (`{envSlug}-{appSlug}-{replicaIndex}`); `DockerEventMonitor` stops capture on die/oom. 60-second max capture timeout with 30s cleanup scheduler. Thread pool of 10 daemon threads. Container logs use the same `instanceId` as the agent (set via `CAMELEER_AGENT_INSTANCEID` env var) for unified log correlation at the instance level.
- **StartupLogPanel** (`ui/src/components/StartupLogPanel.tsx`) — collapsible log panel rendered below `DeploymentProgress`. Queries `/api/v1/logs?source=container&application={appSlug}&environment={envSlug}`. Auto-polls every 3s while deployment is STARTING; shows green "live" badge during polling, red "stopped" badge on FAILED. Uses `useStartupLogs` hook and `LogViewer` (design system).
### Deployment Status Model
@@ -397,7 +397,7 @@ Mean processing time = `camel.route.policy.total_time / camel.route.policy.count
<!-- gitnexus:start -->
# GitNexus — Code Intelligence
This project is indexed by GitNexus as **cameleer3-server** (6295 symbols, 15883 relationships, 300 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
This project is indexed by GitNexus as **cameleer3-server** (6300 symbols, 15883 relationships, 300 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.

View File

@@ -3,9 +3,7 @@ package com.cameleer3.server.app.agent;
import com.cameleer3.server.app.config.AgentRegistryConfig;
import com.cameleer3.server.core.agent.AgentCommand;
import com.cameleer3.server.core.agent.AgentEventListener;
import com.cameleer3.server.core.agent.AgentInfo;
import com.cameleer3.server.core.agent.AgentRegistryService;
import com.cameleer3.server.core.runtime.RuntimeOrchestrator;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import jakarta.annotation.PostConstruct;
@@ -37,16 +35,13 @@ public class SseConnectionManager implements AgentEventListener {
private final AgentRegistryConfig config;
private final SsePayloadSigner ssePayloadSigner;
private final ObjectMapper objectMapper;
private final RuntimeOrchestrator runtimeOrchestrator;
public SseConnectionManager(AgentRegistryService registryService, AgentRegistryConfig config,
SsePayloadSigner ssePayloadSigner, ObjectMapper objectMapper,
RuntimeOrchestrator runtimeOrchestrator) {
SsePayloadSigner ssePayloadSigner, ObjectMapper objectMapper) {
this.registryService = registryService;
this.config = config;
this.ssePayloadSigner = ssePayloadSigner;
this.objectMapper = objectMapper;
this.runtimeOrchestrator = runtimeOrchestrator;
}
@PostConstruct
@@ -87,12 +82,6 @@ public class SseConnectionManager implements AgentEventListener {
log.info("SSE connection established for agent {}", agentId);
// Stop container log capture — agent is now online and will send its own logs
AgentInfo agent = registryService.findById(agentId);
if (agent != null) {
runtimeOrchestrator.stopLogCaptureByApp(agent.applicationId(), agent.environmentId());
}
return emitter;
}

View File

@@ -30,7 +30,7 @@ public class ContainerLogForwarder {
private static final int FLUSH_BATCH_SIZE = 50;
private static final long FLUSH_INTERVAL_MS = 2_000;
private static final long MAX_CAPTURE_DURATION_MS = 5 * 60 * 1_000;
private static final long MAX_CAPTURE_DURATION_MS = 60_000;
private static final long CLEANUP_INTERVAL_MS = 30_000;
// Pattern to match Docker timestamp prefix: "2026-04-14T14:23:01.234567890Z "
@@ -56,21 +56,21 @@ public class ContainerLogForwarder {
CLEANUP_INTERVAL_MS, CLEANUP_INTERVAL_MS, TimeUnit.MILLISECONDS);
}
public void startCapture(String containerId, String appSlug, String envSlug, String tenantId) {
public void startCapture(String containerId, String instanceId, String appSlug, String envSlug, String tenantId) {
if (sessions.containsKey(containerId)) {
log.debug("Already capturing logs for container {}", containerId.substring(0, 12));
return;
}
CaptureSession session = new CaptureSession(containerId, appSlug, envSlug, tenantId);
CaptureSession session = new CaptureSession(containerId, instanceId, appSlug, envSlug, tenantId);
if (sessions.putIfAbsent(containerId, session) != null) {
return; // another thread beat us
}
Future<?> future = executor.submit(() -> streamLogs(session));
session.future = future;
log.info("Started log capture for container {} (app={}, env={})",
containerId.substring(0, 12), appSlug, envSlug);
log.info("Started log capture for container {} (instance={}, app={}, env={})",
containerId.substring(0, 12), instanceId, appSlug, envSlug);
}
public void stopCapture(String containerId) {
@@ -83,23 +83,6 @@ public class ContainerLogForwarder {
containerId.substring(0, 12), session.lineCount);
}
public void stopCaptureByApp(String appSlug, String envSlug) {
List<String> toRemove = new ArrayList<>();
for (Map.Entry<String, CaptureSession> entry : sessions.entrySet()) {
CaptureSession s = entry.getValue();
if (appSlug.equals(s.appSlug) && envSlug.equals(s.envSlug)) {
toRemove.add(entry.getKey());
}
}
for (String containerId : toRemove) {
stopCapture(containerId);
}
if (!toRemove.isEmpty()) {
log.info("Stopped log capture for app={} env={} ({} containers)",
appSlug, envSlug, toRemove.size());
}
}
@PreDestroy
public void shutdown() {
for (String containerId : new ArrayList<>(sessions.keySet())) {
@@ -192,7 +175,7 @@ public class ContainerLogForwarder {
logEntry.setSource("container");
entries.add(new BufferedLogEntry(
session.tenantId, session.envSlug, session.containerId.substring(0, 12),
session.tenantId, session.envSlug, session.instanceId,
session.appSlug, logEntry));
}
@@ -229,6 +212,7 @@ public class ContainerLogForwarder {
private static class CaptureSession {
final String containerId;
final String instanceId;
final String appSlug;
final String envSlug;
final String tenantId;
@@ -240,8 +224,9 @@ public class ContainerLogForwarder {
volatile Future<?> future;
volatile ResultCallback.Adapter<Frame> callback;
CaptureSession(String containerId, String appSlug, String envSlug, String tenantId) {
CaptureSession(String containerId, String instanceId, String appSlug, String envSlug, String tenantId) {
this.containerId = containerId;
this.instanceId = instanceId;
this.appSlug = appSlug;
this.envSlug = envSlug;
this.tenantId = tenantId;

View File

@@ -166,14 +166,22 @@ public class DeploymentExecutor {
updateStage(deployment.id(), DeployStage.START_REPLICAS);
Map<String, String> baseEnvVars = buildEnvVars(app, env, config);
Map<String, String> labels = TraefikLabelBuilder.build(app.slug(), env.slug(), tenantId, config);
labels.putAll(PrometheusLabelBuilder.build(resolvedRuntimeType));
Map<String, String> prometheusLabels = PrometheusLabelBuilder.build(resolvedRuntimeType);
List<Map<String, Object>> replicaStates = new ArrayList<>();
List<String> newContainerIds = new ArrayList<>();
for (int i = 0; i < config.replicas(); i++) {
String containerName = env.slug() + "-" + app.slug() + "-" + i;
String instanceId = env.slug() + "-" + app.slug() + "-" + i;
String containerName = tenantId + "-" + instanceId;
// Per-replica labels (include replica index and instance-id)
Map<String, String> labels = TraefikLabelBuilder.build(app.slug(), env.slug(), tenantId, config, i);
labels.putAll(prometheusLabels);
// Per-replica env vars (set agent instance ID to match container log identity)
Map<String, String> replicaEnvVars = new LinkedHashMap<>(baseEnvVars);
replicaEnvVars.put("CAMELEER_AGENT_INSTANCEID", instanceId);
String volumeName = jarDockerVolume != null && !jarDockerVolume.isBlank() ? jarDockerVolume : null;
ContainerRequest request = new ContainerRequest(
@@ -181,7 +189,7 @@ public class DeploymentExecutor {
volumeName, jarStoragePath,
primaryNetwork,
additionalNets,
baseEnvVars, labels,
replicaEnvVars, labels,
config.memoryLimitBytes(), config.memoryReserveBytes(),
config.dockerCpuShares(), config.dockerCpuQuota(),
config.exposedPorts(), agentHealthPort,
@@ -199,7 +207,7 @@ public class DeploymentExecutor {
}
}
orchestrator.startLogCapture(containerId, app.slug(), env.slug(), tenantId);
orchestrator.startLogCapture(containerId, instanceId, app.slug(), env.slug(), tenantId);
replicaStates.add(Map.of(
"index", i,

View File

@@ -13,7 +13,6 @@ public class DisabledRuntimeOrchestrator implements RuntimeOrchestrator {
@Override public void removeContainer(String id) { throw new UnsupportedOperationException("Runtime management disabled"); }
@Override public ContainerStatus getContainerStatus(String id) { return ContainerStatus.notFound(); }
@Override public Stream<String> getLogs(String id, int tail) { return Stream.empty(); }
@Override public void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId) {}
@Override public void startLogCapture(String containerId, String instanceId, String appSlug, String envSlug, String tenantId) {}
@Override public void stopLogCapture(String containerId) {}
@Override public void stopLogCaptureByApp(String appSlug, String envSlug) {}
}

View File

@@ -192,9 +192,9 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
}
@Override
public void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId) {
public void startLogCapture(String containerId, String instanceId, String appSlug, String envSlug, String tenantId) {
if (logForwarder != null) {
logForwarder.startCapture(containerId, appSlug, envSlug, tenantId);
logForwarder.startCapture(containerId, instanceId, appSlug, envSlug, tenantId);
}
}
@@ -204,11 +204,4 @@ public class DockerRuntimeOrchestrator implements RuntimeOrchestrator {
logForwarder.stopCapture(containerId);
}
}
@Override
public void stopLogCaptureByApp(String appSlug, String envSlug) {
if (logForwarder != null) {
logForwarder.stopCaptureByApp(appSlug, envSlug);
}
}
}

View File

@@ -9,8 +9,10 @@ public final class TraefikLabelBuilder {
private TraefikLabelBuilder() {}
public static Map<String, String> build(String appSlug, String envSlug, String tenantId, ResolvedContainerConfig config) {
public static Map<String, String> build(String appSlug, String envSlug, String tenantId,
ResolvedContainerConfig config, int replicaIndex) {
String svc = envSlug + "-" + appSlug;
String instanceId = envSlug + "-" + appSlug + "-" + replicaIndex;
Map<String, String> labels = new LinkedHashMap<>();
labels.put("traefik.enable", "true");
@@ -18,6 +20,8 @@ public final class TraefikLabelBuilder {
labels.put("cameleer.tenant", tenantId);
labels.put("cameleer.app", appSlug);
labels.put("cameleer.environment", envSlug);
labels.put("cameleer.replica", String.valueOf(replicaIndex));
labels.put("cameleer.instance-id", instanceId);
labels.put("traefik.http.services." + svc + ".loadbalancer.server.port",
String.valueOf(config.appPort()));

View File

@@ -11,11 +11,8 @@ public interface RuntimeOrchestrator {
Stream<String> getLogs(String containerId, int tailLines);
/** Start streaming container logs to ClickHouse. */
default void startLogCapture(String containerId, String appSlug, String envSlug, String tenantId) {}
default void startLogCapture(String containerId, String instanceId, String appSlug, String envSlug, String tenantId) {}
/** Stop log capture for a specific container (e.g., on die/oom). */
default void stopLogCapture(String containerId) {}
/** Stop log capture for all containers matching this app+env (e.g., on agent SSE connect). */
default void stopLogCaptureByApp(String appSlug, String envSlug) {}
}

View File

@@ -3,6 +3,7 @@ import { config } from '../../config';
import { useAuthStore } from '../../auth/auth-store';
import { useRefreshInterval } from './use-refresh-interval';
import { useGlobalFilters } from '@cameleer/design-system';
import { useEnvironmentStore } from '../environment-store';
export interface LogEntryResponse {
timestamp: string;
@@ -98,6 +99,7 @@ export function useApplicationLogs(
) {
const refetchInterval = useRefreshInterval(15_000);
const { timeRange } = useGlobalFilters();
const selectedEnv = useEnvironmentStore((s) => s.environment);
const to = options?.toOverride ?? timeRange.end.toISOString();
const useTimeRange = !options?.exchangeId;
@@ -105,6 +107,7 @@ export function useApplicationLogs(
application: application || undefined,
agentId: agentId || undefined,
source: options?.source || undefined,
environment: selectedEnv || undefined,
exchangeId: options?.exchangeId || undefined,
from: useTimeRange ? timeRange.start.toISOString() : undefined,
to: useTimeRange ? to : undefined,
@@ -112,7 +115,7 @@ export function useApplicationLogs(
};
const query = useQuery({
queryKey: ['logs', 'compat', application, agentId,
queryKey: ['logs', 'compat', application, agentId, selectedEnv,
useTimeRange ? timeRange.start.toISOString() : null,
useTimeRange ? to : null,
options?.limit, options?.exchangeId, options?.source],