cameleer-saas/docs/superpowers/specs/2026-03-29-moat-features-design.md

# Moat-Strengthening Features — Design Specification

**Date:** 2026-03-29
**Status:** Draft — Awaiting Review
**Author:** Boardroom simulation (Strategist, Skeptic, Architect, Growth Hacker)
**Gitea Issues:** cameleer/cameleer3 #57-#72 (label: MOAT)

## Executive Summary

Three features designed to convert Cameleer's technical moat (ByteBuddy agent) into a workflow moat (debugger + lineage) and ultimately a network moat (cross-service correlation) before the vibe-coding window closes.

| Feature | Ship Target | Moat Type | Agent Changes | Server Changes |
|---------|-------------|-----------|---------------|----------------|
| Live Route Debugger | Weeks 8-14 | Workflow | Heavy (DebugSessionManager, breakpoints) | Heavy (WebSocket, session mgmt) |
| Payload Flow Lineage | Weeks 3-6 | Technical | Light (one capture mode check) | Medium (DiffEngine) |
| Cross-Service Correlation | Weeks 1-9 | Network effect | Light (header propagation) | Medium (trace assembly, topology) |

### Build Order

```
Week 1-3:  Foundation + Topology Graph (from existing data, zero agent changes)
Week 3-6:  Payload Flow Lineage (agent + server + UI)
Week 5-9:  Distributed Trace Correlation (agent header + server joins + UI)
Week 8-14: Live Route Debugger (agent + server + UI)
```

### Gitea Issue Map

**Epics:**
- #57 — Live Route Debugger
- #58 — Payload Flow Lineage
- #59 — Cross-Service Trace Correlation + Topology Map

**Debugger sub-issues:**
- #60 — Protocol: Debug session command types (`cameleer3-common`)
- #61 — Agent: DebugSessionManager + breakpoint InterceptStrategy integration
- #62 — Agent: ExchangeStateSerializer + synthetic direct route wrapper
- #63 — Server: DebugSessionService + WebSocket + REST API
- #70 — UI: Debug session frontend components

**Lineage sub-issues:**
- #64 — Protocol: Lineage command types (`cameleer3-common`)
- #65 — Agent: LineageManager + capture mode integration
- #66 — Server: LineageService + DiffEngine + REST API
- #71 — UI: Lineage timeline + diff viewer components

**Correlation sub-issues:**
- #67 — Agent: Enhanced trace context header propagation
- #68 — Server: CorrelationService — distributed trace assembly
- #69 — Server: DependencyGraphService + service topology materialized view
- #72 — UI: Distributed trace view + service topology graph

---

## 1. Live Route Debugger

### 1.1 Concept

Extend the existing `replay` command with a debug-session wrapper. Users provide an exchange (from a prior failed execution or manually constructed) and replay it through a route with breakpoints. Only the replayed exchange's thread blocks at breakpoints — production traffic flows normally.

**User Story:** A developer sees a failed exchange. They click "Debug This Exchange." Cameleer pre-fills the body/headers. They set breakpoints, click "Start Debug Session." The exchange replays through the route, pausing at each breakpoint. They inspect state, modify the body, step forward. Total: 3 minutes. Without Cameleer: 45 minutes.

### 1.2 Architecture

```
Browser (SaaS UI)
    |
    v
WebSocket <--------------------------------------+
    |                                             |
    v                                             |
cameleer3-server                                  |
    |  POST /api/v1/debug/sessions                |
    |  POST /api/v1/debug/sessions/{id}/step      |
    |  POST /api/v1/debug/sessions/{id}/resume    |
    |  DELETE /api/v1/debug/sessions/{id}         |
    |                                             |
    v                                             |
SSE Command Channel --> cameleer3 agent           |
    |                       |                     |
    |  "start-debug"        |                     |
    |  command               v                    |
    |               DebugSessionManager           |
    |                       |                     |
    |               Replay exchange via           |
    |               ProducerTemplate              |
    |                       |                     |
    |               InterceptStrategy checks      |
    |               breakpoints before each       |
    |               processor                     |
    |                       |                     |
    |               On breakpoint hit:            |
    |               > LockSupport.park()          |
    |               > Serialize exchange state     |
    |               > POST state to server -------+
    |                       |                    (server pushes to
    |               Wait for resume/step/skip     browser via WS)
    |               command via SSE
    |                       |
    |               On resume: LockSupport.unpark()
    |               Continue to next processor
```

### 1.3 Protocol Additions (cameleer3-common)

#### New SSE Commands

| Command | Direction | Purpose |
|---------|-----------|---------|
| `START_DEBUG` | Server -> Agent | Create session, spawn thread, replay exchange with breakpoints |
| `DEBUG_RESUME` | Server -> Agent | Unpark thread, continue to next breakpoint |
| `DEBUG_STEP` | Server -> Agent | Unpark thread, break at next processor (STEP_OVER/STEP_INTO) |
| `DEBUG_SKIP` | Server -> Agent | Skip current processor, continue |
| `DEBUG_MODIFY` | Server -> Agent | Apply body/header changes at current breakpoint before resuming |
| `DEBUG_ABORT` | Server -> Agent | Abort session, release thread |

#### StartDebugPayload

```json
{
  "sessionId": "dbg-a1b2c3",
  "routeId": "route-orders",
  "exchange": {
    "body": "{\"orderId\": 42, \"amount\": 150.00}",
    "headers": { "Content-Type": "application/json" }
  },
  "breakpoints": [
    { "processorId": "choice1", "condition": null },
    { "processorId": "to5", "condition": "${body.amount} > 100" }
  ],
  "mode": "STEP_OVER",
  "timeoutSeconds": 300,
  "originalExchangeId": "ID-failed-789",
  "replayToken": "...",
  "nonce": "..."
}
```

#### BreakpointHitReport (Agent -> Server)

```json
{
  "sessionId": "dbg-a1b2c3",
  "processorId": "to5",
  "processorType": "TO",
  "endpointUri": "http://payment-service/charge",
  "depth": 2,
  "stepIndex": 4,
  "exchangeState": {
    "body": "{\"orderId\": 42, \"amount\": 150.00, \"validated\": true}",
    "headers": { "..." },
    "properties": { "CamelSplitIndex": 0 },
    "exception": null,
    "bodyType": "java.util.LinkedHashMap"
  },
  "executionTree": ["...partial tree up to this point..."],
  "parentProcessorId": "split1",
  "routeId": "route-orders",
  "timestamp": "2026-03-29T14:22:05.123Z"
}
```

### 1.4 Agent Implementation (cameleer3-agent)

#### DebugSessionManager

- Location: `com.cameleer3.agent.debug.DebugSessionManager`
- Stores active sessions: `ConcurrentHashMap<sessionId, DebugSession>`
- Enforces max concurrent sessions (default 3, configurable via `cameleer.debug.maxSessions`)
- Allocates **dedicated Thread** per session (NOT from Camel thread pool)
- Timeout watchdog: `ScheduledExecutorService` auto-aborts expired sessions
- Handles all `DEBUG_*` commands via `DefaultCommandHandler` delegation

#### DebugSession

- Stores breakpoint definitions, current step mode, parked thread reference
- `shouldBreak(processorId, Exchange)`: evaluates processor match + Simple condition + step mode
- `reportBreakpointHit()`: serializes state, POSTs to server, calls `LockSupport.park()`
- `applyModifications(Exchange)`: sets body/headers from `DEBUG_MODIFY` command

#### InterceptStrategy Integration

In `CameleerInterceptStrategy.DelegateAsyncProcessor.process()`:

```java
DebugSession session = debugSessionManager.getSession(exchange);
if (session != null && session.shouldBreak(processorId, exchange)) {
    ExchangeState state = ExchangeStateSerializer.capture(exchange);
    List<ProcessorExecution> tree = executionCollector.getPartialTree(exchange);
    session.reportBreakpointHit(processorId, state, tree);
    // Thread parked until server sends resume/step/skip/abort

    if (session.isAborted()) throw new DebugSessionAbortedException();
    if (session.shouldSkip()) { callback.done(true); return true; }
    if (session.hasModifications()) session.applyModifications(exchange);
}
```

**Zero production overhead:** Debug exchanges carry `CameleerDebugSessionId` exchange property. `getSession()` checks this property — single null-check. Production exchanges have no property, check returns null, no further work.

#### ExchangeStateSerializer

- TypeConverter chain: String -> byte[] as Base64 -> class name fallback
- Stream bodies: wrap in `CachedOutputStream` (same pattern as Camel's stream caching)
- Sensitive header redaction (reuses `PayloadCapture` redaction logic)
- Size limit: `cameleer.debug.maxBodySize` (default 64KB)

#### Synthetic Direct Route Wrapper

For non-direct routes (timer, jms, http, file):
1. Extract route's processor chain from `CamelContext`
2. Create temporary `direct:__debug_{routeId}` route with same processors (shared by reference)
3. Debug exchange enters via `ProducerTemplate.send()`
4. Remove temporary route on session completion

### 1.5 Server Implementation (cameleer3-server)

#### REST Endpoints

| Method | Path | Role | Purpose |
|--------|------|------|---------|
| POST | `/api/v1/debug/sessions` | OPERATOR | Create debug session |
| GET | `/api/v1/debug/sessions/{id}` | VIEWER | Get session state |
| POST | `/api/v1/debug/sessions/{id}/step` | OPERATOR | Step over/into |
| POST | `/api/v1/debug/sessions/{id}/resume` | OPERATOR | Resume to next breakpoint |
| POST | `/api/v1/debug/sessions/{id}/skip` | OPERATOR | Skip current processor |
| POST | `/api/v1/debug/sessions/{id}/modify` | OPERATOR | Modify exchange at breakpoint |
| DELETE | `/api/v1/debug/sessions/{id}` | OPERATOR | Abort session |
| POST | `/api/v1/debug/sessions/{id}/breakpoint-hit` | AGENT | Agent reports breakpoint |
| GET | `/api/v1/debug/sessions/{id}/compare` | VIEWER | Compare debug vs original |

#### WebSocket Channel

```
Endpoint: WS /api/v1/debug/ws?token={jwt}

Server -> Browser events:
  { "type": "breakpoint-hit", "sessionId": "...", "data": { ...state... } }
  { "type": "session-completed", "sessionId": "...", "execution": { ... } }
  { "type": "session-error", "sessionId": "...", "error": "Agent disconnected" }
  { "type": "session-timeout", "sessionId": "..." }
```

#### Data Model

```sql
CREATE TABLE debug_sessions (
    session_id        TEXT PRIMARY KEY,
    agent_id          TEXT NOT NULL,
    route_id          TEXT NOT NULL,
    original_exchange TEXT,
    status            TEXT NOT NULL DEFAULT 'PENDING',
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at      TIMESTAMPTZ,
    breakpoints       JSONB,
    current_state     JSONB,
    step_count        INT DEFAULT 0,
    replay_exchange   TEXT,
    created_by        TEXT NOT NULL
);
```

#### DebugSessionService

Lifecycle: PENDING -> ACTIVE -> PAUSED -> COMPLETED/ABORTED/TIMEOUT

1. Generate sessionId + nonce + replay token
2. Send `START_DEBUG` via existing SSE channel
3. Receive breakpoint-hit POSTs, store state, push to WebSocket
4. Translate browser actions (step/resume/skip/modify) into SSE commands
5. Detect agent SSE disconnect via `SseConnectionManager` callback
6. Store completed execution in normal pipeline (tagged with `debugSessionId`)

### 1.6 SaaS Layer (cameleer-saas)

- Tenant isolation: debug sessions scoped to tenant's agents
- Concurrent session limits per tier (free: 1, pro: 5, enterprise: unlimited)
- Usage metering: session creation counted as billable event

### 1.7 UI Components

- **DebugLauncher.tsx** — "Debug This Exchange" button on failed execution detail, pre-fills exchange data
- **DebugSession.tsx** — Main view: route diagram with status coloring (green/yellow/gray), exchange state panel, step controls (F10/F11/F5/F6 keyboard shortcuts)
- **DebugCompare.tsx** — Side-by-side: original execution vs debug replay with diff highlighting
- **BreakpointEditor.tsx** — Click processor nodes to toggle breakpoints, conditional expression input

### 1.8 Safety Mechanisms

| Concern | Mitigation |
|---------|------------|
| Thread leak | Session timeout auto-aborts (default 5 min) |
| Memory leak | Exchange state captured on-demand, not buffered |
| Agent restart | Server detects SSE disconnect, notifies browser |
| High-throughput route | Only debug exchange hits breakpoints (property check) |
| Concurrent sessions | Hard limit (default 3), FAILURE ack if exceeded |
| Non-direct routes | Synthetic `direct:__debug_*` wrapper with same processor chain |

---

## 2. Payload Flow Lineage

### 2.1 Concept

Capture the full transformation history of a message flowing through a route. At each processor, snapshot body before and after. Server computes structural diffs. UI renders a visual "data flow" timeline showing exactly where and how data transforms.

**User Story:** A developer has an exchange where `customerName` is null. They click "Trace Payload Flow." Vertical timeline: at each processor, before/after body with structural diff. Processor 7 (`enrich1`) returned a response missing the `name` field. Root cause in 30 seconds.

### 2.2 Architecture

```
cameleer3 agent
    |
    |  On lineage-enabled exchange:
    |  Before processor: capture INPUT
    |  After processor: capture OUTPUT
    |  Attach to ProcessorExecution as inputBody/outputBody
    |
    v
POST /api/v1/data/executions (processors carry full snapshots)
    |
    v
cameleer3-server
    |
    |  LineageService:
    |  > Flatten processor tree to ordered list
    |  > Compute diffs between processor[n].output and processor[n+1].input
    |  > Classify transformation type
    |  > Generate human-readable summary
    |
    v
GET /api/v1/executions/{id}/lineage
    |
    v
Browser: LineageTimeline + DiffViewer
```

### 2.3 Protocol Additions (cameleer3-common)

#### New SSE Commands

| Command | Direction | Purpose |
|---------|-----------|---------|
| `ENABLE_LINEAGE` | Server -> Agent | Activate targeted payload capture |
| `DISABLE_LINEAGE` | Server -> Agent | Deactivate lineage capture |

#### EnableLineagePayload

```json
{
  "lineageId": "lin-x1y2z3",
  "scope": {
    "type": "ROUTE",
    "routeId": "route-orders"
  },
  "predicate": "${header.orderId} == 'ORD-500'",
  "predicateLanguage": "simple",
  "maxCaptures": 10,
  "duration": "PT10M",
  "captureHeaders": true,
  "captureProperties": false
}
```

#### Scope Types

| Scope | Meaning |
|-------|---------|
| `ROUTE` | All exchanges on a specific route |
| `CORRELATION` | All exchanges with a specific correlationId |
| `EXPRESSION` | Any exchange matching a Simple/JsonPath predicate |
| `NEXT_N` | Next N exchanges on the route (countdown) |

### 2.4 Agent Implementation (cameleer3-agent)

#### LineageManager

- Location: `com.cameleer3.agent.lineage.LineageManager`
- Stores active configs: `ConcurrentHashMap<lineageId, LineageConfig>`
- Tracks capture count per lineageId: auto-disables at `maxCaptures`
- Duration timeout via `ScheduledExecutorService`: auto-disables after expiry
- `shouldCaptureLineage(Exchange)`: evaluates scope + predicate, sets `CameleerLineageActive` property
- `isLineageActive(Exchange)`: single null-check on exchange property (HOT PATH, O(1))

#### Integration Points (Minimal Agent Changes)

**1. CameleerEventNotifier.onExchangeCreated():**
```java
lineageManager.shouldCaptureLineage(exchange);
// Sets CameleerLineageActive property if matching
```

**2. ExecutionCollector.resolveProcessorCaptureMode():**
```java
if (lineageManager.isLineageActive(exchange)) {
    return PayloadCaptureMode.BOTH;
}
```

**3. PayloadCapture body size:**
```java
int maxSize = lineageManager.isLineageActive(exchange)
    ? config.getLineageMaxBodySize()    // 64KB
    : config.getMaxBodySize();           // 4KB
```

**Production overhead when lineage is disabled: effectively zero.** The `isLineageActive()` check is a single null-check on an exchange property that doesn't exist on non-lineage exchanges.

#### Configuration

```properties
cameleer.lineage.maxBodySize=65536    # 64KB for lineage captures (vs 4KB normal)
cameleer.lineage.enabled=true          # master switch
```

### 2.5 Server Implementation (cameleer3-server)

#### LineageService

- `getLineage(executionId)`: fetch execution, flatten tree to ordered processor list, compute diffs
- `enableLineage(request)`: send `ENABLE_LINEAGE` to target agents
- `disableLineage(lineageId)`: send `DISABLE_LINEAGE`
- `getActiveLineages()`: list active configs across all agents

#### DiffEngine

Format-aware diff computation:

| Format | Detection | Library | Output |
|--------|-----------|---------|--------|
| JSON | Jackson parse success | zjsonpatch (RFC 6902) or custom tree walk | FIELD_ADDED, FIELD_REMOVED, FIELD_MODIFIED with JSON path |
| XML | DOM parse success | xmlunit-core | ELEMENT_ADDED, ELEMENT_REMOVED, ATTRIBUTE_CHANGED |
| Text | Fallback | java-diff-utils (Myers) | LINE_ADDED, LINE_REMOVED, LINE_CHANGED |
| Binary | Type detection | N/A | Size comparison only |

#### Transformation Classification

```
UNCHANGED      — No diff
MUTATION       — Existing fields modified, same format
ENRICHMENT     — Fields only added (e.g., enrich processor)
REDUCTION      — Fields only removed
FORMAT_CHANGED — Content type changed (XML -> JSON)
TYPE_CHANGED   — Java type changed but content equivalent
MIXED          — Combination of additions, removals, modifications
```

#### Summary Generation

Auto-generated human-readable summaries:
- `"XML -> JSON conversion"` (FORMAT_CHANGED)
- `"Added customer object from external API"` (ENRICHMENT + field names)
- `"Modified amount field: 150.00 -> 135.00"` (MUTATION + values)

#### Lineage Response Schema

```json
{
  "executionId": "exec-123",
  "routeId": "route-orders",
  "processors": [
    {
      "processorId": "unmarshal1",
      "processorType": "UNMARSHAL",
      "input": {
        "body": "<order><id>42</id></order>",
        "bodyType": "java.lang.String",
        "contentType": "application/xml"
      },
      "output": {
        "body": "{\"id\": 42}",
        "bodyType": "java.util.LinkedHashMap",
        "contentType": "application/json"
      },
      "diff": {
        "transformationType": "FORMAT_CHANGED",
        "summary": "XML -> JSON conversion",
        "bodyChanged": true,
        "headersChanged": true,
        "changes": [
          { "type": "FORMAT_CHANGED", "from": "XML", "to": "JSON" }
        ]
      },
      "durationMs": 12,
      "status": "COMPLETED"
    }
  ]
}
```

#### REST Endpoints

| Method | Path | Role | Purpose |
|--------|------|------|---------|
| GET | `/api/v1/executions/{id}/lineage` | VIEWER | Full lineage with diffs |
| POST | `/api/v1/lineage/enable` | OPERATOR | Enable lineage on agents |
| DELETE | `/api/v1/lineage/{lineageId}` | OPERATOR | Disable lineage |
| GET | `/api/v1/lineage/active` | VIEWER | List active lineage configs |

### 2.6 SaaS Layer (cameleer-saas)

- Lineage captures counted as premium events (higher billing weight)
- Active lineage config limits per tier
- Post-hoc lineage from COMPLETE engine level available on all tiers (resource-intensive fallback)
- Targeted lineage-on-demand is a paid-tier feature (upgrade driver)

### 2.7 UI Components

- **LineageTimeline.tsx** — Vertical processor list, color-coded by transformation type (green/yellow/blue/red/purple), expandable diffs, auto-generated summaries
- **LineageDiffViewer.tsx** — Side-by-side or unified diff, format-aware (JSON tree-diff, XML element-diff, text line-diff, binary hex)
- **LineageEnableDialog.tsx** — "Trace Payload Flow" button, scope/predicate builder, max captures slider
- **LineageSummaryStrip.tsx** — Compact horizontal strip on execution detail page, transformation icons per processor

---

## 3. Cross-Service Trace Correlation + Topology Map

### 3.1 Concept

Stitch executions across services into unified distributed traces. Build a service dependency topology graph automatically from observed traffic. Design the protocol for future cross-tenant federation.

**User Story:** Platform team with 8 Camel microservices. Order stuck in "processing." Engineer searches by orderId, sees distributed trace: horizontal timeline across all services, each expandable to route detail. Service C (pricing) timed out. Root cause across 4 boundaries in 60 seconds.

### 3.2 Phase 1: Intra-Tenant Trace Correlation

#### Enhanced Trace Context Header

```
Current (exists):
  X-Cameleer-CorrelationId: corr-abc-123

New (added):
  X-Cameleer-TraceContext: {
    "traceId": "trc-xyz",
    "parentSpanId": "span-001",
    "hopIndex": 2,
    "sourceApp": "order-service",
    "sourceRoute": "route-validate"
  }
```

#### Transport-Specific Propagation

| Transport | Detection | Mechanism |
|-----------|-----------|-----------|
| HTTP/REST | URI prefix `http:`, `https:`, `rest:` | HTTP header `X-Cameleer-TraceContext` |
| JMS | URI prefix `jms:`, `activemq:`, `amqp:` | JMS property `CameleerTraceContext` |
| Kafka | URI prefix `kafka:` | Kafka header `cameleer-trace-context` |
| Direct/SEDA | URI prefix `direct:`, `seda:`, `vm:` | Exchange property (in-process) |
| File/FTP | URI prefix `file:`, `ftp:` | Not propagated (async) |

### 3.3 Agent Implementation (cameleer3-agent)

#### Outgoing Propagation (InterceptStrategy)

Before delegating to TO/ENRICH/WIRE_TAP processors:

```java
if (isOutgoingEndpoint(processorType, endpointUri)) {
    TraceContext ctx = new TraceContext(
        executionCollector.getTraceId(exchange),
        currentProcessorExecution.getId(),
        executionCollector.getHopIndex(exchange) + 1,
        config.getApplicationName(),
        exchange.getFromRouteId()
    );
    injectTraceContext(exchange, endpointUri, ctx);
}
```

#### Incoming Extraction (CameleerEventNotifier)

In `onExchangeCreated()`:

```java
String traceCtxJson = extractTraceContext(exchange);
if (traceCtxJson != null) {
    TraceContext ctx = objectMapper.readValue(traceCtxJson, TraceContext.class);
    exchange.setProperty("CameleerParentSpanId", ctx.parentSpanId);
    exchange.setProperty("CameleerSourceApp", ctx.sourceApp);
    exchange.setProperty("CameleerSourceRoute", ctx.sourceRoute);
    exchange.setProperty("CameleerHopIndex", ctx.hopIndex);
}
```

#### New RouteExecution Fields

```java
execution.setParentSpanId(...);   // processor execution ID from calling service
execution.setSourceApp(...);      // application name of caller
execution.setSourceRoute(...);    // routeId of caller
execution.setHopIndex(...);       // depth in distributed trace
```

#### Safety

- Header size always <256 bytes
- Parse failure: log warning, continue without context (no exchange failure)
- Only inject on outgoing processors, never on FROM consumers

### 3.4 Server Implementation: Trace Assembly (cameleer3-server)

#### CorrelationService

```
buildDistributedTrace(correlationId):
  1. SELECT * FROM executions WHERE correlation_id = ? ORDER BY start_time
  2. Index by executionId for O(1) lookup
  3. Build tree: roots = executions where parentSpanId IS NULL
     For each with parentSpanId: find parent, attach as child hop
  4. Compute gaps: child.startTime - parent.processor.startTime = network latency
     If gap < 0: flag clock skew warning
  5. Aggregate: totalDuration, serviceCount, hopCount, status
```

#### Distributed Trace Response

```json
{
  "traceId": "trc-xyz",
  "correlationId": "corr-abc-123",
  "totalDurationMs": 1250,
  "hopCount": 4,
  "serviceCount": 3,
  "status": "FAILED",
  "entryPoint": {
    "application": "api-gateway",
    "routeId": "route-incoming-orders",
    "executionId": "exec-001",
    "durationMs": 1250,
    "children": [
      {
        "calledFrom": {
          "processorId": "to3",
          "processorType": "TO",
          "endpointUri": "http://order-service/validate"
        },
        "application": "order-service",
        "routeId": "route-validate",
        "executionId": "exec-002",
        "durationMs": 350,
        "networkLatencyMs": 12,
        "children": []
      }
    ]
  }
}
```

#### Data Model Changes

```sql
ALTER TABLE executions ADD COLUMN parent_span_id TEXT;
ALTER TABLE executions ADD COLUMN source_app TEXT;
ALTER TABLE executions ADD COLUMN source_route TEXT;
ALTER TABLE executions ADD COLUMN hop_index INT;

CREATE INDEX idx_executions_parent_span
    ON executions(parent_span_id) WHERE parent_span_id IS NOT NULL;
```

#### Edge Cases

- **Missing hops:** uninstrumented service shown as "unknown" node
- **Clock skew:** flagged as warning, still rendered
- **Fan-out:** parallel multicast creates multiple children from same processor
- **Circular calls:** detected via hopIndex (max depth 20)

### 3.5 Server Implementation: Topology Graph (cameleer3-server)

#### DependencyGraphService

Builds service dependency graph from existing execution data — **zero additional agent overhead**.

Data source: `processor_executions` where `processor_type IN (TO, TO_DYNAMIC, EIP_ENRICH, EIP_POLL_ENRICH, EIP_WIRE_TAP)` and `resolved_endpoint_uri IS NOT NULL`.

#### Endpoint-to-Service Resolution

1. Direct/SEDA match: `direct:processOrder` -> route's applicationName
2. Agent registration match: URI base URL matches registered agent
3. Kubernetes hostname: extract hostname from URI -> applicationName
4. Manual mapping: admin-configured regex/glob patterns
5. Unresolved: `external:{hostname}` node

#### Materialized View

```sql
CREATE MATERIALIZED VIEW service_dependencies AS
SELECT
    e.application_name AS source_app,
    pe.resolved_endpoint_uri AS target_uri,
    COUNT(*) AS call_count,
    AVG(pe.duration_ms) AS avg_latency_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY pe.duration_ms) AS p99_latency_ms,
    SUM(CASE WHEN pe.status = 'FAILED' THEN 1 ELSE 0 END)::FLOAT
        / NULLIF(COUNT(*), 0) AS error_rate,
    MAX(pe.start_time) AS last_seen,
    MIN(pe.start_time) AS first_seen
FROM executions e
JOIN processor_executions pe
    ON e.execution_id = pe.execution_id
    AND e.start_time = pe.start_time
WHERE pe.processor_type IN ('TO','TO_DYNAMIC','EIP_ENRICH','EIP_POLL_ENRICH','EIP_WIRE_TAP')
    AND pe.resolved_endpoint_uri IS NOT NULL
    AND e.start_time > NOW() - INTERVAL '24 hours'
GROUP BY e.application_name, pe.resolved_endpoint_uri;

-- Refresh every 5 minutes
```

#### REST Endpoints

| Method | Path | Role | Purpose |
|--------|------|------|---------|
| GET | `/api/v1/traces/{correlationId}` | VIEWER | Assembled distributed trace |
| GET | `/api/v1/traces/{correlationId}/timeline` | VIEWER | Flat timeline for Gantt |
| GET | `/api/v1/topology/dependencies` | VIEWER | Service dependency graph |
| GET | `/api/v1/topology/diff` | VIEWER | Topology changes between windows |
| GET | `/api/v1/topology/dependencies/{source}/{target}` | VIEWER | Dependency detail |

### 3.6 Phase 2: Cross-Tenant Federation (Design Only)

Reserve `sourceTenantHash` in TraceContext for future use:

```json
{
  "traceId": "trc-xyz",
  "parentSpanId": "span-001",
  "hopIndex": 2,
  "sourceApp": "order-service",
  "sourceRoute": "route-validate",
  "sourceTenantHash": null
}
```

**Consent model (v2):**
- Both tenants opt-in to "Federation" in SaaS settings
- Shared: trace structure (timing, status, service names)
- NOT shared: payload content, headers, internal route details
- Either tenant can revoke at any time

### 3.7 SaaS Layer (cameleer-saas)

- All trace correlation intra-tenant in v1
- Topology graph scoped to tenant's applications
- External dependencies shown as opaque nodes
- Cross-tenant federation as enterprise-tier feature (v2)

### 3.8 UI Components

- **DistributedTraceView.tsx** — Horizontal Gantt timeline, rows=services, bars=executions, arrows=call flow, click-to-expand to route detail
- **ServiceTopologyGraph.tsx** — Force-directed graph, nodes sized by throughput, edges colored by error rate, animated traffic pulse, click drill-down
- **TopologyDiff.tsx** — "What changed?" view, new/removed dependencies highlighted, latency/error changes annotated
- **TraceSearchEnhanced.tsx** — Search by correlationId/traceId/business attributes, results show trace summaries with service count and hop count

---

## 4. Cross-Feature Integration Points

| From -> To | Integration |
|------------|-------------|
| Correlation -> Debugger | "Debug This Hop": from distributed trace, click a service hop to replay and debug |
| Correlation -> Lineage | "Trace Payload Across Services": enable lineage on a correlationId, see transforms across boundaries |
| Lineage -> Debugger | "Debug From Diff": unexpected processor output -> one-click launch debug with breakpoint on that processor |
| Debugger -> Lineage | Debug sessions auto-capture full lineage (all processors at BOTH mode) |
| Topology -> Correlation | Click dependency edge -> show recent traces between those services |
| Topology -> Lineage | "How does data transform?" -> aggregated lineage summary for a dependency edge |

---

## 5. Competitive Analysis

### What an LLM + Junior Dev Can Replicate

| Capability | Replicable? | Time | Barrier |
|------------|-------------|------|---------|
| JMX metrics dashboard | Yes | 1 weekend | None |
| Log parsing + display | Yes | 1 weekend | None |
| Basic replay (re-send exchange) | Yes | 1 week | Need agent access |
| Per-processor payload capture | No* | 2-3 months | Requires bytecode instrumentation |
| Nested EIP execution trees | No* | 3-6 months | Requires deep Camel internals knowledge |
| Breakpoint debugging in route | No | 6+ months | Thread management + InterceptStrategy + serialization |
| Format-aware payload diffing | Partially | 2 weeks | Diff library exists, but data pipeline doesn't |
| Distributed trace assembly | Partially | 1 month | OTel exists but lacks Camel-specific depth |
| Service topology from execution data | Partially | 2 weeks | Istio does this at network layer, not route layer |

*Achievable with OTel Camel instrumentation (spans only, not payload content)

### Where Each Feature Creates Unreplicable Value

- **Debugger:** Requires InterceptStrategy breakpoints + thread parking + exchange serialization. The combination is unique — no other Camel tool offers browser-based route stepping.
- **Lineage:** Requires per-processor INPUT/OUTPUT capture with correct ordering. OTel spans don't carry body content. JMX doesn't capture payloads. Only bytecode instrumentation provides this data.
- **Correlation + Topology:** The trace assembly is achievable elsewhere. The differentiation is Camel-specific depth: each hop shows processor-level execution trees, not just "Service B took 350ms."

---

## 6. Implementation Sequencing

### Phase A: Foundation + Topology (Weeks 1-3)

| Work | Repo | Issue |
|------|------|-------|
| Service topology materialized view | cameleer3-server | #69 |
| Topology REST API | cameleer3-server | #69 |
| ServiceTopologyGraph.tsx | cameleer3-server + saas | #72 |
| WebSocket infrastructure (for debugger) | cameleer3-server | #63 |
| TraceContext DTO in cameleer3-common | cameleer3 | #67 |

**Ship:** Topology graph visible from existing data. Zero agent changes. Immediate visual payoff.

### Phase B: Lineage (Weeks 3-6)

| Work | Repo | Issue |
|------|------|-------|
| Lineage protocol DTOs | cameleer3-common | #64 |
| LineageManager + capture integration | cameleer3-agent | #65 |
| LineageService + DiffEngine | cameleer3-server | #66 |
| Lineage UI components | cameleer3-server + saas | #71 |

**Ship:** Payload flow lineage independently usable.

### Phase C: Distributed Trace Correlation (Weeks 5-9, overlaps B)

| Work | Repo | Issue |
|------|------|-------|
| Trace context header propagation | cameleer3-agent | #67 |
| Executions table migration (new columns) | cameleer3-server | #68 |
| CorrelationService + trace assembly | cameleer3-server | #68 |
| DistributedTraceView + TraceSearch UI | cameleer3-server + saas | #72 |

**Ship:** Distributed traces + topology — full correlation story.

### Phase D: Live Route Debugger (Weeks 8-14)

| Work | Repo | Issue |
|------|------|-------|
| Debug protocol DTOs | cameleer3-common | #60 |
| DebugSessionManager + InterceptStrategy | cameleer3-agent | #61 |
| ExchangeStateSerializer + synthetic wrapper | cameleer3-agent | #62 |
| DebugSessionService + WS + REST | cameleer3-server | #63 |
| Debug UI components | cameleer3-server + saas | #70 |

**Ship:** Full browser-based route debugger with integration to lineage and correlation.

---

## 7. Open Questions

1. **Debugger concurrency model:** Should we support debugging through parallel `Split` branches? Current design follows the main thread. Parallel branches would require multiple parked threads per session.

2. **Lineage storage costs:** Full INPUT+OUTPUT at every processor generates significant data. Should we add a separate lineage retention policy (e.g., 7 days) shorter than normal execution retention?

3. **Topology graph refresh frequency:** 5-minute materialized view refresh is a trade-off. Real-time would require streaming aggregation (e.g., Kafka Streams). Is 5 minutes acceptable for v1?

4. **Cross-tenant federation security model:** The v2 `sourceTenantHash` design needs a full threat model. Can a malicious tenant forge trace context to see another tenant's data?

5. **OTel interop:** Should the trace context header be compatible with W3C Trace Context format? This would enable mixed environments where some services use OTel and others use Cameleer.