Add moat-strengthening features design spec

Comprehensive design document for three defensibility features:
- Live Route Debugger (replay-based, zero production impact)
- Payload Flow Lineage (targeted per-processor capture + diff)
- Cross-Service Trace Correlation + Topology Map (network effect)

Gitea issues: cameleer/cameleer3 #57-#72 (MOAT label)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
hsiegeln
2026-03-29 16:14:04 +02:00
commit bd472be312
4 changed files with 997 additions and 0 deletions

1
.gitattributes vendored Normal file
View File

@@ -0,0 +1 @@
* text=auto eol=lf

39
CLAUDE.md Normal file
View File

@@ -0,0 +1,39 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project
Cameleer SaaS — multi-tenant SaaS platform wrapping the Cameleer observability stack (Java agent + server) for Apache Camel applications. Customers get managed observability for their Camel integrations without running infrastructure.
## Ecosystem
This repo is the SaaS layer on top of two proven components:
- **cameleer3** (sibling repo) — Java agent using ByteBuddy for zero-code instrumentation of Camel apps. Captures route executions, processor traces, payloads, metrics, and route graph topology. Deploys as `-javaagent` JAR.
- **cameleer3-server** (sibling repo) — Spring Boot observability backend. Receives agent data via HTTP, pushes config/commands via SSE. PostgreSQL + OpenSearch storage. React SPA dashboard. JWT auth with Ed25519 config signing.
- **cameleer-website** — Marketing site (Astro 5)
- **design-system** — Shared React component library (`@cameleer/design-system` on Gitea npm registry)
Agent-server protocol is defined in `cameleer3/cameleer3-common/PROTOCOL.md`. The agent and server are mature, proven components — this repo wraps them with multi-tenancy, billing, and self-service onboarding.
## Architecture Context
The existing cameleer3-server already has single-tenant auth (JWT, RBAC, bootstrap tokens, OIDC). The SaaS layer must:
- Add multi-tenancy (tenant isolation of agent data, diagrams, configs)
- Provide self-service signup, billing, and team management
- Generate per-tenant bootstrap tokens for agent registration
- Proxy or federate access to tenant-specific cameleer3-server instances
- Enforce usage quotas and metered billing
## Related Conventions
- Gitea-hosted: `gitea.siegeln.net/cameleer/`
- CI: `.gitea/workflows/` — Gitea Actions
- K8s target: k3s cluster at 192.168.50.86
- Docker builds: multi-stage, buildx with registry cache, `--provenance=false` for Gitea compatibility
- Design system: import from `@cameleer/design-system` (Gitea npm registry)
## Disabled Skills
- Do NOT use any `gsd:*` skills in this project. This includes all `/gsd:` prefixed commands.

101
STRATEGY.md Normal file
View File

@@ -0,0 +1,101 @@
The Threat: Why "Vibe-Coded App" Is Real
Someone with Claude/GPT can spin up a basic Camel observability dashboard in a weekend:
- Consume Camel JMX MBeans → show metrics
- Parse Camel logs → show traces
- Wrap it in a Next.js app with Clerk auth
That covers maybe 40% of what you've built. The question is what's in the other 60% that's genuinely hard to replicate.
Your Actual Defensible Assets
1. ByteBuddy Agent (Hard to Replicate)
This is your strongest moat. Zero-code bytecode instrumentation that:
- Captures nested processor execution trees (not just route-level)
- Handles 45+ Camel EIP node types correctly
- Preserves parent-child relationships through choice/split/multicast/doTry
- Captures payloads at configurable granularity with redaction
- Cross-service correlation via header propagation
- Works across Camel versions without source changes
Why this is defensible: Getting bytecode instrumentation right for a complex framework like Camel requires deep understanding of Camel internals, EIP execution semantics, and ByteBuddy's instrumentation model. An LLM can generate skeleton ByteBuddy code, but getting
the edge cases right (nested EIPs, error handlers, dynamic endpoints, thread boundaries in async routes) requires iterative debugging against real Camel apps. You've already done this work.
2. Route Graph Topology (Medium to Replicate)
Your agent extracts a full directed graph model of every route:
- Dual representation (tree + graph)
- 45+ node types mapping to Camel's EIP vocabulary
- 4 edge types including cross-route references
- Versioned diagrams that link to specific executions
Why this matters: Visualizing what a Camel route actually does at runtime — not just what the XML/Java DSL says — is hard. The graph model is a data structure spec that took significant iteration (DIAGRAMS.md is 400+ lines of specification).
3. Bidirectional Agent-Server Protocol (Medium to Replicate)
Not just data ingestion — your server can push commands back to agents:
- Deep trace specific correlation IDs
- Replay exchanges on routes
- Override per-processor payload capture at runtime
- Push config changes with cryptographic signing
Why this matters: Most observability tools are read-only. Being able to reach into a running Camel app and say "capture full payloads for the next exchange on this route" or "replay this failed message" is genuinely differentiated.
4. Domain Expertise Encoded in Software (Hardest to Replicate)
The accumulated knowledge of how Camel actually behaves at runtime:
- How processor execution nests through EIP patterns
- Where thread boundaries create gaps in tracing
- How to correlate across services via header propagation
- Which JMX MBeans are meaningful and which are noise
- Engine levels (NONE→COMPLETE) that let ops dial overhead vs. detail
This isn't just code — it's encoded domain expertise that took building and testing against real Camel applications.
What Is NOT a Moat
- The SaaS shell (auth, billing, multi-tenancy) — this is pure commodity now
- The React dashboard — any dev can build a metrics/trace viewer
- PostgreSQL/OpenSearch storage — standard infra choices
- K8s namespace isolation — well-documented pattern
- Marketing site — trivially replicated
MOAT Strategy Recommendations
Here's what I'd focus on to make Cameleer irreplaceable:
A. Double Down on Agent Intelligence
The agent is your core IP. Make it do things that are impossible without bytecode instrumentation:
- Payload diffing — show exactly how a message transforms through a route (before/after each processor)
- Anomaly detection — learn normal execution patterns, alert on deviations (route took 10x longer, new exception type, payload structure changed)
- Data flow lineage — trace a piece of data from ingestion through multiple routes to final destination
- Impact analysis — "if I change this route, which downstream systems are affected?"
B. Make the Protocol a Platform
The bidirectional SSE protocol is unique. Extend it:
- Live debugging — step through a Camel route like a debugger, inspect state at each processor
- Traffic mirroring — shadow production traffic to a test route
- Canary deployments — route a percentage of exchanges through a new route version
- Circuit breaker control — open/close circuit breakers from the dashboard
C. Build Network Effects (The Ultimate Moat)
- Cross-company correlation — when Company A's Camel app sends to Company B's Camel app, both see the full trace. This requires both parties to use Cameleer. Once two companies are linked, neither can easily leave.
- Community route patterns — anonymized aggregation of common EIP patterns, anti-patterns, performance benchmarks. The more customers, the better the insights.
- Integration marketplace — pre-built route templates customers can deploy. Community contributes, everyone benefits.
D. Operational Maturity as a Feature
The things that take years to get right:
- Retention policies — smart data lifecycle (hot/warm/cold storage)
- High-cardinality handling — when a customer has 10,000 routes
- Multi-region — data residency compliance
- Guaranteed delivery — never lose an execution trace, even when the server is down (agent-side buffering with exactly-once delivery)
The One-Sentence MOAT
▎ Cameleer's moat is the Java agent: zero-code bytecode instrumentation that understands Apache Camel's execution model at the processor level, combined with a bidirectional protocol that turns passive observability into active control. Everything else is a commodity
wrapper around this core.

View File

@@ -0,0 +1,856 @@
# Moat-Strengthening Features — Design Specification
**Date:** 2026-03-29
**Status:** Draft — Awaiting Review
**Author:** Boardroom simulation (Strategist, Skeptic, Architect, Growth Hacker)
**Gitea Issues:** cameleer/cameleer3 #57-#72 (label: MOAT)
## Executive Summary
Three features designed to convert Cameleer's technical moat (ByteBuddy agent) into a workflow moat (debugger + lineage) and ultimately a network moat (cross-service correlation) before the vibe-coding window closes.
| Feature | Ship Target | Moat Type | Agent Changes | Server Changes |
|---------|-------------|-----------|---------------|----------------|
| Live Route Debugger | Weeks 8-14 | Workflow | Heavy (DebugSessionManager, breakpoints) | Heavy (WebSocket, session mgmt) |
| Payload Flow Lineage | Weeks 3-6 | Technical | Light (one capture mode check) | Medium (DiffEngine) |
| Cross-Service Correlation | Weeks 1-9 | Network effect | Light (header propagation) | Medium (trace assembly, topology) |
### Build Order
```
Week 1-3: Foundation + Topology Graph (from existing data, zero agent changes)
Week 3-6: Payload Flow Lineage (agent + server + UI)
Week 5-9: Distributed Trace Correlation (agent header + server joins + UI)
Week 8-14: Live Route Debugger (agent + server + UI)
```
### Gitea Issue Map
**Epics:**
- #57 — Live Route Debugger
- #58 — Payload Flow Lineage
- #59 — Cross-Service Trace Correlation + Topology Map
**Debugger sub-issues:**
- #60 — Protocol: Debug session command types (`cameleer3-common`)
- #61 — Agent: DebugSessionManager + breakpoint InterceptStrategy integration
- #62 — Agent: ExchangeStateSerializer + synthetic direct route wrapper
- #63 — Server: DebugSessionService + WebSocket + REST API
- #70 — UI: Debug session frontend components
**Lineage sub-issues:**
- #64 — Protocol: Lineage command types (`cameleer3-common`)
- #65 — Agent: LineageManager + capture mode integration
- #66 — Server: LineageService + DiffEngine + REST API
- #71 — UI: Lineage timeline + diff viewer components
**Correlation sub-issues:**
- #67 — Agent: Enhanced trace context header propagation
- #68 — Server: CorrelationService — distributed trace assembly
- #69 — Server: DependencyGraphService + service topology materialized view
- #72 — UI: Distributed trace view + service topology graph
---
## 1. Live Route Debugger
### 1.1 Concept
Extend the existing `replay` command with a debug-session wrapper. Users provide an exchange (from a prior failed execution or manually constructed) and replay it through a route with breakpoints. Only the replayed exchange's thread blocks at breakpoints — production traffic flows normally.
**User Story:** A developer sees a failed exchange. They click "Debug This Exchange." Cameleer pre-fills the body/headers. They set breakpoints, click "Start Debug Session." The exchange replays through the route, pausing at each breakpoint. They inspect state, modify the body, step forward. Total: 3 minutes. Without Cameleer: 45 minutes.
### 1.2 Architecture
```
Browser (SaaS UI)
|
v
WebSocket <--------------------------------------+
| |
v |
cameleer3-server |
| POST /api/v1/debug/sessions |
| POST /api/v1/debug/sessions/{id}/step |
| POST /api/v1/debug/sessions/{id}/resume |
| DELETE /api/v1/debug/sessions/{id} |
| |
v |
SSE Command Channel --> cameleer3 agent |
| | |
| "start-debug" | |
| command v |
| DebugSessionManager |
| | |
| Replay exchange via |
| ProducerTemplate |
| | |
| InterceptStrategy checks |
| breakpoints before each |
| processor |
| | |
| On breakpoint hit: |
| > LockSupport.park() |
| > Serialize exchange state |
| > POST state to server -------+
| | (server pushes to
| Wait for resume/step/skip browser via WS)
| command via SSE
| |
| On resume: LockSupport.unpark()
| Continue to next processor
```
### 1.3 Protocol Additions (cameleer3-common)
#### New SSE Commands
| Command | Direction | Purpose |
|---------|-----------|---------|
| `START_DEBUG` | Server -> Agent | Create session, spawn thread, replay exchange with breakpoints |
| `DEBUG_RESUME` | Server -> Agent | Unpark thread, continue to next breakpoint |
| `DEBUG_STEP` | Server -> Agent | Unpark thread, break at next processor (STEP_OVER/STEP_INTO) |
| `DEBUG_SKIP` | Server -> Agent | Skip current processor, continue |
| `DEBUG_MODIFY` | Server -> Agent | Apply body/header changes at current breakpoint before resuming |
| `DEBUG_ABORT` | Server -> Agent | Abort session, release thread |
#### StartDebugPayload
```json
{
"sessionId": "dbg-a1b2c3",
"routeId": "route-orders",
"exchange": {
"body": "{\"orderId\": 42, \"amount\": 150.00}",
"headers": { "Content-Type": "application/json" }
},
"breakpoints": [
{ "processorId": "choice1", "condition": null },
{ "processorId": "to5", "condition": "${body.amount} > 100" }
],
"mode": "STEP_OVER",
"timeoutSeconds": 300,
"originalExchangeId": "ID-failed-789",
"replayToken": "...",
"nonce": "..."
}
```
#### BreakpointHitReport (Agent -> Server)
```json
{
"sessionId": "dbg-a1b2c3",
"processorId": "to5",
"processorType": "TO",
"endpointUri": "http://payment-service/charge",
"depth": 2,
"stepIndex": 4,
"exchangeState": {
"body": "{\"orderId\": 42, \"amount\": 150.00, \"validated\": true}",
"headers": { "..." },
"properties": { "CamelSplitIndex": 0 },
"exception": null,
"bodyType": "java.util.LinkedHashMap"
},
"executionTree": ["...partial tree up to this point..."],
"parentProcessorId": "split1",
"routeId": "route-orders",
"timestamp": "2026-03-29T14:22:05.123Z"
}
```
### 1.4 Agent Implementation (cameleer3-agent)
#### DebugSessionManager
- Location: `com.cameleer3.agent.debug.DebugSessionManager`
- Stores active sessions: `ConcurrentHashMap<sessionId, DebugSession>`
- Enforces max concurrent sessions (default 3, configurable via `cameleer.debug.maxSessions`)
- Allocates **dedicated Thread** per session (NOT from Camel thread pool)
- Timeout watchdog: `ScheduledExecutorService` auto-aborts expired sessions
- Handles all `DEBUG_*` commands via `DefaultCommandHandler` delegation
#### DebugSession
- Stores breakpoint definitions, current step mode, parked thread reference
- `shouldBreak(processorId, Exchange)`: evaluates processor match + Simple condition + step mode
- `reportBreakpointHit()`: serializes state, POSTs to server, calls `LockSupport.park()`
- `applyModifications(Exchange)`: sets body/headers from `DEBUG_MODIFY` command
#### InterceptStrategy Integration
In `CameleerInterceptStrategy.DelegateAsyncProcessor.process()`:
```java
DebugSession session = debugSessionManager.getSession(exchange);
if (session != null && session.shouldBreak(processorId, exchange)) {
ExchangeState state = ExchangeStateSerializer.capture(exchange);
List<ProcessorExecution> tree = executionCollector.getPartialTree(exchange);
session.reportBreakpointHit(processorId, state, tree);
// Thread parked until server sends resume/step/skip/abort
if (session.isAborted()) throw new DebugSessionAbortedException();
if (session.shouldSkip()) { callback.done(true); return true; }
if (session.hasModifications()) session.applyModifications(exchange);
}
```
**Zero production overhead:** Debug exchanges carry `CameleerDebugSessionId` exchange property. `getSession()` checks this property — single null-check. Production exchanges have no property, check returns null, no further work.
#### ExchangeStateSerializer
- TypeConverter chain: String -> byte[] as Base64 -> class name fallback
- Stream bodies: wrap in `CachedOutputStream` (same pattern as Camel's stream caching)
- Sensitive header redaction (reuses `PayloadCapture` redaction logic)
- Size limit: `cameleer.debug.maxBodySize` (default 64KB)
#### Synthetic Direct Route Wrapper
For non-direct routes (timer, jms, http, file):
1. Extract route's processor chain from `CamelContext`
2. Create temporary `direct:__debug_{routeId}` route with same processors (shared by reference)
3. Debug exchange enters via `ProducerTemplate.send()`
4. Remove temporary route on session completion
### 1.5 Server Implementation (cameleer3-server)
#### REST Endpoints
| Method | Path | Role | Purpose |
|--------|------|------|---------|
| POST | `/api/v1/debug/sessions` | OPERATOR | Create debug session |
| GET | `/api/v1/debug/sessions/{id}` | VIEWER | Get session state |
| POST | `/api/v1/debug/sessions/{id}/step` | OPERATOR | Step over/into |
| POST | `/api/v1/debug/sessions/{id}/resume` | OPERATOR | Resume to next breakpoint |
| POST | `/api/v1/debug/sessions/{id}/skip` | OPERATOR | Skip current processor |
| POST | `/api/v1/debug/sessions/{id}/modify` | OPERATOR | Modify exchange at breakpoint |
| DELETE | `/api/v1/debug/sessions/{id}` | OPERATOR | Abort session |
| POST | `/api/v1/debug/sessions/{id}/breakpoint-hit` | AGENT | Agent reports breakpoint |
| GET | `/api/v1/debug/sessions/{id}/compare` | VIEWER | Compare debug vs original |
#### WebSocket Channel
```
Endpoint: WS /api/v1/debug/ws?token={jwt}
Server -> Browser events:
{ "type": "breakpoint-hit", "sessionId": "...", "data": { ...state... } }
{ "type": "session-completed", "sessionId": "...", "execution": { ... } }
{ "type": "session-error", "sessionId": "...", "error": "Agent disconnected" }
{ "type": "session-timeout", "sessionId": "..." }
```
#### Data Model
```sql
CREATE TABLE debug_sessions (
session_id TEXT PRIMARY KEY,
agent_id TEXT NOT NULL,
route_id TEXT NOT NULL,
original_exchange TEXT,
status TEXT NOT NULL DEFAULT 'PENDING',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
breakpoints JSONB,
current_state JSONB,
step_count INT DEFAULT 0,
replay_exchange TEXT,
created_by TEXT NOT NULL
);
```
#### DebugSessionService
Lifecycle: PENDING -> ACTIVE -> PAUSED -> COMPLETED/ABORTED/TIMEOUT
1. Generate sessionId + nonce + replay token
2. Send `START_DEBUG` via existing SSE channel
3. Receive breakpoint-hit POSTs, store state, push to WebSocket
4. Translate browser actions (step/resume/skip/modify) into SSE commands
5. Detect agent SSE disconnect via `SseConnectionManager` callback
6. Store completed execution in normal pipeline (tagged with `debugSessionId`)
### 1.6 SaaS Layer (cameleer-saas)
- Tenant isolation: debug sessions scoped to tenant's agents
- Concurrent session limits per tier (free: 1, pro: 5, enterprise: unlimited)
- Usage metering: session creation counted as billable event
### 1.7 UI Components
- **DebugLauncher.tsx** — "Debug This Exchange" button on failed execution detail, pre-fills exchange data
- **DebugSession.tsx** — Main view: route diagram with status coloring (green/yellow/gray), exchange state panel, step controls (F10/F11/F5/F6 keyboard shortcuts)
- **DebugCompare.tsx** — Side-by-side: original execution vs debug replay with diff highlighting
- **BreakpointEditor.tsx** — Click processor nodes to toggle breakpoints, conditional expression input
### 1.8 Safety Mechanisms
| Concern | Mitigation |
|---------|------------|
| Thread leak | Session timeout auto-aborts (default 5 min) |
| Memory leak | Exchange state captured on-demand, not buffered |
| Agent restart | Server detects SSE disconnect, notifies browser |
| High-throughput route | Only debug exchange hits breakpoints (property check) |
| Concurrent sessions | Hard limit (default 3), FAILURE ack if exceeded |
| Non-direct routes | Synthetic `direct:__debug_*` wrapper with same processor chain |
---
## 2. Payload Flow Lineage
### 2.1 Concept
Capture the full transformation history of a message flowing through a route. At each processor, snapshot body before and after. Server computes structural diffs. UI renders a visual "data flow" timeline showing exactly where and how data transforms.
**User Story:** A developer has an exchange where `customerName` is null. They click "Trace Payload Flow." Vertical timeline: at each processor, before/after body with structural diff. Processor 7 (`enrich1`) returned a response missing the `name` field. Root cause in 30 seconds.
### 2.2 Architecture
```
cameleer3 agent
|
| On lineage-enabled exchange:
| Before processor: capture INPUT
| After processor: capture OUTPUT
| Attach to ProcessorExecution as inputBody/outputBody
|
v
POST /api/v1/data/executions (processors carry full snapshots)
|
v
cameleer3-server
|
| LineageService:
| > Flatten processor tree to ordered list
| > Compute diffs between processor[n].output and processor[n+1].input
| > Classify transformation type
| > Generate human-readable summary
|
v
GET /api/v1/executions/{id}/lineage
|
v
Browser: LineageTimeline + DiffViewer
```
### 2.3 Protocol Additions (cameleer3-common)
#### New SSE Commands
| Command | Direction | Purpose |
|---------|-----------|---------|
| `ENABLE_LINEAGE` | Server -> Agent | Activate targeted payload capture |
| `DISABLE_LINEAGE` | Server -> Agent | Deactivate lineage capture |
#### EnableLineagePayload
```json
{
"lineageId": "lin-x1y2z3",
"scope": {
"type": "ROUTE",
"routeId": "route-orders"
},
"predicate": "${header.orderId} == 'ORD-500'",
"predicateLanguage": "simple",
"maxCaptures": 10,
"duration": "PT10M",
"captureHeaders": true,
"captureProperties": false
}
```
#### Scope Types
| Scope | Meaning |
|-------|---------|
| `ROUTE` | All exchanges on a specific route |
| `CORRELATION` | All exchanges with a specific correlationId |
| `EXPRESSION` | Any exchange matching a Simple/JsonPath predicate |
| `NEXT_N` | Next N exchanges on the route (countdown) |
### 2.4 Agent Implementation (cameleer3-agent)
#### LineageManager
- Location: `com.cameleer3.agent.lineage.LineageManager`
- Stores active configs: `ConcurrentHashMap<lineageId, LineageConfig>`
- Tracks capture count per lineageId: auto-disables at `maxCaptures`
- Duration timeout via `ScheduledExecutorService`: auto-disables after expiry
- `shouldCaptureLineage(Exchange)`: evaluates scope + predicate, sets `CameleerLineageActive` property
- `isLineageActive(Exchange)`: single null-check on exchange property (HOT PATH, O(1))
#### Integration Points (Minimal Agent Changes)
**1. CameleerEventNotifier.onExchangeCreated():**
```java
lineageManager.shouldCaptureLineage(exchange);
// Sets CameleerLineageActive property if matching
```
**2. ExecutionCollector.resolveProcessorCaptureMode():**
```java
if (lineageManager.isLineageActive(exchange)) {
return PayloadCaptureMode.BOTH;
}
```
**3. PayloadCapture body size:**
```java
int maxSize = lineageManager.isLineageActive(exchange)
? config.getLineageMaxBodySize() // 64KB
: config.getMaxBodySize(); // 4KB
```
**Production overhead when lineage is disabled: effectively zero.** The `isLineageActive()` check is a single null-check on an exchange property that doesn't exist on non-lineage exchanges.
#### Configuration
```properties
cameleer.lineage.maxBodySize=65536 # 64KB for lineage captures (vs 4KB normal)
cameleer.lineage.enabled=true # master switch
```
### 2.5 Server Implementation (cameleer3-server)
#### LineageService
- `getLineage(executionId)`: fetch execution, flatten tree to ordered processor list, compute diffs
- `enableLineage(request)`: send `ENABLE_LINEAGE` to target agents
- `disableLineage(lineageId)`: send `DISABLE_LINEAGE`
- `getActiveLineages()`: list active configs across all agents
#### DiffEngine
Format-aware diff computation:
| Format | Detection | Library | Output |
|--------|-----------|---------|--------|
| JSON | Jackson parse success | zjsonpatch (RFC 6902) or custom tree walk | FIELD_ADDED, FIELD_REMOVED, FIELD_MODIFIED with JSON path |
| XML | DOM parse success | xmlunit-core | ELEMENT_ADDED, ELEMENT_REMOVED, ATTRIBUTE_CHANGED |
| Text | Fallback | java-diff-utils (Myers) | LINE_ADDED, LINE_REMOVED, LINE_CHANGED |
| Binary | Type detection | N/A | Size comparison only |
#### Transformation Classification
```
UNCHANGED — No diff
MUTATION — Existing fields modified, same format
ENRICHMENT — Fields only added (e.g., enrich processor)
REDUCTION — Fields only removed
FORMAT_CHANGED — Content type changed (XML -> JSON)
TYPE_CHANGED — Java type changed but content equivalent
MIXED — Combination of additions, removals, modifications
```
#### Summary Generation
Auto-generated human-readable summaries:
- `"XML -> JSON conversion"` (FORMAT_CHANGED)
- `"Added customer object from external API"` (ENRICHMENT + field names)
- `"Modified amount field: 150.00 -> 135.00"` (MUTATION + values)
#### Lineage Response Schema
```json
{
"executionId": "exec-123",
"routeId": "route-orders",
"processors": [
{
"processorId": "unmarshal1",
"processorType": "UNMARSHAL",
"input": {
"body": "<order><id>42</id></order>",
"bodyType": "java.lang.String",
"contentType": "application/xml"
},
"output": {
"body": "{\"id\": 42}",
"bodyType": "java.util.LinkedHashMap",
"contentType": "application/json"
},
"diff": {
"transformationType": "FORMAT_CHANGED",
"summary": "XML -> JSON conversion",
"bodyChanged": true,
"headersChanged": true,
"changes": [
{ "type": "FORMAT_CHANGED", "from": "XML", "to": "JSON" }
]
},
"durationMs": 12,
"status": "COMPLETED"
}
]
}
```
#### REST Endpoints
| Method | Path | Role | Purpose |
|--------|------|------|---------|
| GET | `/api/v1/executions/{id}/lineage` | VIEWER | Full lineage with diffs |
| POST | `/api/v1/lineage/enable` | OPERATOR | Enable lineage on agents |
| DELETE | `/api/v1/lineage/{lineageId}` | OPERATOR | Disable lineage |
| GET | `/api/v1/lineage/active` | VIEWER | List active lineage configs |
### 2.6 SaaS Layer (cameleer-saas)
- Lineage captures counted as premium events (higher billing weight)
- Active lineage config limits per tier
- Post-hoc lineage from COMPLETE engine level available on all tiers (resource-intensive fallback)
- Targeted lineage-on-demand is a paid-tier feature (upgrade driver)
### 2.7 UI Components
- **LineageTimeline.tsx** — Vertical processor list, color-coded by transformation type (green/yellow/blue/red/purple), expandable diffs, auto-generated summaries
- **LineageDiffViewer.tsx** — Side-by-side or unified diff, format-aware (JSON tree-diff, XML element-diff, text line-diff, binary hex)
- **LineageEnableDialog.tsx** — "Trace Payload Flow" button, scope/predicate builder, max captures slider
- **LineageSummaryStrip.tsx** — Compact horizontal strip on execution detail page, transformation icons per processor
---
## 3. Cross-Service Trace Correlation + Topology Map
### 3.1 Concept
Stitch executions across services into unified distributed traces. Build a service dependency topology graph automatically from observed traffic. Design the protocol for future cross-tenant federation.
**User Story:** Platform team with 8 Camel microservices. Order stuck in "processing." Engineer searches by orderId, sees distributed trace: horizontal timeline across all services, each expandable to route detail. Service C (pricing) timed out. Root cause across 4 boundaries in 60 seconds.
### 3.2 Phase 1: Intra-Tenant Trace Correlation
#### Enhanced Trace Context Header
```
Current (exists):
X-Cameleer-CorrelationId: corr-abc-123
New (added):
X-Cameleer-TraceContext: {
"traceId": "trc-xyz",
"parentSpanId": "span-001",
"hopIndex": 2,
"sourceApp": "order-service",
"sourceRoute": "route-validate"
}
```
#### Transport-Specific Propagation
| Transport | Detection | Mechanism |
|-----------|-----------|-----------|
| HTTP/REST | URI prefix `http:`, `https:`, `rest:` | HTTP header `X-Cameleer-TraceContext` |
| JMS | URI prefix `jms:`, `activemq:`, `amqp:` | JMS property `CameleerTraceContext` |
| Kafka | URI prefix `kafka:` | Kafka header `cameleer-trace-context` |
| Direct/SEDA | URI prefix `direct:`, `seda:`, `vm:` | Exchange property (in-process) |
| File/FTP | URI prefix `file:`, `ftp:` | Not propagated (async) |
### 3.3 Agent Implementation (cameleer3-agent)
#### Outgoing Propagation (InterceptStrategy)
Before delegating to TO/ENRICH/WIRE_TAP processors:
```java
if (isOutgoingEndpoint(processorType, endpointUri)) {
TraceContext ctx = new TraceContext(
executionCollector.getTraceId(exchange),
currentProcessorExecution.getId(),
executionCollector.getHopIndex(exchange) + 1,
config.getApplicationName(),
exchange.getFromRouteId()
);
injectTraceContext(exchange, endpointUri, ctx);
}
```
#### Incoming Extraction (CameleerEventNotifier)
In `onExchangeCreated()`:
```java
String traceCtxJson = extractTraceContext(exchange);
if (traceCtxJson != null) {
TraceContext ctx = objectMapper.readValue(traceCtxJson, TraceContext.class);
exchange.setProperty("CameleerParentSpanId", ctx.parentSpanId);
exchange.setProperty("CameleerSourceApp", ctx.sourceApp);
exchange.setProperty("CameleerSourceRoute", ctx.sourceRoute);
exchange.setProperty("CameleerHopIndex", ctx.hopIndex);
}
```
#### New RouteExecution Fields
```java
execution.setParentSpanId(...); // processor execution ID from calling service
execution.setSourceApp(...); // application name of caller
execution.setSourceRoute(...); // routeId of caller
execution.setHopIndex(...); // depth in distributed trace
```
#### Safety
- Header size always <256 bytes
- Parse failure: log warning, continue without context (no exchange failure)
- Only inject on outgoing processors, never on FROM consumers
### 3.4 Server Implementation: Trace Assembly (cameleer3-server)
#### CorrelationService
```
buildDistributedTrace(correlationId):
1. SELECT * FROM executions WHERE correlation_id = ? ORDER BY start_time
2. Index by executionId for O(1) lookup
3. Build tree: roots = executions where parentSpanId IS NULL
For each with parentSpanId: find parent, attach as child hop
4. Compute gaps: child.startTime - parent.processor.startTime = network latency
If gap < 0: flag clock skew warning
5. Aggregate: totalDuration, serviceCount, hopCount, status
```
#### Distributed Trace Response
```json
{
"traceId": "trc-xyz",
"correlationId": "corr-abc-123",
"totalDurationMs": 1250,
"hopCount": 4,
"serviceCount": 3,
"status": "FAILED",
"entryPoint": {
"application": "api-gateway",
"routeId": "route-incoming-orders",
"executionId": "exec-001",
"durationMs": 1250,
"children": [
{
"calledFrom": {
"processorId": "to3",
"processorType": "TO",
"endpointUri": "http://order-service/validate"
},
"application": "order-service",
"routeId": "route-validate",
"executionId": "exec-002",
"durationMs": 350,
"networkLatencyMs": 12,
"children": []
}
]
}
}
```
#### Data Model Changes
```sql
ALTER TABLE executions ADD COLUMN parent_span_id TEXT;
ALTER TABLE executions ADD COLUMN source_app TEXT;
ALTER TABLE executions ADD COLUMN source_route TEXT;
ALTER TABLE executions ADD COLUMN hop_index INT;
CREATE INDEX idx_executions_parent_span
ON executions(parent_span_id) WHERE parent_span_id IS NOT NULL;
```
#### Edge Cases
- **Missing hops:** uninstrumented service shown as "unknown" node
- **Clock skew:** flagged as warning, still rendered
- **Fan-out:** parallel multicast creates multiple children from same processor
- **Circular calls:** detected via hopIndex (max depth 20)
### 3.5 Server Implementation: Topology Graph (cameleer3-server)
#### DependencyGraphService
Builds service dependency graph from existing execution data — **zero additional agent overhead**.
Data source: `processor_executions` where `processor_type IN (TO, TO_DYNAMIC, EIP_ENRICH, EIP_POLL_ENRICH, EIP_WIRE_TAP)` and `resolved_endpoint_uri IS NOT NULL`.
#### Endpoint-to-Service Resolution
1. Direct/SEDA match: `direct:processOrder` -> route's applicationName
2. Agent registration match: URI base URL matches registered agent
3. Kubernetes hostname: extract hostname from URI -> applicationName
4. Manual mapping: admin-configured regex/glob patterns
5. Unresolved: `external:{hostname}` node
#### Materialized View
```sql
CREATE MATERIALIZED VIEW service_dependencies AS
SELECT
e.application_name AS source_app,
pe.resolved_endpoint_uri AS target_uri,
COUNT(*) AS call_count,
AVG(pe.duration_ms) AS avg_latency_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY pe.duration_ms) AS p99_latency_ms,
SUM(CASE WHEN pe.status = 'FAILED' THEN 1 ELSE 0 END)::FLOAT
/ NULLIF(COUNT(*), 0) AS error_rate,
MAX(pe.start_time) AS last_seen,
MIN(pe.start_time) AS first_seen
FROM executions e
JOIN processor_executions pe
ON e.execution_id = pe.execution_id
AND e.start_time = pe.start_time
WHERE pe.processor_type IN ('TO','TO_DYNAMIC','EIP_ENRICH','EIP_POLL_ENRICH','EIP_WIRE_TAP')
AND pe.resolved_endpoint_uri IS NOT NULL
AND e.start_time > NOW() - INTERVAL '24 hours'
GROUP BY e.application_name, pe.resolved_endpoint_uri;
-- Refresh every 5 minutes
```
#### REST Endpoints
| Method | Path | Role | Purpose |
|--------|------|------|---------|
| GET | `/api/v1/traces/{correlationId}` | VIEWER | Assembled distributed trace |
| GET | `/api/v1/traces/{correlationId}/timeline` | VIEWER | Flat timeline for Gantt |
| GET | `/api/v1/topology/dependencies` | VIEWER | Service dependency graph |
| GET | `/api/v1/topology/diff` | VIEWER | Topology changes between windows |
| GET | `/api/v1/topology/dependencies/{source}/{target}` | VIEWER | Dependency detail |
### 3.6 Phase 2: Cross-Tenant Federation (Design Only)
Reserve `sourceTenantHash` in TraceContext for future use:
```json
{
"traceId": "trc-xyz",
"parentSpanId": "span-001",
"hopIndex": 2,
"sourceApp": "order-service",
"sourceRoute": "route-validate",
"sourceTenantHash": null
}
```
**Consent model (v2):**
- Both tenants opt-in to "Federation" in SaaS settings
- Shared: trace structure (timing, status, service names)
- NOT shared: payload content, headers, internal route details
- Either tenant can revoke at any time
### 3.7 SaaS Layer (cameleer-saas)
- All trace correlation intra-tenant in v1
- Topology graph scoped to tenant's applications
- External dependencies shown as opaque nodes
- Cross-tenant federation as enterprise-tier feature (v2)
### 3.8 UI Components
- **DistributedTraceView.tsx** — Horizontal Gantt timeline, rows=services, bars=executions, arrows=call flow, click-to-expand to route detail
- **ServiceTopologyGraph.tsx** — Force-directed graph, nodes sized by throughput, edges colored by error rate, animated traffic pulse, click drill-down
- **TopologyDiff.tsx** — "What changed?" view, new/removed dependencies highlighted, latency/error changes annotated
- **TraceSearchEnhanced.tsx** — Search by correlationId/traceId/business attributes, results show trace summaries with service count and hop count
---
## 4. Cross-Feature Integration Points
| From -> To | Integration |
|------------|-------------|
| Correlation -> Debugger | "Debug This Hop": from distributed trace, click a service hop to replay and debug |
| Correlation -> Lineage | "Trace Payload Across Services": enable lineage on a correlationId, see transforms across boundaries |
| Lineage -> Debugger | "Debug From Diff": unexpected processor output -> one-click launch debug with breakpoint on that processor |
| Debugger -> Lineage | Debug sessions auto-capture full lineage (all processors at BOTH mode) |
| Topology -> Correlation | Click dependency edge -> show recent traces between those services |
| Topology -> Lineage | "How does data transform?" -> aggregated lineage summary for a dependency edge |
---
## 5. Competitive Analysis
### What an LLM + Junior Dev Can Replicate
| Capability | Replicable? | Time | Barrier |
|------------|-------------|------|---------|
| JMX metrics dashboard | Yes | 1 weekend | None |
| Log parsing + display | Yes | 1 weekend | None |
| Basic replay (re-send exchange) | Yes | 1 week | Need agent access |
| Per-processor payload capture | No* | 2-3 months | Requires bytecode instrumentation |
| Nested EIP execution trees | No* | 3-6 months | Requires deep Camel internals knowledge |
| Breakpoint debugging in route | No | 6+ months | Thread management + InterceptStrategy + serialization |
| Format-aware payload diffing | Partially | 2 weeks | Diff library exists, but data pipeline doesn't |
| Distributed trace assembly | Partially | 1 month | OTel exists but lacks Camel-specific depth |
| Service topology from execution data | Partially | 2 weeks | Istio does this at network layer, not route layer |
*Achievable with OTel Camel instrumentation (spans only, not payload content)
### Where Each Feature Creates Unreplicable Value
- **Debugger:** Requires InterceptStrategy breakpoints + thread parking + exchange serialization. The combination is unique — no other Camel tool offers browser-based route stepping.
- **Lineage:** Requires per-processor INPUT/OUTPUT capture with correct ordering. OTel spans don't carry body content. JMX doesn't capture payloads. Only bytecode instrumentation provides this data.
- **Correlation + Topology:** The trace assembly is achievable elsewhere. The differentiation is Camel-specific depth: each hop shows processor-level execution trees, not just "Service B took 350ms."
---
## 6. Implementation Sequencing
### Phase A: Foundation + Topology (Weeks 1-3)
| Work | Repo | Issue |
|------|------|-------|
| Service topology materialized view | cameleer3-server | #69 |
| Topology REST API | cameleer3-server | #69 |
| ServiceTopologyGraph.tsx | cameleer3-server + saas | #72 |
| WebSocket infrastructure (for debugger) | cameleer3-server | #63 |
| TraceContext DTO in cameleer3-common | cameleer3 | #67 |
**Ship:** Topology graph visible from existing data. Zero agent changes. Immediate visual payoff.
### Phase B: Lineage (Weeks 3-6)
| Work | Repo | Issue |
|------|------|-------|
| Lineage protocol DTOs | cameleer3-common | #64 |
| LineageManager + capture integration | cameleer3-agent | #65 |
| LineageService + DiffEngine | cameleer3-server | #66 |
| Lineage UI components | cameleer3-server + saas | #71 |
**Ship:** Payload flow lineage independently usable.
### Phase C: Distributed Trace Correlation (Weeks 5-9, overlaps B)
| Work | Repo | Issue |
|------|------|-------|
| Trace context header propagation | cameleer3-agent | #67 |
| Executions table migration (new columns) | cameleer3-server | #68 |
| CorrelationService + trace assembly | cameleer3-server | #68 |
| DistributedTraceView + TraceSearch UI | cameleer3-server + saas | #72 |
**Ship:** Distributed traces + topology — full correlation story.
### Phase D: Live Route Debugger (Weeks 8-14)
| Work | Repo | Issue |
|------|------|-------|
| Debug protocol DTOs | cameleer3-common | #60 |
| DebugSessionManager + InterceptStrategy | cameleer3-agent | #61 |
| ExchangeStateSerializer + synthetic wrapper | cameleer3-agent | #62 |
| DebugSessionService + WS + REST | cameleer3-server | #63 |
| Debug UI components | cameleer3-server + saas | #70 |
**Ship:** Full browser-based route debugger with integration to lineage and correlation.
---
## 7. Open Questions
1. **Debugger concurrency model:** Should we support debugging through parallel `Split` branches? Current design follows the main thread. Parallel branches would require multiple parked threads per session.
2. **Lineage storage costs:** Full INPUT+OUTPUT at every processor generates significant data. Should we add a separate lineage retention policy (e.g., 7 days) shorter than normal execution retention?
3. **Topology graph refresh frequency:** 5-minute materialized view refresh is a trade-off. Real-time would require streaming aggregation (e.g., Kafka Streams). Is 5 minutes acceptable for v1?
4. **Cross-tenant federation security model:** The v2 `sourceTenantHash` design needs a full threat model. Can a malicious tenant forge trace context to see another tenant's data?
5. **OTel interop:** Should the trace context header be compatible with W3C Trace Context format? This would enable mixed environments where some services use OTel and others use Cameleer.