diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md new file mode 100644 index 00000000..658a57a5 --- /dev/null +++ b/.planning/research/ARCHITECTURE.md @@ -0,0 +1,593 @@ +# Architecture Patterns + +**Domain:** Transaction monitoring / observability server for Apache Camel route executions +**Researched:** 2026-03-11 +**Confidence:** MEDIUM (based on established observability architecture patterns; no live web verification available) + +## Recommended Architecture + +### High-Level Overview + +The system follows a **write-heavy, read-occasional** observability pattern with three distinct data paths: + +``` +Agents (50+) Users / UI + | | + v v +[Ingestion Pipeline] [Query Engine] + | | + v | +[Write Buffer / Batcher] | + | | + v v +[ClickHouse] <----- reads ----------+ +[Text Index] <----- full-text ------+ + ^ + | +[Diagram Store] (versioned) + +[SSE Channel Manager] --push--> Agents +``` + +### Component Boundaries + +| Component | Module | Responsibility | Communicates With | +|-----------|--------|---------------|-------------------| +| **Ingestion Controller** | app | HTTP POST endpoint, request validation, deserialization | Write Buffer | +| **Write Buffer** | core | In-memory batching, backpressure signaling | ClickHouse Writer, Text Indexer | +| **ClickHouse Writer** | core | Batch INSERT into ClickHouse, retry logic | ClickHouse | +| **Text Indexer** | core | Extract searchable text, write to text index | Text index (ClickHouse or external) | +| **Transaction Service** | core | Domain logic: transactions, activities, correlations | Storage interfaces | +| **Query Engine** | core | Combines structured + full-text queries, pagination | ClickHouse, Text index | +| **Agent Registry** | core | Track agent instances, lifecycle (LIVE/STALE/DEAD), heartbeat | SSE Channel Manager | +| **SSE Channel Manager** | core (interface) + app (impl) | Manage SSE connections, push config/commands | Agent Registry | +| **Diagram Service** | core | Version diagrams, link to transactions, trigger rendering | Diagram Store | +| **Diagram Renderer** | core | Server-side rendering of route definitions to visual output | Diagram Service | +| **Auth Service** | core | JWT validation, Ed25519 signing, bootstrap token flow | All controllers | +| **REST Controllers** | app | HTTP endpoints for transactions, agents, diagrams, config | All core services | +| **SSE Controller** | app | SSE endpoint, connection lifecycle | SSE Channel Manager | +| **Config Controller** | app | Config CRUD, push triggers | SSE Channel Manager, Config store | + +### Data Flow + +#### 1. Transaction Ingestion (Hot Path) + +``` +Agent POST /api/v1/ingest + | + v +[IngestController] -- validates JWT, deserializes using cameleer3-common models + | + v +[IngestionService.accept(batch)] -- accepts TransactionData/ActivityData + | + v +[WriteBuffer] -- in-memory queue (bounded, per-partition) + | signals backpressure via HTTP 429 when full + | + +---(flush trigger: size threshold OR time interval)---+ + | | + v v +[ClickHouseWriter.insertBatch()] [TextIndexer.indexBatch()] + | | + v v +ClickHouse (MergeTree tables) ClickHouse full-text index + (or separate text index) +``` + +#### 2. Transaction Query (Read Path) + +``` +UI GET /api/v1/transactions?state=ERROR&from=...&to=...&q=free+text + | + v +[TransactionController] -- validates, builds query criteria + | + v +[QueryEngine.search(criteria)] -- combines structured filters + full-text + | + +--- structured filters --> ClickHouse WHERE clauses + +--- full-text query -----> text index lookup (returns transaction IDs) + +--- merge results -------> intersect, sort, paginate + | + v +[Page] -- paginated response with cursor +``` + +#### 3. Agent SSE Communication + +``` +Agent GET /api/v1/agents/{id}/events (SSE) + | + v +[SseController] -- authenticates, registers SseEmitter + | + v +[SseChannelManager.register(agentId, emitter)] + | + v +[AgentRegistry.markLive(agentId)] + +--- Later, when config changes --- + +[ConfigController.update(config)] + | + v +[SseChannelManager.broadcast(configEvent)] + | + v +Each registered SseEmitter sends event to connected agent +``` + +#### 4. Diagram Versioning + +``` +Agent POST /api/v1/diagrams (on startup or route change) + | + v +[DiagramController] -- receives route definition (XML/YAML/JSON from cameleer3-common) + | + v +[DiagramService.storeVersion(definition)] + | + +--- compute content hash + +--- if hash differs from latest: store new version with timestamp + +--- if identical: skip (idempotent) + | + v +[DiagramStore] -- versioned storage (content-addressable) + +--- On transaction query --- + +[TransactionService] -- looks up diagram version active at transaction timestamp + | + v +[DiagramService.getVersionAt(routeId, instant)] + | + v +[DiagramRenderer.render(definition)] -- produces SVG/PNG for display +``` + +## Patterns to Follow + +### Pattern 1: Bounded Write Buffer with Backpressure + +**What:** In-memory queue between ingestion endpoint and storage writes. Bounded size. When full, return HTTP 429 to agents so they back off and retry. + +**When:** Always -- this is the critical buffer between high-throughput ingestion and batch-oriented database writes. + +**Why:** ClickHouse performs best with large batch inserts (thousands of rows). Individual inserts per HTTP request would destroy write performance. The buffer decouples ingestion rate from write rate. + +**Example:** +```java +public class WriteBuffer { + private final BlockingQueue queue; + private final int batchSize; + private final Duration maxFlushInterval; + private final Consumer> flushAction; + + public boolean offer(T item) { + // Returns false when queue is full -> caller returns 429 + return queue.offer(item); + } + + // Scheduled flush: drains up to batchSize items + @Scheduled(fixedDelayString = "${ingestion.flush-interval-ms:1000}") + void flush() { + List batch = new ArrayList<>(batchSize); + queue.drainTo(batch, batchSize); + if (!batch.isEmpty()) { + flushAction.accept(batch); + } + } +} +``` + +**Implementation detail:** Use `ArrayBlockingQueue` with a capacity that matches your memory budget. At ~2KB per transaction record and 10,000 capacity, that is ~20MB -- well within bounds. + +### Pattern 2: Repository Abstraction over ClickHouse + +**What:** Define storage interfaces in core module, implement with ClickHouse JDBC in app module. Core never imports ClickHouse driver directly. + +**When:** Always -- this is the key module boundary principle. + +**Why:** Keeps core testable without a database. Allows swapping storage in tests (in-memory) and theoretically in production. More importantly, it enforces that domain logic does not leak storage concerns. + +**Example:** +```java +// In core module +public interface TransactionRepository { + void insertBatch(List transactions); + Page search(TransactionQuery query, PageRequest page); + Optional findById(String transactionId); +} + +// In app module +@Repository +public class ClickHouseTransactionRepository implements TransactionRepository { + private final JdbcTemplate jdbc; + // ClickHouse-specific SQL, batch inserts, etc. +} +``` + +### Pattern 3: SseEmitter Registry with Heartbeat + +**What:** Maintain a concurrent map of agent ID to SseEmitter. Send periodic heartbeat events. Remove on timeout, error, or completion. + +**When:** For all SSE connections. + +**Why:** SSE connections are long-lived. Without heartbeat, you cannot distinguish between a healthy idle connection and a silently dropped one. The registry is the source of truth for which agents are reachable. + +**Example:** +```java +public class SseChannelManager { + private final ConcurrentHashMap emitters = new ConcurrentHashMap<>(); + + public SseEmitter register(String agentId) { + SseEmitter emitter = new SseEmitter(Long.MAX_VALUE); // no framework timeout + emitter.onCompletion(() -> remove(agentId)); + emitter.onTimeout(() -> remove(agentId)); + emitter.onError(e -> remove(agentId)); + emitters.put(agentId, emitter); + return emitter; + } + + @Scheduled(fixedDelay = 15_000) + void heartbeat() { + emitters.forEach((id, emitter) -> { + try { + emitter.send(SseEmitter.event().name("heartbeat").data("")); + } catch (IOException e) { + remove(id); + } + }); + } + + public void send(String agentId, String eventName, Object data) { + SseEmitter emitter = emitters.get(agentId); + if (emitter != null) { + emitter.send(SseEmitter.event().name(eventName).data(data)); + } + } +} +``` + +### Pattern 4: Content-Addressable Diagram Versioning + +**What:** Hash diagram definitions. Store each unique definition once. Link transactions to the definition hash + a version timestamp. + +**When:** For diagram storage. + +**Why:** Many transactions reference the same diagram version. Content-addressing deduplicates storage. A separate version table maps (routeId, timestamp) to content hash, enabling "what diagram was active at time T?" queries. + +**Schema sketch:** +```sql +-- Diagram definitions (content-addressable) +CREATE TABLE diagram_definitions ( + content_hash String, -- SHA-256 of definition + route_id String, + definition String, -- raw XML/YAML/JSON + rendered_svg String, -- pre-rendered SVG (nullable, filled async) + created_at DateTime64(3) +) ENGINE = MergeTree() +ORDER BY (content_hash); + +-- Version history (which definition was active when) +CREATE TABLE diagram_versions ( + route_id String, + active_from DateTime64(3), + content_hash String +) ENGINE = MergeTree() +ORDER BY (route_id, active_from); +``` + +### Pattern 5: Cursor-Based Pagination for Time-Series Data + +**What:** Use cursor-based pagination (keyset pagination) instead of OFFSET/LIMIT for transaction listing. + +**When:** For all list/search endpoints returning time-ordered transaction data. + +**Why:** OFFSET-based pagination degrades as offset grows -- ClickHouse must scan and skip rows. Cursor-based pagination using `(timestamp, id) > (last_seen_timestamp, last_seen_id)` gives constant-time page fetches regardless of how deep you paginate. + +**Example:** +```java +public record PageCursor(Instant timestamp, String id) {} + +// Query: WHERE (timestamp, id) < (:cursorTs, :cursorId) ORDER BY timestamp DESC, id DESC LIMIT :size +``` + +## Anti-Patterns to Avoid + +### Anti-Pattern 1: Individual Row Inserts to ClickHouse + +**What:** Inserting one transaction per HTTP request directly to ClickHouse. + +**Why bad:** ClickHouse is designed for bulk inserts. Individual inserts create excessive parts in MergeTree tables, causing merge pressure and degraded read performance. At 50+ agents posting concurrently, this would quickly become a bottleneck. + +**Instead:** Buffer in memory, flush in batches of 1,000-10,000 rows per insert. + +### Anti-Pattern 2: Storing Rendered Diagrams in ClickHouse BLOBs + +**What:** Putting SVG/PNG binary data directly in the main ClickHouse tables alongside transaction data. + +**Why bad:** ClickHouse is columnar and optimized for analytical queries. Large binary data in columns degrades compression ratios and query performance for all queries touching that table. + +**Instead:** Store rendered output in filesystem or object storage. Store only the content hash reference in ClickHouse. Or use a separate ClickHouse table with the rendered content that is rarely queried alongside transaction data. + +### Anti-Pattern 3: Blocking SSE Writes on the Request Thread + +**What:** Sending SSE events synchronously from the thread handling a config update request. + +**Why bad:** If an agent's connection is slow or dead, the config update request blocks. With 50+ agents, this creates cascading latency. + +**Instead:** Send SSE events asynchronously. Use a thread pool or virtual threads (Java 21+) to handle SSE writes. Return success to the config updater immediately, handle delivery failures in the background. + +### Anti-Pattern 4: Fat Core Module with Spring Dependencies + +**What:** Adding Spring annotations (@Service, @Repository, @Autowired) throughout the core module. + +**Why bad:** Couples domain logic to Spring. Makes unit testing harder. Violates the purpose of the core/app split. + +**Instead:** Core module defines plain Java interfaces and classes. App module wires them with Spring. Core can use `@Scheduled` or similar only if Spring is already a dependency; otherwise, keep scheduling in app. + +### Anti-Pattern 5: Unbounded SSE Emitter Timeouts + +**What:** Setting SseEmitter timeout to 0 or Long.MAX_VALUE without any heartbeat or cleanup. + +**Why bad:** Dead connections accumulate. Memory leaks. Agent registry shows agents as LIVE when they are actually gone. + +**Instead:** Use heartbeat (Pattern 3). Track last successful send. Transition agents to STALE after N missed heartbeats, DEAD after M. + +## Module Boundary Design + +### Core Module (`cameleer3-server-core`) + +The core module is the domain layer. It contains: + +- **Domain models** -- Transaction, Activity, Agent, DiagramVersion, etc. (may extend or complement cameleer3-common models) +- **Service interfaces and implementations** -- TransactionService, AgentRegistryService, DiagramService, QueryEngine +- **Repository interfaces** -- TransactionRepository, DiagramRepository, AgentRepository (interfaces only, no implementations) +- **Ingestion logic** -- WriteBuffer, batch assembly, backpressure signaling +- **Text indexing abstraction** -- TextIndexer interface +- **Event/notification abstractions** -- SseChannelManager interface (not the Spring SseEmitter impl) +- **Security abstractions** -- JwtValidator interface, Ed25519Signer/Verifier +- **Query model** -- TransactionQuery, PageCursor, search criteria builders + +**No Spring Boot dependencies.** Jackson is acceptable (already present). JUnit for tests. + +### App Module (`cameleer3-server-app`) + +The app module is the infrastructure/adapter layer. It contains: + +- **Spring Boot application class** +- **REST controllers** -- IngestController, TransactionController, AgentController, DiagramController, ConfigController, SseController +- **Repository implementations** -- ClickHouseTransactionRepository, etc. +- **SSE implementation** -- SpringSseChannelManager using SseEmitter +- **Security filters** -- JWT filter, bootstrap token filter +- **Configuration** -- application.yml, ClickHouse connection config, scheduler config +- **Diagram rendering implementation** -- if using an external library for SVG generation +- **Static resources** -- UI assets (later phase) + +**Depends on core.** Wires everything together with Spring configuration. + +### Boundary Rule + +``` +app --> core (allowed) +core --> app (NEVER) +core --> cameleer3-common (allowed) +app --> cameleer3-common (transitively via core) +``` + +## Ingestion Pipeline Detail + +### Buffering Strategy + +Use a two-stage approach: + +1. **Accept stage** -- IngestController deserializes, validates, places into WriteBuffer. Returns 202 Accepted (or 429 if buffer full). +2. **Flush stage** -- Scheduled task drains buffer into batches. Each batch goes to ClickHouseWriter and TextIndexer. + +### Backpressure Mechanism + +- WriteBuffer has a bounded capacity (configurable, default 50,000 items). +- When buffer is >80% full, respond with HTTP 429 + `Retry-After` header. +- Agents (cameleer3) should implement exponential backoff on 429. +- Monitor buffer fill level as a metric. + +### Batch Size Tuning + +- Target: 5,000-10,000 rows per ClickHouse INSERT. +- Flush interval: 1-2 seconds (configurable). +- Flush triggers: whichever comes first -- batch size reached OR interval elapsed. + +## Storage Architecture + +### Write Path (ClickHouse) + +ClickHouse excels at: +- Columnar compression (10:1 or better for structured transaction data) +- Time-partitioned tables with automatic TTL-based expiry (30-day retention) +- Massive batch INSERT throughput +- Analytical queries over time ranges + +**Table design principles:** +- Partition by month: `PARTITION BY toYYYYMM(execution_time)` +- Order by query pattern: `ORDER BY (execution_time, transaction_id)` for time-range scans +- TTL: `TTL execution_time + INTERVAL 30 DAY` +- Use `LowCardinality(String)` for state, agent_id, route_id columns + +### Full-Text Search + +Two viable approaches: + +**Option A: ClickHouse built-in full-text index (recommended for simplicity)** +- ClickHouse supports `tokenbf_v1` and `ngrambf_v1` bloom filter indexes +- Not as powerful as Elasticsearch/Lucene but avoids a separate system +- Good enough for "find transactions containing this string" queries +- Add a `search_text` column that concatenates searchable fields + +**Option B: External search index (Elasticsearch/OpenSearch)** +- More powerful: fuzzy matching, relevance scoring, complex text analysis +- Additional infrastructure to manage +- Only justified if full-text search quality is a key differentiator + +**Recommendation:** Start with ClickHouse bloom filter indexes. The query pattern described (incident-driven, searching by known strings like correlation IDs or error messages) does not require Lucene-level text analysis. If users need fuzzy/ranked search later, add an external index as a separate phase. + +### Read Path + +- Structured queries go directly to ClickHouse SQL. +- Full-text queries use the bloom filter index for pre-filtering, then exact match. +- Results are merged at the QueryEngine level. +- Pagination uses cursor-based approach (Pattern 5). + +## SSE Connection Management at Scale + +### Connection Lifecycle + +``` +Agent connects --> authenticate JWT --> register SseEmitter --> mark LIVE + | + +-- heartbeat every 15s --> success: stays LIVE + | --> failure: mark STALE, remove emitter + | + +-- agent reconnects --> new SseEmitter replaces old one + | + +-- no reconnect within 5min --> mark DEAD +``` + +### Scaling Considerations + +- 50 agents = 50 concurrent SSE connections. This is trivially handled by a single Spring Boot instance. +- At 500+ agents: consider sticky sessions behind a load balancer, or move to a pub/sub system (Redis Pub/Sub) for cross-instance coordination. +- Spring's SseEmitter uses Servlet async support. Each emitter holds a thread from the Servlet container's async pool, not a request thread. +- With virtual threads (Java 21+), SSE connection overhead becomes negligible even at scale. + +### Reconnection Protocol + +- Agents should reconnect with `Last-Event-Id` header. +- Server tracks last event ID per agent. +- On reconnect, replay missed events (if any) from a small in-memory or persistent event log. +- For config push: since config is idempotent, replaying the latest config on reconnect is sufficient. + +## REST API Organization + +### Controller Structure + +``` +/api/v1/ + ingest/ IngestController + POST /transactions -- batch ingest from agents + POST /activities -- batch ingest activities + + transactions/ TransactionController + GET / -- search/list with filters + GET /{id} -- single transaction detail + GET /{id}/activities -- activities within a transaction + + agents/ AgentController + GET / -- list all agents with status + GET /{id} -- agent detail + GET /{id}/events -- SSE stream (SseController) + POST /register -- bootstrap registration + + diagrams/ DiagramController + POST / -- store new diagram version + GET /{routeId} -- latest diagram + GET /{routeId}/at -- diagram at specific timestamp + GET /{routeId}/rendered -- rendered SVG/PNG + + config/ ConfigController + GET / -- current config + PUT / -- update config (triggers SSE push) + POST /commands -- send ad-hoc command to agent(s) +``` + +### Response Conventions + +- List endpoints return `Page` with cursor-based pagination. +- All timestamps in ISO-8601 UTC. +- Error responses follow RFC 7807 Problem Details. +- Use `@RestControllerAdvice` for global exception handling. + +## Scalability Considerations + +| Concern | At 50 agents | At 500 agents | At 5,000 agents | +|---------|-------------|---------------|-----------------| +| **Ingestion throughput** | Single instance, in-memory buffer | Single instance, larger buffer | Multiple instances, partition by agent behind LB | +| **SSE connections** | Single instance, ConcurrentHashMap | Sticky sessions + Redis Pub/Sub for cross-instance events | Dedicated SSE gateway service | +| **ClickHouse writes** | Single writer thread, batch every 1-2s | Multiple writer threads, parallel batches | ClickHouse cluster with sharding | +| **Query latency** | Single ClickHouse node | Read replicas | Distributed ClickHouse cluster | +| **Diagram rendering** | Synchronous on request | Async pre-rendering on store | Worker pool with rendering queue | + +## Suggested Build Order + +Based on component dependencies: + +``` +Phase 1: Foundation + Domain models (core) + Repository interfaces (core) + Basic Spring Boot wiring (app) + +Phase 2: Ingestion Pipeline + WriteBuffer (core) + ClickHouse schema + connection (app) + ClickHouseWriter (app) + IngestController (app) + --> Can receive and store transactions + +Phase 3: Query Engine + TransactionQuery model (core) + QueryEngine (core) + ClickHouse query implementation (app) + TransactionController (app) + --> Can search stored transactions + +Phase 4: Agent Registry + SSE + AgentRegistryService (core) + SseChannelManager interface (core) + impl (app) + AgentController + SseController (app) + --> Agents can register and receive push events + +Phase 5: Diagram Service + DiagramService (core) + DiagramRepository interface (core) + impl (app) + DiagramRenderer (core/app) + DiagramController (app) + --> Versioned diagrams linked to transactions + +Phase 6: Security + JWT validation (core interface, app impl) + Ed25519 config signing (core) + Bootstrap token flow (app) + Security filters (app) + --> All endpoints secured + +Phase 7: Full-Text Search + TextIndexer (core interface, app impl) + ClickHouse bloom filter index setup + QueryEngine full-text integration + --> Combined structured + text search + +Phase 8: UI + Static resources (app) + Frontend consuming REST API +``` + +**Ordering rationale:** +- Storage before query (you need data to query) +- Ingestion before agents (agents need an endpoint to POST to) +- Query before full-text (structured search first, text search layers on top) +- Security can be added at any point but is cleanest as a cross-cutting concern after core flows work +- Diagrams are semi-independent but reference transactions, so after query +- UI is last because API-first means the API must be stable + +## Sources + +- ClickHouse documentation on MergeTree engines, TTL, bloom filter indexes (official docs, verified against training data) +- Spring Boot SseEmitter documentation (Spring Framework reference) +- Observability system architecture patterns from Jaeger, Zipkin, and SigNoz architectures (well-established open-source projects) +- Content-addressable storage patterns from Git internals and Docker image layers +- Cursor-based pagination patterns from Slack API and Stripe API design guides +- Confidence: MEDIUM -- based on established patterns in training data, not live-verified against current documentation diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md new file mode 100644 index 00000000..46e41c1c --- /dev/null +++ b/.planning/research/FEATURES.md @@ -0,0 +1,194 @@ +# Feature Landscape + +**Domain:** Transaction monitoring / observability for Apache Camel route executions +**Researched:** 2026-03-11 +**Confidence:** MEDIUM (based on domain expertise from njams Server, Jaeger, Zipkin, Dynatrace; web search unavailable for latest feature sets) + +## Table Stakes + +Features users expect. Missing = product feels incomplete. + +### Transaction Search and Filtering + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Search by time range | Every monitoring tool has this; primary axis for incident investigation | Low | Date picker with presets (last 15m, 1h, 24h, 7d, custom) | +| Filter by transaction state | SUCCESS/ERROR/WARNING is the first thing ops checks | Low | Multi-select checkboxes, counts per state | +| Filter by duration | Finding slow transactions is core use case | Low | Min/max duration inputs, or predefined buckets | +| Full-text search across payload/attributes | Users need to find "that one order ID" across millions of records | Medium | Requires text index; match highlighting in results | +| Combined/compound filters | Users always combine: "errors in last hour on instance X" | Medium | AND-composition of all filter criteria | +| Paginated result list | Cannot load millions of rows; must page or virtual-scroll | Low | Cursor-based pagination preferred over offset for large datasets | +| Sort by time, duration, state | Basic result ordering | Low | Default: newest first | +| Filter by agent/instance | "Show me only transactions from production-instance-3" | Low | Dropdown populated from agent registry | +| Filter by route name | Users think in routes, not raw IDs | Low | Autocomplete from known route definitions | +| Save/bookmark search queries | Ops teams reuse the same searches during incidents | Medium | Named saved searches, shareable via URL | + +### Transaction Detail and Drill-Down + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Transaction summary view | One-glance: state, start time, duration, instance, route entry point | Low | Header card in detail page | +| Activity list (per-route breakdown) | Hierarchical view of all route executions within a transaction | Medium | Tree or table showing each activity with timing | +| Activity timing waterfall | Visual timeline showing which routes executed when, and their overlap | Medium | Horizontal bar chart; critical for finding bottlenecks | +| Payload/attribute inspection | View message body, headers, properties at each activity step | Medium | Expandable sections; JSON/XML pretty-printing | +| Error detail with stack trace | When a transaction fails, users need the exception detail immediately | Low | Rendered stack trace with copy button | +| Cross-instance correlation | Transaction spans instances A and B -- show the full chain | High | Requires correlation ID propagation; single unified view | +| Link to route diagram | From any activity, jump to the diagram showing the route definition | Low | Hyperlink; depends on diagram storage existing | + +### Route Diagram Visualization + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Render route diagram from stored definition | The core differentiator vs generic tracing tools; users think in Camel routes | High | Server-side or client-side rendering from graph model | +| Diagram versioning | Route changed last Tuesday -- show the diagram as it was when the transaction ran | Medium | Version stored per diagram; transaction references specific version | +| Zoom and pan | Diagrams can be large (50+ nodes); must be navigable | Medium | Standard canvas controls; minimap helpful for large diagrams | +| Execution overlay on diagram | Highlight which path the transaction actually took through the route | High | Color/annotate nodes with state (success/error), timing | +| Node click for activity detail | Click a node in the diagram to see the activity data for that step | Medium | Links diagram nodes to activity records | + +### Agent Management + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Agent list with status | See all connected agents and their lifecycle state (LIVE/STALE/DEAD) | Low | Table with status indicator; auto-refresh | +| Agent heartbeat monitoring | Detect when an agent goes silent | Low | Timestamp of last heartbeat; threshold-based state transitions | +| Agent detail view | Instance name, version, connected routes, uptime, config | Low | Detail page per agent | +| Agent registration/deregistration | New agents register via bootstrap token; dead agents get cleaned up | Medium | Registration endpoint; TTL-based cleanup | + +### Authentication and Security + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| JWT-based API authentication | Secure the REST API; every enterprise monitoring tool requires auth | Medium | Token issuance, validation, refresh | +| Bootstrap token for agent registration | Agents need a way to initially register without pre-existing credentials | Low | Shared secret, single-use or time-limited | +| Ed25519 config signing | Agents must verify config came from the server, not tampered | Medium | Key management, signature generation/verification | + +### Dashboard and Overview + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Transaction volume chart (time series) | "How many transactions are we processing?" -- first question on login | Medium | Bar or line chart, grouped by time bucket | +| Error rate chart | "Is something broken right now?" -- second question | Medium | Error count or percentage over time | +| Active agents count | Quick health check of the agent fleet | Low | Simple counter with status breakdown | +| Recent errors list | Quick access to the latest failures without searching | Low | Pre-filtered list, auto-refreshing | + +## Differentiators + +Features that set product apart from generic tracing tools. Not expected, but valued. + +### Diagram-Centric Experience + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| Route diagram as primary navigation | Instead of trace waterfall, users navigate via the Camel route diagram -- this is how they think | High | Diagram becomes the entry point, not just a visualization | +| Execution heatmap on diagram | Color nodes by frequency/error rate over a time window -- shows hotspots | High | Aggregate stats per node; requires efficient querying | +| Side-by-side diagram comparison | Compare two diagram versions to see what changed in a route | Medium | Diff view highlighting added/removed/changed nodes | +| Diagram-based search | "Show me all failed transactions that passed through this node" | High | Click a node, get filtered transaction list | + +### Advanced Search and Analytics + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| Statistical duration analysis | P50/P95/P99 duration for a route over time -- detect degradation trends | Medium | Requires ClickHouse aggregation queries | +| Transaction comparison | Side-by-side diff of two transactions through the same route | Medium | Useful for "why did this one fail but that one succeed?" | +| Search result aggregations | Faceted counts: N errors, N warnings, distribution by route, by instance | Medium | ClickHouse GROUP BY queries alongside search results | +| Correlation graph | Visual graph showing how transactions flow across instances | High | Network diagram; requires correlation data | + +### Configuration Push + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| Per-route tracing level control | Turn on detailed tracing for one problematic route without restarting the agent | Medium | SSE push of config change; agent applies dynamically | +| Bulk config push to agent groups | "Enable debug tracing on all production instances" | Medium | Agent tagging/grouping + batch SSE dispatch | +| Config history and rollback | See what config was active when, roll back a bad change | Medium | Versioned config storage with timestamps | +| Ad-hoc command dispatch | Send a "flush cache" or "reconnect" command to specific agents | Medium | Command/response pattern over SSE; command status tracking | + +### Operational Intelligence + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| Alerting on error rate thresholds | Notify when error rate exceeds threshold for a route | High | Threshold evaluation, notification channels (email, webhook) | +| Anomaly detection on duration | Alert when P95 duration spikes compared to baseline | High | Statistical baseline computation; deviation detection | +| Scheduled data export | Export transaction data as CSV/JSON for compliance or reporting | Medium | Job scheduler; file generation; download endpoint | +| Retention policy management | Configure per-route or per-instance retention periods | Medium | TTL management in ClickHouse; UI for policy CRUD | + +## Anti-Features + +Features to explicitly NOT build. + +| Anti-Feature | Why Avoid | What to Do Instead | +|--------------|-----------|-------------------| +| General APM metrics (CPU, memory, GC) | Out of scope; Cameleer is transaction-focused, not an APM tool. Adding metrics creates scope creep and competes with Prometheus/Grafana which do it better | Provide a link/integration point to external metrics tools if needed | +| Log aggregation/viewer | Transactions are not logs. Mixing them confuses the data model and competes with ELK/Loki | Store transaction payloads and attributes, not raw log lines | +| Custom dashboard builder | Enormous complexity for marginal value. Ops teams already have Grafana for custom dashboards | Provide good built-in dashboards; expose metrics via Prometheus endpoint for Grafana | +| Multi-tenancy | Adds auth complexity, data isolation, billing concerns. Single-tenant deployment is simpler and sufficient for the target audience | Deploy separate instances per environment/team | +| Mobile app | Ops teams use desktop browsers during incidents. Mobile adds huge UI complexity | Responsive web UI that works on tablets if needed | +| Plugin/extension system | Premature abstraction; adds API stability burden before the core is stable | Build features directly; consider plugins much later if demand emerges | +| Real-time streaming transaction view | "Firehose" views of all transactions in real-time look impressive but are useless at scale (millions/day). Users cannot process the stream | Provide auto-refreshing search results and recent errors list | +| AI/ML-powered root cause analysis | Hype-driven feature with poor reliability. Requires massive training data and domain-specific models | Provide good search, filtering, and comparison tools so humans can find root causes efficiently | + +## Feature Dependencies + +``` +Agent Registration --> Agent List/Status +Agent Registration --> SSE Connection --> Config Push +Agent Registration --> SSE Connection --> Ad-hoc Commands + +Transaction Ingestion --> Transaction Storage +Transaction Storage --> Transaction Search/Filtering +Transaction Search --> Transaction Detail View +Transaction Detail --> Activity Waterfall +Transaction Detail --> Payload Inspection +Transaction Detail --> Error Detail + +Diagram Storage --> Diagram Rendering +Diagram Versioning --> Transaction-to-Diagram Linking +Diagram Rendering --> Execution Overlay (requires both diagram + activity data) +Diagram Rendering --> Execution Heatmap (requires aggregated activity data) +Diagram Rendering --> Diagram-based Search + +Transaction Search --> Statistical Duration Analysis (aggregation of search results) +Transaction Search --> Search Result Aggregations + +JWT Auth --> All REST API endpoints +Bootstrap Token --> Agent Registration +Ed25519 Signing --> Config Push + +Transaction Volume Chart --> Transaction Storage (aggregation queries) +Error Rate Chart --> Transaction Storage (aggregation queries) +``` + +## MVP Recommendation + +**Prioritize (Phase 1 -- Foundation):** +1. Transaction ingestion and storage -- nothing works without data flowing in +2. Agent registration and lifecycle -- must know who is sending data +3. Basic transaction search (time range, state, duration) -- core value proposition +4. Transaction detail with activity breakdown -- users need to drill down + +**Prioritize (Phase 2 -- Core Experience):** +5. Full-text search -- the "find that one transaction" use case +6. Route diagram rendering with version linking -- the Camel-specific differentiator +7. JWT authentication -- required before any production deployment +8. Dashboard overview (volume chart, error rate, agent status) + +**Prioritize (Phase 3 -- Differentiation):** +9. Execution overlay on diagrams -- the killer feature that generic tools cannot offer +10. Config push via SSE -- operational value that justifies the agent-server architecture +11. Cross-instance correlation -- required for complex multi-instance Camel deployments + +**Defer:** +- Alerting: defer until core search and dashboard are solid; alerting without good data is noise +- Data export: useful but not blocking; add when compliance demands arise +- Anomaly detection: requires baseline data that only accumulates over time +- Diagram-based search: powerful but depends on both diagram rendering and search being mature +- Execution heatmap: requires significant aggregation infrastructure + +## Sources + +- Domain knowledge from njams Server (Integration Matters) feature set -- transaction monitoring for integration platforms, hierarchical transaction/activity model, route diagram visualization +- Jaeger UI and Zipkin UI -- distributed tracing search, trace detail waterfall views, service dependency graphs +- Dynatrace PurePath -- transaction-level drill-down, service flow visualization, statistical analysis +- Apache Camel route model -- EIP-based visual representation, route definition structure +- Project context from PROJECT.md and CLAUDE.md -- specific requirements, constraints, and architectural decisions + +**Confidence note:** Feature categorization is based on training data knowledge of these products. Web search was unavailable to verify latest feature additions in 2025-2026 releases. The core feature landscape for this domain is mature and unlikely to have shifted dramatically, but specific UI patterns and newer differentiators may be missed. Confidence: MEDIUM. diff --git a/.planning/research/PITFALLS.md b/.planning/research/PITFALLS.md new file mode 100644 index 00000000..dd669fbb --- /dev/null +++ b/.planning/research/PITFALLS.md @@ -0,0 +1,322 @@ +# Domain Pitfalls + +**Domain:** Transaction monitoring / observability server (Cameleer3 Server) +**Researched:** 2026-03-11 +**Confidence:** MEDIUM (based on established patterns for ClickHouse, SSE, high-volume ingestion; no web verification available) + +--- + +## Critical Pitfalls + +Mistakes that cause data loss, rewrites, or production outages. + +### Pitfall 1: Inserting Rows One-at-a-Time into ClickHouse + +**What goes wrong:** ClickHouse is a columnar OLAP engine optimized for bulk inserts. Sending one INSERT per incoming transaction (or per activity) creates a new data part per insert. ClickHouse merges parts in the background, but if parts accumulate faster than merges complete, you get "too many parts" errors and the table becomes read-only or the server OOMs. + +**Why it happens:** Developers coming from PostgreSQL/MySQL treat ClickHouse like an OLTP database. The agent sends a transaction, the server writes it immediately -- natural but catastrophic at scale. + +**Consequences:** At 50+ agents sending thousands of transactions/minute, row-by-row inserts will produce hundreds of parts per second. ClickHouse will reject inserts within hours. Data loss follows. + +**Warning signs:** +- `system.parts` table shows thousands of active parts per partition +- ClickHouse logs show "too many parts" warnings +- Insert latency increases progressively over hours + +**Prevention:** +- Buffer incoming transactions in memory (or a local queue) and flush in batches of 1,000-10,000 rows every 1-5 seconds +- Use ClickHouse's `Buffer` table engine as a safety net, but do not rely on it as the primary batching mechanism -- it has its own quirks (data visible before flush, lost on crash) +- Alternatively, write to a Kafka topic and use ClickHouse's Kafka engine for consumption (adds infrastructure but is the most robust pattern at high scale) +- Set `max_insert_block_size` and monitor `system.parts` in your health checks + +**Phase relevance:** Must be correct from the very first storage implementation (Phase 1). Retrofitting batching into a synchronous write path is painful. + +--- + +### Pitfall 2: Wrong ClickHouse Primary Key / ORDER BY Design + +**What goes wrong:** ClickHouse does not have traditional indexes. The `ORDER BY` clause defines how data is physically sorted on disk, and this sorting IS the primary access optimization. Choosing the wrong ORDER BY makes your most common queries scan entire partitions. + +**Why it happens:** Developers pick `ORDER BY (id)` by instinct (UUID primary key). But ClickHouse queries for this project will filter by time range, agent, state, and transaction attributes -- not by UUID. + +**Consequences:** A query like "find all ERROR transactions in the last hour from agent X" does a full partition scan instead of reading a narrow range. At millions of rows per day with 30-day retention, this means scanning tens of millions of rows for simple queries. + +**Warning signs:** +- `EXPLAIN` shows large `rows_read` relative to result set +- Queries that should take milliseconds take seconds +- CPU spikes on simple filtered queries + +**Prevention:** +- Design ORDER BY around your dominant query pattern: `ORDER BY (agent_id, status, toStartOfHour(execution_time), transaction_id)` or similar +- PARTITION BY month or day (e.g., `toYYYYMM(execution_time)`) to enable efficient TTL and partition dropping +- Put high-cardinality columns (like transaction_id) last in the ORDER BY +- Add `GRANULARITY`-based skip indexes (e.g., `INDEX idx_text content TYPE tokenbf_v1(...)`) for full-text-like searches +- Test with realistic data volumes before committing to a schema -- ClickHouse schema changes require table recreation or materialized views + +**Phase relevance:** Must be designed correctly before any data is stored (Phase 1). Changing ORDER BY requires recreating the table and re-ingesting all data. + +--- + +### Pitfall 3: SSE Connection Leaks and Unbounded Memory Growth + +**What goes wrong:** Each connected agent holds an open SSE connection. If the server does not detect dead connections, does not limit per-agent connections, and does not bound the event buffer per connection, memory grows unboundedly. Agents that disconnect uncleanly (network failure, OOM kill) leave orphaned `SseEmitter` objects on the server. + +**Why it happens:** Spring's `SseEmitter` does not automatically detect a dead TCP connection. The server happily buffers events for a dead connection until memory runs out. HTTP keep-alive and TCP timeouts are often far too long (minutes to hours). + +**Consequences:** With 50+ agents, each potentially disconnecting/reconnecting multiple times per day, orphaned emitters accumulate. The server eventually OOMs or becomes unresponsive. Config pushes go to dead connections and are silently lost. + +**Warning signs:** +- Heap usage grows steadily over days without corresponding agent count increase +- `SseEmitter` count in metrics diverges from known active agent count +- Config pushes succeed (no error) but agents never receive them + +**Prevention:** +- Set `SseEmitter` timeout explicitly (e.g., 60 seconds idle, with periodic heartbeat/ping events) +- Implement server-side heartbeat: send a comment event (`: ping`) every 15-30 seconds. If the write fails, the connection is dead -- clean it up immediately +- Register `onCompletion`, `onTimeout`, and `onError` callbacks on every `SseEmitter` to remove it from the registry +- Limit to one SSE connection per agent instance (keyed by agent ID). If a new connection arrives for the same agent, close the old one +- Bound the outbound event queue per connection (drop oldest events if agent is too slow) +- Use Spring WebFlux `Flux` instead of `SseEmitter` if possible -- it integrates better with reactive backpressure and connection lifecycle + +**Phase relevance:** Must be correct from the first SSE implementation (Phase 2 or whenever SSE is introduced). Connection leaks are silent and cumulative. + +--- + +### Pitfall 4: No Backpressure on Ingestion Endpoint + +**What goes wrong:** The HTTP POST endpoint that receives transaction data from agents accepts requests unboundedly. Under burst load (agent reconnection storm, batch replay), the server runs out of memory buffering writes, or overwhelms ClickHouse with insert pressure. + +**Why it happens:** The default Spring Boot behavior is to accept all incoming requests. Without explicit rate limiting or queue depth control, the server cannot signal agents to slow down. + +**Consequences:** Server OOMs during agent reconnection storms (all 50+ agents replay buffered data simultaneously). Or ClickHouse falls behind on merges, enters "too many parts" state, and rejects writes -- causing data loss. + +**Warning signs:** +- Memory spikes correlated with agent reconnect events +- HTTP 503s during burst periods +- ClickHouse merge queue growing faster than it drains + +**Prevention:** +- Implement a bounded in-memory queue (e.g., `ArrayBlockingQueue` or Disruptor ring buffer) between the HTTP endpoint and the ClickHouse writer +- Return HTTP 429 (Too Many Requests) with `Retry-After` header when the queue is full -- agents should implement exponential backoff +- Size the queue based on expected burst duration (e.g., 30 seconds of peak throughput) +- Monitor queue depth as a key metric +- Consider writing to local disk (append-only log) as overflow when queue is full, then draining asynchronously + +**Phase relevance:** Should be designed into the ingestion layer from the start (Phase 1). Retrofitting backpressure requires changing both server and agent behavior. + +--- + +### Pitfall 5: Storing Full Transaction Payloads in ClickHouse for Full-Text Search + +**What goes wrong:** Developers store large text fields (message bodies, stack traces, XML/JSON payloads) directly in ClickHouse columns and try to search them with `LIKE '%term%'` or `hasToken()`. ClickHouse is not a text search engine. These queries scan every row in the partition and are extremely slow at scale. + +**Why it happens:** The requirement says "full-text search." ClickHouse can technically do string matching. So developers avoid adding a second storage system. + +**Consequences:** Full-text queries on 30 days of data (hundreds of millions of rows) take 30+ seconds or time out entirely. Users cannot find transactions by content, which is a core value proposition. + +**Warning signs:** +- Full-text queries take >5 seconds even on recent data +- ClickHouse CPU pegged at 100% during text searches +- Users avoid the search feature because it is too slow + +**Prevention:** +- Use a dedicated text search index alongside ClickHouse. Options: + - **OpenSearch/Elasticsearch:** Battle-tested for log/observability search. Index the searchable text fields (message content, stack traces) with the transaction ID as a foreign key. Query OpenSearch for matching transaction IDs, then fetch details from ClickHouse. + - **ClickHouse `tokenbf_v1` or `ngrambf_v1` skip indexes:** Viable for token-based search on specific columns if the search vocabulary is limited. Not a replacement for real full-text search but can handle "find transactions containing this exact correlation ID" well. + - **Tantivy/Lucene sidecar:** If you want to avoid a full OpenSearch cluster, embed a Lucene-based index in the server process. Higher coupling but lower infrastructure cost. +- For MVP, ClickHouse token bloom filter indexes may suffice for exact-token searches. Plan the architecture to swap in OpenSearch later without changing the query API. + +**Phase relevance:** Architecture decision needed in Phase 1 (storage design). Implementation can be phased -- start with ClickHouse skip indexes, add OpenSearch when query patterns demand it. + +--- + +### Pitfall 6: Losing Data During Server Restart or Crash + +**What goes wrong:** If the server buffers transactions in memory before batch-flushing to ClickHouse (as recommended in Pitfall 1), a server crash or restart loses all buffered data. + +**Why it happens:** In-memory buffering is the obvious first implementation. Nobody thinks about crash recovery until data is lost. + +**Consequences:** Every server restart during deployment loses 1-5 seconds of transaction data. In a crash scenario, potentially more. + +**Warning signs:** +- Missing transactions in ClickHouse around server restart timestamps +- Agents report successful POSTs but transactions are absent from storage + +**Prevention:** +- Accept that some data loss on crash is tolerable for an observability system (this is not a financial ledger). Document the guarantee: "at-most-once delivery with bounded loss window of N seconds" +- Implement graceful shutdown: on SIGTERM, flush the current buffer before stopping (`@PreDestroy` or `SmartLifecycle` with ordered shutdown) +- For zero data loss: write to a Write-Ahead Log (local append file) before acknowledging the HTTP POST, then batch from the WAL to ClickHouse. This adds complexity -- only do it if the data loss window from in-memory buffering is unacceptable +- Size the flush interval to minimize the loss window (1 second flush = max 1 second of data lost) + +**Phase relevance:** Graceful shutdown should be in Phase 1. WAL-based durability is a later optimization if needed. + +--- + +## Moderate Pitfalls + +### Pitfall 7: Timezone and Instant Handling Inconsistency + +**What goes wrong:** Transaction timestamps arrive from agents in various formats or timezones. The server stores them inconsistently, leading to queries that miss transactions or return wrong time ranges. ClickHouse's `DateTime` type is timezone-aware but defaults to server timezone if not specified. + +**Prevention:** +- Mandate UTC everywhere: agents send `Instant` (epoch millis or ISO-8601 with Z), server stores as ClickHouse `DateTime64(3, 'UTC')`, UI converts to local timezone for display only +- Use Jackson's `JavaTimeModule` (already noted in CLAUDE.md) and ensure `WRITE_DATES_AS_TIMESTAMPS` is disabled so Instant serializes as ISO-8601 +- ClickHouse: always use `DateTime64(3, 'UTC')` not bare `DateTime` +- Add a server-received timestamp alongside the agent-reported timestamp so you can detect clock skew + +**Phase relevance:** Must be correct from first data model design (Phase 1). + +--- + +### Pitfall 8: Correlation ID Design That Cannot Span Instances + +**What goes wrong:** Transactions that span multiple Camel instances (route A on instance 1 calls route B on instance 2) need a shared correlation ID. If the correlation ID is generated per-instance or per-route, you cannot reconstruct the full transaction path. + +**Prevention:** +- Use a single correlation ID (propagated via message headers) that is generated at the entry point and carried through all downstream calls +- Store both `transactionId` (the correlation ID spanning instances) and `activityId` (unique per route execution) as separate fields +- Ensure the agent propagates the correlation ID through Camel exchange properties and any external endpoint calls (HTTP headers, JMS properties, etc.) +- Index `transactionId` in ClickHouse ORDER BY so correlation lookups are fast +- This is primarily an agent-side concern, but the server schema must support it + +**Phase relevance:** Data model design (Phase 1). Agent protocol must define correlation ID propagation. + +--- + +### Pitfall 9: N+1 Queries When Loading Transaction Details + +**What goes wrong:** A transaction detail view needs: the transaction record, all activities within it, the route diagram for each activity, and possibly the message content. If each is a separate query, a transaction with 20 activities generates 40+ queries. + +**Prevention:** +- Design the API to return a fully hydrated transaction in one call: transaction + activities in a single ClickHouse query (they share the same `transactionId`, and if ORDER BY is designed correctly, they are physically co-located) +- Cache route diagrams aggressively (they are versioned and immutable once stored) -- a transaction with 20 activities likely references only 2-3 distinct diagrams +- For list views (search results), return summary data only (no activities, no content). Load details on demand via a separate detail endpoint +- Consider storing the diagram version hash with each activity so the detail endpoint can batch-fetch unique diagrams + +**Phase relevance:** API design (Phase 2). Must be considered during data model design (Phase 1). + +--- + +### Pitfall 10: SSE Reconnection Without Last-Event-ID + +**What goes wrong:** When an agent's SSE connection drops and reconnects, it misses all events sent during the disconnection. Without `Last-Event-ID` support, the agent has no way to request missed events, so configuration changes are silently lost. + +**Prevention:** +- Assign a monotonically increasing ID to every SSE event +- On reconnection, the agent sends `Last-Event-ID` header. The server replays events since that ID +- Keep a bounded event log (last N events or last T minutes) for replay. Events older than the replay window trigger a full state sync instead +- For config push specifically: make config idempotent and include a version number. On reconnection, always send the current full config state rather than relying on event replay. This is simpler and more robust than event sourcing for config + +**Phase relevance:** SSE implementation (Phase 2). The "full config sync on reconnect" pattern should be the default from day one. + +--- + +### Pitfall 11: ClickHouse TTL That Fragments Partitions + +**What goes wrong:** ClickHouse TTL deletes individual rows, which fragments existing data parts. At high data volumes with daily TTL expiration, this creates continuous background merge pressure and degrades query performance. + +**Prevention:** +- PARTITION BY `toYYYYMMDD(execution_time)` (daily partitions) and use `ALTER TABLE DROP PARTITION` via a scheduled job instead of row-level TTL +- Dropping a partition is an instant metadata operation -- no data scanning, no merge pressure +- A simple daily cron (or Spring `@Scheduled`) that drops partitions older than 30 days is more predictable than TTL +- If you use TTL, set `ttl_only_drop_parts = 1` in the table settings so ClickHouse drops entire parts rather than rewriting them with rows removed (available in recent ClickHouse versions) + +**Phase relevance:** Storage design (Phase 1). Must be decided before data accumulates. + +--- + +### Pitfall 12: JWT Token Management Without Rotation + +**What goes wrong:** JWT tokens are issued with no expiration or with very long expiration. If a token is compromised, there is no way to revoke it. Alternatively, tokens expire too quickly and agents disconnect/reconnect constantly. + +**Prevention:** +- Use short-lived access tokens (15-60 minutes) with a refresh token mechanism +- For agent authentication specifically: the bootstrap token is used once to register and obtain a long-lived agent credential. The agent credential is used to obtain short-lived JWTs +- Maintain a server-side token denylist (or use token versioning per agent) so compromised tokens can be revoked +- Ed25519 signing for config push is separate from JWT auth -- do not conflate the two. Ed25519 ensures config integrity (agent verifies server signature). JWT ensures identity (server verifies agent identity) +- Store agent public keys server-side so you can revoke individual agents + +**Phase relevance:** Security implementation (later phase). Design the token lifecycle model early even if implementation comes later. + +--- + +### Pitfall 13: Schema Evolution Without Migration Strategy + +**What goes wrong:** The agent protocol is "still evolving" (per project constraints). When the data model changes, existing data in ClickHouse becomes incompatible. ClickHouse does not support `ALTER TABLE` for changing ORDER BY, and column type changes are limited. + +**Prevention:** +- Version your schema explicitly (e.g., `schema_version` column or table naming convention) +- For additive changes (new nullable columns): `ALTER TABLE ADD COLUMN` works fine in ClickHouse +- For breaking changes (ORDER BY change, column type change): create a new table with the new schema and use a materialized view to transform data from old tables, or accept that old data stays in old format and queries span both tables +- Design the ingestion layer to normalize incoming data to the current schema version, handling backward compatibility with older agents +- Include a `protocol_version` field in agent registration so the server knows what format to expect + +**Phase relevance:** Must be considered in Phase 1 data model design. The migration strategy becomes critical as soon as you have production data. + +--- + +## Minor Pitfalls + +### Pitfall 14: Overloading Agents via SSE Event Storms + +**What goes wrong:** A bulk config change pushes 50 events in rapid succession to all agents. Agents process events synchronously and stall their main Camel routes while handling config updates. + +**Prevention:** +- Batch config changes into a single SSE event containing the full updated config +- Rate-limit SSE event emission (no more than 1 config event per second per agent) +- Agents should process SSE events asynchronously on a separate thread from the Camel context + +**Phase relevance:** SSE implementation (Phase 2). + +--- + +### Pitfall 15: Not Monitoring the Monitoring System + +**What goes wrong:** The observability server itself has no observability. When it degrades, nobody knows until agents report failures or users complain about missing data. + +**Prevention:** +- Expose Prometheus/Micrometer metrics for: ingestion rate, batch flush latency, ClickHouse insert latency, SSE active connections, queue depth, error rates +- Add a `/health` endpoint that checks ClickHouse connectivity and queue depth +- Alert on: ingestion rate dropping below expected baseline, queue depth exceeding threshold, ClickHouse insert errors + +**Phase relevance:** Should be layered in alongside each component as it is built, not deferred to a "monitoring phase." + +--- + +### Pitfall 16: Route Diagram Versioning Without Content Hashing + +**What goes wrong:** Storing a new diagram version every time an agent reports, even if the diagram has not changed. With 50 agents reporting the same routes, you get 50 copies of identical diagrams. + +**Prevention:** +- Content-hash each diagram definition (SHA-256 of the normalized diagram content) +- Store diagrams keyed by content hash. If the hash already exists, skip the insert +- Link activities to diagrams via the content hash, not a sequential version number +- This deduplicates across agents running the same routes and across deployments where routes did not change + +**Phase relevance:** Diagram storage design (Phase 2 or whenever diagrams are implemented). + +--- + +## Phase-Specific Warnings + +| Phase Topic | Likely Pitfall | Mitigation | +|-------------|---------------|------------| +| Storage / ClickHouse setup | Row-by-row inserts (Pitfall 1), wrong ORDER BY (Pitfall 2), TTL fragmentation (Pitfall 11) | Design batch ingestion and ORDER BY before writing any code. Prototype with realistic volume. | +| Ingestion endpoint | No backpressure (Pitfall 4), crash data loss (Pitfall 6) | Bounded queue + graceful shutdown from day one. | +| Full-text search | ClickHouse as text engine (Pitfall 5) | Start with skip indexes, design API to allow backend swap. | +| SSE implementation | Connection leaks (Pitfall 3), no reconnection handling (Pitfall 10), event storms (Pitfall 14) | Heartbeat + timeout + one-connection-per-agent from first implementation. | +| Data model | Timezone inconsistency (Pitfall 7), correlation ID design (Pitfall 8), schema evolution (Pitfall 13) | UTC everywhere, correlation ID in protocol spec, versioned schema. | +| Security | Token management (Pitfall 12) | Design token lifecycle early, implement in security phase. | +| API design | N+1 queries (Pitfall 9) | Co-locate activities with transactions in storage, cache diagrams. | +| Operations | No self-monitoring (Pitfall 15) | Add metrics alongside each component, not as a separate phase. | + +--- + +## Sources + +- ClickHouse documentation on MergeTree engine, partitioning, and TTL (training data, MEDIUM confidence) +- Spring Framework SSE / SseEmitter documentation (training data, MEDIUM confidence) +- Production experience patterns from observability platforms (Jaeger, Zipkin, Grafana Tempo architecture docs) (training data, MEDIUM confidence) +- General distributed systems ingestion patterns (training data, MEDIUM confidence) + +**Note:** WebSearch was unavailable during this research session. All findings are based on training data (cutoff May 2025). Confidence is MEDIUM across the board -- the patterns are well-established but specific version details (e.g., ClickHouse TTL settings, Spring Boot 3.4.3 SSE behavior) should be verified against current documentation during implementation. diff --git a/.planning/research/STACK.md b/.planning/research/STACK.md new file mode 100644 index 00000000..f5ec3ceb --- /dev/null +++ b/.planning/research/STACK.md @@ -0,0 +1,271 @@ +# Technology Stack + +**Project:** Cameleer3 Server +**Researched:** 2026-03-11 +**Overall confidence:** MEDIUM (no live source verification available; versions based on training data up to May 2025) + +## Recommended Stack + +### Core Framework (Already Decided) + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| Java | 17+ | Runtime | Already established; LTS, well-supported | HIGH | +| Spring Boot | 3.4.3 | Application framework | Already in POM; provides web, security, configuration | HIGH | +| Maven | 3.9+ | Build system | Already established; multi-module project | HIGH | + +### Primary Data Store: ClickHouse + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| ClickHouse | 24.x+ | Transaction/activity storage | Column-oriented, built for billions of rows, native TTL, excellent time-range queries, MergeTree engine handles millions of inserts/day trivially | MEDIUM | +| clickhouse-java (HTTP) | 0.6.x+ | Java client | Official ClickHouse Java client; HTTP transport is simpler and more reliable than native TCP for Spring Boot apps | MEDIUM | + +**Why ClickHouse over alternatives:** + +- **vs Elasticsearch/OpenSearch:** ClickHouse is 5-10x more storage-efficient for structured columnar data. For time-series-like transaction data with known schema, ClickHouse drastically outperforms ES on aggregation queries (avg duration, count by state, time bucketing). ES is overkill when you don't need its inverted index for *every* field. +- **vs TimescaleDB:** TimescaleDB is PostgreSQL-based and good for moderate scale, but ClickHouse handles the "millions of inserts per day" tier with less operational overhead. TimescaleDB's row-oriented heritage means larger storage footprint for wide transaction records. ClickHouse's columnar compression achieves 10-20x compression on typical observability data. +- **vs PostgreSQL (plain):** PostgreSQL cannot efficiently handle this insert volume with 30-day retention and fast analytical queries. Partitioning and vacuuming become operational nightmares at this scale. + +**ClickHouse key features for this project:** +- **TTL on tables:** `TTL executionDate + INTERVAL 30 DAY` — automatic 30-day retention with zero application code +- **MergeTree engine:** Handles high insert throughput; batch inserts of 10K+ rows are trivial +- **Materialized views:** Pre-aggregate common queries (transactions by state per hour, etc.) +- **Low storage cost:** 10-20x compression means 30 days of millions of transactions fits in modest disk + +### Full-Text Search: OpenSearch + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| OpenSearch | 2.x | Full-text search over payloads, metadata, attributes | True inverted index for arbitrary text search; ClickHouse's full-text is rudimentary | MEDIUM | +| opensearch-java | 2.x | Java client | Official OpenSearch Java client; works well with Spring Boot | MEDIUM | + +**Why a separate search engine instead of ClickHouse alone:** + +ClickHouse has token-level bloom filter indexes and `hasToken()`/`LIKE` matching, but these are not true full-text search. For the requirement "search by any content in payloads, metadata, and attributes," you need an inverted index with: +- Tokenization and analysis (stemming, case folding) +- Relevance scoring +- Phrase matching +- Highlighting of matched terms in results + +**Why OpenSearch over Elasticsearch:** +- Apache 2.0 licensed (no SSPL concerns for self-hosted deployment) +- API-compatible with Elasticsearch 7.x +- Active development, large community +- OpenSearch Dashboards available if needed later +- No licensing ambiguity for Docker deployment + +**Dual-store pattern:** +- ClickHouse = source of truth for structured queries (time range, state, duration, aggregations) +- OpenSearch = search index for full-text queries +- Application writes to both; OpenSearch indexed asynchronously from an internal queue +- Structured filters (time, state) applied in ClickHouse; full-text queries in OpenSearch return transaction IDs, then ClickHouse fetches full records + +### Caching Layer: Caffeine + Redis (phased) + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| Caffeine | 3.1.x | In-process cache for agent registry, diagram versions, hot config | Fastest JVM cache; zero network overhead; perfect for single-instance start | MEDIUM | +| Spring Cache (`@Cacheable`) | (Spring Boot) | Cache abstraction | Switch cache backends without code changes | HIGH | +| Redis | 7.x | Distributed cache (Phase 2+, when horizontal scaling) | Shared state across multiple server instances; SSE session coordination | MEDIUM | + +**Phased approach:** +1. **Phase 1:** Caffeine only. Single server instance. Agent registry, diagram cache, recent query results all in-process. +2. **Phase 2 (horizontal scaling):** Add Redis for shared state. Agent registry must be consistent across instances. SSE sessions need coordination. + +### Message Ingestion: Internal Buffer with Backpressure + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| LMAX Disruptor | 4.0.x | High-performance ring buffer for ingestion | Lock-free, single-writer principle, handles burst traffic without blocking HTTP threads | MEDIUM | +| *Alternative:* `java.util.concurrent.LinkedBlockingQueue` | (JDK) | Simpler bounded queue | Good enough for initial implementation; switch to Disruptor if profiling shows contention | HIGH | + +**Why an internal buffer, not Kafka:** + +Kafka is the standard answer for "high-volume ingestion," but it adds massive operational complexity for a system that: +- Has a single data producer type (Cameleer agents via HTTP POST) +- Does not need replay from an external topic +- Does not need multi-consumer fan-out +- Is already receiving data via HTTP (not streaming) + +The right pattern here: **HTTP POST -> bounded in-memory queue -> batch writer to ClickHouse + async indexer to OpenSearch**. If the queue fills up, return HTTP 503 with `Retry-After` header — agents should implement exponential backoff. + +**When to add Kafka:** Only if you need cross-datacenter replication, multi-consumer processing, or guaranteed exactly-once delivery beyond what the internal buffer provides. This is a "maybe Phase 3+" decision. + +### API Documentation: springdoc-openapi + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| springdoc-openapi-starter-webmvc-ui | 2.x | OpenAPI 3.1 spec generation + Swagger UI | De facto standard for Spring Boot 3.x API docs; annotation-driven, zero-config for basic setup | MEDIUM | + +**Why springdoc over alternatives:** +- **vs SpringFox:** SpringFox is effectively dead; no Spring Boot 3 support +- **vs manual OpenAPI:** Too much maintenance overhead; springdoc generates from code +- springdoc supports Spring Boot 3.x natively, including Spring Security integration + +### Security + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| Spring Security | (Spring Boot 3.4.3) | Authentication/authorization framework | Already part of Spring Boot; JWT filter chain, method security | HIGH | +| java-jwt (Auth0) | 4.x | JWT creation and validation | Lightweight, well-maintained; simpler than Nimbus for this use case | MEDIUM | +| Ed25519 (JDK `java.security`) | (JDK 17) | Config signing | JDK 15+ has native EdDSA support; no external library needed | HIGH | + +### Testing + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| JUnit 5 | (Spring Boot) | Unit/integration testing | Already in POM; standard | HIGH | +| Testcontainers | 1.19.x+ | Integration tests with ClickHouse and OpenSearch | Spin up real databases in Docker for tests; no mocking storage layer | MEDIUM | +| Spring Boot Test | (Spring Boot) | Controller/integration testing | `@SpringBootTest`, `MockMvc`, etc. | HIGH | +| Awaitility | 4.2.x | Async testing (SSE, queue processing) | Clean API for testing eventually-consistent behavior | MEDIUM | + +### Containerization + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| Docker | - | Container runtime | Required per project constraints | HIGH | +| Docker Compose | - | Local dev + simple deployment | Single command to run server + ClickHouse + OpenSearch + Redis | HIGH | +| Eclipse Temurin JDK 17 | - | Base image | Official OpenJDK distribution; `eclipse-temurin:17-jre-alpine` for small image | HIGH | + +### Monitoring (Server Self-Observability) + +| Technology | Version | Purpose | Why | Confidence | +|------------|---------|---------|-----|------------| +| Micrometer | (Spring Boot) | Metrics facade | Built into Spring Boot; exposes ingestion rates, queue depth, query latencies | HIGH | +| Spring Boot Actuator | (Spring Boot) | Health checks, metrics endpoint | `/actuator/health` for Docker health checks, `/actuator/prometheus` for metrics | HIGH | + +## Supporting Libraries + +| Library | Version | Purpose | When to Use | Confidence | +|---------|---------|---------|-------------|------------| +| MapStruct | 1.5.x | DTO <-> entity mapping | Compile-time mapping; avoids reflection overhead in hot path | MEDIUM | +| Jackson JavaTimeModule | (already used) | `Instant` serialization | Already in project for `java.time` types | HIGH | +| SLF4J + Logback | (Spring Boot) | Logging | Default Spring Boot logging; structured JSON logging for production | HIGH | + +## What NOT to Use + +| Technology | Why Not | +|------------|---------| +| Elasticsearch | SSPL license; OpenSearch is API-compatible and Apache 2.0 | +| Kafka | Massive operational overhead for a system with single producer type; internal buffer is sufficient initially | +| MongoDB | Poor fit for time-series analytical queries; no native TTL with the efficiency of ClickHouse's MergeTree | +| PostgreSQL (as primary) | Cannot handle millions of inserts/day with fast analytical queries at 30-day retention | +| SpringFox | Dead project; no Spring Boot 3 support | +| Hibernate/JPA | ClickHouse is not a relational DB; JPA adds friction with no benefit. Use ClickHouse Java client directly. | +| Lombok | Controversial; Java 17 records cover most use cases; explicit code is clearer | +| gRPC | Agents already use HTTP POST; adding gRPC doubles protocol complexity for marginal throughput gain | + +## Alternatives Considered + +| Category | Recommended | Alternative | Why Not Alternative | +|----------|-------------|-------------|---------------------| +| Primary store | ClickHouse | TimescaleDB | Row-oriented heritage; larger storage footprint; less efficient for wide analytical queries | +| Primary store | ClickHouse | PostgreSQL + partitioning | Vacuum overhead; partition management; slower aggregations | +| Search | OpenSearch | Elasticsearch | SSPL license risk; functionally equivalent | +| Search | OpenSearch | ClickHouse full-text indexes | Not true full-text search; no relevance scoring, no phrase matching | +| Ingestion buffer | Internal queue | Apache Kafka | Operational complexity not justified; single producer type | +| Cache | Caffeine | Guava Cache | Caffeine is successor to Guava Cache with better performance | +| API docs | springdoc-openapi | SpringFox | SpringFox has no Spring Boot 3 support | +| JWT | java-jwt (Auth0) | Nimbus JOSE+JWT | Nimbus is more complex; java-jwt sufficient for symmetric/asymmetric JWT | + +## Installation (Maven Dependencies) + +```xml + + + com.clickhouse + clickhouse-http-client + 0.6.5 + + + + + org.opensearch.client + opensearch-java + 2.13.0 + + + + + com.github.ben-manes.caffeine + caffeine + + + + org.springframework.boot + spring-boot-starter-cache + + + + + org.springframework.boot + spring-boot-starter-security + + + com.auth0 + java-jwt + 4.4.0 + + + + + org.springdoc + springdoc-openapi-starter-webmvc-ui + 2.6.0 + + + + + org.springframework.boot + spring-boot-starter-actuator + + + io.micrometer + micrometer-registry-prometheus + + + + + org.testcontainers + clickhouse + 1.19.8 + test + + + org.testcontainers + junit-jupiter + 1.19.8 + test + + + org.awaitility + awaitility + test + + + + + + org.mapstruct + mapstruct + 1.5.5.Final + +``` + +## Version Verification Notes + +All version numbers are from training data (up to May 2025). Before adding dependencies, verify the latest stable versions on Maven Central: +- `clickhouse-http-client`: check https://github.com/ClickHouse/clickhouse-java/releases +- `opensearch-java`: check https://github.com/opensearch-project/opensearch-java/releases +- `springdoc-openapi-starter-webmvc-ui`: check https://springdoc.org/ +- `java-jwt`: check https://github.com/auth0/java-jwt/releases +- `testcontainers`: check https://github.com/testcontainers/testcontainers-java/releases + +## Sources + +- Training data knowledge (ClickHouse architecture, OpenSearch capabilities, Spring Boot ecosystem) +- Project POM analysis (Spring Boot 3.4.3, Jackson 2.17.3, existing module structure) +- CLAUDE.md project instructions (ClickHouse mentioned as storage target, JWT/Ed25519 security model) + +**Note:** All external source verification was unavailable during this research session. Version numbers should be validated before implementation. diff --git a/.planning/research/SUMMARY.md b/.planning/research/SUMMARY.md new file mode 100644 index 00000000..358cce0b --- /dev/null +++ b/.planning/research/SUMMARY.md @@ -0,0 +1,101 @@ +# Research Summary: Cameleer3 Server + +**Domain:** Transaction observability server for Apache Camel integrations +**Researched:** 2026-03-11 +**Overall confidence:** MEDIUM (established domain with mature patterns; version numbers unverified against live sources) + +## Executive Summary + +Cameleer3 Server is a write-heavy, read-occasional observability system that receives millions of transaction records per day from distributed Apache Camel agents, stores them with 30-day retention, and provides structured + full-text search. The architecture closely parallels established observability platforms like Jaeger, Zipkin, and njams Server, with the key differentiator being Camel route diagram visualization tied to individual transactions. + +The recommended stack centers on **ClickHouse** as the primary data store. ClickHouse's columnar MergeTree engine provides the exact properties this project needs: massive batch insert throughput, excellent time-range query performance, native TTL-based retention, and 10-20x compression on structured observability data. This is a well-established pattern used by production observability platforms (SigNoz, Uptrace, PostHog all run on ClickHouse). + +For full-text search, the recommendation is a **phased approach**: start with ClickHouse's built-in token bloom filter skip indexes (`tokenbf_v1`), which handle exact-token search (correlation IDs, error messages, specific values) well enough for MVP. When query patterns demand fuzzy matching or relevance scoring, add **OpenSearch** as a secondary search index. The architecture should be designed from the start to allow this swap transparently via the repository abstraction in the core module. + +The critical architectural pattern is an **in-memory write buffer** between the HTTP ingestion endpoint and ClickHouse. ClickHouse performs best with batch inserts of 1K-10K rows; individual row inserts are the single most common and most damaging mistake when building on ClickHouse. The buffer also provides the backpressure mechanism (HTTP 429) that prevents the server from being overwhelmed during agent reconnection storms. + +The two-module structure (core for domain logic + interfaces, app for Spring Boot wiring + implementations) enforces clean boundaries. Core defines repository interfaces, service implementations, and the write buffer. App provides ClickHouse repository implementations, Spring SseEmitter integration, REST controllers, and security filters. The boundary rule is strict: app depends on core, never the reverse. + +## Key Findings + +**Stack:** Java 17 / Spring Boot 3.4.3 + ClickHouse (primary store) + ClickHouse skip indexes for text search (phase 1), OpenSearch optional (phase 2+) + Caffeine cache + springdoc-openapi + Auth0 java-jwt. No Kafka, no Elasticsearch, no JPA. + +**Architecture:** Write-heavy CQRS-lite with three data paths: (1) buffered ingestion pipeline to ClickHouse, (2) query engine combining structured ClickHouse queries with text search, (3) SSE connection registry for agent push. Repository abstraction keeps core module storage-agnostic. Content-addressable diagram versioning with async pre-rendering. + +**Critical pitfall:** Row-by-row ClickHouse inserts and wrong ORDER BY design. These two mistakes together will make the system fail within hours under load and cannot be fixed without table recreation. Batch buffering and schema design must be correct from the first implementation. + +## Implications for Roadmap + +Based on research, suggested phase structure: + +1. **Foundation + Ingestion Pipeline** - Data model, ClickHouse schema design, batch write buffer, ingestion endpoint + - Addresses: Transaction ingestion, storage with TTL retention + - Avoids: Row-by-row inserts, wrong ORDER BY, no backpressure + - This phase needs careful design; ClickHouse ORDER BY and partition strategy are nearly impossible to change later + +2. **Transaction Query + API** - Query engine, structured filters (time/state/duration), cursor-based pagination, REST controllers + - Addresses: Core search experience, API-first design + - Avoids: OFFSET pagination degradation, N+1 queries by co-locating data access + +3. **Agent Registry + SSE** - Agent lifecycle management (LIVE/STALE/DEAD), heartbeat monitoring, SSE connection registry, config push + - Addresses: Agent management, real-time server-to-agent communication + - Avoids: SSE connection leaks, ghost agents, reconnection without Last-Event-ID + +4. **Diagram Service** - Content-addressable versioned storage, async rendering, transaction-diagram linking + - Addresses: Route diagram visualization (key Camel-specific differentiator) + - Avoids: Duplicate diagram storage via content hashing, synchronous rendering bottleneck + +5. **Security** - JWT authentication, Ed25519 config signing, bootstrap token registration + - Addresses: Production-ready security + - Avoids: Token management without rotation + - Can be partially layered in earlier if needed for integration testing with agents + +6. **Full-Text Search** - ClickHouse skip indexes initially; OpenSearch integration if bloom filters prove insufficient + - Addresses: "Find any transaction by content" requirement + - Avoids: Using LIKE/hasToken on large text columns without proper indexing + - Decision point: ClickHouse bloom filters may suffice; evaluate before adding OpenSearch + +7. **Dashboard + Aggregations** - Overview charts, error rates, volume trends using ClickHouse aggregation queries + - Addresses: At-a-glance operational awareness + +8. **Web UI** - Frontend consuming the REST API exclusively + - Addresses: User-facing interface + - Must come after API is stable per API-first principle + +**Phase ordering rationale:** +- Storage before query: you need data to query +- Ingestion before agents: agents need somewhere to POST +- Query before full-text: structured search first, text layers on top +- Agent registry before config push: must know who to push to +- Diagrams after query engine: transactions must exist to link diagrams to +- Security is cross-cutting but cleanest after core flows work +- UI last because API-first means the API must be stable first + +**Research flags for phases:** +- Phase 1 (Storage): NEEDS DEEPER RESEARCH -- ClickHouse Java client API, optimal ORDER BY for the specific query patterns, Docker configuration +- Phase 4 (Diagrams): NEEDS DEEPER RESEARCH -- server-side graph rendering library selection (Batik, jsvg, JGraphX, or client-side rendering) +- Phase 6 (Full-Text): NEEDS DEEPER RESEARCH -- ClickHouse skip index capabilities vs OpenSearch integration complexity; decision point +- Phase 8 (UI): NEEDS DEEPER RESEARCH -- frontend framework selection +- Phase 2 (Query): Standard patterns, unlikely to need research +- Phase 5 (Security): Standard patterns, unlikely to need research + +## Confidence Assessment + +| Area | Confidence | Notes | +|------|------------|-------| +| Stack (ClickHouse choice) | HIGH | Well-established pattern for observability; used by SigNoz, Uptrace, PostHog | +| Stack (version numbers) | LOW | Could not verify against live sources; all versions from training data (May 2025 cutoff) | +| Features | MEDIUM | Based on domain knowledge of njams, Jaeger, Zipkin; could not verify latest feature trends | +| Architecture | MEDIUM | Patterns are well-established; batch buffer, SSE registry, content-addressable storage are standard | +| Pitfalls | HIGH | ClickHouse pitfalls are well-documented; SSE lifecycle issues are common; ingestion backpressure is standard | +| Full-text search approach | MEDIUM | ClickHouse skip indexes vs OpenSearch is a legitimate decision point that needs hands-on evaluation | + +## Gaps to Address + +- **ClickHouse Java client API:** The clickhouse-java library has undergone significant changes. Exact API, connection pooling, and Spring Boot integration patterns need phase-specific research +- **cameleer3-common PROTOCOL.md:** Must read the agent protocol definition before designing ClickHouse schema -- this defines the exact data structures being ingested +- **ClickHouse Docker setup:** Optimal ClickHouse Docker configuration (memory limits, merge settings) for development and production +- **Full-text search decision:** ClickHouse skip indexes may or may not meet the "search by any content" requirement. This needs prototyping with realistic data +- **Diagram rendering library:** Server-side route diagram rendering is a significant unknown; needs prototyping with actual Camel route graph data from cameleer3-common +- **Frontend framework:** No research on UI technology -- deferred to UI phase +- **Agent protocol stability:** The cameleer3-common protocol is still evolving. Schema evolution strategy needs alignment with agent development