cameleer-server/.planning/research/PITFALLS.md

# Domain Pitfalls

**Domain:** Transaction monitoring / observability server (Cameleer Server)
**Researched:** 2026-03-11
**Confidence:** MEDIUM (based on established patterns for ClickHouse, SSE, high-volume ingestion; no web verification available)

---

## Critical Pitfalls

Mistakes that cause data loss, rewrites, or production outages.

### Pitfall 1: Inserting Rows One-at-a-Time into ClickHouse

**What goes wrong:** ClickHouse is a columnar OLAP engine optimized for bulk inserts. Sending one INSERT per incoming transaction (or per activity) creates a new data part per insert. ClickHouse merges parts in the background, but if parts accumulate faster than merges complete, you get "too many parts" errors and the table becomes read-only or the server OOMs.

**Why it happens:** Developers coming from PostgreSQL/MySQL treat ClickHouse like an OLTP database. The agent sends a transaction, the server writes it immediately -- natural but catastrophic at scale.

**Consequences:** At 50+ agents sending thousands of transactions/minute, row-by-row inserts will produce hundreds of parts per second. ClickHouse will reject inserts within hours. Data loss follows.

**Warning signs:**
- `system.parts` table shows thousands of active parts per partition
- ClickHouse logs show "too many parts" warnings
- Insert latency increases progressively over hours

**Prevention:**
- Buffer incoming transactions in memory (or a local queue) and flush in batches of 1,000-10,000 rows every 1-5 seconds
- Use ClickHouse's `Buffer` table engine as a safety net, but do not rely on it as the primary batching mechanism -- it has its own quirks (data visible before flush, lost on crash)
- Alternatively, write to a Kafka topic and use ClickHouse's Kafka engine for consumption (adds infrastructure but is the most robust pattern at high scale)
- Set `max_insert_block_size` and monitor `system.parts` in your health checks

**Phase relevance:** Must be correct from the very first storage implementation (Phase 1). Retrofitting batching into a synchronous write path is painful.

---

### Pitfall 2: Wrong ClickHouse Primary Key / ORDER BY Design

**What goes wrong:** ClickHouse does not have traditional indexes. The `ORDER BY` clause defines how data is physically sorted on disk, and this sorting IS the primary access optimization. Choosing the wrong ORDER BY makes your most common queries scan entire partitions.

**Why it happens:** Developers pick `ORDER BY (id)` by instinct (UUID primary key). But ClickHouse queries for this project will filter by time range, agent, state, and transaction attributes -- not by UUID.

**Consequences:** A query like "find all ERROR transactions in the last hour from agent X" does a full partition scan instead of reading a narrow range. At millions of rows per day with 30-day retention, this means scanning tens of millions of rows for simple queries.

**Warning signs:**
- `EXPLAIN` shows large `rows_read` relative to result set
- Queries that should take milliseconds take seconds
- CPU spikes on simple filtered queries

**Prevention:**
- Design ORDER BY around your dominant query pattern: `ORDER BY (agent_id, status, toStartOfHour(execution_time), transaction_id)` or similar
- PARTITION BY month or day (e.g., `toYYYYMM(execution_time)`) to enable efficient TTL and partition dropping
- Put high-cardinality columns (like transaction_id) last in the ORDER BY
- Add `GRANULARITY`-based skip indexes (e.g., `INDEX idx_text content TYPE tokenbf_v1(...)`) for full-text-like searches
- Test with realistic data volumes before committing to a schema -- ClickHouse schema changes require table recreation or materialized views

**Phase relevance:** Must be designed correctly before any data is stored (Phase 1). Changing ORDER BY requires recreating the table and re-ingesting all data.

---

### Pitfall 3: SSE Connection Leaks and Unbounded Memory Growth

**What goes wrong:** Each connected agent holds an open SSE connection. If the server does not detect dead connections, does not limit per-agent connections, and does not bound the event buffer per connection, memory grows unboundedly. Agents that disconnect uncleanly (network failure, OOM kill) leave orphaned `SseEmitter` objects on the server.

**Why it happens:** Spring's `SseEmitter` does not automatically detect a dead TCP connection. The server happily buffers events for a dead connection until memory runs out. HTTP keep-alive and TCP timeouts are often far too long (minutes to hours).

**Consequences:** With 50+ agents, each potentially disconnecting/reconnecting multiple times per day, orphaned emitters accumulate. The server eventually OOMs or becomes unresponsive. Config pushes go to dead connections and are silently lost.

**Warning signs:**
- Heap usage grows steadily over days without corresponding agent count increase
- `SseEmitter` count in metrics diverges from known active agent count
- Config pushes succeed (no error) but agents never receive them

**Prevention:**
- Set `SseEmitter` timeout explicitly (e.g., 60 seconds idle, with periodic heartbeat/ping events)
- Implement server-side heartbeat: send a comment event (`: ping`) every 15-30 seconds. If the write fails, the connection is dead -- clean it up immediately
- Register `onCompletion`, `onTimeout`, and `onError` callbacks on every `SseEmitter` to remove it from the registry
- Limit to one SSE connection per agent instance (keyed by agent ID). If a new connection arrives for the same agent, close the old one
- Bound the outbound event queue per connection (drop oldest events if agent is too slow)
- Use Spring WebFlux `Flux<ServerSentEvent>` instead of `SseEmitter` if possible -- it integrates better with reactive backpressure and connection lifecycle

**Phase relevance:** Must be correct from the first SSE implementation (Phase 2 or whenever SSE is introduced). Connection leaks are silent and cumulative.

---

### Pitfall 4: No Backpressure on Ingestion Endpoint

**What goes wrong:** The HTTP POST endpoint that receives transaction data from agents accepts requests unboundedly. Under burst load (agent reconnection storm, batch replay), the server runs out of memory buffering writes, or overwhelms ClickHouse with insert pressure.

**Why it happens:** The default Spring Boot behavior is to accept all incoming requests. Without explicit rate limiting or queue depth control, the server cannot signal agents to slow down.

**Consequences:** Server OOMs during agent reconnection storms (all 50+ agents replay buffered data simultaneously). Or ClickHouse falls behind on merges, enters "too many parts" state, and rejects writes -- causing data loss.

**Warning signs:**
- Memory spikes correlated with agent reconnect events
- HTTP 503s during burst periods
- ClickHouse merge queue growing faster than it drains

**Prevention:**
- Implement a bounded in-memory queue (e.g., `ArrayBlockingQueue` or Disruptor ring buffer) between the HTTP endpoint and the ClickHouse writer
- Return HTTP 429 (Too Many Requests) with `Retry-After` header when the queue is full -- agents should implement exponential backoff
- Size the queue based on expected burst duration (e.g., 30 seconds of peak throughput)
- Monitor queue depth as a key metric
- Consider writing to local disk (append-only log) as overflow when queue is full, then draining asynchronously

**Phase relevance:** Should be designed into the ingestion layer from the start (Phase 1). Retrofitting backpressure requires changing both server and agent behavior.

---

### Pitfall 5: Storing Full Transaction Payloads in ClickHouse for Full-Text Search

**What goes wrong:** Developers store large text fields (message bodies, stack traces, XML/JSON payloads) directly in ClickHouse columns and try to search them with `LIKE '%term%'` or `hasToken()`. ClickHouse is not a text search engine. These queries scan every row in the partition and are extremely slow at scale.

**Why it happens:** The requirement says "full-text search." ClickHouse can technically do string matching. So developers avoid adding a second storage system.

**Consequences:** Full-text queries on 30 days of data (hundreds of millions of rows) take 30+ seconds or time out entirely. Users cannot find transactions by content, which is a core value proposition.

**Warning signs:**
- Full-text queries take >5 seconds even on recent data
- ClickHouse CPU pegged at 100% during text searches
- Users avoid the search feature because it is too slow

**Prevention:**
- Use a dedicated text search index alongside ClickHouse. Options:
  - **OpenSearch/Elasticsearch:** Battle-tested for log/observability search. Index the searchable text fields (message content, stack traces) with the transaction ID as a foreign key. Query OpenSearch for matching transaction IDs, then fetch details from ClickHouse.
  - **ClickHouse `tokenbf_v1` or `ngrambf_v1` skip indexes:** Viable for token-based search on specific columns if the search vocabulary is limited. Not a replacement for real full-text search but can handle "find transactions containing this exact correlation ID" well.
  - **Tantivy/Lucene sidecar:** If you want to avoid a full OpenSearch cluster, embed a Lucene-based index in the server process. Higher coupling but lower infrastructure cost.
- For MVP, ClickHouse token bloom filter indexes may suffice for exact-token searches. Plan the architecture to swap in OpenSearch later without changing the query API.

**Phase relevance:** Architecture decision needed in Phase 1 (storage design). Implementation can be phased -- start with ClickHouse skip indexes, add OpenSearch when query patterns demand it.

---

### Pitfall 6: Losing Data During Server Restart or Crash

**What goes wrong:** If the server buffers transactions in memory before batch-flushing to ClickHouse (as recommended in Pitfall 1), a server crash or restart loses all buffered data.

**Why it happens:** In-memory buffering is the obvious first implementation. Nobody thinks about crash recovery until data is lost.

**Consequences:** Every server restart during deployment loses 1-5 seconds of transaction data. In a crash scenario, potentially more.

**Warning signs:**
- Missing transactions in ClickHouse around server restart timestamps
- Agents report successful POSTs but transactions are absent from storage

**Prevention:**
- Accept that some data loss on crash is tolerable for an observability system (this is not a financial ledger). Document the guarantee: "at-most-once delivery with bounded loss window of N seconds"
- Implement graceful shutdown: on SIGTERM, flush the current buffer before stopping (`@PreDestroy` or `SmartLifecycle` with ordered shutdown)
- For zero data loss: write to a Write-Ahead Log (local append file) before acknowledging the HTTP POST, then batch from the WAL to ClickHouse. This adds complexity -- only do it if the data loss window from in-memory buffering is unacceptable
- Size the flush interval to minimize the loss window (1 second flush = max 1 second of data lost)

**Phase relevance:** Graceful shutdown should be in Phase 1. WAL-based durability is a later optimization if needed.

---

## Moderate Pitfalls

### Pitfall 7: Timezone and Instant Handling Inconsistency

**What goes wrong:** Transaction timestamps arrive from agents in various formats or timezones. The server stores them inconsistently, leading to queries that miss transactions or return wrong time ranges. ClickHouse's `DateTime` type is timezone-aware but defaults to server timezone if not specified.

**Prevention:**
- Mandate UTC everywhere: agents send `Instant` (epoch millis or ISO-8601 with Z), server stores as ClickHouse `DateTime64(3, 'UTC')`, UI converts to local timezone for display only
- Use Jackson's `JavaTimeModule` (already noted in CLAUDE.md) and ensure `WRITE_DATES_AS_TIMESTAMPS` is disabled so Instant serializes as ISO-8601
- ClickHouse: always use `DateTime64(3, 'UTC')` not bare `DateTime`
- Add a server-received timestamp alongside the agent-reported timestamp so you can detect clock skew

**Phase relevance:** Must be correct from first data model design (Phase 1).

---

### Pitfall 8: Correlation ID Design That Cannot Span Instances

**What goes wrong:** Transactions that span multiple Camel instances (route A on instance 1 calls route B on instance 2) need a shared correlation ID. If the correlation ID is generated per-instance or per-route, you cannot reconstruct the full transaction path.

**Prevention:**
- Use a single correlation ID (propagated via message headers) that is generated at the entry point and carried through all downstream calls
- Store both `transactionId` (the correlation ID spanning instances) and `activityId` (unique per route execution) as separate fields
- Ensure the agent propagates the correlation ID through Camel exchange properties and any external endpoint calls (HTTP headers, JMS properties, etc.)
- Index `transactionId` in ClickHouse ORDER BY so correlation lookups are fast
- This is primarily an agent-side concern, but the server schema must support it

**Phase relevance:** Data model design (Phase 1). Agent protocol must define correlation ID propagation.

---

### Pitfall 9: N+1 Queries When Loading Transaction Details

**What goes wrong:** A transaction detail view needs: the transaction record, all activities within it, the route diagram for each activity, and possibly the message content. If each is a separate query, a transaction with 20 activities generates 40+ queries.

**Prevention:**
- Design the API to return a fully hydrated transaction in one call: transaction + activities in a single ClickHouse query (they share the same `transactionId`, and if ORDER BY is designed correctly, they are physically co-located)
- Cache route diagrams aggressively (they are versioned and immutable once stored) -- a transaction with 20 activities likely references only 2-3 distinct diagrams
- For list views (search results), return summary data only (no activities, no content). Load details on demand via a separate detail endpoint
- Consider storing the diagram version hash with each activity so the detail endpoint can batch-fetch unique diagrams

**Phase relevance:** API design (Phase 2). Must be considered during data model design (Phase 1).

---

### Pitfall 10: SSE Reconnection Without Last-Event-ID

**What goes wrong:** When an agent's SSE connection drops and reconnects, it misses all events sent during the disconnection. Without `Last-Event-ID` support, the agent has no way to request missed events, so configuration changes are silently lost.

**Prevention:**
- Assign a monotonically increasing ID to every SSE event
- On reconnection, the agent sends `Last-Event-ID` header. The server replays events since that ID
- Keep a bounded event log (last N events or last T minutes) for replay. Events older than the replay window trigger a full state sync instead
- For config push specifically: make config idempotent and include a version number. On reconnection, always send the current full config state rather than relying on event replay. This is simpler and more robust than event sourcing for config

**Phase relevance:** SSE implementation (Phase 2). The "full config sync on reconnect" pattern should be the default from day one.

---

### Pitfall 11: ClickHouse TTL That Fragments Partitions

**What goes wrong:** ClickHouse TTL deletes individual rows, which fragments existing data parts. At high data volumes with daily TTL expiration, this creates continuous background merge pressure and degrades query performance.

**Prevention:**
- PARTITION BY `toYYYYMMDD(execution_time)` (daily partitions) and use `ALTER TABLE DROP PARTITION` via a scheduled job instead of row-level TTL
- Dropping a partition is an instant metadata operation -- no data scanning, no merge pressure
- A simple daily cron (or Spring `@Scheduled`) that drops partitions older than 30 days is more predictable than TTL
- If you use TTL, set `ttl_only_drop_parts = 1` in the table settings so ClickHouse drops entire parts rather than rewriting them with rows removed (available in recent ClickHouse versions)

**Phase relevance:** Storage design (Phase 1). Must be decided before data accumulates.

---

### Pitfall 12: JWT Token Management Without Rotation

**What goes wrong:** JWT tokens are issued with no expiration or with very long expiration. If a token is compromised, there is no way to revoke it. Alternatively, tokens expire too quickly and agents disconnect/reconnect constantly.

**Prevention:**
- Use short-lived access tokens (15-60 minutes) with a refresh token mechanism
- For agent authentication specifically: the bootstrap token is used once to register and obtain a long-lived agent credential. The agent credential is used to obtain short-lived JWTs
- Maintain a server-side token denylist (or use token versioning per agent) so compromised tokens can be revoked
- Ed25519 signing for config push is separate from JWT auth -- do not conflate the two. Ed25519 ensures config integrity (agent verifies server signature). JWT ensures identity (server verifies agent identity)
- Store agent public keys server-side so you can revoke individual agents

**Phase relevance:** Security implementation (later phase). Design the token lifecycle model early even if implementation comes later.

---

### Pitfall 13: Schema Evolution Without Migration Strategy

**What goes wrong:** The agent protocol is "still evolving" (per project constraints). When the data model changes, existing data in ClickHouse becomes incompatible. ClickHouse does not support `ALTER TABLE` for changing ORDER BY, and column type changes are limited.

**Prevention:**
- Version your schema explicitly (e.g., `schema_version` column or table naming convention)
- For additive changes (new nullable columns): `ALTER TABLE ADD COLUMN` works fine in ClickHouse
- For breaking changes (ORDER BY change, column type change): create a new table with the new schema and use a materialized view to transform data from old tables, or accept that old data stays in old format and queries span both tables
- Design the ingestion layer to normalize incoming data to the current schema version, handling backward compatibility with older agents
- Include a `protocol_version` field in agent registration so the server knows what format to expect

**Phase relevance:** Must be considered in Phase 1 data model design. The migration strategy becomes critical as soon as you have production data.

---

## Minor Pitfalls

### Pitfall 14: Overloading Agents via SSE Event Storms

**What goes wrong:** A bulk config change pushes 50 events in rapid succession to all agents. Agents process events synchronously and stall their main Camel routes while handling config updates.

**Prevention:**
- Batch config changes into a single SSE event containing the full updated config
- Rate-limit SSE event emission (no more than 1 config event per second per agent)
- Agents should process SSE events asynchronously on a separate thread from the Camel context

**Phase relevance:** SSE implementation (Phase 2).

---

### Pitfall 15: Not Monitoring the Monitoring System

**What goes wrong:** The observability server itself has no observability. When it degrades, nobody knows until agents report failures or users complain about missing data.

**Prevention:**
- Expose Prometheus/Micrometer metrics for: ingestion rate, batch flush latency, ClickHouse insert latency, SSE active connections, queue depth, error rates
- Add a `/health` endpoint that checks ClickHouse connectivity and queue depth
- Alert on: ingestion rate dropping below expected baseline, queue depth exceeding threshold, ClickHouse insert errors

**Phase relevance:** Should be layered in alongside each component as it is built, not deferred to a "monitoring phase."

---

### Pitfall 16: Route Diagram Versioning Without Content Hashing

**What goes wrong:** Storing a new diagram version every time an agent reports, even if the diagram has not changed. With 50 agents reporting the same routes, you get 50 copies of identical diagrams.

**Prevention:**
- Content-hash each diagram definition (SHA-256 of the normalized diagram content)
- Store diagrams keyed by content hash. If the hash already exists, skip the insert
- Link activities to diagrams via the content hash, not a sequential version number
- This deduplicates across agents running the same routes and across deployments where routes did not change

**Phase relevance:** Diagram storage design (Phase 2 or whenever diagrams are implemented).

---

## Phase-Specific Warnings

| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Storage / ClickHouse setup | Row-by-row inserts (Pitfall 1), wrong ORDER BY (Pitfall 2), TTL fragmentation (Pitfall 11) | Design batch ingestion and ORDER BY before writing any code. Prototype with realistic volume. |
| Ingestion endpoint | No backpressure (Pitfall 4), crash data loss (Pitfall 6) | Bounded queue + graceful shutdown from day one. |
| Full-text search | ClickHouse as text engine (Pitfall 5) | Start with skip indexes, design API to allow backend swap. |
| SSE implementation | Connection leaks (Pitfall 3), no reconnection handling (Pitfall 10), event storms (Pitfall 14) | Heartbeat + timeout + one-connection-per-agent from first implementation. |
| Data model | Timezone inconsistency (Pitfall 7), correlation ID design (Pitfall 8), schema evolution (Pitfall 13) | UTC everywhere, correlation ID in protocol spec, versioned schema. |
| Security | Token management (Pitfall 12) | Design token lifecycle early, implement in security phase. |
| API design | N+1 queries (Pitfall 9) | Co-locate activities with transactions in storage, cache diagrams. |
| Operations | No self-monitoring (Pitfall 15) | Add metrics alongside each component, not as a separate phase. |

---

## Sources

- ClickHouse documentation on MergeTree engine, partitioning, and TTL (training data, MEDIUM confidence)
- Spring Framework SSE / SseEmitter documentation (training data, MEDIUM confidence)
- Production experience patterns from observability platforms (Jaeger, Zipkin, Grafana Tempo architecture docs) (training data, MEDIUM confidence)
- General distributed systems ingestion patterns (training data, MEDIUM confidence)

**Note:** WebSearch was unavailable during this research session. All findings are based on training data (cutoff May 2025). Confidence is MEDIUM across the board -- the patterns are well-established but specific version details (e.g., ClickHouse TTL settings, Spring Boot 3.4.3 SSE behavior) should be verified against current documentation during implementation.