[TransactionService] -- looks up diagram version active at transaction timestamp
|
v
[DiagramService.getVersionAt(routeId, instant)]
|
v
[DiagramRenderer.render(definition)] -- produces SVG/PNG for display
```
## Patterns to Follow
### Pattern 1: Bounded Write Buffer with Backpressure
**What:** In-memory queue between ingestion endpoint and storage writes. Bounded size. When full, return HTTP 429 to agents so they back off and retry.
**When:** Always -- this is the critical buffer between high-throughput ingestion and batch-oriented database writes.
**Why:** ClickHouse performs best with large batch inserts (thousands of rows). Individual inserts per HTTP request would destroy write performance. The buffer decouples ingestion rate from write rate.
**Example:**
```java
public class WriteBuffer<T> {
private final BlockingQueue<T> queue;
private final int batchSize;
private final Duration maxFlushInterval;
private final Consumer<List<T>> flushAction;
public boolean offer(T item) {
// Returns false when queue is full -> caller returns 429
**Implementation detail:** Use `ArrayBlockingQueue` with a capacity that matches your memory budget. At ~2KB per transaction record and 10,000 capacity, that is ~20MB -- well within bounds.
### Pattern 2: Repository Abstraction over ClickHouse
**What:** Define storage interfaces in core module, implement with ClickHouse JDBC in app module. Core never imports ClickHouse driver directly.
**When:** Always -- this is the key module boundary principle.
**Why:** Keeps core testable without a database. Allows swapping storage in tests (in-memory) and theoretically in production. More importantly, it enforces that domain logic does not leak storage concerns.
public class ClickHouseTransactionRepository implements TransactionRepository {
private final JdbcTemplate jdbc;
// ClickHouse-specific SQL, batch inserts, etc.
}
```
### Pattern 3: SseEmitter Registry with Heartbeat
**What:** Maintain a concurrent map of agent ID to SseEmitter. Send periodic heartbeat events. Remove on timeout, error, or completion.
**When:** For all SSE connections.
**Why:** SSE connections are long-lived. Without heartbeat, you cannot distinguish between a healthy idle connection and a silently dropped one. The registry is the source of truth for which agents are reachable.
**Example:**
```java
public class SseChannelManager {
private final ConcurrentHashMap<String, SseEmitter> emitters = new ConcurrentHashMap<>();
public SseEmitter register(String agentId) {
SseEmitter emitter = new SseEmitter(Long.MAX_VALUE); // no framework timeout
**What:** Hash diagram definitions. Store each unique definition once. Link transactions to the definition hash + a version timestamp.
**When:** For diagram storage.
**Why:** Many transactions reference the same diagram version. Content-addressing deduplicates storage. A separate version table maps (routeId, timestamp) to content hash, enabling "what diagram was active at time T?" queries.
-- Version history (which definition was active when)
CREATE TABLE diagram_versions (
route_id String,
active_from DateTime64(3),
content_hash String
) ENGINE = MergeTree()
ORDER BY (route_id, active_from);
```
### Pattern 5: Cursor-Based Pagination for Time-Series Data
**What:** Use cursor-based pagination (keyset pagination) instead of OFFSET/LIMIT for transaction listing.
**When:** For all list/search endpoints returning time-ordered transaction data.
**Why:** OFFSET-based pagination degrades as offset grows -- ClickHouse must scan and skip rows. Cursor-based pagination using `(timestamp, id) > (last_seen_timestamp, last_seen_id)` gives constant-time page fetches regardless of how deep you paginate.
**Example:**
```java
public record PageCursor(Instant timestamp, String id) {}
// Query: WHERE (timestamp, id) < (:cursorTs, :cursorId) ORDER BY timestamp DESC, id DESC LIMIT :size
```
## Anti-Patterns to Avoid
### Anti-Pattern 1: Individual Row Inserts to ClickHouse
**What:** Inserting one transaction per HTTP request directly to ClickHouse.
**Why bad:** ClickHouse is designed for bulk inserts. Individual inserts create excessive parts in MergeTree tables, causing merge pressure and degraded read performance. At 50+ agents posting concurrently, this would quickly become a bottleneck.
**Instead:** Buffer in memory, flush in batches of 1,000-10,000 rows per insert.
### Anti-Pattern 2: Storing Rendered Diagrams in ClickHouse BLOBs
**What:** Putting SVG/PNG binary data directly in the main ClickHouse tables alongside transaction data.
**Why bad:** ClickHouse is columnar and optimized for analytical queries. Large binary data in columns degrades compression ratios and query performance for all queries touching that table.
**Instead:** Store rendered output in filesystem or object storage. Store only the content hash reference in ClickHouse. Or use a separate ClickHouse table with the rendered content that is rarely queried alongside transaction data.
### Anti-Pattern 3: Blocking SSE Writes on the Request Thread
**What:** Sending SSE events synchronously from the thread handling a config update request.
**Why bad:** If an agent's connection is slow or dead, the config update request blocks. With 50+ agents, this creates cascading latency.
**Instead:** Send SSE events asynchronously. Use a thread pool or virtual threads (Java 21+) to handle SSE writes. Return success to the config updater immediately, handle delivery failures in the background.
### Anti-Pattern 4: Fat Core Module with Spring Dependencies
**What:** Adding Spring annotations (@Service, @Repository, @Autowired) throughout the core module.
**Why bad:** Couples domain logic to Spring. Makes unit testing harder. Violates the purpose of the core/app split.
**Instead:** Core module defines plain Java interfaces and classes. App module wires them with Spring. Core can use `@Scheduled` or similar only if Spring is already a dependency; otherwise, keep scheduling in app.
- Target: 5,000-10,000 rows per ClickHouse INSERT.
- Flush interval: 1-2 seconds (configurable).
- Flush triggers: whichever comes first -- batch size reached OR interval elapsed.
## Storage Architecture
### Write Path (ClickHouse)
ClickHouse excels at:
- Columnar compression (10:1 or better for structured transaction data)
- Time-partitioned tables with automatic TTL-based expiry (30-day retention)
- Massive batch INSERT throughput
- Analytical queries over time ranges
**Table design principles:**
- Partition by month: `PARTITION BY toYYYYMM(execution_time)`
- Order by query pattern: `ORDER BY (execution_time, transaction_id)` for time-range scans
- TTL: `TTL execution_time + INTERVAL 30 DAY`
- Use `LowCardinality(String)` for state, agent_id, route_id columns
### Full-Text Search
Two viable approaches:
**Option A: ClickHouse built-in full-text index (recommended for simplicity)**
- ClickHouse supports `tokenbf_v1` and `ngrambf_v1` bloom filter indexes
- Not as powerful as Elasticsearch/Lucene but avoids a separate system
- Good enough for "find transactions containing this string" queries
- Add a `search_text` column that concatenates searchable fields
**Option B: External search index (Elasticsearch/OpenSearch)**
- More powerful: fuzzy matching, relevance scoring, complex text analysis
- Additional infrastructure to manage
- Only justified if full-text search quality is a key differentiator
**Recommendation:** Start with ClickHouse bloom filter indexes. The query pattern described (incident-driven, searching by known strings like correlation IDs or error messages) does not require Lucene-level text analysis. If users need fuzzy/ranked search later, add an external index as a separate phase.
### Read Path
- Structured queries go directly to ClickHouse SQL.
- Full-text queries use the bloom filter index for pre-filtering, then exact match.