ClickHouse scaling plan for 1,000-10,000 tx/s production load #122

Open
opened 2026-04-03 22:59:05 +02:00 by claude · 0 comments
Owner

Context

At 3 tx/s testbed load, ClickHouse CPU was reduced from 407m to 130m by batching processor and log inserts (commit 633a61d). However, production loads are expected at 1,000-10,000 tx/s (~333-3,333x current), which will require further optimization.

Current State (3 tx/s baseline)

  • 12 inserts/min across all tables (one per 5s flush cycle)
  • 5 materialized views with quantileState(0.99) on every insert
  • 12 tables + 5 MV target tables = continuous background merge work
  • HTTP protocol with text serialization

Scaling Levers (in priority order)

1. Remove quantileState from MVs

  • Drop p99_duration column from all 5 stats MV target tables
  • Compute P99 at query time: quantile(0.99)(duration_ms) over raw data
  • At 1-minute buckets even high-volume queries are sub-second
  • Impact: Eliminates the most CPU-intensive aggregate function from the write path

2. Consolidate 5 MVs into 2

  • Merge stats_1m_all + stats_1m_app + stats_1m_route into one MV with all dimensions (tenant, app, route). Query-time GROUP BY filters dimensions.
  • Merge stats_1m_processor + stats_1m_processor_detail similarly
  • Impact: Reduces write amplification from 5 MV inserts to 2 per flush

3. Switch to native TCP protocol

  • Change JDBC URL from jdbc:clickhouse://host:8123 to jdbc:ch://host:9000
  • Binary columnar format reduces serialization CPU on both Java and CH sides
  • Enables async_insert as safety net (native protocol separates query from data)
  • Driver already supports it (clickhouse-jdbc:0.9.7:all)
  • Impact: 1-4x throughput improvement per benchmarks

4. Increase flush interval at scale

  • At high volume, 10-15s flush with larger batches reduces part creation
  • May need to increase buffer-capacity beyond 50,000
  • Impact: Fewer, larger parts = less merge work

5. Consider ClickHouse Buffer engine

  • Buffer engine sits in front of MergeTree and auto-flushes
  • Eliminates application-side batching concerns entirely
  • Trade-off: data in buffer is lost on CH crash (acceptable for observability)

Diagnostics

Useful queries for measuring impact at each step:

-- Insert frequency per table (last 5 min)
SELECT substring(query, position(query, 'INTO ') + 5, position(query, ' (') - position(query, 'INTO ') - 5) AS tbl,
       count() AS inserts, round(avg(written_rows),1) AS avg_rows
FROM system.query_log
WHERE event_time > now() - INTERVAL 5 MINUTE AND type = 'QueryFinish' AND query_kind = 'Insert'
GROUP BY tbl ORDER BY inserts DESC

-- Top CPU events
SELECT event, value FROM system.events
WHERE event LIKE '%Merge%' OR event LIKE '%Microseconds%'
ORDER BY value DESC LIMIT 10

Acceptance Criteria

  • ClickHouse CPU stays below 500m at 1,000 tx/s sustained
  • P99 latency still available in dashboard (query-time or pre-computed)
  • No data loss under normal operation
## Context At 3 tx/s testbed load, ClickHouse CPU was reduced from 407m to 130m by batching processor and log inserts (commit 633a61d). However, production loads are expected at **1,000-10,000 tx/s** (~333-3,333x current), which will require further optimization. ## Current State (3 tx/s baseline) - 12 inserts/min across all tables (one per 5s flush cycle) - 5 materialized views with `quantileState(0.99)` on every insert - 12 tables + 5 MV target tables = continuous background merge work - HTTP protocol with text serialization ## Scaling Levers (in priority order) ### 1. Remove `quantileState` from MVs - Drop `p99_duration` column from all 5 stats MV target tables - Compute P99 at query time: `quantile(0.99)(duration_ms)` over raw data - At 1-minute buckets even high-volume queries are sub-second - **Impact:** Eliminates the most CPU-intensive aggregate function from the write path ### 2. Consolidate 5 MVs into 2 - Merge `stats_1m_all` + `stats_1m_app` + `stats_1m_route` into one MV with all dimensions (tenant, app, route). Query-time GROUP BY filters dimensions. - Merge `stats_1m_processor` + `stats_1m_processor_detail` similarly - **Impact:** Reduces write amplification from 5 MV inserts to 2 per flush ### 3. Switch to native TCP protocol - Change JDBC URL from `jdbc:clickhouse://host:8123` to `jdbc:ch://host:9000` - Binary columnar format reduces serialization CPU on both Java and CH sides - Enables `async_insert` as safety net (native protocol separates query from data) - Driver already supports it (`clickhouse-jdbc:0.9.7:all`) - **Impact:** 1-4x throughput improvement per benchmarks ### 4. Increase flush interval at scale - At high volume, 10-15s flush with larger batches reduces part creation - May need to increase `buffer-capacity` beyond 50,000 - **Impact:** Fewer, larger parts = less merge work ### 5. Consider ClickHouse Buffer engine - `Buffer` engine sits in front of MergeTree and auto-flushes - Eliminates application-side batching concerns entirely - Trade-off: data in buffer is lost on CH crash (acceptable for observability) ## Diagnostics Useful queries for measuring impact at each step: ```sql -- Insert frequency per table (last 5 min) SELECT substring(query, position(query, 'INTO ') + 5, position(query, ' (') - position(query, 'INTO ') - 5) AS tbl, count() AS inserts, round(avg(written_rows),1) AS avg_rows FROM system.query_log WHERE event_time > now() - INTERVAL 5 MINUTE AND type = 'QueryFinish' AND query_kind = 'Insert' GROUP BY tbl ORDER BY inserts DESC -- Top CPU events SELECT event, value FROM system.events WHERE event LIKE '%Merge%' OR event LIKE '%Microseconds%' ORDER BY value DESC LIMIT 10 ``` ## Acceptance Criteria - [ ] ClickHouse CPU stays below 500m at 1,000 tx/s sustained - [ ] P99 latency still available in dashboard (query-time or pre-computed) - [ ] No data loss under normal operation
Sign in to join this conversation.