Verify processor stats dedup after deployment #128

Open
opened 2026-04-12 23:12:30 +02:00 by claude · 0 comments
Owner

Context

Commit e2f784b fixes inflated processor execution counts caused by duplicate inserts into the plain MergeTree processor_executions table. The fix replaces count() with uniq(execution_id) in both stats_1m_processor and stats_1m_processor_detail materialized views.

What to verify after deployment

  1. Counts are correct: In the dashboard process diagram tooltip, processors earlier in a route should have counts >= processors later in the same route. Check the try-catch-test route in quarkus-native-app specifically — process5 should have count >= log12.

  2. Backfill ran successfully: The init.sql drops and recreates both processor stats tables on startup, backfilling from processor_executions. Check server logs for ClickHouse schema initialization complete without errors.

  3. Startup time: The backfill scans the full processor_executions table (365-day TTL). Monitor startup duration — if it becomes a problem with large datasets, consider making the migration one-time (check column type before dropping).

  4. Duration metrics still accurate: avg_duration_ms now divides sumMerge(duration_sum) by uniqMerge(total_count). With duplicates, duration_sum is still inflated (sum of all rows including dupes) while the denominator is now deduplicated. This means avg duration may be slightly higher than reality for historically duplicated data. New data going forward is correct.

Verification query

-- Compare raw unique count vs what the MV reports
SELECT
    p.processor_id,
    uniqExact(p.execution_id) AS raw_unique,
    sum(tc) AS mv_total
FROM processor_executions p
LEFT JOIN (
    SELECT processor_id, uniqMerge(total_count) AS tc
    FROM stats_1m_processor_detail
    GROUP BY processor_id
) m ON p.processor_id = m.processor_id
WHERE p.route_id = 'try-catch-test'
GROUP BY p.processor_id
ORDER BY p.processor_id

raw_unique and mv_total should match.

Follow-up consideration

The duration_sum aggregate (sumState) still double-counts duplicate inserts. A full fix would require deduplicating processor_executions itself (e.g., ReplacingMergeTree keyed on execution_id + seq) or changing duration_sum to also use a dedup-aware aggregate. Low priority since avg/p99 are approximately correct and new data is clean.

## Context Commit `e2f784b` fixes inflated processor execution counts caused by duplicate inserts into the plain `MergeTree` `processor_executions` table. The fix replaces `count()` with `uniq(execution_id)` in both `stats_1m_processor` and `stats_1m_processor_detail` materialized views. ## What to verify after deployment 1. **Counts are correct**: In the dashboard process diagram tooltip, processors earlier in a route should have counts >= processors later in the same route. Check the `try-catch-test` route in `quarkus-native-app` specifically — `process5` should have count >= `log12`. 2. **Backfill ran successfully**: The init.sql drops and recreates both processor stats tables on startup, backfilling from `processor_executions`. Check server logs for `ClickHouse schema initialization complete` without errors. 3. **Startup time**: The backfill scans the full `processor_executions` table (365-day TTL). Monitor startup duration — if it becomes a problem with large datasets, consider making the migration one-time (check column type before dropping). 4. **Duration metrics still accurate**: `avg_duration_ms` now divides `sumMerge(duration_sum)` by `uniqMerge(total_count)`. With duplicates, `duration_sum` is still inflated (sum of all rows including dupes) while the denominator is now deduplicated. This means avg duration may be slightly higher than reality for historically duplicated data. New data going forward is correct. ## Verification query ```sql -- Compare raw unique count vs what the MV reports SELECT p.processor_id, uniqExact(p.execution_id) AS raw_unique, sum(tc) AS mv_total FROM processor_executions p LEFT JOIN ( SELECT processor_id, uniqMerge(total_count) AS tc FROM stats_1m_processor_detail GROUP BY processor_id ) m ON p.processor_id = m.processor_id WHERE p.route_id = 'try-catch-test' GROUP BY p.processor_id ORDER BY p.processor_id ``` `raw_unique` and `mv_total` should match. ## Follow-up consideration The `duration_sum` aggregate (`sumState`) still double-counts duplicate inserts. A full fix would require deduplicating `processor_executions` itself (e.g., `ReplacingMergeTree` keyed on `execution_id + seq`) or changing `duration_sum` to also use a dedup-aware aggregate. Low priority since avg/p99 are approximately correct and new data is clean.
Sign in to join this conversation.