[P1] Latency outlier investigation path #106

Open
opened 2026-04-01 22:53:17 +02:00 by claude · 2 comments
Owner

Parent Epic

#100

Problem

The KPI strip shows P99 latency of 321 seconds (5+ minutes) — alarming, but clicking it does nothing. The dashboard shows SLA breaches but provides no path to investigate which exchanges are causing the problem. Dashboards that surface problems but don't help investigate them are only half-useful.

Current State

  • KPI strip shows P99: 321s with an ↑ trend arrow — not clickable
  • Dashboard L2 shows SLA Compliance: 0.0% BREACH — no drill-down
  • Dashboard L3 has a "BOTTLENECK" card showing the slowest processor — good but not enough
  • No latency distribution histogram anywhere
  • No way to filter exchanges by "slower than X"
  • No outlier detection or grouping

Proposed Solution

1. Clickable KPI Strip

Each KPI in the strip should be interactive:

KPI clicked Action
Total Navigate to Exchanges tab (current scope)
Err% Navigate to Exchanges tab with status=error filter
Avg Open latency distribution popover (see below)
P99 Open latency distribution popover, highlight P99 line

2. Latency Distribution Histogram

When clicking Avg or P99, show a popover or inline expansion:

┌─────────────────────────────────────────────┐
│  Latency Distribution (last 1h)             │
│                                             │
│  ██████████████████████  < 100ms  (1,847)   │
│  ████████               100-500ms (156)     │
│  ███                    500ms-1s  (52)      │
│  █                      1-5s      (18)      │
│  ▏                      5-30s     (3)       │
│  ▏                      > 30s     (2)  ← P99│
│                                             │
│  [View slowest exchanges →]                 │
└─────────────────────────────────────────────┘

Clicking a bar filters the exchanges table to that duration range.

3. "Slowest Exchanges" Quick View

On Dashboard L2 (app level) and L3 (route level), add a "Slowest Exchanges" section:

┌─────────────────────────────────────────────────────┐
│  Slowest Exchanges (last 1h)           [View all →] │
│─────────────────────────────────────────────────────│
│  ● OK   route3   73A3...B3   22:31:41   321.4s     │
│  ● OK   route1   2219...17   22:31:39   241.0s     │
│  ● ERR  route2   006593..F7  22:31:40   266.0ms    │
│  ● OK   route3   73A3...B0   22:31:33   304.0ms    │
│  ● OK   route1   73A3...AE   22:31:32   379.0ms    │
└─────────────────────────────────────────────────────┘

4. Duration Filter on Exchanges Table

Add a duration filter to the exchanges page:

[Status ▾]  [Duration: > 1s ▾]  [Search...]

Preset options: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom

5. SLA Context on Dashboard Cards

The SLA Compliance card currently says "0.0% BREACH" but doesn't explain the threshold. Add:

SLA COMPLIANCE
0.0%   BREACH
Threshold: P99 < 300ms
8 of 12 routes breaching
[View breaching routes →]

Backend Requirements

  • New endpoint: GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1h
    • Returns histogram buckets with counts
  • New endpoint or parameter: GET /api/v1/executions?minDuration=1000&sort=duration_desc
    • Filter exchanges by minimum duration
  • ClickHouse query: histogram over duration_ms column with configurable buckets

Acceptance Criteria

  • KPI strip values are clickable with contextual actions
  • Latency distribution histogram available on dashboard
  • Histogram bars are clickable to filter exchanges
  • "Slowest exchanges" section on Dashboard L2/L3
  • Duration filter on exchanges table (presets + custom)
  • SLA breach card shows threshold context and links to offending routes
  • #65 (Format durations to user locale)
  • Dashboard L3 BOTTLENECK card already shows slowest processor — extend this pattern
## Parent Epic #100 ## Problem The KPI strip shows P99 latency of 321 seconds (5+ minutes) — alarming, but clicking it does nothing. The dashboard shows SLA breaches but provides no path to investigate which exchanges are causing the problem. Dashboards that surface problems but don't help investigate them are only half-useful. ## Current State - KPI strip shows P99: 321s with an ↑ trend arrow — not clickable - Dashboard L2 shows SLA Compliance: 0.0% BREACH — no drill-down - Dashboard L3 has a "BOTTLENECK" card showing the slowest processor — good but not enough - No latency distribution histogram anywhere - No way to filter exchanges by "slower than X" - No outlier detection or grouping ## Proposed Solution ### 1. Clickable KPI Strip Each KPI in the strip should be interactive: | KPI clicked | Action | |-------------|--------| | **Total** | Navigate to Exchanges tab (current scope) | | **Err%** | Navigate to Exchanges tab with `status=error` filter | | **Avg** | Open latency distribution popover (see below) | | **P99** | Open latency distribution popover, highlight P99 line | ### 2. Latency Distribution Histogram When clicking Avg or P99, show a popover or inline expansion: ``` ┌─────────────────────────────────────────────┐ │ Latency Distribution (last 1h) │ │ │ │ ██████████████████████ < 100ms (1,847) │ │ ████████ 100-500ms (156) │ │ ███ 500ms-1s (52) │ │ █ 1-5s (18) │ │ ▏ 5-30s (3) │ │ ▏ > 30s (2) ← P99│ │ │ │ [View slowest exchanges →] │ └─────────────────────────────────────────────┘ ``` Clicking a bar filters the exchanges table to that duration range. ### 3. "Slowest Exchanges" Quick View On Dashboard L2 (app level) and L3 (route level), add a "Slowest Exchanges" section: ``` ┌─────────────────────────────────────────────────────┐ │ Slowest Exchanges (last 1h) [View all →] │ │─────────────────────────────────────────────────────│ │ ● OK route3 73A3...B3 22:31:41 321.4s │ │ ● OK route1 2219...17 22:31:39 241.0s │ │ ● ERR route2 006593..F7 22:31:40 266.0ms │ │ ● OK route3 73A3...B0 22:31:33 304.0ms │ │ ● OK route1 73A3...AE 22:31:32 379.0ms │ └─────────────────────────────────────────────────────┘ ``` ### 4. Duration Filter on Exchanges Table Add a duration filter to the exchanges page: ``` [Status ▾] [Duration: > 1s ▾] [Search...] ``` Preset options: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom ### 5. SLA Context on Dashboard Cards The SLA Compliance card currently says "0.0% BREACH" but doesn't explain the threshold. Add: ``` SLA COMPLIANCE 0.0% BREACH Threshold: P99 < 300ms 8 of 12 routes breaching [View breaching routes →] ``` ## Backend Requirements - New endpoint: `GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1h` - Returns histogram buckets with counts - New endpoint or parameter: `GET /api/v1/executions?minDuration=1000&sort=duration_desc` - Filter exchanges by minimum duration - ClickHouse query: histogram over `duration_ms` column with configurable buckets ## Acceptance Criteria - [ ] KPI strip values are clickable with contextual actions - [ ] Latency distribution histogram available on dashboard - [ ] Histogram bars are clickable to filter exchanges - [ ] "Slowest exchanges" section on Dashboard L2/L3 - [ ] Duration filter on exchanges table (presets + custom) - [ ] SLA breach card shows threshold context and links to offending routes ## Related - #65 (Format durations to user locale) - Dashboard L3 BOTTLENECK card already shows slowest processor — extend this pattern
claude added the featureuxpmf labels 2026-04-01 22:53:17 +02:00
Author
Owner

Design Specification

Clickable KPI Strip

KPI Clicked Action
Total Navigate to Exchanges tab (current scope)
Err% Navigate to Exchanges with status=error
Avg Open latency distribution popover
P99 Open latency distribution popover, highlight P99 line

Latency Distribution Histogram

Popover or inline expansion showing bar chart:

< 100ms   ██████████████████████  (1,847)
100-500ms ████████               (156)
500ms-1s  ███                    (52)
1-5s      █                     (18)
> 5s      ▏                     (5)

Click a bar → navigate to exchanges filtered by that duration range.

Backend: New endpoint GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1h. Returns histogram buckets with counts. ClickHouse query: countIf(duration_ms < 100), countIf(duration_ms BETWEEN 100 AND 500), ...

Duration Filter on Exchanges

New filter in top bar: [Duration: > 1s ▾] with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom. Maps to minDuration query param on search API.

Slowest Exchanges Widget

Dashboard L2/L3 section showing top 5 slowest exchanges:

Slowest Exchanges (last 1h)              [View all →]
● OK  route3  73A3...B3  22:31:41  321.4s
● OK  route1  2219...17  22:31:39  241.0s

Backend: GET /api/v1/executions?sort=duration_desc&limit=5.

SLA Card Enhancement

SLA COMPLIANCE
0.0%   BREACH
Threshold: P99 < 300ms
8 of 12 routes breaching
[View breaching routes →]
## Design Specification ### Clickable KPI Strip | KPI Clicked | Action | |-------------|--------| | Total | Navigate to Exchanges tab (current scope) | | Err% | Navigate to Exchanges with `status=error` | | Avg | Open latency distribution popover | | P99 | Open latency distribution popover, highlight P99 line | ### Latency Distribution Histogram Popover or inline expansion showing bar chart: ``` < 100ms ██████████████████████ (1,847) 100-500ms ████████ (156) 500ms-1s ███ (52) 1-5s █ (18) > 5s ▏ (5) ``` Click a bar → navigate to exchanges filtered by that duration range. **Backend**: New endpoint `GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1h`. Returns histogram buckets with counts. ClickHouse query: `countIf(duration_ms < 100), countIf(duration_ms BETWEEN 100 AND 500), ...` ### Duration Filter on Exchanges New filter in top bar: `[Duration: > 1s ▾]` with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom. Maps to `minDuration` query param on search API. ### Slowest Exchanges Widget Dashboard L2/L3 section showing top 5 slowest exchanges: ``` Slowest Exchanges (last 1h) [View all →] ● OK route3 73A3...B3 22:31:41 321.4s ● OK route1 2219...17 22:31:39 241.0s ``` Backend: `GET /api/v1/executions?sort=duration_desc&limit=5`. ### SLA Card Enhancement ``` SLA COMPLIANCE 0.0% BREACH Threshold: P99 < 300ms 8 of 12 routes breaching [View breaching routes →] ```
Author
Owner

Design Specification

1. Clickable KPI Strip

Requires minor DS change: add onClick and cursor to KpiItem interface in KpiStrip.tsx.

KPI Clicked Action (Dashboard L2)
Throughput Navigate to /exchanges/{appId}
Success Rate Navigate to /exchanges/{appId}?status=FAILED
P99 Latency Open latency distribution histogram (inline overlay)
SLA Compliance Scroll to Route Performance table, sort by SLA ascending
Error Velocity Scroll to Top Errors section

2. Latency Distribution Histogram

Custom LatencyHistogram component (not DS BarChart — needs click handlers + threshold lines).

┌─ Latency Distribution ───────── 2,078 exchanges ────────────┐
│                                                               │
│  < 10ms    ████████████████████████████████  1,421 (68%)      │
│  10-50ms   ██████████████                     387 (19%)       │
│  50-100ms  ████████                           156  (8%)       │
│  100-500ms ███                                 72  (3%)       │
│  0.5-1s    █                                   24  (1%)       │
│  1-5s      ▏                                   12  (<1%)      │
│  5-30s     ▏                                    4  (<1%)      │
│  > 30s     ▏                                    2  (<1%)      │
│            ··················|···· P99 ────── ▲               │
│                                                               │
│  ─── P99: 4,210ms  ─── Avg: 48ms  ─── SLA: 300ms            │
│  Click a bar to filter exchanges by duration range            │
│  [View slowest exchanges →]                                   │
└───────────────────────────────────────────────────────────────┘

Bar colors relative to SLA: below=var(--success), crossing=var(--amber), above=var(--error).

Rendered as inline overlay below KPI strip (not DS Popover — avoids trigger-wrapping issues). CSS: position: relative; z-index: 10; background: var(--bg-surface); border: 1px solid var(--border-subtle); border-radius: var(--radius-lg); box-shadow: var(--shadow-card); animation: slideDown 0.15s ease-out;

New backend endpoint: GET /api/v1/search/stats/latency-distribution?from=...&to=...&application=...&routeId=...

ClickHouse query using multiIf() for bucket assignment:

SELECT multiIf(duration_ms<10,0, duration_ms<50,10, ..., 30000) AS bucket_min,
       multiIf(duration_ms<10,10, duration_ms<50,50, ..., -1) AS bucket_max,
       count() AS cnt
FROM executions WHERE start_time >= ? AND start_time < ? [AND app=? AND route=?]
GROUP BY bucket_min, bucket_max ORDER BY bucket_min

Response: { buckets: [{minMs, maxMs, count, percentage}], totalCount, avgMs, p99Ms }

3. Histogram → Exchange Filter

Click bar → navigate to /exchanges/{app}/{route}?durationMin={min}&durationMax={max}.

Backend already supports durationMin/durationMax in SearchRequest — no backend changes needed for this part.

4. Duration Filter on Exchanges Table

Dropdown in table header: [Duration: All ▾] with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom.

Custom shows inline inputs: [Min: ___ms] [Max: ___ms] [Apply].

Syncs with URL params durationMin/durationMax bidirectionally. Active filter shown as amber Badge: > 1s.

5. Slowest Exchanges Widget

Dashboard L2/L3 section showing top 5 by duration:

┌─ Slowest Exchanges ────────────────────── [View all →] ─┐
│ ● OK   route3   ...000000B3  22:31:41  321.4s     ████  │
│ ● OK   route1   ...00000017  22:31:39  241.0s     ███   │
│ ● ERR  route2   ...000000F7  22:31:40  266ms      █     │
└──────────────────────────────────────────────────────────┘

Data source: existing useSearchExecutions with sortField: 'durationMs', sortDir: 'desc', limit: 5. No new endpoint needed.

"View all" links to /exchanges/{app}?sort=durationMs&dir=desc.

6. SLA Card Enhancement

Current: 0.0% BREACH with threshold shown.

Enhanced:

SLA COMPLIANCE
0.0%   BREACH
Threshold: P99 < 300ms
8 of 12 routes breaching
[View breaching routes →]

Breach count computed from existing routeRows: routeRows.filter(r => r.slaCompliance < 99.0).length. Click scrolls to Route Performance table sorted by SLA ascending.

Files

New:

  • ui/src/components/LatencyHistogram/LatencyHistogram.tsx + .module.css
  • LatencyDistribution.java, LatencyBucket.java (records)

Modified:

  • @cameleer/design-system KpiStrip.tsx — add onClick to KpiItem (requires DS release)
  • DashboardL1.tsx, DashboardL2.tsx, DashboardL3.tsx — KPI click handlers, histogram state, slowest widget, SLA breach count
  • Dashboard.tsx — duration filter dropdown, read durationMin/durationMax from URL
  • dashboard.ts (queries) — add useLatencyDistribution hook
  • StatsStore.java — add latencyDistribution() method
  • ClickHouseStatsStore.java — histogram query implementation
  • SearchController.java/stats/latency-distribution endpoint
## Design Specification ### 1. Clickable KPI Strip Requires minor DS change: add `onClick` and `cursor` to `KpiItem` interface in `KpiStrip.tsx`. | KPI Clicked | Action (Dashboard L2) | |-------------|----------------------| | Throughput | Navigate to `/exchanges/{appId}` | | Success Rate | Navigate to `/exchanges/{appId}?status=FAILED` | | P99 Latency | Open latency distribution histogram (inline overlay) | | SLA Compliance | Scroll to Route Performance table, sort by SLA ascending | | Error Velocity | Scroll to Top Errors section | ### 2. Latency Distribution Histogram Custom `LatencyHistogram` component (not DS BarChart — needs click handlers + threshold lines). ``` ┌─ Latency Distribution ───────── 2,078 exchanges ────────────┐ │ │ │ < 10ms ████████████████████████████████ 1,421 (68%) │ │ 10-50ms ██████████████ 387 (19%) │ │ 50-100ms ████████ 156 (8%) │ │ 100-500ms ███ 72 (3%) │ │ 0.5-1s █ 24 (1%) │ │ 1-5s ▏ 12 (<1%) │ │ 5-30s ▏ 4 (<1%) │ │ > 30s ▏ 2 (<1%) │ │ ··················|···· P99 ────── ▲ │ │ │ │ ─── P99: 4,210ms ─── Avg: 48ms ─── SLA: 300ms │ │ Click a bar to filter exchanges by duration range │ │ [View slowest exchanges →] │ └───────────────────────────────────────────────────────────────┘ ``` Bar colors relative to SLA: below=`var(--success)`, crossing=`var(--amber)`, above=`var(--error)`. Rendered as inline overlay below KPI strip (not DS Popover — avoids trigger-wrapping issues). CSS: `position: relative; z-index: 10; background: var(--bg-surface); border: 1px solid var(--border-subtle); border-radius: var(--radius-lg); box-shadow: var(--shadow-card); animation: slideDown 0.15s ease-out;` **New backend endpoint:** `GET /api/v1/search/stats/latency-distribution?from=...&to=...&application=...&routeId=...` ClickHouse query using `multiIf()` for bucket assignment: ```sql SELECT multiIf(duration_ms<10,0, duration_ms<50,10, ..., 30000) AS bucket_min, multiIf(duration_ms<10,10, duration_ms<50,50, ..., -1) AS bucket_max, count() AS cnt FROM executions WHERE start_time >= ? AND start_time < ? [AND app=? AND route=?] GROUP BY bucket_min, bucket_max ORDER BY bucket_min ``` Response: `{ buckets: [{minMs, maxMs, count, percentage}], totalCount, avgMs, p99Ms }` ### 3. Histogram → Exchange Filter Click bar → navigate to `/exchanges/{app}/{route}?durationMin={min}&durationMax={max}`. Backend already supports `durationMin`/`durationMax` in `SearchRequest` — no backend changes needed for this part. ### 4. Duration Filter on Exchanges Table Dropdown in table header: `[Duration: All ▾]` with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom. Custom shows inline inputs: `[Min: ___ms] [Max: ___ms] [Apply]`. Syncs with URL params `durationMin`/`durationMax` bidirectionally. Active filter shown as amber `Badge`: `> 1s`. ### 5. Slowest Exchanges Widget Dashboard L2/L3 section showing top 5 by duration: ``` ┌─ Slowest Exchanges ────────────────────── [View all →] ─┐ │ ● OK route3 ...000000B3 22:31:41 321.4s ████ │ │ ● OK route1 ...00000017 22:31:39 241.0s ███ │ │ ● ERR route2 ...000000F7 22:31:40 266ms █ │ └──────────────────────────────────────────────────────────┘ ``` Data source: existing `useSearchExecutions` with `sortField: 'durationMs', sortDir: 'desc', limit: 5`. No new endpoint needed. "View all" links to `/exchanges/{app}?sort=durationMs&dir=desc`. ### 6. SLA Card Enhancement Current: `0.0% BREACH` with threshold shown. Enhanced: ``` SLA COMPLIANCE 0.0% BREACH Threshold: P99 < 300ms 8 of 12 routes breaching [View breaching routes →] ``` Breach count computed from existing `routeRows`: `routeRows.filter(r => r.slaCompliance < 99.0).length`. Click scrolls to Route Performance table sorted by SLA ascending. ### Files **New:** - `ui/src/components/LatencyHistogram/LatencyHistogram.tsx` + `.module.css` - `LatencyDistribution.java`, `LatencyBucket.java` (records) **Modified:** - `@cameleer/design-system` `KpiStrip.tsx` — add `onClick` to `KpiItem` (requires DS release) - `DashboardL1.tsx`, `DashboardL2.tsx`, `DashboardL3.tsx` — KPI click handlers, histogram state, slowest widget, SLA breach count - `Dashboard.tsx` — duration filter dropdown, read `durationMin`/`durationMax` from URL - `dashboard.ts` (queries) — add `useLatencyDistribution` hook - `StatsStore.java` — add `latencyDistribution()` method - `ClickHouseStatsStore.java` — histogram query implementation - `SearchController.java` — `/stats/latency-distribution` endpoint
Sign in to join this conversation.