[P1] Latency outlier investigation path #106
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent Epic
#100
Problem
The KPI strip shows P99 latency of 321 seconds (5+ minutes) — alarming, but clicking it does nothing. The dashboard shows SLA breaches but provides no path to investigate which exchanges are causing the problem. Dashboards that surface problems but don't help investigate them are only half-useful.
Current State
Proposed Solution
1. Clickable KPI Strip
Each KPI in the strip should be interactive:
status=errorfilter2. Latency Distribution Histogram
When clicking Avg or P99, show a popover or inline expansion:
Clicking a bar filters the exchanges table to that duration range.
3. "Slowest Exchanges" Quick View
On Dashboard L2 (app level) and L3 (route level), add a "Slowest Exchanges" section:
4. Duration Filter on Exchanges Table
Add a duration filter to the exchanges page:
Preset options: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom
5. SLA Context on Dashboard Cards
The SLA Compliance card currently says "0.0% BREACH" but doesn't explain the threshold. Add:
Backend Requirements
GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1hGET /api/v1/executions?minDuration=1000&sort=duration_descduration_mscolumn with configurable bucketsAcceptance Criteria
Related
Design Specification
Clickable KPI Strip
status=errorLatency Distribution Histogram
Popover or inline expansion showing bar chart:
Click a bar → navigate to exchanges filtered by that duration range.
Backend: New endpoint
GET /api/v1/stats/latency-distribution?app=X&route=Y&range=1h. Returns histogram buckets with counts. ClickHouse query:countIf(duration_ms < 100), countIf(duration_ms BETWEEN 100 AND 500), ...Duration Filter on Exchanges
New filter in top bar:
[Duration: > 1s ▾]with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom. Maps tominDurationquery param on search API.Slowest Exchanges Widget
Dashboard L2/L3 section showing top 5 slowest exchanges:
Backend:
GET /api/v1/executions?sort=duration_desc&limit=5.SLA Card Enhancement
Design Specification
1. Clickable KPI Strip
Requires minor DS change: add
onClickandcursortoKpiIteminterface inKpiStrip.tsx./exchanges/{appId}/exchanges/{appId}?status=FAILED2. Latency Distribution Histogram
Custom
LatencyHistogramcomponent (not DS BarChart — needs click handlers + threshold lines).Bar colors relative to SLA: below=
var(--success), crossing=var(--amber), above=var(--error).Rendered as inline overlay below KPI strip (not DS Popover — avoids trigger-wrapping issues). CSS:
position: relative; z-index: 10; background: var(--bg-surface); border: 1px solid var(--border-subtle); border-radius: var(--radius-lg); box-shadow: var(--shadow-card); animation: slideDown 0.15s ease-out;New backend endpoint:
GET /api/v1/search/stats/latency-distribution?from=...&to=...&application=...&routeId=...ClickHouse query using
multiIf()for bucket assignment:Response:
{ buckets: [{minMs, maxMs, count, percentage}], totalCount, avgMs, p99Ms }3. Histogram → Exchange Filter
Click bar → navigate to
/exchanges/{app}/{route}?durationMin={min}&durationMax={max}.Backend already supports
durationMin/durationMaxinSearchRequest— no backend changes needed for this part.4. Duration Filter on Exchanges Table
Dropdown in table header:
[Duration: All ▾]with presets: > 100ms, > 500ms, > 1s, > 5s, > 30s, Custom.Custom shows inline inputs:
[Min: ___ms] [Max: ___ms] [Apply].Syncs with URL params
durationMin/durationMaxbidirectionally. Active filter shown as amberBadge:> 1s.5. Slowest Exchanges Widget
Dashboard L2/L3 section showing top 5 by duration:
Data source: existing
useSearchExecutionswithsortField: 'durationMs', sortDir: 'desc', limit: 5. No new endpoint needed."View all" links to
/exchanges/{app}?sort=durationMs&dir=desc.6. SLA Card Enhancement
Current:
0.0% BREACHwith threshold shown.Enhanced:
Breach count computed from existing
routeRows:routeRows.filter(r => r.slaCompliance < 99.0).length. Click scrolls to Route Performance table sorted by SLA ascending.Files
New:
ui/src/components/LatencyHistogram/LatencyHistogram.tsx+.module.cssLatencyDistribution.java,LatencyBucket.java(records)Modified:
@cameleer/design-systemKpiStrip.tsx— addonClicktoKpiItem(requires DS release)DashboardL1.tsx,DashboardL2.tsx,DashboardL3.tsx— KPI click handlers, histogram state, slowest widget, SLA breach countDashboard.tsx— duration filter dropdown, readdurationMin/durationMaxfrom URLdashboard.ts(queries) — adduseLatencyDistributionhookStatsStore.java— addlatencyDistribution()methodClickHouseStatsStore.java— histogram query implementationSearchController.java—/stats/latency-distributionendpoint