Dashboard: progressive drill-down with RED metrics #94

Closed
opened 2026-03-29 19:03:35 +02:00 by claude · 0 comments
Owner

Summary

Redesign the Dashboard tab as a progressive drill-down dashboard following the RED method (Rate, Errors, Duration) plus SLO compliance. The sidebar drives scope through three levels of detail. Dashboard focuses purely on transaction health — binary/infrastructure health stays on the Runtime tab.

Design Decisions

  • Dashboard = Transactions only (RED + SLO). Binary/infra health stays on Runtime tab. This follows the industry pattern (Datadog, New Relic, Dynatrace all separate APM from Infrastructure as top-level sections).
  • SLA threshold = per-app configurable via admin API, stored in app_settings table.
  • Health dot thresholds = per-app configurable (error rate + SLA compliance thresholds stored alongside SLA threshold).
  • No data collection constraints — build whatever queries/views are needed for the ideal dashboard.

Scope Levels

Sidebar state Level Question answered
No selection L1: All Applications Is my landscape healthy? Which app needs attention?
App selected L2: Application How is this app performing? Which route is the problem?
Route selected L3: Route What's happening in this route? Where's the bottleneck?

KPIs

Core 4 (every level, scoped to selection)

KPI RED pillar What it tells you
Throughput (msg/s) Rate Volume of work
Success Rate (%) Errors (inverted) Quality of processing
P99 Latency (ms) Duration Speed of processing
SLA Compliance (%) Duration (business) Meeting commitments? Exact count of exchanges under per-app threshold

Each card shows: value, trend vs previous 24h, sparkline (12 buckets).

Level-specific 5th KPI

Level 5th KPI Why
L1: All Apps Active Errors (distinct error types in window) "How many different problems exist?" Not raw count (noisy) — distinct types = distinct issues to investigate
L2: App Error Velocity (errors/min + acceleration ▲/──/▼) "Is it getting worse?" Critical for triage — a stable 2/min is different from accelerating 2/min
L3: Route Bottleneck (slowest processor name + % of route time) "Where's the time going?" Immediately actionable without reading the table

Level 1: All Applications

┌──────────────────────────────────────────────────────────────────────────┐
│  KPI STRIP (5 cards)                                                     │
│  [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Active Err] │
├──────────────────────────────────────────────────────────────────────────┤
│  APPLICATION HEALTH TABLE                                                │
│  ● App Name      Throughput  Success%  P99(ms)  SLA%  Errors  Trend     │
│  🔴 payment-gw     312/s      94.1%    380    88.2%    18   ▁▂▃█▇▆▅    │
│  🟢 inventory       215/s      99.9%     85    99.8%     0   ▇▇▇▇▇▇▇    │
│  Status dot: per-app thresholds from app_settings                        │
│  Row click → sidebar selects app → L2                                    │
├──────────────────────────────────────────────────────────────────────────┤
│  CHARTS (side by side)                                                   │
│  [Throughput by App — stacked area] [Error Rate by App — lines]          │
│  Data: new timeseries/by-app endpoint                                    │
└──────────────────────────────────────────────────────────────────────────┘

Level 2: Single Application

┌──────────────────────────────────────────────────────────────────────────┐
│  KPI STRIP (5 cards)                                                     │
│  [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Err Veloc.] │
├──────────────────────────────────────────────────────────────────────────┤
│  ROUTE PERFORMANCE TABLE                                                 │
│  Route ID       Throughput  Success%  Avg(ms)  P99(ms)  SLA%  Trend     │
│  validate-pay     180/s      97.2%     45      210    94.8%  ▅▆▇▇▆▅▇   │
│  Row click → sidebar selects route → L3                                  │
├──────────────────────────────────────────────────────────────────────────┤
│  CHARTS (side by side)                                                   │
│  [Throughput by Route — stacked area] [Latency P50/P99 + SLA threshold]  │
│  SLA threshold line = per-app configurable value                         │
├──────────────────────────────────────────────────────────────────────────┤
│  TOP 5 ERRORS (hidden if zero)                                           │
│  Error Type          Route           Count  Velocity  Last Seen          │
│  ConnectTimeout      validate-pay      12   ▲ 1.2/m   2 min ago         │
│  Click → Exchanges tab with error type pre-filtered                      │
└──────────────────────────────────────────────────────────────────────────┘

Level 3: Single Route

┌──────────────────────────────────────────────────────────────────────────┐
│  KPI STRIP (5 cards)                                                     │
│  [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Bottleneck] │
│  Bottleneck = slowest processor name + avg ms + % of route time          │
├──────────────────────────────────────────────────────────────────────────┤
│  CHARTS (3 in a row)                                                     │
│  [Throughput — area] [Latency P50/P99 + SLA] [Error Rate — red area]     │
├──────────────────────────────────────────────────────────────────────────┤
│  PROCESS DIAGRAM + LATENCY HEATMAP (~250px)                              │
│  Nodes colored green→yellow→red by relative latency (within route)       │
│  Each node shows avg duration label. Click → scrolls to processor table  │
├──────────────────────────────────────────────────────────────────────────┤
│  PROCESSOR METRICS TABLE (default sort: P99 desc)                        │
│  Processor     Type     Invocations  Avg(ms)  P99(ms)  Err%  % Time     │
│  enrich-cust   enricher    18,200     120      285    0.5%   57.1%      │
│  % Time = processor avg / route avg duration                             │
├──────────────────────────────────────────────────────────────────────────┤
│  TOP 5 ERRORS (attributed to PROCESSOR, not route)                       │
│  Error Type          Processor       Count  Velocity  Last Seen          │
│  ConnectTimeout      enrich-cust       12   ▲ 1.2/m   2 min ago         │
└──────────────────────────────────────────────────────────────────────────┘

New Backend Endpoints

Endpoint Description Source
GET /search/stats/timeseries/by-app Timeseries grouped by application stats_1m_app view
GET /search/stats/timeseries/by-route?application=X Timeseries grouped by route stats_1m_route view
GET /search/errors/top?application&routeId&limit=5 Top N errors with velocity trend executions / processor_executions tables
GET /admin/app-settings (CRUD) Per-app SLA threshold + health dot thresholds app_settings table

Modified Endpoints

Endpoint Change
GET /search/stats Add slaCompliance field to ExecutionStats response
GET /routes/metrics Add slaCompliance field to RouteMetrics response

New Database

V12 migration — app_settings table:

Column Type Default Description
app_id TEXT PK Application name
sla_threshold_ms INTEGER 300 Duration threshold for SLA compliance
health_error_warn DOUBLE 1.0 Error rate % for yellow dot
health_error_crit DOUBLE 5.0 Error rate % for red dot
health_sla_warn DOUBLE 99.0 SLA % for yellow dot
health_sla_crit DOUBLE 95.0 SLA % for red dot

SLA Compliance Calculation

Exact count from raw executions hypertable (not approximated from P99):

SELECT
  COUNT(*) FILTER (WHERE duration_ms <= :threshold AND status != 'RUNNING') * 100.0
    / NULLIF(COUNT(*) FILTER (WHERE status != 'RUNNING'), 0)
FROM executions
WHERE start_time >= :from AND start_time < :to
  AND application_name = :app  -- optional
  AND route_id = :route        -- optional

Error Velocity Calculation

Compare 5-minute sliding windows: recent_5m vs prev_5m error count per error type.

  • ▲ accelerating: recent > prev * 1.2
  • ▼ decelerating: recent < prev * 0.8
  • ── stable: otherwise

Implementation Phases

  1. Database: V12 migration (app_settings table)
  2. Backend core: AppSettings model/repository, StatsStore interface extensions, TopError model
  3. Backend endpoints: 3 new + 2 modified + app-settings CRUD
  4. Frontend hooks: New query hooks + shared dashboard utilities
  5. Frontend components: DashboardL1, DashboardL2, DashboardL3
  6. Process diagram heatmap: New latencyHeatmap prop on ProcessDiagram component
  7. Verification: Build, manual test all 3 levels, regenerate openapi.json

Design Spec

Full spec with ASCII wireframes: docs/superpowers/specs/2026-03-29-dashboard-design.md

## Summary Redesign the Dashboard tab as a progressive drill-down dashboard following the RED method (Rate, Errors, Duration) plus SLO compliance. The sidebar drives scope through three levels of detail. Dashboard focuses purely on transaction health — binary/infrastructure health stays on the Runtime tab. ## Design Decisions - **Dashboard = Transactions only (RED + SLO)**. Binary/infra health stays on Runtime tab. This follows the industry pattern (Datadog, New Relic, Dynatrace all separate APM from Infrastructure as top-level sections). - **SLA threshold = per-app configurable** via admin API, stored in `app_settings` table. - **Health dot thresholds = per-app configurable** (error rate + SLA compliance thresholds stored alongside SLA threshold). - **No data collection constraints** — build whatever queries/views are needed for the ideal dashboard. ## Scope Levels | Sidebar state | Level | Question answered | |---|---|---| | No selection | L1: All Applications | Is my landscape healthy? Which app needs attention? | | App selected | L2: Application | How is this app performing? Which route is the problem? | | Route selected | L3: Route | What's happening in this route? Where's the bottleneck? | ## KPIs ### Core 4 (every level, scoped to selection) | KPI | RED pillar | What it tells you | |---|---|---| | **Throughput** (msg/s) | Rate | Volume of work | | **Success Rate** (%) | Errors (inverted) | Quality of processing | | **P99 Latency** (ms) | Duration | Speed of processing | | **SLA Compliance** (%) | Duration (business) | Meeting commitments? Exact count of exchanges under per-app threshold | Each card shows: value, trend vs previous 24h, sparkline (12 buckets). ### Level-specific 5th KPI | Level | 5th KPI | Why | |---|---|---| | L1: All Apps | **Active Errors** (distinct error types in window) | "How many different problems exist?" Not raw count (noisy) — distinct types = distinct issues to investigate | | L2: App | **Error Velocity** (errors/min + acceleration ▲/──/▼) | "Is it getting worse?" Critical for triage — a stable 2/min is different from accelerating 2/min | | L3: Route | **Bottleneck** (slowest processor name + % of route time) | "Where's the time going?" Immediately actionable without reading the table | ## Level 1: All Applications ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ KPI STRIP (5 cards) │ │ [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Active Err] │ ├──────────────────────────────────────────────────────────────────────────┤ │ APPLICATION HEALTH TABLE │ │ ● App Name Throughput Success% P99(ms) SLA% Errors Trend │ │ 🔴 payment-gw 312/s 94.1% 380 88.2% 18 ▁▂▃█▇▆▅ │ │ 🟢 inventory 215/s 99.9% 85 99.8% 0 ▇▇▇▇▇▇▇ │ │ Status dot: per-app thresholds from app_settings │ │ Row click → sidebar selects app → L2 │ ├──────────────────────────────────────────────────────────────────────────┤ │ CHARTS (side by side) │ │ [Throughput by App — stacked area] [Error Rate by App — lines] │ │ Data: new timeseries/by-app endpoint │ └──────────────────────────────────────────────────────────────────────────┘ ``` ## Level 2: Single Application ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ KPI STRIP (5 cards) │ │ [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Err Veloc.] │ ├──────────────────────────────────────────────────────────────────────────┤ │ ROUTE PERFORMANCE TABLE │ │ Route ID Throughput Success% Avg(ms) P99(ms) SLA% Trend │ │ validate-pay 180/s 97.2% 45 210 94.8% ▅▆▇▇▆▅▇ │ │ Row click → sidebar selects route → L3 │ ├──────────────────────────────────────────────────────────────────────────┤ │ CHARTS (side by side) │ │ [Throughput by Route — stacked area] [Latency P50/P99 + SLA threshold] │ │ SLA threshold line = per-app configurable value │ ├──────────────────────────────────────────────────────────────────────────┤ │ TOP 5 ERRORS (hidden if zero) │ │ Error Type Route Count Velocity Last Seen │ │ ConnectTimeout validate-pay 12 ▲ 1.2/m 2 min ago │ │ Click → Exchanges tab with error type pre-filtered │ └──────────────────────────────────────────────────────────────────────────┘ ``` ## Level 3: Single Route ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ KPI STRIP (5 cards) │ │ [Throughput] [Success Rate] [P99 Latency] [SLA Compliance] [Bottleneck] │ │ Bottleneck = slowest processor name + avg ms + % of route time │ ├──────────────────────────────────────────────────────────────────────────┤ │ CHARTS (3 in a row) │ │ [Throughput — area] [Latency P50/P99 + SLA] [Error Rate — red area] │ ├──────────────────────────────────────────────────────────────────────────┤ │ PROCESS DIAGRAM + LATENCY HEATMAP (~250px) │ │ Nodes colored green→yellow→red by relative latency (within route) │ │ Each node shows avg duration label. Click → scrolls to processor table │ ├──────────────────────────────────────────────────────────────────────────┤ │ PROCESSOR METRICS TABLE (default sort: P99 desc) │ │ Processor Type Invocations Avg(ms) P99(ms) Err% % Time │ │ enrich-cust enricher 18,200 120 285 0.5% 57.1% │ │ % Time = processor avg / route avg duration │ ├──────────────────────────────────────────────────────────────────────────┤ │ TOP 5 ERRORS (attributed to PROCESSOR, not route) │ │ Error Type Processor Count Velocity Last Seen │ │ ConnectTimeout enrich-cust 12 ▲ 1.2/m 2 min ago │ └──────────────────────────────────────────────────────────────────────────┘ ``` ## New Backend Endpoints | Endpoint | Description | Source | |---|---|---| | `GET /search/stats/timeseries/by-app` | Timeseries grouped by application | `stats_1m_app` view | | `GET /search/stats/timeseries/by-route?application=X` | Timeseries grouped by route | `stats_1m_route` view | | `GET /search/errors/top?application&routeId&limit=5` | Top N errors with velocity trend | `executions` / `processor_executions` tables | | `GET /admin/app-settings` (CRUD) | Per-app SLA threshold + health dot thresholds | `app_settings` table | ## Modified Endpoints | Endpoint | Change | |---|---| | `GET /search/stats` | Add `slaCompliance` field to `ExecutionStats` response | | `GET /routes/metrics` | Add `slaCompliance` field to `RouteMetrics` response | ## New Database **V12 migration — `app_settings` table:** | Column | Type | Default | Description | |---|---|---|---| | app_id | TEXT PK | — | Application name | | sla_threshold_ms | INTEGER | 300 | Duration threshold for SLA compliance | | health_error_warn | DOUBLE | 1.0 | Error rate % for yellow dot | | health_error_crit | DOUBLE | 5.0 | Error rate % for red dot | | health_sla_warn | DOUBLE | 99.0 | SLA % for yellow dot | | health_sla_crit | DOUBLE | 95.0 | SLA % for red dot | ## SLA Compliance Calculation Exact count from raw `executions` hypertable (not approximated from P99): ```sql SELECT COUNT(*) FILTER (WHERE duration_ms <= :threshold AND status != 'RUNNING') * 100.0 / NULLIF(COUNT(*) FILTER (WHERE status != 'RUNNING'), 0) FROM executions WHERE start_time >= :from AND start_time < :to AND application_name = :app -- optional AND route_id = :route -- optional ``` ## Error Velocity Calculation Compare 5-minute sliding windows: `recent_5m` vs `prev_5m` error count per error type. - ▲ accelerating: recent > prev * 1.2 - ▼ decelerating: recent < prev * 0.8 - ── stable: otherwise ## Implementation Phases 1. **Database**: V12 migration (app_settings table) 2. **Backend core**: AppSettings model/repository, StatsStore interface extensions, TopError model 3. **Backend endpoints**: 3 new + 2 modified + app-settings CRUD 4. **Frontend hooks**: New query hooks + shared dashboard utilities 5. **Frontend components**: DashboardL1, DashboardL2, DashboardL3 6. **Process diagram heatmap**: New `latencyHeatmap` prop on ProcessDiagram component 7. **Verification**: Build, manual test all 3 levels, regenerate openapi.json ## Design Spec Full spec with ASCII wireframes: `docs/superpowers/specs/2026-03-29-dashboard-design.md`
Sign in to join this conversation.