docs: add dashboard design spec
Progressive drill-down dashboard following RED method (Rate, Errors, Duration) with 3 scope levels driven by sidebar selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
197
docs/superpowers/specs/2026-03-29-dashboard-design.md
Normal file
197
docs/superpowers/specs/2026-03-29-dashboard-design.md
Normal file
@@ -0,0 +1,197 @@
|
|||||||
|
# Dashboard Design Spec
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Redesign the Dashboard tab as a progressive drill-down dashboard for Apache Camel application monitoring. Follows the RED method (Rate, Errors, Duration) plus saturation (inflight exchanges). The sidebar drives scope: all applications → single application → single route. Each level answers a focused question with increasing detail.
|
||||||
|
|
||||||
|
Business/support users are the primary audience — the dashboard focuses on exchange health, throughput, error rates, and SLA compliance. Ops/infrastructure monitoring stays on the Runtime tab (agent health, JVM metrics). Power users needing custom analysis will use Grafana; this dashboard targets the 80% sweet spot.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The Dashboard tab renders one of three views based on the current sidebar selection:
|
||||||
|
|
||||||
|
| Sidebar state | Dashboard level | Question answered |
|
||||||
|
|---|---|---|
|
||||||
|
| No selection | Level 1: All Applications | Is my landscape healthy? Which app needs attention? |
|
||||||
|
| Application selected | Level 2: Application | How is this app performing? Which route is the problem? |
|
||||||
|
| Route selected | Level 3: Route | What's happening in this route? Where's the bottleneck? |
|
||||||
|
|
||||||
|
All levels share the same time range (controlled by the top bar's time selector) and auto-refresh behavior (LIVE mode with 30s refresh).
|
||||||
|
|
||||||
|
## Level 1: All Applications Overview
|
||||||
|
|
||||||
|
### KPI Strip (4 metrics)
|
||||||
|
|
||||||
|
| Metric | Source | Trend |
|
||||||
|
|---|---|---|
|
||||||
|
| Total throughput (msg/s) | `stats_1m_all` aggregate | vs previous period |
|
||||||
|
| Error rate (%) | failed / total | vs previous period |
|
||||||
|
| P99 latency (ms) | `approx_percentile(0.99)` | vs previous period |
|
||||||
|
| Inflight exchanges | running count | current value |
|
||||||
|
|
||||||
|
Uses the existing `useExecutionStats()` hook with no application/route filter.
|
||||||
|
|
||||||
|
### Application Health Table
|
||||||
|
|
||||||
|
Columns: App name, Status dot, Throughput (msg/s), Error Rate (%), P99 (ms), Active Routes, Sparkline (12-bucket trend).
|
||||||
|
|
||||||
|
**Status dot derivation:**
|
||||||
|
- Green: error rate < 1% AND P99 < SLA threshold (300ms)
|
||||||
|
- Yellow: error rate 1-5% OR P99 between 200-300ms
|
||||||
|
- Red: error rate > 5% OR P99 > SLA threshold
|
||||||
|
|
||||||
|
Row click navigates to that application in the sidebar (transitions to Level 2).
|
||||||
|
|
||||||
|
Data source: `useRouteMetrics()` aggregated by application. The route-level metrics are grouped by `appId` and aggregated to produce application-level rows.
|
||||||
|
|
||||||
|
### Charts (2, side-by-side)
|
||||||
|
|
||||||
|
1. **Throughput over time** — Stacked area chart, one series per application. Shows relative volume and total.
|
||||||
|
2. **Error rate over time** — Line chart, one line per application. Highlights which app is misbehaving.
|
||||||
|
|
||||||
|
Data source: `useStatsTimeseries()` with no application filter, extended to support per-app breakdown. This requires a new API parameter or a new endpoint that returns timeseries grouped by application.
|
||||||
|
|
||||||
|
### New API endpoint needed
|
||||||
|
|
||||||
|
`GET /api/v1/search/stats/timeseries/by-app` — Returns timeseries buckets grouped by application name. Query: `stats_1m_app` materialized view.
|
||||||
|
|
||||||
|
Response shape:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"apps": {
|
||||||
|
"order-service": [ { "time": "...", "totalCount": 42, "failedCount": 1, ... } ],
|
||||||
|
"payment-gateway": [ ... ]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Level 2: Single Application
|
||||||
|
|
||||||
|
### KPI Strip (4 metrics)
|
||||||
|
|
||||||
|
Same four metrics as Level 1 but scoped to the selected application. Uses `useExecutionStats({ application })`.
|
||||||
|
|
||||||
|
### Route Performance Table
|
||||||
|
|
||||||
|
Columns: Route ID, Throughput (msg/s), Success %, Avg Duration (ms), P99 Duration (ms), Error Rate (%), Sparkline.
|
||||||
|
|
||||||
|
Sortable by any column. Row click navigates to that route in the sidebar (transitions to Level 3).
|
||||||
|
|
||||||
|
Data source: `useRouteMetrics({ appId })` — already exists and returns per-route data filtered by application.
|
||||||
|
|
||||||
|
### Charts (2, side-by-side)
|
||||||
|
|
||||||
|
1. **Throughput over time** — Stacked area by route.
|
||||||
|
2. **Latency percentiles over time** — P50, P95, P99 lines with SLA threshold (300ms horizontal dashed line).
|
||||||
|
|
||||||
|
Data source: `useStatsTimeseries({ application })` for the aggregate latency chart. For per-route throughput breakdown, needs the same by-app pattern extended to by-route:
|
||||||
|
|
||||||
|
### New API endpoint needed
|
||||||
|
|
||||||
|
`GET /api/v1/search/stats/timeseries/by-route` — Returns timeseries buckets grouped by route ID, filtered by application. Query: `stats_1m_route` materialized view.
|
||||||
|
|
||||||
|
Response shape:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"routes": {
|
||||||
|
"process-order": [ { "time": "...", "totalCount": 42, ... } ],
|
||||||
|
"validate-payment": [ ... ]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Top 5 Errors
|
||||||
|
|
||||||
|
Compact table: Error Type, Route, Count, Last Seen.
|
||||||
|
|
||||||
|
Click navigates to the Exchanges tab with the error type pre-filled as a search filter. Section hidden when there are zero errors in the time window.
|
||||||
|
|
||||||
|
Data source: PostgreSQL query on `executions` table, aggregating by `error_type` column for the selected application and time range.
|
||||||
|
|
||||||
|
### New API endpoint needed
|
||||||
|
|
||||||
|
`GET /api/v1/search/errors/top` — Returns top N errors grouped by error type.
|
||||||
|
|
||||||
|
Parameters: `application` (optional), `routeId` (optional), `from`, `to`, `limit` (default 5).
|
||||||
|
|
||||||
|
Response shape:
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{ "errorType": "ConnectTimeoutException", "routeId": "validate-payment", "count": 47, "lastSeen": "2026-03-29T15:58:00Z" }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Source: PostgreSQL query on `executions` table. Group by `error_type` column (added in V10 migration). Filter to `status = 'FAILED'` within the time range. Order by count descending, limit to N.
|
||||||
|
|
||||||
|
## Level 3: Single Route
|
||||||
|
|
||||||
|
### KPI Strip (4 metrics)
|
||||||
|
|
||||||
|
Same four metrics scoped to the selected route. Uses `useExecutionStats({ application, routeId })`.
|
||||||
|
|
||||||
|
### Charts (3, in a row)
|
||||||
|
|
||||||
|
1. **Throughput over time** — Area chart, single series.
|
||||||
|
2. **Latency percentiles over time** — P50, P95, P99 lines with SLA threshold.
|
||||||
|
3. **Error rate over time** — Area chart, red-tinted.
|
||||||
|
|
||||||
|
Data source: `useStatsTimeseries({ application, routeId })` — already exists.
|
||||||
|
|
||||||
|
### Compact Process Diagram with Heatmap
|
||||||
|
|
||||||
|
Reuses the `ProcessDiagram` component but with a **latency heatmap overlay** instead of the execution overlay. Processor nodes are colored by aggregate performance:
|
||||||
|
|
||||||
|
- Color scale: green (fast) → yellow (moderate) → red (slow), computed relative to the route's own processors (not absolute thresholds). The slowest processor in the route gets the warmest color.
|
||||||
|
- Data source: `useProcessorMetrics({ routeId, appId })` — already exists. Uses `avgDurationMs` or `p99DurationMs` for the color mapping.
|
||||||
|
- No click interactions beyond visual identification. Clicking a node scrolls to its row in the processor table below.
|
||||||
|
- Compact sizing: fixed height (~250px), the diagram fits-to-view within that space.
|
||||||
|
|
||||||
|
Implementation: a new `heatmapOverlay` prop on `ProcessDiagram` (or a wrapper component) that takes a `Map<processorId, { avgMs, p99Ms }>` and colors nodes accordingly. Reuses the existing diagram layout and rendering — only the fill color logic changes.
|
||||||
|
|
||||||
|
### Processor Metrics Table
|
||||||
|
|
||||||
|
Columns: Processor ID, Type, Invocation Count, Avg Duration (ms), P99 Duration (ms), Error Rate (%).
|
||||||
|
|
||||||
|
Default sort: P99 descending (slowest processor first — highlights bottlenecks).
|
||||||
|
|
||||||
|
Data source: `useProcessorMetrics({ routeId, appId })` — already exists.
|
||||||
|
|
||||||
|
### Top 5 Errors
|
||||||
|
|
||||||
|
Same format as Level 2 but scoped to this route. Uses the same `top-errors` endpoint with `routeId` parameter.
|
||||||
|
|
||||||
|
## Data Requirements Summary
|
||||||
|
|
||||||
|
### Existing endpoints (no backend changes)
|
||||||
|
|
||||||
|
| Endpoint | Used at |
|
||||||
|
|---|---|
|
||||||
|
| `GET /search/stats` | All levels (KPI strip) |
|
||||||
|
| `GET /search/stats/timeseries` | Level 2, Level 3 (charts) |
|
||||||
|
| `GET /routes/metrics` | Level 1 (app table, aggregated), Level 2 (route table) |
|
||||||
|
| `GET /routes/metrics/processors` | Level 3 (processor table + heatmap) |
|
||||||
|
|
||||||
|
### New endpoints needed
|
||||||
|
|
||||||
|
| Endpoint | Used at | Source |
|
||||||
|
|---|---|---|
|
||||||
|
| `GET /search/stats/timeseries/by-app` | Level 1 (charts) | `stats_1m_app` view |
|
||||||
|
| `GET /search/stats/timeseries/by-route` | Level 2 (throughput chart) | `stats_1m_route` view |
|
||||||
|
| `GET /search/errors/top` | Level 2, Level 3 (top errors) | `executions` table |
|
||||||
|
|
||||||
|
### Existing frontend components reused
|
||||||
|
|
||||||
|
- `KpiStrip` / `TabKpis` — KPI display with trends
|
||||||
|
- `DataTable` — sortable tables
|
||||||
|
- `AreaChart`, `LineChart` — time-series charts
|
||||||
|
- `Sparkline` — compact trend in table cells
|
||||||
|
- `StatusDot` — health indicators
|
||||||
|
- `ProcessDiagram` — route visualization (extended with heatmap)
|
||||||
|
|
||||||
|
## Scope Exclusions
|
||||||
|
|
||||||
|
- No user-customizable panels or drag-and-drop layout (power users use Grafana)
|
||||||
|
- No JVM/infrastructure metrics on Dashboard tab (that's Runtime tab)
|
||||||
|
- No alerting or threshold configuration (out of scope)
|
||||||
|
- No comparison mode (e.g., "this week vs last week" side-by-side)
|
||||||
|
- No export/PDF functionality
|
||||||
Reference in New Issue
Block a user