cameleer/cameleer-server

Fork 0

Files

hsiegeln 0ec41bc02c

CI / cleanup-branch (push) Has been skipped

Details

CI / docker (push) Has been cancelled

Details

CI / deploy (push) Has been cancelled

Details

CI / deploy-feature (push) Has been cancelled

Details

CI / build (push) Has been cancelled

Details

docs: add dashboard design spec

Progressive drill-down dashboard following RED method (Rate, Errors,
Duration) with 3 scope levels driven by sidebar selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 19:02:35 +02:00

8.3 KiB

Raw Blame History

Dashboard Design Spec

Goal

Redesign the Dashboard tab as a progressive drill-down dashboard for Apache Camel application monitoring. Follows the RED method (Rate, Errors, Duration) plus saturation (inflight exchanges). The sidebar drives scope: all applications → single application → single route. Each level answers a focused question with increasing detail.

Business/support users are the primary audience — the dashboard focuses on exchange health, throughput, error rates, and SLA compliance. Ops/infrastructure monitoring stays on the Runtime tab (agent health, JVM metrics). Power users needing custom analysis will use Grafana; this dashboard targets the 80% sweet spot.

Architecture

The Dashboard tab renders one of three views based on the current sidebar selection:

Sidebar state	Dashboard level	Question answered
No selection	Level 1: All Applications	Is my landscape healthy? Which app needs attention?
Application selected	Level 2: Application	How is this app performing? Which route is the problem?
Route selected	Level 3: Route	What's happening in this route? Where's the bottleneck?

All levels share the same time range (controlled by the top bar's time selector) and auto-refresh behavior (LIVE mode with 30s refresh).

Level 1: All Applications Overview

KPI Strip (4 metrics)

Metric	Source	Trend
Total throughput (msg/s)	`stats_1m_all` aggregate	vs previous period
Error rate (%)	failed / total	vs previous period
P99 latency (ms)	`approx_percentile(0.99)`	vs previous period
Inflight exchanges	running count	current value

Uses the existing useExecutionStats() hook with no application/route filter.

Application Health Table

Columns: App name, Status dot, Throughput (msg/s), Error Rate (%), P99 (ms), Active Routes, Sparkline (12-bucket trend).

Status dot derivation:

Green: error rate < 1% AND P99 < SLA threshold (300ms)
Yellow: error rate 1-5% OR P99 between 200-300ms
Red: error rate > 5% OR P99 > SLA threshold

Row click navigates to that application in the sidebar (transitions to Level 2).

Data source: useRouteMetrics() aggregated by application. The route-level metrics are grouped by appId and aggregated to produce application-level rows.

Charts (2, side-by-side)

Throughput over time — Stacked area chart, one series per application. Shows relative volume and total.
Error rate over time — Line chart, one line per application. Highlights which app is misbehaving.

Data source: useStatsTimeseries() with no application filter, extended to support per-app breakdown. This requires a new API parameter or a new endpoint that returns timeseries grouped by application.

New API endpoint needed

GET /api/v1/search/stats/timeseries/by-app — Returns timeseries buckets grouped by application name. Query: stats_1m_app materialized view.

Response shape:

{
  "apps": {
    "order-service": [ { "time": "...", "totalCount": 42, "failedCount": 1, ... } ],
    "payment-gateway": [ ... ]
  }
}

Level 2: Single Application

KPI Strip (4 metrics)

Same four metrics as Level 1 but scoped to the selected application. Uses useExecutionStats({ application }).

Route Performance Table

Columns: Route ID, Throughput (msg/s), Success %, Avg Duration (ms), P99 Duration (ms), Error Rate (%), Sparkline.

Sortable by any column. Row click navigates to that route in the sidebar (transitions to Level 3).

Data source: useRouteMetrics({ appId }) — already exists and returns per-route data filtered by application.

Charts (2, side-by-side)

Throughput over time — Stacked area by route.
Latency percentiles over time — P50, P95, P99 lines with SLA threshold (300ms horizontal dashed line).

Data source: useStatsTimeseries({ application }) for the aggregate latency chart. For per-route throughput breakdown, needs the same by-app pattern extended to by-route:

New API endpoint needed

GET /api/v1/search/stats/timeseries/by-route — Returns timeseries buckets grouped by route ID, filtered by application. Query: stats_1m_route materialized view.

Response shape:

{
  "routes": {
    "process-order": [ { "time": "...", "totalCount": 42, ... } ],
    "validate-payment": [ ... ]
  }
}

Top 5 Errors

Compact table: Error Type, Route, Count, Last Seen.

Click navigates to the Exchanges tab with the error type pre-filled as a search filter. Section hidden when there are zero errors in the time window.

Data source: PostgreSQL query on executions table, aggregating by error_type column for the selected application and time range.

New API endpoint needed

GET /api/v1/search/errors/top — Returns top N errors grouped by error type.

Parameters: application (optional), routeId (optional), from, to, limit (default 5).

Response shape:

[
  { "errorType": "ConnectTimeoutException", "routeId": "validate-payment", "count": 47, "lastSeen": "2026-03-29T15:58:00Z" }
]

Source: PostgreSQL query on executions table. Group by error_type column (added in V10 migration). Filter to status = 'FAILED' within the time range. Order by count descending, limit to N.

Level 3: Single Route

KPI Strip (4 metrics)

Same four metrics scoped to the selected route. Uses useExecutionStats({ application, routeId }).

Charts (3, in a row)

Throughput over time — Area chart, single series.
Latency percentiles over time — P50, P95, P99 lines with SLA threshold.
Error rate over time — Area chart, red-tinted.

Data source: useStatsTimeseries({ application, routeId }) — already exists.

Compact Process Diagram with Heatmap

Reuses the ProcessDiagram component but with a latency heatmap overlay instead of the execution overlay. Processor nodes are colored by aggregate performance:

Color scale: green (fast) → yellow (moderate) → red (slow), computed relative to the route's own processors (not absolute thresholds). The slowest processor in the route gets the warmest color.
Data source: useProcessorMetrics({ routeId, appId }) — already exists. Uses avgDurationMs or p99DurationMs for the color mapping.
No click interactions beyond visual identification. Clicking a node scrolls to its row in the processor table below.
Compact sizing: fixed height (~250px), the diagram fits-to-view within that space.

Implementation: a new heatmapOverlay prop on ProcessDiagram (or a wrapper component) that takes a Map<processorId, { avgMs, p99Ms }> and colors nodes accordingly. Reuses the existing diagram layout and rendering — only the fill color logic changes.

Processor Metrics Table

Columns: Processor ID, Type, Invocation Count, Avg Duration (ms), P99 Duration (ms), Error Rate (%).

Default sort: P99 descending (slowest processor first — highlights bottlenecks).

Data source: useProcessorMetrics({ routeId, appId }) — already exists.

Top 5 Errors

Same format as Level 2 but scoped to this route. Uses the same top-errors endpoint with routeId parameter.

Data Requirements Summary

Existing endpoints (no backend changes)

Endpoint	Used at
`GET /search/stats`	All levels (KPI strip)
`GET /search/stats/timeseries`	Level 2, Level 3 (charts)
`GET /routes/metrics`	Level 1 (app table, aggregated), Level 2 (route table)
`GET /routes/metrics/processors`	Level 3 (processor table + heatmap)

New endpoints needed

Endpoint	Used at	Source
`GET /search/stats/timeseries/by-app`	Level 1 (charts)	`stats_1m_app` view
`GET /search/stats/timeseries/by-route`	Level 2 (throughput chart)	`stats_1m_route` view
`GET /search/errors/top`	Level 2, Level 3 (top errors)	`executions` table

Existing frontend components reused

KpiStrip / TabKpis — KPI display with trends
DataTable — sortable tables
AreaChart, LineChart — time-series charts
Sparkline — compact trend in table cells
StatusDot — health indicators
ProcessDiagram — route visualization (extended with heatmap)

Scope Exclusions

No user-customizable panels or drag-and-drop layout (power users use Grafana)
No JVM/infrastructure metrics on Dashboard tab (that's Runtime tab)
No alerting or threshold configuration (out of scope)
No comparison mode (e.g., "this week vs last week" side-by-side)
No export/PDF functionality

8.3 KiB Raw Blame History

Dashboard Design Spec

Goal

Architecture

Level 1: All Applications Overview

KPI Strip (4 metrics)

Application Health Table

Charts (2, side-by-side)

New API endpoint needed

Level 2: Single Application

KPI Strip (4 metrics)

Route Performance Table

Charts (2, side-by-side)

New API endpoint needed

Top 5 Errors

New API endpoint needed

Level 3: Single Route

KPI Strip (4 metrics)

Charts (3, in a row)

Compact Process Diagram with Heatmap

Processor Metrics Table

Top 5 Errors

Data Requirements Summary

Existing endpoints (no backend changes)

New endpoints needed

Existing frontend components reused

Scope Exclusions

8.3 KiB

Raw Blame History