Files

hsiegeln 6f39e29707 docs: add domain research (stack, features, architecture, pitfalls)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-11 11:05:37 +01:00

14 KiB

Raw Permalink Blame History

Feature Landscape

Domain: Transaction monitoring / observability for Apache Camel route executions Researched: 2026-03-11 Confidence: MEDIUM (based on domain expertise from njams Server, Jaeger, Zipkin, Dynatrace; web search unavailable for latest feature sets)

Table Stakes

Features users expect. Missing = product feels incomplete.

Transaction Search and Filtering

Feature	Why Expected	Complexity	Notes
Search by time range	Every monitoring tool has this; primary axis for incident investigation	Low	Date picker with presets (last 15m, 1h, 24h, 7d, custom)
Filter by transaction state	SUCCESS/ERROR/WARNING is the first thing ops checks	Low	Multi-select checkboxes, counts per state
Filter by duration	Finding slow transactions is core use case	Low	Min/max duration inputs, or predefined buckets
Full-text search across payload/attributes	Users need to find "that one order ID" across millions of records	Medium	Requires text index; match highlighting in results
Combined/compound filters	Users always combine: "errors in last hour on instance X"	Medium	AND-composition of all filter criteria
Paginated result list	Cannot load millions of rows; must page or virtual-scroll	Low	Cursor-based pagination preferred over offset for large datasets
Sort by time, duration, state	Basic result ordering	Low	Default: newest first
Filter by agent/instance	"Show me only transactions from production-instance-3"	Low	Dropdown populated from agent registry
Filter by route name	Users think in routes, not raw IDs	Low	Autocomplete from known route definitions
Save/bookmark search queries	Ops teams reuse the same searches during incidents	Medium	Named saved searches, shareable via URL

Transaction Detail and Drill-Down

Feature	Why Expected	Complexity	Notes
Transaction summary view	One-glance: state, start time, duration, instance, route entry point	Low	Header card in detail page
Activity list (per-route breakdown)	Hierarchical view of all route executions within a transaction	Medium	Tree or table showing each activity with timing
Activity timing waterfall	Visual timeline showing which routes executed when, and their overlap	Medium	Horizontal bar chart; critical for finding bottlenecks
Payload/attribute inspection	View message body, headers, properties at each activity step	Medium	Expandable sections; JSON/XML pretty-printing
Error detail with stack trace	When a transaction fails, users need the exception detail immediately	Low	Rendered stack trace with copy button
Cross-instance correlation	Transaction spans instances A and B -- show the full chain	High	Requires correlation ID propagation; single unified view
Link to route diagram	From any activity, jump to the diagram showing the route definition	Low	Hyperlink; depends on diagram storage existing

Route Diagram Visualization

Feature	Why Expected	Complexity	Notes
Render route diagram from stored definition	The core differentiator vs generic tracing tools; users think in Camel routes	High	Server-side or client-side rendering from graph model
Diagram versioning	Route changed last Tuesday -- show the diagram as it was when the transaction ran	Medium	Version stored per diagram; transaction references specific version
Zoom and pan	Diagrams can be large (50+ nodes); must be navigable	Medium	Standard canvas controls; minimap helpful for large diagrams
Execution overlay on diagram	Highlight which path the transaction actually took through the route	High	Color/annotate nodes with state (success/error), timing
Node click for activity detail	Click a node in the diagram to see the activity data for that step	Medium	Links diagram nodes to activity records

Agent Management

Feature	Why Expected	Complexity	Notes
Agent list with status	See all connected agents and their lifecycle state (LIVE/STALE/DEAD)	Low	Table with status indicator; auto-refresh
Agent heartbeat monitoring	Detect when an agent goes silent	Low	Timestamp of last heartbeat; threshold-based state transitions
Agent detail view	Instance name, version, connected routes, uptime, config	Low	Detail page per agent
Agent registration/deregistration	New agents register via bootstrap token; dead agents get cleaned up	Medium	Registration endpoint; TTL-based cleanup

Authentication and Security

Feature	Why Expected	Complexity	Notes
JWT-based API authentication	Secure the REST API; every enterprise monitoring tool requires auth	Medium	Token issuance, validation, refresh
Bootstrap token for agent registration	Agents need a way to initially register without pre-existing credentials	Low	Shared secret, single-use or time-limited
Ed25519 config signing	Agents must verify config came from the server, not tampered	Medium	Key management, signature generation/verification

Dashboard and Overview

Feature	Why Expected	Complexity	Notes
Transaction volume chart (time series)	"How many transactions are we processing?" -- first question on login	Medium	Bar or line chart, grouped by time bucket
Error rate chart	"Is something broken right now?" -- second question	Medium	Error count or percentage over time
Active agents count	Quick health check of the agent fleet	Low	Simple counter with status breakdown
Recent errors list	Quick access to the latest failures without searching	Low	Pre-filtered list, auto-refreshing

Differentiators

Features that set product apart from generic tracing tools. Not expected, but valued.

Diagram-Centric Experience

Feature	Value Proposition	Complexity	Notes
Route diagram as primary navigation	Instead of trace waterfall, users navigate via the Camel route diagram -- this is how they think	High	Diagram becomes the entry point, not just a visualization
Execution heatmap on diagram	Color nodes by frequency/error rate over a time window -- shows hotspots	High	Aggregate stats per node; requires efficient querying
Side-by-side diagram comparison	Compare two diagram versions to see what changed in a route	Medium	Diff view highlighting added/removed/changed nodes
Diagram-based search	"Show me all failed transactions that passed through this node"	High	Click a node, get filtered transaction list

Advanced Search and Analytics

Feature	Value Proposition	Complexity	Notes
Statistical duration analysis	P50/P95/P99 duration for a route over time -- detect degradation trends	Medium	Requires ClickHouse aggregation queries
Transaction comparison	Side-by-side diff of two transactions through the same route	Medium	Useful for "why did this one fail but that one succeed?"
Search result aggregations	Faceted counts: N errors, N warnings, distribution by route, by instance	Medium	ClickHouse GROUP BY queries alongside search results
Correlation graph	Visual graph showing how transactions flow across instances	High	Network diagram; requires correlation data

Configuration Push

Feature	Value Proposition	Complexity	Notes
Per-route tracing level control	Turn on detailed tracing for one problematic route without restarting the agent	Medium	SSE push of config change; agent applies dynamically
Bulk config push to agent groups	"Enable debug tracing on all production instances"	Medium	Agent tagging/grouping + batch SSE dispatch
Config history and rollback	See what config was active when, roll back a bad change	Medium	Versioned config storage with timestamps
Ad-hoc command dispatch	Send a "flush cache" or "reconnect" command to specific agents	Medium	Command/response pattern over SSE; command status tracking

Operational Intelligence

Feature	Value Proposition	Complexity	Notes
Alerting on error rate thresholds	Notify when error rate exceeds threshold for a route	High	Threshold evaluation, notification channels (email, webhook)
Anomaly detection on duration	Alert when P95 duration spikes compared to baseline	High	Statistical baseline computation; deviation detection
Scheduled data export	Export transaction data as CSV/JSON for compliance or reporting	Medium	Job scheduler; file generation; download endpoint
Retention policy management	Configure per-route or per-instance retention periods	Medium	TTL management in ClickHouse; UI for policy CRUD

Anti-Features

Features to explicitly NOT build.

Anti-Feature	Why Avoid	What to Do Instead
General APM metrics (CPU, memory, GC)	Out of scope; Cameleer is transaction-focused, not an APM tool. Adding metrics creates scope creep and competes with Prometheus/Grafana which do it better	Provide a link/integration point to external metrics tools if needed
Log aggregation/viewer	Transactions are not logs. Mixing them confuses the data model and competes with ELK/Loki	Store transaction payloads and attributes, not raw log lines
Custom dashboard builder	Enormous complexity for marginal value. Ops teams already have Grafana for custom dashboards	Provide good built-in dashboards; expose metrics via Prometheus endpoint for Grafana
Multi-tenancy	Adds auth complexity, data isolation, billing concerns. Single-tenant deployment is simpler and sufficient for the target audience	Deploy separate instances per environment/team
Mobile app	Ops teams use desktop browsers during incidents. Mobile adds huge UI complexity	Responsive web UI that works on tablets if needed
Plugin/extension system	Premature abstraction; adds API stability burden before the core is stable	Build features directly; consider plugins much later if demand emerges
Real-time streaming transaction view	"Firehose" views of all transactions in real-time look impressive but are useless at scale (millions/day). Users cannot process the stream	Provide auto-refreshing search results and recent errors list
AI/ML-powered root cause analysis	Hype-driven feature with poor reliability. Requires massive training data and domain-specific models	Provide good search, filtering, and comparison tools so humans can find root causes efficiently

Feature Dependencies

Agent Registration --> Agent List/Status
Agent Registration --> SSE Connection --> Config Push
Agent Registration --> SSE Connection --> Ad-hoc Commands

Transaction Ingestion --> Transaction Storage
Transaction Storage --> Transaction Search/Filtering
Transaction Search --> Transaction Detail View
Transaction Detail --> Activity Waterfall
Transaction Detail --> Payload Inspection
Transaction Detail --> Error Detail

Diagram Storage --> Diagram Rendering
Diagram Versioning --> Transaction-to-Diagram Linking
Diagram Rendering --> Execution Overlay (requires both diagram + activity data)
Diagram Rendering --> Execution Heatmap (requires aggregated activity data)
Diagram Rendering --> Diagram-based Search

Transaction Search --> Statistical Duration Analysis (aggregation of search results)
Transaction Search --> Search Result Aggregations

JWT Auth --> All REST API endpoints
Bootstrap Token --> Agent Registration
Ed25519 Signing --> Config Push

Transaction Volume Chart --> Transaction Storage (aggregation queries)
Error Rate Chart --> Transaction Storage (aggregation queries)

MVP Recommendation

Prioritize (Phase 1 -- Foundation):

Transaction ingestion and storage -- nothing works without data flowing in
Agent registration and lifecycle -- must know who is sending data
Basic transaction search (time range, state, duration) -- core value proposition
Transaction detail with activity breakdown -- users need to drill down

Prioritize (Phase 2 -- Core Experience): 5. Full-text search -- the "find that one transaction" use case 6. Route diagram rendering with version linking -- the Camel-specific differentiator 7. JWT authentication -- required before any production deployment 8. Dashboard overview (volume chart, error rate, agent status)

Prioritize (Phase 3 -- Differentiation): 9. Execution overlay on diagrams -- the killer feature that generic tools cannot offer 10. Config push via SSE -- operational value that justifies the agent-server architecture 11. Cross-instance correlation -- required for complex multi-instance Camel deployments

Defer:

Alerting: defer until core search and dashboard are solid; alerting without good data is noise
Data export: useful but not blocking; add when compliance demands arise
Anomaly detection: requires baseline data that only accumulates over time
Diagram-based search: powerful but depends on both diagram rendering and search being mature
Execution heatmap: requires significant aggregation infrastructure

Sources

Domain knowledge from njams Server (Integration Matters) feature set -- transaction monitoring for integration platforms, hierarchical transaction/activity model, route diagram visualization
Jaeger UI and Zipkin UI -- distributed tracing search, trace detail waterfall views, service dependency graphs
Dynatrace PurePath -- transaction-level drill-down, service flow visualization, statistical analysis
Apache Camel route model -- EIP-based visual representation, route definition structure
Project context from PROJECT.md and CLAUDE.md -- specific requirements, constraints, and architectural decisions

Confidence note: Feature categorization is based on training data knowledge of these products. Web search was unavailable to verify latest feature additions in 2025-2026 releases. The core feature landscape for this domain is mature and unlikely to have shifted dramatically, but specific UI patterns and newer differentiators may be missed. Confidence: MEDIUM.

14 KiB Raw Permalink Blame History