diff --git a/camel-ops-prototype b/camel-ops-prototype deleted file mode 160000 index e0a122f..0000000 --- a/camel-ops-prototype +++ /dev/null @@ -1 +0,0 @@ -Subproject commit e0a122f440fc918ae64cdec8c76a7922be6650c3 diff --git a/camel-ops-prototype/backend/control-plane/pom.xml b/camel-ops-prototype/backend/control-plane/pom.xml new file mode 100644 index 0000000..8a739f8 --- /dev/null +++ b/camel-ops-prototype/backend/control-plane/pom.xml @@ -0,0 +1,20 @@ + + + 4.0.0 + + com.camelops + camel-ops-backend + 0.0.1-SNAPSHOT + + control-plane + control-plane + Control Plane Application + + + com.camelops + shared-core + ${project.version} + + + diff --git a/camel-ops-prototype/backend/pom.xml b/camel-ops-prototype/backend/pom.xml new file mode 100644 index 0000000..8ee22d9 --- /dev/null +++ b/camel-ops-prototype/backend/pom.xml @@ -0,0 +1,22 @@ + + + 4.0.0 + com.camelops + camel-ops-backend + 0.0.1-SNAPSHOT + pom + camel-ops-backend + Camel Ops Prototype Backend + + + shared-core + control-plane + runner-appliance + + + + 17 + 3.2.3 + + diff --git a/camel-ops-prototype/backend/runner-appliance/pom.xml b/camel-ops-prototype/backend/runner-appliance/pom.xml new file mode 100644 index 0000000..f7f0713 --- /dev/null +++ b/camel-ops-prototype/backend/runner-appliance/pom.xml @@ -0,0 +1,20 @@ + + + 4.0.0 + + com.camelops + camel-ops-backend + 0.0.1-SNAPSHOT + + runner-appliance + runner-appliance + Runner Appliance Application + + + com.camelops + shared-core + ${project.version} + + + diff --git a/camel-ops-prototype/backend/shared-core/pom.xml b/camel-ops-prototype/backend/shared-core/pom.xml new file mode 100644 index 0000000..7a9d030 --- /dev/null +++ b/camel-ops-prototype/backend/shared-core/pom.xml @@ -0,0 +1,13 @@ + + + 4.0.0 + + com.camelops + camel-ops-backend + 0.0.1-SNAPSHOT + + shared-core + shared-core + Shared core entities and utilities + diff --git a/camel-ops-prototype/docs/ALERTING_PROMPT.md b/camel-ops-prototype/docs/ALERTING_PROMPT.md new file mode 100644 index 0000000..cf1cd73 --- /dev/null +++ b/camel-ops-prototype/docs/ALERTING_PROMPT.md @@ -0,0 +1,9 @@ +# Architecture Update: Alerting & Anti-Spam + +**Context from User (Hendrik):** +"I would expect we also need to add some kind of monitoring / alerting to our platform. using your example, we need to detect the failure at 3 am and be able to notify but not spam an operator." + +**Directives:** +1. **Anti-Spam / Grouping:** We need smart alerting (grouping by route/business entity, debouncing, threshold-based). 100 failed messages = 1 alert, not 100 emails. +2. **Local Evaluation:** Since TSDB/Logs are on the Customer Hub, alert evaluation MUST happen on the Hub (e.g., using `vmalert` and `Alertmanager`). +3. **SaaS Management:** The SaaS Control Plane UI is where users configure the rules. These rules are pushed down to the Hub. diff --git a/camel-ops-prototype/docs/EXTRACTION_PROMPT.md b/camel-ops-prototype/docs/EXTRACTION_PROMPT.md new file mode 100644 index 0000000..c16f0f8 --- /dev/null +++ b/camel-ops-prototype/docs/EXTRACTION_PROMPT.md @@ -0,0 +1,9 @@ +# Architecture Update: nJAMS-Style Data Extraction (Java Agent) + +**Context from User (Hendrik):** +"The architect and developer should think about a javaagent to mimic the njams agent for mulesoft and tibco to extract data from not-instrumented apache camel applications. we should be as 'apache camel native' as possible and also keep the monitoring overhead to an absolute minimum." + +**Directives for Architect & Developer:** +1. **Zero-Code Change:** Customers should not have to rewrite their Camel XML or Java DSL. We attach an agent, and it works. +2. **Camel Native:** Instead of blindly instrumenting JVM bytecode (which can break things and add overhead), the agent should ideally hook into Camel's native SPIs (`CamelContext`, `EventNotifier`, `InterceptStrategy`, `Tracer`). +3. **Minimum Overhead:** Capturing payloads at every step can cause OutOfMemory errors and CPU spikes. How do we extract data cheaply? (e.g., Ring buffers, sampling, payload truncation, extracting only on error, or dynamically toggling capture from the SaaS Control Plane). diff --git a/camel-ops-prototype/docs/MULTI_NODE_PROMPT.md b/camel-ops-prototype/docs/MULTI_NODE_PROMPT.md new file mode 100644 index 0000000..e0ad211 --- /dev/null +++ b/camel-ops-prototype/docs/MULTI_NODE_PROMPT.md @@ -0,0 +1,9 @@ +# Architecture Update: Multi-Node Hub/Worker Split + +**Context from User (Hendrik):** +"Even from the very beginning on the installer must be able to install multiple worker nodes that can form a cluster. I also wonder, if we should seperate the roles more: worker nodes that host apache camel workloads, and control plane nodes that host the persistence (loki, the time series database, etc)... architect and developer need to know about this and update their spec. they should also critically think about that change and add to the conversation where required." + +**Directives:** +1. We are moving from a single-node appliance to a multi-node cluster (embedded k3s). +2. Role 1: "Appliance Hub" (runs local k3s control plane, VictoriaMetrics, Loki, SaaS tunnel). +3. Role 2: "Appliance Worker" (runs Apache Camel workloads, OTEL forwarders). diff --git a/camel-ops-prototype/docs/UX_AND_DEPLOY_PROMPT.md b/camel-ops-prototype/docs/UX_AND_DEPLOY_PROMPT.md new file mode 100644 index 0000000..b6a428b --- /dev/null +++ b/camel-ops-prototype/docs/UX_AND_DEPLOY_PROMPT.md @@ -0,0 +1,5 @@ +# Update: Modern UX and "Black Box" Deployment + +**Context from User (Hendrik):** +1. **UX Modernization:** "We need to discuss the UI. I want a very modern, slick UI for the user. while inspired by njams, we should think about how the user could better navigate the ui to find what they are looking for... UX has evolved and users expect a better UI. make them working on the UX spec." (Focus on Cmd+K, Slide-outs, Visual Flow, Business-Entity First, 3 core screens). +2. **Deployment Flexibility (The "Black Box"):** "The installation should really be simple. The customer should just provide - for example - an empty Linux Alpine and we install into that. i do not want the customer to be required to know that we run kubernetes. the solution, however, should be designed in such a way that it could be installed into a customer's k8s and work nicely with that. Think about some customers do not have a skilled it ops department any more." diff --git a/camel-ops-prototype/docs/WAR_ROOM_PROMPT.md b/camel-ops-prototype/docs/WAR_ROOM_PROMPT.md new file mode 100644 index 0000000..24c5338 --- /dev/null +++ b/camel-ops-prototype/docs/WAR_ROOM_PROMPT.md @@ -0,0 +1,4 @@ +# Architecture Pivot: Local Data, SaaS Management + +**Context from User (Hendrik):** +"The architect and developer need to update their strategies according - the three agents should discuss with each other to find the best fit. The solution should be very easy to install and operate, instrument the apache camel applications to retrieve data in nJAMS style. I understand we may have to move the persistence layer (like log aggregation, time series db) to the customer site. As mentioned before, that is totally ok. For the saas offering we will deploy it into the cloud, but manage it via the saas control plane. our customers will be managed via the saas control plane, but all data will reside with the customers." diff --git a/camel-ops-prototype/docs/architecture_outline.md b/camel-ops-prototype/docs/architecture_outline.md new file mode 100644 index 0000000..bfcc64a --- /dev/null +++ b/camel-ops-prototype/docs/architecture_outline.md @@ -0,0 +1,96 @@ +# Technical Architecture Outline: Apache Camel Operations Platform + +**Objective:** Define the system design and boundary between the Runner (execution) and the Control Plane (management), strictly scoped for a 6-week prototype. + +## 1. High-Level Architecture Paradigm +The system follows a distributed execution model: +* **Control Plane (SaaS):** Centralized management, monitoring, and deep observability hub. +* **Runner (Appliance):** Packaged, deployable execution environments hosted where the customer's data gravity resides. + +## 2. The Runner (k3s Appliance) +* **Role:** The execution engine for Apache Camel workloads. +* **Core Infrastructure:** Packaged lightweight Kubernetes (k3s) cluster. +* **Components for Prototype:** + * **Camel Engine:** The runtime environment (e.g., Camel K or plain Camel Spring Boot/Quarkus containers). + * **Telemetry Agent:** A local collector responsible for gathering deep execution metrics and traces (the USP) and buffering them. + * **Appliance Agent:** A lightweight operator managing the lifecycle of the Runner and communicating with the Control Plane. + +## 3. The Control Plane (SaaS) +* **Role:** Management UI and centralized observability dashboard (nJAMS style). +* **Components for Prototype:** + * **Ingestion API:** Secure endpoint to receive telemetry data from remote Runners. + * **Management API / UI:** Interface for the user to view deployments, logs, and deep traces. + * **Datastore:** Time-series/trace storage (e.g., managed Prometheus/Grafana stack or Jaeger for the prototype to save time). + +## 4. Communication & Security +* **Challenge:** Securely connecting customer-hosted Runners to the SaaS Control Plane without requiring inbound firewall rules on the customer side. +* **Prototype Strategy:** + * **Outbound-Only:** The Appliance Agent initiates an outbound connection (e.g., via gRPC, Secure WebSockets, or mTLS) to the Control Plane. + * **Authentication:** Mutual TLS (mTLS) or pre-shared bearer tokens provisioned during Runner bootstrap. + +## 5. Observability (The USP) +* *Pragmatic Approach:* Leverage existing standards (OpenTelemetry) for transport and generic instrumentation to save time. +* *Innovation Focus:* Build custom OTel processors/exporters specifically designed for Apache Camel internals to provide the "nJAMS-style" deep visibility that standard APMs miss. + +## 6. 6-Week Prototype Constraints & Trade-offs +To hit the 6-week milestone, we must favor "boring," reliable tech for the core and only innovate on the observability USP: +* **In Scope:** + * Single-node k3s Runner deployment. + * Basic SaaS ingestion endpoint. + * Deep trace visualization of a single, complex Camel route. + * Outbound-only secure telemetry push. +* **Out of Scope (Post-Prototype):** + * High Availability (HA) Control Plane or multi-node k3s clustering. + * Complex RBAC and multi-tenant deep isolation (use basic logical separation for now). + * The "On-Premises" variant of the Control Plane (focus strictly on SaaS for the prototype to reduce deployment complexity). + * Complex lifecycle management/upgrades of the k3s appliance itself. +## Pivot: Local Data Persistence + +Based on the latest strategy alignment, the architecture has pivoted to a localized data model to ensure strict data privacy and reduce SaaS infrastructure costs: + +* **SaaS Control Plane (Pure Management):** The cloud offering is now strictly a control plane. It handles user authentication, centralized UI, configuration management, and remote orchestration. It does *not* ingest, store, or process customer telemetry, logs, or payload data. +* **Customer-Side Persistence:** The persistence layer (Time Series Database, log aggregation, and trace storage) is completely moved to the customer site, embedded within the Runner appliance. All sensitive execution data resides exclusively with the customer. +* **Zero-Config Installation:** To ensure smooth adoption, the Runner appliance must be extremely easy to install. It will function as a "1-click" or zero-config deployment (e.g., a single Docker Compose up, a self-contained binary, or a pre-configured k3s appliance) that automatically spins up the local datastores, the telemetry collectors, and the connection agent. +* **nJAMS-style Instrumentation:** Apache Camel applications will be deeply instrumented to extract step-by-step route progression, properties, and payload snapshots (the nJAMS-style USP). This data is written directly to the local datastore. The SaaS Control Plane will view this data by securely querying the Runner's local API via an outbound-established tunnel (e.g., reverse proxy tunnel), ensuring no inbound firewall exceptions are required at the customer site. + +## 7. Alerting Data Flow & Anti-Spam + +Since telemetry and logs (Zero-Trust Payload) reside purely on the Customer Hub (Appliance/Runner), alert evaluation **must** occur locally on the Hub. This prevents sensitive data from ever hitting the SaaS Control Plane, while ensuring low-latency alerting. + +* **Rule Synchronization (SaaS -> Hub):** + * **Configuration in SaaS:** Users define alert rules (e.g., failure thresholds like ">5 errors/min", route groupings, or latency SLAs) within the SaaS Control Plane UI. + * **Secure Sync:** The SaaS plane pushes these rule definitions down to the Hub via the established outbound-only tunnel (e.g., gRPC/WebSocket). + * **Local Translation:** The Hub agent translates these definitions into formats understood by the local evaluation engine (e.g., `vmalert` or Prometheus `Alertmanager` rules). +* **Local Evaluation & Anti-Spam:** + * The Hub's localized time-series database and log stores are continuously evaluated against the synced rules. + * **Minimizing Alert Fatigue:** To prevent operator spam (e.g., a backend outage generating 1,000 separate failure alerts), the local evaluation engine heavily utilizes grouping (by Camel route or business entity), debouncing, and rate-limiting. A spike in errors results in a single, aggregated "Degraded Route" alert rather than individual error emails. +* **Notification Delivery:** + * **Direct Outbound (Preferred):** To maintain the Zero-Trust posture, the Hub dispatches notifications directly to target platforms (e.g., Slack webhooks, PagerDuty, or local SMTP). The SaaS securely provisions the necessary webhook URLs/API tokens to the Hub during the rule sync phase. + * **SaaS Relay (Fallback):** In strictly firewalled environments where the Hub cannot reach external APIs (like Slack) directly, it can route a sanitized, payload-free notification event back through the secure tunnel to the SaaS Control Plane, which then relays it to the third-party service. + +## 8. Java Agent (nJAMS Style Data Extraction) + +To deliver the core USP of deep, step-by-step visibility into un-instrumented Apache Camel applications without requiring customer code changes, we will deploy a custom Java Agent. + +### Architectural Trade-offs: Bytecode Instrumentation vs. Native Camel SPI + +When designing the agent, there are two primary approaches for extracting execution data: + +1. **Pure Bytecode Instrumentation (e.g., OpenTelemetry Auto-Instrumentation)** + * *Pros:* Broadest coverage across standard libraries (HTTP, JDBC, Kafka). Follows industry standards. + * *Cons:* Often treats Camel as a black box. Modifying bytecode dynamically is brittle, can introduce significant overhead, and risks breaking complex Camel routing logic or custom components. It lacks semantic understanding of Camel's internal state (Exchange, Message, properties). + +2. **Injecting Native Camel SPIs via the Agent (Our Approach)** + * *Pros:* Extremely "Camel Native." By using the Java Agent simply as a delivery mechanism to inject native Camel SPIs (like `EventNotifier`, `InterceptStrategy`, or `Tracer`) into the `CamelContext` at startup, we work *with* the framework, not against it. It provides precise access to the `Exchange`, routing slip, and payload without rewriting bytecode at every execution step. + * *Cons:* Tightly coupled to Camel versions (though the SPIs are relatively stable). Misses non-Camel JVM operations unless combined with standard OTEL. + +**Decision:** We will use the Java Agent primarily to bootstrap and inject a native Camel `EventNotifier` and/or `InterceptStrategy` into the application's `CamelContext`. This provides the safest, most semantically rich data extraction method. + +### Guaranteeing Absolute Minimum Overhead + +Extracting payloads at every step of a complex integration can easily cause OutOfMemory errors and CPU spikes. To guarantee "absolute minimum overhead," the agent architecture will implement the following safeguards: + +* **Dynamic Payload Capture Toggles:** By default, the agent captures *metadata only* (headers, step execution times, route paths). Full payload capture is disabled. It can be dynamically toggled on/off at runtime via the SaaS Control Plane (e.g., "Trace next 5 messages for Route X"). +* **Exception-Only / Dead-Letter Extraction:** The agent automatically captures the full payload and state *only* when an exception occurs or a message hits a Dead Letter Channel (DLC). This ensures critical debugging data is available without paying the cost for successful transactions. +* **Bounded Queues & Ring Buffers:** Data extraction happens asynchronously. Extracted events are placed into bounded, non-blocking ring buffers (e.g., LMAX Disruptor or similar low-latency queues). If the telemetry exporter cannot keep up, older events are aggressively dropped rather than blocking the Camel execution thread or consuming heap space. +* **Payload Truncation:** When payload capture is enabled, a hard limit (e.g., 50KB) is enforced. Large streams or files are never fully materialized in memory by the agent. \ No newline at end of file diff --git a/camel-ops-prototype/docs/audience_research.md b/camel-ops-prototype/docs/audience_research.md new file mode 100644 index 0000000..678b7da --- /dev/null +++ b/camel-ops-prototype/docs/audience_research.md @@ -0,0 +1,45 @@ +# Audience Research: Apache Camel Day 2 Operations + +## Introduction +This document aggregates real-world pain points regarding Apache Camel Day 2 operations, specifically focusing on observability, debugging, and general developer frustration. The insights are drawn directly from developer forums like Reddit and Stack Overflow. + +## Real-World Pain Points +1. **The "Black Box in Motion" Problem:** Developers struggle to trace a message as it flows through complex Camel routes. Once deployed, tracking exactly which step failed or transformed a payload incorrectly is extremely difficult without extensive manual logging. +2. **Inscrutable Error Messages:** The combination of Camel's DSL, diverse components, and underlying frameworks (like Spring or Karaf) often results in stack traces that are long, complex, and unhelpful. +3. **Silent Failures:** Scenarios where exceptions are thrown but the debugger never breaks, or where errors aren't properly logged, leaving developers guessing about what went wrong. +4. **Steep Learning Curve for Operations:** "You start needing to hire specialised Camel people." Maintaining and debugging routes requires deep framework knowledge rather than just standard Java debugging skills. +5. **Environment Complexity:** Running routes inside containers like Apache Karaf or complex Spring Boot setups makes traditional step-through debugging nearly impossible. + +## Exact Language / Voice of Customer (VoC) +- *"Looks fucking Byzantine with DSL, components, who knows what else, and probably error messages that are totally inscrutable as well."* +- *"I definitely feel your pain, Camel has been a cause of much unmerriment for me, also."* +- *"Debugging a Camel route could sometimes be a challenge."* +- *"I tried to use `onException()`... and the debugger never broke. Nothing logged and nothing discovered when debugging."* +- *"Without observability across reasoning steps... it’s like debugging a black box in motion."* + +## Build in Public Drafts (Agitating Specific Pains) + +### Draft 1: The "Byzantine" DSL +**Hook:** Ever feel like Apache Camel error messages are intentionally hiding the actual problem from you? +**Body:** I was browsing Reddit today and a developer described Camel as "fucking Byzantine with DSL, components... and error messages that are totally inscrutable." They aren't wrong. Day 2 operations with Camel shouldn't require you to be a framework historian just to find out why a message dropped. +**Action/Teaser:** We're building a tool that translates inscrutable Camel stack traces into plain English, highlighting the exact component that failed. No more guessing. #BuildInPublic #ApacheCamel #DeveloperExperience + +### Draft 2: The Silent Failure Nightmare +**Hook:** "I set a breakpoint in `onException()`... and it never broke. Nothing logged. Nothing discovered." +**Body:** This is the reality of Day 2 operations in Apache Camel. A message enters the route, disappears into the ether, and your logs give you absolute silence. You shouldn't have to pepper your routes with `.log()` statements just to prove your app is alive. +**Action/Teaser:** I'm working on a visual observability layer for Camel that shows you exactly where a payload stops, even if the framework swallows the exception. Let's kill the silent failures. Who else hates this? #ApacheCamel #Observability #BuildInPublic + +### Draft 3: The "Black Box in Motion" +**Hook:** Debugging an Apache Camel route in production is like debugging a black box in motion. +**Body:** During Day 1, defining declarative routes feels like magic. But on Day 2, when a payload transforms unexpectedly halfway through a 10-step route, that magic turns into a nightmare. You're left digging through generic application logs trying to piece together a distributed puzzle. +**Action/Teaser:** We're building a way to replay and trace exact message flows step-by-step in production environments. See exactly what your payload looked like before and after that `.process()` step. #Integration #BuildInPublic #TechStartup + +### Draft 4: The Specialized Hire Problem +**Hook:** Why does maintaining Apache Camel routes require a Ph.D. in Enterprise Integration Patterns? +**Body:** A CTO on Reddit nailed it: "You start needing to hire specialised Camel people." When standard Java developers can't easily debug your integration layer, your bus factor drops to 1, and your maintenance costs skyrocket. Day 2 ops shouldn't be gated behind a massive learning curve. +**Action/Teaser:** I'm designing a platform that democratizes Camel observability. If you know how an API works, you should be able to debug a Camel route. Making the complex, simple. #CTO #EngineeringLeadership #BuildInPublic + +### Draft 5: The Karaf/OSGi Debugging Headache +**Hook:** "How can I debug Aries Blueprint running inside of Apache Karaf?" +**Body:** If reading that sentence just gave you a stress headache, you've probably worked with Enterprise Camel deployments. The environment complexity of Day 2 operations makes traditional step-through debugging nearly impossible. +**Action/Teaser:** We need better ways to inspect running routes without attaching remote debuggers to complex container environments. Working on a drop-in agent that visualizes your actual deployed routes and their live traffic. Stay tuned. #ApacheCamel #JavaDeveloper #BuildInPublic diff --git a/camel-ops-prototype/docs/blackbox_appliance.md b/camel-ops-prototype/docs/blackbox_appliance.md new file mode 100644 index 0000000..29bf74c --- /dev/null +++ b/camel-ops-prototype/docs/blackbox_appliance.md @@ -0,0 +1,78 @@ +# Dual-Deployment Architecture: The Black Box Appliance & Native Kubernetes + +## Executive Summary +To capture both ends of the marketβ€”organizations lacking dedicated IT Ops and enterprises with mature DevSecOps practicesβ€”the Runner will employ a dual-deployment architecture. At its core, the application is packaged as a standard Kubernetes Helm Chart. However, the delivery mechanism forks into two distinct paths: a seamless, invisible "Black Box" appliance for bare-metal/VMs, and a raw Helm deployment for existing clusters. + +--- + +## 1. The 'Naked VM' Appliance (The Black Box) + +**Target Audience:** Customers without a skilled IT Operations department. They want a turnkey solution that "just works" on a blank Linux machine (e.g., Alpine or Ubuntu). + +**User Experience:** +The user provisions clean Linux VMs, runs role-specific commands, and a multi-node cluster is formed. They never see or interact with Kubernetes directly. + +```bash +# Provision the Control Plane (Hub) +curl -sfL https://get.our-runner.com | bash -s -- --role hub --token + +# Provision Worker Nodes (Camel Workloads) +curl -sfL https://get.our-runner.com | bash -s -- --role worker --hub-ip --token +``` + +**Technical Mechanisms:** +* **Multi-Node Embedded K3s:** The script silently downloads and installs K3s. The Hub runs the K3s control plane, VictoriaMetrics, Loki, and the SaaS tunnel. Workers join the Hub and run Apache Camel workloads and OTEL forwarders. +* **Helm Bootstrapping:** Once K3s is up, the script uses a bundled `helm` binary to automatically deploy our standard Helm chart into the embedded cluster. +* **Pre-configured Ingress & Storage:** K3s ships natively with Traefik (for Ingress) and Local Path Provisioner (for Persistent Volume Claims). The installer wires these up automatically, exposing the modern web UI on standard ports 80/443. +* **The `runner-ctl` CLI Wrapper:** To completely hide K8s, we provide a host-level CLI binary (e.g., `runner-ctl`). + * `runner-ctl status` (maps to `kubectl get pods`) + * `runner-ctl logs` (maps to `kubectl logs`) + * `runner-ctl backup` (triggers a local volume snapshot/export) + * `runner-ctl update` (pulls new images and runs `helm upgrade`) + +--- + +## 2. The 'Native K8s' Deployment + +**Target Audience:** Enterprise customers with dedicated Ops/Platform teams who manage their own Kubernetes clusters (EKS, GKE, AKS, or on-prem OpenShift). + +**User Experience:** +Ops teams deploy the Runner exactly like any other modern cloud-native application using their existing GitOps pipelines (ArgoCD, Flux) or CLI tools. + +```bash +helm repo add runner https://charts.our-runner.com +helm install my-runner runner/runner-app -f my-values.yaml +``` + +**Technical Mechanisms:** +* **Standardized Artifacts:** We distribute pure OCI-compliant Helm charts and container images hosted on a standard registry. +* **Pluggable Infrastructure:** The Helm chart's `values.yaml` allows enterprise Ops to swap out embedded components for their own enterprise-grade equivalents: + * Overriding the Ingress class (e.g., to use NGINX or AWS ALB instead of Traefik). + * Specifying custom `storageClassName` (e.g., EBS, Portworx). + * Integrating with external databases if preferred over in-cluster databases. + * Emitting metrics via `ServiceMonitor` resources to plug into the customer's existing Prometheus/Grafana stack. + +--- + +## How We Achieve This Seamlessly + +1. **Single Source of Truth:** We maintain **only one** core deployment artifact: the Helm Chart. The Black Box appliance is simply a lightweight wrapper that sets up the environment (K3s) and then consumes that exact same Helm chart. We do not maintain separate `docker-compose.yml` or native systemd service files for the application logic. +2. **Graceful Defaults vs. Explicit Overrides:** The Helm chart is designed with smart defaults that assume an appliance environment (using standard k3s classes). Native K8s users simply override these defaults in their `values.yaml`. +3. **Container-Native from Day 1:** By standardizing on Kubernetes primitives (Deployments, StatefulSets, PVCs, ConfigMaps), the application itself doesn't need to know whether it's running on a massive AWS EKS cluster or a tiny Alpine VM under someone's desk. +## Critical Analysis: The Multi-Node Reality + +**Architectural Verdict for a 6-Week MVP: High Risk, Defer to Post-MVP.** + +While separating the Hub (control plane, persistence) and Workers (Camel workloads) provides a scalable foundation and maps cleanly to Kubernetes concepts, attempting to deliver a multi-node, "invisible" embedded K3s cluster to on-premise customers introduces severe operational complexities for a 6-week MVP. + +1. **Node-to-Node Networking Complexity:** + In a single-node "Black Box," localhost handles internal communication. In a multi-node cluster, we rely on a Container Network Interface (CNI like Flannel) establishing overlay networks across customer-provided infrastructure. If the customer's VMs are on different subnets, routing rules or MTU issues can cause silent failures that our installer cannot automatically diagnose. + +2. **Customer Firewalls & Ports:** + To cluster K3s nodes, specific ports must be open between the Hub and Workers (e.g., 6443 for Kube API, 8472 for Flannel VXLAN, 10250 for Kubelet). In strict enterprise environments lacking dedicated IT Ops (our target audience for the Black Box), these ports are almost certainly blocked by default host firewalls or network ACLs. The "single command" promise breaks down into complex firewall troubleshooting sessions. + +3. **High Availability (HA) for the Hub:** + A single Hub node acts as a Single Point of Failure (SPOF). If the Hub goes down, the control plane is lost, and workers cannot schedule new workloads or reliably ship logs/metrics. Building true HA for the Hub requires a 3-node etcd cluster or an external datastore (like PostgreSQL), completely destroying the "simple Black Box appliance" value proposition. + +**Recommendation:** +For the 6-week MVP, we must restrict the "Black Box" appliance to a **Single-Node architecture**. It will still run embedded K3s, but it will host *both* the control plane/persistence and the Camel workloads on the same machine. This guarantees the "curl and run" experience without network troubleshooting. The multi-node cluster feature (and the `--role` flags) should be clearly scoped as a Fast-Follow or V2 feature, allowing us to validate Product-Market Fit with a reliable single-node appliance first. Enterprise customers who require multi-node scale out of the gate should be directed to the "Native K8s Deployment" method using their own pre-configured clusters. diff --git a/camel-ops-prototype/docs/build_in_public.md b/camel-ops-prototype/docs/build_in_public.md new file mode 100644 index 0000000..4fda6dc --- /dev/null +++ b/camel-ops-prototype/docs/build_in_public.md @@ -0,0 +1,61 @@ +# 'Build in Public' Strategy: First 2 Weeks +**Product:** Apache Camel Operations & Observability Platform +**Goal:** Validate market need, articulate the "nJAMS" value prop for modern Camel, and attract early-adopter design partners for our 6-week prototype phase. +**Target Audience:** Integration Architects, DevOps Engineers, and Camel Developers. + +--- + +## Strategy Overview +Our "Build in Public" (BiP) motion is not about sharing code; it’s about sharing **vision, architecture, and problem-solving**. We want to agitate the pain of "Day 2" operations and position our hybrid platform as the specialized solution that generic APMs (like Datadog/Dynatrace) cannot provide. + +### Content Pillars +1. **The Gap:** Why generic observability fails Apache Camel (payload visibility, traceability). +2. **The Architecture:** Hybrid SaaS + k3s edge runners (enterprise-ready from Day 1). +3. **The Journey:** Transparent progress on our 6-week MVP sprint. + +--- + +## Week 1: The Problem & The Pivot +**Theme:** Agitating the pain and establishing credibility. + +* **Day 1 (Monday): The "Why" - Exposing the Gap** + * *Hook:* "Generic APMs monitor infrastructure. They don't understand your integration payloads." + * *Content:* Discuss why Apache Camel needs specialized Day 2 observability. Highlight the pain of debugging complex routes without payload inspection. + * *CTA:* "Who else is struggling with Camel black boxes in production?" + +* **Day 3 (Wednesday): The Legacy & The Lesson** + * *Hook:* "What I learned building nJAMS, and why the modern Camel ecosystem needs a successor." + * *Content:* A founder's story tying 20+ years of enterprise integration experience to the current market gap. Establish authority. + * *CTA:* Follow the journey as we build the modern equivalent for Camel. + +* **Day 5 (Friday): The Architecture Tease** + * *Hook:* "Why we chose a Hybrid SaaS Control Plane + k3s Runner architecture." + * *Content:* High-level architectural diagram. Explain why we are prioritizing an on-prem/edge runner model for data privacy while keeping SaaS management. + * *CTA:* "Architects: Does this deployment model fit your security requirements?" + +--- + +## Week 2: The Solution & The MVP +**Theme:** Showing the product vision and securing design partners. + +* **Day 8 (Monday): Day 1 vs. Day 2 Operations** + * *Hook:* "Deployment isn't the finish line. It's the starting line." + * *Content:* Contrast the ease of Day 1 deployment with the nightmare of Day 2 troubleshooting. Frame our platform as the ultimate Day 2 companion. + * *CTA:* Share your worst Day 2 Camel incident. + +* **Day 10 (Wednesday): Deep Observability (Show, Don't Tell)** + * *Hook:* "Stop guessing. Start inspecting." + * *Content:* Share a wireframe or low-fidelity mockup of the traceability UI showing step-by-step payload inspection across a Camel route. + * *CTA:* "What data point is missing from this view?" + +* **Day 12 (Friday): The 6-Week Prototype Challenge** + * *Hook:* "We're sprinting to an MVP in 6 weeks. We need 5 brutally honest teams to break it." + * *Content:* Announce the timeline. Clearly state what the MVP will (and won't) do. + * *CTA:* Call for early-adopter design partners. Link to a waitlist or direct message prompt. + +--- + +## Execution Notes +* **Channels:** LinkedIn (primary for enterprise architects) and Twitter/X (secondary for tech community). +* **Tone:** Strategic, opinionated, authoritative, and customer-obsessed. No fluff. +* **Metrics for Success:** Qualitative engagement from target personas (Architects/Ops), waitlist signups for the 6-week prototype. \ No newline at end of file diff --git a/camel-ops-prototype/docs/camel_agent_spec.md b/camel-ops-prototype/docs/camel_agent_spec.md new file mode 100644 index 0000000..acaee4c --- /dev/null +++ b/camel-ops-prototype/docs/camel_agent_spec.md @@ -0,0 +1,58 @@ +# Apache Camel Java Agent Specification + +## 1. Overview +To achieve "zero-code change" data extraction for Apache Camel applications (similar to the nJAMS agent for MuleSoft/TIBCO), we utilize the JVM's `java.lang.instrument` API via a `-javaagent` flag. This agent dynamically attaches to the target JVM, hooks into the Camel Context initialization, and injects native Camel telemetry SPIs (`InterceptStrategy` and `EventNotifier`) without requiring developers to alter their Camel routes or application code. + +## 2. Technical Implementation: The `-javaagent` Hook + +### JVM Instrumentation & Bytecode Manipulation +When the application starts with `-javaagent:/path/to/camel-agent.jar`, the JVM invokes the agent's `premain` method before the application's `main` method. + +1. **Bytecode Interception (ByteBuddy / ASM):** + The agent registers a `ClassFileTransformer` via the `Instrumentation` API. Using a library like [ByteBuddy](https://bytebuddy.net/) (which is safer and easier to maintain than raw ASM), we instrument the `org.apache.camel.impl.engine.AbstractCamelContext` or `DefaultCamelContext` classes depending on the Camel version. + +2. **Hooking Context Initialization:** + We intercept the `start()` or `build()` methods of the `CamelContext`. + + ```java + // Pseudo-code for ByteBuddy Advice + @Advice.OnMethodEnter + public static void onStart(@Advice.This CamelContext context) { + CamelAgentRegistrar.register(context); + } + ``` + +3. **Registering Native Camel SPIs:** + Inside `CamelAgentRegistrar.register(context)`, we inject our monitoring components directly into the context before the routes start processing messages. Because we are inside the JVM, we have direct access to the `CamelContext` object: + + * **`EventNotifier`:** Added via `context.getManagementStrategy().addEventNotifier(...)`. This captures macro lifecycle events (`ExchangeCreatedEvent`, `ExchangeCompletedEvent`, `ExchangeFailedEvent`) with extremely low overhead. + * **`InterceptStrategy`:** Added via `context.addInterceptStrategy(...)`. This wraps every `Processor` in the route definition. It allows us to capture the payload (`Exchange.getIn().getBody()`), headers, and properties before and after each discrete step in the integration flow. + +This approach guarantees zero-code instrumentation while remaining entirely "Camel Native," avoiding the fragility of intercepting arbitrary application methods. + +## 3. High-Performance Payload Extraction + +Capturing payloads at every step of a Camel route can introduce severe GC (Garbage Collection) pauses, CPU spikes, and memory bloat. We must extract data cheaply. + +### Serialization Strategy + +1. **Type-Aware Truncation (The First Line of Defense):** + Before attempting deep serialization, we inspect the payload type. If it's a `String`, `byte[]`, or `InputStream` (common in Camel), we read only the first *N* bytes (e.g., 4KB limit) and discard the rest. We use cached/pooled buffers to read streams without allocating new byte arrays per exchange. + +2. **Fast Object Serialization (Kryo):** + If the payload is a complex POJO, standard Java serialization or Jackson/Gson JSON mapping is too slow and allocates too much memory. + * We embed [Kryo](https://github.com/EsotericSoftware/kryo), a fast, efficient object graph serialization framework. + * Kryo instances are heavily pooled (e.g., using `ThreadLocal` or a fast object pool) because they are not thread-safe but are extremely fast and produce compact binary outputs when reused. + +### Memory & GC Pause Prevention + +1. **Asynchronous Ring Buffers (LMAX Disruptor):** + Instead of creating new Event objects on the heap for every intercepted payload and doing synchronous I/O, the `InterceptStrategy` writes the extracted (and truncated) data directly into a pre-allocated, lock-free Ring Buffer (e.g., LMAX Disruptor or a simple circular array). + * A dedicated, low-priority background thread consumes from this ring buffer, batches the events, and flushes them to the local OTEL Collector or Log storage. + * **Load Shedding:** If the system is under extreme load and the buffer fills up, the agent **drops** the telemetry data rather than blocking the Camel routing threads. Monitoring must never crash the host application. + +2. **Conditional Capture (Dynamic Sampling):** + To further reduce overhead, the agent queries the local Appliance Hub for rules: + * *Error-Only Mode:* Payloads are cached in a tiny thread-local circular buffer and only serialized/retained if the Exchange fails (`Exchange.isFailed()`). + * *Sampling:* Only 1 in 100 exchanges are deeply inspected. + * *Step Filtering:* Only capture payloads at the ingress and egress endpoints of the route, ignoring intermediate data transformation steps unless debug mode is triggered. diff --git a/camel-ops-prototype/docs/tech_stack.md b/camel-ops-prototype/docs/tech_stack.md new file mode 100644 index 0000000..c5c1f42 --- /dev/null +++ b/camel-ops-prototype/docs/tech_stack.md @@ -0,0 +1,120 @@ +# Tech Stack & Repository Structure +## 6-Week Prototype Phase + +Based on the requirements for a high-velocity 6-week prototype (SaaS Control Plane + Customer-side Runners), here is the updated technology stack and repository structure based on your decision: React for the frontend, Java across the backend. + +### πŸ—οΈ Tech Stack Recommendations + +**General Requirements:** +* **Conciseness:** As few different vendors as feasible. +* **Full-Text Search:** Essential for log analysis. +* **Horizontal Scaling:** Must support scaling for growing workloads. +* **Open Source Commitment:** No important features of OSS components should be behind a paywall. + +**1. SaaS Control Plane (Backend & API)** +* **Language/Framework:** Java (Spring Boot or Quarkus). + * *Rationale:* Standardizing on Java across the entire backend reduces cognitive load and allows sharing domain models, DTOs, and utility libraries between the Control Plane and the Runner. Quarkus offers faster startup/lower memory, but Spring Boot has massive ecosystem support. +* **Database:** PostgreSQL. + * *Rationale:* Rock-solid, handles relational data for tenant management and JSONB for flexible configuration payloads. +* **Message Broker / Event Bus:** NATS or Redis (Pub/Sub) - optional for MVP. + * *Rationale:* Lightweight, easy to deploy, perfect for Control Plane <-> Runner communication if HTTP/gRPC isn't enough. + +**2. Frontend (Web Dashboard)** +* **Framework:** React (Vite) or Next.js. +* **UI Library:** Tailwind CSS + shadcn/ui. + * *Rationale:* React has the largest ecosystem for observability dashboards (e.g., charting libraries, flow diagrams like React Flow). Shadcn/ui provides pre-built, accessible, and highly customizable enterprise-grade components. + +**3. Customer-Side Runners (Appliance)** +* **Framework:** Java with Quarkus + Apache Camel. + * *Rationale:* Quarkus is designed for Kubernetes and provides incredibly fast startup times and a low memory footprint. It natively integrates with Apache Camel, aligning perfectly with the core domain. +* **Container Orchestration:** k3s (Lightweight Kubernetes). + * *Rationale:* Ideal for customer-side runner appliances. Easy to package, low overhead, and identical API to standard K8s. + +**4. Observability & Telemetry (The USP)** +* **Telemetry Agent:** Custom Java Agent (`-javaagent`) for zero-code instrumentation. Hooks directly into Apache Camel's native `InterceptStrategy` and `EventNotifier` via ByteBuddy. Uses LMAX Disruptor ring buffers and Kryo serialization to ensure zero GC pauses and minimal overhead. +* **Tracing/Metrics:** OpenTelemetry (OTEL) Collector. + * *Rationale:* Standardizes telemetry collection. The custom Java agent pushes truncated, high-performance payloads and metrics to the local OTEL collector, which then routes to VictoriaMetrics or VictoriaLogs. + +--- + +### πŸ“‚ Repository Structure + +With a Java backend and React frontend, a **Maven/Gradle Multi-Module** approach (or a tool like Nx if you want a true full-stack monorepo) works best. Here is the recommended structure: + +```text +camel-ops-prototype/ +β”œβ”€β”€ frontend/ # React web dashboard (Vite/Next.js) +β”‚ β”œβ”€β”€ src/ # UI components, API clients +β”‚ └── package.json +β”œβ”€β”€ backend/ # Java Parent Project (Maven/Gradle) +β”‚ β”œβ”€β”€ control-plane/ # Java API & orchestrator (SaaS) +β”‚ β”œβ”€β”€ runner-appliance/ # Java + Quarkus + Camel (K3s agent) +β”‚ └── shared-core/ # Shared Java DTOs, utilities, and API contracts +β”œβ”€β”€ infra/ +β”‚ β”œβ”€β”€ k3s-runner/ # Manifests/Helm charts for the customer appliance +β”‚ β”œβ”€β”€ control-plane/ # Docker Compose / K8s manifests for the SaaS backend +β”‚ └── observability/ # OpenTelemetry configs for local dev +β”œβ”€β”€ .github/ # CI/CD pipelines (GitHub Actions) +β”œβ”€β”€ docker-compose.yml # One-click local spin-up for the whole stack +└── README.md # Quickstart guide +``` + +### πŸš€ CI/CD & Delivery +* **GitHub Actions:** Simple, free, and lives next to the code. Maven/Gradle builds for backend, npm/yarn for frontend. +* **Artifacts:** Docker images pushed to GitHub Container Registry (GHCR). +* **Local Dev:** Provide a `docker-compose.yml` that spins up PostgreSQL, the Control Plane, a local k3s/Runner instance, and the OTEL collector. +## Pivot: Multi-Node Hub/Worker Split & Local Persistence + +Based on the architectural decision to keep all operational data (metrics, logs, traces) at the customer site while managing it from the SaaS Control Plane, the customer's k3s runner appliance must support a multi-node cluster (embedded k3s) from day one. We are separating the roles into Hub and Worker nodes. + +* **Appliance Hub (Control Plane):** Runs the local k3s control plane, persistence layers (VictoriaMetrics for TSDB, VictoriaLogs for logs), and the SaaS secure tunnel agent. +* **Appliance Worker (Data Plane):** Dedicated to hosting the Apache Camel workloads and lightweight OpenTelemetry (OTEL) forwarders. + +### πŸ—„οΈ Local Data Stores (Deployed on Hub) +* **Time Series Database (TSDB): VictoriaMetrics** (or Prometheus) + * *Rationale:* VictoriaMetrics is a high-performance, drop-in replacement for Prometheus. It uses a fraction of the RAM and disk space, making it perfect for resource-constrained customer k3s environments while fully supporting PromQL. + * *Integration:* The Java runner (using Micrometer or OpenTelemetry) will scrape or push Camel route metrics directly to this local instance. +* **Log Aggregation: VictoriaLogs** (with Promtail or OTEL Collector) + * *Rationale:* VictoriaLogs is designed for efficient log management, supports full-text search, and aligns with our goal of vendor conciseness by being part of the VictoriaMetrics ecosystem. + +### πŸ“¦ Packaging & Installation Strategy +To make this "very easy to install" for the customer, the entire local suite will be packaged as a **Single Helm Umbrella Chart**, enforcing node placement. + +* **The Bundle:** The Helm chart will contain: + 1. The Java-based Runner Agent (which maintains the secure tunnel to the SaaS Control Plane). + 2. VictoriaMetrics (for time-series data). + 3. VictoriaLogs (for log aggregation). +* **Workload Placement (nodeSelector & Tolerations):** + * **Hub Components:** VictoriaLogs, VictoriaMetrics, and the SaaS Tunnel Agent will be configured with `nodeSelector: { "node-role.kubernetes.io/control-plane": "true" }` (or a dedicated `appliance-role: hub` label). They will use tolerations to ensure they can run on the control-plane tainted nodes. + * **Worker Components:** Apache Camel workloads and the OTEL daemonsets will use `nodeSelector: { "appliance-role": "worker" }` to ensure they run strictly on worker nodes, isolating business integrations from cluster observability. +* **Installation Experience:** The customer installs the entire data plane and runner with a single command: + ```bash + helm repo add camel-ops https://charts.camel-ops.com + helm install camel-runner camel-ops/customer-appliance \ + --set runner.token= \ + --namespace camel-ops --create-namespace + ``` +* **Operations & Architecture:** Our SaaS Control Plane UI will query the local k3s runner's Java API securely. The Java agent translates these requests, queries the local VictoriaMetrics and VictoriaLogs instances via PromQL/LogQL, and formats the data into the requested "nJAMS style" before returning it to the SaaS dashboard. No persistent payload or log data ever leaves the customer's cluster. + +### ⚠️ Critical Analysis: The Multi-Node Dev Experience + +Introducing a multi-node requirement fundamentally shifts the development and testing experience: + +* **Impact on Local Development:** We can no longer rely on a simple `docker-compose.yml` to accurately simulate the customer environment. Developers must use tools like **k3d** or **kind** to spin up multi-node local clusters (e.g., `k3d cluster create camel-ops --servers 1 --agents 2`). This increases local resource requirements (RAM/CPU) and steepens the learning curve for developers used to pure Docker workflows. +* **Impact on CI/CD:** Our automated tests in GitHub Actions will need to provision an ephemeral multi-node K8s cluster to validate Helm chart placement rules, taints, and inter-node networking before merging PRs. This adds runtime overhead and complexity to the test matrix. +* **The 6-Week MVP Timeline:** Implementing and rigorously testing multi-node scheduling, distributed node-join processes, and potential networking nuances will consume a significant portion of our 6-week runway. + +**Technical Verdict:** +While separating the control/observability plane from the data plane is a **sound architectural best practice** that prevents runaway Camel routes from crashing the observability stack, enforcing it *strictly* from day one in a 6-week MVP is highly aggressive. +**Recommendation:** Proceed with defining the Helm `nodeSelector`/taint logic to formalize the Hub/Worker roles in the chart, but configure the MVP Helm chart values to default to a "Single-Node (Converged) Mode" for local dev and simple PoCs. This allows us to validate the multi-node design without bottlenecking the fast-iteration loop required for the MVP. + +### 🚨 Alerting & Anti-Spam (Deployed on Hub) +* **Rule Evaluation: vmalert (VictoriaMetrics)** + * *Rationale:* `vmalert` executes PromQL-based alerting rules against the local VictoriaMetrics TSDB. It allows us to set thresholds (e.g., 100 failed messages in 5 minutes) and keeps evaluation entirely on the customer's Appliance Hub, ensuring data privacy and reducing reliance on the SaaS Control Plane for real-time evaluation. +* **Notification Routing & Grouping: Prometheus Alertmanager** + * *Rationale:* Alertmanager receives firing alerts from `vmalert` and is responsible for dispatching them to configured receivers (email, Slack, PagerDuty). + * *Anti-Spam Mechanisms:* + * **Grouping:** Groups alerts of a similar nature (e.g., grouped by route ID or business entity) into a single notification to avoid overwhelming operators. If 100 routes fail simultaneously, it sends 1 grouped alert, not 100 emails. + * **Deduplication:** Suppresses duplicate notifications for the same active alert. + * **Debouncing & Inhibition:** Can mute lower-severity alerts (e.g., "High Latency") if a related high-severity alert (e.g., "Node Down") is already firing. +* **Management:** The SaaS Control Plane UI acts as the central pane of glass where users configure these alerting rules. The SaaS platform then pushes the configuration down to the customer's Appliance Hub to be executed by `vmalert` and `Alertmanager`. diff --git a/camel-ops-prototype/docs/ui_tech_implementation.md b/camel-ops-prototype/docs/ui_tech_implementation.md new file mode 100644 index 0000000..82187f6 --- /dev/null +++ b/camel-ops-prototype/docs/ui_tech_implementation.md @@ -0,0 +1,50 @@ +# UI Technology Implementation + +This document maps the PM's UX vision (as defined in the UX update) to concrete React components and modern frontend technologies. The goal is to deliver a slick, modern user experience that is fast, context-aware, and built on robust industry standards. + +## 1. Command Palette ('Cmd+K') + +**Vision:** A fast, keyboard-centric way to navigate, search for specific business entities (e.g., order IDs), or trigger actions without navigating through deep menu structures. + +**Implementation:** +* **Library:** [cmdk](https://cmdk.paco.me/) (often wrapped by shadcn/ui's `` component). +* **Structure:** + * A global listener binds to `Cmd+K` (macOS) and `Ctrl+K` (Windows/Linux). + * An overlay/modal opens with a search input focused by default. + * Results are grouped logically: Recent Searches, Business Entities (Orders, Customers), Actions (Restart Flow, View Logs), and Navigation (Go to Settings). +* **Integration:** Hooked into the global state and router (e.g., React Router or Next.js App Router) to perform instant client-side navigation. + +## 2. Visual Flow Diagrams + +**Vision:** Business-Entity First visual representation of Apache Camel integration flows, allowing users to see the path of a message, bottlenecks, and failure points at a glance. + +**Implementation:** +* **Library:** [React Flow](https://reactflow.dev/) (xyflow). +* **Structure:** + * **Custom Nodes:** Implement custom node types for different Camel components (Endpoints, Processors, Splitters, Choice blocks). These nodes will display status icons (success, error, processing) and metrics (processing time, message count). + * **Custom Edges:** Animated edges to simulate message flow or highlight specific paths taken by a specific business entity (e.g., tracing a specific Order ID). + * **Interactivity:** Clicking a node triggers a contextual slide-out (see below) to show detailed logs, payload data, or configuration for that specific step in the flow. + +## 3. Contextual Slide-outs + +**Vision:** Keep the user in the context of their primary task (e.g., viewing a flow) while providing deep-dive details. Avoid full page reloads and context switching. + +**Implementation:** +* **Library:** [Radix UI Dialog/Sheet](https://www.radix-ui.com/) or [shadcn/ui Sheet](https://ui.shadcn.com/docs/components/sheet). +* **Structure:** + * Rendered as an overlay that slides in from the right side of the screen. + * Can be stacked or replaced depending on the user's drill-down path. +* **Usage:** Clicking a node in the React Flow diagram, or a row in a data table, opens the slide-out displaying detailed metadata, logs, and payload traces for that specific element. + +## 4. State Management and Data Fetching (SaaS Proxy to Local Runner) + +**Vision:** The UI must feel snappy and responsive, fetching data from local Runners via the SaaS proxy without causing full page reloads or blocking the UI. + +**Implementation:** +* **Library:** [React Query (@tanstack/react-query)](https://tanstack.com/query/latest) combined with a robust global state solution like [Zustand](https://github.com/pmndrs/zustand) for client UI state. +* **Architecture:** + * **React Query:** Handles all asynchronous data fetching, caching, synchronization, and background updates. + * When a user opens a slide-out for a specific node, React Query fetches the logs/data for that node. If they close and reopen it, the cached data is shown instantly while a background re-fetch occurs (stale-while-revalidate). + * Polling or WebSockets/Server-Sent Events (SSE) can be integrated with React Query to provide real-time updates of flow statuses from the Runners. + * **Zustand:** Manages transient UI state, such as which slide-out is currently open, the state of the Cmd+K palette, or the current zoom/pan level of the React Flow diagram. + * **SaaS Proxy Communication:** API calls are made from the frontend to the SaaS backend, which securely proxies the request down to the specific local Runner. React Query manages the loading states (spinners/skeletons) while waiting for the proxy response, ensuring the rest of the application remains interactive. diff --git a/camel-ops-prototype/docs/ux_spec.md b/camel-ops-prototype/docs/ux_spec.md new file mode 100644 index 0000000..a7f7fa7 --- /dev/null +++ b/camel-ops-prototype/docs/ux_spec.md @@ -0,0 +1,46 @@ +# 6-Week MVP UX Specification + +**Goal:** Modernize the nJAMS paradigm to deliver a slick, fast, and highly intuitive User Experience for Day 2 operations of Apache Camel solutions. + +## Core Philosophy +- **Speed & Navigation:** Zero full-page reloads for drill-downs. Keyboard-centric navigation. +- **Context over Clatter:** Present information progressively. Give operators the high-level business context first, then allow seamless deep dives into technical execution. +- **Visual Clarity:** Replace static tables with visual topologies where relationships and flows matter. + +## 1. Global "Business-Entity First" Dashboard +**The Old Paradigm:** A list of servers, nodes, or integration processes with green/red status lights. +**The New Paradigm:** An aggregated view of the business objects flowing through the system. +- **Key Elements:** + - High-level KPI cards (e.g., "Orders Processed", "Invoices Failed"). + - A timeline or heatmap of activity spikes. + - Global Search prominently displayed, defaulting to Business Correlation IDs (e.g., "Order #12345"). +- **MVP Scope:** A single dashboard page showing the health of the top 3-5 configured business flows and recent critical alerts. + +## 2. Route Topology View +**The Old Paradigm:** A static, paginated table listing steps in an integration route. +**The New Paradigm:** A visual, interactive map of the Camel route. +- **Key Elements:** + - Nodes representing Camel components (Endpoints, Processors, EIPs). + - Edges showing the message flow, with color-coding for success/failure rates. + - Selecting a node highlights its metrics (throughput, error rate) without leaving the view. +- **MVP Scope:** Auto-generated Directed Acyclic Graph (DAG) visualizations for individual Camel routes, utilizing standard libraries (e.g., React Flow). + +## 3. Trace/Payload Drill-down Drawer +**The Old Paradigm:** Clicking a trace opens a new page. Clicking a payload downloads a file or opens yet another page. Losing context is easy. +**The New Paradigm:** Contextual slide-outs (Drawers). +- **Key Elements:** + - When investigating a specific failed exchange from the Topology or Dashboard, a side drawer slides in from the right. + - The drawer contains tabs: **Headers**, **Properties**, **Payload**, **Stacktrace**. + - Users can view JSON/XML payloads with syntax highlighting directly in the drawer. + - Closing the drawer instantly returns the user to the exact context they were in (Dashboard or Topology). +- **MVP Scope:** Right-side slide-out component with basic syntax highlighting for standard payload types, preventing context loss during troubleshooting. + +## The Power User Enabler: Cmd+K Command Palette +To tie it all together and ensure lightning-fast navigation for power users (IT Ops, Developers): +- A globally accessible `Cmd+K` (or `Ctrl+K`) modal. +- **Actions supported in MVP:** + - Search by Correlation ID: "Find Order 998877" + - Jump to Route: "Go to Route: SAP-to-Salesforce" + - Change environments: "Switch to Staging" + - Quick filters: "Show recent failures" +- **Why?** It drastically reduces click-depth and mouse travel, aligning with modern developer tool expectations (e.g., Linear, Raycast, Vscode). diff --git a/camel-ops-prototype/frontend/package.json b/camel-ops-prototype/frontend/package.json new file mode 100644 index 0000000..a2db79e --- /dev/null +++ b/camel-ops-prototype/frontend/package.json @@ -0,0 +1,18 @@ +{ + "name": "camel-ops-frontend", + "version": "0.1.0", + "private": true, + "description": "React frontend for Camel Ops", + "dependencies": { + "react": "^18.2.0", + "react-dom": "^18.2.0", + "react-scripts": "5.0.1", + "web-vitals": "^2.1.4" + }, + "scripts": { + "start": "react-scripts start", + "build": "react-scripts build", + "test": "react-scripts test", + "eject": "react-scripts eject" + } +} diff --git a/camel-ops-prototype/infra/docker-compose.yml b/camel-ops-prototype/infra/docker-compose.yml new file mode 100644 index 0000000..770ce03 --- /dev/null +++ b/camel-ops-prototype/infra/docker-compose.yml @@ -0,0 +1,38 @@ +version: '3.8' + +services: + postgres: + image: postgres:15 + container_name: camel_ops_db + environment: + POSTGRES_USER: camel_user + POSTGRES_PASSWORD: camel_password + POSTGRES_DB: camel_ops_db + ports: + - "5432:5432" + volumes: + - pg_data:/var/lib/postgresql/data + restart: unless-stopped + + victoriametrics: + image: victoriametrics/victoria-metrics:v1.93.0 + container_name: camel_ops_vm + ports: + - "8428:8428" + command: + - "--retentionPeriod=1y" + volumes: + - vm_data:/victoria-metrics-data + restart: unless-stopped + + loki: + image: grafana/loki:2.9.2 + container_name: camel_ops_loki + ports: + - "3100:3100" + command: -config.file=/etc/loki/local-config.yaml + restart: unless-stopped + +volumes: + pg_data: + vm_data: