# Technical Architecture Outline: Apache Camel Operations Platform **Objective:** Define the system design and boundary between the Runner (execution) and the Control Plane (management), strictly scoped for a 6-week prototype. ## 1. High-Level Architecture Paradigm The system follows a distributed execution model: * **Control Plane (SaaS):** Centralized management, monitoring, and deep observability hub. * **Runner (Appliance):** Packaged, deployable execution environments hosted where the customer's data gravity resides. ## 2. The Runner (k3s Appliance) * **Role:** The execution engine for Apache Camel workloads. * **Core Infrastructure:** Packaged lightweight Kubernetes (k3s) cluster. * **Components for Prototype:** * **Camel Engine:** The runtime environment (e.g., Camel K or plain Camel Spring Boot/Quarkus containers). * **Telemetry Agent:** A local collector responsible for gathering deep execution metrics and traces (the USP) and buffering them. * **Appliance Agent:** A lightweight operator managing the lifecycle of the Runner and communicating with the Control Plane. ## 3. The Control Plane (SaaS) * **Role:** Management UI and centralized observability dashboard (nJAMS style). * **Components for Prototype:** * **Ingestion API:** Secure endpoint to receive telemetry data from remote Runners. * **Management API / UI:** Interface for the user to view deployments, logs, and deep traces. * **Datastore:** Time-series/trace storage (e.g., managed Prometheus/Grafana stack or Jaeger for the prototype to save time). ## 4. Communication & Security * **Challenge:** Securely connecting customer-hosted Runners to the SaaS Control Plane without requiring inbound firewall rules on the customer side. * **Prototype Strategy:** * **Outbound-Only:** The Appliance Agent initiates an outbound connection (e.g., via gRPC, Secure WebSockets, or mTLS) to the Control Plane. * **Authentication:** Mutual TLS (mTLS) or pre-shared bearer tokens provisioned during Runner bootstrap. ## 5. Observability (The USP) * *Pragmatic Approach:* Leverage existing standards (OpenTelemetry) for transport and generic instrumentation to save time. * *Innovation Focus:* Build custom OTel processors/exporters specifically designed for Apache Camel internals to provide the "nJAMS-style" deep visibility that standard APMs miss. ## 6. 6-Week Prototype Constraints & Trade-offs To hit the 6-week milestone, we must favor "boring," reliable tech for the core and only innovate on the observability USP: * **In Scope:** * Single-node k3s Runner deployment. * Basic SaaS ingestion endpoint. * Deep trace visualization of a single, complex Camel route. * Outbound-only secure telemetry push. * **Out of Scope (Post-Prototype):** * High Availability (HA) Control Plane or multi-node k3s clustering. * Complex RBAC and multi-tenant deep isolation (use basic logical separation for now). * The "On-Premises" variant of the Control Plane (focus strictly on SaaS for the prototype to reduce deployment complexity). * Complex lifecycle management/upgrades of the k3s appliance itself. ## Pivot: Local Data Persistence Based on the latest strategy alignment, the architecture has pivoted to a localized data model to ensure strict data privacy and reduce SaaS infrastructure costs: * **SaaS Control Plane (Pure Management):** The cloud offering is now strictly a control plane. It handles user authentication, centralized UI, configuration management, and remote orchestration. It does *not* ingest, store, or process customer telemetry, logs, or payload data. * **Customer-Side Persistence:** The persistence layer (Time Series Database, log aggregation, and trace storage) is completely moved to the customer site, embedded within the Runner appliance. All sensitive execution data resides exclusively with the customer. * **Zero-Config Installation:** To ensure smooth adoption, the Runner appliance must be extremely easy to install. It will function as a "1-click" or zero-config deployment (e.g., a single Docker Compose up, a self-contained binary, or a pre-configured k3s appliance) that automatically spins up the local datastores, the telemetry collectors, and the connection agent. * **nJAMS-style Instrumentation:** Apache Camel applications will be deeply instrumented to extract step-by-step route progression, properties, and payload snapshots (the nJAMS-style USP). This data is written directly to the local datastore. The SaaS Control Plane will view this data by securely querying the Runner's local API via an outbound-established tunnel (e.g., reverse proxy tunnel), ensuring no inbound firewall exceptions are required at the customer site. ## 7. Alerting Data Flow & Anti-Spam Since telemetry and logs (Zero-Trust Payload) reside purely on the Customer Hub (Appliance/Runner), alert evaluation **must** occur locally on the Hub. This prevents sensitive data from ever hitting the SaaS Control Plane, while ensuring low-latency alerting. * **Rule Synchronization (SaaS -> Hub):** * **Configuration in SaaS:** Users define alert rules (e.g., failure thresholds like ">5 errors/min", route groupings, or latency SLAs) within the SaaS Control Plane UI. * **Secure Sync:** The SaaS plane pushes these rule definitions down to the Hub via the established outbound-only tunnel (e.g., gRPC/WebSocket). * **Local Translation:** The Hub agent translates these definitions into formats understood by the local evaluation engine (e.g., `vmalert` or Prometheus `Alertmanager` rules). * **Local Evaluation & Anti-Spam:** * The Hub's localized time-series database and log stores are continuously evaluated against the synced rules. * **Minimizing Alert Fatigue:** To prevent operator spam (e.g., a backend outage generating 1,000 separate failure alerts), the local evaluation engine heavily utilizes grouping (by Camel route or business entity), debouncing, and rate-limiting. A spike in errors results in a single, aggregated "Degraded Route" alert rather than individual error emails. * **Notification Delivery:** * **Direct Outbound (Preferred):** To maintain the Zero-Trust posture, the Hub dispatches notifications directly to target platforms (e.g., Slack webhooks, PagerDuty, or local SMTP). The SaaS securely provisions the necessary webhook URLs/API tokens to the Hub during the rule sync phase. * **SaaS Relay (Fallback):** In strictly firewalled environments where the Hub cannot reach external APIs (like Slack) directly, it can route a sanitized, payload-free notification event back through the secure tunnel to the SaaS Control Plane, which then relays it to the third-party service. ## 8. Java Agent (nJAMS Style Data Extraction) To deliver the core USP of deep, step-by-step visibility into un-instrumented Apache Camel applications without requiring customer code changes, we will deploy a custom Java Agent. ### Architectural Trade-offs: Bytecode Instrumentation vs. Native Camel SPI When designing the agent, there are two primary approaches for extracting execution data: 1. **Pure Bytecode Instrumentation (e.g., OpenTelemetry Auto-Instrumentation)** * *Pros:* Broadest coverage across standard libraries (HTTP, JDBC, Kafka). Follows industry standards. * *Cons:* Often treats Camel as a black box. Modifying bytecode dynamically is brittle, can introduce significant overhead, and risks breaking complex Camel routing logic or custom components. It lacks semantic understanding of Camel's internal state (Exchange, Message, properties). 2. **Injecting Native Camel SPIs via the Agent (Our Approach)** * *Pros:* Extremely "Camel Native." By using the Java Agent simply as a delivery mechanism to inject native Camel SPIs (like `EventNotifier`, `InterceptStrategy`, or `Tracer`) into the `CamelContext` at startup, we work *with* the framework, not against it. It provides precise access to the `Exchange`, routing slip, and payload without rewriting bytecode at every execution step. * *Cons:* Tightly coupled to Camel versions (though the SPIs are relatively stable). Misses non-Camel JVM operations unless combined with standard OTEL. **Decision:** We will use the Java Agent primarily to bootstrap and inject a native Camel `EventNotifier` and/or `InterceptStrategy` into the application's `CamelContext`. This provides the safest, most semantically rich data extraction method. ### Guaranteeing Absolute Minimum Overhead Extracting payloads at every step of a complex integration can easily cause OutOfMemory errors and CPU spikes. To guarantee "absolute minimum overhead," the agent architecture will implement the following safeguards: * **Dynamic Payload Capture Toggles:** By default, the agent captures *metadata only* (headers, step execution times, route paths). Full payload capture is disabled. It can be dynamically toggled on/off at runtime via the SaaS Control Plane (e.g., "Trace next 5 messages for Route X"). * **Exception-Only / Dead-Letter Extraction:** The agent automatically captures the full payload and state *only* when an exception occurs or a message hits a Dead Letter Channel (DLC). This ensures critical debugging data is available without paying the cost for successful transactions. * **Bounded Queues & Ring Buffers:** Data extraction happens asynchronously. Extracted events are placed into bounded, non-blocking ring buffers (e.g., LMAX Disruptor or similar low-latency queues). If the telemetry exporter cannot keep up, older events are aggressively dropped rather than blocking the Camel execution thread or consuming heap space. * **Payload Truncation:** When payload capture is enabled, a hard limit (e.g., 50KB) is enforced. Large streams or files are never fully materialized in memory by the agent.