Files
camel-ops-platform/design/SYSTEM_DESIGN.md

8.5 KiB

Camel Operations Platform - System Design Document (MVP)

Status: Draft / MVP Definition
Target Audience: Enterprise IT, DevOps, Integration Architects
Date: 2026-02-27


1. Executive Summary

Vision

To provide a unified, "Day 2 Operations" platform for Apache Camel that bridges the gap between modern cloud-native practices (GitOps, Kubernetes) and enterprise on-premise requirements (Zero Trust, Data Sovereignty).

Problem Statement

Enterprises heavily rely on Apache Camel for integration but lack a cohesive operational layer. Existing solutions are either legacy (heavyweight ESBs), lack deep Camel visibility (generic APMs), or require complex DIY Kubernetes management.

Key Value Propositions

  • "Managed Appliance" Experience: A single-binary installer that turns any Linux host into a managed Camel runtime (embedded K3s), removing K8s complexity from the developer.
  • Zero Trust Architecture: The runtime connects outbound-only to the SaaS Control Plane via a reverse tunnel. No inbound firewall ports required.
  • Camel-Native Observability: Deep introspection into Camel Routes, Exchanges, and Message bodies, superior to generic HTTP tracing.
  • GitOps from Day 0: All configurations and deployments are driven by Git state, ensuring auditability and rollback capabilities.

2. High-Level Architecture

The architecture follows a hybrid model: a centralized SaaS Control Plane for management and visibility, and distributed Runners deployed in customer environments (On-Prem, Private Cloud, Edge) to execute workloads.

Architecture Diagram Description

graph TD
    subgraph "SaaS Control Plane"
        UI[Web Console]
        API[API Gateway]
        TunnelServer[Tunnel Server]
        TSDB[(Time-Series DB)]
        RelDB[(PostgreSQL)]
    end

    subgraph "Customer Environment (The Runner)"
        TunnelClient[Tunnel Client]
        K3s[Embedded K3s Cluster]
        
        subgraph "Camel Workload Pod"
            CamelApp[Camel Application]
            Sidecar[Observability Agent]
        end
        
        Build[Build Controller (Kaniko)]
        Registry[Local Registry]
    end

    User[User/DevOps] --> UI
    Git[Git Provider] --Webhook--> API
    
    %% Connections
    TunnelClient -- Outbound mTLS (WebSocket/gRPC) --> TunnelServer
    TunnelServer --> API
    
    CamelApp -- Traces/Metrics --> Sidecar
    Sidecar -- Telemetry --> TunnelClient
    TunnelClient -- Telemetry --> TSDB

3. Component Deep Dive

3.1 The Runner (Managed Appliance)

The Runner is a self-contained runtime environment installed on customer infrastructure. It abstracts the complexity of Kubernetes.

  • Core Engine: K3s (Lightweight Kubernetes). Selected for its single-binary footprint and low resource usage.
  • Ingress Layer: Traefik. Handles internal routing for deployed Camel services.
  • Connectivity: Reverse Tunnel Client. Establishes a persistent, multiplexed connection (using technologies like WebSocket or HTTP/2) to the Control Plane. This tunnel carries:
    • Control commands (Deploy, Restart, Scale).
    • Telemetry data (Logs, Traces, Metrics).
    • Proxy traffic (viewing internal Camel endpoints from SaaS UI).
  • Build System:
    • Kaniko: Performs in-cluster container builds from source code without requiring a Docker daemon.
    • Local Registry: A lightweight internal container registry to store built images before deployment.
  • Storage: Rancher Local Path Provisioner. Uses node-local storage for ephemeral build artifacts and durable message buffering.
  • Security:
    • Namespace Isolation: Each "Environment" (Dev, Prod) maps to a K8s Namespace.
    • Network Policies: Deny-all by default; allow only whitelisted egress.

3.2 The Control Plane (SaaS)

The central brain of the platform.

  • Tech Stack:
    • Backend: Go (Golang) for high-performance concurrent handling of tunnel connections and telemetry ingestion.
    • Frontend: React / Next.js for a responsive, dashboard-like experience.
  • Data Stores:
    • Relational (PostgreSQL): Users, Organizations, Projects, Environment configurations, RBAC policies.
    • Telemetry (ClickHouse or TimescaleDB): High-volume storage for Camel traces (Exchanges), logs, and metrics. ClickHouse is preferred for query performance on massive trace datasets.
  • GitOps Engine:
    • Monitors connected Git repositories.
    • Generates Kubernetes manifests (Deployment, Service, ConfigMap) based on camel-context.xml or Route definitions.
    • Syncs desired state to the Runner via the Tunnel.

3.3 The Observability Stack

Tailored specifically for Apache Camel integration patterns.

  • Camel Tracer (Java Agent / Sidecar):
    • Attaches to the Camel runtime (Quarkus, Spring Boot, Karaf).
    • Interceps ExchangeCreated, ExchangeCompleted, ExchangeFailed events.
    • Smart Sampling: Configurable sampling rates to balance overhead vs. visibility.
    • Body Capture: secure redaction (regex masking) of sensitive PII in message bodies before transmission.
  • Message Replay Mechanism:
    • The Control Plane stores metadata of failed exchanges (Headers, Body blobs).
    • Action: User clicks "Replay" in UI.
    • Flow: Control Plane sends "Replay Command" -> Tunnel -> Runner -> Observability Sidecar.
    • Execution: The Sidecar re-injects the message into the specific Camel Endpoint or Route start.

4. Data Flow

4.1 Deployment Flow (GitOps)

  1. Commit: Developer pushes code to Git repository.
  2. Webhook: Git provider notifies Control Plane API.
  3. Instruction: Control Plane determines which Runner is target, sends "Build Job" instruction via Tunnel.
  4. Pull & Build: Runner's Build Controller (Kaniko) pulls source, builds container image, pushes to Local Registry.
  5. Deploy: Runner applies updated K8s manifests. K3s pulls image from Local Registry and rolls out the new Pod.
  6. Status: Runner reports DeploymentStatus: Ready back to Control Plane.

4.2 Telemetry Flow (Observability)

  1. Intercept: Camel App processes a message. Sidecar captures the trace data (Route ID, Node ID, Duration, Failure/Success, Payload).
  2. Buffer: Sidecar buffers traces in memory (ring buffer) to handle bursts.
  3. Transmit: Batched traces are sent to the local Runner Agent (Tunnel Client).
  4. Tunnel: Data flows upstream through the mTLS tunnel to the Control Plane Ingestor.
  5. Persist: Ingestor validates and writes data to ClickHouse/TimescaleDB.
  6. Visualize: User queries the "Route Diagram" in the UI; backend fetches aggregation from DB.

5. Security Model

Zero Trust & Connectivity

  • No Inbound Ports: The Runner requires strictly outbound-only HTTPS (443) access to the Control Plane.
  • Authentication:
    • Runner registration uses a short-lived One-Time Token (OTT) generated in the UI.
    • Upon first connect, the Runner performs a certificate exchange (CSR) to obtain a unique mTLS client certificate.
  • mTLS Tunnel: All traffic between Runner and Control Plane is encrypted and mutually authenticated.

Secrets Management

  • At Rest: Secrets (API keys, DB passwords) are encrypted in the Control Plane database (AES-256).
  • In Transit: Delivered to the Runner only when needed for deployment.
  • On Runner: Stored as K8s Secrets, mounted as environment variables or files into the Camel Pods.

Multi-Tenancy

  • Control Plane: Logical isolation (Row-Level Security) ensures customers cannot see each other's data.
  • Runner: Designed as single-tenant per install (usually), but supports multi-environment isolation via Namespaces if shared by multiple teams within one enterprise.

6. Future Proofing & Scalability

High Availability (HA)

  • Control Plane: Stateless microservices, autoscaled on public cloud (AWS/GCP/Azure). DBs run in clustered mode.
  • Runner (MVP): Single-node K3s.
  • Runner (Future): Multi-node K3s cluster support. The "Appliance" installer will support joining additional nodes for worker capacity and control plane redundancy.

Scaling Strategy

  • Horizontal Pod Autoscaling (HPA): The Runner will support defining HPA rules (CPU/Memory based) for Camel workloads.
  • Partitioning: The Telemetry store (ClickHouse) will be partitioned by Time and Customer ID to support years of retention.

Prepared by: Subagent (OpenClaw)