camel-ops-platform/docs/blackbox_appliance.md

# Dual-Deployment Architecture: The Black Box Appliance & Native Kubernetes

## Executive Summary
To capture both ends of the market—organizations lacking dedicated IT Ops and enterprises with mature DevSecOps practices—the Runner will employ a dual-deployment architecture. At its core, the application is packaged as a standard Kubernetes Helm Chart. However, the delivery mechanism forks into two distinct paths: a seamless, invisible "Black Box" appliance for bare-metal/VMs, and a raw Helm deployment for existing clusters.

---

## 1. The 'Naked VM' Appliance (The Black Box)

**Target Audience:** Customers without a skilled IT Operations department. They want a turnkey solution that "just works" on a blank Linux machine (e.g., Alpine or Ubuntu).

**User Experience:**
The user provisions clean Linux VMs, runs role-specific commands, and a multi-node cluster is formed. They never see or interact with Kubernetes directly.

```bash
# Provision the Control Plane (Hub)
curl -sfL https://get.our-runner.com | bash -s -- --role hub --token <license_token>

# Provision Worker Nodes (Camel Workloads)
curl -sfL https://get.our-runner.com | bash -s -- --role worker --hub-ip <hub_internal_ip> --token <worker_token>
```

**Technical Mechanisms:**
* **Multi-Node Embedded K3s:** The script silently downloads and installs K3s. The Hub runs the K3s control plane, VictoriaMetrics, Loki, and the SaaS tunnel. Workers join the Hub and run Apache Camel workloads and OTEL forwarders.
* **Helm Bootstrapping:** Once K3s is up, the script uses a bundled `helm` binary to automatically deploy our standard Helm chart into the embedded cluster.
* **Pre-configured Ingress & Storage:** K3s ships natively with Traefik (for Ingress) and Local Path Provisioner (for Persistent Volume Claims). The installer wires these up automatically, exposing the modern web UI on standard ports 80/443.
* **The `runner-ctl` CLI Wrapper:** To completely hide K8s, we provide a host-level CLI binary (e.g., `runner-ctl`).
  * `runner-ctl status` (maps to `kubectl get pods`)
  * `runner-ctl logs` (maps to `kubectl logs`)
  * `runner-ctl backup` (triggers a local volume snapshot/export)
  * `runner-ctl update` (pulls new images and runs `helm upgrade`)

---

## 2. The 'Native K8s' Deployment

**Target Audience:** Enterprise customers with dedicated Ops/Platform teams who manage their own Kubernetes clusters (EKS, GKE, AKS, or on-prem OpenShift).

**User Experience:**
Ops teams deploy the Runner exactly like any other modern cloud-native application using their existing GitOps pipelines (ArgoCD, Flux) or CLI tools.

```bash
helm repo add runner https://charts.our-runner.com
helm install my-runner runner/runner-app -f my-values.yaml
```

**Technical Mechanisms:**
* **Standardized Artifacts:** We distribute pure OCI-compliant Helm charts and container images hosted on a standard registry.
* **Pluggable Infrastructure:** The Helm chart's `values.yaml` allows enterprise Ops to swap out embedded components for their own enterprise-grade equivalents:
  * Overriding the Ingress class (e.g., to use NGINX or AWS ALB instead of Traefik).
  * Specifying custom `storageClassName` (e.g., EBS, Portworx).
  * Integrating with external databases if preferred over in-cluster databases.
  * Emitting metrics via `ServiceMonitor` resources to plug into the customer's existing Prometheus/Grafana stack.

---

## How We Achieve This Seamlessly

1. **Single Source of Truth:** We maintain **only one** core deployment artifact: the Helm Chart. The Black Box appliance is simply a lightweight wrapper that sets up the environment (K3s) and then consumes that exact same Helm chart. We do not maintain separate `docker-compose.yml` or native systemd service files for the application logic.
2. **Graceful Defaults vs. Explicit Overrides:** The Helm chart is designed with smart defaults that assume an appliance environment (using standard k3s classes). Native K8s users simply override these defaults in their `values.yaml`.
3. **Container-Native from Day 1:** By standardizing on Kubernetes primitives (Deployments, StatefulSets, PVCs, ConfigMaps), the application itself doesn't need to know whether it's running on a massive AWS EKS cluster or a tiny Alpine VM under someone's desk.
## Critical Analysis: The Multi-Node Reality

**Architectural Verdict for a 6-Week MVP: High Risk, Defer to Post-MVP.**

While separating the Hub (control plane, persistence) and Workers (Camel workloads) provides a scalable foundation and maps cleanly to Kubernetes concepts, attempting to deliver a multi-node, "invisible" embedded K3s cluster to on-premise customers introduces severe operational complexities for a 6-week MVP.

1.  **Node-to-Node Networking Complexity:**
    In a single-node "Black Box," localhost handles internal communication. In a multi-node cluster, we rely on a Container Network Interface (CNI like Flannel) establishing overlay networks across customer-provided infrastructure. If the customer's VMs are on different subnets, routing rules or MTU issues can cause silent failures that our installer cannot automatically diagnose.

2.  **Customer Firewalls & Ports:**
    To cluster K3s nodes, specific ports must be open between the Hub and Workers (e.g., 6443 for Kube API, 8472 for Flannel VXLAN, 10250 for Kubelet). In strict enterprise environments lacking dedicated IT Ops (our target audience for the Black Box), these ports are almost certainly blocked by default host firewalls or network ACLs. The "single command" promise breaks down into complex firewall troubleshooting sessions.

3.  **High Availability (HA) for the Hub:**
    A single Hub node acts as a Single Point of Failure (SPOF). If the Hub goes down, the control plane is lost, and workers cannot schedule new workloads or reliably ship logs/metrics. Building true HA for the Hub requires a 3-node etcd cluster or an external datastore (like PostgreSQL), completely destroying the "simple Black Box appliance" value proposition.

**Recommendation:**
For the 6-week MVP, we must restrict the "Black Box" appliance to a **Single-Node architecture**. It will still run embedded K3s, but it will host *both* the control plane/persistence and the Camel workloads on the same machine. This guarantees the "curl and run" experience without network troubleshooting. The multi-node cluster feature (and the `--role` flags) should be clearly scoped as a Fast-Follow or V2 feature, allowing us to validate Product-Market Fit with a reliable single-node appliance first. Enterprise customers who require multi-node scale out of the gate should be directed to the "Native K8s Deployment" method using their own pre-configured clusters.