docs/tech_stack.md

# Tech Stack & Repository Structure
## 6-Week Prototype Phase

Based on the requirements for a high-velocity 6-week prototype (SaaS Control Plane + Customer-side Runners), here is the updated technology stack and repository structure based on your decision: React for the frontend, Java across the backend.

### 🏗️ Tech Stack Recommendations

**1. SaaS Control Plane (Backend & API)**
* **Language/Framework:** Java (Spring Boot or Quarkus).
  * *Rationale:* Standardizing on Java across the entire backend reduces cognitive load and allows sharing domain models, DTOs, and utility libraries between the Control Plane and the Runner. Quarkus offers faster startup/lower memory, but Spring Boot has massive ecosystem support.
* **Database:** PostgreSQL.
  * *Rationale:* Rock-solid, handles relational data for tenant management and JSONB for flexible configuration payloads.
* **Message Broker / Event Bus:** NATS or Redis (Pub/Sub) - optional for MVP.
  * *Rationale:* Lightweight, easy to deploy, perfect for Control Plane <-> Runner communication if HTTP/gRPC isn't enough.

**2. Frontend (Web Dashboard)**
* **Framework:** React (Vite) or Next.js.
* **UI Library:** Tailwind CSS + shadcn/ui.
  * *Rationale:* React has the largest ecosystem for observability dashboards (e.g., charting libraries, flow diagrams like React Flow). Shadcn/ui provides pre-built, accessible, and highly customizable enterprise-grade components.

**3. Customer-Side Runners (Appliance)**
* **Framework:** Java with Quarkus + Apache Camel.
  * *Rationale:* Quarkus is designed for Kubernetes and provides incredibly fast startup times and a low memory footprint. It natively integrates with Apache Camel, aligning perfectly with the core domain.
* **Container Orchestration:** k3s (Lightweight Kubernetes).
  * *Rationale:* Ideal for customer-side runner appliances. Easy to package, low overhead, and identical API to standard K8s.

**4. Observability & Telemetry (The USP)**
* **Telemetry Agent:** Custom Java Agent (`-javaagent`) for zero-code instrumentation. Hooks directly into Apache Camel's native `InterceptStrategy` and `EventNotifier` via ByteBuddy. Uses LMAX Disruptor ring buffers and Kryo serialization to ensure zero GC pauses and minimal overhead.
* **Tracing/Metrics:** OpenTelemetry (OTEL) Collector.
  * *Rationale:* Standardizes telemetry collection. The custom Java agent pushes truncated, high-performance payloads and metrics to the local OTEL collector, which then routes to VictoriaMetrics or Loki.

---

### 📂 Repository Structure

With a Java backend and React frontend, a **Maven/Gradle Multi-Module** approach (or a tool like Nx if you want a true full-stack monorepo) works best. Here is the recommended structure:

```text
camel-ops-prototype/
├── frontend/                   # React web dashboard (Vite/Next.js)
│   ├── src/                    # UI components, API clients
│   └── package.json            
├── backend/                    # Java Parent Project (Maven/Gradle)
│   ├── control-plane/          # Java API & orchestrator (SaaS)
│   ├── runner-appliance/       # Java + Quarkus + Camel (K3s agent)
│   └── shared-core/            # Shared Java DTOs, utilities, and API contracts
├── infra/
│   ├── k3s-runner/             # Manifests/Helm charts for the customer appliance
│   ├── control-plane/          # Docker Compose / K8s manifests for the SaaS backend
│   └── observability/          # OpenTelemetry configs for local dev
├── .github/                    # CI/CD pipelines (GitHub Actions)
├── docker-compose.yml          # One-click local spin-up for the whole stack
└── README.md                   # Quickstart guide
```

### 🚀 CI/CD & Delivery
* **GitHub Actions:** Simple, free, and lives next to the code. Maven/Gradle builds for backend, npm/yarn for frontend.
* **Artifacts:** Docker images pushed to GitHub Container Registry (GHCR).
* **Local Dev:** Provide a `docker-compose.yml` that spins up PostgreSQL, the Control Plane, a local k3s/Runner instance, and the OTEL collector.
## Pivot: Multi-Node Hub/Worker Split & Local Persistence

Based on the architectural decision to keep all operational data (metrics, logs, traces) at the customer site while managing it from the SaaS Control Plane, the customer's k3s runner appliance must support a multi-node cluster (embedded k3s) from day one. We are separating the roles into Hub and Worker nodes.

*   **Appliance Hub (Control Plane):** Runs the local k3s control plane, persistence layers (VictoriaMetrics for TSDB, Grafana Loki for logs), and the SaaS secure tunnel agent.
*   **Appliance Worker (Data Plane):** Dedicated to hosting the Apache Camel workloads and lightweight OpenTelemetry (OTEL) forwarders.

### 🗄️ Local Data Stores (Deployed on Hub)
* **Time Series Database (TSDB): VictoriaMetrics** (or Prometheus)
  * *Rationale:* VictoriaMetrics is a high-performance, drop-in replacement for Prometheus. It uses a fraction of the RAM and disk space, making it perfect for resource-constrained customer k3s environments while fully supporting PromQL.
  * *Integration:* The Java runner (using Micrometer or OpenTelemetry) will scrape or push Camel route metrics directly to this local instance.
* **Log Aggregation: Grafana Loki** (with Promtail or OTEL Collector)
  * *Rationale:* Unlike Elasticsearch (which requires heavy JVM memory overhead), Loki indexes only metadata/labels and stores raw log streams efficiently. This is ideal for lightweight edge deployments.

### 📦 Packaging & Installation Strategy
To make this "very easy to install" for the customer, the entire local suite will be packaged as a **Single Helm Umbrella Chart**, enforcing node placement.

* **The Bundle:** The Helm chart will contain:
  1. The Java-based Runner Agent (which maintains the secure tunnel to the SaaS Control Plane).
  2. VictoriaMetrics (for time-series data).
  3. Loki & Promtail (for log aggregation).
* **Workload Placement (nodeSelector & Tolerations):**
  * **Hub Components:** Loki, VictoriaMetrics, and the SaaS Tunnel Agent will be configured with `nodeSelector: { "node-role.kubernetes.io/control-plane": "true" }` (or a dedicated `appliance-role: hub` label). They will use tolerations to ensure they can run on the control-plane tainted nodes.
  * **Worker Components:** Apache Camel workloads and the OTEL daemonsets will use `nodeSelector: { "appliance-role": "worker" }` to ensure they run strictly on worker nodes, isolating business integrations from cluster observability.
* **Installation Experience:** The customer installs the entire data plane and runner with a single command:
  ```bash
  helm repo add camel-ops https://charts.camel-ops.com
  helm install camel-runner camel-ops/customer-appliance \
    --set runner.token=<TENANT_TOKEN> \
    --namespace camel-ops --create-namespace
  ```
* **Operations & Architecture:** Our SaaS Control Plane will query the local k3s runner's Java API securely. The Java agent translates these requests, queries the local VictoriaMetrics and Loki instances via PromQL/LogQL, and formats the data into the requested "nJAMS style" before returning it to the SaaS dashboard. No persistent payload or log data ever leaves the customer's cluster.

### ⚠️ Critical Analysis: The Multi-Node Dev Experience

Introducing a multi-node requirement fundamentally shifts the development and testing experience:

*   **Impact on Local Development:** We can no longer rely on a simple `docker-compose.yml` to accurately simulate the customer environment. Developers must use tools like **k3d** or **kind** to spin up multi-node local clusters (e.g., `k3d cluster create camel-ops --servers 1 --agents 2`). This increases local resource requirements (RAM/CPU) and steepens the learning curve for developers used to pure Docker workflows.
*   **Impact on CI/CD:** Our automated tests in GitHub Actions will need to provision an ephemeral multi-node K8s cluster to validate Helm chart placement rules, taints, and inter-node networking before merging PRs. This adds runtime overhead and complexity to the test matrix.
*   **The 6-Week MVP Timeline:** Implementing and rigorously testing multi-node scheduling, distributed node-join processes, and potential networking nuances will consume a significant portion of our 6-week runway.

**Technical Verdict:** 
While separating the control/observability plane from the data plane is a **sound architectural best practice** that prevents runaway Camel routes from crashing the observability stack, enforcing it *strictly* from day one in a 6-week MVP is highly aggressive. 
**Recommendation:** Proceed with defining the Helm `nodeSelector`/taint logic to formalize the Hub/Worker roles in the chart, but configure the MVP Helm chart values to default to a "Single-Node (Converged) Mode" for local dev and simple PoCs. This allows us to validate the multi-node design without bottlenecking the fast-iteration loop required for the MVP.

### 🚨 Alerting & Anti-Spam (Deployed on Hub)
* **Rule Evaluation: vmalert (VictoriaMetrics)**
  * *Rationale:* `vmalert` executes PromQL-based alerting rules against the local VictoriaMetrics TSDB. It allows us to set thresholds (e.g., 100 failed messages in 5 minutes) and keeps evaluation entirely on the customer's Appliance Hub, ensuring data privacy and reducing reliance on the SaaS Control Plane for real-time evaluation.
* **Notification Routing & Grouping: Prometheus Alertmanager**
  * *Rationale:* Alertmanager receives firing alerts from `vmalert` and is responsible for dispatching them to configured receivers (email, Slack, PagerDuty).
  * *Anti-Spam Mechanisms:*
    * **Grouping:** Groups alerts of a similar nature (e.g., grouped by route ID or business entity) into a single notification to avoid overwhelming operators. If 100 routes fail simultaneously, it sends 1 grouped alert, not 100 emails.
    * **Deduplication:** Suppresses duplicate notifications for the same active alert.
    * **Debouncing & Inhibition:** Can mute lower-severity alerts (e.g., "High Latency") if a related high-severity alert (e.g., "Node Down") is already firing.
* **Management:** The SaaS Control Plane UI acts as the central pane of glass where users configure these alerting rules. The SaaS platform then pushes the configuration down to the customer's Appliance Hub to be executed by `vmalert` and `Alertmanager`.
Initial commit: Architecture, UX Strategy, and Scaffolding 2026-03-01 20:52:49 +00:00			`# Tech Stack & Repository Structure`
			`## 6-Week Prototype Phase`

			`Based on the requirements for a high-velocity 6-week prototype (SaaS Control Plane + Customer-side Runners), here is the updated technology stack and repository structure based on your decision: React for the frontend, Java across the backend.`

			`### 🏗️ Tech Stack Recommendations`

			`1. SaaS Control Plane (Backend & API)`
			`* Language/Framework: Java (Spring Boot or Quarkus).`
			`* Rationale: Standardizing on Java across the entire backend reduces cognitive load and allows sharing domain models, DTOs, and utility libraries between the Control Plane and the Runner. Quarkus offers faster startup/lower memory, but Spring Boot has massive ecosystem support.`
			`* Database: PostgreSQL.`
			`* Rationale: Rock-solid, handles relational data for tenant management and JSONB for flexible configuration payloads.`
			`* Message Broker / Event Bus: NATS or Redis (Pub/Sub) - optional for MVP.`
			`* Rationale: Lightweight, easy to deploy, perfect for Control Plane <-> Runner communication if HTTP/gRPC isn't enough.`

			`2. Frontend (Web Dashboard)`
			`* Framework: React (Vite) or Next.js.`
			`* UI Library: Tailwind CSS + shadcn/ui.`
			`* Rationale: React has the largest ecosystem for observability dashboards (e.g., charting libraries, flow diagrams like React Flow). Shadcn/ui provides pre-built, accessible, and highly customizable enterprise-grade components.`

			`3. Customer-Side Runners (Appliance)`
			`* Framework: Java with Quarkus + Apache Camel.`
			`* Rationale: Quarkus is designed for Kubernetes and provides incredibly fast startup times and a low memory footprint. It natively integrates with Apache Camel, aligning perfectly with the core domain.`
			`* Container Orchestration: k3s (Lightweight Kubernetes).`
			`* Rationale: Ideal for customer-side runner appliances. Easy to package, low overhead, and identical API to standard K8s.`

			`4. Observability & Telemetry (The USP)`
			* Telemetry Agent: Custom Java Agent (`-javaagent`) for zero-code instrumentation. Hooks directly into Apache Camel's native `InterceptStrategy` and `EventNotifier` via ByteBuddy. Uses LMAX Disruptor ring buffers and Kryo serialization to ensure zero GC pauses and minimal overhead.
			`* Tracing/Metrics: OpenTelemetry (OTEL) Collector.`
			`* Rationale: Standardizes telemetry collection. The custom Java agent pushes truncated, high-performance payloads and metrics to the local OTEL collector, which then routes to VictoriaMetrics or Loki.`

			`---`

			`### 📂 Repository Structure`

			`With a Java backend and React frontend, a Maven/Gradle Multi-Module approach (or a tool like Nx if you want a true full-stack monorepo) works best. Here is the recommended structure:`

			```text
			`camel-ops-prototype/`
			`├── frontend/ # React web dashboard (Vite/Next.js)`
			`│ ├── src/ # UI components, API clients`
			`│ └── package.json`
			`├── backend/ # Java Parent Project (Maven/Gradle)`
			`│ ├── control-plane/ # Java API & orchestrator (SaaS)`
			`│ ├── runner-appliance/ # Java + Quarkus + Camel (K3s agent)`
			`│ └── shared-core/ # Shared Java DTOs, utilities, and API contracts`
			`├── infra/`
			`│ ├── k3s-runner/ # Manifests/Helm charts for the customer appliance`
			`│ ├── control-plane/ # Docker Compose / K8s manifests for the SaaS backend`
			`│ └── observability/ # OpenTelemetry configs for local dev`
			`├── .github/ # CI/CD pipelines (GitHub Actions)`
			`├── docker-compose.yml # One-click local spin-up for the whole stack`
			`└── README.md # Quickstart guide`
			```

			`### 🚀 CI/CD & Delivery`
			`* GitHub Actions: Simple, free, and lives next to the code. Maven/Gradle builds for backend, npm/yarn for frontend.`
			`* Artifacts: Docker images pushed to GitHub Container Registry (GHCR).`
			* Local Dev: Provide a `docker-compose.yml` that spins up PostgreSQL, the Control Plane, a local k3s/Runner instance, and the OTEL collector.
			`## Pivot: Multi-Node Hub/Worker Split & Local Persistence`

			`Based on the architectural decision to keep all operational data (metrics, logs, traces) at the customer site while managing it from the SaaS Control Plane, the customer's k3s runner appliance must support a multi-node cluster (embedded k3s) from day one. We are separating the roles into Hub and Worker nodes.`

			`* Appliance Hub (Control Plane): Runs the local k3s control plane, persistence layers (VictoriaMetrics for TSDB, Grafana Loki for logs), and the SaaS secure tunnel agent.`
			`* Appliance Worker (Data Plane): Dedicated to hosting the Apache Camel workloads and lightweight OpenTelemetry (OTEL) forwarders.`

			`### 🗄️ Local Data Stores (Deployed on Hub)`
			`* Time Series Database (TSDB): VictoriaMetrics (or Prometheus)`
			`* Rationale: VictoriaMetrics is a high-performance, drop-in replacement for Prometheus. It uses a fraction of the RAM and disk space, making it perfect for resource-constrained customer k3s environments while fully supporting PromQL.`
			`* Integration: The Java runner (using Micrometer or OpenTelemetry) will scrape or push Camel route metrics directly to this local instance.`
			`* Log Aggregation: Grafana Loki (with Promtail or OTEL Collector)`
			`* Rationale: Unlike Elasticsearch (which requires heavy JVM memory overhead), Loki indexes only metadata/labels and stores raw log streams efficiently. This is ideal for lightweight edge deployments.`

			`### 📦 Packaging & Installation Strategy`
			`To make this "very easy to install" for the customer, the entire local suite will be packaged as a Single Helm Umbrella Chart, enforcing node placement.`

			`* The Bundle: The Helm chart will contain:`
			`1. The Java-based Runner Agent (which maintains the secure tunnel to the SaaS Control Plane).`
			`2. VictoriaMetrics (for time-series data).`
			`3. Loki & Promtail (for log aggregation).`
			`* Workload Placement (nodeSelector & Tolerations):`
			* Hub Components: Loki, VictoriaMetrics, and the SaaS Tunnel Agent will be configured with `nodeSelector: { "node-role.kubernetes.io/control-plane": "true" }` (or a dedicated `appliance-role: hub` label). They will use tolerations to ensure they can run on the control-plane tainted nodes.
			* Worker Components: Apache Camel workloads and the OTEL daemonsets will use `nodeSelector: { "appliance-role": "worker" }` to ensure they run strictly on worker nodes, isolating business integrations from cluster observability.
			`* Installation Experience: The customer installs the entire data plane and runner with a single command:`
			```bash
			`helm repo add camel-ops https://charts.camel-ops.com`
			`helm install camel-runner camel-ops/customer-appliance \`
			`--set runner.token=<TENANT_TOKEN> \`
			`--namespace camel-ops --create-namespace`
			```
			`* Operations & Architecture: Our SaaS Control Plane will query the local k3s runner's Java API securely. The Java agent translates these requests, queries the local VictoriaMetrics and Loki instances via PromQL/LogQL, and formats the data into the requested "nJAMS style" before returning it to the SaaS dashboard. No persistent payload or log data ever leaves the customer's cluster.`

			`### ⚠️ Critical Analysis: The Multi-Node Dev Experience`

			`Introducing a multi-node requirement fundamentally shifts the development and testing experience:`

			* Impact on Local Development: We can no longer rely on a simple `docker-compose.yml` to accurately simulate the customer environment. Developers must use tools like k3d or kind to spin up multi-node local clusters (e.g., `k3d cluster create camel-ops --servers 1 --agents 2`). This increases local resource requirements (RAM/CPU) and steepens the learning curve for developers used to pure Docker workflows.
			`* Impact on CI/CD: Our automated tests in GitHub Actions will need to provision an ephemeral multi-node K8s cluster to validate Helm chart placement rules, taints, and inter-node networking before merging PRs. This adds runtime overhead and complexity to the test matrix.`
			`* The 6-Week MVP Timeline: Implementing and rigorously testing multi-node scheduling, distributed node-join processes, and potential networking nuances will consume a significant portion of our 6-week runway.`

			`Technical Verdict:`
			`While separating the control/observability plane from the data plane is a sound architectural best practice that prevents runaway Camel routes from crashing the observability stack, enforcing it strictly from day one in a 6-week MVP is highly aggressive.`
			Recommendation: Proceed with defining the Helm `nodeSelector`/taint logic to formalize the Hub/Worker roles in the chart, but configure the MVP Helm chart values to default to a "Single-Node (Converged) Mode" for local dev and simple PoCs. This allows us to validate the multi-node design without bottlenecking the fast-iteration loop required for the MVP.

			`### 🚨 Alerting & Anti-Spam (Deployed on Hub)`
			`* Rule Evaluation: vmalert (VictoriaMetrics)`
			* Rationale: `vmalert` executes PromQL-based alerting rules against the local VictoriaMetrics TSDB. It allows us to set thresholds (e.g., 100 failed messages in 5 minutes) and keeps evaluation entirely on the customer's Appliance Hub, ensuring data privacy and reducing reliance on the SaaS Control Plane for real-time evaluation.
			`* Notification Routing & Grouping: Prometheus Alertmanager`
			* Rationale: Alertmanager receives firing alerts from `vmalert` and is responsible for dispatching them to configured receivers (email, Slack, PagerDuty).
			`* Anti-Spam Mechanisms:`
			`* Grouping: Groups alerts of a similar nature (e.g., grouped by route ID or business entity) into a single notification to avoid overwhelming operators. If 100 routes fail simultaneously, it sends 1 grouped alert, not 100 emails.`
			`* Deduplication: Suppresses duplicate notifications for the same active alert.`
			`* Debouncing & Inhibition: Can mute lower-severity alerts (e.g., "High Latency") if a related high-severity alert (e.g., "Node Down") is already firing.`
			* Management: The SaaS Control Plane UI acts as the central pane of glass where users configure these alerting rules. The SaaS platform then pushes the configuration down to the customer's Appliance Hub to be executed by `vmalert` and `Alertmanager`.