docs: add Phase 4 Observability Pipeline + Inbound Routing spec
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,321 @@
|
||||
# Phase 4: Observability Pipeline + Inbound Routing
|
||||
|
||||
**Date:** 2026-04-04
|
||||
**Status:** Draft
|
||||
**Depends on:** Phase 3 (Runtime Orchestration + Environments)
|
||||
**Gitea issue:** #28
|
||||
|
||||
## Context
|
||||
|
||||
Phase 3 delivered the managed Camel runtime: customers upload a JAR, the platform builds a container with the cameleer3 agent injected, and deploys it. The agent connects to cameleer3-server and sends traces, metrics, diagrams, and logs to ClickHouse. But there is no way for the user to see this data yet, and customer apps that expose HTTP endpoints are not reachable.
|
||||
|
||||
Phase 4 completes the loop: deploy an app, hit its endpoint, see the traces in the dashboard.
|
||||
|
||||
cameleer3-server already has the complete observability stack — ClickHouse schemas with `tenant_id` partitioning, full search/stats/diagram/log REST APIs, and a React SPA dashboard. Phase 4 is a **wiring phase**, not a build-from-scratch phase.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Observability UI | Serve existing cameleer3-server React SPA via Traefik | Already built. SaaS management UI is Phase 9 — observability UI is not SaaS-specific. |
|
||||
| API access | Traefik routes directly to cameleer3-server with forward-auth | No proxy layer needed. Forward-auth validates user, injects headers. Server API works as-is. |
|
||||
| Server changes | None | Single-tenant Docker mode works out of the box. `CAMELEER_TENANT_ID` env var already supported. |
|
||||
| Agent changes | None | Agent already sends `applicationId`, `environmentId`, connects to `CAMELEER_EXPORT_ENDPOINT`. |
|
||||
| Tenant ID | Set `CAMELEER_TENANT_ID` to tenant slug in Docker Compose | Tags ClickHouse data with the real tenant identity from day one. Avoids `'default'` → real-id migration later. |
|
||||
| Inbound routing | Optional `exposedPort` on deployment, Traefik labels on customer containers | Thin feature. `{app}.{env}.{tenant}.{domain}` routing via Traefik Host rule. |
|
||||
|
||||
## What's Already Working (Phase 3)
|
||||
|
||||
- Customer containers on the `cameleer` bridge network
|
||||
- Agent configured: `CAMELEER_AUTH_TOKEN`, `CAMELEER_EXPORT_ENDPOINT=http://cameleer3-server:8081`, `CAMELEER_APPLICATION_ID`, `CAMELEER_ENVIRONMENT_ID`
|
||||
- cameleer3-server writes traces/metrics/diagrams/logs to ClickHouse
|
||||
- Traefik routes `/observe/*` to cameleer3-server with forward-auth middleware
|
||||
- Forward-auth endpoint at `/auth/verify` validates JWT, returns `X-Tenant-Id`, `X-User-Id`, `X-User-Email` headers
|
||||
|
||||
## Component 1: Serve cameleer3-server Dashboard
|
||||
|
||||
### Traefik Routing
|
||||
|
||||
Add Traefik labels to the cameleer3-server service in `docker-compose.yml` to serve the React SPA:
|
||||
|
||||
```yaml
|
||||
# Existing (Phase 3):
|
||||
- traefik.http.routers.observe.rule=PathPrefix(`/observe`)
|
||||
- traefik.http.routers.observe.middlewares=forward-auth
|
||||
|
||||
# New (Phase 4):
|
||||
- traefik.http.routers.dashboard.rule=PathPrefix(`/dashboard`)
|
||||
- traefik.http.routers.dashboard.middlewares=forward-auth
|
||||
- traefik.http.services.dashboard.loadbalancer.server.port=8080
|
||||
```
|
||||
|
||||
The cameleer3-server SPA is served from its own embedded web server. The SPA already calls the server's API endpoints at relative paths — the existing `/observe/*` Traefik route handles those requests with forward-auth.
|
||||
|
||||
**Note:** If the cameleer3-server SPA expects to be served from `/` rather than `/dashboard`, a Traefik StripPrefix middleware may be needed:
|
||||
|
||||
```yaml
|
||||
- traefik.http.middlewares.dashboard-strip.stripprefix.prefixes=/dashboard
|
||||
- traefik.http.routers.dashboard.middlewares=forward-auth,dashboard-strip
|
||||
```
|
||||
|
||||
This depends on how the cameleer3-server SPA is configured (base path). To be verified during implementation.
|
||||
|
||||
### CAMELEER_TENANT_ID Configuration
|
||||
|
||||
Set `CAMELEER_TENANT_ID` on the cameleer3-server service so all ingested data is tagged with the real tenant slug:
|
||||
|
||||
```yaml
|
||||
cameleer3-server:
|
||||
environment:
|
||||
CAMELEER_TENANT_ID: ${CAMELEER_TENANT_SLUG:-default}
|
||||
```
|
||||
|
||||
In the Docker single-tenant stack, this is set once during initial setup (e.g., `CAMELEER_TENANT_SLUG=acme` in `.env`). All ClickHouse data is then partitioned under this tenant ID.
|
||||
|
||||
Add `CAMELEER_TENANT_SLUG` to `.env.example`.
|
||||
|
||||
## Component 2: Agent Connectivity Verification
|
||||
|
||||
New endpoint in cameleer-saas to check whether a deployed app's agent has successfully registered with cameleer3-server and is sending data.
|
||||
|
||||
### API
|
||||
|
||||
```
|
||||
GET /api/apps/{appId}/agent-status
|
||||
Returns: 200 + AgentStatusResponse
|
||||
```
|
||||
|
||||
### AgentStatusResponse
|
||||
|
||||
```java
|
||||
public record AgentStatusResponse(
|
||||
boolean registered,
|
||||
String state, // ACTIVE, STALE, DEAD, UNKNOWN
|
||||
Instant lastHeartbeat,
|
||||
List<String> routeIds,
|
||||
String applicationId,
|
||||
String environmentId
|
||||
) {}
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
`AgentStatusService` in cameleer-saas calls cameleer3-server's agent registry API:
|
||||
|
||||
```
|
||||
GET http://cameleer3-server:8081/api/v1/agents
|
||||
```
|
||||
|
||||
This returns the list of registered agents. The service filters by `applicationId` matching the app's slug and `environmentId` matching the environment's slug.
|
||||
|
||||
If the cameleer3-server doesn't expose a public agent listing endpoint, the alternative is to query ClickHouse directly for recent data:
|
||||
|
||||
```sql
|
||||
SELECT max(timestamp) as last_seen
|
||||
FROM container_logs
|
||||
WHERE app_id = ? AND deployment_id = ?
|
||||
LIMIT 1
|
||||
```
|
||||
|
||||
The preferred approach is the agent registry API. If it requires authentication, cameleer-saas can use the shared `CAMELEER_AUTH_TOKEN` as a machine token.
|
||||
|
||||
### Integration with Deployment Status
|
||||
|
||||
After a deployment reaches `RUNNING` status (container healthy), the platform can poll agent-status to confirm the agent has registered. This could be surfaced as a sub-status:
|
||||
|
||||
- `RUNNING` — container is healthy
|
||||
- `RUNNING_CONNECTED` — container healthy + agent registered with server
|
||||
- `RUNNING_DISCONNECTED` — container healthy but agent not yet registered (timeout: 30s)
|
||||
|
||||
This is a nice-to-have enhancement on top of the basic agent-status endpoint.
|
||||
|
||||
## Component 3: Inbound HTTP Routing for Customer Apps
|
||||
|
||||
### Data Model
|
||||
|
||||
Add `exposed_port` column to the `apps` table:
|
||||
|
||||
```sql
|
||||
ALTER TABLE apps ADD COLUMN exposed_port INTEGER;
|
||||
```
|
||||
|
||||
This is the port the customer's Camel app listens on inside the container (e.g., 8080 for a Spring Boot REST app). When set, Traefik routes external traffic to this port.
|
||||
|
||||
### API
|
||||
|
||||
```
|
||||
PATCH /api/apps/{appId}/routing
|
||||
Body: { "exposedPort": 8080 } // or null to disable routing
|
||||
Returns: 200 + AppResponse
|
||||
```
|
||||
|
||||
The routable URL is computed and included in `AppResponse`:
|
||||
|
||||
```java
|
||||
// In AppResponse, add:
|
||||
String routeUrl // e.g., "http://order-svc.default.acme.localhost" or null if no routing
|
||||
```
|
||||
|
||||
### URL Pattern
|
||||
|
||||
```
|
||||
{app-slug}.{env-slug}.{tenant-slug}.{domain}
|
||||
```
|
||||
|
||||
Example: `order-svc.default.acme.localhost`
|
||||
|
||||
The `{domain}` comes from the `DOMAIN` env var (already in `.env.example`).
|
||||
|
||||
### DockerRuntimeOrchestrator Changes
|
||||
|
||||
When starting a container for an app that has `exposedPort` set, add Traefik labels:
|
||||
|
||||
```java
|
||||
var labels = new HashMap<String, String>();
|
||||
labels.put("traefik.enable", "true");
|
||||
labels.put("traefik.http.routers." + containerName + ".rule",
|
||||
"Host(`" + app.getSlug() + "." + env.getSlug() + "." + tenant.getSlug() + "." + domain + "`)");
|
||||
labels.put("traefik.http.services." + containerName + ".loadbalancer.server.port",
|
||||
String.valueOf(app.getExposedPort()));
|
||||
```
|
||||
|
||||
These labels are set on the Docker container via docker-java's `withLabels()` on the `CreateContainerCmd`.
|
||||
|
||||
Traefik auto-discovers labeled containers on the `cameleer` network (already configured in `traefik.yml` with `exposedByDefault: false`).
|
||||
|
||||
### StartContainerRequest Changes
|
||||
|
||||
Add optional fields to `StartContainerRequest`:
|
||||
|
||||
```java
|
||||
public record StartContainerRequest(
|
||||
String imageRef,
|
||||
String containerName,
|
||||
String network,
|
||||
Map<String, String> envVars,
|
||||
long memoryLimitBytes,
|
||||
int cpuShares,
|
||||
int healthCheckPort,
|
||||
Map<String, String> labels // NEW: Traefik routing labels
|
||||
) {}
|
||||
```
|
||||
|
||||
### RuntimeConfig Addition
|
||||
|
||||
```yaml
|
||||
cameleer:
|
||||
runtime:
|
||||
domain: ${DOMAIN:localhost}
|
||||
```
|
||||
|
||||
## Component 4: End-to-End Connectivity Health
|
||||
|
||||
### Startup Verification
|
||||
|
||||
On application startup, cameleer-saas verifies that cameleer3-server is reachable:
|
||||
|
||||
```java
|
||||
@EventListener(ApplicationReadyEvent.class)
|
||||
public void verifyConnectivity() {
|
||||
// HTTP GET http://cameleer3-server:8081/actuator/health
|
||||
// Log result: "cameleer3-server connectivity: OK" or "FAILED: ..."
|
||||
}
|
||||
```
|
||||
|
||||
This is a best-effort check, not a hard dependency. If cameleer3-server is not yet running (e.g., starting up), the SaaS platform still starts. The check is logged for diagnostics.
|
||||
|
||||
### ClickHouse Data Verification
|
||||
|
||||
Add a lightweight endpoint for checking whether a deployed app is producing observability data:
|
||||
|
||||
```
|
||||
GET /api/apps/{appId}/observability-status
|
||||
Returns: 200 + ObservabilityStatusResponse
|
||||
```
|
||||
|
||||
```java
|
||||
public record ObservabilityStatusResponse(
|
||||
boolean hasTraces,
|
||||
boolean hasMetrics,
|
||||
boolean hasDiagrams,
|
||||
Instant lastTraceAt,
|
||||
long traceCount24h
|
||||
) {}
|
||||
```
|
||||
|
||||
Implementation queries ClickHouse:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
count() > 0 as has_traces,
|
||||
max(start_time) as last_trace,
|
||||
count() as trace_count_24h
|
||||
FROM executions
|
||||
WHERE tenant_id = ? AND application_id = ? AND environment = ?
|
||||
AND start_time > now() - INTERVAL 24 HOUR
|
||||
```
|
||||
|
||||
This requires cameleer-saas to query ClickHouse directly (the `clickHouseDataSource` bean from Phase 3). The query is scoped by tenant + application + environment.
|
||||
|
||||
## Docker Compose Changes
|
||||
|
||||
### cameleer3-server labels (add dashboard route)
|
||||
|
||||
```yaml
|
||||
cameleer3-server:
|
||||
environment:
|
||||
CAMELEER_TENANT_ID: ${CAMELEER_TENANT_SLUG:-default}
|
||||
labels:
|
||||
# Existing:
|
||||
- traefik.enable=true
|
||||
- traefik.http.routers.observe.rule=PathPrefix(`/observe`)
|
||||
- traefik.http.routers.observe.middlewares=forward-auth
|
||||
- traefik.http.services.observe.loadbalancer.server.port=8080
|
||||
# New:
|
||||
- traefik.http.routers.dashboard.rule=PathPrefix(`/dashboard`)
|
||||
- traefik.http.routers.dashboard.middlewares=forward-auth,dashboard-strip
|
||||
- traefik.http.middlewares.dashboard-strip.stripprefix.prefixes=/dashboard
|
||||
- traefik.http.services.dashboard.loadbalancer.server.port=8080
|
||||
```
|
||||
|
||||
### .env.example addition
|
||||
|
||||
```
|
||||
CAMELEER_TENANT_SLUG=default
|
||||
```
|
||||
|
||||
## Database Migration
|
||||
|
||||
```sql
|
||||
-- V010__add_exposed_port_to_apps.sql
|
||||
ALTER TABLE apps ADD COLUMN exposed_port INTEGER;
|
||||
```
|
||||
|
||||
## New Configuration Properties
|
||||
|
||||
```yaml
|
||||
cameleer:
|
||||
runtime:
|
||||
domain: ${DOMAIN:localhost}
|
||||
```
|
||||
|
||||
## Verification Plan
|
||||
|
||||
1. Deploy a sample Camel REST app with `exposedPort: 8080`
|
||||
2. `curl http://order-svc.default.acme.localhost` hits the Camel app
|
||||
3. The Camel route processes the request
|
||||
4. cameleer3 agent captures the trace and sends to cameleer3-server
|
||||
5. `GET /api/apps/{appId}/agent-status` shows `registered: true, state: ACTIVE`
|
||||
6. `GET /api/apps/{appId}/observability-status` shows `hasTraces: true`
|
||||
7. Open `http://localhost/dashboard` — cameleer3-server SPA loads
|
||||
8. Traces visible in the dashboard for the deployed app
|
||||
9. Route topology graph shows the Camel route structure
|
||||
10. `CAMELEER_TENANT_ID` is set to the tenant slug in ClickHouse data
|
||||
|
||||
## What Phase 4 Does NOT Touch
|
||||
|
||||
- No changes to cameleer3-server code (works as-is for single-tenant Docker mode)
|
||||
- No changes to the cameleer3 agent
|
||||
- No new ClickHouse schemas (cameleer3-server manages its own)
|
||||
- No SaaS management UI (Phase 9)
|
||||
- No K8s-specific changes (Phase 5)
|
||||
Reference in New Issue
Block a user