Rename Java packages from net.siegeln.cameleer3 to net.siegeln.cameleer, update all references in workflows, Docker configs, docs, and bootstrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
685 lines
29 KiB
Markdown
685 lines
29 KiB
Markdown
# Cameleer SaaS Platform — Product Requirements Document
|
||
|
||
**Status:** Draft — Awaiting Review
|
||
**Date:** 2026-03-29
|
||
**Author:** Hendrik Siegeln + Claude (brainstorming session)
|
||
**Gitea Project:** cameleer/cameleer-saas
|
||
**Gitea Epics:** #1–#13
|
||
|
||
---
|
||
|
||
## 1. Product Definition
|
||
|
||
**Cameleer SaaS** is a Camel application runtime platform with built-in observability. Customers deploy Apache Camel applications and get zero-configuration tracing, topology mapping, payload lineage, distributed correlation, live debugging, and exchange replay — powered by the cameleer agent (auto-injected) and cameleer-server (managed per tenant).
|
||
|
||
### Three Pillars
|
||
|
||
1. **Runtime** — Deploy and run Camel applications with automatic agent injection
|
||
2. **Observability** — Per-tenant cameleer-server (traces, topology, lineage, correlation, debugger, replay)
|
||
3. **Management** — Auth, billing, teams, provisioning, secrets, environments
|
||
|
||
### Two Deployment Modes
|
||
|
||
- **SaaS (managed)** — Fully managed by the Cameleer platform
|
||
- **Self-hosted / Air-gapped** — Customer-operated, license-enforced feature parity with SaaS tiers
|
||
|
||
### Relationship to Existing Components
|
||
|
||
| Component | Role | Changes Required |
|
||
|-----------|------|------------------|
|
||
| cameleer (agent) | Zero-code Camel instrumentation, auto-injected into customer JARs | MOAT features (lineage, correlation, debugger, replay) |
|
||
| cameleer-server | Per-tenant observability backend | Managed mode (trust SaaS JWT), license module, MOAT features |
|
||
| cameleer-saas (this repo) | SaaS management platform — control plane | New: everything in this document |
|
||
| design-system | Shared React component library | Used by both SaaS shell and server UI |
|
||
|
||
---
|
||
|
||
## 2. Tier Structure
|
||
|
||
### Tier Matrix
|
||
|
||
| Dimension | Low | Mid | High | Business |
|
||
|-----------|-----|-----|------|----------|
|
||
| **Infrastructure** | Shared cluster, shared PG/OS | Shared cluster, shared PG/OS | Dedicated cluster(s) | Dedicated cluster(s) |
|
||
| **Pricing** | Base fee + usage (data vol, CPU, RAM) | Base fee + usage (data vol, CPU, RAM) | Committed resources | Committed resources |
|
||
| **Environments** | 1 (prod) | 2 (dev, prod) | Unlimited | Unlimited |
|
||
| **Agents** | Limited | Higher limit | Unlimited | Unlimited |
|
||
| **Data Retention** | 7 days | 30 days | 90 days | Custom |
|
||
| **Topology Graph** | Yes | Yes | Yes | Yes |
|
||
| **Payload Lineage** | Limited (route-scope only, max 10 captures/min) | Full | Full | Full |
|
||
| **Cross-Service Correlation** | No | Yes | Yes | Yes |
|
||
| **Live Route Debugger** | No | No | Yes | Yes |
|
||
| **Exchange Replay** | No | No | Yes | Yes |
|
||
| **SSO / OIDC** | No | No | Yes | Yes |
|
||
| **Custom Roles** | No | No | Yes | Yes |
|
||
| **Team Management** | Basic | Basic | Full | Full |
|
||
| **Secrets** | Platform-native | Platform-native + 1 vault | Unlimited vaults | Unlimited vaults |
|
||
| **Support** | Docs | Email | Priority | Dedicated CSM |
|
||
| **SLA** | Best effort | 99.5% | 99.9% | 99.95%+ custom |
|
||
| **VPN (future)** | No | No | Yes | Yes |
|
||
|
||
### Pricing Models
|
||
|
||
**Usage-based (Low/Mid):**
|
||
- Optional small monthly base fee
|
||
- Metered dimensions: data volume (GB ingested), CPU (core·hours), RAM (GB·hours)
|
||
- Stripe metered subscriptions with periodic usage reporting
|
||
|
||
**Committed resources (High/Business):**
|
||
- Fixed pricing based on reserved cluster capacity (CPU cores, RAM, storage, node count)
|
||
- Annual or multi-year contracts
|
||
- Overage alerts (upsell, not automatic billing)
|
||
|
||
---
|
||
|
||
## 3. System Architecture
|
||
|
||
### Approach: Modular Monolith Control Plane
|
||
|
||
Single Spring Boot application with well-bounded internal modules. K8s ingress handles tenant routing. Flux CD handles infrastructure reconciliation.
|
||
|
||
```
|
||
[Browser] → [Ingress (Traefik/Envoy)] → [SaaS Platform (modular Spring Boot)]
|
||
↓ (tenant routes) ↓ (provisioning)
|
||
[Tenant cameleer-server] [Flux CD → K8s]
|
||
```
|
||
|
||
### Component Map
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ Ingress (Traefik/Envoy) │
|
||
│ TLS termination, tenant routing │
|
||
└──────┬──────────────┬──────────────┬────┘
|
||
│ │ │
|
||
┌──────────▼──────────┐ │ ┌─────────▼─────────┐
|
||
│ SaaS Management │ │ │ Grafana/Prometheus│
|
||
│ Platform │ │ │ (self-monitoring) │
|
||
│ (Spring Boot) │ │ └───────────────────┘
|
||
│ │ │
|
||
│ Modules: │ │
|
||
│ ├─ Auth │ │
|
||
│ ├─ Billing │ │
|
||
│ ├─ Provisioning │ │
|
||
│ ├─ Runtime │ │
|
||
│ ├─ License │ │
|
||
│ ├─ Secrets │ │
|
||
│ └─ Audit │ │
|
||
└──┬───┬──────┬───────┘ │
|
||
│ │ │ │
|
||
┌────────┘ │ └──────┐ │
|
||
▼ ▼ ▼ │
|
||
┌──────────────┐ ┌────────┐ ┌──────────▼───────────────┐
|
||
│ Platform DB │ │ Stripe │ │ Shared K8s Cluster │
|
||
│ (PostgreSQL) │ │ API │ │ │
|
||
│ - tenants │ └────────┘ │ ┌─────────────────────┐ │
|
||
│ - users │ │ │ tenant-a namespace │ │
|
||
│ - teams │ ┌─────┐ │ │ ├─ cameleer-server │ │
|
||
│ - audit log │ │Flux │ │ │ ├─ camel-app-1 │ │
|
||
│ - licenses │ │ CD │ │ │ ├─ camel-app-2 │ │
|
||
└──────────────┘ └──┬──┘ │ │ └─ NetworkPolicies │ │
|
||
│ │ └─────────────────────┘ │
|
||
┌───────▼──┐ │ ┌─────────────────────┐ │
|
||
│ GitOps │ │ │ tenant-b namespace │ │
|
||
│ Repo │ │ │ └─ ... │ │
|
||
│(HelmRel) │ │ └─────────────────────┘ │
|
||
└──────────┘ │ │
|
||
│ Shared: │
|
||
│ ├─ PostgreSQL (tenant │
|
||
│ │ schemas) │
|
||
│ ├─ OpenSearch (tenant │
|
||
│ │ indices) │
|
||
│ └─ Container Registry │
|
||
└──────────────────────────┘
|
||
```
|
||
|
||
### Dedicated Tier (High/Business)
|
||
|
||
Same management platform routes to dedicated cluster(s) per customer. Dedicated PostgreSQL, OpenSearch, and container registry within the customer's cluster. Provisioned semi-manually at launch (Flux bootstrap), full Cluster API automation deferred.
|
||
|
||
### Tech Stack
|
||
|
||
| Component | Technology |
|
||
|-----------|------------|
|
||
| Management Platform backend | Spring Boot 3, Java 21 |
|
||
| Management Platform frontend | React, @cameleer/design-system |
|
||
| Platform database | PostgreSQL |
|
||
| Tenant observability | cameleer-server (Spring Boot), PostgreSQL, OpenSearch |
|
||
| GitOps | Flux CD |
|
||
| K8s distribution | Talos (production), k3s (dev) |
|
||
| Ingress | Traefik or Envoy |
|
||
| Billing | Stripe (Subscriptions + Usage Records API) |
|
||
| Auth | Spring Security OAuth2, Ed25519 JWT |
|
||
| Secrets sync | K8s External Secrets Operator |
|
||
| Container registry | Platform-managed (Harbor or Gitea Container Registry) |
|
||
| Monitoring | Prometheus, Grafana, Loki, Alertmanager |
|
||
| Image signing | cosign/sigstore |
|
||
| Image scanning | Trivy |
|
||
|
||
### Key Architectural Decisions
|
||
|
||
1. **Modular monolith** — Single Spring Boot app with clean module boundaries. Extractable later if needed.
|
||
2. **K8s ingress handles routing** — Tenant routing via path or subdomain. No custom API gateway.
|
||
3. **Flux CD for reconciliation** — HelmRelease CRs per tenant. Drift detection, self-healing. K8s-distribution-agnostic.
|
||
4. **Platform DB separate from tenant data** — Management platform has its own PostgreSQL. Tenant observability data in separate shared (or dedicated) instances.
|
||
5. **Immutable artifact pipeline** — JAR upload → container image → promote through environments. Same binary everywhere.
|
||
6. **Dual-mode auth** — SaaS mode: platform is the IdP. Air-gapped mode: server uses standalone auth with local license file.
|
||
7. **SOC 2 baked in** — Not bolted on. Audit logging, encryption, image signing, SBOM from day 1.
|
||
8. **Self-monitoring** — Prometheus + Grafana stack, completely separate from tenant observability.
|
||
|
||
---
|
||
|
||
## 4. Data Architecture
|
||
|
||
### Platform Database (Management Platform)
|
||
|
||
Stores all SaaS control plane data — completely separate from tenant observability data.
|
||
|
||
| Table/Domain | Purpose |
|
||
|---|---|
|
||
| `tenants` | Tenant record: ID, name, tier, status, Stripe customer ID, created_at |
|
||
| `users` | Platform users: email, password hash, MFA, status |
|
||
| `tenant_members` | User-to-tenant mapping with role |
|
||
| `teams` | Team groupings within a tenant |
|
||
| `roles` / `permissions` | RBAC definitions (predefined + custom for high/business) |
|
||
| `licenses` | License records: tenant, tier, feature flags, limits, expiry, signing key |
|
||
| `audit_log` | Immutable append-only log: actor, action, resource, timestamp, IP, tenant |
|
||
| `applications` | Deployed Camel app metadata: name, tenant, version, image ref, status |
|
||
| `secrets_metadata` | Secret references (actual values in K8s Secrets or external vault) |
|
||
| `vault_configs` | External vault connection configs per tenant |
|
||
| `provisioning_events` | Tenant provisioning pipeline state and history |
|
||
| `billing_usage` | Aggregated usage snapshots before Stripe reporting |
|
||
|
||
### Tenant Data (Shared PostgreSQL)
|
||
|
||
Each tenant's cameleer-server uses its own PostgreSQL schema on the shared instance (dedicated instance for high/business). This is the existing cameleer-server data model — unchanged:
|
||
|
||
- Route executions, processor traces, metrics
|
||
- Route graph topology
|
||
- Agent registrations, config history
|
||
- Lineage captures, correlation traces, debug sessions
|
||
|
||
### Tenant Data (Shared OpenSearch)
|
||
|
||
- `{tenant_id}-executions-*` — time-series execution data
|
||
- `{tenant_id}-traces-*` — processor-level traces
|
||
- Full index-level isolation with index templates per tenant
|
||
|
||
### Self-Monitoring Data
|
||
|
||
Completely separate: Prometheus TSDB for metrics, Loki for logs.
|
||
|
||
---
|
||
|
||
## 5. Identity & Access Management
|
||
|
||
### Architecture
|
||
|
||
The SaaS management platform is the single identity plane. It owns authentication and authorization. Per-tenant cameleer-server instances trust SaaS-issued tokens.
|
||
|
||
- Spring Security OAuth2 for OIDC federation with customer IdPs
|
||
- Ed25519 JWT signing (consistent with existing cameleer-server pattern)
|
||
- Tokens carry: tenant ID, user ID, roles, feature entitlements
|
||
- cameleer-server validates SaaS-issued JWTs in managed mode
|
||
- Standalone mode retains its own auth for air-gapped deployments
|
||
|
||
### RBAC Model
|
||
|
||
| Role | Capabilities |
|
||
|------|-------------|
|
||
| Owner | Full tenant admin, billing, team management, delete tenant |
|
||
| Admin | Manage apps, secrets, team members, environments. No billing. |
|
||
| Developer | Deploy apps, view traces, use debugger/replay. No team management. |
|
||
| Viewer | Read-only access to dashboards, traces, topology |
|
||
|
||
High/Business tiers: custom roles with granular permissions (e.g., "can replay in dev but not prod").
|
||
|
||
### Team Management
|
||
|
||
- Invite by email
|
||
- Role assignment per user
|
||
- Basic (low/mid): single team, predefined roles
|
||
- Full (high/business): multiple teams, custom roles, team-scoped permissions
|
||
|
||
---
|
||
|
||
## 6. Tenant Provisioning
|
||
|
||
### Shared Tier Flow (Low/Mid)
|
||
|
||
```
|
||
Customer signs up + payment
|
||
→ Create tenant record + Stripe customer/subscription
|
||
→ Generate signed license token (Ed25519)
|
||
→ Create Flux HelmRelease CR
|
||
→ Flux reconciles: namespace, ResourceQuota, NetworkPolicies, cameleer-server
|
||
→ Provision PostgreSQL schema + per-tenant credentials
|
||
→ Provision OpenSearch index template + per-tenant credentials
|
||
→ Readiness check: server healthy, DB migrated, auth working
|
||
→ Generate bootstrap tokens, present onboarding instructions
|
||
→ Tenant status → ACTIVE
|
||
```
|
||
|
||
**Target: < 5 minutes from payment to active environment.**
|
||
|
||
### Dedicated Tier Flow (High/Business)
|
||
|
||
Semi-manual at launch:
|
||
1. Customer signs committed resource agreement
|
||
2. Operator provisions dedicated cluster (Talos)
|
||
3. Flux bootstrap deploys full stack
|
||
4. Management platform configured to route to dedicated cluster
|
||
5. From this point, automated (same lifecycle management as shared)
|
||
|
||
Full Cluster API automation deferred to future release.
|
||
|
||
### Lifecycle Operations
|
||
|
||
| Operation | Mechanism |
|
||
|-----------|-----------|
|
||
| Suspension (non-payment) | Scale tenant workloads to 0, license set to suspended |
|
||
| Reactivation | Scale back up, license reactivated |
|
||
| Deletion | Remove namespace, drop PG schema, delete OS indices, scrub audit log references. GDPR compliant. |
|
||
| Tier upgrade (shared → dedicated) | Provision dedicated cluster, migrate data, update routing. Downtime window coordinated. |
|
||
| Tier downgrade | Reverse of upgrade. Data retention limits applied. |
|
||
|
||
### Failure Handling
|
||
|
||
- Each provisioning step is idempotent and retryable
|
||
- State machine in platform DB tracks progress per step
|
||
- Failed provisioning → alert ops + notify customer with ETA
|
||
- Partial provisioning cleanup on permanent failure
|
||
|
||
---
|
||
|
||
## 7. Camel Application Runtime
|
||
|
||
### JAR Upload → Immutable Image
|
||
|
||
1. **Validation** — File type check, size limit per tier, SHA-256 checksum, Trivy security scan, secret detection (reject JARs with embedded credentials)
|
||
2. **Image Build** — Templated Dockerfile: distroless JRE base + customer JAR + cameleer-agent.jar + `-javaagent` flag + agent pre-configured for tenant server. Image tagged: `registry/{tenant}/{app}:v{N}-{sha256short}`. Signed with cosign. SBOM attached.
|
||
3. **Registry Push** — Per-tenant repository in platform container registry
|
||
4. **Deploy** — K8s Deployment in tenant namespace with resource limits, secrets mounted, config injected, NetworkPolicy applied, liveness/readiness probes
|
||
|
||
### Environment Promotion
|
||
|
||
```
|
||
dev → staging → prod
|
||
(same image tag, different config + secrets per environment)
|
||
```
|
||
|
||
- Promotion = deploy existing image tag to target environment (no rebuild)
|
||
- Rollback = redeploy previous image tag
|
||
- Every promotion audit logged (who, what, from, to)
|
||
|
||
### Environment Model
|
||
|
||
| Tier | Default Environments | Custom Environments |
|
||
|------|---------------------|-------------------|
|
||
| Low | prod | No |
|
||
| Mid | dev, prod | No |
|
||
| High | dev, staging, prod | Unlimited |
|
||
| Business | dev, staging, prod | Unlimited |
|
||
|
||
### Application Deployment Page
|
||
|
||
Central UI for managing each deployed application:
|
||
|
||
- **Deploy** — Upload JAR, view build status, deploy to environment, promote, rollback
|
||
- **Configuration** — Environment variables, JVM options, agent config overrides, application properties. Per-environment. Changes trigger rolling restart.
|
||
- **Secrets** — Create/edit platform-managed secrets. Link external vault secrets. Scoped per environment. Masked in UI, reveal with audit log.
|
||
- **Status** — Pod health, resource usage, agent connection status, recent events
|
||
- **Logs** — Live stdout/stderr stream
|
||
- **Versions** — Image history, promotion history, rollback targets
|
||
|
||
### Application Lifecycle
|
||
|
||
| Action | Mechanism |
|
||
|--------|-----------|
|
||
| Deploy | Upload JAR → build image → deploy to environment |
|
||
| Promote | Redeploy same image tag to next environment |
|
||
| Rollback | Redeploy previous image tag |
|
||
| Scale | Update replica count |
|
||
| Stop | Scale to 0 (preserves config) |
|
||
| Delete | Remove Deployment + clean registry images per retention |
|
||
| Logs | Stream via K8s log API |
|
||
|
||
---
|
||
|
||
## 8. Observability Integration
|
||
|
||
### Architecture
|
||
|
||
Each tenant gets a dedicated cameleer-server instance:
|
||
- Shared tiers: deployed in tenant's namespace
|
||
- Dedicated tiers: deployed in tenant's cluster
|
||
|
||
The SaaS API gateway routes `/t/{tenant}/api/*` to the correct server instance. The server's React UI is embedded in the SaaS shell (nav, tenant switcher, billing pages provided by the shell; product UI rendered inside).
|
||
|
||
### Agent Connection
|
||
|
||
- Agent bootstrap tokens generated by the SaaS platform
|
||
- Agents connect directly to their tenant's cameleer-server instance
|
||
- Agent auto-injected into customer Camel apps deployed on the platform
|
||
- External agents (customer-hosted Camel apps) can also connect using bootstrap tokens
|
||
|
||
### MOAT Features (gated by license)
|
||
|
||
| Feature | Description | Tier Availability |
|
||
|---------|-------------|-------------------|
|
||
| **Topology Graph** | Route dependency visualization from existing execution data | All tiers |
|
||
| **Payload Flow Lineage** | Per-processor before/after capture + format-aware diff | Limited on Low (route-scope only, max 10 captures/min), Full on Mid+ |
|
||
| **Cross-Service Correlation** | Distributed trace assembly + service dependency graph | Mid+ |
|
||
| **Live Route Debugger** | Browser-based route stepping with breakpoints | High+ |
|
||
| **Exchange Replay** | Re-execute recorded exchange with modified payload, fully audited | High+ |
|
||
|
||
### Server Configuration
|
||
|
||
- SaaS platform pushes tier-specific config: feature flags, retention limits, resource limits
|
||
- Server runs in "managed mode": trusts SaaS-issued JWTs, reports metrics back to platform
|
||
- Air-gapped mode: standalone with local license file
|
||
|
||
---
|
||
|
||
## 9. Secrets Management
|
||
|
||
### Day 1 Requirements
|
||
|
||
- **Platform-native secret store** — Encrypted at rest in K8s Secrets (sealed-secrets or SOPS)
|
||
- **External vault integration** — HashiCorp Vault at launch. AWS Secrets Manager, Azure Key Vault, GCP Secret Manager deferred to future release.
|
||
- **Injection** — Secrets injected into Camel app containers as environment variables or mounted files
|
||
- **Rotation** — Update secret → rolling restart of affected apps
|
||
- **RBAC** — Only authorized team members can create/view/rotate secrets
|
||
- **Per-environment scoping** — Dev secrets ≠ prod secrets
|
||
- **K8s External Secrets Operator** — Syncs external vault secrets into K8s Secrets
|
||
|
||
### Tenant Isolation
|
||
|
||
- Secrets strictly scoped to tenant + environment
|
||
- No cross-tenant secret access possible
|
||
- Envelope encryption with per-tenant keys on shared storage
|
||
|
||
### Audit
|
||
|
||
- Every secret access logged (create, read, update, delete, inject)
|
||
- Audit trail queryable by tenant admins
|
||
|
||
---
|
||
|
||
## 10. License & Feature Gating
|
||
|
||
### License Token
|
||
|
||
Ed25519-signed JWT containing:
|
||
- Tenant ID, tier, expiry
|
||
- Feature flags (topology, lineage, correlation, debugger, replay)
|
||
- Resource limits (agents, retention, environments, vaults, debug sessions)
|
||
- SSO/OIDC and custom roles entitlements
|
||
|
||
### Dual Validation
|
||
|
||
| Mode | Mechanism |
|
||
|------|-----------|
|
||
| **SaaS** | Server polls platform API `GET /api/license/{tenant}`, caches 5 min, 24h grace on API failure |
|
||
| **Air-gapped** | Server validates local file `/etc/cameleer/license.jwt`, Ed25519 signature verification |
|
||
|
||
Both modes produce the same `LicenseContext` singleton used throughout the server.
|
||
|
||
### Enforcement
|
||
|
||
- Feature endpoints return 403 with `not_entitled` reason and upgrade URL
|
||
- Graceful degradation: features disabled, not errors
|
||
- License expiry: 7-day grace period (read-only mode), then hard cutoff
|
||
|
||
### Lifecycle
|
||
|
||
- Generated on tenant signup, regenerated on tier change
|
||
- Air-gapped: downloadable from management platform
|
||
- Non-payment: license suspended → grace period → expired
|
||
|
||
---
|
||
|
||
## 11. Networking & Tenant Isolation
|
||
|
||
### Day 1: Namespace Isolation (Shared Tiers)
|
||
|
||
K8s NetworkPolicies per tenant namespace:
|
||
- **Default deny** all ingress/egress between tenant namespaces
|
||
- **Allow:** tenant namespace → shared PostgreSQL/OpenSearch (authenticated per-tenant credentials)
|
||
- **Allow:** tenant namespace → public internet (Camel app external connectivity)
|
||
- **Allow:** SaaS platform namespace → all tenant namespaces (management access)
|
||
- **Allow:** tenant Camel apps → tenant cameleer-server (intra-namespace)
|
||
|
||
### Zero-Trust Tenant Boundary
|
||
|
||
- Per-tenant database credentials (not shared superuser with row filtering)
|
||
- Per-tenant OpenSearch roles with index-level ACLs
|
||
- Connection pooling per tenant (PgBouncer per namespace)
|
||
- A compromised tenant server physically cannot query another tenant's data
|
||
|
||
### Future: VPN / Private Connectivity
|
||
|
||
- WireGuard or IPsec tunnels to customer infrastructure
|
||
- Private DNS resolution for customer internal hostnames
|
||
- Available on High/Business tiers only
|
||
|
||
---
|
||
|
||
## 12. Security & SOC 2 Compliance
|
||
|
||
### Encryption
|
||
|
||
| Layer | Mechanism |
|
||
|-------|-----------|
|
||
| In transit (external) | TLS 1.3 at ingress |
|
||
| In transit (internal) | mTLS between services |
|
||
| In transit (agent ↔ server) | TLS + Ed25519-signed config payloads |
|
||
| At rest (databases) | Volume encryption (LUKS or cloud-native) |
|
||
| At rest (secrets) | Envelope encryption with per-tenant keys |
|
||
| At rest (registry) | Encrypted storage backend |
|
||
|
||
### Audit Trail
|
||
|
||
Every state-changing action produces an immutable audit record:
|
||
- Actor, tenant, action, resource, environment, source IP, result, metadata
|
||
- Append-only table with no UPDATE/DELETE grants
|
||
- Minimum 1 year retention
|
||
- Shipped to separate write-only sink (survives platform DB compromise)
|
||
- Covers: auth events, provisioning, deployments, config changes, secret access, billing, replay executions, debug sessions, admin actions
|
||
|
||
### Container Hardening
|
||
|
||
- Distroless base images (no shell in production)
|
||
- Read-only filesystem
|
||
- K8s Pod Security Standards: restricted profile (no root, no privilege escalation, no host access)
|
||
- Resource limits enforced — compromised tenant can't fork-bomb the node
|
||
|
||
### Supply Chain Security
|
||
|
||
- Container images signed with cosign/sigstore
|
||
- SBOM generated per build
|
||
- Dependency pinning (no floating versions)
|
||
- Trivy scanning in CI — block on critical CVEs
|
||
- Customer JAR uploads scanned
|
||
|
||
### Breach Detection
|
||
|
||
- Anomaly alerting: unusual API patterns, auth failures, cross-namespace DNS queries
|
||
- Runtime security scanning (Falco or similar)
|
||
- Audit log anomaly detection
|
||
|
||
### Payload Protection
|
||
|
||
- Application-level encryption of customer exchange payloads with per-tenant keys before writing to PG/OS
|
||
- Tenant key rotation without downtime
|
||
- Payload redaction rules configurable per tenant (agent already supports this)
|
||
|
||
### Compliance
|
||
|
||
- SOC 2 Trust Service Criteria: Security (CC6), Availability (A1), Processing Integrity (PI1), Confidentiality (C1), Privacy (P1-P8)
|
||
- Evidence collection: git history (change management), audit log (access), Prometheus (availability)
|
||
- Evaluate Vanta or Drata for continuous compliance monitoring
|
||
|
||
---
|
||
|
||
## 13. Platform Operations & Self-Monitoring
|
||
|
||
### Monitoring Stack
|
||
|
||
| Tool | Purpose |
|
||
|------|---------|
|
||
| Prometheus | Metrics collection (platform + tenant infra + K8s) |
|
||
| Grafana | Dashboards |
|
||
| Loki | Log aggregation |
|
||
| Alertmanager | Alert routing → PagerDuty/OpsGenie/Slack |
|
||
| Uptime Kuma or Checkly | External synthetic monitoring |
|
||
|
||
Completely separate from tenant observability data.
|
||
|
||
### Key Day-1 Alerts
|
||
|
||
- Control plane down/degraded
|
||
- Tenant provisioning failure
|
||
- Database connection pool exhaustion
|
||
- OpenSearch cluster red/yellow
|
||
- Flux reconciliation failure
|
||
- TLS certificate expiry < 14 days
|
||
- Metering pipeline stale > 1 hour
|
||
- Disk usage > 80% on any PV
|
||
- Tenant cameleer-server unhealthy > 5 minutes
|
||
- OOMKill on any tenant workload
|
||
|
||
### Dashboards
|
||
|
||
- Platform overview: tenant count, active agents, provisioning queue, error rates
|
||
- Per-tenant health: server status, app status, resource usage
|
||
- Billing: MRR, usage trends, metering pipeline health
|
||
- Infrastructure: cluster capacity, node utilization, storage growth
|
||
- Security: auth failures, audit log anomalies, certificate status
|
||
|
||
### SLA Reporting
|
||
|
||
- Automated uptime calculation per tenant
|
||
- SLA breach detection and alerting
|
||
- Monthly availability reports for high/business tier customers
|
||
|
||
---
|
||
|
||
## 14. Billing & Metering
|
||
|
||
### Metering Pipeline (Low/Mid Tiers)
|
||
|
||
```
|
||
K8s Metrics → Metrics Collector → Usage Aggregator (hourly) → Stripe Usage Records API
|
||
```
|
||
|
||
| Dimension | Unit | Source |
|
||
|-----------|------|--------|
|
||
| CPU | core·hours | K8s metrics (namespace aggregate) |
|
||
| RAM | GB·hours | K8s metrics (namespace aggregate) |
|
||
| Data volume | GB ingested | cameleer-server reports |
|
||
|
||
- Aggregated per tenant, per hour, stored in platform DB before Stripe submission
|
||
- Idempotent aggregation (safe to re-run)
|
||
- Staleness alert if no data for > 1 hour
|
||
- Monthly reconciliation: platform records vs Stripe invoices
|
||
|
||
### Committed Resources (High/Business)
|
||
|
||
- Fixed Stripe subscription per resource bundle
|
||
- Overage alerts (upsell, not automatic billing)
|
||
- Annual/multi-year contracts
|
||
|
||
### Billing UI
|
||
|
||
- Current period usage with live cost estimate
|
||
- Historical usage charts per dimension
|
||
- Invoice history
|
||
- Plan management (upgrade/downgrade)
|
||
|
||
---
|
||
|
||
## 15. Management Platform UI
|
||
|
||
### Navigation
|
||
|
||
| Section | Content |
|
||
|---------|---------|
|
||
| **Dashboard** | Platform overview: apps, health, usage summary |
|
||
| **Apps** | List deployed Camel applications |
|
||
| **App → Deploy** | Upload JAR, build status, deploy/promote/rollback |
|
||
| **App → Configuration** | Env vars, JVM options, agent config. Per environment. |
|
||
| **App → Secrets** | Manage secrets, link vaults. Per environment. |
|
||
| **App → Status** | Pod health, resource usage, agent connection, events |
|
||
| **App → Logs** | Live stdout/stderr stream |
|
||
| **App → Versions** | Image history, promotion log, rollback |
|
||
| **Observe** | Embedded cameleer-server UI (topology, traces, lineage, correlation, debugger, replay) |
|
||
| **Team** | Users, roles, invites |
|
||
| **Settings** | Tenant config, SSO/OIDC, vault connections |
|
||
| **Billing** | Usage, invoices, plan management |
|
||
|
||
### Design
|
||
|
||
- SaaS shell built with `@cameleer/design-system`
|
||
- cameleer-server React UI embedded (same design system, visual consistency)
|
||
- Responsive but desktop-primary (observability tooling is a desktop workflow)
|
||
|
||
---
|
||
|
||
## 16. Day-1 vs Future Scope
|
||
|
||
### Day 1 (Launch)
|
||
|
||
| Epic | Scope |
|
||
|------|-------|
|
||
| #1 Management Platform | Modular monolith, React shell |
|
||
| #2 Identity & Access | Registration, login, teams, JWT, OIDC (high/business) |
|
||
| #3 Tenant Provisioning | Automated shared tiers, semi-manual dedicated |
|
||
| #4 Billing & Metering | Stripe usage-based + committed. Full metering pipeline. |
|
||
| #5 Camel Runtime | JAR upload → immutable image → deploy. Agent auto-injection. |
|
||
| #6 Observability | Per-tenant server, embedded UI, all MOAT features gated by tier |
|
||
| #7 License Module | Dual-mode (SaaS API + local file), feature gating |
|
||
| #8 Networking | Namespace isolation, NetworkPolicies, public internet |
|
||
| #9 Secrets | Platform-native + HashiCorp Vault. Per-environment scoping. |
|
||
| #10 Environments | Build-once-deploy-often. Tier-based environment model. |
|
||
| #11 Security & SOC 2 | Full SOC 2 foundations, zero-trust tenant boundaries, audit logging |
|
||
| #12 Self-Monitoring | Prometheus/Grafana/Loki/Alertmanager, key alerts, dashboards |
|
||
| #13 Exchange Replay | MOAT feature, extends debugger infrastructure |
|
||
|
||
### Deferred (Future)
|
||
|
||
| Feature | Reason |
|
||
|---------|--------|
|
||
| Automated dedicated cluster provisioning (Cluster API) | Semi-manual sufficient for early high/business customers |
|
||
| Container image deployment | JAR upload covers day 1 |
|
||
| Git-based deployment | Nice-to-have |
|
||
| VPN / private connectivity | Public internet sufficient at launch |
|
||
| Auto-scaling (HPA) | Manual scaling sufficient |
|
||
| Data residency / region selection | Single region at launch |
|
||
| Cross-tenant correlation federation | Designed, deferred to v2 |
|
||
| Additional vault providers (AWS, Azure, GCP) | HashiCorp Vault covers day 1 |
|
||
| Compliance tooling integration (Vanta/Drata) | Manual evidence collection initially |
|
||
| Vulnerability scanning in registry | Trivy in CI covers basics |
|
||
|
||
---
|
||
|
||
## 17. Gitea Issue Map
|
||
|
||
| # | Epic | Labels |
|
||
|---|------|--------|
|
||
| 1 | SaaS Management Platform | epic, platform |
|
||
| 2 | Identity & Access Management | epic, auth |
|
||
| 3 | Tenant Provisioning & Lifecycle | epic, infra |
|
||
| 4 | Billing & Metering | epic, billing |
|
||
| 5 | Camel Application Runtime | epic, runtime |
|
||
| 6 | Observability Integration | epic, observability |
|
||
| 7 | License & Feature Gating | epic, licensing |
|
||
| 8 | Networking & Tenant Isolation | epic, networking |
|
||
| 9 | Secrets Management | epic, secrets |
|
||
| 10 | Environments & Promotion Pipeline | epic, runtime, day-1 |
|
||
| 11 | Security & SOC 2 Compliance | epic, security |
|
||
| 12 | Platform Operations & Self-Monitoring | epic, ops |
|
||
| 13 | MOAT: Exchange Replay | epic, observability |
|
||
|
||
MOAT features (Debugger, Lineage, Correlation) tracked in cameleer/cameleer #57–#72.
|