Epic: Platform Operations & Self-Monitoring #12

New Issue

claude · 2026-03-29T23:21:44+02:00

claude commented

2026-03-29 23:21:44 +02:00

Overview

The observability platform must itself be observable. Comprehensive monitoring of the SaaS control plane, tenant infrastructure, and all shared services. Alerting that pages before customers notice.

Platform Components to Monitor

Control Plane (SaaS Management Platform)

Application metrics: request latency, error rates, active sessions, JVM stats
Health endpoints: liveness, readiness
API response times and error rates per endpoint
Provisioning pipeline: success/failure rates, duration, queue depth

Per-Tenant Infrastructure

cameleer3-server instances: health, resource usage, connection counts
Camel application pods: health, restarts, resource consumption, OOMKills
Namespace resource quota utilization (approaching limits → alert)

Shared Services

PostgreSQL: connections, query latency, replication lag, disk usage, vacuum stats
OpenSearch: cluster health, index sizes, query latency, shard balance
Flux CD: reconciliation status, failures, drift detection

Kubernetes Cluster

Node health, resource pressure (CPU, memory, disk, PID)
Pod scheduling failures
NetworkPolicy enforcement
Certificate expiry (TLS, JWT signing keys)

Billing Pipeline

Stripe webhook delivery success/failure
Metering data freshness (stale metrics → incorrect bills)
Usage reporting lag

Stack

Metrics: Prometheus + Grafana (self-hosted, not a dependency on the tenant stack)
Logs: Loki or OpenSearch (separate from tenant data)
Alerting: Alertmanager → PagerDuty/OpsGenie/Slack
Uptime: External synthetic monitoring (e.g., Uptime Kuma or Checkly)

Key Alerts (Day 1)

Control plane down / degraded
Tenant provisioning failure
Database connection pool exhaustion
OpenSearch cluster red/yellow
Flux reconciliation failure (tenant env drift)
TLS certificate expiry < 14 days
Metering pipeline stale > 1 hour
Disk usage > 80% on any persistent volume
Tenant cameleer3-server unhealthy for > 5 minutes
OOMKill on any tenant workload

Dashboards

Platform overview: tenant count, active agents, provisioning queue, error rates
Per-tenant health: server status, app status, resource usage
Billing: MRR, usage trends, metering pipeline health
Infrastructure: cluster capacity, node utilization, storage growth
Security: auth failures, audit log anomalies, certificate status

SLA Reporting

Automated uptime calculation per tenant
SLA breach detection and alerting
Monthly availability reports for high/business tier customers

## Overview The observability platform must itself be observable. Comprehensive monitoring of the SaaS control plane, tenant infrastructure, and all shared services. Alerting that pages before customers notice. ## Platform Components to Monitor ### Control Plane (SaaS Management Platform) - Application metrics: request latency, error rates, active sessions, JVM stats - Health endpoints: liveness, readiness - API response times and error rates per endpoint - Provisioning pipeline: success/failure rates, duration, queue depth ### Per-Tenant Infrastructure - cameleer3-server instances: health, resource usage, connection counts - Camel application pods: health, restarts, resource consumption, OOMKills - Namespace resource quota utilization (approaching limits → alert) ### Shared Services - PostgreSQL: connections, query latency, replication lag, disk usage, vacuum stats - OpenSearch: cluster health, index sizes, query latency, shard balance - Flux CD: reconciliation status, failures, drift detection ### Kubernetes Cluster - Node health, resource pressure (CPU, memory, disk, PID) - Pod scheduling failures - NetworkPolicy enforcement - Certificate expiry (TLS, JWT signing keys) ### Billing Pipeline - Stripe webhook delivery success/failure - Metering data freshness (stale metrics → incorrect bills) - Usage reporting lag ## Stack - **Metrics**: Prometheus + Grafana (self-hosted, not a dependency on the tenant stack) - **Logs**: Loki or OpenSearch (separate from tenant data) - **Alerting**: Alertmanager → PagerDuty/OpsGenie/Slack - **Uptime**: External synthetic monitoring (e.g., Uptime Kuma or Checkly) ## Key Alerts (Day 1) - Control plane down / degraded - Tenant provisioning failure - Database connection pool exhaustion - OpenSearch cluster red/yellow - Flux reconciliation failure (tenant env drift) - TLS certificate expiry < 14 days - Metering pipeline stale > 1 hour - Disk usage > 80% on any persistent volume - Tenant cameleer3-server unhealthy for > 5 minutes - OOMKill on any tenant workload ## Dashboards - Platform overview: tenant count, active agents, provisioning queue, error rates - Per-tenant health: server status, app status, resource usage - Billing: MRR, usage trends, metering pipeline health - Infrastructure: cluster capacity, node utilization, storage growth - Security: auth failures, audit log anomalies, certificate status ## SLA Reporting - Automated uptime calculation per tenant - SLA breach detection and alerting - Monthly availability reports for high/business tier customers

claude added the epic ops labels 2026-03-29 23:21:48 +02:00

claude referenced this issue

2026-03-30 09:24:29 +02:00

Phase 7: Security Hardening + Monitoring (was Phase 8) #30

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cameleer/cameleer-saas#12