Epic: Platform Operations & Self-Monitoring #12

Open
opened 2026-03-29 23:21:44 +02:00 by claude · 0 comments
Owner

Overview

The observability platform must itself be observable. Comprehensive monitoring of the SaaS control plane, tenant infrastructure, and all shared services. Alerting that pages before customers notice.

Platform Components to Monitor

Control Plane (SaaS Management Platform)

  • Application metrics: request latency, error rates, active sessions, JVM stats
  • Health endpoints: liveness, readiness
  • API response times and error rates per endpoint
  • Provisioning pipeline: success/failure rates, duration, queue depth

Per-Tenant Infrastructure

  • cameleer3-server instances: health, resource usage, connection counts
  • Camel application pods: health, restarts, resource consumption, OOMKills
  • Namespace resource quota utilization (approaching limits → alert)

Shared Services

  • PostgreSQL: connections, query latency, replication lag, disk usage, vacuum stats
  • OpenSearch: cluster health, index sizes, query latency, shard balance
  • Flux CD: reconciliation status, failures, drift detection

Kubernetes Cluster

  • Node health, resource pressure (CPU, memory, disk, PID)
  • Pod scheduling failures
  • NetworkPolicy enforcement
  • Certificate expiry (TLS, JWT signing keys)

Billing Pipeline

  • Stripe webhook delivery success/failure
  • Metering data freshness (stale metrics → incorrect bills)
  • Usage reporting lag

Stack

  • Metrics: Prometheus + Grafana (self-hosted, not a dependency on the tenant stack)
  • Logs: Loki or OpenSearch (separate from tenant data)
  • Alerting: Alertmanager → PagerDuty/OpsGenie/Slack
  • Uptime: External synthetic monitoring (e.g., Uptime Kuma or Checkly)

Key Alerts (Day 1)

  • Control plane down / degraded
  • Tenant provisioning failure
  • Database connection pool exhaustion
  • OpenSearch cluster red/yellow
  • Flux reconciliation failure (tenant env drift)
  • TLS certificate expiry < 14 days
  • Metering pipeline stale > 1 hour
  • Disk usage > 80% on any persistent volume
  • Tenant cameleer3-server unhealthy for > 5 minutes
  • OOMKill on any tenant workload

Dashboards

  • Platform overview: tenant count, active agents, provisioning queue, error rates
  • Per-tenant health: server status, app status, resource usage
  • Billing: MRR, usage trends, metering pipeline health
  • Infrastructure: cluster capacity, node utilization, storage growth
  • Security: auth failures, audit log anomalies, certificate status

SLA Reporting

  • Automated uptime calculation per tenant
  • SLA breach detection and alerting
  • Monthly availability reports for high/business tier customers
## Overview The observability platform must itself be observable. Comprehensive monitoring of the SaaS control plane, tenant infrastructure, and all shared services. Alerting that pages before customers notice. ## Platform Components to Monitor ### Control Plane (SaaS Management Platform) - Application metrics: request latency, error rates, active sessions, JVM stats - Health endpoints: liveness, readiness - API response times and error rates per endpoint - Provisioning pipeline: success/failure rates, duration, queue depth ### Per-Tenant Infrastructure - cameleer3-server instances: health, resource usage, connection counts - Camel application pods: health, restarts, resource consumption, OOMKills - Namespace resource quota utilization (approaching limits → alert) ### Shared Services - PostgreSQL: connections, query latency, replication lag, disk usage, vacuum stats - OpenSearch: cluster health, index sizes, query latency, shard balance - Flux CD: reconciliation status, failures, drift detection ### Kubernetes Cluster - Node health, resource pressure (CPU, memory, disk, PID) - Pod scheduling failures - NetworkPolicy enforcement - Certificate expiry (TLS, JWT signing keys) ### Billing Pipeline - Stripe webhook delivery success/failure - Metering data freshness (stale metrics → incorrect bills) - Usage reporting lag ## Stack - **Metrics**: Prometheus + Grafana (self-hosted, not a dependency on the tenant stack) - **Logs**: Loki or OpenSearch (separate from tenant data) - **Alerting**: Alertmanager → PagerDuty/OpsGenie/Slack - **Uptime**: External synthetic monitoring (e.g., Uptime Kuma or Checkly) ## Key Alerts (Day 1) - Control plane down / degraded - Tenant provisioning failure - Database connection pool exhaustion - OpenSearch cluster red/yellow - Flux reconciliation failure (tenant env drift) - TLS certificate expiry < 14 days - Metering pipeline stale > 1 hour - Disk usage > 80% on any persistent volume - Tenant cameleer3-server unhealthy for > 5 minutes - OOMKill on any tenant workload ## Dashboards - Platform overview: tenant count, active agents, provisioning queue, error rates - Per-tenant health: server status, app status, resource usage - Billing: MRR, usage trends, metering pipeline health - Infrastructure: cluster capacity, node utilization, storage growth - Security: auth failures, audit log anomalies, certificate status ## SLA Reporting - Automated uptime calculation per tenant - SLA breach detection and alerting - Monthly availability reports for high/business tier customers
claude added the epicops labels 2026-03-29 23:21:48 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: cameleer/cameleer-saas#12