Files
cameleer-server/docs/superpowers/plans/2026-04-16-environment-scoping.md
hsiegeln 9b1ef51d77
All checks were successful
CI / cleanup-branch (push) Has been skipped
CI / build (push) Successful in 1m27s
CI / docker (push) Successful in 1m10s
CI / deploy-feature (push) Has been skipped
CI / deploy (push) Successful in 1m40s
SonarQube / sonarqube (push) Successful in 4m29s
feat!: scope per-app config and settings by environment
BREAKING: wipe dev PostgreSQL before deploying — V1 checksum changes.
Agents must now send environmentId on registration (400 if missing).

Two tables previously keyed on app name alone caused cross-environment
data bleed: writing config for (app=X, env=dev) would overwrite the row
used by (app=X, env=prod) agents, and agent startup fetches ignored env
entirely.

- V1 schema: application_config and app_settings are now PK (app, env).
- Repositories: env-keyed finders/saves; env is the authoritative column,
  stamped on the stored JSON so the row agrees with itself.
- ApplicationConfigController.getConfig is dual-mode — AGENT role uses
  JWT env claim (agents cannot spoof env); non-agent callers provide env
  via ?environment= query param.
- AppSettingsController endpoints now require ?environment=.
- SensitiveKeysAdminController fan-out iterates (app, env) slices so each
  env gets its own merged keys.
- DiagramController ingestion stamps env on TaggedDiagram; ClickHouse
  route_diagrams INSERT + findProcessorRouteMapping are env-scoped.
- AgentRegistrationController: environmentId is required on register;
  removed all "default" fallbacks from register/refresh/heartbeat auto-heal.
- UI hooks (useApplicationConfig, useProcessorRouteMapping, useAppSettings,
  useAllAppSettings, useUpdateAppSettings) take env, wired to
  useEnvironmentStore at all call sites.
- New ConfigEnvIsolationIT covers env-isolation for both repositories.

Plan in docs/superpowers/plans/2026-04-16-environment-scoping.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 22:25:21 +02:00

9.2 KiB

Environment-scoped config — fixing cross-env data bleed

Date: 2026-04-16 Status: Not started Backwards compatibility: None (pre-1.0; user will wipe dev DB)

Problem

Two PostgreSQL tables key per-app state on the application name alone, despite environments (dev/staging/prod) being first-class in the rest of the system:

  • application_config PK (application) — traced processors, taps, route recording, per-app sensitive keys. All env-sensitive.
  • app_settings PK (application_id) — SLA threshold, health warn/crit thresholds. All env-sensitive.

Consequences:

  • Config corruption: PUT /api/v1/config/{app}?environment=dev correctly scopes the SSE fan-out but overwrites the single DB row, so when prod agents restart and fetch config they get the dev config.
  • Agent startup is env-blind: GET /api/v1/config/{app} reads neither JWT env claim nor any request parameter; returns whichever row exists.
  • Dashboard settings ambiguous: AppSettings endpoints have no env parameter; SLA/health displayed without env context.
  • Ancillary: ClickHouseDiagramStore.findProcessorRouteMapping(appId) doesn't filter by env even though the table has an environment column.
  • Ancillary: AgentRegistrationController accepts registrations without environmentId and silently defaults to "default" — masks misconfigured agents.

Non-goals / working correctly (do not touch)

  • All ClickHouse observability tables (executions, logs, metrics, stats_1m_*) — already env-scoped.
  • AgentCommandController / SSE command fan-out — already env-filtered via AgentRegistryService.findByApplicationAndEnvironment.
  • SearchController search path — fixed in commit e2d9428.
  • RBAC (users/roles/groups/claim mappings) — tenant-wide by design.
  • Global sensitive-keys push to all envs (SensitiveKeysAdminController.fanOutToAllAgents) — by design; global baseline.
  • Admin UI per-page env indicator — not needed, already shown in top-right of the shell.

Design decisions (fixed)

Question Answer
Schema migration strategy Edit V1__init.sql in place. User wipes dev DB.
Agent config fetch with no/unknown env Return 404 Not Found. No "default" fallback.
cameleer-common ApplicationConfig model Add environment field in-place; agent team coordinates the bump (SNAPSHOT).
Agent registration without environmentId Return 400 Bad Request. Registration MUST include env.
UI per-screen env display Already covered by top-right global env indicator — no extra UI work.

Plan

Phase 1 — PostgreSQL schema

  1. Edit cameleer-server-app/src/main/resources/db/migration/V1__init.sql:
    • application_config: add environment TEXT NOT NULL column; change PK to (application, environment).
    • app_settings: add environment TEXT NOT NULL column; change PK to (application_id, environment).
  2. Commit message MUST call out: "Wipe dev DB before deploying — Flyway V1 checksum changes."

Phase 2 — Shared model (cameleer-common)

  1. ApplicationConfig stays untouched on the server side. The agent team is adding environment to the common class separately; the server doesn't depend on it. On the server, environment flows as a sidecar parameter through repositories/controllers and as a dedicated environment column on application_config. The stored JSON body contains only the config content. If/when the field appears in the common class, we'll hydrate it from the DB column into the returned DTO — no code change needed today.
  2. Add environment field to AppSettings record in cameleer-server-core (admin/AppSettings.java). Done.

Phase 3 — Repositories

  1. PostgresApplicationConfigRepository:
    • findByApplicationAndEnvironment(String app, String env) replaces findByApplication(app).
    • findAll(String env) (env required) replaces findAll().
    • save(String app, String env, ApplicationConfig, String updatedBy) replaces save(app, config, updatedBy).
    • Keep behaviour identical except for the PK.
  2. AppSettingsRepository interface (core) and PostgresAppSettingsRepository (app) — same treatment with (applicationId, environment).

Phase 4 — REST controllers

  1. ApplicationConfigController:
    • getConfig(@PathVariable app): dual-mode by caller role. For AGENT role → env taken from JWT env claim, query param ignored (agents cannot spoof env). For non-agent callers (admin UI, with user JWTs whose env="default" is a placeholder) → env must be passed via ?environment= query param. If neither produces a value → 404.
    • updateConfig(@PathVariable app, @RequestParam String environment, ...): make environment required. Forward to repo save. SSE push already env-scoped — keep.
    • listConfigs(@RequestParam String environment): require env; filter.
    • getProcessorRouteMapping(@PathVariable app, @RequestParam String environment): require env; forward to ClickHouse.
    • testExpression(@PathVariable app, @RequestParam String environment, ...): make env required (already accepted as optional — tighten).
  2. AppSettingsController:
    • GET /api/v1/admin/app-settings?environment=: list filtered.
    • GET /api/v1/admin/app-settings/{appId}?environment=: require env.
    • PUT /api/v1/admin/app-settings/{appId}?environment=: require env.
  3. SensitiveKeysAdminController: review — global sensitive keys are server-wide (one row in server_config), no change needed. Add code comment clarifying env-wide push is intentional.
  4. SearchController.stats: the SLA threshold lookup appSettingsRepository.findByApplicationId(app) becomes env-aware via the existing environment query param.

Phase 5 — Storage

  1. ClickHouseDiagramStore.findProcessorRouteMapping(app)findProcessorRouteMapping(app, env). Include environment = ? in WHERE.

Phase 6 — JWT surface

  1. Expose env claim via Spring Authentication principal — simplest path is a small custom AuthenticationPrincipal or @RequestAttribute("env") populated by JwtAuthenticationFilter. Keep scope minimal; only ApplicationConfigController.getConfig needs it directly for the 404 rule.

Phase 7 — Agent registration hardening

  1. AgentRegistrationController.register:
    • If request.environmentId() is null or blank → 400 Bad Request with an explicit error message. Drop the "default" fallback on line 122.
    • Log the rejection (agent identity + remote IP) at INFO for diagnostics.
  2. AgentRegistrationController.refreshToken: remove the "default" fallback at line 211 (dead after Phase 7.13, but harmless to clean up).
  3. AgentRegistrationController.heartbeat: already falls back to JWT claim; after Phase 7.13 every JWT has a real env, so the "default" fallback at line 247 is dead code — remove.

Phase 8 — UI queries

  1. ui/src/api/queries/dashboard.ts: useAppSettings(appId)useAppSettings(appId, environment); same for useAllAppSettings(). Pull env from useEnvironmentStore.
  2. ui/src/api/queries/commands.ts: verify useApplicationConfig(appId) / useUpdateApplicationConfig already pass env. Add if missing. (Audit pass only, may be no-op.)
  3. Verify no other UI hook fetches per-app state without env.

Phase 9 — Tests

  1. Integration: write config for (app=X, env=dev); read for (app=X, env=prod) returns empty/default.
  2. Integration: agent JWT with env=dev calling GET /api/v1/config/X returns the dev config row. JWT with no env claim → 404.
  3. Integration: POST /api/v1/agents/register with no environmentId → 400.
  4. Unit: AppSettingsRepository env-isolation test.

Phase 10 — Documentation

  1. CLAUDE.md:
    • "Storage" section: update application_config and app_settings PK description.
    • Agent lifecycle section: note that registration requires environmentId (was optional, defaulted to "default").
    • Remove the "priority: heartbeat environmentId > JWT env claim > "default"" note — after fix, every agent has a real env on every path.
  2. .claude/rules/app-classes.md:
    • ApplicationConfigController — reflect env-required endpoints.
    • AppSettingsController — reflect env-required endpoints.
    • AgentRegistrationController — note env required.
  3. .claude/rules/core-classes.md:
    • PostgresApplicationConfigRepository, PostgresAppSettingsRepository — updated signatures.

Execution order

Phases are mostly sequential by dependency: 1 → 2 → 3 → (4, 5 in parallel) → 6 → 7 → 8 → 9 → 10. Phase 6 (JWT surfacing) is a small dependency for Phase 4 controller changes; do them together.

Verification

  • mvn clean verify passes.
  • detect_changes scope matches the files touched per phase.
  • Manual: spin up two envs (dev + prod) locally; configure tap in dev; confirm prod agent doesn't receive it and its DB row is untouched.
  • Manual: stop an agent without env in the registration payload; confirm server returns 400.

Out of scope / follow-ups

  • audit_log has no environment column; filtering audit by env would be nice-to-have but not a correctness issue. Defer.
  • Agent bootstrap-token scoping to env (so a dev token can't register as prod) — security hardening for after 1.0.