Files
cameleer-server/docs/superpowers/plans/2026-04-16-environment-scoping.md

136 lines
9.2 KiB
Markdown
Raw Normal View History

feat!: scope per-app config and settings by environment BREAKING: wipe dev PostgreSQL before deploying — V1 checksum changes. Agents must now send environmentId on registration (400 if missing). Two tables previously keyed on app name alone caused cross-environment data bleed: writing config for (app=X, env=dev) would overwrite the row used by (app=X, env=prod) agents, and agent startup fetches ignored env entirely. - V1 schema: application_config and app_settings are now PK (app, env). - Repositories: env-keyed finders/saves; env is the authoritative column, stamped on the stored JSON so the row agrees with itself. - ApplicationConfigController.getConfig is dual-mode — AGENT role uses JWT env claim (agents cannot spoof env); non-agent callers provide env via ?environment= query param. - AppSettingsController endpoints now require ?environment=. - SensitiveKeysAdminController fan-out iterates (app, env) slices so each env gets its own merged keys. - DiagramController ingestion stamps env on TaggedDiagram; ClickHouse route_diagrams INSERT + findProcessorRouteMapping are env-scoped. - AgentRegistrationController: environmentId is required on register; removed all "default" fallbacks from register/refresh/heartbeat auto-heal. - UI hooks (useApplicationConfig, useProcessorRouteMapping, useAppSettings, useAllAppSettings, useUpdateAppSettings) take env, wired to useEnvironmentStore at all call sites. - New ConfigEnvIsolationIT covers env-isolation for both repositories. Plan in docs/superpowers/plans/2026-04-16-environment-scoping.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 22:25:21 +02:00
# Environment-scoped config — fixing cross-env data bleed
**Date:** 2026-04-16
**Status:** Not started
**Backwards compatibility:** None (pre-1.0; user will wipe dev DB)
## Problem
Two PostgreSQL tables key per-app state on the application name alone, despite environments (`dev`/`staging`/`prod`) being first-class in the rest of the system:
- `application_config` PK `(application)` — traced processors, taps, route recording, per-app sensitive keys. All env-sensitive.
- `app_settings` PK `(application_id)` — SLA threshold, health warn/crit thresholds. All env-sensitive.
Consequences:
- **Config corruption**: `PUT /api/v1/config/{app}?environment=dev` correctly scopes the SSE fan-out but overwrites the single DB row, so when `prod` agents restart and fetch config they get the `dev` config.
- **Agent startup is env-blind**: `GET /api/v1/config/{app}` reads neither JWT `env` claim nor any request parameter; returns whichever row exists.
- **Dashboard settings ambiguous**: `AppSettings` endpoints have no env parameter; SLA/health displayed without env context.
- Ancillary: `ClickHouseDiagramStore.findProcessorRouteMapping(appId)` doesn't filter by env even though the table has an `environment` column.
- Ancillary: `AgentRegistrationController` accepts registrations without `environmentId` and silently defaults to `"default"` — masks misconfigured agents.
## Non-goals / working correctly (do not touch)
- All ClickHouse observability tables (executions, logs, metrics, stats_1m_*) — already env-scoped.
- `AgentCommandController` / SSE command fan-out — already env-filtered via `AgentRegistryService.findByApplicationAndEnvironment`.
- `SearchController` search path — fixed in commit `e2d9428`.
- RBAC (users/roles/groups/claim mappings) — tenant-wide by design.
- Global sensitive-keys push to all envs (`SensitiveKeysAdminController.fanOutToAllAgents`) — by design; global baseline.
- Admin UI per-page env indicator — not needed, already shown in top-right of the shell.
## Design decisions (fixed)
| Question | Answer |
|---|---|
| Schema migration strategy | Edit `V1__init.sql` in place. User wipes dev DB. |
| Agent config fetch with no/unknown env | Return `404 Not Found`. No `"default"` fallback. |
| `cameleer-common` `ApplicationConfig` model | Add `environment` field in-place; agent team coordinates the bump (SNAPSHOT). |
| Agent registration without `environmentId` | Return `400 Bad Request`. Registration MUST include env. |
| UI per-screen env display | Already covered by top-right global env indicator — no extra UI work. |
## Plan
### Phase 1 — PostgreSQL schema
1. Edit `cameleer-server-app/src/main/resources/db/migration/V1__init.sql`:
- `application_config`: add `environment TEXT NOT NULL` column; change PK to `(application, environment)`.
- `app_settings`: add `environment TEXT NOT NULL` column; change PK to `(application_id, environment)`.
2. Commit message MUST call out: "Wipe dev DB before deploying — Flyway V1 checksum changes."
### Phase 2 — Shared model (`cameleer-common`)
3. **`ApplicationConfig` stays untouched on the server side.** The agent team is adding `environment` to the common class separately; the server doesn't depend on it. On the server, `environment` flows as a sidecar parameter through repositories/controllers and as a dedicated `environment` column on `application_config`. The stored JSON body contains only the config content. If/when the field appears in the common class, we'll hydrate it from the DB column into the returned DTO — no code change needed today.
4. Add `environment` field to `AppSettings` record in `cameleer-server-core` (`admin/AppSettings.java`). **Done.**
### Phase 3 — Repositories
5. `PostgresApplicationConfigRepository`:
- `findByApplicationAndEnvironment(String app, String env)` replaces `findByApplication(app)`.
- `findAll(String env)` (env required) replaces `findAll()`.
- `save(String app, String env, ApplicationConfig, String updatedBy)` replaces `save(app, config, updatedBy)`.
- Keep behaviour identical except for the PK.
6. `AppSettingsRepository` interface (core) and `PostgresAppSettingsRepository` (app) — same treatment with `(applicationId, environment)`.
### Phase 4 — REST controllers
7. `ApplicationConfigController`:
- `getConfig(@PathVariable app)`: dual-mode by caller role. For AGENT role → env taken from JWT `env` claim, query param ignored (agents cannot spoof env). For non-agent callers (admin UI, with user JWTs whose `env="default"` is a placeholder) → env must be passed via `?environment=` query param. If neither produces a value → **404**.
- `updateConfig(@PathVariable app, @RequestParam String environment, ...)`: make `environment` required. Forward to repo save. SSE push already env-scoped — keep.
- `listConfigs(@RequestParam String environment)`: require env; filter.
- `getProcessorRouteMapping(@PathVariable app, @RequestParam String environment)`: require env; forward to ClickHouse.
- `testExpression(@PathVariable app, @RequestParam String environment, ...)`: make env required (already accepted as optional — tighten).
8. `AppSettingsController`:
- `GET /api/v1/admin/app-settings?environment=`: list filtered.
- `GET /api/v1/admin/app-settings/{appId}?environment=`: require env.
- `PUT /api/v1/admin/app-settings/{appId}?environment=`: require env.
9. `SensitiveKeysAdminController`: review — global sensitive keys are server-wide (one row in `server_config`), no change needed. Add code comment clarifying env-wide push is intentional.
10. `SearchController.stats`: the SLA threshold lookup `appSettingsRepository.findByApplicationId(app)` becomes env-aware via the existing `environment` query param.
### Phase 5 — Storage
11. `ClickHouseDiagramStore.findProcessorRouteMapping(app)``findProcessorRouteMapping(app, env)`. Include `environment = ?` in `WHERE`.
### Phase 6 — JWT surface
12. Expose `env` claim via Spring `Authentication` principal — simplest path is a small custom `AuthenticationPrincipal` or `@RequestAttribute("env")` populated by `JwtAuthenticationFilter`. Keep scope minimal; only `ApplicationConfigController.getConfig` needs it directly for the 404 rule.
### Phase 7 — Agent registration hardening
13. `AgentRegistrationController.register`:
- If `request.environmentId()` is null or blank → `400 Bad Request` with an explicit error message. Drop the `"default"` fallback on line 122.
- Log the rejection (agent identity + remote IP) at INFO for diagnostics.
14. `AgentRegistrationController.refreshToken`: remove the `"default"` fallback at line 211 (dead after Phase 7.13, but harmless to clean up).
15. `AgentRegistrationController.heartbeat`: already falls back to JWT claim; after Phase 7.13 every JWT has a real env, so the `"default"` fallback at line 247 is dead code — remove.
### Phase 8 — UI queries
16. `ui/src/api/queries/dashboard.ts`: `useAppSettings(appId)``useAppSettings(appId, environment)`; same for `useAllAppSettings()`. Pull env from `useEnvironmentStore`.
17. `ui/src/api/queries/commands.ts`: verify `useApplicationConfig(appId)` / `useUpdateApplicationConfig` already pass env. Add if missing. (Audit pass only, may be no-op.)
18. Verify no other UI hook fetches per-app state without env.
### Phase 9 — Tests
19. Integration: write config for `(app=X, env=dev)`; read for `(app=X, env=prod)` returns empty/default.
20. Integration: agent JWT with `env=dev` calling `GET /api/v1/config/X` returns the dev config row. JWT with no env claim → 404.
21. Integration: `POST /api/v1/agents/register` with no `environmentId` → 400.
22. Unit: `AppSettingsRepository` env-isolation test.
### Phase 10 — Documentation
23. `CLAUDE.md`:
- "Storage" section: update `application_config` and `app_settings` PK description.
- Agent lifecycle section: note that registration requires `environmentId` (was optional, defaulted to `"default"`).
- Remove the "priority: heartbeat `environmentId` > JWT `env` claim > `"default"`" note — after fix, every agent has a real env on every path.
24. `.claude/rules/app-classes.md`:
- `ApplicationConfigController` — reflect env-required endpoints.
- `AppSettingsController` — reflect env-required endpoints.
- `AgentRegistrationController` — note env required.
25. `.claude/rules/core-classes.md`:
- `PostgresApplicationConfigRepository`, `PostgresAppSettingsRepository` — updated signatures.
## Execution order
Phases are mostly sequential by dependency: 1 → 2 → 3 → (4, 5 in parallel) → 6 → 7 → 8 → 9 → 10. Phase 6 (JWT surfacing) is a small dependency for Phase 4 controller changes; do them together.
## Verification
- `mvn clean verify` passes.
- `detect_changes` scope matches the files touched per phase.
- Manual: spin up two envs (`dev` + `prod`) locally; configure tap in `dev`; confirm `prod` agent doesn't receive it and its DB row is untouched.
- Manual: stop an agent without env in the registration payload; confirm server returns 400.
## Out of scope / follow-ups
- `audit_log` has no `environment` column; filtering audit by env would be nice-to-have but not a correctness issue. Defer.
- Agent bootstrap-token scoping to env (so a dev token can't register as prod) — security hardening for after 1.0.