136 lines
9.2 KiB
Markdown
136 lines
9.2 KiB
Markdown
|
|
# Environment-scoped config — fixing cross-env data bleed
|
||
|
|
|
||
|
|
**Date:** 2026-04-16
|
||
|
|
**Status:** Not started
|
||
|
|
**Backwards compatibility:** None (pre-1.0; user will wipe dev DB)
|
||
|
|
|
||
|
|
## Problem
|
||
|
|
|
||
|
|
Two PostgreSQL tables key per-app state on the application name alone, despite environments (`dev`/`staging`/`prod`) being first-class in the rest of the system:
|
||
|
|
|
||
|
|
- `application_config` PK `(application)` — traced processors, taps, route recording, per-app sensitive keys. All env-sensitive.
|
||
|
|
- `app_settings` PK `(application_id)` — SLA threshold, health warn/crit thresholds. All env-sensitive.
|
||
|
|
|
||
|
|
Consequences:
|
||
|
|
|
||
|
|
- **Config corruption**: `PUT /api/v1/config/{app}?environment=dev` correctly scopes the SSE fan-out but overwrites the single DB row, so when `prod` agents restart and fetch config they get the `dev` config.
|
||
|
|
- **Agent startup is env-blind**: `GET /api/v1/config/{app}` reads neither JWT `env` claim nor any request parameter; returns whichever row exists.
|
||
|
|
- **Dashboard settings ambiguous**: `AppSettings` endpoints have no env parameter; SLA/health displayed without env context.
|
||
|
|
- Ancillary: `ClickHouseDiagramStore.findProcessorRouteMapping(appId)` doesn't filter by env even though the table has an `environment` column.
|
||
|
|
- Ancillary: `AgentRegistrationController` accepts registrations without `environmentId` and silently defaults to `"default"` — masks misconfigured agents.
|
||
|
|
|
||
|
|
## Non-goals / working correctly (do not touch)
|
||
|
|
|
||
|
|
- All ClickHouse observability tables (executions, logs, metrics, stats_1m_*) — already env-scoped.
|
||
|
|
- `AgentCommandController` / SSE command fan-out — already env-filtered via `AgentRegistryService.findByApplicationAndEnvironment`.
|
||
|
|
- `SearchController` search path — fixed in commit `e2d9428`.
|
||
|
|
- RBAC (users/roles/groups/claim mappings) — tenant-wide by design.
|
||
|
|
- Global sensitive-keys push to all envs (`SensitiveKeysAdminController.fanOutToAllAgents`) — by design; global baseline.
|
||
|
|
- Admin UI per-page env indicator — not needed, already shown in top-right of the shell.
|
||
|
|
|
||
|
|
## Design decisions (fixed)
|
||
|
|
|
||
|
|
| Question | Answer |
|
||
|
|
|---|---|
|
||
|
|
| Schema migration strategy | Edit `V1__init.sql` in place. User wipes dev DB. |
|
||
|
|
| Agent config fetch with no/unknown env | Return `404 Not Found`. No `"default"` fallback. |
|
||
|
|
| `cameleer-common` `ApplicationConfig` model | Add `environment` field in-place; agent team coordinates the bump (SNAPSHOT). |
|
||
|
|
| Agent registration without `environmentId` | Return `400 Bad Request`. Registration MUST include env. |
|
||
|
|
| UI per-screen env display | Already covered by top-right global env indicator — no extra UI work. |
|
||
|
|
|
||
|
|
## Plan
|
||
|
|
|
||
|
|
### Phase 1 — PostgreSQL schema
|
||
|
|
|
||
|
|
1. Edit `cameleer-server-app/src/main/resources/db/migration/V1__init.sql`:
|
||
|
|
- `application_config`: add `environment TEXT NOT NULL` column; change PK to `(application, environment)`.
|
||
|
|
- `app_settings`: add `environment TEXT NOT NULL` column; change PK to `(application_id, environment)`.
|
||
|
|
2. Commit message MUST call out: "Wipe dev DB before deploying — Flyway V1 checksum changes."
|
||
|
|
|
||
|
|
### Phase 2 — Shared model (`cameleer-common`)
|
||
|
|
|
||
|
|
3. **`ApplicationConfig` stays untouched on the server side.** The agent team is adding `environment` to the common class separately; the server doesn't depend on it. On the server, `environment` flows as a sidecar parameter through repositories/controllers and as a dedicated `environment` column on `application_config`. The stored JSON body contains only the config content. If/when the field appears in the common class, we'll hydrate it from the DB column into the returned DTO — no code change needed today.
|
||
|
|
4. Add `environment` field to `AppSettings` record in `cameleer-server-core` (`admin/AppSettings.java`). **Done.**
|
||
|
|
|
||
|
|
### Phase 3 — Repositories
|
||
|
|
|
||
|
|
5. `PostgresApplicationConfigRepository`:
|
||
|
|
- `findByApplicationAndEnvironment(String app, String env)` replaces `findByApplication(app)`.
|
||
|
|
- `findAll(String env)` (env required) replaces `findAll()`.
|
||
|
|
- `save(String app, String env, ApplicationConfig, String updatedBy)` replaces `save(app, config, updatedBy)`.
|
||
|
|
- Keep behaviour identical except for the PK.
|
||
|
|
6. `AppSettingsRepository` interface (core) and `PostgresAppSettingsRepository` (app) — same treatment with `(applicationId, environment)`.
|
||
|
|
|
||
|
|
### Phase 4 — REST controllers
|
||
|
|
|
||
|
|
7. `ApplicationConfigController`:
|
||
|
|
- `getConfig(@PathVariable app)`: dual-mode by caller role. For AGENT role → env taken from JWT `env` claim, query param ignored (agents cannot spoof env). For non-agent callers (admin UI, with user JWTs whose `env="default"` is a placeholder) → env must be passed via `?environment=` query param. If neither produces a value → **404**.
|
||
|
|
- `updateConfig(@PathVariable app, @RequestParam String environment, ...)`: make `environment` required. Forward to repo save. SSE push already env-scoped — keep.
|
||
|
|
- `listConfigs(@RequestParam String environment)`: require env; filter.
|
||
|
|
- `getProcessorRouteMapping(@PathVariable app, @RequestParam String environment)`: require env; forward to ClickHouse.
|
||
|
|
- `testExpression(@PathVariable app, @RequestParam String environment, ...)`: make env required (already accepted as optional — tighten).
|
||
|
|
8. `AppSettingsController`:
|
||
|
|
- `GET /api/v1/admin/app-settings?environment=`: list filtered.
|
||
|
|
- `GET /api/v1/admin/app-settings/{appId}?environment=`: require env.
|
||
|
|
- `PUT /api/v1/admin/app-settings/{appId}?environment=`: require env.
|
||
|
|
9. `SensitiveKeysAdminController`: review — global sensitive keys are server-wide (one row in `server_config`), no change needed. Add code comment clarifying env-wide push is intentional.
|
||
|
|
10. `SearchController.stats`: the SLA threshold lookup `appSettingsRepository.findByApplicationId(app)` becomes env-aware via the existing `environment` query param.
|
||
|
|
|
||
|
|
### Phase 5 — Storage
|
||
|
|
|
||
|
|
11. `ClickHouseDiagramStore.findProcessorRouteMapping(app)` → `findProcessorRouteMapping(app, env)`. Include `environment = ?` in `WHERE`.
|
||
|
|
|
||
|
|
### Phase 6 — JWT surface
|
||
|
|
|
||
|
|
12. Expose `env` claim via Spring `Authentication` principal — simplest path is a small custom `AuthenticationPrincipal` or `@RequestAttribute("env")` populated by `JwtAuthenticationFilter`. Keep scope minimal; only `ApplicationConfigController.getConfig` needs it directly for the 404 rule.
|
||
|
|
|
||
|
|
### Phase 7 — Agent registration hardening
|
||
|
|
|
||
|
|
13. `AgentRegistrationController.register`:
|
||
|
|
- If `request.environmentId()` is null or blank → `400 Bad Request` with an explicit error message. Drop the `"default"` fallback on line 122.
|
||
|
|
- Log the rejection (agent identity + remote IP) at INFO for diagnostics.
|
||
|
|
14. `AgentRegistrationController.refreshToken`: remove the `"default"` fallback at line 211 (dead after Phase 7.13, but harmless to clean up).
|
||
|
|
15. `AgentRegistrationController.heartbeat`: already falls back to JWT claim; after Phase 7.13 every JWT has a real env, so the `"default"` fallback at line 247 is dead code — remove.
|
||
|
|
|
||
|
|
### Phase 8 — UI queries
|
||
|
|
|
||
|
|
16. `ui/src/api/queries/dashboard.ts`: `useAppSettings(appId)` → `useAppSettings(appId, environment)`; same for `useAllAppSettings()`. Pull env from `useEnvironmentStore`.
|
||
|
|
17. `ui/src/api/queries/commands.ts`: verify `useApplicationConfig(appId)` / `useUpdateApplicationConfig` already pass env. Add if missing. (Audit pass only, may be no-op.)
|
||
|
|
18. Verify no other UI hook fetches per-app state without env.
|
||
|
|
|
||
|
|
### Phase 9 — Tests
|
||
|
|
|
||
|
|
19. Integration: write config for `(app=X, env=dev)`; read for `(app=X, env=prod)` returns empty/default.
|
||
|
|
20. Integration: agent JWT with `env=dev` calling `GET /api/v1/config/X` returns the dev config row. JWT with no env claim → 404.
|
||
|
|
21. Integration: `POST /api/v1/agents/register` with no `environmentId` → 400.
|
||
|
|
22. Unit: `AppSettingsRepository` env-isolation test.
|
||
|
|
|
||
|
|
### Phase 10 — Documentation
|
||
|
|
|
||
|
|
23. `CLAUDE.md`:
|
||
|
|
- "Storage" section: update `application_config` and `app_settings` PK description.
|
||
|
|
- Agent lifecycle section: note that registration requires `environmentId` (was optional, defaulted to `"default"`).
|
||
|
|
- Remove the "priority: heartbeat `environmentId` > JWT `env` claim > `"default"`" note — after fix, every agent has a real env on every path.
|
||
|
|
24. `.claude/rules/app-classes.md`:
|
||
|
|
- `ApplicationConfigController` — reflect env-required endpoints.
|
||
|
|
- `AppSettingsController` — reflect env-required endpoints.
|
||
|
|
- `AgentRegistrationController` — note env required.
|
||
|
|
25. `.claude/rules/core-classes.md`:
|
||
|
|
- `PostgresApplicationConfigRepository`, `PostgresAppSettingsRepository` — updated signatures.
|
||
|
|
|
||
|
|
## Execution order
|
||
|
|
|
||
|
|
Phases are mostly sequential by dependency: 1 → 2 → 3 → (4, 5 in parallel) → 6 → 7 → 8 → 9 → 10. Phase 6 (JWT surfacing) is a small dependency for Phase 4 controller changes; do them together.
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
- `mvn clean verify` passes.
|
||
|
|
- `detect_changes` scope matches the files touched per phase.
|
||
|
|
- Manual: spin up two envs (`dev` + `prod`) locally; configure tap in `dev`; confirm `prod` agent doesn't receive it and its DB row is untouched.
|
||
|
|
- Manual: stop an agent without env in the registration payload; confirm server returns 400.
|
||
|
|
|
||
|
|
## Out of scope / follow-ups
|
||
|
|
|
||
|
|
- `audit_log` has no `environment` column; filtering audit by env would be nice-to-have but not a correctness issue. Defer.
|
||
|
|
- Agent bootstrap-token scoping to env (so a dev token can't register as prod) — security hardening for after 1.0.
|