Add Zot OCI registry as ArtifactStore backend (P1 security work) #158

Open
opened 2026-04-27 14:33:48 +02:00 by claude · 0 comments
Owner

Context

This is the deferred OCI-registry follow-up to the init-container JAR fetch work tracked in docs/superpowers/plans/2026-04-27-init-container-jar-fetch.md (and the multi-tenant hardening epic #152). The earlier work introduces an ArtifactStore interface in cameleer-server-core with one filesystem implementation. This issue covers adding a second implementation, OciArtifactStore, backed by Zot, and standing up Zot in the deployment stack.

Why we deferred (and what changes when we do it)

OCI buys nothing for today's needs. We store JAR bytes and serve them over an authenticated HTTP URL — a registry adds zero value for that and adds real ops cost (a service, storage backend, TLS, auth, backups, monitoring). The payoff arrives only when we start needing what OCI bundles:

  • Image + artifact scanning (Trivy is built into Zot)
  • Signing (Cosign verification built into Zot)
  • SBOM attestation (sigstore / in-toto attached to the artifact)
  • Node-level pull caching (free with K8s once we get there)
  • Provenance — immutable, content-addressed, replay-safe

These are all explicitly P1 items in #152 (Falco rules, Trivy/Cosign pipeline, JAR SBOM scanner, etc.). When that work starts, OCI is the right shape for everything: customers still upload via REST, the upload path persists into the OCI registry instead of the filesystem, the loader init container still does an HTTP GET against either Cameleer or a pre-signed registry URL — same wire shape, different backend.

Why Zot specifically (and not the alternatives)

Considered options:

Option Verdict
Zot (zotregistry.dev) Pick this. Single Go binary, no Postgres/Redis, OCI-native (not Docker-protocol-first), Trivy scanning + Cosign verification built in, ORAS-friendly. CNCF sandbox project, healthy momentum. Right size for our stage.
distribution/registry:2 (Docker reference impl) The cheap "just need storage" answer. Battle-tested, but no scanning/signing — we'd bolt those on ourselves and end up rebuilding what Zot ships. Skip.
Harbor (CNCF graduated) Full enterprise platform: RBAC, projects, replication, vulnerability DB, quotas. But also: Postgres + Redis + Trivy + several other components. It is a small platform to operate. Defer until we need multi-project RBAC as a customer-visible feature.
JFrog Artifactory / Sonatype Nexus Speak OCI now, but heavyweight if we don't already have them. Only relevant if a customer says "can you push into our existing Nexus?" — then OCI-protocol against their infra is the play. Not our registry.
Cloud-managed (ECR, GCR/Artifact Registry, ACR, GHCR) Zero-ops but cloud-coupled. Fine if Cameleer-SaaS commits to one cloud. Bad fit for self-hosted / on-prem deployments.

Zot is the only one that gives us scanning + signing + OCI-native without dragging in a Postgres/Redis side-stack we'd otherwise have to operate.

Migration path

The ArtifactStore interface (shipped in the init-container plan) is the migration insurance. With it in place, this work is bounded to weeks, not a quarter:

  1. Stand up Zot in the stack (docker-compose service + persistent volume; later K8s StatefulSet). TLS via Traefik, auth via htpasswd or Zot's built-in OIDC integration.
  2. Implement OciArtifactStore in cameleer-server-app/storage/. Use zotregistry/oras-java (or fall back to Apache HttpClient against the OCI distribution spec endpoints — well documented, ~200 lines).
  3. Dual-write for one release. Every new upload lands in both filesystem and OCI; reads still come from filesystem. Verify byte-for-byte equivalence in production for a week.
  4. Cut over reads. Flip the ArtifactStore bean to OciArtifactStore. Init container URLs now resolve against OCI (still HTTP, still token-auth — invisible to the loader).
  5. Backfill historical data. One-time job iterates app_versions, copies filesystem → OCI for old entries that never went through dual-write. The OCI ref convention is already locked: ArtifactCoordinates.ociRef() returns {tenantId}/{appId}:v{version}.
  6. Stop dual-write, decommission filesystem path.

The ArtifactCoordinates value type was deliberately designed with ociRef() from day one so this migration doesn't need a coordinates schema change — see cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ArtifactCoordinates.java.

What this issue does NOT cover

  • Customer-facing OCI push. Customers still upload via our REST API. If they want to use OCI at the source (e.g. oras push from CI), that's a separate feature and probably not worth building — the REST API is universal across every CI system.
  • "Pull by Maven coordinates" UX. Tracked separately. Resolves Maven coords against a customer's Artifactory at upload time and stores the bytes as a normal AppVersion. Independent of OCI backend choice.
  • Image signing of our own server/runtime images. That's tooling for our build pipeline (Cosign + GitHub Actions / equivalent), not Cameleer code.

Acceptance criteria

  • Zot service running in dev/staging stack with persistent storage
  • OciArtifactStore implements ArtifactStore, all FilesystemArtifactStoreTest cases pass against the OCI impl as well
  • Trivy scanning runs on every push; results queryable via Zot's API
  • Cosign signing wired into the upload path
  • Dual-write phase ships behind a feature flag; metrics confirm parity
  • Backfill script for historical app_versions
  • Filesystem-store path removed
  • .claude/rules/* updated to reflect OCI as the artifact backend
  • Loader init container behavior unchanged (HTTP GET against URL — invisible swap)

Trigger conditions

Pick this up when any of:

  • P1 security work from #152 starts (Trivy/Cosign/SBOM)
  • A customer asks for "verifiable signed deploys" or "where's the SBOM for this version"
  • We migrate to K8s and want node-level image pull caching for tenant artifacts
  • Compliance/regulated-tenant work begins

Until then, the filesystem store is the right answer.

References

## Context This is the deferred OCI-registry follow-up to the init-container JAR fetch work tracked in `docs/superpowers/plans/2026-04-27-init-container-jar-fetch.md` (and the multi-tenant hardening epic #152). The earlier work introduces an `ArtifactStore` interface in `cameleer-server-core` with one filesystem implementation. This issue covers adding a second implementation, `OciArtifactStore`, backed by [Zot](https://zotregistry.dev/), and standing up Zot in the deployment stack. ## Why we deferred (and what changes when we do it) OCI buys nothing for today's needs. We store JAR bytes and serve them over an authenticated HTTP URL — a registry adds zero value for that and adds real ops cost (a service, storage backend, TLS, auth, backups, monitoring). The payoff arrives only when we start needing what OCI bundles: - **Image + artifact scanning** (Trivy is built into Zot) - **Signing** (Cosign verification built into Zot) - **SBOM attestation** (sigstore / in-toto attached to the artifact) - **Node-level pull caching** (free with K8s once we get there) - **Provenance** — immutable, content-addressed, replay-safe These are all explicitly P1 items in #152 (Falco rules, Trivy/Cosign pipeline, JAR SBOM scanner, etc.). When that work starts, OCI is the right shape for everything: customers still upload via REST, the upload path persists into the OCI registry instead of the filesystem, the loader init container still does an `HTTP GET` against either Cameleer or a pre-signed registry URL — same wire shape, different backend. ## Why Zot specifically (and not the alternatives) Considered options: | Option | Verdict | |---|---| | **Zot** ([zotregistry.dev](https://zotregistry.dev/)) | **Pick this.** Single Go binary, no Postgres/Redis, OCI-native (not Docker-protocol-first), Trivy scanning + Cosign verification built in, ORAS-friendly. CNCF sandbox project, healthy momentum. Right size for our stage. | | **distribution/registry:2** (Docker reference impl) | The cheap "just need storage" answer. Battle-tested, but no scanning/signing — we'd bolt those on ourselves and end up rebuilding what Zot ships. Skip. | | **Harbor** (CNCF graduated) | Full enterprise platform: RBAC, projects, replication, vulnerability DB, quotas. But also: Postgres + Redis + Trivy + several other components. It *is* a small platform to operate. Defer until we need multi-project RBAC as a customer-visible feature. | | **JFrog Artifactory / Sonatype Nexus** | Speak OCI now, but heavyweight if we don't already have them. Only relevant if a customer says "can you push into our existing Nexus?" — then OCI-protocol against their infra is the play. Not our registry. | | **Cloud-managed (ECR, GCR/Artifact Registry, ACR, GHCR)** | Zero-ops but cloud-coupled. Fine if Cameleer-SaaS commits to one cloud. Bad fit for self-hosted / on-prem deployments. | Zot is the only one that gives us scanning + signing + OCI-native without dragging in a Postgres/Redis side-stack we'd otherwise have to operate. ## Migration path The `ArtifactStore` interface (shipped in the init-container plan) is the migration insurance. With it in place, this work is bounded to weeks, not a quarter: 1. **Stand up Zot in the stack** (docker-compose service + persistent volume; later K8s `StatefulSet`). TLS via Traefik, auth via htpasswd or Zot's built-in OIDC integration. 2. **Implement `OciArtifactStore`** in `cameleer-server-app/storage/`. Use [zotregistry/oras-java](https://github.com/oras-project/oras-java) (or fall back to Apache HttpClient against the OCI distribution spec endpoints — well documented, ~200 lines). 3. **Dual-write for one release.** Every new upload lands in both filesystem and OCI; reads still come from filesystem. Verify byte-for-byte equivalence in production for a week. 4. **Cut over reads.** Flip the `ArtifactStore` bean to `OciArtifactStore`. Init container URLs now resolve against OCI (still HTTP, still token-auth — invisible to the loader). 5. **Backfill historical data.** One-time job iterates `app_versions`, copies filesystem → OCI for old entries that never went through dual-write. The OCI ref convention is already locked: `ArtifactCoordinates.ociRef()` returns `{tenantId}/{appId}:v{version}`. 6. **Stop dual-write, decommission filesystem path.** The `ArtifactCoordinates` value type was deliberately designed with `ociRef()` from day one so this migration doesn't need a coordinates schema change — see `cameleer-server-core/src/main/java/com/cameleer/server/core/storage/ArtifactCoordinates.java`. ## What this issue does NOT cover - **Customer-facing OCI push.** Customers still upload via our REST API. If they want to use OCI at the source (e.g. `oras push` from CI), that's a separate feature and probably not worth building — the REST API is universal across every CI system. - **"Pull by Maven coordinates" UX.** Tracked separately. Resolves Maven coords against a customer's Artifactory at upload time and stores the bytes as a normal `AppVersion`. Independent of OCI backend choice. - **Image signing of our own server/runtime images.** That's tooling for our build pipeline (Cosign + GitHub Actions / equivalent), not Cameleer code. ## Acceptance criteria - [ ] Zot service running in dev/staging stack with persistent storage - [ ] `OciArtifactStore implements ArtifactStore`, all `FilesystemArtifactStoreTest` cases pass against the OCI impl as well - [ ] Trivy scanning runs on every push; results queryable via Zot's API - [ ] Cosign signing wired into the upload path - [ ] Dual-write phase ships behind a feature flag; metrics confirm parity - [ ] Backfill script for historical `app_versions` - [ ] Filesystem-store path removed - [ ] `.claude/rules/*` updated to reflect OCI as the artifact backend - [ ] Loader init container behavior unchanged (HTTP GET against URL — invisible swap) ## Trigger conditions Pick this up when **any** of: - P1 security work from #152 starts (Trivy/Cosign/SBOM) - A customer asks for "verifiable signed deploys" or "where's the SBOM for this version" - We migrate to K8s and want node-level image pull caching for tenant artifacts - Compliance/regulated-tenant work begins Until then, the filesystem store is the right answer. ## References - Init-container plan: `docs/superpowers/plans/2026-04-27-init-container-jar-fetch.md` - Hardening epic: #152 - [Zot](https://zotregistry.dev/) — single-binary OCI-native registry - [ORAS](https://oras.land/) — JAR-in-OCI tooling - [OCI distribution spec](https://github.com/opencontainers/distribution-spec) - [Sigstore / Cosign](https://docs.sigstore.dev/)
Sign in to join this conversation.