Skip to content

Design proposal: self-hosted in-cluster registry for air-gapped Cozystack#21

Open
George Gaál (gecube) wants to merge 1 commit into
mainfrom
proposal/airgap-in-cluster-registry
Open

Design proposal: self-hosted in-cluster registry for air-gapped Cozystack#21
George Gaál (gecube) wants to merge 1 commit into
mainfrom
proposal/airgap-in-cluster-registry

Conversation

@gecube

@gecube George Gaál (gecube) commented Jun 24, 2026

Copy link
Copy Markdown

Migrates discussion cozystack/cozystack#3029 into the design-proposal process.

Adds design-proposals/airgap-in-cluster-registry/README.md: evolve air-gap from "bring your own registry" to a bundled, self-hosted flow — offline bundle (OCI images + Talos assets) → throwaway admin registry for bootstrap → self-hosted in-cluster registry (distribution/zot) as the persistent source of truth, with redirection via Talos/containerd registry mirrors.

Source discussion: cozystack/cozystack#3029

Sibling proposal (migrated together): #22

DCO: commit is signed off.

…ozystack

Migrated from discussion cozystack/cozystack#3029 to the design-proposal
process for review.

Signed-off-by: Gaál György <gb12335@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new design proposal document is added at design-proposals/airgap-in-cluster-registry/README.md. It defines an air-gapped in-cluster registry workflow for Cozystack, covering problem framing, bundle bootstrap, persistent in-cluster registry deployment, containerd mirror redirection, day-2 upgrade/rollback sequencing, Phase 1 security posture, testing plans, phased rollout milestones, open questions, and alternatives considered.

Changes

Air-gap in-cluster registry design proposal

Layer / File(s) Summary
Problem statement, goals, and non-goals
design-proposals/airgap-in-cluster-registry/README.md
Adds proposal front-matter, overview, artifact classes, cluster tiers, stated problem, goals, and explicit non-goals for the air-gap registry workflow.
Core design rules, bootstrap sequence, and mirror configuration
design-proposals/airgap-in-cluster-registry/README.md
Defines standardized tagging rules, throwaway bootstrap registry, persistent in-cluster registry (distribution/zot) deployment, containerd machine.registries.mirrors redirect config with example, end-to-end bootstrap-to-cutover flowchart, management vs tenant node handling, and distribution compatibility with deferred paas-hosted story.
Day-2 upgrade/rollback, security posture, and failure cases
design-proposals/airgap-in-cluster-registry/README.md
Specifies CLI/bundle tooling and runbook deliverables, pre-load → preflight → bump upgrade sequencing with rollback and GC behaviors, Phase 1 TLS trust posture with deferred signing/provenance/encryption, and failure/edge cases including registry outage, incomplete transfers, DNS requirements, and admission policy limitations.
Testing, phased rollout, open questions, and alternatives
design-proposals/airgap-in-cluster-registry/README.md
Covers unit/integration/e2e/manual cutover testing expectations, Phase 0–3 rollout milestones, open questions on registry choice/addressing/bundle granularity/tooling location, and alternatives considered including admission-based redirection, BYOB status quo, and Harbor reuse.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A registry hidden away from the net,
Bundles and mirrors, the best plan yet!
Bootstrap the throwaway, load up the stash,
Cutover to in-cluster in one mighty dash.
No internet needed — the rabbit hops free! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the new design proposal for a self-hosted in-cluster registry in air-gapped Cozystack.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch proposal/airgap-in-cluster-registry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for establishing a self-hosted, in-cluster registry to support air-gapped Cozystack installations. The proposed workflow bootstraps from a temporary local registry and transitions to a persistent in-cluster registry using containerd registry mirrors. The review feedback highlights two key areas for improvement: first, removing the public registry fallback from the mirror configuration to prevent connection timeouts in strictly air-gapped environments, and second, clarifying how the internal registry domain will be resolved at the host containerd level since Talos nodes typically bypass in-cluster DNS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +73 to +79
machine:
registries:
mirrors:
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
- https://ghcr.io # fallback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In a strictly air-gapped environment, including the public registry (https://ghcr.io) as a fallback endpoint can cause containerd to experience long connection timeouts when the internal registry is temporarily unavailable, rather than failing fast. Consider omitting the public fallback or making it conditionally configured.

Suggested change
machine:
registries:
mirrors:
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
- https://ghcr.io # fallback
machine:
registries:
mirrors:
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred

- **Bump before pre-load** → immediate `ImagePullBackOff`; preflight check is the guard.
- **Total registry outage** → blocks new pulls only; already-running pods keep their cached images. Mitigated by multi-replica HA backing storage.
- **Incomplete bundle transfer** → caught by preflight digest verification before any version bump.
- **DNS for `registry.cozy-system` unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since host-level containerd on Talos nodes typically uses the node's configured upstream DNS rather than the in-cluster CoreDNS (to prevent resolution loops), resolving registry.cozy-system can be problematic. It would be highly beneficial to clarify how this resolution is achieved (e.g., using Talos host entries or a local DNS forwarder).

Suggested change
- **DNS for `registry.cozy-system` unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster.
- **DNS for registry.cozy-system unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster (e.g., mapped via Talos host entries or a local DNS forwarder).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@design-proposals/airgap-in-cluster-registry/README.md`:
- Around line 23-28: Clarify the handling of non-OCI Talos assets in the air-gap
flow by updating the proposal around the “Artifact classes” and
upgrade/preflight sections. Specify whether Talos OS assets are stored and
served through the in-cluster registry as artifacts or via an
object-store-backed path with a manifest mapping, and describe how clients
verify them. Use the existing “Artifact classes” wording and the “OCI images and
Talos assets” push/upload steps to make the storage, serving, and verification
contract unambiguous.
- Around line 62-66: Tighten the bootstrap transport guidance in the airgap
registry README by explicitly limiting any plain HTTP or temporary insecure TLS
usage to a single-host or trusted admin-LAN PoC only, and require a clear
expiry/removal step before production cutover. Update the relevant registry push
and TLS verification sections referenced by the existing symbols so they state
these insecure settings must never become the default in runbooks and should be
replaced with secure transport before rollout.
- Around line 76-80: The air-gap mirror example still includes an external
fallback endpoint, which conflicts with the no-egress setup. Update the registry
mirror example in the README so the ghcr.io entry points only to internal
endpoints and remove the https://ghcr.io fallback from the example. Keep the
change localized to the mirror configuration snippet under the air-gap profile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 70a8294f-8908-4d21-a2a6-824f00c9fe9b

📥 Commits

Reviewing files that changed from the base of the PR and between fbfc6ba and 836d2ec.

📒 Files selected for processing (1)
  • design-proposals/airgap-in-cluster-registry/README.md

Comment on lines +23 to +28
Air-gap delivery spans two artifact classes across two cluster tiers:

**Artifact classes**
- OCI container images — pulled by containerd at runtime.
- Talos OS assets — kernel, initramfs, ISO, metal images (bare-metal) and the nocloud disk image (VMs).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Clarify where non-OCI Talos assets live and how they are verified.

Line 23-28 defines Talos boot artifacts, but Lines 119-120 say to push “OCI images and Talos assets” to the in-cluster registry (distribution/zot). Please specify the exact storage/serving contract for non-OCI assets (registry-as-artifact vs object store path + manifest mapping), or upgrade preflight is underspecified.

Also applies to: 115-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 23 - 28,
Clarify the handling of non-OCI Talos assets in the air-gap flow by updating the
proposal around the “Artifact classes” and upgrade/preflight sections. Specify
whether Talos OS assets are stored and served through the in-cluster registry as
artifacts or via an object-store-backed path with a manifest mapping, and
describe how clients verify them. Use the existing “Artifact classes” wording
and the “OCI images and Talos assets” push/upload steps to make the storage,
serving, and verification contract unambiguous.

Comment on lines +62 to +66
docker run -d -p 5000:5000 --name cozy-bootstrap-registry \
-v /srv/cozy-registry:/var/lib/registry registry:2
cozystack images push --bundle cozystack-airgap-paas-full-v1.5.0.tar \
--to http://ADMIN_IP:5000
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Tighten bootstrap transport security requirements.

Line 65 uses plain HTTP, and Line 138 allows temporary insecureSkipVerify. Even for PoC, define strict limits (single-host/admin LAN only, explicit expiry/removal before cutover) to prevent insecure defaults from leaking into production runbooks.

Also applies to: 138-138

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 62 - 66,
Tighten the bootstrap transport guidance in the airgap registry README by
explicitly limiting any plain HTTP or temporary insecure TLS usage to a
single-host or trusted admin-LAN PoC only, and require a clear expiry/removal
step before production cutover. Update the relevant registry push and TLS
verification sections referenced by the existing symbols so they state these
insecure settings must never become the default in runbooks and should be
replaced with secure transport before rollout.

Comment on lines +76 to +80
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
- https://ghcr.io # fallback
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove internet fallback from the air-gap mirror example.

Line 79 (https://ghcr.io) conflicts with the no-egress objective and can cause slow/failing pulls if resolution/connect attempts happen before failover logic settles. Use an air-gap profile that only points to internal endpoints.

Proposed doc edit
       ghcr.io:
         endpoints:
           - https://registry.cozy-system     # internal, preferred
-          - https://ghcr.io                   # fallback
+          # no external fallback in air-gapped mode
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
- https://ghcr.io # fallback
```
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
# no external fallback in air-gapped mode
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 76 - 80,
The air-gap mirror example still includes an external fallback endpoint, which
conflicts with the no-egress setup. Update the registry mirror example in the
README so the ghcr.io entry points only to internal endpoints and remove the
https://ghcr.io fallback from the example. Keep the change localized to the
mirror configuration snippet under the air-gap profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant