docs(operations): add containerized GPU workloads guide by lexfrei · Pull Request #555 · cozystack/website

Aleksei Sviridkin (lexfrei) · 2026-05-28T17:57:34Z

What this PR does

Add a new operations guide describing the container variant of cozystack.gpu-operator — the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver and nvidia-container-toolkit via the distro package manager.

The new page lands at content/en/docs/next/operations/gpu-container-workloads.md and rounds out the GPU documentation surface:

Running VMs with GPU Passthrough — VFIO passthrough of whole GPUs to KubeVirt VMs (default variant).
GPU Sharing with HAMi — fractional GPU sharing in tenant Kubernetes clusters.
Running Containerized GPU Workloads — this page. Containerized GPU workloads on management nodes (container variant).

Content covers when to pick the variant (host driver + host toolkit + a containerd-registered nvidia runtime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override on a stock apt install), the Talos caveat with a pointer to the examples/values-native-talos.yaml reference, install steps with Package CR variant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.

Companion to cozystack/cozystack#2766, which adds the container variant itself.

Release note

docs(operations): add guide for containerized GPU workloads via the gpu-operator `container` variant.

Summary by CodeRabbit

Documentation
- New guide for running containerized GPU workloads on cluster nodes: prerequisites, installation via the Package CR, explicit warning against using bundles.enabledPackages for this variant, operator health and GPU allocatable verification, sample CUDA Pod workflow, fractional GPU sharing via HAMi, and a comparison of container, default (VM passthrough), and vGPU variants.

netlify · 2026-05-28T17:57:40Z

✅ Deploy Preview for cozystack ready!

Name	Link
🔨 Latest commit	`4c291c4`
🔍 Latest deploy log	https://app.netlify.com/projects/cozystack/deploys/6a392af946c9560008c489b3
😎 Deploy Preview	https://deploy-preview-555--cozystack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-05-28T17:57:44Z

Warning

Review limit reached

@lexfrei, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 36 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e407c075-2a77-4e7e-a969-4f578e52714b

📥 Commits

Reviewing files that changed from the base of the PR and between f2ae9b7 and 4c291c4.

📒 Files selected for processing (1)

content/en/docs/next/operations/gpu-container-workloads.md

📝 Walkthrough

Walkthrough

Adds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the cozystack.gpu-operator container variant, including prerequisites, Package CR installation, health checks, CUDA smoke-test, HAMi fractional-sharing notes, and a variant comparison table.

Changes

GPU Container Workloads Documentation

Layer / File(s)	Summary
GPU container variant guide `content/en/docs/next/operations/gpu-container-workloads.md`	New operations guide explains when to use the container variant (host has NVIDIA driver and `nvidia-container-toolkit`), installation prerequisites, Package CR setup with warnings against `bundles.enabledPackages`, operator health verification, `nvidia.com/gpu` allocatable checks, a CUDA smoke-test Pod example, HAMi fractional-sharing guidance, and a variant comparison table.

Possibly related issues

cozystack/cozystack#2764: Directly addresses the same cozystack.gpu-operator container variant documentation and configuration guidance referenced in this PR.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped the docs to share the way,
Containers meet GPUs by light of day,
Drivers checked, CUDA pods take flight,
HAMi whispers fractional delight,
A tiny guide to make workloads play.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'docs(operations): add containerized GPU workloads guide' directly and clearly summarizes the main change: adding a new documentation page for containerized GPU workloads, which matches the added content perfectly.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/gpu-container-workloads-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.

gemini-code-assist · 2026-05-28T17:58:19Z

+kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
+  -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'


In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.

Suggested change

kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \

-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\

-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

This flow was removed — the page no longer patches cozystack.gpu-operator into bundles.enabledPackages (the bundle path only accepts the default/vgpu variants). It now applies a standalone Package CR, and packages.cozystack.io is a cluster-scoped resource (scope: Cluster in the CRD), so there is no namespace to set.

gemini-code-assist · 2026-05-28T17:58:20Z

+apiVersion: cozystack.io/v1alpha1
+kind: Package
+metadata:
+  name: cozystack.gpu-operator
+spec:
+  variant: container


The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.

Suggested change

apiVersion: cozystack.io/v1alpha1

kind: Package

metadata:

name: cozystack.gpu-operator

spec:

variant: container

apiVersion: cozystack.io/v1alpha1

kind: Package

metadata:

name: cozystack.gpu-operator

namespace: cozy-system

spec:

variant: container

packages.cozystack.io is a cluster-scoped resource (scope: Cluster in the CRD), so a Package has no namespace — the platform controller reconciles it cluster-wide. namespace: cozy-system would be ignored, so the example leaves it off intentionally.

coderabbitai · 2026-05-28T18:42:37Z

Actionable comments posted: 0

myasnikovdaniil

Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.

Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.

Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.

Smaller accuracy/UX fixes inline. Recommendation: request changes.

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+## Fractional GPU sharing
+
+The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled.


⚠️ This HAMi claim is incorrect and would lead users into a resource conflict.

"HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."

That auto-disable only lives in the tenant kubernetes app chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml — "should disable devicePlugin when hami is enabled"). The management-cluster cozystack.hami PackageSource only declares dependsOn: cozystack.gpu-operator (install ordering); packages/system/hami/values.yaml does not touch the operator's device plugin.

The container variant pins devicePlugin.enabled: true (values-container.yaml in #2766). Stacking cozystack.hami on top, as written, runs two device plugins both registering nvidia.com/gpu — exactly the conflict the HAMi doc warns about.

Suggested rewrite:

The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. For fractional sharing (per-pod memory and compute quotas), see [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's built-in device plugin to avoid resource-registration conflicts. Stacking the `cozystack.hami` package directly on top of the `container` variant on the management cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on, and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.

The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.

myasnikovdaniil · 2026-06-08T09:37:55Z

+## Prerequisites
+
+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.


The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.

- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro package manager (other distros with an equivalent driver + toolkit package layout should work the same way but are not regularly tested). Verify with `nvidia-smi` …

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
+- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).


apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:

- `nvidia-container-toolkit` installed on the same node and registered with containerd: ```bash sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd grep nvidia /etc/containerd/config.toml # must show the runtime entry

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+```bash
+kubectl apply -f cuda-smoke.yaml
+kubectl logs cuda-smoke


Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:

kubectl apply -f cuda-smoke.yaml kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m kubectl logs cuda-smoke

myasnikovdaniil · 2026-06-08T09:37:55Z

+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
+- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).
+- `kubectl` configured against the management cluster.


Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.

- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).

Aleksei Sviridkin (lexfrei) · 2026-06-08T14:24:11Z

Thanks — addressed in the latest push.

HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant kubernetes app chart, and sources/hami.yaml only declares dependsOn for ordering. The page now says stacking cozystack.hami directly on the container variant on the management cluster is not supported yet (both register nvidia.com/gpu), and the intro line is softened to match.

OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested."

containerd registration — spelled out with the explicit nvidia-ctk runtime configure --runtime=containerd + restart + grep block.

Leftover nvidia.com/gpu.workload.config label — added as a prerequisite with the removal command.

CUDA smoke pod — added kubectl wait --for=jsonpath='{.status.phase}'=Succeeded before kubectl logs.

Validator path — same reframe as the code PR: dropped /host/usr/bin/nvidia-smi, now "host driver at its standard location, no driverInstallDir override on apt."

On the bot's namespace suggestions (-n cozy-system / namespace: cozy-system on the Package CR): left out deliberately — Cozystack's own canonical examples (packages/core/installer/example/platform.yaml, examples/values-native-talos.yaml) create Package CRs with no namespace, so adding one would diverge from the shipped convention. The current doc uses kubectl apply -f, not kubectl patch, so that suggestion doesn't apply either.

Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the next/ tree so it tracks the unreleased variant.

myasnikovdaniil

NOT LGTM — the practical advice in the bundles.enabledPackages warning is right, but its stated failure mechanism is factually wrong and will mislead operators.

Business context: documents the container variant of cozystack.gpu-operator for running CUDA pods on management-cluster nodes that already ship the NVIDIA driver + container toolkit from the distro package manager.

Status of the requested changes (2026-06-08 review)

✅ HAMi device-plugin conflict — the Fractional GPU sharing section now explains cozystack.hami and the container variant both register nvidia.com/gpu and aren't a supported combination.
✅ OS support scope — Ubuntu/Debian primary, other distros "not regularly tested."
✅ containerd nvidia runtime registration — nvidia-ctk runtime configure + restart + verify present.
✅ leftover nvidia.com/gpu.workload.config label — prerequisite bullet with removal command added.
✅ CUDA smoke-pod — kubectl wait …Succeeded added before kubectl logs.
✅ host-driver / driver.enabled=false path — reframed clearly; Talos caveat points at the reference values file.

Outstanding

B1 (blocker) — bundles.enabledPackages warning states the wrong failure mechanism — inline at line 41. The text says the bundle "hardcodes spec.variant: default" and "any user Package CR with variant: container is overwritten on the next reconcile." Neither is what happens: iaas.yaml renders the GPU operator via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and fails the Helm render if that value isn't default/vgpu. So container via the bundle path is a hard render error, not a silent overwrite. Keep the conclusion (use a standalone Package CR); fix the reason. Suggested wording inline.

Non-blocking:

#2766 passed helm template + unit tests but no hardware CUDA run — a "provisional pending hardware validation" note would help calibrate trust.
Prerequisite ordering: the nvidia.com/gpu.workload.config removal bullet sits after the containerd-registration block; a node migrating from the passthrough guide would remove the label before/with toolkit registration.

Analysis — where the issues come from

Original code: B1 (wrong bundle-mechanism text) and the ordering nit are both in the initial commit f2ae9b7.
Introduced by post-review fixes: none — the branch is a single commit; no regressions added.
Unresolved from the previous review: none — all six asks addressed.

myasnikovdaniil · 2026-06-10T11:01:59Z

+
+## 1. Install the GPU Operator (container variant)
+
+**Do not** add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry.


The stated reason here is incorrect, though the practical advice is right. gpu-operator in the iaas bundle does not go through the cozystack.platform.package.optional.default helper and does not hardcode spec.variant: default. iaas.yaml renders it via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and immediately fails the Helm render if that value is anything other than "default" or "vgpu":

{{- if not (or (eq $gpuVariant "default") (eq $gpuVariant "vgpu")) -}} {{- fail (printf "bundles.iaas.gpuOperatorVariant must be \"default\" or \"vgpu\", got %q" $gpuVariant) -}} {{- end -}}

So "container" via the bundle path causes a hard Helm render failure, not a silent overwrite — the user Package CR is never touched because the chart never renders. Suggested replacement:

Do not add cozystack.gpu-operator to bundles.enabledPackages for this variant. The iaas bundle template only accepts bundles.iaas.gpuOperatorVariant: default or vgpu; any other value — including container — causes a hard Helm render failure (packages/core/platform/templates/bundles/iaas.yaml). Apply the Package CR directly instead; the platform controller installs it without a bundle entry and without the variant restriction.

Fixed in 4c291c4. The warning no longer claims a hardcoded spec.variant: default / silent overwrite. It now states the real mechanism: the iaas bundle renders the operator from bundles.iaas.gpuOperatorVariant, which only accepts default/vgpu and fails the Helm render on anything else (including container). The conclusion — apply the Package CR directly — stays.

Document the new container variant of cozystack.gpu-operator, paired with cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit Linux shape that the variant targets: when to pick it over the passthrough and vGPU variants, prerequisites (host driver + host nvidia-container-toolkit registered with containerd via nvidia-ctk runtime configure, validated with nvidia-smi over kubectl debug), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override needed on a stock apt install), the Talos caveat with a pointer to the values-native-talos.yaml reference, install steps, a sample CUDA pod for verification, the variant comparison matrix, and a note on why stacking HAMi directly on the container variant on the management cluster is not a supported combination yet (both register nvidia.com/gpu). Lands under operations/ — symmetric with virtualization/gpu.md (VM passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi in tenant Kubernetes addons). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

The container-variant warning claimed the optional-package template hardcodes spec.variant: default and silently overwrites a user Package CR. The iaas bundle actually renders the GPU operator from bundles.iaas.gpuOperatorVariant, which only accepts default or vgpu and fails the Helm render on any other value. Describe the real mechanism while keeping the conclusion to apply the Package CR directly. Signed-off-by: Aleksei Sviridkin <f@lex.la>

Aleksei Sviridkin (lexfrei) · 2026-06-22T12:32:20Z

myasnikovdaniil B1 is fixed in 4c291c4. The bundles.enabledPackages warning now describes the real failure mode: the iaas bundle renders the operator from bundles.iaas.gpuOperatorVariant (only default/vgpu) and fails the Helm render on any other value — not a hardcoded spec.variant: default overwriting a user CR. The conclusion (apply a standalone Package CR) is unchanged.

Sequencing is resolved too: cozystack/cozystack#2766 merged on 2026-06-09, so the container variant now exists in main (packages/core/platform/sources/gpu-operator.yaml).

Items 1-6 from the earlier pass remain addressed at this head (HAMi conflict, OS scope, containerd registration, leftover workload.config label, smoke-test wait, host-driver path). The two non-blocking notes — a "provisional, pending hardware validation" caveat and moving the leftover-label prerequisite up next to toolkit setup — are fair, but I'd keep them out of this change since they're independent of the blocker.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from 3170d45 to 8b83e54 Compare May 28, 2026 18:25

Aleksei Sviridkin (lexfrei) marked this pull request as ready for review May 28, 2026 18:36

Aleksei Sviridkin (lexfrei) requested review from Andrei Kvapil (kvaps) and Timofei Larkin (lllamnyp) as code owners May 28, 2026 18:36

Aleksei Sviridkin (lexfrei) self-assigned this May 28, 2026

Aleksei Sviridkin (lexfrei) mentioned this pull request Jun 2, 2026

Document out-of-the-box GPU passthrough for tenant Kubernetes clusters (gpu=on auto-label + NvLinkDisable default) #561

Open

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from 8b83e54 to b9cae43 Compare June 5, 2026 10:03

myasnikovdaniil requested changes Jun 8, 2026

View reviewed changes

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from b9cae43 to f2ae9b7 Compare June 8, 2026 14:16

Aleksei Sviridkin (lexfrei) requested a review from myasnikovdaniil June 8, 2026 14:26

myasnikovdaniil requested changes Jun 10, 2026

View reviewed changes

Aleksei Sviridkin (lexfrei) added 2 commits June 22, 2026 15:28

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from f2ae9b7 to 4c291c4 Compare June 22, 2026 12:30

		kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
		-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'


		## Fractional GPU sharing

		The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled.


		## 1. Install the GPU Operator (container variant)

		Do not add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry.

Uh oh!

Conversation

Aleksei Sviridkin (lexfrei) commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Release note

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cozystack ready!

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Possibly related issues

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

myasnikovdaniil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aleksei Sviridkin (lexfrei) commented Jun 8, 2026

Uh oh!

myasnikovdaniil left a comment

Choose a reason for hiding this comment

Status of the requested changes (2026-06-08 review)

Outstanding

Analysis — where the issues come from

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aleksei Sviridkin (lexfrei) commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aleksei Sviridkin (lexfrei) commented May 28, 2026 •

edited

Loading

netlify Bot commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading