docs(operations): add containerized GPU workloads guide#555
docs(operations): add containerized GPU workloads guide#555Aleksei Sviridkin (lexfrei) wants to merge 2 commits into
Conversation
✅ Deploy Preview for cozystack ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Warning Review limit reached
More reviews will be available in 51 minutes and 36 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the ChangesGPU Container Workloads Documentation
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | ||
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
There was a problem hiding this comment.
In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' | |
| kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
There was a problem hiding this comment.
This flow was removed — the page no longer patches cozystack.gpu-operator into bundles.enabledPackages (the bundle path only accepts the default/vgpu variants). It now applies a standalone Package CR, and packages.cozystack.io is a cluster-scoped resource (scope: Cluster in the CRD), so there is no namespace to set.
| apiVersion: cozystack.io/v1alpha1 | ||
| kind: Package | ||
| metadata: | ||
| name: cozystack.gpu-operator | ||
| spec: | ||
| variant: container |
There was a problem hiding this comment.
The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| spec: | |
| variant: container | |
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| namespace: cozy-system | |
| spec: | |
| variant: container |
There was a problem hiding this comment.
packages.cozystack.io is a cluster-scoped resource (scope: Cluster in the CRD), so a Package has no namespace — the platform controller reconciles it cluster-wide. namespace: cozy-system would be ignored, so the example leaves it off intentionally.
3170d45 to
8b83e54
Compare
|
Actionable comments posted: 0 |
8b83e54 to
b9cae43
Compare
myasnikovdaniil
left a comment
There was a problem hiding this comment.
Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.
Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.
Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.
Smaller accuracy/UX fixes inline. Recommendation: request changes.
|
|
||
| ## Fractional GPU sharing | ||
|
|
||
| The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled. |
There was a problem hiding this comment.
- "HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."
- That auto-disable only lives in the tenant
kubernetesapp chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml— "should disable devicePlugin when hami is enabled"). The management-clustercozystack.hamiPackageSource only declaresdependsOn: cozystack.gpu-operator(install ordering);packages/system/hami/values.yamldoes not touch the operator's device plugin. - The
containervariant pinsdevicePlugin.enabled: true(values-container.yamlin #2766). Stackingcozystack.hamion top, as written, runs two device plugins both registeringnvidia.com/gpu— exactly the conflict the HAMi doc warns about.
Suggested rewrite:
The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin.
For fractional sharing (per-pod memory and compute quotas), see
[GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for
tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's
built-in device plugin to avoid resource-registration conflicts. Stacking the
`cozystack.hami` package directly on top of the `container` variant on the management
cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on,
and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.
| ## Prerequisites | ||
|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. |
There was a problem hiding this comment.
The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.
- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro
package manager (other distros with an equivalent driver + toolkit package layout
should work the same way but are not regularly tested). Verify with `nvidia-smi` …|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). |
There was a problem hiding this comment.
apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:
- `nvidia-container-toolkit` installed on the same node and registered with containerd:
```bash
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
grep nvidia /etc/containerd/config.toml # must show the runtime entry|
|
||
| ```bash | ||
| kubectl apply -f cuda-smoke.yaml | ||
| kubectl logs cuda-smoke |
There was a problem hiding this comment.
Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:
kubectl apply -f cuda-smoke.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m
kubectl logs cuda-smoke| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). | ||
| - `kubectl` configured against the management cluster. |
There was a problem hiding this comment.
Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.
- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the
passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).b9cae43 to
f2ae9b7
Compare
|
Thanks — addressed in the latest push. HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested." containerd registration — spelled out with the explicit Leftover CUDA smoke pod — added Validator path — same reframe as the code PR: dropped On the bot's namespace suggestions ( Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the |
myasnikovdaniil
left a comment
There was a problem hiding this comment.
NOT LGTM — the practical advice in the bundles.enabledPackages warning is right, but its stated failure mechanism is factually wrong and will mislead operators.
Business context: documents the container variant of cozystack.gpu-operator for running CUDA pods on management-cluster nodes that already ship the NVIDIA driver + container toolkit from the distro package manager.
Status of the requested changes (2026-06-08 review)
- ✅ HAMi device-plugin conflict — the Fractional GPU sharing section now explains
cozystack.hamiand thecontainervariant both registernvidia.com/gpuand aren't a supported combination. - ✅ OS support scope — Ubuntu/Debian primary, other distros "not regularly tested."
- ✅ containerd
nvidiaruntime registration —nvidia-ctk runtime configure+ restart + verify present. - ✅ leftover
nvidia.com/gpu.workload.configlabel — prerequisite bullet with removal command added. - ✅ CUDA smoke-pod —
kubectl wait …Succeededadded beforekubectl logs. - ✅ host-driver /
driver.enabled=falsepath — reframed clearly; Talos caveat points at the reference values file.
Outstanding
B1 (blocker) — bundles.enabledPackages warning states the wrong failure mechanism — inline at line 41. The text says the bundle "hardcodes spec.variant: default" and "any user Package CR with variant: container is overwritten on the next reconcile." Neither is what happens: iaas.yaml renders the GPU operator via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and fails the Helm render if that value isn't default/vgpu. So container via the bundle path is a hard render error, not a silent overwrite. Keep the conclusion (use a standalone Package CR); fix the reason. Suggested wording inline.
Non-blocking:
- #2766 passed
helm template+ unit tests but no hardware CUDA run — a "provisional pending hardware validation" note would help calibrate trust. - Prerequisite ordering: the
nvidia.com/gpu.workload.configremoval bullet sits after the containerd-registration block; a node migrating from the passthrough guide would remove the label before/with toolkit registration.
Analysis — where the issues come from
- Original code: B1 (wrong bundle-mechanism text) and the ordering nit are both in the initial commit
f2ae9b7. - Introduced by post-review fixes: none — the branch is a single commit; no regressions added.
- Unresolved from the previous review: none — all six asks addressed.
|
|
||
| ## 1. Install the GPU Operator (container variant) | ||
|
|
||
| **Do not** add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry. |
There was a problem hiding this comment.
The stated reason here is incorrect, though the practical advice is right. gpu-operator in the iaas bundle does not go through the cozystack.platform.package.optional.default helper and does not hardcode spec.variant: default. iaas.yaml renders it via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and immediately fails the Helm render if that value is anything other than "default" or "vgpu":
{{- if not (or (eq $gpuVariant "default") (eq $gpuVariant "vgpu")) -}}
{{- fail (printf "bundles.iaas.gpuOperatorVariant must be \"default\" or \"vgpu\", got %q" $gpuVariant) -}}
{{- end -}}
So "container" via the bundle path causes a hard Helm render failure, not a silent overwrite — the user Package CR is never touched because the chart never renders. Suggested replacement:
Do not add
cozystack.gpu-operatortobundles.enabledPackagesfor this variant. Theiaasbundle template only acceptsbundles.iaas.gpuOperatorVariant: defaultorvgpu; any other value — includingcontainer— causes a hard Helm render failure (packages/core/platform/templates/bundles/iaas.yaml). Apply thePackageCR directly instead; the platform controller installs it without a bundle entry and without the variant restriction.
There was a problem hiding this comment.
Fixed in 4c291c4. The warning no longer claims a hardcoded spec.variant: default / silent overwrite. It now states the real mechanism: the iaas bundle renders the operator from bundles.iaas.gpuOperatorVariant, which only accepts default/vgpu and fails the Helm render on anything else (including container). The conclusion — apply the Package CR directly — stays.
Document the new container variant of cozystack.gpu-operator, paired with cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit Linux shape that the variant targets: when to pick it over the passthrough and vGPU variants, prerequisites (host driver + host nvidia-container-toolkit registered with containerd via nvidia-ctk runtime configure, validated with nvidia-smi over kubectl debug), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override needed on a stock apt install), the Talos caveat with a pointer to the values-native-talos.yaml reference, install steps, a sample CUDA pod for verification, the variant comparison matrix, and a note on why stacking HAMi directly on the container variant on the management cluster is not a supported combination yet (both register nvidia.com/gpu). Lands under operations/ — symmetric with virtualization/gpu.md (VM passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi in tenant Kubernetes addons). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
The container-variant warning claimed the optional-package template hardcodes spec.variant: default and silently overwrites a user Package CR. The iaas bundle actually renders the GPU operator from bundles.iaas.gpuOperatorVariant, which only accepts default or vgpu and fails the Helm render on any other value. Describe the real mechanism while keeping the conclusion to apply the Package CR directly. Signed-off-by: Aleksei Sviridkin <f@lex.la>
f2ae9b7 to
4c291c4
Compare
|
myasnikovdaniil B1 is fixed in 4c291c4. The Sequencing is resolved too: cozystack/cozystack#2766 merged on 2026-06-09, so the Items 1-6 from the earlier pass remain addressed at this head (HAMi conflict, OS scope, containerd registration, leftover |
What this PR does
Add a new operations guide describing the
containervariant ofcozystack.gpu-operator— the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver andnvidia-container-toolkitvia the distro package manager.The new page lands at
content/en/docs/next/operations/gpu-container-workloads.mdand rounds out the GPU documentation surface:defaultvariant).containervariant).Content covers when to pick the variant (host driver + host toolkit + a containerd-registered
nvidiaruntime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with nodriverInstallDiroverride on a stock apt install), the Talos caveat with a pointer to theexamples/values-native-talos.yamlreference, install steps withPackageCRvariant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.Companion to cozystack/cozystack#2766, which adds the
containervariant itself.Release note
Summary by CodeRabbit