v25.3.0-rc.3 #399
klueska
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Release notes
This release is an important milestone towards the general availability of the NVIDIA DRA Driver for GPUs. It focuses on improving support for NVIDIA's Multi-Node NVLink (MNNVL) in Kubernetes by delivering a number of
ComputeDomain
improvements and bug fixes.All commits since the last release can be seen here: v25.3.0-rc.2...v25.3.0-rc.3. The changes are summarized below.
For background on how
ComputeDomain
s enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc and this slide deck.Improvements
ComputeDomain
cleanup semantics: deletion of aComputeDomain
is now immediately followed by resource teardown (instead of waiting for workload to complete).nvidiaDriverRoot
Helm chart variable and overcome common GPU driver setup issues.ComputeDomain
now communicate via Pod IP (using a virtual overlay network instead of usinghostnetwork: true
) to improve robustness and security.Fixes
ComputeDomain
teardown now works even after a correspondingResourceClaim
was removed from the API server (#342).kubectl logs ds/nvidia-dra-driver-gpu-kubelet-plugin
actually yields plugin logs (#355).Notable changes
v0.33.0
(cf. changes) for various robustness improvements (such as for more reliable rolling upgrades).Breaking changes
nvidiaCtkPath
Helm chart variable does not need to be provided anymore (see above); doing so now results in an error.The path forward
ComputeDomains
Future versions of the NVIDIA GPU driver (580+) will include IMEX daemons with support for communicating using DNS names in addition to raw IP addresses. This feature allows us to overcome a number of limitations inherent to the existing
ComputeDomain
implementation – with no breaking changes to the user-facing API.Highlights include:
Removal of the
numNodes
field in theComputeDomain
abstraction. Users will no longer need to pre-calculate how many nodes their (static) multi-node workload will ultimately span.Support for elastic workloads. The number of pods associated with a mulit-node workload will no longer need to be fixed and forced to match the value of the
numNodes
field in theComputeDomain
the workload is running in.Support for running more than one pod per
ComputeDomain
on a given node. As of now, all pods of a multi-node workload are artificially forced to run on different nodes, even if there are enough GPUs on a single node to service more than one of them. This new feature will remove this restriction.Support for running pods from different
ComputeDomain
s on the same node. As of now, only one pod from any multi-node workload is allowed to run on a given node associated with aComputeDomain
(even if there are enough GPUs available to service more than one of them). This new feature will remove this restriction.Improved tolerance to node failures within an IMEX domain. As of now, if one node of an IMEX domain goes down, the entire workload needs to be shut down and rescheduled. This new feature will allow the failed node to be swapped in-place, without needing to shut down the entire IMEX domain (of course, many types of failures may still require the workloads to restart anyway to explicitly recover from a loss of state).
We also plan on adding improvements to the general debuggability and observability of
ComputeDomain
s, including:ComputeDomain
can be in to allow for robust automation.GPUs
The upcoming 25.3.0 release will not include official support for allocating GPUs (only
ComputeDomain
s). In the following release (25.8.0), we will add official support for allocating GPUs. This 25.8.0 release will be integrated with the NVIDIA GPU Operator and does not need to be installed as a standalone Helm chart anymore.Note: The DRA feature in upstream Kubernetes is slated to go GA in August. The 25.8.0 release of the NVIDIA DRA driver for GPUs is planned to coincide with that.
Features we plan to include in the 25.8.0 release:
Features for future releases in the near term:
nvidia.com/gpu
: 2
)This discussion was created from the release v25.3.0-rc.3.
Beta Was this translation helpful? Give feedback.
All reactions