-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Comparison to Nvidia GPU Operator + GPU Feature Discovery #9
Comments
The GPU Operator and Feature Discovery are auxiliary mechanisms that make it easier to manage GPUs in a K8s cluster. They just make life easier by automatically installing the Nvidia drivers and Device Plugin (in the case of the Operator) as well as automatically adding labels/taints to nodes with GPUs (in the case of the Feature Discovery thing). AFAIK, Nvidia offers two mechanisms for sharing a GPU between multiple containers. These are exposed to a Kubernetes cluster through the official device plugin [1]. TL;DR
1. MIG (Multi-Instance-GPU)This requires special hardware (Ampere architecture GPUs). The GPU's hardware is segmented in a way that allows the driver to offer "true" splits of the GPU as independent devices. You can skim through the official docs [2] for an overview on how that works. 2. "Nvidia Device Plugin GPU Sharing"NVIDIA device plugin 0.12.0 officially provides an option to enable sharing a GPU between multiple containers (https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/).
Memory is still the core problem: Quoting them:
This simply solves the 1-1 assignment on K8s and doesn't do anything to prevent OOM and friction between co-located apps. I'll quote my thesis [3] (the abstract and first chapter are especially worth a read) on this very important distinction that we must always keep in mind when evaluating these alternative approaches:
[1] https://github.com/NVIDIA/k8s-device-plugin |
Great answer! Thank you! Ahhh, I did not realize that the Nvidia device plugin for GPU sharing does not gracefully handle fair-sharing of memory. |
Apologies if Issues is the wrong place for my question, but I don't see a Discussions forum for this repo.
I've read your Medium article, which provides a nice summary of what problem
nvshare
is solving.However, I also came across this blog from VMWare, which describes GPU virtualization in Kubernetes via Nvidia's GPU Operator and GPU Feature Discovery, which adds labels to the Nodes such as
nvidia.com/vgpu.present=true
and facilitates fractional allocation of GPUs to Pods.How does
nvshare
differ and/or what additional value does it provide?The text was updated successfully, but these errors were encountered: