-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for gpu node to be ready before scheduling pods using NVML #615
Comments
You can add consider adding the Sample snippet
|
@easyrider14 is your pod requesting a GPU using resource requests/limits (e.g. requesting an |
Hi @tariq1890 I've tried this after digging in gpu-operator manifest files, but still have the same result @cdesiniotis I don't need/want the resources to be reserved for this pod, as it is mainly keeping a state of available resources on the node in an ETCD database. This is no workload running continuously, just a regular update of available ram/cpu/gpu at regular intervals. I don't want to reserve and block resources for that |
Hi everyone
I face an issue with gpu-operator and scaling of my K8S cluster
When adding a GPU node to cluster, gpu-operator will, amon others things, install container runtime and drivers
I've got a daemonset which uses NVML, but it is scheduled on the newly added gpu node as soon as it is available. But the driver is not ready, and initializing NVML fails. The container in my pod exits, but the pod is restarted and not deleted/created, so NVML initialization still fails. Which criteria should I use in mu daemonset definition to make sur my pod will be able to initialize NVML and run correctly when it will be scheduled on the node ?
Thanks
The text was updated successfully, but these errors were encountered: