-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No GPU node in the cluster, do not create DaemonSets #607
Comments
We have a same issue with latest release (23.9.0) but work with 23.6.1. GPU: Tesla V100 16 Go |
@benjaminprevost when you refer to "same issue" are you also running |
Hi @joshpwrk , could you try https://microk8s.io/docs/addon-gpu instead of |
@shivamerla / @cdesiniotis do we support Gaming cards with the Operator? |
@ArangoGutierrez The GPUs that are officially supported can be found here |
Looks like both @joshpwrk GPU card is not supported by the Operator |
@benjaminprevost Please file a new ticket for your use case. |
Hey @ArangoGutierrez, I'm facing the same issue with NVIDIA RTX 4080 series. Did you guys find any solution? |
I am facing the same issue with NVIDIA H100. Is there anyone with a solution? |
Goal: Have a docker container within a k8s cluster run a pytorch script using Nvidia GPU on local at home computer.
1. Quick Debug Information
Hardware:
2. Issue or feature description
The gpu-operator pod is not able to find the GPU and outputs this error:
As a result:
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
A. The drivers for my GPU were pre-installed when I first installed Linux. I am able to run
nerfstudio
locally (no docker) and fully utilize my GPU and CUDA.B. I noticed the NFD container is logging this message. Not sure if relevant:
C. Based on docs, I believe the Nvidia Container Toolkit container should be automatically launched by the gpu-operator helm chart, but I do not see it in my pod list
D. Notice in the gpu-operator pod, there's also a message saying "unable to get runtime info from cluster, defaulting to containerd". Not sure if this is an issue since I'm running k8s via Docker Desktop and it technically should be running using the Docker engine.
E. One final thing - I'm not able to SSH into the gpu-operator-worker pod as it gives me the error:
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
NOTE: I sent an email with the full logs too
The text was updated successfully, but these errors were encountered: