Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No GPU node in the cluster, do not create DaemonSets #607

Open
4 of 6 tasks
joshpwrk opened this issue Nov 12, 2023 · 9 comments
Open
4 of 6 tasks

No GPU node in the cluster, do not create DaemonSets #607

joshpwrk opened this issue Nov 12, 2023 · 9 comments

Comments

@joshpwrk
Copy link

joshpwrk commented Nov 12, 2023

Goal: Have a docker container within a k8s cluster run a pytorch script using Nvidia GPU on local at home computer.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.2.0-36-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker (using Docker-Desktop for Linux)
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s using Docker Desktop v1.28.2
  • GPU Operator Version: v23.9.0

Hardware:

  • Nvidia GeForce RTX 4090
  • Intel 14900KF processor

2. Issue or feature description

The gpu-operator pod is not able to find the GPU and outputs this error:

{"level":"info","ts":1699757011.503861,"msg":"version: 762213f2"}
{"level":"info","ts":1699757011.5041952,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1699757011.5099807,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1699757011.5101135,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1699757011.6101875,"msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I1112 02:43:31.610309       1 leaderelection.go:245] attempting to acquire leader lease gpu-operator/53822513.nvidia.com...
I1112 02:43:31.614497       1 leaderelection.go:255] successfully acquired lease gpu-operator/53822513.nvidia.com
{"level":"info","ts":1699757011.6146836,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6147175,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6147218,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.614736,"msg":"Starting Controller","controller":"clusterpolicy-controller"}
{"level":"info","ts":1699757011.61475,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6147707,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6147761,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.614778,"msg":"Starting Controller","controller":"upgrade-controller"}
{"level":"info","ts":1699757011.6147697,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1alpha1.NVIDIADriver"}
{"level":"info","ts":1699757011.6147943,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6148026,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6148062,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.6148114,"msg":"Starting Controller","controller":"nvidia-driver-controller"}
{"level":"info","ts":1699757011.7161403,"msg":"Starting workers","controller":"clusterpolicy-controller","worker count":1}
{"level":"info","ts":1699757011.716681,"msg":"Starting workers","controller":"upgrade-controller","worker count":1}
{"level":"info","ts":1699757011.716697,"msg":"Starting workers","controller":"nvidia-driver-controller","worker count":1}
{"level":"info","ts":1699757012.7282248,"logger":"controllers.Upgrade","msg":"Reconciling Upgrade","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1699757012.7282693,"logger":"controllers.Upgrade","msg":"Using label selector","upgrade":{"name":"cluster-policy"},"key":"app","value":"nvidia-driver-daemonset"}
{"level":"info","ts":1699757012.728283,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1699757012.7292824,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.28.2"}
{"level":"info","ts":1699757012.729581,"logger":"controllers.ClusterPolicy","msg":"Operator metrics initialized."}
{"level":"info","ts":1699757012.7295918,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/pre-requisites"}
{"level":"info","ts":1699757012.7298443,"logger":"controllers.ClusterPolicy","msg":"PodSecurityPolicy API is not supported. Skipping..."}
{"level":"info","ts":1699757012.7298522,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-metrics"}
{"level":"info","ts":1699757012.7301328,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-driver"}
{"level":"info","ts":1699757012.7314265,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-container-toolkit"}
{"level":"info","ts":1699757012.7318559,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-validation"}
{"level":"info","ts":1699757012.7318525,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1699757012.7318795,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1699757012.7318833,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1699757012.7318895,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":0,"currently unavailable nodes":0,"total number of nodes":0,"maximum nodes that can be unavailable":0}
{"level":"info","ts":1699757012.7318926,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1699757012.731894,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1699757012.7318954,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1699757012.7318974,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1699757012.7318988,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1699757012.7319002,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1699757012.731902,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1699757012.7319036,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1699757012.7319052,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1699757012.7319071,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1699757012.7319083,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"
{"level":"info","ts":1699757012.73191,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1699757012.7319114,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1699757012.7319129,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1699757012.7319145,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
{"level":"info","ts":1699757012.7331576,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-device-plugin"}
{"level":"info","ts":1699757012.7336576,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm"}
{"level":"info","ts":1699757012.733867,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm-exporter"}
{"level":"info","ts":1699757012.7341948,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/gpu-feature-discovery"}
{"level":"info","ts":1699757012.7345686,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-mig-manager"}
{"level":"info","ts":1699757012.7350636,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-node-status-exporter"}
{"level":"info","ts":1699757012.7354207,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-manager"}
{"level":"info","ts":1699757012.735814,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-device-manager"}
{"level":"info","ts":1699757012.7367134,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-validation"}
{"level":"info","ts":1699757012.7371285,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vfio-manager"}
{"level":"info","ts":1699757012.7375736,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-device-plugin"}
{"level":"info","ts":1699757012.7379014,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-kata-manager"}
{"level":"info","ts":1699757012.7383142,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-cc-manager"}
{"level":"info","ts":1699757012.7393732,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
{"level":"info","ts":1699757012.739426,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"docker-desktop","GpuWorkloadConfig":"container"}
{"level":"info","ts":1699757012.7394345,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":0}
{"level":"info","ts":1699757012.739452,"logger":"controllers.ClusterPolicy","msg":"Unable to get runtime info from the cluster, defaulting to containerd"}
{"level":"info","ts":1699757012.73946,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}

As a result:

  • there is not nvidia-device-plugin DaemonSet that is deployed
  • container is never able to find the GPU.

3. Steps to reproduce the issue

  1. Install Docker Desktop for Linux
  2. Launch k8s by going to settings -> kubernetes -> enable
  3. using helm: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false

4. Information to attach (optional if deemed irrelevant)

A. The drivers for my GPU were pre-installed when I first installed Linux. I am able to run nerfstudio locally (no docker) and fully utilize my GPU and CUDA.

B. I noticed the NFD container is logging this message. Not sure if relevant:

I1112 02:43:31.491890       1 main.go:66] "-server is deprecated, will be removed in a future release along with the deprecated gRPC API"
I1112 02:43:31.491969       1 nfd-worker.go:219] "Node Feature Discovery Worker" version="v0.14.2" nodeName="docker-desktop" namespace="gpu-operator"
I1112 02:43:31.492181       1 nfd-worker.go:520] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
I1112 02:43:31.492386       1 nfd-worker.go:552] "configuration successfully updated" configuration={"Core":{"Klog":{},"LabelWhiteList":{},"NoPublish":false,"FeatureSources":["all"],"Sources":null,"LabelSources":["all"],"SleepInterval":{"Duration":60000000000}},"Sources":{"cpu":{"cpuid":{"attributeBlacklist":["BMI1","BMI2","CLMUL","CMOV","CX16","ERMS","F16C","HTT","LZCNT","MMX","MMXEXT","NX","POPCNT","RDRAND","RDSEED","RDTSCP","SGX","SGXLC","SSE","SSE2","SSE3","SSE4","SSE42","SSSE3","TDX_GUEST"]}},"custom":[],"fake":{"labels":{"fakefeature1":"true","fakefeature2":"true","fakefeature3":"true"},"flagFeatures":["flag_1","flag_2","flag_3"],"attributeFeatures":{"attr_1":"true","attr_2":"false","attr_3":"10"},"instanceFeatures":[{"attr_1":"true","attr_2":"false","attr_3":"10","attr_4":"foobar","name":"instance_1"},{"attr_1":"true","attr_2":"true","attr_3":"100","name":"instance_2"},{"name":"instance_3"}]},"kernel":{"KconfigFile":"","configOpts":["NO_HZ","NO_HZ_IDLE","NO_HZ_FULL","PREEMPT"]},"local":{},"pci":{"deviceClassWhitelist":["02","0200","0207","0300","0302"],"deviceLabelFields":["vendor"]},"usb":{"deviceClassWhitelist":["0e","ef","fe","ff"],"deviceLabelFields":["class","vendor","device"]}}}
I1112 02:43:31.492496       1 metrics.go:70] "metrics server starting" port=8081
E1112 02:43:31.495324       1 memory.go:91] "failed to detect NUMA nodes" err="failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory"
I1112 02:43:31.498944       1 nfd-worker.go:562] "starting feature discovery..."
I1112 02:43:31.499098       1 nfd-worker.go:577] "feature discovery completed"
I1112 02:43:31.512431       1 nfd-worker.go:698] "creating NodeFeature object" nodefeature=""

C. Based on docs, I believe the Nvidia Container Toolkit container should be automatically launched by the gpu-operator helm chart, but I do not see it in my pod list

D. Notice in the gpu-operator pod, there's also a message saying "unable to get runtime info from cluster, defaulting to containerd". Not sure if this is an issue since I'm running k8s via Docker Desktop and it technically should be running using the Docker engine.

E. One final thing - I'm not able to SSH into the gpu-operator-worker pod as it gives me the error:

OCI runtime exec failed: exec failed: unable to start container process: exec: "sh": executable file not found in $PATH: unknown
  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
NAME                                                              READY   STATUS    RESTARTS   AGE
gpu-operator-1699757009-node-feature-discovery-gc-d94b5686vk2bd   1/1     Running   0          90m
gpu-operator-1699757009-node-feature-discovery-master-67bfjbcvm   1/1     Running   0          90m
gpu-operator-1699757009-node-feature-discovery-worker-fn6w5       1/1     Running   0          90m
gpu-operator-6f74bc4cd4-tstq5                                     1/1     Running   0          90m
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
gpu-operator-1699757009-node-feature-discovery-worker   1         1         1       1            1           <none>          90m
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |
|  0%   45C    P8              37W / 480W |   1198MiB / 24564MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                        
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1502      G   /usr/lib/xorg/Xorg                          497MiB |
|    0   N/A  N/A      1723      G   /usr/bin/gnome-shell                         77MiB |
|    0   N/A  N/A      3946      G   ...ures=SpareRendererForSitePerProcess      229MiB |
|    0   N/A  N/A      5388      G   ...ures=SpareRendererForSitePerProcess       35MiB |
|    0   N/A  N/A      5642      G   ...sion,SpareRendererForSitePerProcess       93MiB |
|    0   N/A  N/A      5898      G   ...irefox/3358/usr/lib/firefox/firefox      193MiB |
|    0   N/A  N/A     11580      G   gnome-control-center                          6MiB |
|    0   N/A  N/A     36041      G   ...,WinRetrieveSuggestionsOnlyOnDemand       31MiB |
+---------------------------------------------------------------------------------------+

  • containerd logs journalctl -u containerd > containerd.log

NOTE: I sent an email with the full logs too

@benjaminprevost
Copy link

benjaminprevost commented Nov 23, 2023

We have a same issue with latest release (23.9.0) but work with 23.6.1.

GPU: Tesla V100 16 Go
K8S: 1.27
OS: Ubuntu 22.04.3
Kernel: 5.15.0-88-generic

@ArangoGutierrez
Copy link
Collaborator

@benjaminprevost when you refer to "same issue" are you also running docker-desktop ?

@ArangoGutierrez
Copy link
Collaborator

Hi @joshpwrk , could you try https://microk8s.io/docs/addon-gpu instead of docker-desktop ? and let us know

@ArangoGutierrez
Copy link
Collaborator

@shivamerla / @cdesiniotis do we support Gaming cards with the Operator?

@tariq1890
Copy link
Contributor

tariq1890 commented Nov 27, 2023

@ArangoGutierrez The GPUs that are officially supported can be found here

@ArangoGutierrez
Copy link
Collaborator

ArangoGutierrez commented Nov 27, 2023

Looks like both @joshpwrk GPU card is not supported by the Operator

@ArangoGutierrez
Copy link
Collaborator

ArangoGutierrez commented Nov 27, 2023

We have a same issue with latest release (23.9.0) but work with 23.6.1.

GPU: Tesla V100 16 Go K8S: 1.27 OS: Ubuntu 22.04.3 Kernel: 5.15.0-88-generic

@benjaminprevost Please file a new ticket for your use case.
Could you tell us more about the Kubernetes solution you are using? is it Virtualized or bare metal? (in the new issue, not here just to be clear)

@harshsavasil
Copy link

Hey @ArangoGutierrez,

I'm facing the same issue with NVIDIA RTX 4080 series. Did you guys find any solution?

@rogersaloo
Copy link

I am facing the same issue with NVIDIA H100. Is there anyone with a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants