Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU detection doesn't work with all the drivers & toolkits installed #10534

Closed
arshan-ritual opened this issue Jul 16, 2024 · 2 comments
Closed

Comments

@arshan-ritual
Copy link

arshan-ritual commented Jul 16, 2024

This may be a mistake on my side but I've set up a cluster with only one master node. I've installed the NVIDIA drivers as well as the container runtime. I've verified that containers can access GPU both through docker as well as containerd.

Environmental Info:
K3s Version:

➜  / k3s -v
k3s version v1.29.6+k3s2 (b4b156d9)
go version go1.21.11

Node(s) CPU architecture, OS, and Version:

uname -a 
Linux psduyqnrwtts 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Just one master node:

NAME           STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
psduyqnrwtts   Ready    control-plane,master   23m   v1.29.6+k3s2   10.64.4.86    184.105.5.117   Ubuntu 22.04.2 LTS   5.15.0-113-generic   containerd://1.7.17-k3s1

Describe the bug/Steps To Reproduce

  1. Installed container runtime & drivers
  2. verified nvidia-smi throuhg host machine, and containers through both docker & containerd
  3. started the master node
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--node-external-ip=$(master_ip) --flannel-backend=wireguard-native --flannel-external-ip" sh -
  1. kubectl describe node <my node> doesn't show the gpu:
Allocatable:
  cpu:                8
  ephemeral-storage:  98520635110
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             46223936Ki
  pods:               110

Running nvidia-smi from containerd

sudo ctr run --rm --gpus 0 -t docker.io/library/ubuntu:latest cuda-11.0-base nvidia-smi

Tue Jul 16 19:53:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               Off |   00000000:00:05.0 Off |                  Off |
| 41%   44C    P8             10W /  140W |       2MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Running nvidia-smi from docker

➜  / sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Tue Jul 16 19:54:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               Off |   00000000:00:05.0 Off |                  Off |
| 41%   44C    P8             10W /  140W |       2MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Expected behavior:
GPU should be automatically detected.

Actual behavior:
It doesn't get detected.

Additional context / logs:
Per the website's instructions, I ran this;

sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

I can see that nvidia & nvidia-experimental runtime classes are visible:

kubectl get runtimeclasses

NAME                  HANDLER               AGE
crun                  crun                  29m
lunatic               lunatic               29m
nvidia                nvidia                29m
nvidia-experimental   nvidia-experimental   29m
slight                slight                29m
spin                  spin                  29m
wasmedge              wasmedge              29m
wasmer                wasmer                29m
wasmtime              wasmtime              29m
wws                   wws                   29m

I applied the pod specified on the website itself, and upon inspecting it:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  23m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  2m59s (x4 over 17m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Full output of kubectl node describe

Name:               psduyqnrwtts
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=psduyqnrwtts
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.64.4.86
                    flannel.alpha.coreos.com/backend-data: {"PublicKey":"gTTrS2fGKW52EFpki9hRixF2gjg/BHQ9t1us8gyZ3TE="}
                    flannel.alpha.coreos.com/backend-type: wireguard
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 184.105.5.117
                    flannel.alpha.coreos.com/public-ip-overwrite: 184.105.5.117
                    k3s.io/external-ip: 184.105.5.117
                    k3s.io/hostname: psduyqnrwtts
                    k3s.io/internal-ip: 10.64.4.86
                    k3s.io/node-args: ["server","--node-external-ip","184.105.5.117","--flannel-backend","wireguard-native","--flannel-external-ip"]
                    k3s.io/node-config-hash: YR2JCCTPZLYSZXHAO5IRMU3BZFXH2KZ7VN3RAN5VF7AM7JWTV6XA====
                    k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c38ba7cc1669e7d80b8156ae743932fd86f5bce3871b8a88bef531dd4e3c02b2"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 16 Jul 2024 15:26:13 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  psduyqnrwtts
  AcquireTime:     <unset>
  RenewTime:       Tue, 16 Jul 2024 15:58:00 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 16 Jul 2024 15:57:21 -0400   Tue, 16 Jul 2024 15:26:13 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 16 Jul 2024 15:57:21 -0400   Tue, 16 Jul 2024 15:26:13 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 16 Jul 2024 15:57:21 -0400   Tue, 16 Jul 2024 15:26:13 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 16 Jul 2024 15:57:21 -0400   Tue, 16 Jul 2024 15:26:14 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.64.4.86
  ExternalIP:  184.105.5.117
  Hostname:    psduyqnrwtts
Capacity:
  cpu:                8
  ephemeral-storage:  101275324Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             46223936Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  98520635110
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             46223936Ki
  pods:               110
System Info:
  Machine ID:                 db7f86e412f94f248d2772ebfb6a5577
  System UUID:                7962ce5f-9903-6f29-75f3-725910b0b45c
  Boot ID:                    bee31d38-6f65-4221-b230-8fff4ffb972e
  Kernel Version:             5.15.0-113-generic
  OS Image:                   Ubuntu 22.04.2 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.17-k3s1
  Kubelet Version:            v1.29.6+k3s2
  Kube-Proxy Version:         v1.29.6+k3s2
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://psduyqnrwtts
Non-terminated Pods:          (5 in total)
  Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                      ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-6799fbcd5-rvpws                   100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     31m
  kube-system                 local-path-provisioner-6f5d79df6-zvxwm    0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  kube-system                 metrics-server-696765bb5b-wljgp           100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         31m
  kube-system                 svclb-traefik-118aa6da-w4hx6              0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  kube-system                 traefik-7d5f6474df-f4tnx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (2%)   0 (0%)
  memory             140Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                          Age                From                   Message
  ----     ------                          ----               ----                   -------
  Normal   Starting                        31m                kube-proxy             
  Normal   Starting                        31m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity             31m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory         31m (x2 over 31m)  kubelet                Node psduyqnrwtts status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced         31m                kubelet                Updated Node Allocatable limit across pods
  Normal   NodeReady                       31m                kubelet                Node psduyqnrwtts status is now: NodeReady
  Normal   NodePasswordValidationComplete  31m                k3s-supervisor         Deferred node password secret validation complete
  Normal   Synced                          31m                cloud-node-controller  Node synced successfully
  Normal   RegisteredNode                  31m                node-controller        Node psduyqnrwtts event: Registered Node psduyqnrwtts in Controller

@arshan-ritual
Copy link
Author

I ended up getting around this issue by following this blogpost by installing the nvidia gpu operator:

# first, install the helm utility
sudo snap install helm

# we first install the helm repo for nvidia and update
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

# install the operator
helm install --wait nvidiagpu \
     -n gpu-operator --create-namespace \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
     nvidia/gpu-operator

But leaving this issue open as it seems like this is something that should just work out of the box with k3s.

@brandond
Copy link
Contributor

brandond commented Jul 17, 2024

As the docs say, all we natively support is the runtimes. If you want things that the operator adds, including GPU resources in the node status, you need to install the plugin.

https://docs.k3s.io/advanced#nvidia-container-runtime-support

Note that the NVIDIA Container Runtime is also frequently used with NVIDIA Device Plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants