Description
Gpu operator version: v24.6.1
driver.version: 535.154.05
device plugin verion: v0.16.2-ubi8
Kubernetes distribution
EKS
Kubernetes version
v1.27.0
Hi,
We attempted to install the Nvidia driver directly on our node's base image instead of using the GPU operator. However, after doing so, the resource requests and limits set for the pods are no longer effective, and all containers within the pods are able to access all the GPUs.
Sample pod spec
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi-pod-3
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g5.48xlarge
containers:
- name: nvidia-smi-container
image: nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 5
requests:
nvidia.com/gpu: 5
securityContext:
capabilities:
add:
- SYS_NICE
privileged: true
tolerations:
- key: "nvidia.com/gpu"
value: "true"
effect: "NoSchedule"
Here I am trying to set request and limit to 5.
But when I enter into the container and check, I am able to see all the 8 gpus.
However, we tested running the same pod in a different environment where the same driver version was installed using the GPU operator (instead of directly in the base image), and it worked as expected.
What could be the problem? Is there a way to fix it?