Skip to content

The resource requests and limits are not being applied to the pod as expected. #1145

Open
@IndhumithaR

Description

@IndhumithaR

Gpu operator version: v24.6.1
driver.version: 535.154.05
device plugin verion: v0.16.2-ubi8

Kubernetes distribution
EKS

Kubernetes version
v1.27.0

Hi,

We attempted to install the Nvidia driver directly on our node's base image instead of using the GPU operator. However, after doing so, the resource requests and limits set for the pods are no longer effective, and all containers within the pods are able to access all the GPUs.

Sample pod spec

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi-pod-3
spec:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - g5.48xlarge
  containers:
  - name: nvidia-smi-container
    image: nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 5
      requests:
        nvidia.com/gpu: 5
 
    securityContext:
      capabilities:
        add:
        - SYS_NICE
      privileged: true
  tolerations:
  - key: "nvidia.com/gpu"
    value: "true"
    effect: "NoSchedule"

Here I am trying to set request and limit to 5.
But when I enter into the container and check, I am able to see all the 8 gpus.

Image

However, we tested running the same pod in a different environment where the same driver version was installed using the GPU operator (instead of directly in the base image), and it worked as expected.

Image

What could be the problem? Is there a way to fix it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions