Cannot set custom daemonset tolerations #577

lmyslinski · 2023-09-07T07:29:41Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
Kernel Version: AKSUbuntu-2204gen2containerd-202308.10.0
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
GPU Operator Version: any, tried latest as well as 23.6.0

2. Issue or feature description

What the title says. I'm trying to set custom tolerations via daemonsets.tolerations as mentioned in the docs. I've tried all sorts
of syntaxes, but the daemonsets do not get the tolerations applied.

Syntax via yaml file:

helm upgrade -i gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator -f gpu-operator-values.yaml

Yaml file:

daemonsets:
  tolerations:
    - effect: NoSchedule
      key: kubernetes.azure.com/scalesetpriority
      value: spot

Syntax via --set:

helm upgrade -i gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator --set 'daemonsets.tolerations[0].effect=NoSchedule,daemonsets.tolerations[0].key=kubernetes.azure.com/scalesetpriority,daemonsets.tolerations[0].value=spot'

In either case, I cannot see the updated toleration values in the NFD daemonset:

kubectl get ds gpu-operator-node-feature-discovery-worker -n gpu-operator -o json | jq '.spec.template.spec.tolerations':

[
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/master",
    "operator": "Equal"
  },
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/control-plane",
    "operator": "Equal"
  },
  {
    "effect": "NoSchedule",
    "key": "nvidia.com/gpu",
    "operator": "Exists"
  }
]

Slightly related to @shivamerla 's answer at #529

Happy to provide more details if needed. Is there anything I'm missing here?

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-09-27T14:41:49Z

@lmyslinski NFD is deployed as a dependent chart from the operator chart. You need to set these tolerations under node-feature-discovery.master and node-feature-discovery.worker as well. Here are the defaults with the subchart.

karoldob · 2024-07-17T15:57:22Z

Solved similar case.
Like above, it is enough to add daemonsets.tolerations[0].* and node-feature-discovery.worker.tolerations[0].* to helm.
Do not forget add limits-resources-nvidia.com/gpu to your deployment container to wait for nvidia related stuff.

cdesiniotis closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot set custom daemonset tolerations #577

Cannot set custom daemonset tolerations #577

lmyslinski commented Sep 7, 2023

shivamerla commented Sep 27, 2023

karoldob commented Jul 17, 2024

Cannot set custom daemonset tolerations #577

Cannot set custom daemonset tolerations #577

Comments

lmyslinski commented Sep 7, 2023

1. Quick Debug Information

2. Issue or feature description

shivamerla commented Sep 27, 2023

karoldob commented Jul 17, 2024