Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot set custom daemonset tolerations #577

Closed
lmyslinski opened this issue Sep 7, 2023 · 2 comments
Closed

Cannot set custom daemonset tolerations #577

lmyslinski opened this issue Sep 7, 2023 · 2 comments

Comments

@lmyslinski
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: AKSUbuntu-2204gen2containerd-202308.10.0
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
  • GPU Operator Version: any, tried latest as well as 23.6.0

2. Issue or feature description

What the title says. I'm trying to set custom tolerations via daemonsets.tolerations as mentioned in the docs. I've tried all sorts
of syntaxes, but the daemonsets do not get the tolerations applied.

Syntax via yaml file:

helm upgrade -i gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator -f gpu-operator-values.yaml

Yaml file:

daemonsets:
  tolerations:
    - effect: NoSchedule
      key: kubernetes.azure.com/scalesetpriority
      value: spot

Syntax via --set:

helm upgrade -i gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator --set 'daemonsets.tolerations[0].effect=NoSchedule,daemonsets.tolerations[0].key=kubernetes.azure.com/scalesetpriority,daemonsets.tolerations[0].value=spot'

In either case, I cannot see the updated toleration values in the NFD daemonset:

kubectl get ds gpu-operator-node-feature-discovery-worker -n gpu-operator -o json | jq '.spec.template.spec.tolerations':

[
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/master",
    "operator": "Equal"
  },
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/control-plane",
    "operator": "Equal"
  },
  {
    "effect": "NoSchedule",
    "key": "nvidia.com/gpu",
    "operator": "Exists"
  }
]

Slightly related to @shivamerla 's answer at #529

Happy to provide more details if needed. Is there anything I'm missing here?

@shivamerla
Copy link
Contributor

@lmyslinski NFD is deployed as a dependent chart from the operator chart. You need to set these tolerations under node-feature-discovery.master and node-feature-discovery.worker as well. Here are the defaults with the subchart.

@karoldob
Copy link

Solved similar case.
Like above, it is enough to add daemonsets.tolerations[0].* and node-feature-discovery.worker.tolerations[0].* to helm.
Do not forget add limits-resources-nvidia.com/gpu to your deployment container to wait for nvidia related stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants