Configurable tolerations #194

jennydaman · 2022-01-31T22:34:43Z

GPU nodes are tainted with PreferNoSchedule. They can still be scheduled to if a pod specs its containers with resoureces.limits['nvidia.com/gpu'] = 1, but it would be better if pman can be configured to conditionally set tolerations on jobs.

https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

p.s. Per-compute env config is starting to get unwieldy such as with swarm and kubernetes, how can we manage this more concisely?

The text was updated successfully, but these errors were encountered:

qxprakash · 2022-04-20T10:12:30Z

@jennydaman so you want pman to set toleration on pods instead of adding them in pod-definition.yml .. I hope I am getting this correct..

jennydaman · 2022-04-26T17:30:36Z

To rephrase my question in the p.s.:

pman is supposed to be a common interface over kubernetes, swarm, SLURM, ... so scheduler-specific configuration is antithetical to its intention.

jennydaman · 2022-04-26T21:14:16Z

@Prakashh21 what/where is pod-definition.yml?

qxprakash · 2022-04-27T06:28:05Z

I used pod-definition.yml just as a reference name , yes what I meant was scheduler specific configuration in setting tolerations , here we want pman to set tolerations on pods right?

qxprakash · 2022-04-27T06:52:52Z

Q1) why did we set taints on GPU nodes (probably to not schedule just any pod on the GPU node but to schedule only those pods which perform graphically intensive operations and require GPU prowess )
Q2) so it seems like we are not setting any toleration on pods as of now , just passing the spec resources.limits['nvidia.com/gpu'] = 1 schedules the pod on the tainted GPU node
Q3) what we want.. is to set toleration on pod through pman in the job description itself?
@jennydaman am I getting this... ?

jennydaman · 2022-04-27T17:36:12Z

I still don't understand what you mean by pod-definition.yml but moving on...

Q1) yes*
Q2) yes
Q3) I think so?

Closely related issue: being able to configure pman with a set of affinity labels. Using tolerations and affinities, we can deploy multiple pman instances which correspond to different configurations, e.g. one pman will prefer low-CPU, high-memory, another pman will prefer high CPU, high memory, ...

*GPU-intensive does not necessarily mean graphically intensive, e.g. machine learning

qxprakash · 2022-04-27T17:49:42Z

@jennydaman pod-definition.yml is the configuration/manifests/specification of the pods which are to be scheduled on the cluster nodes , tolerations are set on pods defined in their manifests that is what I was saying , and pod-defination.yml was the example name of the manifests , I hope I was clear..

qxprakash · 2022-04-27T18:14:50Z

I still don't understand what you mean by pod-definition.yml but moving on...

Q1) yes*

Q2) yes

Q3) I think so?

Closely related issue: being able to configure pman with a set of affinity labels. Using tolerations and affinities, we can deploy multiple pman instances which correspond to different configurations, e.g. one pman will prefer low-CPU, high-memory, another pman will prefer high CPU, high memory, ...

*GPU-intensive does not necessarily mean graphically intensive, e.g. machine learning

so what you're saying is , we'll have multiple instances of pman , each would prefer to schedule pods on different set of nodes (catering to different types of work loads) through set tolerations and affinities , this sounds cool , but tell me this if we have multiple instances of pman then how would pfcon know to which pman instance it should send the job description...? will this be defined in the job description it self ... or....... ?

jennydaman added enhancement good first issue Good for newcomers labels Jan 31, 2022

jennydaman assigned qxprakash and unassigned qxprakash Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable tolerations #194

Configurable tolerations #194

jennydaman commented Jan 31, 2022

qxprakash commented Apr 20, 2022 •

edited

Loading

jennydaman commented Apr 26, 2022

jennydaman commented Apr 26, 2022

qxprakash commented Apr 27, 2022

qxprakash commented Apr 27, 2022

jennydaman commented Apr 27, 2022

qxprakash commented Apr 27, 2022

qxprakash commented Apr 27, 2022 •

edited

Loading

Configurable tolerations #194

Configurable tolerations #194

Comments

jennydaman commented Jan 31, 2022

qxprakash commented Apr 20, 2022 • edited Loading

jennydaman commented Apr 26, 2022

jennydaman commented Apr 26, 2022

qxprakash commented Apr 27, 2022

qxprakash commented Apr 27, 2022

jennydaman commented Apr 27, 2022

qxprakash commented Apr 27, 2022

qxprakash commented Apr 27, 2022 • edited Loading

qxprakash commented Apr 20, 2022 •

edited

Loading

qxprakash commented Apr 27, 2022 •

edited

Loading