You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking for a way to set certain nodes in my cluster as "reduce only" nodes, ie. nodes that are only available for executing the reduce stage of jobs.
Conversely, you can have the option to set "map only" nodes, ie. nodes that are only available for executing the map stage of jobs.
In my cluster, I have two kinds of servers: one set of high performance servers for executing the heavy computations in the map stage, and another set of lower performance servers suitable for executing the less complex reduce stage.
So I don't want the high performance servers to be wasted executing the reduce stage of my jobs.
Disco 0.5.4 does not have this feature. So if someone can point me to where in the code the logic is for selecting the node to execute the reduce stage of a job, it will be greatly appreciated.
I don't believe this should be complex to add:
Add configuration settings for designating reduce-only and map-only nodes.
When selecting a node for either stage, the disco master selects a node that falls in one of the designated set.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi, the code that chooses a node is available at job_coordinator:do_submit_tasks_in. This might be overridden later based on the node availability.
Please note that this type of cluster is not very common. If the nodes are not uniform, you can already set the number of workers per node. Moreover, the idea is to push computation to the data. If a map is performed on a node, the output of the map will be on the same node and it makes sense to run reduce on the same node to avoid shipping the data to another node.
Hello
I am looking for a way to set certain nodes in my cluster as "reduce only" nodes, ie. nodes that are only available for executing the reduce stage of jobs.
Conversely, you can have the option to set "map only" nodes, ie. nodes that are only available for executing the map stage of jobs.
In my cluster, I have two kinds of servers: one set of high performance servers for executing the heavy computations in the map stage, and another set of lower performance servers suitable for executing the less complex reduce stage.
So I don't want the high performance servers to be wasted executing the reduce stage of my jobs.
Disco 0.5.4 does not have this feature. So if someone can point me to where in the code the logic is for selecting the node to execute the reduce stage of a job, it will be greatly appreciated.
I don't believe this should be complex to add:
Thanks in advance!
The text was updated successfully, but these errors were encountered: