Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multi-node with LocalExecutor #130

Open
LopezGG opened this issue Dec 23, 2024 · 2 comments
Open

Using multi-node with LocalExecutor #130

LopezGG opened this issue Dec 23, 2024 · 2 comments

Comments

@LopezGG
Copy link

LopezGG commented Dec 23, 2024

I noticed LocalExecutor has a hard-coded value for nnodes.

def nnodes(self) -> int:
return 1

Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes

if max_nnodes == 1:
# using port 0 makes elastic chose a free random port which is ok
# for single-node jobs since all workers run under a single agent
# When nnodes is 0 and max_nnodes is 1, it's still a single node job
# but pending until the resources become available
rdzv_endpoint = "localhost:0"
num_nodes = nnodes_rep
nproc_per_node = str(nproc_per_node)
node_rank = "0"
else:
# for multi-node, rely on the rank0_env environment variable set by
# the schedulers (see scheduler implementation for the actual env var this maps to)
# some schedulers (e.g. aws batch) make the rank0's ip-addr available on all BUT on rank0
# so default to "localhost" if the env var is not set or is empty
# rdzv_endpoint bash resolves to something to the effect of
# ${TORCHX_RANK0_HOST:=localhost}:29500
# use $$ in the prefix to escape the '$' literal (rather than a string Template substitution argument)
rdzv_endpoint = torchx_dist._noquote(f"$${ExecutorMacros.HEAD_NODE_IP_VAR}:{rdzv_port}")
num_nodes = torchx_dist._noquote(f"$${ExecutorMacros.NUM_NODES_VAR}")
nproc_per_node = str(nproc_per_node)
node_rank = torchx_dist._noquote(f"$${ExecutorMacros.NODE_RANK_VAR}")

Asking because I am using this with AML where I can usually get multi-node working with torchrun

@hemildesai
Copy link
Collaborator

Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via LocalExecutor.

Currently LocalExecutor assumes you are on your local workstation which would just be a single node.

@LopezGG
Copy link
Author

LopezGG commented Dec 23, 2024

Thank you for the quick reply @hemildesai . Usually, with AML I use something like

 torchrun  --nproc_per_node=${{inputs.nproc_per_node}} --nnodes=${{inputs.nnodes}}
  --node_rank=$NODE_RANK
  --master_addr=$MASTER_ADDR
  --master_port=$MASTER_PORT  train.py

https://pytorch.org/docs/stable/elastic/run.html

$NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though)

Image

Changing num_nodes will feed into

def nnodes(self) -> int:
"""
Helper function called by torchrun component
to determine --nnodes.
"""
raise NotImplementedError

and can be called from

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants