Using multi-node with LocalExecutor #130

LopezGG · 2024-12-23T18:22:27Z

I noticed LocalExecutor has a hard-coded value for nnodes.

NeMo-Run/src/nemo_run/core/execution/local.py

Lines 53 to 54 in b4e2258

    
           def nnodes(self) -> int: 
        
               return 1

Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes

NeMo-Run/src/nemo_run/run/torchx_backend/components/torchrun.py

Lines 104 to 124 in b4e2258

    
           if max_nnodes == 1: 
        
               # using port 0 makes elastic chose a free random port which is ok 
        
               # for single-node jobs since all workers run under a single agent 
        
               # When nnodes is 0 and max_nnodes is 1, it's still a single node job 
        
               # but pending until the resources become available 
        
               rdzv_endpoint = "localhost:0" 
        
               num_nodes = nnodes_rep 
        
               nproc_per_node = str(nproc_per_node) 
        
               node_rank = "0" 
        
           else: 
        
               # for multi-node, rely on the rank0_env environment variable set by 
        
               # the schedulers (see scheduler implementation for the actual env var this maps to) 
        
               # some schedulers (e.g. aws batch) make the rank0's ip-addr available on all BUT on rank0 
        
               # so default to "localhost" if the env var is not set or is empty 
        
               # rdzv_endpoint bash resolves to something to the effect of 
        
               # ${TORCHX_RANK0_HOST:=localhost}:29500 
        
               # use $$ in the prefix to escape the '$' literal (rather than a string Template substitution argument) 
        
               rdzv_endpoint = torchx_dist._noquote(f"$${ExecutorMacros.HEAD_NODE_IP_VAR}:{rdzv_port}") 
        
               num_nodes = torchx_dist._noquote(f"$${ExecutorMacros.NUM_NODES_VAR}") 
        
               nproc_per_node = str(nproc_per_node) 
        
               node_rank = torchx_dist._noquote(f"$${ExecutorMacros.NODE_RANK_VAR}")

Asking because I am using this with AML where I can usually get multi-node working with torchrun

hemildesai · 2024-12-23T19:19:14Z

Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via LocalExecutor.

Currently LocalExecutor assumes you are on your local workstation which would just be a single node.

LopezGG · 2024-12-23T19:30:17Z

Thank you for the quick reply @hemildesai . Usually, with AML I use something like

 torchrun  --nproc_per_node=${{inputs.nproc_per_node}} --nnodes=${{inputs.nnodes}}
  --node_rank=$NODE_RANK
  --master_addr=$MASTER_ADDR
  --master_port=$MASTER_PORT  train.py

https://pytorch.org/docs/stable/elastic/run.html

$NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though)

Changing num_nodes will feed into

NeMo-Run/src/nemo_run/core/execution/base.py

Lines 183 to 188 in b4e2258

    
               def nnodes(self) -> int: 
        
                   """ 
        
                   Helper function called by torchrun component 
        
                   to determine --nnodes. 
        
                   """ 
        
                   raise NotImplementedError

and can be called from

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multi-node with LocalExecutor #130

Using multi-node with LocalExecutor #130

LopezGG commented Dec 23, 2024

hemildesai commented Dec 23, 2024

LopezGG commented Dec 23, 2024 •

edited

Loading

Using multi-node with LocalExecutor #130

Using multi-node with LocalExecutor #130

Comments

LopezGG commented Dec 23, 2024

hemildesai commented Dec 23, 2024

LopezGG commented Dec 23, 2024 • edited Loading

LopezGG commented Dec 23, 2024 •

edited

Loading