You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via LocalExecutor.
Currently LocalExecutor assumes you are on your local workstation which would just be a single node.
$NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though)
I noticed LocalExecutor has a hard-coded value for nnodes.
NeMo-Run/src/nemo_run/core/execution/local.py
Lines 53 to 54 in b4e2258
Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes
NeMo-Run/src/nemo_run/run/torchx_backend/components/torchrun.py
Lines 104 to 124 in b4e2258
Asking because I am using this with AML where I can usually get multi-node working with torchrun
The text was updated successfully, but these errors were encountered: