Updates multi-node training commands to also support Spark #3978

kellyguo11 · 2025-11-08T00:39:33Z

Description

Removes rendezvous backend for multi-node training since it doesn't seem to be necessary and prevents multi-node setup on the DGX Spark.

Type of change

Documentation update

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

greptile-apps

Greptile Overview

Greptile Summary

Simplified multi-node training setup by replacing rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with direct master address parameters (--master_addr, --master_port).

Key changes:

Updated all PyTorch multi-node training commands in multi_gpu.rst to use --master_addr=<ip_of_master> and --master_port=5555 instead of rendezvous parameters
Commands affected: rl_games, rsl_rl, and skrl (PyTorch backend) training scripts
Removed DGX Spark limitation note about multi-node training requiring additional network configurations from installation/index.rst
JAX-based training commands remain unchanged (already use --coordinator_address)

Issues found:

Minor spacing issue on line 163 in multi_gpu.rst (extra space before scripts/)

This simplification aligns with PyTorch's recommended approach and makes multi-node setup more straightforward, particularly for DGX Spark environments.

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it's a documentation-only change that simplifies multi-node training setup.
Score of 4 reflects that the changes are documentation-only and use a valid PyTorch distributed training approach. The simplified parameter set (master_addr/master_port) is standard and well-supported. One minor spacing issue was found that should be fixed before merge.
Pay attention to docs/source/features/multi_gpu.rst line 163 which has an extra space that should be removed.

Important Files Changed

File Analysis

Filename	Score	Overview
docs/source/features/multi_gpu.rst	4/5	Replaced rendezvous parameters with simpler master_addr/master_port for multi-node training commands. One minor spacing issue found on line 163.
docs/source/setup/installation/index.rst	5/5	Removed limitation note about multi-node training requiring additional network configurations on DGX Spark, aligning with the simplified approach.

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant MasterNode as Master Node
    participant WorkerNode as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Docs: Read multi-node training setup
    Note over Docs: OLD: rdzv_backend + rdzv_id + rdzv_endpoint
    Note over Docs: NEW: master_addr + master_port
    
    User->>MasterNode: Launch with --node_rank=0 --master_addr=<ip> --master_port=5555
    MasterNode->>PyTorch: Initialize distributed training (rank 0)
    
    User->>WorkerNode: Launch with --node_rank=1 --master_addr=<ip> --master_port=5555
    WorkerNode->>PyTorch: Connect to master using master_addr:master_port
    
    PyTorch->>MasterNode: Coordinate training processes
    PyTorch->>WorkerNode: Coordinate training processes
    
    Note over MasterNode,WorkerNode: Synchronize gradients across nodes
    Note over MasterNode,WorkerNode: No rendezvous backend needed (simpler setup)

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

docs/source/features/multi_gpu.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kelly Guo <[email protected]>

greptile-apps · 2025-11-11T03:45:57Z

Greptile Overview

Greptile Summary

This PR simplifies multi-node training documentation by replacing PyTorch's rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with the simpler --master_addr and --master_port parameters. This change improves compatibility with DGX Spark systems.

Key changes:

Updated all multi-node training commands across rl_games, rsl_rl, and skrl (PyTorch) frameworks
Removed the DGX Spark limitation note about multi-node training requiring direct connections
Commands now use --master_addr=<ip_of_master> --master_port=5555 instead of the rendezvous backend approach

Impact:

Simplifies multi-node setup process
Enables multi-node training on DGX Spark without additional network configuration
Maintains backward compatibility as both approaches are valid PyTorch distributed training methods

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it only updates documentation
The changes are straightforward documentation updates that replace one valid PyTorch distributed training approach with another simpler one. Both --master_addr/--master_port and rendezvous backend parameters are valid methods supported by PyTorch. The change improves compatibility with DGX Spark and simplifies the user experience.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
docs/source/features/multi_gpu.rst	5/5	Replaced rendezvous backend parameters with simpler `--master_addr` and `--master_port` parameters for multi-node training commands
docs/source/setup/installation/index.rst	5/5	Removed limitation note about multi-node training requiring direct connections on DGX Spark

Sequence Diagram

sequenceDiagram
    participant User as User
    participant Doc as Documentation
    participant Master as Master Node
    participant Worker as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Doc: Read multi-node training docs
    Doc-->>User: Updated commands with --master_addr/--master_port
    
    User->>Master: Launch master (node_rank=0)
    Master->>PyTorch: torch.distributed.run with master_addr & master_port
    PyTorch->>Master: Initialize master process group
    
    User->>Worker: Launch worker (node_rank=1+)
    Worker->>PyTorch: torch.distributed.run with same master_addr & master_port
    PyTorch->>Master: Connect to master using TCP/IP
    Master-->>Worker: Acknowledge connection
    
    PyTorch->>PyTorch: Synchronize all nodes
    PyTorch-->>User: Multi-node training starts

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Updates multi-node training commands to also support Spark

e0e4cdd

kellyguo11 requested review from Mayankm96 and jtigue-bdai as code owners November 8, 2025 00:39

github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025

greptile-apps bot reviewed Nov 8, 2025

View reviewed changes

docs/source/features/multi_gpu.rst Outdated Show resolved Hide resolved

Update docs/source/features/multi_gpu.rst

8591fa0

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kelly Guo <[email protected]>

greptile-apps bot reviewed Nov 11, 2025

View reviewed changes

kellyguo11 merged commit ec38e60 into isaac-sim:main Nov 12, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates multi-node training commands to also support Spark #3978

Updates multi-node training commands to also support Spark #3978

kellyguo11 commented Nov 8, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot commented Nov 11, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Updates multi-node training commands to also support Spark #3978

Updates multi-node training commands to also support Spark #3978

Conversation

kellyguo11 commented Nov 8, 2025

Description

Type of change

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant