Skip to content

Conversation

@kellyguo11
Copy link
Contributor

Description

Removes rendezvous backend for multi-node training since it doesn't seem to be necessary and prevents multi-node setup on the DGX Spark.

Type of change

  • Documentation update

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Simplified multi-node training setup by replacing rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with direct master address parameters (--master_addr, --master_port).

Key changes:

  • Updated all PyTorch multi-node training commands in multi_gpu.rst to use --master_addr=<ip_of_master> and --master_port=5555 instead of rendezvous parameters
  • Commands affected: rl_games, rsl_rl, and skrl (PyTorch backend) training scripts
  • Removed DGX Spark limitation note about multi-node training requiring additional network configurations from installation/index.rst
  • JAX-based training commands remain unchanged (already use --coordinator_address)

Issues found:

  • Minor spacing issue on line 163 in multi_gpu.rst (extra space before scripts/)

This simplification aligns with PyTorch's recommended approach and makes multi-node setup more straightforward, particularly for DGX Spark environments.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it's a documentation-only change that simplifies multi-node training setup.
  • Score of 4 reflects that the changes are documentation-only and use a valid PyTorch distributed training approach. The simplified parameter set (master_addr/master_port) is standard and well-supported. One minor spacing issue was found that should be fixed before merge.
  • Pay attention to docs/source/features/multi_gpu.rst line 163 which has an extra space that should be removed.

Important Files Changed

File Analysis

Filename Score Overview
docs/source/features/multi_gpu.rst 4/5 Replaced rendezvous parameters with simpler master_addr/master_port for multi-node training commands. One minor spacing issue found on line 163.
docs/source/setup/installation/index.rst 5/5 Removed limitation note about multi-node training requiring additional network configurations on DGX Spark, aligning with the simplified approach.

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant MasterNode as Master Node
    participant WorkerNode as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Docs: Read multi-node training setup
    Note over Docs: OLD: rdzv_backend + rdzv_id + rdzv_endpoint
    Note over Docs: NEW: master_addr + master_port
    
    User->>MasterNode: Launch with --node_rank=0 --master_addr=<ip> --master_port=5555
    MasterNode->>PyTorch: Initialize distributed training (rank 0)
    
    User->>WorkerNode: Launch with --node_rank=1 --master_addr=<ip> --master_port=5555
    WorkerNode->>PyTorch: Connect to master using master_addr:master_port
    
    PyTorch->>MasterNode: Coordinate training processes
    PyTorch->>WorkerNode: Coordinate training processes
    
    Note over MasterNode,WorkerNode: Synchronize gradients across nodes
    Note over MasterNode,WorkerNode: No rendezvous backend needed (simpler setup)
Loading

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kelly Guo <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

This PR simplifies multi-node training documentation by replacing PyTorch's rendezvous backend parameters (--rdzv_id, --rdzv_backend, --rdzv_endpoint) with the simpler --master_addr and --master_port parameters. This change improves compatibility with DGX Spark systems.

Key changes:

  • Updated all multi-node training commands across rl_games, rsl_rl, and skrl (PyTorch) frameworks
  • Removed the DGX Spark limitation note about multi-node training requiring direct connections
  • Commands now use --master_addr=<ip_of_master> --master_port=5555 instead of the rendezvous backend approach

Impact:

  • Simplifies multi-node setup process
  • Enables multi-node training on DGX Spark without additional network configuration
  • Maintains backward compatibility as both approaches are valid PyTorch distributed training methods

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it only updates documentation
  • The changes are straightforward documentation updates that replace one valid PyTorch distributed training approach with another simpler one. Both --master_addr/--master_port and rendezvous backend parameters are valid methods supported by PyTorch. The change improves compatibility with DGX Spark and simplifies the user experience.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
docs/source/features/multi_gpu.rst 5/5 Replaced rendezvous backend parameters with simpler --master_addr and --master_port parameters for multi-node training commands
docs/source/setup/installation/index.rst 5/5 Removed limitation note about multi-node training requiring direct connections on DGX Spark

Sequence Diagram

sequenceDiagram
    participant User as User
    participant Doc as Documentation
    participant Master as Master Node
    participant Worker as Worker Node(s)
    participant PyTorch as PyTorch Distributed
    
    User->>Doc: Read multi-node training docs
    Doc-->>User: Updated commands with --master_addr/--master_port
    
    User->>Master: Launch master (node_rank=0)
    Master->>PyTorch: torch.distributed.run with master_addr & master_port
    PyTorch->>Master: Initialize master process group
    
    User->>Worker: Launch worker (node_rank=1+)
    Worker->>PyTorch: torch.distributed.run with same master_addr & master_port
    PyTorch->>Master: Connect to master using TCP/IP
    Master-->>Worker: Acknowledge connection
    
    PyTorch->>PyTorch: Synchronize all nodes
    PyTorch-->>User: Multi-node training starts
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@kellyguo11 kellyguo11 merged commit ec38e60 into isaac-sim:main Nov 12, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant