Skip to content

Migrate AllToAllv and AllGather tests to TcpStore-compatible bootstrap#1011

Open
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua:export-D95996769
Open

Migrate AllToAllv and AllGather tests to TcpStore-compatible bootstrap#1011
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua:export-D95996769

Conversation

@Scusemua
Copy link

Summary:
TL;DR: Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from ncclx_test_launcher (mpirun) to re_launcher (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces PipesDistTestUtils.h with:

  • hasTcpStoreEnv(): detects TcpStore env vars set by re_launcher/torchrun
  • PipesDistEnvironment: GTest environment that skips MPI_Init when TcpStore env is present
  • PipesDistTestFixture: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via BenchmarkTestFixture (comms/testinfra/BenchmarkTestFixture.h).

Changes:

  • Created comms/pipes/tests/PipesDistTestUtils.h with shared env detection + fixture
  • Migrated AllToAllvTest.cc and AllGatherTest.cc from MpiBaseTestFixture to PipesDistTestFixture
  • Replaced MPI_Barrier(MPI_COMM_WORLD) with bootstrap->barrierAll() (works with both backends)
  • Replaced direct MpiBootstrap construction with createBootstrap() factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769

Summary:
`getTransportsArray()` is declared in `MultiPeerNvlTransport.h` but was never implemented, causing a linker error when called:

```
ld.lld: error: undefined symbol: comms::pipes::MultiPeerNvlTransport::getTransportsArray()
```

The equivalent method `getDeviceTransports()` already exists and does the same thing (lazy init via `initializeTransportsArray()`, returns `DeviceSpan<Transport>`). All callers were manually wrapping the raw pointer in a `DeviceSpan` anyway, so switching to `getDeviceTransports()` is a direct 1:1 replacement.

Changes:
- Replace 5 calls to `getTransportsArray()` with `getDeviceTransports()` across test and benchmark files
- Remove the orphaned `getTransportsArray()` declaration from `MultiPeerNvlTransport.h`

Differential Revision: D96001554
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Mar 10, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95996769.

Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Mar 10, 2026
meta-pytorch#1011)

Summary:

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Mar 10, 2026
meta-pytorch#1011)

Summary:
Pull Request resolved: meta-pytorch#1011

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Mar 10, 2026
meta-pytorch#1011)

Summary:

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Mar 10, 2026
meta-pytorch#1011)

Summary:

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Mar 10, 2026
meta-pytorch#1011)

Summary:
Pull Request resolved: meta-pytorch#1011

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
meta-pytorch#1011)

Summary:
Pull Request resolved: meta-pytorch#1011

**TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).

The fix introduces `PipesDistTestUtils.h` with:
- `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun
- `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present
- `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap

This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h).

Changes:
- Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture
- Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture`
- Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends)
- Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method

Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)

Differential Revision: D95996769
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant