Migrate AllToAllv and AllGather tests to TcpStore-compatible bootstrap#1011
Open
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Open
Migrate AllToAllv and AllGather tests to TcpStore-compatible bootstrap#1011Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
Summary: `getTransportsArray()` is declared in `MultiPeerNvlTransport.h` but was never implemented, causing a linker error when called: ``` ld.lld: error: undefined symbol: comms::pipes::MultiPeerNvlTransport::getTransportsArray() ``` The equivalent method `getDeviceTransports()` already exists and does the same thing (lazy init via `initializeTransportsArray()`, returns `DeviceSpan<Transport>`). All callers were manually wrapping the raw pointer in a `DeviceSpan` anyway, so switching to `getDeviceTransports()` is a direct 1:1 replacement. Changes: - Replace 5 calls to `getTransportsArray()` with `getDeviceTransports()` across test and benchmark files - Remove the orphaned `getTransportsArray()` declaration from `MultiPeerNvlTransport.h` Differential Revision: D96001554
Contributor
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Mar 10, 2026
meta-pytorch#1011) Summary: **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
249c458 to
8e5cdd9
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Mar 10, 2026
meta-pytorch#1011) Summary: Pull Request resolved: meta-pytorch#1011 **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
8e5cdd9 to
6062dac
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Mar 10, 2026
meta-pytorch#1011) Summary: **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
6062dac to
a683ac0
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Mar 10, 2026
meta-pytorch#1011) Summary: **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
a683ac0 to
0f031eb
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Mar 10, 2026
meta-pytorch#1011) Summary: Pull Request resolved: meta-pytorch#1011 **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
0f031eb to
1da0d16
Compare
meta-pytorch#1011) Summary: Pull Request resolved: meta-pytorch#1011 **TL;DR:** Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from `ncclx_test_launcher` (mpirun) to `re_launcher` (no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI). The fix introduces `PipesDistTestUtils.h` with: - `hasTcpStoreEnv()`: detects TcpStore env vars set by re_launcher/torchrun - `PipesDistEnvironment`: GTest environment that skips MPI_Init when TcpStore env is present - `PipesDistTestFixture`: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrap This follows the same pattern already used by Pipes benchmarks via `BenchmarkTestFixture` (comms/testinfra/BenchmarkTestFixture.h). Changes: - Created `comms/pipes/tests/PipesDistTestUtils.h` with shared env detection + fixture - Migrated `AllToAllvTest.cc` and `AllGatherTest.cc` from `MpiBaseTestFixture` to `PipesDistTestFixture` - Replaced `MPI_Barrier(MPI_COMM_WORLD)` with `bootstrap->barrierAll()` (works with both backends) - Replaced direct `MpiBootstrap` construction with `createBootstrap()` factory method Related: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue) Differential Revision: D95996769
1da0d16 to
97e8c02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
TL;DR: Pipes AllToAllv and AllGather tests fail on RE after D95485096 moved singlehost tests from
ncclx_test_launcher(mpirun) tore_launcher(no mpirun). Without mpirun, MPI_Init enters singleton mode (rank=0, size=1 for every process), causing 0-byte buffer allocations that crash at cudaIpcGetMemHandle. This diff adds dual-mode bootstrap detection so tests work on both RE (TcpStore) and devgpu (MPI).The fix introduces
PipesDistTestUtils.hwith:hasTcpStoreEnv(): detects TcpStore env vars set by re_launcher/torchrunPipesDistEnvironment: GTest environment that skips MPI_Init when TcpStore env is presentPipesDistTestFixture: test fixture that auto-selects TcpStoreBootstrap or MpiBootstrapThis follows the same pattern already used by Pipes benchmarks via
BenchmarkTestFixture(comms/testinfra/BenchmarkTestFixture.h).Changes:
comms/pipes/tests/PipesDistTestUtils.hwith shared env detection + fixtureAllToAllvTest.ccandAllGatherTest.ccfromMpiBaseTestFixturetoPipesDistTestFixtureMPI_Barrier(MPI_COMM_WORLD)withbootstrap->barrierAll()(works with both backends)MpiBootstrapconstruction withcreateBootstrap()factory methodRelated: D95485096 (re_launcher migration), D95974997 (NCCL/ctran fix for same issue)
Differential Revision: D95996769