Add a sample multi-GPU benchmark #5753

wujingyue · 2026-01-03T18:37:06Z

We've been adding C++ multi-GPU benchmarks organically, e.g.,

Fuser/benchmarks/cpp/p2p_communication.cpp

Line 25 in d395676

void benchmarkP2PCommunication() {

and

Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp

Line 80 in d395676

at::Tensor runBenchmark(

.

This PR adds a sample multi-GPU benchmark based on https://github.com/google/benchmark, which nvFuser already uses for single-GPU.

Below is the repro:

$ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all

$ cat /tmp/test_multidevice/1/rank.0/stdout
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10

wujingyue · 2026-01-03T18:37:16Z

!test

github-actions · 2026-01-03T18:38:05Z

Review updated until commit 5da9fb0

Description

Refactor multi-device test infrastructure by separating concerns into MultiDeviceFixture base class
Add MultiDeviceBenchmark class for Google Benchmark integration with multi-device testing
Implement sample MultiDeviceBenchmark::Reduction benchmark for tensor reduction across devices
Update build system to include benchmark library dependencies

Changes walkthrough

Relevant files

Enhancement

multidevice.h `Refactor test infrastructure with fixture separation` tests/cpp/multidevice.h Rename MultiDeviceTest to MultiDeviceFixture base class Create new MultiDeviceTest class inheriting from NVFuserTest and MultiDeviceFixture Add MultiDeviceBenchmark class inheriting from benchmark::Fixture and MultiDeviceFixture Add benchmark and gtest includes	+25/-7
multidevice.cpp `Implement fixture classes and benchmark integration` tests/cpp/multidevice.cpp Implement MultiDeviceFixture constructor/destructor Add MultiDeviceBenchmark::TearDown with barrier synchronization Add benchmark detection logic in main() function Update main() to run benchmarks when requested	+42/-5

Tests

test_multidevice_sharding.cpp `Add sample multi-device benchmark test` tests/cpp/test_multidevice_sharding.cpp Add MultiDeviceBenchmark::Reduction benchmark test Register benchmark with Arg(4), Arg(8), and Iterations(10) Include benchmark header for Google Benchmark support	+36/-0

Configuration changes

CMakeLists.txt `Add benchmark library dependencies` CMakeLists.txt Add benchmark include directory to test build configuration Link benchmark library to test targets	+2/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Constructor ordering The MultiDeviceFixture constructor is defined before MultiDeviceTest constructor, but MultiDeviceTest inherits from MultiDeviceFixture. This could lead to initialization order issues. The constructors should be properly ordered or MultiDeviceFixture should have a virtual destructor. MultiDeviceFixture::MultiDeviceFixture() { // Enable logging in c10d so debug messages can be printed out via // `TORCH_DISTRIBUTED_DEBUG`. c10d::setDebugLevelFromEnvironment(); communicator_ = &Communicator::getInstance(); tensor_options_ = at::TensorOptions().dtype(at::kFloat).device(communicator_->device()); debug_print = getNvFuserEnv("MULTIDEVICE_DEBUG_PRINT") != nullptr; } MultiDeviceTest::MultiDeviceTest() { disable_skip = getNvFuserEnv("MULTIDEVICE_DISABLE_SKIP") != nullptr; } Benchmark iteration consistency The comment mentions that iterations must be consistent across processes to prevent hanging, but there's no validation that all processes actually receive the same iteration count. Consider adding runtime validation.

greptile-apps · 2026-01-03T18:39:33Z

Greptile Summary

Refactored test infrastructure by extracting MultiDeviceFixture as a separate base class from MultiDeviceTest. This allows MultiDeviceTest to use multiple inheritance (from both NVFuserTest and MultiDeviceFixture), enabling better code reuse and separation of concerns.

Key changes:

Created MultiDeviceFixture containing common multi-device testing utilities (communicator_, tensor_options_, debug_print, shardTensor methods)
MultiDeviceTest now inherits from both NVFuserTest and MultiDeviceFixture
Moved initialization logic to appropriate constructors based on class responsibilities
Removed unused <mutex> include
Removed comment about setting random seed (no related code was present)

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Clean refactoring with proper separation of concerns, maintains backward compatibility for all existing tests that inherit from MultiDeviceTest, and follows C++ multiple inheritance best practices
No files require special attention

Important Files Changed

Filename	Overview
tests/cpp/multidevice.h	Extracted `MultiDeviceFixture` as a separate base class containing common testing utilities, allowing `MultiDeviceTest` to use multiple inheritance cleanly.
tests/cpp/multidevice.cpp	Moved initialization logic from `MultiDeviceTest` to `MultiDeviceFixture`, removed unused `<mutex>` include, cleaned up constructor responsibilities.

Sequence Diagram

sequenceDiagram
    participant TestRunner as Test Runner
    participant MDT as MultiDeviceTest
    participant NVFT as NVFuserTest
    participant MDF as MultiDeviceFixture
    participant Comm as Communicator
    
    TestRunner->>MDT: Construct MultiDeviceTest
    MDT->>NVFT: Call NVFuserTest()
    NVFT-->>MDT: Base initialized
    MDT->>MDF: Call MultiDeviceFixture()
    MDF->>Comm: getInstance()
    Comm-->>MDF: communicator instance
    MDF->>MDF: Setup tensor_options_
    MDF->>MDF: Setup debug_print
    MDF-->>MDT: Fixture initialized
    MDT->>MDT: Setup disable_skip
    MDT-->>TestRunner: Test object ready
    
    TestRunner->>MDT: SetUp()
    MDT->>NVFT: NVFuserTest::SetUp()
    NVFT-->>MDT: Base setup complete
    MDT->>MDT: Check communicator availability
    MDT-->>TestRunner: Setup complete
    
    TestRunner->>MDT: Run test
    Note over MDT,MDF: Test uses communicator_,<br/>tensor_options_ from fixture
    MDT-->>TestRunner: Test complete
    
    TestRunner->>MDT: Destroy MultiDeviceTest
    MDT->>MDF: Call ~MultiDeviceFixture()
    MDF->>Comm: barrier() if available
    MDF-->>MDT: Cleanup complete
    MDT-->>TestRunner: Destroyed

``` $ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all $ cat /tmp/test_multidevice/1/rank.0/stdout ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------- MultiDeviceBenchmark/Reduction/4/iterations:10 20128420 ns 16788148 ns 10 MultiDeviceBenchmark/Reduction/8/iterations:10 100694 ns 100708 ns 10 ```

wujingyue · 2026-01-03T21:23:37Z

tests/cpp/multidevice.cpp

  testing::InitGoogleTest(&argc, argv);
  testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());
+
+  if (wantsBenchmarks(argc, argv)) {


Benchmarks tend to run longer and don't need to run as frequently as tests, so it's worth separating benchmarks from (correctness) tests.

The question though is how.

In this version, the benchmarks are defined in the same set of files as tests, and I'm reusing the same main function which detects flags like --benchmarks.

Alternatively, I could write two main functions (one for tests and the other for benchmarks) and link them to different binaries (test_multidevice vs benchmark_multidevice).

Furthermore, I could even split the test files and the benchmark files. It's harder to reuse code this way. For example, a FusionDefinition needs to be DRY'ed in order to be both tested and benchmarked.

Create MultiDeviceFixture

1b031a5

wujingyue marked this pull request as draft January 3, 2026 18:37

wujingyue added 2 commits January 3, 2026 10:50

Create MultiDeviceBenchmark

36804d9

wujingyue changed the title ~~Create MultiDeviceFixture~~ Add a sample multi-GPU benchmark Jan 3, 2026

wujingyue commented Jan 3, 2026

View reviewed changes

wujingyue requested a review from Priya2698 January 3, 2026 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a sample multi-GPU benchmark #5753

Add a sample multi-GPU benchmark #5753

wujingyue commented Jan 3, 2026 •

edited

Loading

Uh oh!

wujingyue commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Jan 3, 2026

Uh oh!

wujingyue Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a sample multi-GPU benchmark #5753

Are you sure you want to change the base?

Add a sample multi-GPU benchmark #5753

Conversation

wujingyue commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Jan 3, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

wujingyue Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wujingyue commented Jan 3, 2026 •

edited

Loading

github-actions bot commented Jan 3, 2026 •

edited

Loading