Skip to content

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Jan 3, 2026

We've been adding C++ multi-GPU benchmarks organically, e.g.,

void benchmarkP2PCommunication() {
and .

This PR adds a sample multi-GPU benchmark based on https://github.com/google/benchmark, which nvFuser already uses for single-GPU.

Below is the repro:

$ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all

$ cat /tmp/test_multidevice/1/rank.0/stdout
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10

@wujingyue wujingyue marked this pull request as draft January 3, 2026 18:37
@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

Review updated until commit 5da9fb0

Description

  • Refactor multi-device test infrastructure by separating concerns into MultiDeviceFixture base class

  • Add MultiDeviceBenchmark class for Google Benchmark integration with multi-device testing

  • Implement sample MultiDeviceBenchmark::Reduction benchmark for tensor reduction across devices

  • Update build system to include benchmark library dependencies

Changes walkthrough

Relevant files
Enhancement
multidevice.h
Refactor test infrastructure with fixture separation         

tests/cpp/multidevice.h

  • Rename MultiDeviceTest to MultiDeviceFixture base class
  • Create new MultiDeviceTest class inheriting from NVFuserTest and
    MultiDeviceFixture
  • Add MultiDeviceBenchmark class inheriting from benchmark::Fixture and
    MultiDeviceFixture
  • Add benchmark and gtest includes
  • +25/-7   
    multidevice.cpp
    Implement fixture classes and benchmark integration           

    tests/cpp/multidevice.cpp

  • Implement MultiDeviceFixture constructor/destructor
  • Add MultiDeviceBenchmark::TearDown with barrier synchronization
  • Add benchmark detection logic in main() function
  • Update main() to run benchmarks when requested
  • +42/-5   
    Tests
    test_multidevice_sharding.cpp
    Add sample multi-device benchmark test                                     

    tests/cpp/test_multidevice_sharding.cpp

  • Add MultiDeviceBenchmark::Reduction benchmark test
  • Register benchmark with Arg(4), Arg(8), and Iterations(10)
  • Include benchmark header for Google Benchmark support
  • +36/-0   
    Configuration changes
    CMakeLists.txt
    Add benchmark library dependencies                                             

    CMakeLists.txt

  • Add benchmark include directory to test build configuration
  • Link benchmark library to test targets
  • +2/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Constructor ordering

    The MultiDeviceFixture constructor is defined before MultiDeviceTest constructor, but MultiDeviceTest inherits from MultiDeviceFixture. This could lead to initialization order issues. The constructors should be properly ordered or MultiDeviceFixture should have a virtual destructor.

    MultiDeviceFixture::MultiDeviceFixture() {
      // Enable logging in c10d so debug messages can be printed out via
      // `TORCH_DISTRIBUTED_DEBUG`.
      c10d::setDebugLevelFromEnvironment();
    
      communicator_ = &Communicator::getInstance();
      tensor_options_ =
          at::TensorOptions().dtype(at::kFloat).device(communicator_->device());
      debug_print = getNvFuserEnv("MULTIDEVICE_DEBUG_PRINT") != nullptr;
    }
    
    MultiDeviceTest::MultiDeviceTest() {
      disable_skip = getNvFuserEnv("MULTIDEVICE_DISABLE_SKIP") != nullptr;
    }
    Benchmark iteration consistency

    The comment mentions that iterations must be consistent across processes to prevent hanging, but there's no validation that all processes actually receive the same iteration count. Consider adding runtime validation.

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 3, 2026

    Greptile Summary

    Refactored test infrastructure by extracting MultiDeviceFixture as a separate base class from MultiDeviceTest. This allows MultiDeviceTest to use multiple inheritance (from both NVFuserTest and MultiDeviceFixture), enabling better code reuse and separation of concerns.

    Key changes:

    • Created MultiDeviceFixture containing common multi-device testing utilities (communicator_, tensor_options_, debug_print, shardTensor methods)
    • MultiDeviceTest now inherits from both NVFuserTest and MultiDeviceFixture
    • Moved initialization logic to appropriate constructors based on class responsibilities
    • Removed unused <mutex> include
    • Removed comment about setting random seed (no related code was present)

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • Clean refactoring with proper separation of concerns, maintains backward compatibility for all existing tests that inherit from MultiDeviceTest, and follows C++ multiple inheritance best practices
    • No files require special attention

    Important Files Changed

    Filename Overview
    tests/cpp/multidevice.h Extracted MultiDeviceFixture as a separate base class containing common testing utilities, allowing MultiDeviceTest to use multiple inheritance cleanly.
    tests/cpp/multidevice.cpp Moved initialization logic from MultiDeviceTest to MultiDeviceFixture, removed unused <mutex> include, cleaned up constructor responsibilities.

    Sequence Diagram

    sequenceDiagram
        participant TestRunner as Test Runner
        participant MDT as MultiDeviceTest
        participant NVFT as NVFuserTest
        participant MDF as MultiDeviceFixture
        participant Comm as Communicator
        
        TestRunner->>MDT: Construct MultiDeviceTest
        MDT->>NVFT: Call NVFuserTest()
        NVFT-->>MDT: Base initialized
        MDT->>MDF: Call MultiDeviceFixture()
        MDF->>Comm: getInstance()
        Comm-->>MDF: communicator instance
        MDF->>MDF: Setup tensor_options_
        MDF->>MDF: Setup debug_print
        MDF-->>MDT: Fixture initialized
        MDT->>MDT: Setup disable_skip
        MDT-->>TestRunner: Test object ready
        
        TestRunner->>MDT: SetUp()
        MDT->>NVFT: NVFuserTest::SetUp()
        NVFT-->>MDT: Base setup complete
        MDT->>MDT: Check communicator availability
        MDT-->>TestRunner: Setup complete
        
        TestRunner->>MDT: Run test
        Note over MDT,MDF: Test uses communicator_,<br/>tensor_options_ from fixture
        MDT-->>TestRunner: Test complete
        
        TestRunner->>MDT: Destroy MultiDeviceTest
        MDT->>MDF: Call ~MultiDeviceFixture()
        MDF->>Comm: barrier() if available
        MDF-->>MDT: Cleanup complete
        MDT-->>TestRunner: Destroyed
    
    Loading

    ```
    $ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all
    
    $ cat /tmp/test_multidevice/1/rank.0/stdout
    -----------------------------------------------------------------------------------------
    Benchmark                                               Time             CPU   Iterations
    -----------------------------------------------------------------------------------------
    MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
    MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10
    ```
    @wujingyue wujingyue changed the title Create MultiDeviceFixture Add a sample multi-GPU benchmark Jan 3, 2026
    testing::InitGoogleTest(&argc, argv);
    testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());

    if (wantsBenchmarks(argc, argv)) {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Benchmarks tend to run longer and don't need to run as frequently as tests, so it's worth separating benchmarks from (correctness) tests.

    The question though is how.

    1. In this version, the benchmarks are defined in the same set of files as tests, and I'm reusing the same main function which detects flags like --benchmarks.
    2. Alternatively, I could write two main functions (one for tests and the other for benchmarks) and link them to different binaries (test_multidevice vs benchmark_multidevice).
    3. Furthermore, I could even split the test files and the benchmark files. It's harder to reuse code this way. For example, a FusionDefinition needs to be DRY'ed in order to be both tested and benchmarked.

    @wujingyue wujingyue requested a review from Priya2698 January 3, 2026 21:26
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants