Skip to content

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Nov 13, 2025

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 13, 2025

Greptile Summary

  • Introduces SymmetricTensor runtime type enabling distributed symmetric memory allocation using CUDA VMM with IPC handle exchange, remote tensor access, NVLS multicast support (CUDA 13.0+), and contiguous view creation
  • Refactors serialization helpers (toBytes/fromBytes) into shared utilities header for reuse across IPC and symmetric tensor implementations

Confidence Score: 2/5

  • This PR has critical file descriptor leaks that will cause resource exhaustion in production.
  • Score reflects two confirmed file descriptor leaks in setupRemoteHandles (line 284) and setupMulticast (line 467) where the exporter rank never closes the shared FDs after peers retrieve them. These leaks will accumulate with repeated calls.
  • Pay close attention to csrc/multidevice/symmetric_tensor.cpp - file descriptor leaks must be fixed before merge.

Important Files Changed

Filename Overview
csrc/multidevice/symmetric_tensor.cpp Introduces SymmetricTensor runtime type with VMM allocation, remote access, multicast, and contiguous view; contains file descriptor leaks in setupRemoteHandles and setupMulticast

Sequence Diagram

sequenceDiagram
    participant R0 as Rank 0
    participant Store as TCP Store
    participant R1 as Rank 1
    
    Note over R0,R1: setupRemoteHandles()
    
    R0->>R0: cuMemExportToShareableHandle(local_handle, &shared_fd)
    R0->>Store: set("sym_tensor_0_tag_fd", shared_fd)
    R0->>Store: set("sym_tensor_0_tag_pid", pid)
    
    R1->>R1: cuMemExportToShareableHandle(local_handle, &shared_fd)
    R1->>Store: set("sym_tensor_1_tag_fd", shared_fd)
    R1->>Store: set("sym_tensor_1_tag_pid", pid)
    
    Note over R0,R1: barrier()
    
    R0->>Store: get("sym_tensor_1_tag_fd")
    Store-->>R0: peer_fd
    R0->>R0: pidfd_getfd(peer_pid, peer_fd) -> local_fd
    R0->>R0: cuMemImportFromShareableHandle(local_fd)
    R0->>R0: cuMemMap(peer_ptr)
    
    R1->>Store: get("sym_tensor_0_tag_fd")
    Store-->>R1: peer_fd
    R1->>R1: pidfd_getfd(peer_pid, peer_fd) -> local_fd
    R1->>R1: cuMemImportFromShareableHandle(local_fd)
    R1->>R1: cuMemMap(peer_ptr)
    
    Note over R0,R1: barrier()
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@samnordmann
Copy link
Collaborator Author

!build

@github-actions
Copy link

github-actions bot commented Nov 17, 2025

Review updated until commit ca8c70a

Description

  • Introduced SymmetricTensor class for symmetric memory allocation and management.

  • Added methods for remote access, multicast, and contiguous view setup.

  • Refactored and moved serialization functions to utils.h.

  • Updated CMakeLists.txt to include new source and test files.

Changes walkthrough

Relevant files
Refactoring
ipc_handle.cpp
Remove unused serialization functions                                       

csrc/multidevice/ipc_handle.cpp

  • Removed unused template functions for serialization.
  • Included utils.h for serialization functions.
  • +1/-16   
    test_multidevice_ipc.cpp
    Remove unused serialization functions                                       

    tests/cpp/test_multidevice_ipc.cpp

    • Removed unused template functions for serialization.
    +0/-12   
    Enhancement
    symmetric_tensor.cpp
    Implement SymmetricTensor class                                                   

    csrc/multidevice/symmetric_tensor.cpp

  • Implemented SymmetricTensor class with methods for allocation,
    validation, remote access, multicast, and contiguous view.
  • +541/-0 
    symmetric_tensor.h
    Define SymmetricTensor class                                                         

    csrc/multidevice/symmetric_tensor.h

    • Defined SymmetricTensor class with public and private members.
    +92/-0   
    utils.h
    Add serialization functions                                                           

    csrc/multidevice/utils.h

    • Added template functions for serialization.
    +13/-0   
    Tests
    test_multidevice_symmetric_tensor.cpp
    Add SymmetricTensor tests                                                               

    tests/cpp/test_multidevice_symmetric_tensor.cpp

    • Added tests for SymmetricTensor class functionality.
    +235/-0 
    Configuration changes
    CMakeLists.txt
    Update CMakeLists.txt                                                                       

    CMakeLists.txt

  • Added symmetric_tensor.cpp to source files.
  • Added test_multidevice_symmetric_tensor.cpp to test files.
  • +2/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Multicast Support

    The multicast functionality is only supported for CUDA 13.0+. Ensure that this is clearly communicated and that the fallback mechanism is robust for older CUDA versions.

    void SymmetricTensor::setupMulticast(
        int64_t exporter_rank,
        const std::string& tag) {
    #if (CUDA_VERSION >= 13000)
      if (is_multicast_setup_) {
        return;
      }
    
      Communicator& comm = Communicator::getInstance();
      const int64_t my_rank = comm.deviceId();
      const int64_t local_rank = comm.local_rank();
    
      int is_multicast_supported;
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
          &is_multicast_supported,
          CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
          local_rank));
      NVF_CHECK(is_multicast_supported, "Multicast not supported");
    
      exporter_rank_ = exporter_rank;
    
      CUmulticastObjectProp mcast_prop{};
      mcast_prop.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
      mcast_prop.numDevices = world_size_;
      mcast_prop.size = aligned_size_;
    
      int shared_handle;
      auto store = comm.getTcpStore();
      pid_t root_pid;
    
      if (my_rank == exporter_rank) {
        NVFUSER_CUDA_SAFE_CALL(cuMulticastCreate(&mcast_handle_, &mcast_prop));
        NVFUSER_CUDA_SAFE_CALL(cuMemExportToShareableHandle(
            &shared_handle,
            mcast_handle_,
            CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
            0));
        prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY);
        root_pid = getpid();
    
        store->set(tag + "_fd", toBytes(shared_handle));
        store->set(tag + "_pid", toBytes(root_pid));
      }
    
      comm.barrier();
    
      if (my_rank != exporter_rank) {
        shared_handle = fromBytes<int>(store->get(tag + "_fd"));
        root_pid = fromBytes<pid_t>(store->get(tag + "_pid"));
    
        pid_fd_ = syscall(SYS_pidfd_open, root_pid, 0);
        NVF_CHECK(pid_fd_ >= 0, "pidfd_open failed");
        peer_fd_ = syscall(SYS_pidfd_getfd, pid_fd_, shared_handle, 0);
        NVF_CHECK(peer_fd_ >= 0, "pidfd_getfd failed");
    
        NVFUSER_CUDA_SAFE_CALL(cuMemImportFromShareableHandle(
            &mcast_handle_,
            reinterpret_cast<void*>(static_cast<uint64_t>(peer_fd_)),
            CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR));
      }
    
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGet(&cu_dev_, static_cast<int>(local_rank)));
      NVFUSER_CUDA_SAFE_CALL(cuMulticastAddDevice(mcast_handle_, cu_dev_));
    
      CUdeviceptr local_ptr = remote_ptrs_[my_device_id_];
      CUdeviceptr base_ptr;
      size_t base_size;
      NVFUSER_CUDA_SAFE_CALL(
          cuMemGetAddressRange(&base_ptr, &base_size, local_ptr));
      size_t mem_offset = static_cast<size_t>(local_ptr - base_ptr);
    
      NVFUSER_CUDA_SAFE_CALL(cuMulticastBindMem(
          mcast_handle_,
          0,
          alloc_handles_[my_device_id_],
          mem_offset,
          aligned_size_,
          0));
    
      CUdeviceptr mc_ptr;
      NVFUSER_CUDA_SAFE_CALL(
          cuMemAddressReserve(&mc_ptr, aligned_size_, granularity_, 0, 0));
      NVFUSER_CUDA_SAFE_CALL(cuMemMap(mc_ptr, aligned_size_, 0, mcast_handle_, 0));
    
      CUmemAccessDesc access{};
      access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
      access.location.id = static_cast<int>(local_rank);
      access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
      NVFUSER_CUDA_SAFE_CALL(cuMemSetAccess(mc_ptr, aligned_size_, &access, 1));
    
      mc_ptr_ = reinterpret_cast<void*>(mc_ptr);
      is_multicast_setup_ = true;
    
      comm.barrier();
    
      if (my_rank == exporter_rank) {
        store->deleteKey(tag + "_fd");
        store->deleteKey(tag + "_pid");
      }
    #else
      (void)exporter_rank;
      (void)tag;
      NVF_ERROR("Multicast requires CUDA 13.0+");
    #endif
    }
    Error Handling

    Review the error handling in the setupMulticast function. Ensure that all CUDA API calls are properly checked for errors and that the function handles errors gracefully.

    void SymmetricTensor::setupMulticast(
        int64_t exporter_rank,
        const std::string& tag) {
    #if (CUDA_VERSION >= 13000)
      if (is_multicast_setup_) {
        return;
      }
    
      Communicator& comm = Communicator::getInstance();
      const int64_t my_rank = comm.deviceId();
      const int64_t local_rank = comm.local_rank();
    
      int is_multicast_supported;
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
          &is_multicast_supported,
          CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
          local_rank));
      NVF_CHECK(is_multicast_supported, "Multicast not supported");
    
      exporter_rank_ = exporter_rank;
    
      CUmulticastObjectProp mcast_prop{};
      mcast_prop.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
      mcast_prop.numDevices = world_size_;
      mcast_prop.size = aligned_size_;
    
      int shared_handle;
      auto store = comm.getTcpStore();
      pid_t root_pid;
    
      if (my_rank == exporter_rank) {
        NVFUSER_CUDA_SAFE_CALL(cuMulticastCreate(&mcast_handle_, &mcast_prop));
        NVFUSER_CUDA_SAFE_CALL(cuMemExportToShareableHandle(
            &shared_handle,
            mcast_handle_,
            CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
            0));
        prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY);
        root_pid = getpid();
    
        store->set(tag + "_fd", toBytes(shared_handle));
        store->set(tag + "_pid", toBytes(root_pid));
      }
    
      comm.barrier();
    
      if (my_rank != exporter_rank) {
        shared_handle = fromBytes<int>(store->get(tag + "_fd"));
        root_pid = fromBytes<pid_t>(store->get(tag + "_pid"));
    
        pid_fd_ = syscall(SYS_pidfd_open, root_pid, 0);
        NVF_CHECK(pid_fd_ >= 0, "pidfd_open failed");
        peer_fd_ = syscall(SYS_pidfd_getfd, pid_fd_, shared_handle, 0);
        NVF_CHECK(peer_fd_ >= 0, "pidfd_getfd failed");
    
        NVFUSER_CUDA_SAFE_CALL(cuMemImportFromShareableHandle(
            &mcast_handle_,
            reinterpret_cast<void*>(static_cast<uint64_t>(peer_fd_)),
            CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR));
      }
    
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGet(&cu_dev_, static_cast<int>(local_rank)));
      NVFUSER_CUDA_SAFE_CALL(cuMulticastAddDevice(mcast_handle_, cu_dev_));
    
      CUdeviceptr local_ptr = remote_ptrs_[my_device_id_];
      CUdeviceptr base_ptr;
      size_t base_size;
      NVFUSER_CUDA_SAFE_CALL(
          cuMemGetAddressRange(&base_ptr, &base_size, local_ptr));
      size_t mem_offset = static_cast<size_t>(local_ptr - base_ptr);
    
      NVFUSER_CUDA_SAFE_CALL(cuMulticastBindMem(
          mcast_handle_,
          0,
          alloc_handles_[my_device_id_],
          mem_offset,
          aligned_size_,
          0));
    
      CUdeviceptr mc_ptr;
      NVFUSER_CUDA_SAFE_CALL(
          cuMemAddressReserve(&mc_ptr, aligned_size_, granularity_, 0, 0));
      NVFUSER_CUDA_SAFE_CALL(cuMemMap(mc_ptr, aligned_size_, 0, mcast_handle_, 0));
    
      CUmemAccessDesc access{};
      access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
      access.location.id = static_cast<int>(local_rank);
      access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
      NVFUSER_CUDA_SAFE_CALL(cuMemSetAccess(mc_ptr, aligned_size_, &access, 1));
    
      mc_ptr_ = reinterpret_cast<void*>(mc_ptr);
      is_multicast_setup_ = true;
    
      comm.barrier();
    
      if (my_rank == exporter_rank) {
        store->deleteKey(tag + "_fd");
        store->deleteKey(tag + "_pid");
      }
    #else
      (void)exporter_rank;
      (void)tag;
      NVF_ERROR("Multicast requires CUDA 13.0+");
    #endif
    }
    Test Coverage

    Ensure that the test cases cover all possible scenarios, including edge cases and error conditions. Consider adding tests for different CUDA versions and device configurations.

    // clang-format off
    /*
     * SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES.
     * All rights reserved.
     * SPDX-License-Identifier: BSD-3-Clause
     */
    // clang-format on
    #include <cuda.h>
    #include <multidevice/symmetric_tensor.h>
    #include <tests/cpp/multidevice.h>
    
    namespace nvfuser {
    
    using SymmetricTensorTest = MultiDeviceTest;
    
    TEST_F(SymmetricTensorTest, BasicAllocation) {
      if (communicator_->size() == 1) {
        GTEST_SKIP() << "Skipping test for single device";
      }
    
      const int64_t rank = communicator_->deviceId();
      const int64_t world_size = communicator_->size();
      NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));
    
      // Create a symmetric tensor
      at::Tensor local_tensor = SymmetricTensor::allocate(
          {256, 512}, at::ScalarType::Float, communicator_->device());
      SymmetricTensor sym_tensor(local_tensor);
    
      // Validate local tensor
      EXPECT_TRUE(local_tensor.is_cuda());
      EXPECT_EQ(local_tensor.scalar_type(), at::ScalarType::Float);
      EXPECT_EQ(local_tensor.numel(), 256 * 512);
      EXPECT_EQ(local_tensor.sizes()[0], 256);
      EXPECT_EQ(local_tensor.sizes()[1], 512);
    
      // Write unique value to local tensor
      float local_value = static_cast<float>(rank + 100);
      local_tensor.fill_(local_value);
    
      sym_tensor.setupRemoteHandles();
    
      // Read from all remote tensors
      for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
        void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
        EXPECT_NE(peer_ptr, nullptr);
    
        // Copy first element from peer
        float peer_value;
        NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
            &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));
    
        float expected_value = static_cast<float>(peer_rank + 100);
        EXPECT_FLOAT_EQ(peer_value, expected_value)
            << "Rank " << rank << " reading from rank " << peer_rank;
      }
    }
    
    TEST_F(SymmetricTensorTest, PreallocatedTensor) {
      if (communicator_->size() == 1) {
        GTEST_SKIP() << "Skipping test for single device";
      }
    
      const int64_t rank = communicator_->deviceId();
      const int64_t world_size = communicator_->size();
      NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));
    
      // Allocate tensor with symmetric memory
      at::Tensor local_tensor = SymmetricTensor::allocate(
          /*sizes=*/at::IntArrayRef({128, 256}),
          /*dtype=*/at::ScalarType::Double,
          /*device=*/c10::Device(c10::DeviceType::CUDA, rank));
    
      // Create SymmetricTensor from pre-allocated tensor
      SymmetricTensor sym_tensor(local_tensor);
    
      // Validate
      EXPECT_EQ(sym_tensor.localTensor().numel(), 128 * 256);
    
      // Write unique pattern to local tensor
      double local_value = static_cast<double>(rank * 1000 + 42);
      local_tensor.fill_(local_value);
    
      sym_tensor.setupRemoteHandles();
    
      // Verify remote access
      for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
        if (peer_rank == rank) {
          continue;
        }
    
        void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
        double peer_value;
        NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
            &peer_value, peer_ptr, sizeof(double), cudaMemcpyDeviceToHost));
    
        double expected = static_cast<double>(peer_rank * 1000 + 42);
        EXPECT_DOUBLE_EQ(peer_value, expected);
      }
    }
    
    TEST_F(SymmetricTensorTest, Multicast) {
    #if (CUDA_VERSION < 13000)
      GTEST_SKIP() << "Multicast requires CUDA 13.0+";
    #else
      if (communicator_->size() == 1) {
        GTEST_SKIP() << "Skipping test for single device";
      }
    
      const int64_t rank = communicator_->deviceId();
      const int64_t root = 0;
      NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));
    
      // Check multicast support
      int is_multicast_supported;
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
          &is_multicast_supported, CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED, rank));
      if (!is_multicast_supported) {
        GTEST_SKIP() << "Device does not support multicast";
      }
    
      // Create symmetric tensor (2MB to meet granularity requirements)
      constexpr int64_t kNumElems = 524288; // 2MB / 4 bytes
      at::Tensor local_tensor = SymmetricTensor::allocate(
          /*sizes=*/at::IntArrayRef({kNumElems}),
          /*dtype=*/at::ScalarType::Int,
          /*device=*/c10::Device(c10::DeviceType::CUDA, rank));
      SymmetricTensor sym_tensor(local_tensor);
    
      // Setup multicast
      sym_tensor.setupMulticast(root, "test_multicast");
    
      // Root writes data to multicast buffer
      std::vector<int> host_data(kNumElems);
      if (rank == root) {
        void* mc_ptr = sym_tensor.multicastPtr();
        EXPECT_NE(mc_ptr, nullptr);
    
        // Prepare pattern data
        for (int64_t i = 0; i < kNumElems; ++i) {
          host_data[i] = static_cast<int>(i * 7 + 13);
        }
    
        // Write to multicast buffer
        NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
            mc_ptr,
            host_data.data(),
            kNumElems * sizeof(int),
            cudaMemcpyHostToDevice));
      }
    
      communicator_->barrier();
    
      // All ranks read from local tensor and validate
      const at::Tensor& local = sym_tensor.localTensor();
      std::vector<int> readback(kNumElems);
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
          readback.data(),
          local.data_ptr(),
          kNumElems * sizeof(int),
          cudaMemcpyDeviceToHost));
    
      for (int64_t i = 0; i < kNumElems; ++i) {
        int expected = static_cast<int>(i * 7 + 13);
        EXPECT_EQ(readback[i], expected)
            << "Rank " << rank << " failed to read multicast data at index " << i;
      }
    #endif
    }
    
    TEST_F(SymmetricTensorTest, ContiguousView) {
      if (communicator_->size() == 1) {
        GTEST_SKIP() << "Skipping test for single device";
      }
    
      const int64_t rank = communicator_->deviceId();
      const int64_t world_size = communicator_->size();
      NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));
    
      // Create symmetric tensor
      at::Tensor local_tensor = SymmetricTensor::allocate(
          /*sizes=*/at::IntArrayRef({2, 262144}),
          /*dtype=*/at::ScalarType::Float,
          /*device=*/c10::Device(c10::DeviceType::CUDA, rank));
      SymmetricTensor sym_tensor(local_tensor);
    
      // Write rank-specific pattern to local tensor
      local_tensor.fill_(static_cast<float>(rank + 100));
    
      // Validate that localTensor has the correct values for this rank
      std::vector<float> local_data(local_tensor.numel());
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
          local_data.data(),
          local_tensor.data_ptr(),
          local_tensor.numel() * sizeof(float),
          cudaMemcpyDeviceToHost));
      for (int64_t i = 0; i < local_tensor.numel(); ++i) {
        ASSERT_EQ(local_data[i], static_cast<float>(rank + 100))
            << "localTensor value mismatch at index " << i << " for rank " << rank;
      }
    
      communicator_->barrier();
    
      // Setup and get contiguous view of all ranks
      sym_tensor.setupContiguousView("test_contiguous");
      at::Tensor contiguous_view = sym_tensor.getContiguousView();
    
      // Validate shape: [world_size, 2, 262144]
      EXPECT_EQ(contiguous_view.dim(), 3);
      EXPECT_EQ(contiguous_view.size(0), world_size);
      EXPECT_EQ(contiguous_view.size(1), 2);
      EXPECT_EQ(contiguous_view.size(2), 262144);
    
      // Validation: copy and check each per-rank slice from host buffer
      const int64_t slice_elems = contiguous_view.size(1) * contiguous_view.size(2);
      const int64_t total_elems = world_size * slice_elems;
      std::vector<float> all_data(total_elems);
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
          all_data.data(),
          contiguous_view.data_ptr(),
          total_elems * sizeof(float),
          cudaMemcpyDeviceToHost));
    
      for (int64_t r = 0; r < world_size; ++r) {
        for (int64_t i = 0; i < slice_elems; ++i) {
          float expected = static_cast<float>(r + 100);
          size_t idx = r * slice_elems + i;
          ASSERT_EQ(all_data[idx], expected)
              << "Rank " << rank << " view checking slice for rank " << r
              << " at offset " << i << " did not match expected value";
        }
      }
    }
    
    } // namespace nvfuser

    Test failures

    • (Medium, 3) nvFuser SymmetricTensor multidevice failures (pidfd_getfd) on 4GPU_A100 runner

      Test Name A100 (dist.) Source
      SymmetricTensorTest.BasicAllocation Link
      SymmetricTensorTest.ContiguousView Link
      SymmetricTensorTest.PreallocatedTensor Link
    • (Medium, 2) IPC peer-FD acquisition failures in nvFuser multi-GPU VMM tests (4GPU_A100)

      Test Name A100 (dist.) Source
      IpcTest.IpcP2pWithVmm Link
      IpcTest.VmmMultiRankContiguousMappingTest Link
    • (Low, 1) Tensor numerical mismatches in nvFuser HopperMatmulTest on Matmul scheduler (H100 runner)

      Test Name H100 Source
      HopperMatmulTest.HSH_NT_UseScheduler_MultipleInstructionsPerWarpTile Link

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    7 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile
    React with 👍 or 👎 to share your feedback on this new summary format

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    7 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile
    React with 👍 or 👎 to share your feedback on this new summary format


    namespace nvfuser {

    // SymmetricTensor wraps a local symmetric memory allocation and enables:
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    From the API perspective, no fundamental difference. Both aims at implementing the same standard.

    Here I propose an implementation from scratch that we can better control and maintain.

    I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.

    In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    cc @syed-ahmed

    Are you on top of the symmetric memory work regarding multicasting?

    Copy link
    Collaborator

    @wujingyue wujingyue Nov 19, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    cc @nvcastet who probably knows the latest about symmetric memory and multicast

    Copy link

    @nvcastet nvcastet Nov 19, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @samnordmann @wujingyue
    Pytorch symmetric API works well and support 3 different backends (Pure CUDA, NVSHMEM, and NCCL). The pure cuda supports getting peer ptrs and multicast ptrs and the NCCL backend is going to be finished soon with NCCL 2.29 providing host API to access host and peer ptrs.
    Coming up soon is support for allowing torch.compile to trace allocations used in symmetric ops to allocate them directly in the symmetric memory pool (pytorch/pytorch#162859)
    It can be a real burden to maintain a new allocator / memory handle exchange etc... as we have seen in past projects (flashinfer, wholegraph etc...) So I would advice against it and use the Pytorch symmetric API and let the Framework own the memory stack.
    The best would be to contribute directly to pytorch core for improvements you want and even add new backends.
    It would be sad to have multiple symmetric memory pools and create more memory fragmentation in the application if there is not a strong reason for it.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for your honest opinion, @nvcastet!

    I assume @samnordmann has been digging into Torch’s symmetric memory and identified some gaps. While I’m not sure which of those gaps can be fixed by us versus what may require changes from Meta, what would you recommend as the best way to engage the PyTorch community so we can at least surface the issues he found?

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It would be nice to write down the specifics of those gaps. What was listed has been supported by pytorch symmetric memory. Then, I would encourage with the pytorch community and loop in @kwen2501 from Meta. I am sure they would be happy to collaborate and take PRs.

    //
    // Design: Decouples local allocation from IPC handle exchange for better
    // interoperability and support for pre-allocated user buffers
    class SymmetricTensor {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I like your other comment. I'd include that in the code for posterity.

    Suggested change
    class SymmetricTensor {
    // From the API perspective, no fundamental difference. Both aims at implementing the same standard.
    Here I propose an implementation from scratch that we can better control and maintain.
    I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.
    In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.
    class SymmetricTensor {

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants