[SymMem 2/5] `SymmetricTensor` runtime type #5517

samnordmann · 2025-11-13T17:47:20Z

Full branch for reference: #5515

greptile-apps · 2025-11-13T17:50:50Z

Greptile Summary

Introduces SymmetricTensor runtime type enabling distributed symmetric memory allocation using CUDA VMM with IPC handle exchange, remote tensor access, NVLS multicast support (CUDA 13.0+), and contiguous view creation
Refactors serialization helpers (toBytes/fromBytes) into shared utilities header for reuse across IPC and symmetric tensor implementations

Confidence Score: 2/5

This PR has critical file descriptor leaks that will cause resource exhaustion in production.
Score reflects two confirmed file descriptor leaks in setupRemoteHandles (line 284) and setupMulticast (line 467) where the exporter rank never closes the shared FDs after peers retrieve them. These leaks will accumulate with repeated calls.
Pay close attention to csrc/multidevice/symmetric_tensor.cpp - file descriptor leaks must be fixed before merge.

Important Files Changed

Filename	Overview
csrc/multidevice/symmetric_tensor.cpp	Introduces `SymmetricTensor` runtime type with VMM allocation, remote access, multicast, and contiguous view; contains file descriptor leaks in `setupRemoteHandles` and `setupMulticast`

Sequence Diagram

sequenceDiagram
    participant R0 as Rank 0
    participant Store as TCP Store
    participant R1 as Rank 1
    
    Note over R0,R1: setupRemoteHandles()
    
    R0->>R0: cuMemExportToShareableHandle(local_handle, &shared_fd)
    R0->>Store: set("sym_tensor_0_tag_fd", shared_fd)
    R0->>Store: set("sym_tensor_0_tag_pid", pid)
    
    R1->>R1: cuMemExportToShareableHandle(local_handle, &shared_fd)
    R1->>Store: set("sym_tensor_1_tag_fd", shared_fd)
    R1->>Store: set("sym_tensor_1_tag_pid", pid)
    
    Note over R0,R1: barrier()
    
    R0->>Store: get("sym_tensor_1_tag_fd")
    Store-->>R0: peer_fd
    R0->>R0: pidfd_getfd(peer_pid, peer_fd) -> local_fd
    R0->>R0: cuMemImportFromShareableHandle(local_fd)
    R0->>R0: cuMemMap(peer_ptr)
    
    R1->>Store: get("sym_tensor_0_tag_fd")
    Store-->>R1: peer_fd
    R1->>R1: pidfd_getfd(peer_pid, peer_fd) -> local_fd
    R1->>R1: cuMemImportFromShareableHandle(local_fd)
    R1->>R1: cuMemMap(peer_ptr)
    
    Note over R0,R1: barrier()

greptile-apps

_{7 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/multidevice/symmetric_tensor.cpp

samnordmann · 2025-11-17T18:40:36Z

!build

github-actions · 2025-11-17T18:41:25Z

Review updated until commit ca8c70a

Description

Introduced SymmetricTensor class for symmetric memory allocation and management.
Added methods for remote access, multicast, and contiguous view setup.
Refactored and moved serialization functions to utils.h.
Updated CMakeLists.txt to include new source and test files.

Changes walkthrough

Relevant files

Refactoring

ipc_handle.cpp `Remove unused serialization functions` csrc/multidevice/ipc_handle.cpp Removed unused template functions for serialization. Included `utils.h` for serialization functions.	+1/-16
test_multidevice_ipc.cpp `Remove unused serialization functions` tests/cpp/test_multidevice_ipc.cpp Removed unused template functions for serialization.	+0/-12

Enhancement

symmetric_tensor.cpp `Implement SymmetricTensor class` csrc/multidevice/symmetric_tensor.cpp Implemented `SymmetricTensor` class with methods for allocation, validation, remote access, multicast, and contiguous view.	+541/-0
symmetric_tensor.h `Define SymmetricTensor class` csrc/multidevice/symmetric_tensor.h Defined `SymmetricTensor` class with public and private members.	+92/-0
utils.h `Add serialization functions` csrc/multidevice/utils.h Added template functions for serialization.	+13/-0

Tests

test_multidevice_symmetric_tensor.cpp `Add SymmetricTensor tests` tests/cpp/test_multidevice_symmetric_tensor.cpp Added tests for `SymmetricTensor` class functionality.	+235/-0

Configuration changes

CMakeLists.txt `Update CMakeLists.txt` CMakeLists.txt Added `symmetric_tensor.cpp` to source files. Added `test_multidevice_symmetric_tensor.cpp` to test files.	+2/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Multicast Support

The multicast functionality is only supported for CUDA 13.0+. Ensure that this is clearly communicated and that the fallback mechanism is robust for older CUDA versions.

void SymmetricTensor::setupMulticast(
    int64_t exporter_rank,
    const std::string& tag) {
#if (CUDA_VERSION >= 13000)
  if (is_multicast_setup_) {
    return;
  }

  Communicator& comm = Communicator::getInstance();
  const int64_t my_rank = comm.deviceId();
  const int64_t local_rank = comm.local_rank();

  int is_multicast_supported;
  NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
      &is_multicast_supported,
      CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
      local_rank));
  NVF_CHECK(is_multicast_supported, "Multicast not supported");

  exporter_rank_ = exporter_rank;

  CUmulticastObjectProp mcast_prop{};
  mcast_prop.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
  mcast_prop.numDevices = world_size_;
  mcast_prop.size = aligned_size_;

  int shared_handle;
  auto store = comm.getTcpStore();
  pid_t root_pid;

  if (my_rank == exporter_rank) {
    NVFUSER_CUDA_SAFE_CALL(cuMulticastCreate(&mcast_handle_, &mcast_prop));
    NVFUSER_CUDA_SAFE_CALL(cuMemExportToShareableHandle(
        &shared_handle,
        mcast_handle_,
        CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
        0));
    prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY);
    root_pid = getpid();

    store->set(tag + "_fd", toBytes(shared_handle));
    store->set(tag + "_pid", toBytes(root_pid));
  }

  comm.barrier();

  if (my_rank != exporter_rank) {
    shared_handle = fromBytes<int>(store->get(tag + "_fd"));
    root_pid = fromBytes<pid_t>(store->get(tag + "_pid"));

    pid_fd_ = syscall(SYS_pidfd_open, root_pid, 0);
    NVF_CHECK(pid_fd_ >= 0, "pidfd_open failed");
    peer_fd_ = syscall(SYS_pidfd_getfd, pid_fd_, shared_handle, 0);
    NVF_CHECK(peer_fd_ >= 0, "pidfd_getfd failed");

    NVFUSER_CUDA_SAFE_CALL(cuMemImportFromShareableHandle(
        &mcast_handle_,
        reinterpret_cast<void*>(static_cast<uint64_t>(peer_fd_)),
        CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR));
  }

  NVFUSER_CUDA_SAFE_CALL(cuDeviceGet(&cu_dev_, static_cast<int>(local_rank)));
  NVFUSER_CUDA_SAFE_CALL(cuMulticastAddDevice(mcast_handle_, cu_dev_));

  CUdeviceptr local_ptr = remote_ptrs_[my_device_id_];
  CUdeviceptr base_ptr;
  size_t base_size;
  NVFUSER_CUDA_SAFE_CALL(
      cuMemGetAddressRange(&base_ptr, &base_size, local_ptr));
  size_t mem_offset = static_cast<size_t>(local_ptr - base_ptr);

  NVFUSER_CUDA_SAFE_CALL(cuMulticastBindMem(
      mcast_handle_,
      0,
      alloc_handles_[my_device_id_],
      mem_offset,
      aligned_size_,
      0));

  CUdeviceptr mc_ptr;
  NVFUSER_CUDA_SAFE_CALL(
      cuMemAddressReserve(&mc_ptr, aligned_size_, granularity_, 0, 0));
  NVFUSER_CUDA_SAFE_CALL(cuMemMap(mc_ptr, aligned_size_, 0, mcast_handle_, 0));

  CUmemAccessDesc access{};
  access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
  access.location.id = static_cast<int>(local_rank);
  access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
  NVFUSER_CUDA_SAFE_CALL(cuMemSetAccess(mc_ptr, aligned_size_, &access, 1));

  mc_ptr_ = reinterpret_cast<void*>(mc_ptr);
  is_multicast_setup_ = true;

  comm.barrier();

  if (my_rank == exporter_rank) {
    store->deleteKey(tag + "_fd");
    store->deleteKey(tag + "_pid");
  }
#else
  (void)exporter_rank;
  (void)tag;
  NVF_ERROR("Multicast requires CUDA 13.0+");
#endif
}

Error Handling

Review the error handling in the setupMulticast function. Ensure that all CUDA API calls are properly checked for errors and that the function handles errors gracefully.

void SymmetricTensor::setupMulticast(
    int64_t exporter_rank,
    const std::string& tag) {
#if (CUDA_VERSION >= 13000)
  if (is_multicast_setup_) {
    return;
  }

  Communicator& comm = Communicator::getInstance();
  const int64_t my_rank = comm.deviceId();
  const int64_t local_rank = comm.local_rank();

  int is_multicast_supported;
  NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
      &is_multicast_supported,
      CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
      local_rank));
  NVF_CHECK(is_multicast_supported, "Multicast not supported");

  exporter_rank_ = exporter_rank;

  CUmulticastObjectProp mcast_prop{};
  mcast_prop.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
  mcast_prop.numDevices = world_size_;
  mcast_prop.size = aligned_size_;

  int shared_handle;
  auto store = comm.getTcpStore();
  pid_t root_pid;

  if (my_rank == exporter_rank) {
    NVFUSER_CUDA_SAFE_CALL(cuMulticastCreate(&mcast_handle_, &mcast_prop));
    NVFUSER_CUDA_SAFE_CALL(cuMemExportToShareableHandle(
        &shared_handle,
        mcast_handle_,
        CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
        0));
    prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY);
    root_pid = getpid();

    store->set(tag + "_fd", toBytes(shared_handle));
    store->set(tag + "_pid", toBytes(root_pid));
  }

  comm.barrier();

  if (my_rank != exporter_rank) {
    shared_handle = fromBytes<int>(store->get(tag + "_fd"));
    root_pid = fromBytes<pid_t>(store->get(tag + "_pid"));

    pid_fd_ = syscall(SYS_pidfd_open, root_pid, 0);
    NVF_CHECK(pid_fd_ >= 0, "pidfd_open failed");
    peer_fd_ = syscall(SYS_pidfd_getfd, pid_fd_, shared_handle, 0);
    NVF_CHECK(peer_fd_ >= 0, "pidfd_getfd failed");

    NVFUSER_CUDA_SAFE_CALL(cuMemImportFromShareableHandle(
        &mcast_handle_,
        reinterpret_cast<void*>(static_cast<uint64_t>(peer_fd_)),
        CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR));
  }

  NVFUSER_CUDA_SAFE_CALL(cuDeviceGet(&cu_dev_, static_cast<int>(local_rank)));
  NVFUSER_CUDA_SAFE_CALL(cuMulticastAddDevice(mcast_handle_, cu_dev_));

  CUdeviceptr local_ptr = remote_ptrs_[my_device_id_];
  CUdeviceptr base_ptr;
  size_t base_size;
  NVFUSER_CUDA_SAFE_CALL(
      cuMemGetAddressRange(&base_ptr, &base_size, local_ptr));
  size_t mem_offset = static_cast<size_t>(local_ptr - base_ptr);

  NVFUSER_CUDA_SAFE_CALL(cuMulticastBindMem(
      mcast_handle_,
      0,
      alloc_handles_[my_device_id_],
      mem_offset,
      aligned_size_,
      0));

  CUdeviceptr mc_ptr;
  NVFUSER_CUDA_SAFE_CALL(
      cuMemAddressReserve(&mc_ptr, aligned_size_, granularity_, 0, 0));
  NVFUSER_CUDA_SAFE_CALL(cuMemMap(mc_ptr, aligned_size_, 0, mcast_handle_, 0));

  CUmemAccessDesc access{};
  access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
  access.location.id = static_cast<int>(local_rank);
  access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
  NVFUSER_CUDA_SAFE_CALL(cuMemSetAccess(mc_ptr, aligned_size_, &access, 1));

  mc_ptr_ = reinterpret_cast<void*>(mc_ptr);
  is_multicast_setup_ = true;

  comm.barrier();

  if (my_rank == exporter_rank) {
    store->deleteKey(tag + "_fd");
    store->deleteKey(tag + "_pid");
  }
#else
  (void)exporter_rank;
  (void)tag;
  NVF_ERROR("Multicast requires CUDA 13.0+");
#endif
}

Test Coverage

Ensure that the test cases cover all possible scenarios, including edge cases and error conditions. Consider adding tests for different CUDA versions and device configurations.

// clang-format off
/*
 * SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES.
 * All rights reserved.
 * SPDX-License-Identifier: BSD-3-Clause
 */
// clang-format on
#include <cuda.h>
#include <multidevice/symmetric_tensor.h>
#include <tests/cpp/multidevice.h>

namespace nvfuser {

using SymmetricTensorTest = MultiDeviceTest;

TEST_F(SymmetricTensorTest, BasicAllocation) {
  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }

  const int64_t rank = communicator_->deviceId();
  const int64_t world_size = communicator_->size();
  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));

  // Create a symmetric tensor
  at::Tensor local_tensor = SymmetricTensor::allocate(
      {256, 512}, at::ScalarType::Float, communicator_->device());
  SymmetricTensor sym_tensor(local_tensor);

  // Validate local tensor
  EXPECT_TRUE(local_tensor.is_cuda());
  EXPECT_EQ(local_tensor.scalar_type(), at::ScalarType::Float);
  EXPECT_EQ(local_tensor.numel(), 256 * 512);
  EXPECT_EQ(local_tensor.sizes()[0], 256);
  EXPECT_EQ(local_tensor.sizes()[1], 512);

  // Write unique value to local tensor
  float local_value = static_cast<float>(rank + 100);
  local_tensor.fill_(local_value);

  sym_tensor.setupRemoteHandles();

  // Read from all remote tensors
  for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
    void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
    EXPECT_NE(peer_ptr, nullptr);

    // Copy first element from peer
    float peer_value;
    NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
        &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

    float expected_value = static_cast<float>(peer_rank + 100);
    EXPECT_FLOAT_EQ(peer_value, expected_value)
        << "Rank " << rank << " reading from rank " << peer_rank;
  }
}

TEST_F(SymmetricTensorTest, PreallocatedTensor) {
  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }

  const int64_t rank = communicator_->deviceId();
  const int64_t world_size = communicator_->size();
  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));

  // Allocate tensor with symmetric memory
  at::Tensor local_tensor = SymmetricTensor::allocate(
      /*sizes=*/at::IntArrayRef({128, 256}),
      /*dtype=*/at::ScalarType::Double,
      /*device=*/c10::Device(c10::DeviceType::CUDA, rank));

  // Create SymmetricTensor from pre-allocated tensor
  SymmetricTensor sym_tensor(local_tensor);

  // Validate
  EXPECT_EQ(sym_tensor.localTensor().numel(), 128 * 256);

  // Write unique pattern to local tensor
  double local_value = static_cast<double>(rank * 1000 + 42);
  local_tensor.fill_(local_value);

  sym_tensor.setupRemoteHandles();

  // Verify remote access
  for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
    if (peer_rank == rank) {
      continue;
    }

    void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
    double peer_value;
    NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
        &peer_value, peer_ptr, sizeof(double), cudaMemcpyDeviceToHost));

    double expected = static_cast<double>(peer_rank * 1000 + 42);
    EXPECT_DOUBLE_EQ(peer_value, expected);
  }
}

TEST_F(SymmetricTensorTest, Multicast) {
#if (CUDA_VERSION < 13000)
  GTEST_SKIP() << "Multicast requires CUDA 13.0+";
#else
  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }

  const int64_t rank = communicator_->deviceId();
  const int64_t root = 0;
  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));

  // Check multicast support
  int is_multicast_supported;
  NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
      &is_multicast_supported, CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED, rank));
  if (!is_multicast_supported) {
    GTEST_SKIP() << "Device does not support multicast";
  }

  // Create symmetric tensor (2MB to meet granularity requirements)
  constexpr int64_t kNumElems = 524288; // 2MB / 4 bytes
  at::Tensor local_tensor = SymmetricTensor::allocate(
      /*sizes=*/at::IntArrayRef({kNumElems}),
      /*dtype=*/at::ScalarType::Int,
      /*device=*/c10::Device(c10::DeviceType::CUDA, rank));
  SymmetricTensor sym_tensor(local_tensor);

  // Setup multicast
  sym_tensor.setupMulticast(root, "test_multicast");

  // Root writes data to multicast buffer
  std::vector<int> host_data(kNumElems);
  if (rank == root) {
    void* mc_ptr = sym_tensor.multicastPtr();
    EXPECT_NE(mc_ptr, nullptr);

    // Prepare pattern data
    for (int64_t i = 0; i < kNumElems; ++i) {
      host_data[i] = static_cast<int>(i * 7 + 13);
    }

    // Write to multicast buffer
    NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
        mc_ptr,
        host_data.data(),
        kNumElems * sizeof(int),
        cudaMemcpyHostToDevice));
  }

  communicator_->barrier();

  // All ranks read from local tensor and validate
  const at::Tensor& local = sym_tensor.localTensor();
  std::vector<int> readback(kNumElems);
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
      readback.data(),
      local.data_ptr(),
      kNumElems * sizeof(int),
      cudaMemcpyDeviceToHost));

  for (int64_t i = 0; i < kNumElems; ++i) {
    int expected = static_cast<int>(i * 7 + 13);
    EXPECT_EQ(readback[i], expected)
        << "Rank " << rank << " failed to read multicast data at index " << i;
  }
#endif
}

TEST_F(SymmetricTensorTest, ContiguousView) {
  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }

  const int64_t rank = communicator_->deviceId();
  const int64_t world_size = communicator_->size();
  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(rank));

  // Create symmetric tensor
  at::Tensor local_tensor = SymmetricTensor::allocate(
      /*sizes=*/at::IntArrayRef({2, 262144}),
      /*dtype=*/at::ScalarType::Float,
      /*device=*/c10::Device(c10::DeviceType::CUDA, rank));
  SymmetricTensor sym_tensor(local_tensor);

  // Write rank-specific pattern to local tensor
  local_tensor.fill_(static_cast<float>(rank + 100));

  // Validate that localTensor has the correct values for this rank
  std::vector<float> local_data(local_tensor.numel());
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
      local_data.data(),
      local_tensor.data_ptr(),
      local_tensor.numel() * sizeof(float),
      cudaMemcpyDeviceToHost));
  for (int64_t i = 0; i < local_tensor.numel(); ++i) {
    ASSERT_EQ(local_data[i], static_cast<float>(rank + 100))
        << "localTensor value mismatch at index " << i << " for rank " << rank;
  }

  communicator_->barrier();

  // Setup and get contiguous view of all ranks
  sym_tensor.setupContiguousView("test_contiguous");
  at::Tensor contiguous_view = sym_tensor.getContiguousView();

  // Validate shape: [world_size, 2, 262144]
  EXPECT_EQ(contiguous_view.dim(), 3);
  EXPECT_EQ(contiguous_view.size(0), world_size);
  EXPECT_EQ(contiguous_view.size(1), 2);
  EXPECT_EQ(contiguous_view.size(2), 262144);

  // Validation: copy and check each per-rank slice from host buffer
  const int64_t slice_elems = contiguous_view.size(1) * contiguous_view.size(2);
  const int64_t total_elems = world_size * slice_elems;
  std::vector<float> all_data(total_elems);
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
      all_data.data(),
      contiguous_view.data_ptr(),
      total_elems * sizeof(float),
      cudaMemcpyDeviceToHost));

  for (int64_t r = 0; r < world_size; ++r) {
    for (int64_t i = 0; i < slice_elems; ++i) {
      float expected = static_cast<float>(r + 100);
      size_t idx = r * slice_elems + i;
      ASSERT_EQ(all_data[idx], expected)
          << "Rank " << rank << " view checking slice for rank " << r
          << " at offset " << i << " did not match expected value";
    }
  }
}

} // namespace nvfuser

Test failures

(Medium, 3) nvFuser SymmetricTensor multidevice failures (pidfd_getfd) on 4GPU_A100 runner

Test Name A100 (dist.) Source

SymmetricTensorTest.BasicAllocation ❌ Link

SymmetricTensorTest.ContiguousView ❌ Link

SymmetricTensorTest.PreallocatedTensor ❌ Link
(Medium, 2) IPC peer-FD acquisition failures in nvFuser multi-GPU VMM tests (4GPU_A100)

Test Name A100 (dist.) Source

IpcTest.IpcP2pWithVmm ❌ Link

IpcTest.VmmMultiRankContiguousMappingTest ❌ Link
(Low, 1) Tensor numerical mismatches in nvFuser HopperMatmulTest on Matmul scheduler (H100 runner)

Test Name H100 Source

HopperMatmulTest.HSH_NT_UseScheduler_MultipleInstructionsPerWarpTile ❌ Link

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

samnordmann · 2025-11-17T18:44:25Z

!test

samnordmann · 2025-11-17T18:48:56Z

!test

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

wujingyue · 2025-11-17T20:50:21Z

csrc/multidevice/symmetric_tensor.h

+
+namespace nvfuser {
+
+// SymmetricTensor wraps a local symmetric memory allocation and enables:


How is it different from https://github.com/pytorch/pytorch/blob/2ddcf53e1a98d4453a2d2ff2422af19bc04bd26e/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp?

From the API perspective, no fundamental difference. Both aims at implementing the same standard.

Here I propose an implementation from scratch that we can better control and maintain.

I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.

In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.

cc @syed-ahmed

Are you on top of the symmetric memory work regarding multicasting?

cc @nvcastet who probably knows the latest about symmetric memory and multicast

@samnordmann @wujingyue
Pytorch symmetric API works well and support 3 different backends (Pure CUDA, NVSHMEM, and NCCL). The pure cuda supports getting peer ptrs and multicast ptrs and the NCCL backend is going to be finished soon with NCCL 2.29 providing host API to access host and peer ptrs.
Coming up soon is support for allowing torch.compile to trace allocations used in symmetric ops to allocate them directly in the symmetric memory pool (pytorch/pytorch#162859)
It can be a real burden to maintain a new allocator / memory handle exchange etc... as we have seen in past projects (flashinfer, wholegraph etc...) So I would advice against it and use the Pytorch symmetric API and let the Framework own the memory stack.
The best would be to contribute directly to pytorch core for improvements you want and even add new backends.
It would be sad to have multiple symmetric memory pools and create more memory fragmentation in the application if there is not a strong reason for it.

Thanks for your honest opinion, @nvcastet!

I assume @samnordmann has been digging into Torch’s symmetric memory and identified some gaps. While I’m not sure which of those gaps can be fixed by us versus what may require changes from Meta, what would you recommend as the best way to engage the PyTorch community so we can at least surface the issues he found?

It would be nice to write down the specifics of those gaps. What was listed has been supported by pytorch symmetric memory. Then, I would encourage with the pytorch community and loop in @kwen2501 from Meta. I am sure they would be happy to collaborate and take PRs.

wujingyue · 2025-11-19T02:09:17Z

csrc/multidevice/symmetric_tensor.h

+//
+// Design: Decouples local allocation from IPC handle exchange for better
+// interoperability and support for pre-allocated user buffers
+class SymmetricTensor {


I like your other comment. I'd include that in the code for posterity.

Suggested change

class SymmetricTensor {

// From the API perspective, no fundamental difference. Both aims at implementing the same standard.

Here I propose an implementation from scratch that we can better control and maintain.

I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.

In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.

class SymmetricTensor {

runtime symmetric tensor

ad9b201

greptile-apps bot reviewed Nov 13, 2025

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Show resolved Hide resolved

csrc/multidevice/symmetric_tensor.cpp Show resolved Hide resolved

csrc/multidevice/symmetric_tensor.cpp Show resolved Hide resolved

This was referenced Nov 13, 2025

[SymMem 1/5] VMM and multicast reference #5516

Open

[SymMem 3/5] Add MemoryType::Symmetric #5518

Open

[SymMem 5/5] Contiguous View #5520

Open

[SymMem 4/5] Broadcast and Allgather lowering + runtime support #5519

Open

samnordmann requested review from nsarka and wujingyue November 17, 2025 18:39

Merge branch 'sym/VMM_bits' into sym/symmetric_tensor

ea2aba7

greptile-apps bot reviewed Nov 17, 2025

View reviewed changes

fix header

ca8c70a

greptile-apps bot reviewed Nov 17, 2025

View reviewed changes

wujingyue reviewed Nov 17, 2025

View reviewed changes

wujingyue approved these changes Nov 19, 2025

View reviewed changes


		namespace nvfuser {

		// SymmetricTensor wraps a local symmetric memory allocation and enables:

-class SymmetricTensor {
+// From the API perspective, no fundamental difference. Both aims at implementing the same standard.
+Here I propose an implementation from scratch that we can better control and maintain.
+I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.
+In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.
+class SymmetricTensor {

[SymMem 2/5] SymmetricTensor runtime type #5517

Are you sure you want to change the base?

[SymMem 2/5] SymmetricTensor runtime type #5517

Uh oh!

Conversation

samnordmann commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samnordmann commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 17, 2025 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Nov 17, 2025

Uh oh!

samnordmann commented Nov 17, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvcastet Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nvcastet Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SymMem 2/5] `SymmetricTensor` runtime type #5517

[SymMem 2/5] `SymmetricTensor` runtime type #5517

samnordmann commented Nov 13, 2025 •

edited

Loading

greptile-apps bot commented Nov 13, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited by xwang233

Loading

wujingyue Nov 19, 2025 •

edited

Loading

nvcastet Nov 19, 2025 •

edited

Loading