-
Notifications
You must be signed in to change notification settings - Fork 70
[SymMem 2/5] SymmetricTensor runtime type
#5517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sym/VMM_bits
Are you sure you want to change the base?
Conversation
Greptile Summary
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant R0 as Rank 0
participant Store as TCP Store
participant R1 as Rank 1
Note over R0,R1: setupRemoteHandles()
R0->>R0: cuMemExportToShareableHandle(local_handle, &shared_fd)
R0->>Store: set("sym_tensor_0_tag_fd", shared_fd)
R0->>Store: set("sym_tensor_0_tag_pid", pid)
R1->>R1: cuMemExportToShareableHandle(local_handle, &shared_fd)
R1->>Store: set("sym_tensor_1_tag_fd", shared_fd)
R1->>Store: set("sym_tensor_1_tag_pid", pid)
Note over R0,R1: barrier()
R0->>Store: get("sym_tensor_1_tag_fd")
Store-->>R0: peer_fd
R0->>R0: pidfd_getfd(peer_pid, peer_fd) -> local_fd
R0->>R0: cuMemImportFromShareableHandle(local_fd)
R0->>R0: cuMemMap(peer_ptr)
R1->>Store: get("sym_tensor_0_tag_fd")
Store-->>R1: peer_fd
R1->>R1: pidfd_getfd(peer_pid, peer_fd) -> local_fd
R1->>R1: cuMemImportFromShareableHandle(local_fd)
R1->>R1: cuMemMap(peer_ptr)
Note over R0,R1: barrier()
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 3 comments
|
!build |
|
Review updated until commit ca8c70a Description
|
| Relevant files | |||||||
|---|---|---|---|---|---|---|---|
| Refactoring |
| ||||||
| Enhancement |
| ||||||
| Tests |
| ||||||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Multicast Support
|
Test failures
-
(Medium, 3)
nvFuser SymmetricTensor multidevice failures (pidfd_getfd) on 4GPU_A100 runnerTest Name A100 (dist.) Source SymmetricTensorTest.BasicAllocation ❌ Link SymmetricTensorTest.ContiguousView ❌ Link SymmetricTensorTest.PreallocatedTensor ❌ Link -
(Medium, 2)
IPC peer-FD acquisition failures in nvFuser multi-GPU VMM tests (4GPU_A100)Test Name A100 (dist.) Source IpcTest.IpcP2pWithVmm ❌ Link IpcTest.VmmMultiRankContiguousMappingTest ❌ Link -
(Low, 1)
Tensor numerical mismatches in nvFuser HopperMatmulTest on Matmul scheduler (H100 runner)Test Name H100 Source HopperMatmulTest.HSH_NT_UseScheduler_MultipleInstructionsPerWarpTile ❌ Link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
|
!test |
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
|
|
||
| namespace nvfuser { | ||
|
|
||
| // SymmetricTensor wraps a local symmetric memory allocation and enables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the API perspective, no fundamental difference. Both aims at implementing the same standard.
Here I propose an implementation from scratch that we can better control and maintain.
I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support.
In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @syed-ahmed
Are you on top of the symmetric memory work regarding multicasting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @nvcastet who probably knows the latest about symmetric memory and multicast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samnordmann @wujingyue
Pytorch symmetric API works well and support 3 different backends (Pure CUDA, NVSHMEM, and NCCL). The pure cuda supports getting peer ptrs and multicast ptrs and the NCCL backend is going to be finished soon with NCCL 2.29 providing host API to access host and peer ptrs.
Coming up soon is support for allowing torch.compile to trace allocations used in symmetric ops to allocate them directly in the symmetric memory pool (pytorch/pytorch#162859)
It can be a real burden to maintain a new allocator / memory handle exchange etc... as we have seen in past projects (flashinfer, wholegraph etc...) So I would advice against it and use the Pytorch symmetric API and let the Framework own the memory stack.
The best would be to contribute directly to pytorch core for improvements you want and even add new backends.
It would be sad to have multiple symmetric memory pools and create more memory fragmentation in the application if there is not a strong reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your honest opinion, @nvcastet!
I assume @samnordmann has been digging into Torch’s symmetric memory and identified some gaps. While I’m not sure which of those gaps can be fixed by us versus what may require changes from Meta, what would you recommend as the best way to engage the PyTorch community so we can at least surface the issues he found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to write down the specifics of those gaps. What was listed has been supported by pytorch symmetric memory. Then, I would encourage with the pytorch community and loop in @kwen2501 from Meta. I am sure they would be happy to collaborate and take PRs.
| // | ||
| // Design: Decouples local allocation from IPC handle exchange for better | ||
| // interoperability and support for pre-allocated user buffers | ||
| class SymmetricTensor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your other comment. I'd include that in the code for posterity.
| class SymmetricTensor { | |
| // From the API perspective, no fundamental difference. Both aims at implementing the same standard. | |
| Here I propose an implementation from scratch that we can better control and maintain. | |
| I made this choice after studying torch's symmetric memory and trying to use that in the first place. But I found that Torch's symmetric allocator is for now very experimental and incomplete. For example, there is no multicast support. | |
| In the future, we could have different backends for symmetric memory, besides this one -- e.g. pytorch or nvshmem implementation. From the current interface, we could easily switch from one backend to another, similarily to how we can switch communication backend, and use external libraries like nccl or UCC besides our own Cuda backend implementation. | |
| class SymmetricTensor { |
SymmetricTensorruntime type #5517MemoryType::Symmetric#5518Full branch for reference: #5515