SDDMM operator fails in distributed environment #7697

junliang-lin · 2024-08-14T07:57:38Z

🐛 Bug

The SDDMM operator fails when running in a distributed environment. It either returns a sparse tensor filled with zeros or encounters an illegal memory access.

To Reproduce

Steps to reproduce the behavior:

reproduce.py

import torch
import torch.distributed as dist
import dgl.sparse as dglsp

def main():
    rank = dist.get_rank()
    indices = torch.tensor([[0, 0, 1, 1], [0, 1, 0, 1]]).cuda(rank)
    values = torch.tensor([1., 1., 1., 1.]).cuda(rank)
    A = dglsp.spmatrix(indices, val=values, shape=(2, 2))
    B = torch.ones(2,4).cuda(rank)
    C = torch.ones(4,2).cuda(rank)
    out = dglsp.sddmm(A, B, C) 
    print(out)

if __name__ == "__main__":
    dist.init_process_group("nccl", world_size=2)
    main()
    dist.destroy_process_group()

run the command:

torchrun --nproc_per_node=2 --master_port 47769 reproduce.py

output:

without errors: incorrect results

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                             [0, 1, 0, 1]], device='cuda:0'),
             values=tensor([4., 4., 4., 4.], device='cuda:0'),
             shape=(2, 2), nnz=4)

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                             [0, 1, 0, 1]], device='cuda:1'),
             values=tensor([0., 0., 0., 0.], device='cuda:1'),
             shape=(2, 2), nnz=4)

with errors:

[rank1]: Traceback (most recent call last):                                                                                                                  
[rank1]:   File "/test/bench/reproduce.py", line 17, in <module>                                                                                  
[rank1]:     main()                                                                                                                                          
[rank1]:   File "/test/bench/reproduce.py", line 13, in main                                                                                      
[rank1]:     print(out)                                                                                                                                      
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 15, in __repr__                     
[rank1]:     return _sparse_matrix_str(self)                                                                                                                 
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 1454, in _sparse_matrix_str         
[rank1]:     values_str = str(spmat.val)                                                                                                                     
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor.py", line 464, in __repr__                               
[rank1]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)                                                                            
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 697, in _str                               
[rank1]:     return _str_intern(self, tensor_contents=tensor_contents)                                                                                       
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 617, in _str_intern
[rank1]:     tensor_str = _tensor_str(self, indent)
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank1]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 138, in __init__
[rank1]:     tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

output:

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                            [0, 1, 0, 1]], device='cuda:0'),
            values=tensor([4., 4., 4., 4.], device='cuda:0'),
            shape=(2, 2), nnz=4)

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                            [0, 1, 0, 1]], device='cuda:1'),
            values=tensor([4., 4., 4., 4.], device='cuda:1'),
            shape=(2, 2), nnz=4)

Environment

DGL Version (e.g., 1.0): 2.3.0+cu118
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch 2.3.0+cu118
OS (e.g., Linux): ubuntu 20.04
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9.19
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

rudongyu added the bug:confirmed Something isn't working label Aug 15, 2024

rudongyu added this to DGL Project Tracker Aug 15, 2024

github-project-automation bot moved this to 🏠 Backlog in DGL Project Tracker Aug 15, 2024

frozenbugs self-assigned this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDDMM operator fails in distributed environment #7697

SDDMM operator fails in distributed environment #7697

junliang-lin commented Aug 14, 2024

SDDMM operator fails in distributed environment #7697

SDDMM operator fails in distributed environment #7697

Comments

junliang-lin commented Aug 14, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context