Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDDMM operator fails in distributed environment #7697

Open
junliang-lin opened this issue Aug 14, 2024 · 0 comments
Open

SDDMM operator fails in distributed environment #7697

junliang-lin opened this issue Aug 14, 2024 · 0 comments
Assignees
Labels
bug:confirmed Something isn't working

Comments

@junliang-lin
Copy link

🐛 Bug

The SDDMM operator fails when running in a distributed environment. It either returns a sparse tensor filled with zeros or encounters an illegal memory access.

To Reproduce

Steps to reproduce the behavior:

  1. reproduce.py
import torch
import torch.distributed as dist
import dgl.sparse as dglsp

def main():
    rank = dist.get_rank()
    indices = torch.tensor([[0, 0, 1, 1], [0, 1, 0, 1]]).cuda(rank)
    values = torch.tensor([1., 1., 1., 1.]).cuda(rank)
    A = dglsp.spmatrix(indices, val=values, shape=(2, 2))
    B = torch.ones(2,4).cuda(rank)
    C = torch.ones(4,2).cuda(rank)
    out = dglsp.sddmm(A, B, C) 
    print(out)

if __name__ == "__main__":
    dist.init_process_group("nccl", world_size=2)
    main()
    dist.destroy_process_group()
  1. run the command:
torchrun --nproc_per_node=2 --master_port 47769 reproduce.py
  1. output:
  • without errors: incorrect results
SparseMatrix(indices=tensor([[0, 0, 1, 1],
                             [0, 1, 0, 1]], device='cuda:0'),
             values=tensor([4., 4., 4., 4.], device='cuda:0'),
             shape=(2, 2), nnz=4)

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                             [0, 1, 0, 1]], device='cuda:1'),
             values=tensor([0., 0., 0., 0.], device='cuda:1'),
             shape=(2, 2), nnz=4)
  • with errors:
[rank1]: Traceback (most recent call last):                                                                                                                  
[rank1]:   File "/test/bench/reproduce.py", line 17, in <module>                                                                                  
[rank1]:     main()                                                                                                                                          
[rank1]:   File "/test/bench/reproduce.py", line 13, in main                                                                                      
[rank1]:     print(out)                                                                                                                                      
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 15, in __repr__                     
[rank1]:     return _sparse_matrix_str(self)                                                                                                                 
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 1454, in _sparse_matrix_str         
[rank1]:     values_str = str(spmat.val)                                                                                                                     
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor.py", line 464, in __repr__                               
[rank1]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)                                                                            
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 697, in _str                               
[rank1]:     return _str_intern(self, tensor_contents=tensor_contents)                                                                                       
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 617, in _str_intern
[rank1]:     tensor_str = _tensor_str(self, indent)
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank1]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank1]:   File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 138, in __init__
[rank1]:     tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

output:

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                            [0, 1, 0, 1]], device='cuda:0'),
            values=tensor([4., 4., 4., 4.], device='cuda:0'),
            shape=(2, 2), nnz=4)

SparseMatrix(indices=tensor([[0, 0, 1, 1],
                            [0, 1, 0, 1]], device='cuda:1'),
            values=tensor([4., 4., 4., 4.], device='cuda:1'),
            shape=(2, 2), nnz=4)

Environment

  • DGL Version (e.g., 1.0): 2.3.0+cu118
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch 2.3.0+cu118
  • OS (e.g., Linux): ubuntu 20.04
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.9.19
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@rudongyu rudongyu added the bug:confirmed Something isn't working label Aug 15, 2024
@github-project-automation github-project-automation bot moved this to 🏠 Backlog in DGL Project Tracker Aug 15, 2024
@frozenbugs frozenbugs self-assigned this Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
Status: 🏠 Backlog
Development

No branches or pull requests

3 participants