You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SDDMM operator fails when running in a distributed environment. It either returns a sparse tensor filled with zeros or encounters an illegal memory access.
To Reproduce
Steps to reproduce the behavior:
reproduce.py
import torch
import torch.distributed as dist
import dgl.sparse as dglsp
def main():
rank = dist.get_rank()
indices = torch.tensor([[0, 0, 1, 1], [0, 1, 0, 1]]).cuda(rank)
values = torch.tensor([1., 1., 1., 1.]).cuda(rank)
A = dglsp.spmatrix(indices, val=values, shape=(2, 2))
B = torch.ones(2,4).cuda(rank)
C = torch.ones(4,2).cuda(rank)
out = dglsp.sddmm(A, B, C)
print(out)
if __name__ == "__main__":
dist.init_process_group("nccl", world_size=2)
main()
dist.destroy_process_group()
[rank1]: Traceback (most recent call last):
[rank1]: File "/test/bench/reproduce.py", line 17, in <module>
[rank1]: main()
[rank1]: File "/test/bench/reproduce.py", line 13, in main
[rank1]: print(out)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 15, in __repr__
[rank1]: return _sparse_matrix_str(self)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/dgl/sparse/sparse_matrix.py", line 1454, in _sparse_matrix_str
[rank1]: values_str = str(spmat.val)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor.py", line 464, in __repr__
[rank1]: return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 697, in _str
[rank1]: return _str_intern(self, tensor_contents=tensor_contents)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 617, in _str_intern
[rank1]: tensor_str = _tensor_str(self, indent)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank1]: formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank1]: File "/test/miniconda3/envs/new_dgl/lib/python3.9/site-packages/torch/_tensor_str.py", line 138, in __init__
[rank1]: tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
🐛 Bug
The SDDMM operator fails when running in a distributed environment. It either returns a sparse tensor filled with zeros or encounters an illegal memory access.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
output:
Environment
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: