Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CUDA Graph Build Failure #2460

Open
5 tasks done
dangxingyu opened this issue Dec 12, 2024 · 0 comments
Open
5 tasks done

[Bug] CUDA Graph Build Failure #2460

dangxingyu opened this issue Dec 12, 2024 · 0 comments

Comments

@dangxingyu
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi,
I'm working on offline_generation on 8xL40S, it raises RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream when building the CUDA graph

Error Information

INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
babel-12-29:256515:256515 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256515:256515 [0] NCCL INFO Using network Socket
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap timings total 0.000354 (create 0.000023, send 0.000077, recv 0.000161, ring 0.000010, delay 0.000000)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256515:256515 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
babel-12-29:256515:256515 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO comm 0xe528a80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
babel-12-29:256515:256515 [0] NCCL INFO Channel 00/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Channel 01/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
babel-12-29:256515:256515 [0] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256515:257048 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 37
babel-12-29:256515:257047 [0] NCCL INFO [Proxy Service] Device 0 CPU core 35
babel-12-29:256515:256515 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Connected all rings
babel-12-29:256515:256515 [0] NCCL INFO Connected all trees
babel-12-29:256515:257051 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 6
babel-12-29:256515:256515 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256515:256515 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256515:256515 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
babel-12-29:256515:256515 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256515:256515 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.13 (kernels 0.09, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.1
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256516:256516 [1] NCCL INFO Using network Socket
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap timings total 0.068977 (create 0.000023, send 0.000079, recv 0.068764, ring 0.000009, delay 0.000000)
babel-12-29:256516:256516 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256516:256516 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
babel-12-29:256516:256516 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO comm 0xf0f6040 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
babel-12-29:256516:256516 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
babel-12-29:256516:256516 [1] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256516:257049 [1] NCCL INFO [Proxy Service] Device 1 CPU core 9
babel-12-29:256516:257050 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 44
babel-12-29:256516:256516 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Connected all rings
babel-12-29:256516:256516 [1] NCCL INFO Connected all trees
babel-12-29:256516:257052 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 46
babel-12-29:256516:256516 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256516:256516 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256516:256516 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256516:256516 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.21 (kernels 0.09, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:257069 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:257069 [1] NCCL INFO Using non-device net plugin version 0
babel-12-29:256516:257069 [1] NCCL INFO Using network Socket
babel-12-29:256516:257069 [1] NCCL INFO ncclCommInitRank comm 0x144f2a30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x33fb9f7285907ed5 - Init START
babel-12-29:256
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  9.12it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  9.11it/s]

INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
[2024-12-11 22:03:20 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
    out = run_once()
          ^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
    return self.logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
    last_logits = tensor_model_parallel_all_gather(last_logits)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
    return get_tp_group().all_gather(input_, dim)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
    torch.distributed.all_gather_into_tensor(output_tensor,
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
    "args": f"{args}, {kwargs}",
            ^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
           ^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
    with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2024-12-11 22:03:20 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
    out = run_once()
          ^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
    return self.logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
    last_logits = tensor_model_parallel_all_gather(last_logits)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
    return get_tp_group().all_gather(input_, dim)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
    torch.distributed.all_gather_into_tensor(output_tensor,
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
    "args": f"{args}, {kwargs}",
            ^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
           ^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
    with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 



Reproduction

I'm using Qwen/Qwen2.5-0.5B, tensor parallel size = 2

import sglang as sgl
if __name__ == "__main__":
    llm = sgl.Engine(model_path="Qwen/Qwen2.5-0.5B", tp_size=2)
    print(llm.generate("Hello, world!", max_tokens=10))

scripts to run

#! /bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=lo
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_USE_CUDA_DSA=1
export MASTER_ADDR=localhost
export MASTER_PORT=12343
export CUDA_LAUNCH_BLOCKING=1

# python test.py
python test.py 2>&1 | tee bug_log.txt

Environment

CUDA Version: 12.6
CUDA Driver Version: 560.35.03
GPU: 8xNVIDIA L40S

Here's the Conda Env:

  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - asttokens=2.4.1=pyhd8ed1ab_0
  - blas=1.0=mkl
  - brotli-python=1.0.9=py311h6a678d5_8
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.11.26=h06a4308_0
  - certifi=2024.8.30=pyhd8ed1ab_0
  - charset-normalizer=3.3.2=pyhd3eb1b0_0
  - comm=0.2.2=pyhd8ed1ab_0
  - cuda-cudart=12.1.105=0
  - cuda-cupti=12.1.105=0
  - cuda-libraries=12.1.0=0
  - cuda-nvrtc=12.1.105=0
  - cuda-nvtx=12.1.105=0
  - cuda-opencl=12.6.77=0
  - cuda-runtime=12.1.0=0
  - cuda-version=12.6=3
  - debugpy=1.8.9=py311hfdbb021_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - exceptiongroup=1.2.2=pyhd8ed1ab_0
  - executing=2.1.0=pyhd8ed1ab_0
  - ffmpeg=4.3=hf484d3e_0
  - freetype=2.12.1=h4a9f257_0
  - giflib=5.2.2=h5eee18b_0
  - gmp=6.2.1=h295c915_3
  - gmpy2=2.1.2=py311hc9b5ff0_0
  - gnutls=3.6.15=he1e5248_0
  - idna=3.7=py311h06a4308_0
  - importlib-metadata=8.5.0=pyha770c72_0
  - intel-openmp=2023.1.0=hdb19cb5_46306
  - ipykernel=6.29.5=pyh3099207_0
  - ipython=8.30.0=pyh707e725_0
  - jedi=0.19.2=pyhff2d567_0
  - jinja2=3.1.4=py311h06a4308_1
  - jpeg=9e=h5eee18b_3
  - jupyter_client=8.6.3=pyhd8ed1ab_0
  - jupyter_core=5.7.2=pyh31011fe_1
  - krb5=1.21.3=h143b758_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.40=h12ee557_0
  - lerc=3.0=h295c915_0
  - libcublas=12.1.0.26=0
  - libcufft=11.0.2.4=0
  - libcufile=1.11.1.6=0
  - libcurand=10.3.7.77=0
  - libcusolver=11.4.4.55=0
  - libcusparse=12.0.2.55=0
  - libdeflate=1.17=h5eee18b_1
  - libedit=3.1.20230828=h5eee18b_0
  - libffi=3.4.4=h6a678d5_1
  - libgcc=14.2.0=h77fa898_1
  - libgcc-ng=14.2.0=h69a702a_1
  - libgomp=14.2.0=h77fa898_1
  - libiconv=1.16=h5eee18b_3
  - libidn2=2.3.4=h5eee18b_0
  - libjpeg-turbo=2.0.0=h9bf148f_0
  - libnpp=12.0.2.50=0
  - libnvjitlink=12.1.105=0
  - libnvjpeg=12.1.1.14=0
  - libpng=1.6.39=h5eee18b_0
  - libsodium=1.0.20=h4ab18f5_0
  - libstdcxx=14.2.0=hc0a3c3a_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libtasn1=4.19.0=h5eee18b_0
  - libtiff=4.5.1=h6a678d5_0
  - libunistring=0.9.10=h27cfd23_0
  - libuuid=1.41.5=h5eee18b_0
  - libwebp=1.3.2=h11a3e52_0
  - libwebp-base=1.3.2=h5eee18b_1
  - llvm-openmp=14.0.6=h9e868ea_0
  - lz4-c=1.9.4=h6a678d5_1
  - markupsafe=2.1.3=py311h5eee18b_0
  - matplotlib-inline=0.1.7=pyhd8ed1ab_0
  - mkl=2023.1.0=h213fc3f_46344
  - mkl-service=2.4.0=py311h5eee18b_1
  - mkl_fft=1.3.11=py311h5eee18b_0
  - mkl_random=1.2.8=py311ha02d727_0
  - mpc=1.1.0=h10f8cd9_1
  - mpfr=4.0.2=hb69a4c5_1
  - mpmath=1.3.0=py311h06a4308_0
  - ncurses=6.4=h6a678d5_0
  - nest-asyncio=1.6.0=pyhd8ed1ab_0
  - nettle=3.7.3=hbbd107a_1
  - networkx=3.2.1=py311h06a4308_0
  - openh264=2.1.1=h4ff587b_0
  - openjpeg=2.5.2=he7f1fd0_0
  - openssl=3.4.0=hb9d3cd8_0
  - packaging=24.2=pyhff2d567_1
  - parso=0.8.4=pyhd8ed1ab_0
  - pexpect=4.9.0=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - platformdirs=4.3.6=pyhd8ed1ab_0
  - prompt-toolkit=3.0.48=pyha770c72_0
  - psutil=6.1.0=py311h9ecbd09_0
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.3=pyhd8ed1ab_0
  - pygments=2.18.0=pyhd8ed1ab_0
  - pysocks=1.7.1=py311h06a4308_0
  - python=3.11.10=he870216_0
  - python-dateutil=2.9.0.post0=pyhff2d567_0
  - python_abi=3.11=2_cp311
  - pytorch=2.5.1=py3.11_cuda12.1_cudnn9.1.0_0
  - pytorch-cuda=12.1=ha16c6d3_6
  - pytorch-mutex=1.0=cuda
  - pyyaml=6.0.2=py311h5eee18b_0
  - pyzmq=26.2.0=py311h7deb3e3_3
  - readline=8.2=h5eee18b_0
  - requests=2.32.3=py311h06a4308_1
  - setuptools=75.1.0=py311h06a4308_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.45.3=h5eee18b_0
  - stack_data=0.6.2=pyhd8ed1ab_0
  - tbb=2021.8.0=hdb19cb5_0
  - tk=8.6.14=h39e8969_0
  - torchaudio=2.5.1=py311_cu121
  - torchtriton=3.1.0=py311
  - torchvision=0.20.1=py311_cu121
  - tornado=6.4.2=py311h9ecbd09_0
  - traitlets=5.14.3=pyhd8ed1ab_0
  - urllib3=2.2.3=py311h06a4308_0
  - wcwidth=0.2.13=pyhd8ed1ab_0
  - wheel=0.44.0=py311h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - yaml=0.2.5=h7b6447c_0
  - zeromq=4.3.5=h3b0a872_7
  - zipp=3.21.0=pyhd8ed1ab_1
  - zlib=1.2.13=h5eee18b_1
  - zstd=1.5.6=hc292b87_0
  - pip:
      - accelerate==1.1.1
      - aiohappyeyeballs==2.4.3
      - aiohttp==3.11.8
      - aiosignal==1.3.1
      - airportsdata==20241001
      - annotated-types==0.7.0
      - anthropic==0.40.0
      - anyio==4.6.2.post1
      - astor==0.8.1
      - attrs==24.2.0
      - blake3==1.0.0
      - click==8.1.7
      - cloudpickle==3.1.0
      - compressed-tensors==0.8.0
      - contourpy==1.3.1
      - cuda-python==12.6.2.post1
      - cycler==0.12.1
      - datasets==3.1.0
      - decord==0.6.0
      - depyf==0.18.0
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - docker-pycreds==0.4.0
      - einops==0.8.0
      - fastapi==0.115.5
      - filelock==3.16.1
      - flashinfer==0.1.6+cu121torch2.4
      - fonttools==4.55.0
      - frozenlist==1.5.0
      - fsspec==2024.9.0
      - gguf==0.10.0
      - gitdb==4.0.11
      - gitpython==3.1.43
      - h11==0.14.0
      - hf-transfer==0.1.8
      - httpcore==1.0.7
      - httptools==0.6.4
      - httpx==0.27.2
      - huggingface-hub==0.26.3
      - iniconfig==2.0.0
      - interegular==0.3.3
      - jiter==0.8.0
      - jsonlines==4.0.0
      - jsonschema==4.23.0
      - jsonschema-specifications==2024.10.1
      - kiwisolver==1.4.7
      - lark==1.2.2
      - litellm==1.54.1
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.9
      - matplotlib==3.9.3
      - mistral-common==1.5.1
      - modelscope==1.21.0
      - msgpack==1.1.0
      - msgspec==0.18.6
      - multidict==6.1.0
      - multiprocess==0.70.16
      - numba==0.60.0
      - numpy==1.26.4
      - nvidia-ml-py==12.560.30
      - openai==1.55.3
      - opencv-python-headless==4.10.0.84
      - orjson==3.10.12
      - outlines==0.0.46
      - outlines-core==0.1.24
      - pandas==2.2.3
      - partial-json-parser==0.2.1.1.post4
      - pillow==10.4.0
      - pip==24.3.1
      - pluggy==1.5.0
      - prometheus-client==0.21.0
      - prometheus-fastapi-instrumentator==7.0.0
      - propcache==0.2.0
      - protobuf==5.29.0
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyarrow==18.1.0
      - pybind11==2.13.6
      - pycountry==24.6.1
      - pydantic==2.10.2
      - pydantic-core==2.27.1
      - pyparsing==3.2.0
      - pytest==8.3.4
      - python-dotenv==1.0.1
      - python-multipart==0.0.19
      - pytz==2024.2
      - ray==2.39.0
      - referencing==0.35.1
      - regex==2024.11.6
      - rpds-py==0.21.0
      - safetensors==0.4.5
      - scipy==1.14.1
      - seaborn==0.13.2
      - sentencepiece==0.2.0
      - sentry-sdk==2.19.0
      - setproctitle==1.3.4
      - sglang==0.4.0.post1
      - smmap==5.0.1
      - sniffio==1.3.1
      - starlette==0.41.3
      - sympy==1.13.1
      - tiktoken==0.7.0
      - tokenizers==0.20.3
      - torchao==0.7.0
      - tqdm==4.67.1
      - transformers==4.46.3
      - typing-extensions==4.12.2
      - tzdata==2024.2
      - uvicorn==0.32.1
      - uvloop==0.21.0
      - vllm==0.6.4.post1
      - wandb==0.18.7
      - watchfiles==1.0.0
      - websockets==14.1
      - xformers==0.0.28.post3
      - xgrammar==0.1.6
      - xxhash==3.5.0
      - yarl==1.18.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant