[Bug] CUDA Graph Build Failure #2460

dangxingyu · 2024-12-12T03:15:22Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Hi,
I'm working on offline_generation on 8xL40S, it raises RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream when building the CUDA graph

Error Information

INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
babel-12-29:256515:256515 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256515:256515 [0] NCCL INFO Using network Socket
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap timings total 0.000354 (create 0.000023, send 0.000077, recv 0.000161, ring 0.000010, delay 0.000000)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256515:256515 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
babel-12-29:256515:256515 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO comm 0xe528a80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
babel-12-29:256515:256515 [0] NCCL INFO Channel 00/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Channel 01/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
babel-12-29:256515:256515 [0] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256515:257048 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 37
babel-12-29:256515:257047 [0] NCCL INFO [Proxy Service] Device 0 CPU core 35
babel-12-29:256515:256515 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Connected all rings
babel-12-29:256515:256515 [0] NCCL INFO Connected all trees
babel-12-29:256515:257051 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 6
babel-12-29:256515:256515 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256515:256515 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256515:256515 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
babel-12-29:256515:256515 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256515:256515 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.13 (kernels 0.09, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.1
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256516:256516 [1] NCCL INFO Using network Socket
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap timings total 0.068977 (create 0.000023, send 0.000079, recv 0.068764, ring 0.000009, delay 0.000000)
babel-12-29:256516:256516 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256516:256516 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
babel-12-29:256516:256516 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO comm 0xf0f6040 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
babel-12-29:256516:256516 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
babel-12-29:256516:256516 [1] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256516:257049 [1] NCCL INFO [Proxy Service] Device 1 CPU core 9
babel-12-29:256516:257050 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 44
babel-12-29:256516:256516 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Connected all rings
babel-12-29:256516:256516 [1] NCCL INFO Connected all trees
babel-12-29:256516:257052 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 46
babel-12-29:256516:256516 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256516:256516 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256516:256516 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256516:256516 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.21 (kernels 0.09, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:257069 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:257069 [1] NCCL INFO Using non-device net plugin version 0
babel-12-29:256516:257069 [1] NCCL INFO Using network Socket
babel-12-29:256516:257069 [1] NCCL INFO ncclCommInitRank comm 0x144f2a30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x33fb9f7285907ed5 - Init START
babel-12-29:256
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  9.12it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  9.11it/s]

INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
[2024-12-11 22:03:20 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
    out = run_once()
          ^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
    return self.logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
    last_logits = tensor_model_parallel_all_gather(last_logits)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
    return get_tp_group().all_gather(input_, dim)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
    torch.distributed.all_gather_into_tensor(output_tensor,
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
    "args": f"{args}, {kwargs}",
            ^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
           ^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
    with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2024-12-11 22:03:20 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
    out = run_once()
          ^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
    return self.logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
    last_logits = tensor_model_parallel_all_gather(last_logits)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
    return get_tp_group().all_gather(input_, dim)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
    torch.distributed.all_gather_into_tensor(output_tensor,
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
    "args": f"{args}, {kwargs}",
            ^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
           ^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
    with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Reproduction

I'm using Qwen/Qwen2.5-0.5B, tensor parallel size = 2

import sglang as sgl
if __name__ == "__main__":
    llm = sgl.Engine(model_path="Qwen/Qwen2.5-0.5B", tp_size=2)
    print(llm.generate("Hello, world!", max_tokens=10))

scripts to run

#! /bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=lo
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_USE_CUDA_DSA=1
export MASTER_ADDR=localhost
export MASTER_PORT=12343
export CUDA_LAUNCH_BLOCKING=1

# python test.py
python test.py 2>&1 | tee bug_log.txt

Environment

CUDA Version: 12.6
CUDA Driver Version: 560.35.03
GPU: 8xNVIDIA L40S

Here's the Conda Env:

  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - asttokens=2.4.1=pyhd8ed1ab_0
  - blas=1.0=mkl
  - brotli-python=1.0.9=py311h6a678d5_8
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.11.26=h06a4308_0
  - certifi=2024.8.30=pyhd8ed1ab_0
  - charset-normalizer=3.3.2=pyhd3eb1b0_0
  - comm=0.2.2=pyhd8ed1ab_0
  - cuda-cudart=12.1.105=0
  - cuda-cupti=12.1.105=0
  - cuda-libraries=12.1.0=0
  - cuda-nvrtc=12.1.105=0
  - cuda-nvtx=12.1.105=0
  - cuda-opencl=12.6.77=0
  - cuda-runtime=12.1.0=0
  - cuda-version=12.6=3
  - debugpy=1.8.9=py311hfdbb021_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - exceptiongroup=1.2.2=pyhd8ed1ab_0
  - executing=2.1.0=pyhd8ed1ab_0
  - ffmpeg=4.3=hf484d3e_0
  - freetype=2.12.1=h4a9f257_0
  - giflib=5.2.2=h5eee18b_0
  - gmp=6.2.1=h295c915_3
  - gmpy2=2.1.2=py311hc9b5ff0_0
  - gnutls=3.6.15=he1e5248_0
  - idna=3.7=py311h06a4308_0
  - importlib-metadata=8.5.0=pyha770c72_0
  - intel-openmp=2023.1.0=hdb19cb5_46306
  - ipykernel=6.29.5=pyh3099207_0
  - ipython=8.30.0=pyh707e725_0
  - jedi=0.19.2=pyhff2d567_0
  - jinja2=3.1.4=py311h06a4308_1
  - jpeg=9e=h5eee18b_3
  - jupyter_client=8.6.3=pyhd8ed1ab_0
  - jupyter_core=5.7.2=pyh31011fe_1
  - krb5=1.21.3=h143b758_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.40=h12ee557_0
  - lerc=3.0=h295c915_0
  - libcublas=12.1.0.26=0
  - libcufft=11.0.2.4=0
  - libcufile=1.11.1.6=0
  - libcurand=10.3.7.77=0
  - libcusolver=11.4.4.55=0
  - libcusparse=12.0.2.55=0
  - libdeflate=1.17=h5eee18b_1
  - libedit=3.1.20230828=h5eee18b_0
  - libffi=3.4.4=h6a678d5_1
  - libgcc=14.2.0=h77fa898_1
  - libgcc-ng=14.2.0=h69a702a_1
  - libgomp=14.2.0=h77fa898_1
  - libiconv=1.16=h5eee18b_3
  - libidn2=2.3.4=h5eee18b_0
  - libjpeg-turbo=2.0.0=h9bf148f_0
  - libnpp=12.0.2.50=0
  - libnvjitlink=12.1.105=0
  - libnvjpeg=12.1.1.14=0
  - libpng=1.6.39=h5eee18b_0
  - libsodium=1.0.20=h4ab18f5_0
  - libstdcxx=14.2.0=hc0a3c3a_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libtasn1=4.19.0=h5eee18b_0
  - libtiff=4.5.1=h6a678d5_0
  - libunistring=0.9.10=h27cfd23_0
  - libuuid=1.41.5=h5eee18b_0
  - libwebp=1.3.2=h11a3e52_0
  - libwebp-base=1.3.2=h5eee18b_1
  - llvm-openmp=14.0.6=h9e868ea_0
  - lz4-c=1.9.4=h6a678d5_1
  - markupsafe=2.1.3=py311h5eee18b_0
  - matplotlib-inline=0.1.7=pyhd8ed1ab_0
  - mkl=2023.1.0=h213fc3f_46344
  - mkl-service=2.4.0=py311h5eee18b_1
  - mkl_fft=1.3.11=py311h5eee18b_0
  - mkl_random=1.2.8=py311ha02d727_0
  - mpc=1.1.0=h10f8cd9_1
  - mpfr=4.0.2=hb69a4c5_1
  - mpmath=1.3.0=py311h06a4308_0
  - ncurses=6.4=h6a678d5_0
  - nest-asyncio=1.6.0=pyhd8ed1ab_0
  - nettle=3.7.3=hbbd107a_1
  - networkx=3.2.1=py311h06a4308_0
  - openh264=2.1.1=h4ff587b_0
  - openjpeg=2.5.2=he7f1fd0_0
  - openssl=3.4.0=hb9d3cd8_0
  - packaging=24.2=pyhff2d567_1
  - parso=0.8.4=pyhd8ed1ab_0
  - pexpect=4.9.0=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - platformdirs=4.3.6=pyhd8ed1ab_0
  - prompt-toolkit=3.0.48=pyha770c72_0
  - psutil=6.1.0=py311h9ecbd09_0
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.3=pyhd8ed1ab_0
  - pygments=2.18.0=pyhd8ed1ab_0
  - pysocks=1.7.1=py311h06a4308_0
  - python=3.11.10=he870216_0
  - python-dateutil=2.9.0.post0=pyhff2d567_0
  - python_abi=3.11=2_cp311
  - pytorch=2.5.1=py3.11_cuda12.1_cudnn9.1.0_0
  - pytorch-cuda=12.1=ha16c6d3_6
  - pytorch-mutex=1.0=cuda
  - pyyaml=6.0.2=py311h5eee18b_0
  - pyzmq=26.2.0=py311h7deb3e3_3
  - readline=8.2=h5eee18b_0
  - requests=2.32.3=py311h06a4308_1
  - setuptools=75.1.0=py311h06a4308_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.45.3=h5eee18b_0
  - stack_data=0.6.2=pyhd8ed1ab_0
  - tbb=2021.8.0=hdb19cb5_0
  - tk=8.6.14=h39e8969_0
  - torchaudio=2.5.1=py311_cu121
  - torchtriton=3.1.0=py311
  - torchvision=0.20.1=py311_cu121
  - tornado=6.4.2=py311h9ecbd09_0
  - traitlets=5.14.3=pyhd8ed1ab_0
  - urllib3=2.2.3=py311h06a4308_0
  - wcwidth=0.2.13=pyhd8ed1ab_0
  - wheel=0.44.0=py311h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - yaml=0.2.5=h7b6447c_0
  - zeromq=4.3.5=h3b0a872_7
  - zipp=3.21.0=pyhd8ed1ab_1
  - zlib=1.2.13=h5eee18b_1
  - zstd=1.5.6=hc292b87_0
  - pip:
      - accelerate==1.1.1
      - aiohappyeyeballs==2.4.3
      - aiohttp==3.11.8
      - aiosignal==1.3.1
      - airportsdata==20241001
      - annotated-types==0.7.0
      - anthropic==0.40.0
      - anyio==4.6.2.post1
      - astor==0.8.1
      - attrs==24.2.0
      - blake3==1.0.0
      - click==8.1.7
      - cloudpickle==3.1.0
      - compressed-tensors==0.8.0
      - contourpy==1.3.1
      - cuda-python==12.6.2.post1
      - cycler==0.12.1
      - datasets==3.1.0
      - decord==0.6.0
      - depyf==0.18.0
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - docker-pycreds==0.4.0
      - einops==0.8.0
      - fastapi==0.115.5
      - filelock==3.16.1
      - flashinfer==0.1.6+cu121torch2.4
      - fonttools==4.55.0
      - frozenlist==1.5.0
      - fsspec==2024.9.0
      - gguf==0.10.0
      - gitdb==4.0.11
      - gitpython==3.1.43
      - h11==0.14.0
      - hf-transfer==0.1.8
      - httpcore==1.0.7
      - httptools==0.6.4
      - httpx==0.27.2
      - huggingface-hub==0.26.3
      - iniconfig==2.0.0
      - interegular==0.3.3
      - jiter==0.8.0
      - jsonlines==4.0.0
      - jsonschema==4.23.0
      - jsonschema-specifications==2024.10.1
      - kiwisolver==1.4.7
      - lark==1.2.2
      - litellm==1.54.1
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.9
      - matplotlib==3.9.3
      - mistral-common==1.5.1
      - modelscope==1.21.0
      - msgpack==1.1.0
      - msgspec==0.18.6
      - multidict==6.1.0
      - multiprocess==0.70.16
      - numba==0.60.0
      - numpy==1.26.4
      - nvidia-ml-py==12.560.30
      - openai==1.55.3
      - opencv-python-headless==4.10.0.84
      - orjson==3.10.12
      - outlines==0.0.46
      - outlines-core==0.1.24
      - pandas==2.2.3
      - partial-json-parser==0.2.1.1.post4
      - pillow==10.4.0
      - pip==24.3.1
      - pluggy==1.5.0
      - prometheus-client==0.21.0
      - prometheus-fastapi-instrumentator==7.0.0
      - propcache==0.2.0
      - protobuf==5.29.0
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyarrow==18.1.0
      - pybind11==2.13.6
      - pycountry==24.6.1
      - pydantic==2.10.2
      - pydantic-core==2.27.1
      - pyparsing==3.2.0
      - pytest==8.3.4
      - python-dotenv==1.0.1
      - python-multipart==0.0.19
      - pytz==2024.2
      - ray==2.39.0
      - referencing==0.35.1
      - regex==2024.11.6
      - rpds-py==0.21.0
      - safetensors==0.4.5
      - scipy==1.14.1
      - seaborn==0.13.2
      - sentencepiece==0.2.0
      - sentry-sdk==2.19.0
      - setproctitle==1.3.4
      - sglang==0.4.0.post1
      - smmap==5.0.1
      - sniffio==1.3.1
      - starlette==0.41.3
      - sympy==1.13.1
      - tiktoken==0.7.0
      - tokenizers==0.20.3
      - torchao==0.7.0
      - tqdm==4.67.1
      - transformers==4.46.3
      - typing-extensions==4.12.2
      - tzdata==2024.2
      - uvicorn==0.32.1
      - uvloop==0.21.0
      - vllm==0.6.4.post1
      - wandb==0.18.7
      - watchfiles==1.0.0
      - websockets==14.1
      - xformers==0.0.28.post3
      - xgrammar==0.1.6
      - xxhash==3.5.0
      - yarl==1.18.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA Graph Build Failure #2460

[Bug] CUDA Graph Build Failure #2460

dangxingyu commented Dec 12, 2024

[Bug] CUDA Graph Build Failure #2460

[Bug] CUDA Graph Build Failure #2460

Comments

dangxingyu commented Dec 12, 2024

Checklist

Describe the bug

Reproduction

Environment