You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
Hi,
I'm working on offline_generation on 8xL40S, it raises RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream when building the CUDA graph
Error Information
INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
INFO 12-11 22:03:15 utils.py:961] Found nccl from library libnccl.so.2
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
babel-12-29:256515:256515 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256515:256515 [0] NCCL INFO Using network Socket
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap timings total 0.000354 (create 0.000023, send 0.000077, recv 0.000161, ring 0.000010, delay 0.000000)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256515:256515 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
babel-12-29:256515:256515 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256515:256515 [0] NCCL INFO comm 0xe528a80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
babel-12-29:256515:256515 [0] NCCL INFO Channel 00/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Channel 01/02 : 0 1
babel-12-29:256515:256515 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
babel-12-29:256515:256515 [0] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256515:257048 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 37
babel-12-29:256515:257047 [0] NCCL INFO [Proxy Service] Device 0 CPU core 35
babel-12-29:256515:256515 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
babel-12-29:256515:256515 [0] NCCL INFO Connected all rings
babel-12-29:256515:256515 [0] NCCL INFO Connected all trees
babel-12-29:256515:257051 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 6
babel-12-29:256515:256515 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256515:256515 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256515:256515 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
babel-12-29:256515:256515 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256515:256515 [0] NCCL INFO ncclCommInitRank comm 0xe528a80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256515:256515 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.13 (kernels 0.09, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256515:256515 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256515:256515 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256515:256515 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256515:256515 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.1
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NCCL version 2.23.4+cuda12.6
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
babel-12-29:256516:256516 [1] NCCL INFO Using network Socket
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init START
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap timings total 0.068977 (create 0.000023, send 0.000079, recv 0.068764, ring 0.000009, delay 0.000000)
babel-12-29:256516:256516 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
babel-12-29:256516:256516 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
babel-12-29:256516:256516 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
babel-12-29:256516:256516 [1] NCCL INFO comm 0xf0f6040 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
babel-12-29:256516:256516 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
babel-12-29:256516:256516 [1] NCCL INFO P2P Chunksize set to 131072
babel-12-29:256516:257049 [1] NCCL INFO [Proxy Service] Device 1 CPU core 9
babel-12-29:256516:257050 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 44
babel-12-29:256516:256516 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
babel-12-29:256516:256516 [1] NCCL INFO Connected all rings
babel-12-29:256516:256516 [1] NCCL INFO Connected all trees
babel-12-29:256516:257052 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 46
babel-12-29:256516:256516 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
babel-12-29:256516:256516 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-12-29:256516:256516 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
babel-12-29:256516:256516 [1] NCCL INFO ncclCommInitRank comm 0xf0f6040 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0xa9bdccc2329a5250 - Init COMPLETE
babel-12-29:256516:256516 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.21 (kernels 0.09, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.04, rest 0.00)
babel-12-29:256516:256516 [1] NCCL INFO cudaDriverVersion 12060
babel-12-29:256516:256516 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:256516 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-12-29:256516:256516 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
babel-12-29:256516:257069 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
babel-12-29:256516:257069 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
babel-12-29:256516:257069 [1] NCCL INFO Using non-device net plugin version 0
babel-12-29:256516:257069 [1] NCCL INFO Using network Socket
babel-12-29:256516:257069 [1] NCCL INFO ncclCommInitRank comm 0x144f2a30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x33fb9f7285907ed5 - Init START
babel-12-29:256
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 9.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 9.11it/s]
INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
INFO 12-11 22:03:20 custom_all_reduce.py:224] Registering 49 cuda graph addresses
[2024-12-11 22:03:20 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
out = run_once()
^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
return self.logits_processor(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
last_logits = tensor_model_parallel_all_gather(last_logits)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
return get_tp_group().all_gather(input_, dim)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
torch.distributed.all_gather_into_tensor(output_tensor,
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
"args": f"{args}, {kwargs}",
^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
return _str_intern(self, tensor_contents=tensor_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
tensor_str = _tensor_str(self, indent)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
self = self.float()
^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
self.capture()
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
self.cuda_graph.capture_end()
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
self.init_cuda_graphs()
File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2024-12-11 22:03:20 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
work.wait()
RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 341, in capture_one_batch_size
out = run_once()
^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/models/qwen2.py", line 299, in forward
return self.logits_processor(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/layers/logits_processor.py", line 184, in forward
last_logits = tensor_model_parallel_all_gather(last_logits)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/communication_op.py", line 17, in tensor_model_parallel_all_gather
return get_tp_group().all_gather(input_, dim)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 444, in all_gather
torch.distributed.all_gather_into_tensor(output_tensor,
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 85, in wrapper
msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 56, in _get_msg_dict
"args": f"{args}, {kwargs}",
^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
return _str_intern(self, tensor_contents=tensor_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
tensor_str = _tensor_str(self, indent)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor_str.py", line 339, in _tensor_str
self = self.float()
^^^^^^^^^^^^
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
self.capture()
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 340, in capture_one_batch_size
with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
self.cuda_graph.capture_end()
File "/home/xdang/anaconda3/envs/llm/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
self.init_cuda_graphs()
File "/home/xdang/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xdang/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
raise Exception(
Exception: Capture cuda graph failed: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
Reproduction
I'm using Qwen/Qwen2.5-0.5B, tensor parallel size = 2
Checklist
Describe the bug
Hi,
I'm working on offline_generation on 8xL40S, it raises RuntimeError: CUDA error: operation not permitted on an event last recorded in a capturing stream when building the CUDA graph
Error Information
Reproduction
I'm using Qwen/Qwen2.5-0.5B, tensor parallel size = 2
scripts to run
Environment
CUDA Version: 12.6
CUDA Driver Version: 560.35.03
GPU: 8xNVIDIA L40S
Here's the Conda Env:
The text was updated successfully, but these errors were encountered: