Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Closed
5 tasks done
BruceXcluding opened this issue Dec 7, 2024 · 8 comments
Closed
5 tasks done

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

BruceXcluding opened this issue Dec 7, 2024 · 8 comments
Assignees
Labels

Comments

@BruceXcluding
Copy link
Contributor

BruceXcluding commented Dec 7, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Deepseek-v2 ROCM Env triton compiler error

Bug report:

WARNING 12-07 02:43:18 rocm.py:17] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
[2024-12-07 02:43:23] server_args=ServerArgs(model_path='/data/deepseek-v2-lite/', tokenizer_path='/data/deepseek-v2-lite/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/data/deepseek-v2-lite/', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.81, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=8, stream_interval=1, random_seed=179983669, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-07 02:43:32 TP4] Process 3010 gpu_id 4 is running on CPUs: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155]
[2024-12-07 02:43:33 TP4] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP4] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] Process 3006 gpu_id 0 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107]
[2024-12-07 02:43:33 TP1] Process 3007 gpu_id 1 is running on CPUs: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
[2024-12-07 02:43:33 TP5] Process 3011 gpu_id 5 is running on CPUs: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167]
[2024-12-07 02:43:33 TP7] Process 3139 gpu_id 7 is running on CPUs: [84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191]
[2024-12-07 02:43:33 TP6] Process 3075 gpu_id 6 is running on CPUs: [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179]
[2024-12-07 02:43:33 TP3] Process 3009 gpu_id 3 is running on CPUs: [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]
[2024-12-07 02:43:33 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP1] Init torch distributed begin.
[2024-12-07 02:43:33 TP5] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP5] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP0] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] Process 3008 gpu_id 2 is running on CPUs: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131]
[2024-12-07 02:43:33 TP7] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP7] Init torch distributed begin.
[2024-12-07 02:43:33 TP6] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP6] Init torch distributed begin.
[2024-12-07 02:43:33 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP3] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP2] Init torch distributed begin.
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
[2024-12-07 02:43:36 TP4] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP7] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP0] Load weight begin. avail mem=184.31 GB
[2024-12-07 02:43:36 TP5] Load weight begin. avail mem=185.80 GB
[2024-12-07 02:43:36 TP6] Load weight begin. avail mem=185.70 GB
[2024-12-07 02:43:36 TP3] Load weight begin. avail mem=185.54 GB
[2024-12-07 02:43:36 TP2] Load weight begin. avail mem=186.38 GB
[2024-12-07 02:43:36 TP1] Load weight begin. avail mem=185.55 GB
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP4] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP5] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP6] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP3] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP0] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP2] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP1] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP7] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP4] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP5] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP6] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP1] lm_eval is not installed, GPTQ may not be usable
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [03:52<11:38, 232.83s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [07:58<08:01, 240.54s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [09:31<02:53, 173.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 137.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 163.93s/it]

[2024-12-07 02:54:33 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.61 GB
[2024-12-07 02:54:33 TP7] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=180.54 GB
[2024-12-07 02:54:33 TP4] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP5] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.04 GB
[2024-12-07 02:54:33 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.79 GB
[2024-12-07 02:54:33 TP6] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.93 GB
[2024-12-07 02:54:33 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.77 GB
[2024-12-07 02:54:33 TP5] Memory pool end. avail mem=34.51 GB
[2024-12-07 02:54:33 TP7] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP3] Memory pool end. avail mem=34.25 GB
[2024-12-07 02:54:33 TP4] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP6] Memory pool end. avail mem=34.41 GB
[2024-12-07 02:54:33 TP1] Memory pool end. avail mem=34.26 GB
[2024-12-07 02:54:33 TP0] Memory pool end. avail mem=33.02 GB
[2024-12-07 02:54:33 TP2] Memory pool end. avail mem=35.09 GB
[2024-12-07 02:54:35 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP2] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP6] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP7] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP4] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP3] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP5] Capture cuda graph begin. This can take up to several minutes.
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

Reproduction

python -m sglang.launch_server \
         --model-path /data/deepseek-v2-lite/ \
         --dp 1 \
         --tp 8 \
         --trust-remote-code \

Environment

docker image henryx/haisgl:sgl0.3.2_vllm0.6.0_torch2.5_rocm6.2_triton3.0.0

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 7, 2024

@BruceXcluding Thanks for taking a look!

@cxmt-ai-tc
Copy link

same error when running /deepseek-coder-v2-instruct-awq on A40

@HaiShaw HaiShaw added the amd label Dec 9, 2024
@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 9, 2024

@BruceXcluding @cxmt-ai-tc may you try to change BLOCK= 64 from 128 at beginning of _decode_grouped_softmax_reducev_fwd?

@BruceXcluding
Copy link
Contributor Author

it works by changed BLOCK=64

@cxmt-ai-tc
Copy link

it works by changed BLOCK=64

how to change, what file and how to compile/install

@BruceXcluding
Copy link
Contributor Author

BruceXcluding commented Dec 9, 2024

it works by changed BLOCK=64

how to change, what file and how to compile/install

docker image henryx/haisgl:sgl0.3.2_vllm0.6.0_torch2.5_rocm6.2_triton3.0.0
vim /sglang/python/sglang/srt/layers/triton_attention/decode_attention.py Line#534 BLOCK=128 to BLOCK=64

@cxmt-ai-tc
Copy link

it works by changed BLOCK=64

how to change, what file and how to compile/install

docker image lmsysorg/sglang:v0.4.0.post1-rocm620 vim /sglang/python/sglang/srt/layers/triton_attention/decode_attention.py Line#534 BLOCK=128 to BLOCK=64

i change the BLOCK=128 to BLOCK=64, and got this error:

root@s0pgpuap12:/workspace# CUDA_VISIBLE_DEVICES=2,3,6,7 python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ --port 50800 --host 0.0.0.0 --tp 4 --trust-remote-code
[2024-12-09 03:40:17] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=50800, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=741910078, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
INFO 12-09 03:40:17 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP2] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP1] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP3] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP0] Init torch distributed begin.
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-09 03:40:25 TP3] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP0] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP1] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP2] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:26 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP0] lm_eval is not installed, GPTQ may not be usable
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/26 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 4% Completed | 1/26 [00:00<00:18, 1.37it/s]
Loading safetensors checkpoint shards: 8% Completed | 2/26 [00:01<00:21, 1.10it/s]
Loading safetensors checkpoint shards: 12% Completed | 3/26 [00:02<00:22, 1.01it/s]
Loading safetensors checkpoint shards: 15% Completed | 4/26 [00:03<00:21, 1.01it/s]
Loading safetensors checkpoint shards: 19% Completed | 5/26 [00:06<00:31, 1.51s/it]
Loading safetensors checkpoint shards: 23% Completed | 6/26 [00:07<00:27, 1.37s/it]
Loading safetensors checkpoint shards: 27% Completed | 7/26 [00:08<00:24, 1.28s/it]
Loading safetensors checkpoint shards: 31% Completed | 8/26 [00:09<00:21, 1.21s/it]
Loading safetensors checkpoint shards: 35% Completed | 9/26 [00:10<00:20, 1.19s/it]
Loading safetensors checkpoint shards: 38% Completed | 10/26 [00:11<00:18, 1.15s/it]
Loading safetensors checkpoint shards: 42% Completed | 11/26 [00:12<00:16, 1.11s/it]
Loading safetensors checkpoint shards: 46% Completed | 12/26 [00:13<00:15, 1.09s/it]
Loading safetensors checkpoint shards: 50% Completed | 13/26 [00:14<00:13, 1.08s/it]
Loading safetensors checkpoint shards: 54% Completed | 14/26 [00:15<00:12, 1.07s/it]
Loading safetensors checkpoint shards: 58% Completed | 15/26 [00:16<00:11, 1.05s/it]
Loading safetensors checkpoint shards: 62% Completed | 16/26 [00:17<00:10, 1.04s/it]
Loading safetensors checkpoint shards: 65% Completed | 17/26 [00:18<00:09, 1.02s/it]
Loading safetensors checkpoint shards: 69% Completed | 18/26 [00:19<00:08, 1.01s/it]
Loading safetensors checkpoint shards: 73% Completed | 19/26 [00:20<00:07, 1.03s/it]
Loading safetensors checkpoint shards: 77% Completed | 20/26 [00:21<00:06, 1.03s/it]
Loading safetensors checkpoint shards: 81% Completed | 21/26 [00:22<00:05, 1.03s/it]
Loading safetensors checkpoint shards: 85% Completed | 22/26 [00:24<00:04, 1.03s/it]
Loading safetensors checkpoint shards: 88% Completed | 23/26 [00:25<00:03, 1.04s/it]
Loading safetensors checkpoint shards: 92% Completed | 24/26 [00:26<00:02, 1.04s/it]
Loading safetensors checkpoint shards: 96% Completed | 25/26 [00:27<00:01, 1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.07s/it]

[2024-12-09 03:41:08 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP1] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP2] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP3] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP0] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP2] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP3] Capture cuda graph begin. This can take up to several minutes.
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): loc(error: "operation scheduled before its operands/
workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
[2024-12-09 03:41:12 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

  1. disable cuda graph by --disable-cuda-graph
  2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  3. disable torch compile by not using --enable-torch-compile
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

  1. disable cuda graph by --disable-cuda-graph
  2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  3. disable torch compile by not using --enable-torch-compile
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

  1. disable cuda graph by --disable-cuda-graph
  2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  3. disable torch compile by not using --enable-torch-compile
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

  1. disable cuda graph by --disable-cuda-graph
  2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  3. disable torch compile by not using --enable-torch-compile
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Killed

cc @HaiShaw @binarycrayon

@BruceXcluding
Copy link
Contributor Author

@cxmt-ai-tc Can you try with this instruction #2601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants