[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

BruceXcluding · 2024-12-07T07:24:15Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Deepseek-v2 ROCM Env triton compiler error

Bug report:

WARNING 12-07 02:43:18 rocm.py:17] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
[2024-12-07 02:43:23] server_args=ServerArgs(model_path='/data/deepseek-v2-lite/', tokenizer_path='/data/deepseek-v2-lite/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/data/deepseek-v2-lite/', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.81, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=8, stream_interval=1, random_seed=179983669, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-07 02:43:32 TP4] Process 3010 gpu_id 4 is running on CPUs: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155]
[2024-12-07 02:43:33 TP4] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP4] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] Process 3006 gpu_id 0 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107]
[2024-12-07 02:43:33 TP1] Process 3007 gpu_id 1 is running on CPUs: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
[2024-12-07 02:43:33 TP5] Process 3011 gpu_id 5 is running on CPUs: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167]
[2024-12-07 02:43:33 TP7] Process 3139 gpu_id 7 is running on CPUs: [84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191]
[2024-12-07 02:43:33 TP6] Process 3075 gpu_id 6 is running on CPUs: [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179]
[2024-12-07 02:43:33 TP3] Process 3009 gpu_id 3 is running on CPUs: [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]
[2024-12-07 02:43:33 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP1] Init torch distributed begin.
[2024-12-07 02:43:33 TP5] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP5] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP0] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] Process 3008 gpu_id 2 is running on CPUs: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131]
[2024-12-07 02:43:33 TP7] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP7] Init torch distributed begin.
[2024-12-07 02:43:33 TP6] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP6] Init torch distributed begin.
[2024-12-07 02:43:33 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP3] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP2] Init torch distributed begin.
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
[2024-12-07 02:43:36 TP4] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP7] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP0] Load weight begin. avail mem=184.31 GB
[2024-12-07 02:43:36 TP5] Load weight begin. avail mem=185.80 GB
[2024-12-07 02:43:36 TP6] Load weight begin. avail mem=185.70 GB
[2024-12-07 02:43:36 TP3] Load weight begin. avail mem=185.54 GB
[2024-12-07 02:43:36 TP2] Load weight begin. avail mem=186.38 GB
[2024-12-07 02:43:36 TP1] Load weight begin. avail mem=185.55 GB
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP4] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP5] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP6] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP3] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP0] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP2] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP1] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP7] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP4] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP5] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP6] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP1] lm_eval is not installed, GPTQ may not be usable
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [03:52<11:38, 232.83s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [07:58<08:01, 240.54s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [09:31<02:53, 173.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 137.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 163.93s/it]

[2024-12-07 02:54:33 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.61 GB
[2024-12-07 02:54:33 TP7] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=180.54 GB
[2024-12-07 02:54:33 TP4] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP5] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.04 GB
[2024-12-07 02:54:33 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.79 GB
[2024-12-07 02:54:33 TP6] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.93 GB
[2024-12-07 02:54:33 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.77 GB
[2024-12-07 02:54:33 TP5] Memory pool end. avail mem=34.51 GB
[2024-12-07 02:54:33 TP7] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP3] Memory pool end. avail mem=34.25 GB
[2024-12-07 02:54:33 TP4] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP6] Memory pool end. avail mem=34.41 GB
[2024-12-07 02:54:33 TP1] Memory pool end. avail mem=34.26 GB
[2024-12-07 02:54:33 TP0] Memory pool end. avail mem=33.02 GB
[2024-12-07 02:54:33 TP2] Memory pool end. avail mem=35.09 GB
[2024-12-07 02:54:35 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP2] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP6] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP7] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP4] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP3] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP5] Capture cuda graph begin. This can take up to several minutes.
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

Reproduction

python -m sglang.launch_server \
         --model-path /data/deepseek-v2-lite/ \
         --dp 1 \
         --tp 8 \
         --trust-remote-code \

Environment

docker image henryx/haisgl:sgl0.3.2_vllm0.6.0_torch2.5_rocm6.2_triton3.0.0

The text was updated successfully, but these errors were encountered:

HaiShaw · 2024-12-07T07:27:00Z

@BruceXcluding Thanks for taking a look!

cxmt-ai-tc · 2024-12-09T05:44:23Z

same error when running /deepseek-coder-v2-instruct-awq on A40

HaiShaw · 2024-12-09T08:27:52Z

@BruceXcluding @cxmt-ai-tc may you try to change BLOCK= 64 from 128 at beginning of _decode_grouped_softmax_reducev_fwd?

BruceXcluding · 2024-12-09T09:02:35Z

it works by changed BLOCK=64

cxmt-ai-tc · 2024-12-09T09:06:08Z

it works by changed BLOCK=64

how to change, what file and how to compile/install

BruceXcluding · 2024-12-09T09:18:32Z

it works by changed BLOCK=64

how to change, what file and how to compile/install

docker image henryx/haisgl:sgl0.3.2_vllm0.6.0_torch2.5_rocm6.2_triton3.0.0
vim /sglang/python/sglang/srt/layers/triton_attention/decode_attention.py Line#534 BLOCK=128 to BLOCK=64

cxmt-ai-tc · 2024-12-09T11:42:57Z

it works by changed BLOCK=64

how to change, what file and how to compile/install

docker image lmsysorg/sglang:v0.4.0.post1-rocm620 vim /sglang/python/sglang/srt/layers/triton_attention/decode_attention.py Line#534 BLOCK=128 to BLOCK=64

i change the BLOCK=128 to BLOCK=64， and got this error：

root@s0pgpuap12:/workspace# CUDA_VISIBLE_DEVICES=2,3,6,7 python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ --port 50800 --host 0.0.0.0 --tp 4 --trust-remote-code
[2024-12-09 03:40:17] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=50800, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=741910078, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
INFO 12-09 03:40:17 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP2] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP1] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP3] Init torch distributed begin.
INFO 12-09 03:40:24 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-09 03:40:24 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-09 03:40:24 TP0] Init torch distributed begin.
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-09 03:40:24 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-09 03:40:25 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-09 03:40:25 TP3] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP0] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP1] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:25 TP2] Load weight begin. avail mem=44.06 GB
[2024-12-09 03:40:26 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-09 03:40:26 TP0] lm_eval is not installed, GPTQ may not be usable
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/26 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 4% Completed | 1/26 [00:00<00:18, 1.37it/s]
Loading safetensors checkpoint shards: 8% Completed | 2/26 [00:01<00:21, 1.10it/s]
Loading safetensors checkpoint shards: 12% Completed | 3/26 [00:02<00:22, 1.01it/s]
Loading safetensors checkpoint shards: 15% Completed | 4/26 [00:03<00:21, 1.01it/s]
Loading safetensors checkpoint shards: 19% Completed | 5/26 [00:06<00:31, 1.51s/it]
Loading safetensors checkpoint shards: 23% Completed | 6/26 [00:07<00:27, 1.37s/it]
Loading safetensors checkpoint shards: 27% Completed | 7/26 [00:08<00:24, 1.28s/it]
Loading safetensors checkpoint shards: 31% Completed | 8/26 [00:09<00:21, 1.21s/it]
Loading safetensors checkpoint shards: 35% Completed | 9/26 [00:10<00:20, 1.19s/it]
Loading safetensors checkpoint shards: 38% Completed | 10/26 [00:11<00:18, 1.15s/it]
Loading safetensors checkpoint shards: 42% Completed | 11/26 [00:12<00:16, 1.11s/it]
Loading safetensors checkpoint shards: 46% Completed | 12/26 [00:13<00:15, 1.09s/it]
Loading safetensors checkpoint shards: 50% Completed | 13/26 [00:14<00:13, 1.08s/it]
Loading safetensors checkpoint shards: 54% Completed | 14/26 [00:15<00:12, 1.07s/it]
Loading safetensors checkpoint shards: 58% Completed | 15/26 [00:16<00:11, 1.05s/it]
Loading safetensors checkpoint shards: 62% Completed | 16/26 [00:17<00:10, 1.04s/it]
Loading safetensors checkpoint shards: 65% Completed | 17/26 [00:18<00:09, 1.02s/it]
Loading safetensors checkpoint shards: 69% Completed | 18/26 [00:19<00:08, 1.01s/it]
Loading safetensors checkpoint shards: 73% Completed | 19/26 [00:20<00:07, 1.03s/it]
Loading safetensors checkpoint shards: 77% Completed | 20/26 [00:21<00:06, 1.03s/it]
Loading safetensors checkpoint shards: 81% Completed | 21/26 [00:22<00:05, 1.03s/it]
Loading safetensors checkpoint shards: 85% Completed | 22/26 [00:24<00:04, 1.03s/it]
Loading safetensors checkpoint shards: 88% Completed | 23/26 [00:25<00:03, 1.04s/it]
Loading safetensors checkpoint shards: 92% Completed | 24/26 [00:26<00:02, 1.04s/it]
Loading safetensors checkpoint shards: 96% Completed | 25/26 [00:27<00:01, 1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.07s/it]

[2024-12-09 03:41:08 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-09 03:41:08 TP1] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP2] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP3] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP0] Memory pool end. avail mem=5.30 GB
[2024-12-09 03:41:08 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-09 03:41:08 TP2] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-12-09 03:41:08 TP3] Capture cuda graph begin. This can take up to several minutes.
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): loc(error: "operation scheduled before its operands/
workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-09 03:41:12 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
[2024-12-09 03:41:12 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2024-12-09 03:41:12 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in init
self.capture()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
) = self.capture_one_batch_size(bs, forward)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
run_once()
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1527, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in init
self.init_cuda_graphs()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Killed

cc @HaiShaw @binarycrayon

BruceXcluding · 2024-12-31T02:41:14Z

@cxmt-ai-tc Can you try with this instruction #2601

HaiShaw assigned BruceXcluding Dec 7, 2024

HaiShaw added the amd label Dec 9, 2024

BruceXcluding mentioned this issue Dec 27, 2024

[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601

Merged

9 tasks

BruceXcluding closed this as completed Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

BruceXcluding commented Dec 7, 2024 •

edited

Loading

HaiShaw commented Dec 7, 2024

cxmt-ai-tc commented Dec 9, 2024

HaiShaw commented Dec 9, 2024

BruceXcluding commented Dec 9, 2024

cxmt-ai-tc commented Dec 9, 2024

BruceXcluding commented Dec 9, 2024 •

edited

Loading

cxmt-ai-tc commented Dec 9, 2024

BruceXcluding commented Dec 31, 2024

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Comments

BruceXcluding commented Dec 7, 2024 • edited Loading

Checklist

Describe the bug

Deepseek-v2 ROCM Env triton compiler error

Reproduction

Environment

HaiShaw commented Dec 7, 2024

cxmt-ai-tc commented Dec 9, 2024

HaiShaw commented Dec 9, 2024

BruceXcluding commented Dec 9, 2024

cxmt-ai-tc commented Dec 9, 2024

BruceXcluding commented Dec 9, 2024 • edited Loading

cxmt-ai-tc commented Dec 9, 2024

BruceXcluding commented Dec 31, 2024

BruceXcluding commented Dec 7, 2024 •

edited

Loading

BruceXcluding commented Dec 9, 2024 •

edited

Loading