[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601

BruceXcluding · 2024-12-26T15:51:52Z

Motivation

Support DeepseekV3 on AMD Instinct MI300X GPU

Modifications

Add proper fix for AMD FP8 e4m3fnuz to support DeepseekV3 FP8 model
Bypass FlashInfer backend bmm_fp8 to cast FP8 to BF16 in MLA
Add AMD triton stages config

TODO

amd base image testing ROCm base image update #2692
sgl-kernel add amd backend
DeepseekV3 MOE config optimization
batch mm for FP8 optimization on rocm
customized block FP8 quant
dp attention optimization

How to run

build env

cd sglang/docker

docker build –t sglang-rocm:latest –f Dockerfile.rocm .
 
docker run -it --ipc=host \ 
               --cap-add=SYS_PTRACE \
               --network=host \ 
               --device=/dev/kfd --device=/dev/dri \
               --security-opt seccomp=unconfined \ 
               --group-add video \
               --privileged \
               -w /workspace sglang-rocm:latest

offline:

python -m sglang.bench_one_batch --batch-size 32 --input 128 --output 32 --model /data/DeepSeek-V3-Base/ --tp 8 --trust-remote-code

Prefill. latency: 3.95045 s, throughput:   1036.84 token/s
Decode.  latency: 0.10960 s, throughput:    291.96 token/s
Decode.  latency: 0.10487 s, throughput:    305.14 token/s
Decode.  latency: 0.10468 s, throughput:    305.71 token/s
Decode.  latency: 0.10455 s, throughput:    306.07 token/s
Decode.  latency: 0.10458 s, throughput:    305.98 token/s
Decode.  median latency: 0.10458 s, median throughput:    305.98 token/s
Total. latency:  4.688 s, throughput:    928.38 token/s
Benchmark ...
Prefill. latency: 0.38250 s, throughput:  10708.55 token/s
Decode.  latency: 0.10400 s, throughput:    307.70 token/s
Decode.  latency: 0.10448 s, throughput:    306.28 token/s
Decode.  latency: 0.10446 s, throughput:    306.34 token/s
Decode.  latency: 0.10434 s, throughput:    306.70 token/s
Decode.  latency: 0.10454 s, throughput:    306.12 token/s
Decode.  median latency: 0.10429 s, median throughput:    306.83 token/s
Total. latency:  3.617 s, throughput:   1415.51 token/s

server:

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --tp 8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

Accuracy: 0.917
Invalid: 0.001
Latency: 164.768 s
Output throughput: 765.894 token/s

Issues

If you get the error like raise OutOfResources(self.metadata.shared, max_shared, "shared memory"), same with [Bug] Deepseek-v2-lite AMD MI300 run failed #2384
Solved with python/sglang/srt/layers/attention/triton_ops/decode_attention.py +410
If you get an error like ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema', same with [Bug] SGLang v0.4.0 with AMD MI300X #2530
Solved with downgrade vllm
If you get an error like `RuntimeError: [enforce fail at /app/pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0'
Solved with ifconfig check your eth number and export GLOO_SOCKET_IFNAME=your eth

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

carlushuang · 2024-12-26T16:15:13Z

@HaiShaw

HaiShaw · 2024-12-26T16:23:07Z

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

zhyncs · 2024-12-26T16:25:10Z

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

That would be nice. I plan to release v0.4.1.post1 soon to enable users to use AMD MI300X initially.

zhyncs · 2024-12-26T16:50:22Z

Hi @BruceXcluding May you help fix the failed CIs ref https://github.com/sgl-project/sglang/blob/main/docs/references/contributor_guide.md#format-your-code

HaiShaw

@BruceXcluding
Some to address, thanks!

HaiShaw · 2024-12-26T23:14:44Z

docker/Dockerfile.rocm

@@ -37,7 +37,7 @@ ENV SGLANG_SET_CPU_AFFINITY=1
 ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
 ENV NCCL_MIN_NCHANNELS=112

-ENV MOE_PADDING=1
+ENV MOE_PADDING=0


We need to keep MOE_PADDING on for performance, to error it incurs we need to fix it.

docker/Dockerfile.rocm

HaiShaw · 2024-12-26T23:16:26Z

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

@@ -402,7 +402,7 @@ def _decode_grouped_att_m_fwd(
    sm_scale,
    logit_cap,
 ):
-    BLOCK = 32
+    BLOCK = 16 if is_hip() else 32


we should not cut by half for HIP globally here.

it doesn't work well in latest vllm with BLOCK 32

This part we can not take as it is - it will cost performance of all other models in large margin.

HaiShaw · 2024-12-26T23:32:05Z

python/sglang/srt/layers/quantization/fp8.py

@@ -217,7 +217,7 @@ def create_weights(

        # WEIGHT
        weight_dtype = (
-            torch.float8_e4m3fn
+            torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


We should not have this, serialized weight is always OCP (torch.float8_e4m3fn)

it would encounter the error "python/sglang/srt/layers/quantization/fp8_kernel.py:176:33: error: Unsupported conversion from 'f8E4M3FN' to 'f16'
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]" with torch.float8_e4m3fn at w8a8_block_fp8_matmul

Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

HaiShaw · 2024-12-26T23:37:44Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

@@ -430,7 +432,7 @@ def get_default_config(
    dtype: Optional[str],
    is_marlin: bool,
 ) -> Dict[str, int]:
-    if dtype == "fp8_w8a8":
+    if dtype == "fp8_w8a8" and not is_hip():


following block isn't a breaker to HIP

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

HaiShaw

@BruceXcluding
Also see this error below with your version of pyproject.toml:

  File "/dockerx/1226/HS/sglang/python/sglang/srt/constrained/outlines_backend.py", line 23, in <module>
    from outlines.fsm.json_schema import build_regex_from_schema
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema' (/usr/local/lib/python3.12/dist-packages/outlines/fsm/json_schema.py)

ZJLi2013 · 2024-12-27T04:55:32Z

the CI failure: PR Test / unit-test-backend-2-gpu, used a lite model 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', which doesn't has fp8 block-level quant feature

HaiShaw · 2024-12-27T07:54:59Z

python/sglang/srt/layers/quantization/fp8.py

@@ -217,7 +217,7 @@ def create_weights(

        # WEIGHT
        weight_dtype = (
-            torch.float8_e4m3fn
+            torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

HaiShaw · 2024-12-27T07:55:41Z

python/sglang/srt/layers/quantization/fp8.py

@@ -432,7 +432,7 @@ def create_weights(
        from sglang.srt.layers.moe.fused_moe_triton import FusedMoeWeightScaleSupported

        if self.quant_config.is_checkpoint_fp8_serialized:
-            params_dtype = torch.float8_e4m3fn
+            params_dtype = torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


same problem here - check out the previous usage from normalize_e4m3fn_to_e4m3fnuz

zhyncs · 2024-12-30T13:56:44Z

sgl-kernel/setup.py

+
+def is_hip() -> bool:
+    """Return whether it is HIP on the AMD ROCm platform."""
+    return torch.version.hip is not None


Suggested change

return torch.version.hip is not None

return torch.cuda.is_available() and torch.version.hip

https://pytorch.org/docs/stable/notes/hip.html

zhyncs · 2024-12-30T13:58:15Z

python/pyproject.toml

@@ -27,7 +27,7 @@ srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cu

 # HIP (Heterogeneous-computing Interface for Portability) for AMD
 # => base docker rocm/vllm-dev:20241022, not from public vllm whl
-srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]
+srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.post2.dev1+g1ef171e0.rocm624"]


What issues could occur if the image isn't updated? Minimize updating the base image whenever possible.

@zhyncs we (AMD) will have to decide on this, so ignore it for now.

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

zhyncs · 2024-12-30T14:01:57Z

sgl-kernel/setup.py

+            "sgl_kernel.ops.moe_align_block_size",
+            [
+                "src/sgl-kernel/csrc/moe_align_kernel.cu",
+                "src/sgl-kernel/csrc/sgl_kernel_ops.cu",


If you need to use AMD for compilation, I recommend not compiling sgl_kernel_ops.cu directly. Instead, use a separate file to avoid mixing NVIDIA and AMD's cu files, it's better to keep them separate. cc @HaiShaw @ispobock @merrymercy

Do you have any suggestions? @yzh119

seems need to compile reduce kernel here, otherwise some archs will not be imported due to No module named 'sgl_kernel.ops._kernels'

May we use is_hip there

@zhyncs In case CUDA/HIP compatible kernel files, we don't use separate files (that is point of HIP), I believe that is one of the cases. We do for sure separate files for AMD specific kernels or kernel implementations.

@zyeric the else: case seemingly have no impact to NV side, can you be more specific?

maybe it's better to separate amd/nv kernels as 2 different backends? at this moment, moe_align_kernel is only required for amd backend, while in near future, there are ck kernels added to amd backend.

@HaiShaw I think the root cause is that the import path is still sgl_kernel.ops._kernels at https://github.com/BruceXcluding/sglang/blob/main/sgl-kernel/src/sgl-kernel/ops/__init__.py#L1

Current version works for me, many thanks :D

Accuracy: 0.951 Invalid: 0.000 Latency: 160.916 s Output throughput: 869.145 token/s

BruceXcluding · 2024-12-31T02:47:56Z

@AdjectiveAllison we are targeted to fix accuracy issue with fp8, do you see garbled output with bf16 as well? We will tune performance with config.json provided soon. Are you using MI308?

No, output on full bf16 works perfectly. I'm on an 8x mi300x machine. 192GB of vram each.

@AdjectiveAllison Can you try with the latest instruction

zhyncs · 2024-12-31T08:11:25Z

python/sglang/srt/server.py

@@ -578,8 +578,9 @@ def _set_envs_and_config(server_args: ServerArgs):
    os.environ["NCCL_NVLS_ENABLE"] = "0"
    os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
    os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
-    if "GLOO_SOCKET_IFNAME" not in os.environ:
-        os.environ["GLOO_SOCKET_IFNAME"] = "eth0"
+    # TODO(fix socket error with gpu backend)


Why is this commented out?

Is this used for cpu backend or specific workstation? get RuntimeError: [enforce fail at pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0

This is used for multi-node tensor parallelism. Instead of using comments, we suggest adding an is_hip flag.

I think the value set for the GLOO_SOCKET_IFNAME environment variable should depend on the name of the network interface card in each user's system and should not be hard-coded as eth0

@wufann If the user's value is not eth0, they should specify it explicitly, this applies only when no setting is provided, with eth0 as the default.

@zhyncs Different network interface ( "ens" ) may be used. Also they may test in a single node envrionment where IB is not configured. In that case IB should be disabled

zhyncs · 2024-12-31T08:12:34Z

sgl-kernel/amd/CMakeLists.txt

@@ -0,0 +1,51 @@
+cmake_minimum_required(VERSION 3.18)


Please remove this, we only use CMakeLists.txt for clangd indexing, so it's not necessary.

zhyncs · 2024-12-31T08:17:42Z

sgl-kernel/amd/pyproject.toml

+build-backend = "setuptools.build_meta"
+
+[project]
+name = "sgl-kernel"


Can we refer to the setup of flash-attention or vllm compatible with NVIDIA and AMD?
https://github.com/Dao-AILab/flash-attention/blob/main/setup.py
https://github.com/vllm-project/vllm/blob/main/setup.py

zhyncs · 2025-01-02T13:50:17Z

Hi @BruceXcluding @HaiShaw
#2712
You can now try using moe_align_block_size_triton on AMD.

BruceXcluding · 2025-01-02T15:52:56Z

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

HaiShaw · 2025-01-02T17:42:25Z

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

@BruceXcluding, How was the performance comparing to sgl-kernel-amd?

zhyncs · 2025-01-02T17:53:23Z

Hi @BruceXcluding @HaiShaw
Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

HaiShaw · 2025-01-02T19:05:26Z

@zhyncs I am expecting @BruceXcluding to do the final update.
@BruceXcluding can you confirm the decode_attention.py change?

HaiShaw

@BruceXcluding thanks!

had been address above

BruceXcluding · 2025-01-03T00:40:19Z

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

Co-authored-by: root <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: Bruce Xue <[email protected]> Co-authored-by: Yineng Zhang <[email protected]>

yiakwy-xpu-ml-framework-team · 2025-01-03T11:27:53Z

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

Yes theoretical throughput is

4800 (memory transaction speed) / (671 / 8) * 1.8 (MTP multiplier) ~ 100 tok/gpu/sec

There are spaces to improve.

Add hip config

45dfe9e

zhyncs added the high priority label Dec 26, 2024

zhyncs assigned ispobock, HandH1998 and zhyncs Dec 26, 2024

zhyncs added bug Something isn't working amd labels Dec 26, 2024

merrymercy mentioned this pull request Dec 26, 2024

[Bug] Deepseek v3 doesn't work on mi300x #2595

Open

5 tasks

zhyncs mentioned this pull request Dec 26, 2024

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

Open

5 tasks

HaiShaw requested changes Dec 26, 2024

View reviewed changes

merrymercy previously requested changes Dec 27, 2024

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py Outdated Show resolved Hide resolved

HaiShaw requested changes Dec 27, 2024

View reviewed changes

Merge branch 'sgl-project:main' into main

d315402

Fix AMD moe_align and triton stage config

57a5006

BruceXcluding marked this pull request as ready for review December 27, 2024 05:56

BruceXcluding requested review from Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 27, 2024 05:56

BruceXcluding requested review from HaiShaw and merrymercy December 27, 2024 05:59

HaiShaw requested changes Dec 27, 2024

View reviewed changes

fix fused_moe.py conflict

3fa113b

BruceXcluding force-pushed the main branch from 067b04a to 3fa113b Compare December 27, 2024 09:05

zhyncs reviewed Dec 30, 2024

View reviewed changes

BruceXcluding marked this pull request as draft December 31, 2024 01:36

sperate sgl-kernel with amd backend

abc497d

BruceXcluding force-pushed the main branch from 82968d4 to abc497d Compare December 31, 2024 02:07

This was referenced Dec 31, 2024

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Closed

[Bug] SGLang v0.4.0 with AMD MI300X #2530

Closed

zhyncs reviewed Dec 31, 2024

View reviewed changes

Merge 'main' into 'main'

4bb3332

Clang format

b10c089

BruceXcluding force-pushed the main branch from 34da38f to b10c089 Compare January 2, 2025 16:52

BruceXcluding requested a review from HaiShaw January 2, 2025 16:54

zhyncs added 2 commits January 3, 2025 01:58

Merge branch 'main' into main

bf2ad5d

Merge branch 'main' into main

3b63a5f

zhyncs marked this pull request as ready for review January 2, 2025 18:42

Merge branch 'main' into main

7b8d375

HaiShaw approved these changes Jan 3, 2025

View reviewed changes

HaiShaw merged commit c7ae474 into sgl-project:main Jan 3, 2025
15 checks passed

	return torch.version.hip is not None
	return torch.cuda.is_available() and torch.version.hip

[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601

[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601

Conversation

BruceXcluding commented Dec 26, 2024 • edited Loading

Motivation

Modifications

TODO

How to run

Issues

Checklist

carlushuang commented Dec 26, 2024

HaiShaw commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

HaiShaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaiShaw left a comment • edited Loading

Choose a reason for hiding this comment

ZJLi2013 commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyeric Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

BruceXcluding commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhyncs Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhyncs commented Jan 2, 2025

BruceXcluding commented Jan 2, 2025

HaiShaw commented Jan 2, 2025

zhyncs commented Jan 2, 2025

HaiShaw commented Jan 2, 2025 • edited Loading

HaiShaw left a comment

Choose a reason for hiding this comment

BruceXcluding commented Jan 3, 2025

yiakwy-xpu-ml-framework-team commented Jan 3, 2025

BruceXcluding commented Dec 26, 2024 •

edited

Loading

HaiShaw left a comment •

edited

Loading

zyeric Dec 31, 2024 •

edited

Loading

zhyncs Dec 31, 2024 •

edited

Loading

HaiShaw commented Jan 2, 2025 •

edited

Loading