Issues with throughput for 405B model #1235

dmakhervaks · 2024-08-27T17:47:58Z

dmakhervaks
Aug 27, 2024

Hello, I am trying to replicate the throughput numbers quoted in your blog https://lmsys.org/blog/2024-07-25-sglang-llama3/#llama-405b-on-8-x-h100-fp8

However, I am getting numbers which are 10x smaller in throughput. Can you please help?

Which arguments were used to run the 405B Llama 3.1 to get those benchmark results?

Here are some of the varieties which you have specified across your blog and github repo:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8

GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph

python -m sglang.launch_server --model ~/llama-3.1-405b-fp8-dummy/ --load-format dummy --tp 8 --quant fp8 --disable-radix --mem-frac 0.87

merrymercy · 2024-09-10T17:30:49Z

merrymercy
Sep 10, 2024
Maintainer

the blog post runs on a fp8 checkpoint on a single node. see instructions here https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with throughput for 405B model #1235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Issues with throughput for 405B model #1235

dmakhervaks Aug 27, 2024

Replies: 1 comment

merrymercy Sep 10, 2024 Maintainer

dmakhervaks
Aug 27, 2024

merrymercy
Sep 10, 2024
Maintainer