Issues with throughput for 405B model #1235
Closed
dmakhervaks
started this conversation in
General
Replies: 1 comment
-
the blog post runs on a fp8 checkpoint on a single node. see instructions here https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I am trying to replicate the throughput numbers quoted in your blog https://lmsys.org/blog/2024-07-25-sglang-llama3/#llama-405b-on-8-x-h100-fp8
However, I am getting numbers which are 10x smaller in throughput. Can you please help?
Which arguments were used to run the 405B Llama 3.1 to get those benchmark results?
Here are some of the varieties which you have specified across your blog and github repo:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
python -m sglang.launch_server --model ~/llama-3.1-405b-fp8-dummy/ --load-format dummy --tp 8 --quant fp8 --disable-radix --mem-frac 0.87
Beta Was this translation helpful? Give feedback.
All reactions