AWQ performance tracking #1505

zhyncs · 2024-09-24T14:33:27Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Current Situation

SGLang

# v0.3.1.post3
pip install --upgrade pip
pip install "sglang[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --disable-radix

python3 bench_serving.py --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  161.16
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    970868
Request throughput (req/s):              31.02
Input token throughput (tok/s):          7014.49
Output token throughput (tok/s):         6028.81
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   87157.00
Median E2E Latency (ms):                 87767.15
---------------Time to First Token----------------
Mean TTFT (ms):                          52751.14
Median TTFT (ms):                        42772.56
P99 TTFT (ms):                           122414.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          289.26
Median TPOT (ms):                        202.07
P99 TPOT (ms):                           1915.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           183.11
Median ITL (ms):                         119.46
P99 ITL (ms):                            686.84
==================================================

LMDeploy

pip3 install https://github.com/zhyncs/lmdeploy-build/releases/download/bf89a01/lmdeploy-0.6.0+cu121+bf89a01-cp310-cp310-manylinux2014_x86_64.whl

python3 -m lmdeploy serve api_server hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4

python3 bench_serving.py --backend lmdeploy --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 lmdeploy
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  133.48
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    976379
Request throughput (req/s):              37.46
Input token throughput (tok/s):          8469.20
Output token throughput (tok/s):         7279.11
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   68692.60
Median E2E Latency (ms):                 69067.49
---------------Time to First Token----------------
Mean TTFT (ms):                          57053.45
Median TTFT (ms):                        56180.29
P99 TTFT (ms):                           117505.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.08
Median TPOT (ms):                        64.48
P99 TPOT (ms):                           161.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           222.47
Median ITL (ms):                         196.81
P99 ITL (ms):                            902.97
==================================================

TODO

Integrate TurboMind GEMM into SGLang to enhance AWQ performance.

https://github.com/internlm/turbomind

Related resources

No response

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-09-24T14:51:52Z

The performance of BF16 is close, there is a gap when AWQ is enabled.

zhyncs added the performance label Sep 24, 2024

zhyncs assigned ispobock and zhyncs Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ performance tracking #1505

AWQ performance tracking #1505

zhyncs commented Sep 24, 2024

zhyncs commented Sep 24, 2024

AWQ performance tracking #1505

AWQ performance tracking #1505

Comments

zhyncs commented Sep 24, 2024

Checklist

Motivation

Current Situation

SGLang

LMDeploy

TODO

Related resources

zhyncs commented Sep 24, 2024