Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ performance tracking #1505

Open
2 tasks
zhyncs opened this issue Sep 24, 2024 · 1 comment
Open
2 tasks

AWQ performance tracking #1505

zhyncs opened this issue Sep 24, 2024 · 1 comment
Assignees

Comments

@zhyncs
Copy link
Member

zhyncs commented Sep 24, 2024

Checklist

Motivation

Current Situation

SGLang

# v0.3.1.post3
pip install --upgrade pip
pip install "sglang[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
python3 -m sglang.launch_server --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --disable-radix

python3 bench_serving.py --backend sglang --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  161.16
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    970868
Request throughput (req/s):              31.02
Input token throughput (tok/s):          7014.49
Output token throughput (tok/s):         6028.81
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   87157.00
Median E2E Latency (ms):                 87767.15
---------------Time to First Token----------------
Mean TTFT (ms):                          52751.14
Median TTFT (ms):                        42772.56
P99 TTFT (ms):                           122414.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          289.26
Median TPOT (ms):                        202.07
P99 TPOT (ms):                           1915.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           183.11
Median ITL (ms):                         119.46
P99 ITL (ms):                            686.84
==================================================

LMDeploy

pip3 install https://github.com/zhyncs/lmdeploy-build/releases/download/bf89a01/lmdeploy-0.6.0+cu121+bf89a01-cp310-cp310-manylinux2014_x86_64.whl
python3 -m lmdeploy serve api_server hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4

python3 bench_serving.py --backend lmdeploy --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 lmdeploy
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  133.48
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    976379
Request throughput (req/s):              37.46
Input token throughput (tok/s):          8469.20
Output token throughput (tok/s):         7279.11
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   68692.60
Median E2E Latency (ms):                 69067.49
---------------Time to First Token----------------
Mean TTFT (ms):                          57053.45
Median TTFT (ms):                        56180.29
P99 TTFT (ms):                           117505.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.08
Median TPOT (ms):                        64.48
P99 TPOT (ms):                           161.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           222.47
Median ITL (ms):                         196.81
P99 ITL (ms):                            902.97
==================================================

TODO

Integrate TurboMind GEMM into SGLang to enhance AWQ performance.

https://github.com/internlm/turbomind

Related resources

No response

@zhyncs
Copy link
Member Author

zhyncs commented Sep 24, 2024

The performance of BF16 is close, there is a gap when AWQ is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants