We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
# v0.3.1.post3 pip install --upgrade pip pip install "sglang[all]" pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
python3 -m sglang.launch_server --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --disable-radix python3 bench_serving.py --backend sglang --num-prompts 5000
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 5000 Benchmark duration (s): 161.16 Total input tokens: 1130466 Total generated tokens: 971613 Total generated tokens (retokenized): 970868 Request throughput (req/s): 31.02 Input token throughput (tok/s): 7014.49 Output token throughput (tok/s): 6028.81 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 87157.00 Median E2E Latency (ms): 87767.15 ---------------Time to First Token---------------- Mean TTFT (ms): 52751.14 Median TTFT (ms): 42772.56 P99 TTFT (ms): 122414.71 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 289.26 Median TPOT (ms): 202.07 P99 TPOT (ms): 1915.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 183.11 Median ITL (ms): 119.46 P99 ITL (ms): 686.84 ==================================================
pip3 install https://github.com/zhyncs/lmdeploy-build/releases/download/bf89a01/lmdeploy-0.6.0+cu121+bf89a01-cp310-cp310-manylinux2014_x86_64.whl
python3 -m lmdeploy serve api_server hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 python3 bench_serving.py --backend lmdeploy --num-prompts 5000
============ Serving Benchmark Result ============ Backend: lmdeploy Traffic request rate: inf Successful requests: 5000 Benchmark duration (s): 133.48 Total input tokens: 1130466 Total generated tokens: 971613 Total generated tokens (retokenized): 976379 Request throughput (req/s): 37.46 Input token throughput (tok/s): 8469.20 Output token throughput (tok/s): 7279.11 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 68692.60 Median E2E Latency (ms): 69067.49 ---------------Time to First Token---------------- Mean TTFT (ms): 57053.45 Median TTFT (ms): 56180.29 P99 TTFT (ms): 117505.87 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.08 Median TPOT (ms): 64.48 P99 TPOT (ms): 161.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 222.47 Median ITL (ms): 196.81 P99 ITL (ms): 902.97 ==================================================
Integrate TurboMind GEMM into SGLang to enhance AWQ performance.
https://github.com/internlm/turbomind
No response
The text was updated successfully, but these errors were encountered:
The performance of BF16 is close, there is a gap when AWQ is enabled.
Sorry, something went wrong.
ispobock
zhyncs
No branches or pull requests
Checklist
Motivation
Current Situation
SGLang
LMDeploy
TODO
Integrate TurboMind GEMM into SGLang to enhance AWQ performance.
https://github.com/internlm/turbomind
Related resources
No response
The text was updated successfully, but these errors were encountered: