-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] benchmarks on different cuda architecture with models of various size #815
Comments
采样(num_beam=1)感觉是不是对性能影响不大啊? |
我理解是 temperature, top_p, top_k 这样的setting |
我使用了不同的top_p, top_k和temperature在llama-2-chat-7b模型tp1下使用profile_throughtput.py测试了性能,tokens/s几乎没有差异 |
A100 (w4a16)Request Throughput (RPM)
Static Inference Performancellama2-7b
llama2-13b
internlm-20b
llama2-70b
|
问下,这个静态 batch 怎么测试的?现在不是支持 continue batch 了,这个不是根据显存大小去看推理的 batch size 的吗? |
这里静态batch是个相对概念。在推理过程中,还是 continuous batching,只是在推理的绝大部分时间中,推理batch和输入的batch一样(--concurrency参数) |
latest benchmark results https://buildkite.com/vllm/performance-benchmark/builds/3924 |
Maybe we could do something similar cc @zhulinJulia24 @lvhan028 |
背景
我们发现绝大部分LLM推理引擎在报告推理性能的时候,都是关掉sampling功能的。但是在实际应用中,sampling几乎是必选项。为了给出尽可能贴近实际应用的benchmark,我们开了这个issue,报告 LMDeploy 在采样开启时候的性能。
测试模型
测试设备
模型计算精度:BF16(FP16)、W4A16、KV8
模型计算精度:FP16
模型计算精度:W4A16
模型计算精度:W4A16
模型计算精度:W4A16
测量指标
The text was updated successfully, but these errors were encountered: