Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support tp #14

Open
cyLi-Tiger opened this issue May 24, 2024 · 2 comments
Open

support tp #14

cyLi-Tiger opened this issue May 24, 2024 · 2 comments

Comments

@cyLi-Tiger
Copy link

Hi, thanks for the great work!

What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve doesn't support tp? If I add sharp the quantized weight on my own, will it affect the gemm kernel you developed?

@ys-2020
Copy link
Contributor

ys-2020 commented Jun 4, 2024

Hi! Thank you very much for your interest in QServe.

Yes. TP is definitely helpful for serving larger models. Currently, we have not supported TP in QServe yet, but we believe that QServe is compatible with TP and other parallelization strategies. By the way, since QServe greatly compresses the weights and KV cache of LLMs, it is viable to serve most of the open-sourced models within a single A100 GPU, so that the communication overhead between GPUs can be avoided (possibly with DP).

@cyLi-Tiger
Copy link
Author

cyLi-Tiger commented Jun 12, 2024

Thanks for you reply!

Qserve indeed can serve large model in one card. However, if other frameworks enable tp, for example, vllm with tp=4, may result in higher throughput since each card compute less. In our real experiment, when serving Qwen1.5-72B-chat, vllm with tp=4 can double qserve's throughput with tp=1. So I think tp is important and look forward for qserve to support it lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants