You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve doesn't support tp? If I add sharp the quantized weight on my own, will it affect the gemm kernel you developed?
The text was updated successfully, but these errors were encountered:
Hi! Thank you very much for your interest in QServe.
Yes. TP is definitely helpful for serving larger models. Currently, we have not supported TP in QServe yet, but we believe that QServe is compatible with TP and other parallelization strategies. By the way, since QServe greatly compresses the weights and KV cache of LLMs, it is viable to serve most of the open-sourced models within a single A100 GPU, so that the communication overhead between GPUs can be avoided (possibly with DP).
Qserve indeed can serve large model in one card. However, if other frameworks enable tp, for example, vllm with tp=4, may result in higher throughput since each card compute less. In our real experiment, when serving Qwen1.5-72B-chat, vllm with tp=4 can double qserve's throughput with tp=1. So I think tp is important and look forward for qserve to support it lol.
Hi, thanks for the great work!
What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve doesn't support tp? If I add sharp the quantized weight on my own, will it affect the gemm kernel you developed?
The text was updated successfully, but these errors were encountered: