support tp #14

cyLi-Tiger · 2024-05-24T06:55:53Z

Hi, thanks for the great work!

What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve doesn't support tp? If I add sharp the quantized weight on my own, will it affect the gemm kernel you developed?

ys-2020 · 2024-06-04T14:43:39Z

Hi! Thank you very much for your interest in QServe.

Yes. TP is definitely helpful for serving larger models. Currently, we have not supported TP in QServe yet, but we believe that QServe is compatible with TP and other parallelization strategies. By the way, since QServe greatly compresses the weights and KV cache of LLMs, it is viable to serve most of the open-sourced models within a single A100 GPU, so that the communication overhead between GPUs can be avoided (possibly with DP).

cyLi-Tiger · 2024-06-12T09:30:43Z

Thanks for you reply!

Qserve indeed can serve large model in one card. However, if other frameworks enable tp, for example, vllm with tp=4, may result in higher throughput since each card compute less. In our real experiment, when serving Qwen1.5-72B-chat, vllm with tp=4 can double qserve's throughput with tp=1. So I think tp is important and look forward for qserve to support it lol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support tp #14

support tp #14

cyLi-Tiger commented May 24, 2024

ys-2020 commented Jun 4, 2024

cyLi-Tiger commented Jun 12, 2024 •

edited

Loading

support tp #14

support tp #14

Comments

cyLi-Tiger commented May 24, 2024

ys-2020 commented Jun 4, 2024

cyLi-Tiger commented Jun 12, 2024 • edited Loading

cyLi-Tiger commented Jun 12, 2024 •

edited

Loading