Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would this work on consumer hardware and integrated in frameworks like llama.cpp or others? #5

Open
Mayorc1978 opened this issue May 11, 2024 · 4 comments

Comments

@Mayorc1978
Copy link

Mayorc1978 commented May 11, 2024

As per title.
Example: with GPUs like 3060 12GB or 3090 24GB.

@Mayorc1978 Mayorc1978 changed the title Would this work on consumer hardware and with frameworks like llama.cpp or others? Would this work on consumer hardware and integrated in frameworks like llama.cpp or others? May 11, 2024
@ys-2020
Copy link
Contributor

ys-2020 commented May 14, 2024

Hi @Mayorc1978 , thank you very much for your interest in QServe! Although it is targeted for large-scale LLM serving, QServe can also work on consumer GPUs like RTX 4090 and 3090. For RTX 4090, you can expect a similar speedup over TensorRT-LLM as on L40S. We did not do many experiments on 3060 or 3090, but we believe that the principles will still hold.

@tp-nan
Copy link

tp-nan commented May 17, 2024

Hi, how about Tesla T4 and RTX2080Ti?

@ys-2020
Copy link
Contributor

ys-2020 commented May 17, 2024

Hi @tp-nan , Tesla T4 and RTX2080 are not supported in QServe right now. Currently, we have some instructions that can only be compiled with Ampere+ architecture. We will consider support older GPUs after cleaning the cuda code. Thank you!

@anaivebird
Copy link

@ys-2020 will the performance of qserve outperform trtllm w4a8 in llama3 13b?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants