You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per title.
Example: with GPUs like 3060 12GB or 3090 24GB.
The text was updated successfully, but these errors were encountered:
Mayorc1978
changed the title
Would this work on consumer hardware and with frameworks like llama.cpp or others?
Would this work on consumer hardware and integrated in frameworks like llama.cpp or others?
May 11, 2024
Hi @Mayorc1978 , thank you very much for your interest in QServe! Although it is targeted for large-scale LLM serving, QServe can also work on consumer GPUs like RTX 4090 and 3090. For RTX 4090, you can expect a similar speedup over TensorRT-LLM as on L40S. We did not do many experiments on 3060 or 3090, but we believe that the principles will still hold.
Hi @tp-nan , Tesla T4 and RTX2080 are not supported in QServe right now. Currently, we have some instructions that can only be compiled with Ampere+ architecture. We will consider support older GPUs after cleaning the cuda code. Thank you!
As per title.
Example: with GPUs like 3060 12GB or 3090 24GB.
The text was updated successfully, but these errors were encountered: