Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using vLLM to deploy LLM as an API to accelerate inference #100

Open
fx-hit opened this issue Jun 21, 2024 · 3 comments
Open

Using vLLM to deploy LLM as an API to accelerate inference #100

fx-hit opened this issue Jun 21, 2024 · 3 comments

Comments

@fx-hit
Copy link

fx-hit commented Jun 21, 2024

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

@badcookie78
Copy link

Hi, Can I know if is possible to run in with ollama and then host the LLM locally?

@zk19971101
Copy link

I find comfyui_omost show a way to accelerate inference by TGI(text generation inference).
https://github.com/huchenlei/ComfyUI_omost?tab=readme-ov-file#accelerating-llm

@sudanl
Copy link

sudanl commented Sep 3, 2024

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

Good idea! Could you kindly share the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants