-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow parallel requests for same model #1142
Comments
|
I'm a fan of this, especially with the potential of localagi I want to run two gpu LLMs on different servers i run, currently i just load balance round robbin between distinct local ai front-backends. Ideally, one api front end could queue incoming requests and send to the next available llm. Allowing all LLM in a pool to store and retrieve from long term memory would be awesome |
this is actually already possible with llama.cpp and by specifying PARALLEL_REQUESTS or --parallel-requests and the number of parallel requests for llama.cpp Line 72 in d6073ac
|
Is your feature request related to a problem? Please describe.
Currently each model gets an instance only
Describe the solution you'd like
A flag or a YAML config option to tell the maximum number of instance to spawn/connect to. defaults to 1 (mimicking the current behavior)
Describe alternatives you've considered
Additional context
Ideally we should keep track of what models are being used and redirect to free slots.
Also we need to keep into account the logic which is already in place to handle a single GPU device (1 model loaded only)
Related: go-skynet/go-llama.cpp#253
The text was updated successfully, but these errors were encountered: