-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ef05c05
commit cc94425
Showing
1 changed file
with
42 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,49 +1,71 @@ | ||
# LLaMAGate | ||
# llama-swap | ||
|
||
A golang gateway that automatically manages [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) to serve the requested `model` in the HTTP request. Serve all the models you have downloaded without manually swapping between them. | ||
[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, so let's swap llama-server instead! | ||
|
||
Created because I wanted: | ||
llama-swap is a proxy server that sits in front of llama-server. When a request for `/v1/chat/completions` comes in it will extract the `model` requested and change the underlying llama-server automatically. | ||
|
||
- ✅ easy to deploy: single binary with no dependencies | ||
- ✅ full control over llama-server's startup settings | ||
- ✅ ❤️ for Nvidia P40 users who are rely on llama.cpp row split mode for large models | ||
- ✅ ❤️ for nvidia P40 users who are rely on llama.cpp for inference | ||
|
||
## YAML Configuration | ||
## config.yaml | ||
|
||
llama-swap's configuration purposefully simple. | ||
|
||
```yaml | ||
# Seconds to wait for llama.cpp to be available to serve requests | ||
# Default (and minimum): 15 seconds | ||
# Seconds to wait for llama.cpp to load and be ready to serve requests | ||
# Default (and minimum) is 15 seconds | ||
healthCheckTimeout: 60 | ||
|
||
# define models | ||
# define valid model values and the upstream server start | ||
models: | ||
"llama": | ||
env: | ||
- "CUDA_VISIBLE_DEVICES=0" | ||
|
||
cmd: "llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf" | ||
|
||
# address where llama-ser | ||
# Where to proxy to, important it matches this format | ||
proxy: "http://127.0.0.1:8999" | ||
|
||
# list of aliases this llama.cpp instance can also serve | ||
# aliases model names to use this configuration for | ||
aliases: | ||
- "gpt-4o-mini" | ||
- "gpt-3.5-turbo" | ||
|
||
"qwen": | ||
# environment variables to pass to the command | ||
env: | ||
- "CUDA_VISIBLE_DEVICES=0" | ||
cmd: "llama-server --port 8999 -m path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf" | ||
proxy: "http://127.0.0.1:8999" | ||
``` | ||
## Testing with CURL | ||
## Deployment | ||
1. Create a configuration file, see [config.example.yaml](config.example.yaml) | ||
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture. | ||
* _Note: Windows currently untested._ | ||
1. Run the binary with `llama-swap --config path/to/config.yaml` | ||
|
||
## Systemd Unit Files | ||
|
||
Use this unit file to start llama-swap on boot | ||
|
||
`/etc/systemd/system/llama-swap.service` | ||
``` | ||
[Unit] | ||
Description=llama-swap | ||
After=network.target | ||
|
||
[Service] | ||
User=nobody | ||
|
||
```bash | ||
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"llama"}'| jq -c '.choices[].message.content' | ||
# set this to match your environment | ||
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml | ||
|
||
# will reuse the llama-server instance | ||
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"gpt-4o-mini"}'| jq -c '.choices[].message.content' | ||
Restart=on-failure | ||
RestartSec=3 | ||
StartLimitBurst=3 | ||
StartLimitInterval=30 | ||
|
||
# swap to Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | ||
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"qwen"}'| jq -c '.choices[].message.content' | ||
[Install] | ||
WantedBy=multi-user.target | ||
``` |