update README

mostlygeek · Oct 5, 2024 · cc94425 · cc94425
1 parent ef05c05
commit cc94425
Showing 1 changed file with 42 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -1,49 +1,71 @@
-# LLaMAGate
+# llama-swap
 
-A golang gateway that automatically manages [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) to serve the requested `model` in the HTTP request. Serve all the models you have downloaded without manually swapping between them.
+[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, so let's swap llama-server instead!
 
-Created because I wanted:
+llama-swap is a proxy server that sits in front of llama-server. When a request for `/v1/chat/completions` comes in it will extract the `model` requested and change the underlying llama-server automatically.
 
 - ✅ easy to deploy: single binary with no dependencies
 - ✅ full control over llama-server's startup settings
-- ✅ ❤️ for Nvidia P40 users who are rely on llama.cpp row split mode for large models
+- ✅ ❤️ for nvidia P40 users who are rely on llama.cpp for inference
 
-## YAML Configuration
+## config.yaml
+
+llama-swap's configuration purposefully simple.
 
 ```yaml
-# Seconds to wait for llama.cpp to be available to serve requests
-# Default (and minimum): 15 seconds
+# Seconds to wait for llama.cpp to load and be ready to serve requests
+# Default (and minimum) is 15 seconds
 healthCheckTimeout: 60
 
-# define models
+# define valid model values and the upstream server start
 models:
   "llama":
-    env:
-      - "CUDA_VISIBLE_DEVICES=0"
-
     cmd: "llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf"
 
-    # address where llama-ser
+    # Where to proxy to, important it matches this format
     proxy: "http://127.0.0.1:8999"
 
-    # list of aliases this llama.cpp instance can also serve
+    # aliases model names to use this configuration for
     aliases:
     - "gpt-4o-mini"
     - "gpt-3.5-turbo"
 
   "qwen":
+    # environment variables to pass to the command
+    env:
+      - "CUDA_VISIBLE_DEVICES=0"
     cmd: "llama-server --port 8999 -m path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
     proxy: "http://127.0.0.1:8999"
 ```
 
-## Testing with CURL
+## Deployment
+
+1. Create a configuration file, see [config.example.yaml](config.example.yaml)
+1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
+    * _Note: Windows currently untested._
+1. Run the binary with `llama-swap --config path/to/config.yaml`
+
+## Systemd Unit Files
+
+Use this unit file to start llama-swap on boot
+
+`/etc/systemd/system/llama-swap.service`
+```
+[Unit]
+Description=llama-swap
+After=network.target
+
+[Service]
+User=nobody
 
-```bash
-> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"llama"}'| jq -c '.choices[].message.content'
+# set this to match your environment
+ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml
 
-# will reuse the llama-server instance
-> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"gpt-4o-mini"}'| jq -c '.choices[].message.content'
+Restart=on-failure
+RestartSec=3
+StartLimitBurst=3
+StartLimitInterval=30
 
-# swap to Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
-> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"qwen"}'| jq -c '.choices[].message.content'
+[Install]
+WantedBy=multi-user.target
 ```