v0.9.4
Features
- server: auto max_batch_total_tokens for flash att models #630
- router: ngrok edge #642
- server: Add trust_remote_code to quantize script by @ChristophRaab #647
- server: Add exllama GPTQ CUDA kernel support #553 #666
- server: Directly load GPTBigCode to specified device by @Atry in #618
- server: add cuda memory fraction #659
- server: Using
quantize_config.json
instead of GPTQ_BITS env variables #671 - server: support new falcon config #712
Fix
- server: llama v2 GPTQ #648
- server: Fixing non parameters in quantize script
bigcode/starcoder
was an example #661 - server: use mem_get_info to get kv cache size #664
- server: fix exllama buffers #689
- server: fix quantization python requirements #708
New Contributors
- @ChristophRaab made their first contribution in #647
- @fxmarty made their first contribution in #648
- @Atry made their first contribution in #618
Full Changelog: v0.9.3...v0.9.4