v0.9.4

OlivierDehaene released this 27 Jul 17:29

· 811 commits to main since this release

Features

server: auto max_batch_total_tokens for flash att models #630
router: ngrok edge #642
server: Add trust_remote_code to quantize script by @ChristophRaab #647
server: Add exllama GPTQ CUDA kernel support #553 #666
server: Directly load GPTBigCode to specified device by @Atry in #618
server: add cuda memory fraction #659
server: Using quantize_config.json instead of GPTQ_BITS env variables #671
server: support new falcon config #712

Fix

server: llama v2 GPTQ #648
server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
server: use mem_get_info to get kv cache size #664
server: fix exllama buffers #689
server: fix quantization python requirements #708

New Contributors

@ChristophRaab made their first contribution in #647
@fxmarty made their first contribution in #648
@Atry made their first contribution in #618

Full Changelog: v0.9.3...v0.9.4

Contributors

Atry, ChristophRaab, and fxmarty

Assets 2