Skip to content

v0.9.4

Compare
Choose a tag to compare
@OlivierDehaene OlivierDehaene released this 27 Jul 17:29
· 811 commits to main since this release
9f18f4c

Features

  • server: auto max_batch_total_tokens for flash att models #630
  • router: ngrok edge #642
  • server: Add trust_remote_code to quantize script by @ChristophRaab #647
  • server: Add exllama GPTQ CUDA kernel support #553 #666
  • server: Directly load GPTBigCode to specified device by @Atry in #618
  • server: add cuda memory fraction #659
  • server: Using quantize_config.json instead of GPTQ_BITS env variables #671
  • server: support new falcon config #712

Fix

  • server: llama v2 GPTQ #648
  • server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
  • server: use mem_get_info to get kv cache size #664
  • server: fix exllama buffers #689
  • server: fix quantization python requirements #708

New Contributors

Full Changelog: v0.9.3...v0.9.4