Skip to content

Releases: turboderp/exllamav2

0.2.2

14 Sep 19:20
Compare
Choose a tag to compare
  • small fixes related to LMFE
  • allow SDPA during normal inference with custom bias

Full Changelog: v0.2.1...v0.2.2

0.2.1

08 Sep 17:26
Compare
Choose a tag to compare
  • TP: fallback SDPA mode when flash-attn is unavailable
  • Faster filter/grammar path
  • Add DRY
  • Fix issues since 0.1.9 (streams/graphs) when loading certain models via Tabby
  • Banish Râul

Full Changelog: v0.2.0...v0.2.1

0.2.0

28 Aug 21:00
Compare
Choose a tag to compare

Small release to fix various issues in 0.1.9

Full Changelog: v0.1.9...v0.2.0

0.1.9

22 Aug 11:54
Compare
Choose a tag to compare
  • Add experimental tensor-parallel mode. Currently supports Llama(1+2+3), Qwen2 and Mistral models
  • CUDA Graphs to reduce overhead and CPU bottlenecking
  • Various other optimizations
  • Some bugfixes

Full Changelog: v0.1.8...v0.1.9

0.1.8

24 Jul 06:36
Compare
Choose a tag to compare
  • Support Llama 3.1 (correct RoPE scaling etc.)
  • Support IndexTeam architecture
  • Some bugfixes and QoL improvements

Full Changelog: v0.1.7...v0.1.8

0.1.7

11 Jul 13:20
Compare
Choose a tag to compare
  • Support Gemma2
  • Support InternLM2
  • Various bugfixes and optimizations

Full Changelog: v0.1.6...v0.1.7

0.1.6

24 Jun 00:36
Compare
Choose a tag to compare
  • Fix dynamic generator fallback mode (was broken for prompts longer than max_input_len)
  • Fix inference on ROCm wave64 devices
  • Made model conversion script part of exllamav2 package
  • CPU optimizations

Full Changelog: v0.1.5...v0.1.6

0.1.5

09 Jun 00:19
Compare
Choose a tag to compare
  • Added Q6 and Q8 cache modes
  • Defragment cache in dynamic generator
  • Use SDPA with Torch 2.3.0+
  • Updated wheels to Torch 2.3.1
  • Added Python 3.12 wheels, plus Python 3.9 for ROCm

Full Changelog: v0.1.4...v0.1.5

0.1.4

03 Jun 23:34
Compare
Choose a tag to compare
  • Option to keep calibration states in VRAM while measuring
  • Fix for Q4 cache for odd key/value sizes (MiniCPM specifically)
  • Alternative fasttensors option on Windows to solve system memory issues
  • Prefix filter with multiple prefixes

Full Changelog: v0.1.3...v0.1.4

0.1.3

01 Jun 19:32
Compare
Choose a tag to compare
  • Fixes CFG

Full Changelog: v0.1.2...v0.1.3