Skip to content

Releases: turboderp/exllamav2

0.1.2

01 Jun 17:58
Compare
Choose a tag to compare
  • Support MiniCPM architecture
  • Optimized prompt processing for page generator with Q4 cache
  • New HumanEval and MMLU tests using dynamic generator
  • Some bugfixes and small QoL improvements

Full Changelog: v0.1.1...v0.1.2

0.1.1

27 May 16:53
Compare
Choose a tag to compare
  • Fix performance of Q4 cache in dynamic generator
  • Add paged attn support for FP16 models
  • Add xformers support

Full Changelog: v0.1.0...v0.1.1

0.1.0

25 May 20:56
Compare
Choose a tag to compare
  • Paged attention support (requries flash-attn>=2.5.7)
  • New generator with dynamic batching support (requires paged attn)
  • Examples updated for dynamic generator
  • Faster draft model SD
  • Various optimizations, bugfixes and QoL improvements

Full Changelog: v0.0.21...v0.1.0

0.0.21

11 May 13:31
Compare
Choose a tag to compare
  • Support for Granite architecture
  • Support for GPT2 architecture
  • Support for banned strings in streaming generator
  • A bit more work on multimodal support (still unfinished)
  • Few bugfixes and stuff
  • Windows wheels for PyTorch 2.2.0 are included below to work around an apparent (likely temporary) issue in PyTorch. See #434 and pytorch/pytorch#125109

Full Changelog: v0.0.20...v0.0.21

0.0.20

27 Apr 00:56
Compare
Choose a tag to compare
  • Adds Phi3 support
  • Wheels compiled for PyTorch 2.3.0
  • ROCm 6.0 wheels

Full Changelog: v0.0.19...v0.0.20

0.0.19

19 Apr 06:44
ed118b4
Compare
Choose a tag to compare
  • More accurate Q4 cache using groupwise rotations
  • Better prompt ingestion speed when using flash-attn
  • Minor fixes related to issues quantizing Llama 3
  • New, more robust optimizer
  • Fix bug on long-sequence inference for GPTQ models

Full Changelog: v0.0.18...v0.0.19

0.0.18

07 Apr 18:41
dafb508
Compare
Choose a tag to compare
  • Support for Command-R-plus
  • Fix for pre-AVX2 CPUs
  • VRAM optimizations for quantization
  • Very preliminary multimodal support
  • Various other small fixes and optimizations

Full Changelog: v0.0.17...v0.0.18

0.0.17

31 Mar 03:19
Compare
Choose a tag to compare

Mostly just minor fixes and support for DBRX models.

Full Changelog: v0.0.16...v0.0.17

0.0.16

20 Mar 07:23
Compare
Choose a tag to compare
  • Adds support for Cohere models
  • N-gram decoding
  • A few bugfixes
  • Lots of optimizations

Full Changelog: v0.0.15...v0.0.16

0.0.15

07 Mar 02:26
Compare
Choose a tag to compare
  • Adds Q4 cache mode
  • Support for StarCoder2
  • Minor optimizations and a couple of bugfixes

Full Changelog: v0.0.14...v0.0.15