v2.2.0
Notable changes
- Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
- Gemma2 softcap support
- Deepseek v2 support.
- Lots of internal reworks/cleanup (allowing for cool features)
- Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
- Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)
What's Changed
- Preparing patch release. by @Narsil in #2186
- Adding "longrope" for Phi-3 (#2172) by @amihalik in #2179
- Refactor dead code - Removing all
flash_xxx.py
files. by @Narsil in #2166 - Fix Starcoder2 after refactor by @danieldk in #2189
- GPTQ CI improvements by @danieldk in #2151
- Consistently take
prefix
in model constructors by @danieldk in #2191 - fix dbrx & opt model prefix bug by @icyxp in #2201
- hotfix: Fix number of KV heads by @danieldk in #2202
- Fix incorrect cache allocation with multi-query by @danieldk in #2203
- Falcon/DBRX: get correct number of key-value heads by @danieldk in #2205
- add doc for intel gpus by @sywangyi in #2181
- fix: python deserialization by @jaluma in #2178
- update to metrics 0.23.0 or could work with metrics-exporter-promethe… by @sywangyi in #2190
- feat: use model name as adapter id in chat endpoints by @drbh in #2128
- Fix nccl regression on PyTorch 2.3 upgrade by @fxmarty in #2099
- Fix buildx cache + change runner type by @glegendre01 in #2176
- Fixed README ToC by @vinkamath in #2196
- Updating the self check by @Narsil in #2209
- Move quantized weight handling out of the
Weights
class by @danieldk in #2194 - Add support for FP8 on compute capability >=8.0, <8.9 by @danieldk in #2213
- fix: append DONE message to chat stream by @drbh in #2221
- [fix] Modifying base in yarn embedding by @SeongBeomLEE in #2212
- Use symmetric quantization in the
quantize
subcommand by @danieldk in #2120 - feat: simple mistral lora integration tests by @drbh in #2180
- fix custom cache dir by @ErikKaum in #2226
- fix: Remove bitsandbytes installation when running cpu-only install by @Hugoch in #2216
- Add support for AWQ-quantized Idefics2 by @danieldk in #2233
server quantize
: expose groupsize option by @danieldk in #2225- Remove stray
quantize
argument inget_weights_col_packed_qkv
by @danieldk in #2237 - fix(server): fix cohere by @OlivierDehaene in #2249
- Improve the handling of quantized weights by @danieldk in #2250
- Hotfix: fix of use of unquantized weights in Gemma GQA loading by @danieldk in #2255
- Hotfix: various GPT-based model fixes by @danieldk in #2256
- Hotfix: fix MPT after recent refactor by @danieldk in #2257
- Hotfix: pass through model revision in
VlmCausalLM
by @danieldk in #2258 - usage stats and crash reports by @ErikKaum in #2220
- add usage stats to toctree by @ErikKaum in #2260
- fix: adjust default tool choice by @drbh in #2244
- Add support for Deepseek V2 by @danieldk in #2224
- re-push to internal registry by @XciD in #2242
- Add FP8 release test by @danieldk in #2261
- feat(fp8): use fbgemm kernels and load fp8 weights directly by @OlivierDehaene in #2248
- fix(server): fix deepseekv2 loading by @OlivierDehaene in #2266
- Hotfix: fix of use of unquantized weights in Mixtral GQA loading by @icyxp in #2269
- legacy warning on text_generation client by @ErikKaum in #2271
- fix(ci): test new instances by @XciD in #2272
- fix(server): fix fp8 weight loading by @OlivierDehaene in #2268
- Softcapping for gemma2. by @Narsil in #2273
- use proper name for ci by @XciD in #2274
- Fixing mistral nemo. by @Narsil in #2276
- fix(l4): fix fp8 logic on l4 by @OlivierDehaene in #2277
- Add support for repacking AWQ weights for GPTQ-Marlin by @danieldk in #2278
- [WIP] Add support for Mistral-Nemo by supporting head_dim through config by @shaltielshmid in #2254
- Preparing for release. by @Narsil in #2285
- Add support for Llama 3 rotary embeddings by @danieldk in #2286
- hotfix: pin numpy by @danieldk in #2289
New Contributors
- @jaluma made their first contribution in #2178
- @vinkamath made their first contribution in #2196
- @ErikKaum made their first contribution in #2226
- @Hugoch made their first contribution in #2216
- @XciD made their first contribution in #2242
- @shaltielshmid made their first contribution in #2254
Full Changelog: v2.1.1...v2.2.0