Notable changes
- Choose input/total tokens automatically based on available VRAM
- Support Qwen2 VL
- Decrease latency of very large batches (> 128)
What's Changed
- feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in #2687
- Avoiding timeout for bloom tests. by @Narsil in #2693
- Green main by @Narsil in #2697
- Choosing input/total tokens automatically based on available VRAM? by @Narsil in #2673
- We can have a tokenizer anywhere. by @Narsil in #2527
- Update poetry lock. by @Narsil in #2698
- Fixing auto bloom test. by @Narsil in #2699
- More timeout on docker start ? by @Narsil in #2701
- Monkey patching as a desperate measure. by @Narsil in #2704
- add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in #2702
- Support qwen2 vl by @drbh in #2689
- fix cuda graphs for qwen2-vl by @drbh in #2708
- fix: create position ids for text only input by @drbh in #2714
- fix: add chat_tokenize endpoint to api docs by @drbh in #2710
- Hotfixing auto length (warmup max_s was wrong). by @Narsil in #2716
- Fix prefix caching + speculative decoding by @tgaddair in #2711
- Fixing linting on main. by @Narsil in #2719
- nix: move to tgi-nix
main
by @danieldk in #2718 - fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in #2717
- add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in #2725
- Add initial support for compressed-tensors checkpoints by @danieldk in #2732
- nix: update nixpkgs by @danieldk in #2746
- benchmark: fix prefill throughput by @danieldk in #2741
- Fix: Change model_type from ssm to mamba by @mokeddembillel in #2740
- Fix: Change embeddings to embedding by @mokeddembillel in #2738
- fix response type of document for Text Generation Inference by @jitokim in #2743
- Upgrade outlines to 0.1.1 by @aW3st in #2742
- Upgrading our deps. by @Narsil in #2750
- feat: return streaming errors as an event formatted for openai's client by @drbh in #2668
- Remove vLLM dependency for CUDA by @danieldk in #2751
- fix: improve find_segments via numpy diff by @drbh in #2686
- add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in #2707
- Add support for compressed-tensors w8a8 int checkpoints by @danieldk in #2745
- feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in #2721
- Simplify two ipex conditions by @danieldk in #2755
- Update to moe-kernels 0.7.0 by @danieldk in #2720
- PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in #2645
- fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in #2760
- nix: update for outlines 0.1.4 by @danieldk in #2764
- Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in #2758
- nix: build and cache impure devshells by @danieldk in #2765
- fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in #2766
- nix: downgrade to outlines 0.1.3 by @danieldk in #2768
- fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in #2770
- fix: tweak grammar test response by @drbh in #2769
- Add a README section about using Nix by @danieldk in #2767
- Remove guideline from API by @Wauplin in #2762
- feat: Add automatic nightly benchmarks by @Hugoch in #2591
- feat: add payload limit by @OlivierDehaene in #2726
- Update to marlin-kernels 0.3.6 by @danieldk in #2771
- chore: prepare 2.4.1 release by @OlivierDehaene in #2773
New Contributors
- @tgaddair made their first contribution in #2711
- @mokeddembillel made their first contribution in #2740
- @jitokim made their first contribution in #2743
Full Changelog: v2.3.0...v2.4.1