v2.1.0
Notable changes
-
New models : gemma2
-
Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010
-
Faster GPTQ inference and Marlin support (up to 2x speedup).
-
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
-
Lots of Rocm support and bugfixes,
-
Lots of new contributors ! Thanks a lot for these contributions
What's Changed
- OpenAI function calling compatible support by @phangiabao98 in #1888
- Fixing types. by @Narsil in #1906
- Types. by @Narsil in #1909
- Fixing signals. by @Narsil in #1910
- Removing some unused code. by @Narsil in #1915
- MI300 compatibility by @fxmarty in #1764
- Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
- Update grafana template by @fxmarty in #1918
- Fix TunableOp bug by @fxmarty in #1920
- Fix TGI issues with ROCm by @fxmarty in #1921
- Fixing the download strategy for ibm-fms by @Narsil in #1917
- ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
- docs: Fix grafana dashboard url by @edwardzjl in #1925
- feat: include token in client test like server tests by @drbh in #1932
- Creating doc automatically for supported models. by @Narsil in #1929
- fix: use path inside of speculator config by @drbh in #1935
- feat: add train medusa head tutorial by @drbh in #1934
- reenable xpu for tgi by @sywangyi in #1939
- Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
- Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
- Improving the logging system. by @Narsil in #1938
- Fixing codellama loads by using purely
AutoTokenizer
. by @Narsil in #1947 - Fix seeded output. by @Narsil in #1949
- Fix (flash) Gemma prefix and enable tests by @danieldk in #1950
- Fix GPTQ for models which do not have float16 at the default dtype (simpler) by @danieldk in #1953
- Processor config chat template by @drbh in #1954
- fix small typo and broken link by @MoritzLaurer in #1958
- Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). by @Narsil in #1959
- Fix (non-container) pytest stdout buffering-related lock-up by @danieldk in #1963
- Fixing the text part from tokenizer endpoint. by @Narsil in #1967
- feat: adjust attn weight loading logic by @drbh in #1975
- Add support for exl2-quantized models by @danieldk in #1965
- Update documentation version to 2.0.4 by @fxmarty in #1980
- Purely refactors paged/attention into
layers/attention
and make hardware differences more obvious with 1 file per hardware. by @Narsil in #1986 - Fixing exl2 scratch buffer. by @Narsil in #1990
- single char ` addition for docs by @nbroad1881 in #1989
- Fixing GPTQ imports. by @Narsil in #1994
- reable xpu, broken by gptq and setuptool upgrade by @sywangyi in #1988
- router: send the input as chunks to the backend by @danieldk in #1981
- Fix Phi-2 with
tp>1
by @danieldk in #2003 - fix: update triton implementation reference by @emmanuel-ferdman in #2002
- feat: add SchedulerV3 by @OlivierDehaene in #1996
- Support GPTQ models with column-packed up/gate tensor by @danieldk in #2006
- Making
make install
work better by default. by @Narsil in #2004 - Hotfixing
make install
. by @Narsil in #2008 - Do not initialize scratch space when there are no ExLlamaV2 layers by @danieldk in #2015
- feat: move allocation logic to rust by @OlivierDehaene in #1835
- Fixing rocm. by @Narsil in #2021
- Fix GPTQWeight import by @danieldk in #2020
- Update version on init.py to 0.7.0 by @andimarafioti in #2017
- Add support for Marlin-quantized models by @danieldk in #2014
- marlin: support tp>1 when group_size==-1 by @danieldk in #2032
- marlin: improve build by @danieldk in #2031
- Internal runner ? by @Narsil in #2023
- Xpu gqa by @sywangyi in #2013
- server: use chunked inputs by @danieldk in #1985
- ROCm and sliding windows fixes by @fxmarty in #2033
- Add Phi-3 medium support by @danieldk in #2039
- feat(ci): add trufflehog secrets detection by @McPatate in #2038
- fix(ci): remove unnecessary permissions by @McPatate in #2045
- Update LLMM1 bound by @fxmarty in #2050
- Support chat response format by @drbh in #2046
- fix(server): fix OPT implementation by @OlivierDehaene in #2061
- fix(layers): fix SuRotaryEmbedding by @OlivierDehaene in #2060
- PR #2049 CI run by @drbh in #2054
- implement Open Inference Protocol endpoints by @drbh in #1942
- Add support for GPTQ Marlin by @danieldk in #2052
- Update the link for qwen2 by @xianbaoqian in #2068
- Adding architecture document by @tengomucho in #2044
- Support different image sizes in prefill in VLMs by @danieldk in #2065
- Contributing guide & Code of Conduct by @LysandreJik in #2074
- fix build.rs watch files by @zirconium-n in #2072
- Set maximum grpc message receive size to 2GiB by @danieldk in #2075
- CI: Tailscale improvements by @glegendre01 in #2079
- CI: pass pre-commit hooks again by @danieldk in #2084
- feat: rotate tests ci token by @drbh in #2091
- Support exl2-quantized Qwen2 models by @danieldk in #2085
- Factor out sharding of packed tensors by @danieldk in #2059
- Fix
text-generation-server quantize
by @danieldk in #2103 - feat: sort cuda graphs in descending order by @drbh in #2104
- New runner. Manual squash. by @Narsil in #2110
- Fix cargo-chef prepare by @ur4t in #2101
- Support
HF_TOKEN
environment variable by @Wauplin in #2066 - Add OTLP Service Name Environment Variable by @KevinDuffy94 in #2076
- corrected Pydantic warning. by @yukiman76 in #2095
- use xpu-smi to dump used memory by @sywangyi in #2047
- fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api by @sunxichen in #2089
- Cpu tgi by @sywangyi in #1936
- feat: add simple tests for weights by @drbh in #2092
- Removing IPEX_AVAIL. by @Narsil in #2115
- fix cpu and xpu issue by @sywangyi in #2116
- Add pytest release marker by @danieldk in #2114
- Fix CI . by @Narsil in #2118
- Enable multiple LoRa adapters by @drbh in #2010
- Support AWQ quantization with bias by @danieldk in #2117
- Add support for Marlin 2:4 sparsity by @danieldk in #2102
- fix: simplify kserve endpoint and fix imports by @drbh in #2119
- Fixing prom leak by upgrading. by @Narsil in #2129
- Bumping to 2.1 by @Narsil in #2131
- Idefics2: sync added image tokens with transformers by @danieldk in #2080
- Fixing malformed rust tokenizers by @Narsil in #2134
- Fixing gemma2. by @Narsil in #2135
- fix: refactor post_processor logic and add test by @drbh in #2137
New Contributors
- @phangiabao98 made their first contribution in #1888
- @edwardzjl made their first contribution in #1925
- @thomas-schillaci made their first contribution in #1869
- @nbroad1881 made their first contribution in #1989
- @emmanuel-ferdman made their first contribution in #2002
- @andimarafioti made their first contribution in #2017
- @McPatate made their first contribution in #2038
- @xianbaoqian made their first contribution in #2068
- @tengomucho made their first contribution in #2044
- @LysandreJik made their first contribution in #2074
- @zirconium-n made their first contribution in #2072
- @glegendre01 made their first contribution in #2079
- @ur4t made their first contribution in #2101
- @KevinDuffy94 made their first contribution in #2076
- @yukiman76 made their first contribution in #2095
- @sunxichen made their first contribution in #2089
Full Changelog: v2.0.3...v2.1.0