v2.3.0
Important changes
-
Renamed
HUGGINGFACE_HUB_CACHE
to useHF_HOME
. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from/data/models-....
to/data/hub/models-....
on the Docker. -
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to
flashinfer
(andflashdecoding
as a fallback for some specific models that aren't supported by flashinfer). -
Lots of performance improvements with Marlin and quantization.
What's Changed
- chore: update to torch 2.4 by @OlivierDehaene in #2259
- fix crash in multi-modal by @sywangyi in #2245
- fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in #2291
- Split up
layers.marlin
into several files by @danieldk in #2292 - fix: refactor adapter weight loading and mapping by @drbh in #2193
- Using g6 instead of g5. by @Narsil in #2281
- Some small fixes for the Torch 2.4.0 update by @danieldk in #2304
- Fixing idefics on g6 tests. by @Narsil in #2306
- Fix registry name by @XciD in #2307
- Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in #2313
- feat: add ruff and resolve issue by @drbh in #2262
- Run ci api key by @ErikKaum in #2315
- Install Marlin from standalone package by @danieldk in #2320
- fix: reject grammars without properties by @drbh in #2309
- patch-error-on-invalid-grammar by @ErikKaum in #2282
- fix: adjust test snapshots and small refactors by @drbh in #2323
- server quantize: store quantizer config in standard format by @danieldk in #2299
- Rebase TRT-llm by @Narsil in #2331
- Handle GPTQ-Marlin loading in
GPTQMarlinWeightLoader
by @danieldk in #2300 - Pr 2290 ci run by @drbh in #2329
- refactor usage stats by @ErikKaum in #2339
- enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in #2338
- Fix cache block size for flash decoding by @danieldk in #2351
- Unify attention output handling by @danieldk in #2343
- fix: attempt forward on flash attn2 to check hardware support by @drbh in #2335
- feat: include local lora adapter loading docs by @drbh in #2359
- fix: return the out tensor rather then the functions return value by @drbh in #2361
- feat: implement a templated endpoint for visibility into chat requests by @drbh in #2333
- feat: prefer stop over eos_token to align with openai finish_reason by @drbh in #2344
- feat: return the generated text when parsing fails by @drbh in #2353
- fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in #2364
- fix: prefer original layernorm names for 180B by @drbh in #2365
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in #2350
- add gptj modeling in TGI #2366 (CI RUN) by @drbh in #2372
- Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in #2371
- Pr 2374 ci branch by @drbh in #2378
- fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in #2346
- Pr 2337 ci branch by @drbh in #2379
- fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in #2381
- Update Quantization docs and minor doc fix. by @Vaibhavs10 in #2368
- Pr 2352 ci branch by @drbh in #2382
- Add FlashInfer support by @danieldk in #2354
- Add experimental flake by @danieldk in #2384
- Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in #2288
- flake: add fmt and clippy by @danieldk in #2389
- Update documentation for Supported models by @Vaibhavs10 in #2386
- flake: use rust-overlay by @danieldk in #2390
- Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in #2385
- feat: add guideline to chat request and template by @drbh in #2391
- Update flake for 9.0a capability in Torch by @danieldk in #2394
- nix: add router to the devshell by @danieldk in #2396
- Upgrade fbgemm by @Narsil in #2398
- Adding launcher to build. by @Narsil in #2397
- Fixing import exl2 by @Narsil in #2399
- Cpu dockerimage by @sywangyi in #2367
- Add support for prefix caching to the v3 router by @danieldk in #2392
- Keeping the benchmark somewhere by @Narsil in #2401
- feat: validate template variables before apply and improve sliding wi… by @drbh in #2403
- fix: allocate tmp based on sgmv kernel if available by @drbh in #2345
- fix: improve completions to send a final chunk with usage details by @drbh in #2336
- Updating the flake. by @Narsil in #2404
- Pr 2395 ci run by @drbh in #2406
- fix: include create_exllama_buffers and set_device for exllama by @drbh in #2407
- nix: incremental build of the launcher by @danieldk in #2410
- Adding more kernels to flake. by @Narsil in #2411
- add numa to improve cpu inference perf by @sywangyi in #2330
- fix: adds causal to attention params by @drbh in #2408
- nix: partial incremental build of the router by @danieldk in #2416
- Upgrading exl2. by @Narsil in #2415
- More fixes trtllm by @mfuntowicz in #2342
- nix: build router incrementally by @danieldk in #2422
- Fixing exl2 and other quanize tests again. by @Narsil in #2419
- Upgrading the tests to match the current workings. by @Narsil in #2423
- nix: try to reduce the number of Rust rebuilds by @danieldk in #2424
- Improve the Consuming TGI + Streaming docs. by @Vaibhavs10 in #2412
- Further fixes. by @Narsil in #2426
- doc: Add metrics documentation and add a 'Reference' section by @Hugoch in #2230
- All integration tests back everywhere (too many failed CI). by @Narsil in #2428
- nix: update to CUDA 12.4 by @danieldk in #2429
- Prefix caching by @Narsil in #2402
- nix: add pure server to flake, add both pure and impure devshells by @danieldk in #2430
- nix: add
text-generation-benchmark
to pure devshell by @danieldk in #2431 - Adding eetq to flake. by @Narsil in #2438
- nix: add awq-inference-engine as server dependency by @danieldk in #2442
- nix: add default package by @danieldk in #2453
- Fix: don't apply post layernorm in SiglipVisionTransformer by @drbh in #2459
- Pr 2451 ci branch by @drbh in #2454
- Fixing CI. by @Narsil in #2462
- fix: bump minijinja version and add test for llama 3.1 tools by @drbh in #2463
- fix: improve regex expression by @drbh in #2468
- nix: build Torch against MKL and various other improvements by @danieldk in #2469
- Lots of improvements (Still 2 allocators) by @Narsil in #2449
- feat: add /v1/models endpoint by @drbh in #2433
- update doc with intel cpu part by @sywangyi in #2420
- Tied embeddings in MLP speculator. by @Narsil in #2473
- nix: improve impure devshell by @danieldk in #2478
- nix: add punica-kernels by @danieldk in #2477
- fix: enable chat requests in vertex endpoint by @drbh in #2481
- feat: support lora revisions and qkv_proj weights by @drbh in #2482
- hotfix: avoid non-prefilled block use when using prefix caching by @danieldk in #2489
- Adding links to Adyen blogpost. by @Narsil in #2492
- Add two handy gitignores for Nix environments by @danieldk in #2484
- hotfix: fix regression of attention api change in intel platform by @sywangyi in #2439
- nix: add pyright/ruff for proper LSP in the impure devshell by @danieldk in #2496
- Fix incompatibility with latest
syrupy
and update in Poetry by @danieldk in #2497 - radix trie: add assertions by @danieldk in #2491
- hotfix: add syrupy to the right subproject by @danieldk in #2499
- Add links to Adyen blogpost by @martinigoyanes in #2500
- Fixing more correctly the invalid drop of the batch. by @Narsil in #2498
- Add Directory Check to Prevent Redundant Cloning in Build Process by @vamsivallepu in #2486
- Prefix test - Different kind of load test to trigger prefix test bugs. by @Narsil in #2490
- Fix tokenization yi by @Narsil in #2507
- Fix truffle by @Narsil in #2514
- nix: support Python tokenizer conversion in the router by @danieldk in #2515
- Add nix test. by @Narsil in #2513
- fix: pass missing revision arg for lora adapter when loading multiple… by @drbh in #2510
- hotfix : enable intel ipex cpu and xpu in python3.11 by @sywangyi in #2517
- Use
ratatui
not (deprecated)tui
by @strickvl in #2521 - Add tests for Mixtral by @danieldk in #2520
- Adding a test for FD. by @Narsil in #2516
- nix: pure Rust check/fmt/clippy/test by @danieldk in #2525
- fix: metrics unbounded memory by @OlivierDehaene in #2528
- Move to moe-kernels package and switch to common MoE layer by @danieldk in #2511
- Stream options. by @Narsil in #2533
- Update to moe-kenels 0.3.1 by @danieldk in #2535
- doc: clarify that
--quantize
is not needed for pre-quantized models by @danieldk in #2536 - hotfix: ipex fails since cuda moe kernel is not supported by @sywangyi in #2532
- fix: wrap python basic logs in debug assertion in launcher by @OlivierDehaene in #2539
- Preparing for release. by @Narsil in #2540
New Contributors
- @almersawi made their first contribution in #2350
- @Vaibhavs10 made their first contribution in #2368
- @mfuntowicz made their first contribution in #2342
- @vamsivallepu made their first contribution in #2486
- @strickvl made their first contribution in #2521
Full Changelog: v2.2.0...v2.3.0