Release v1.4.0 · huggingface/text-generation-inference

Highlights

fix: fix local loading for .bin models by @OlivierDehaene in #1419
Fix missing make target platform for local install: 'install-flash-attention-v2' by @deepily in #1414
fix: follow base model for tokenizer in router by @OlivierDehaene in #1424
Fix local load for Medusa by @PYNing in #1420
Return prompt vs generated tokens. by @Narsil in #1436
feat: supports openai chat completions API by @drbh in #1427
feat: support raise_exception, bos and eos tokens by @drbh in #1450
chore: bump rust version and annotate/fix all clippy warnings by @drbh in #1455
feat: conditionally toggle chat on invocations route by @drbh in #1454
Disable decoder_input_details on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in #1470
Fixing non divisible embeddings. by @Narsil in #1476
Add messages api compatibility docs by @drbh in #1478
Add a new /tokenize route to get the tokenized input by @Narsil in #1471
feat: adds phi model by @drbh in #1442
fix: read stderr in download by @OlivierDehaene in #1486
fix: show warning with tokenizer config parsing error by @drbh in #1488
fix: launcher doc typos by @Narsil in #1473
Reinstate exl2 with tp by @Narsil in #1490
Add sealion mpt support by @Narsil in #1477
Trying to fix that flaky test. by @Narsil in #1491
fix: launcher doc typos by @thelinuxkid in #1462
Update the docs to include newer models. by @Narsil in #1492
GPTQ support on ROCm by @fxmarty in #1489
feat: add tokenizer-config-path to launcher args by @drbh in #1495

Full Changelog: v1.3.4...v1.4.0