27 Jul 17:29

OlivierDehaene

9f18f4c

v0.9.4

Features

server: auto max_batch_total_tokens for flash att models #630
router: ngrok edge #642
server: Add trust_remote_code to quantize script by @ChristophRaab #647
server: Add exllama GPTQ CUDA kernel support #553 #666
server: Directly load GPTBigCode to specified device by @Atry in #618
server: add cuda memory fraction #659
server: Using quantize_config.json instead of GPTQ_BITS env variables #671
server: support new falcon config #712

Fix

server: llama v2 GPTQ #648
server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
server: use mem_get_info to get kv cache size #664
server: fix exllama buffers #689
server: fix quantization python requirements #708

New Contributors

@ChristophRaab made their first contribution in #647
@fxmarty made their first contribution in #648
@Atry made their first contribution in #618

Full Changelog: v0.9.3...v0.9.4

Contributors

Atry, ChristophRaab, and fxmarty

Assets 2

18 Jul 16:53

OlivierDehaene

v0.9.3

5e6ddfd

v0.9.3

Highlights

server: add support for flash attention v2
server: add support for llamav2

Features

launcher: add debug logs
server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

Assets 2

14 Jul 14:36

OlivierDehaene

v0.9.2

c58a0c1

v0.9.2

Features

server: harden a bit the weights choice to save on disk
server: better errors for warmup and TP
server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
server: Implements sharding for non divisible vocab_size
launcher: add arg validation and drop subprocess
router: explicit warning if revision is not set

Fix

server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
server: T5 weights names
server: Adding logger import to t5_modeling.py by @akowalsk
server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
server: GPTQ Env vars: catch correct type of error by @ssmi153
server: blacklist local files

New Contributors

@akowalsk made their first contribution in #585
@ssmi153 made their first contribution in #590
@gary149 made their first contribution in #611

Full Changelog: v0.9.1...v0.9.2

Contributors

gary149, akowalsk, and ssmi153

Assets 2

06 Jul 14:09

OlivierDehaene

v0.9.1

31b36cc

v0.9.1

Highlights

server: Non flash MPT
server: decrease memory fragmentation

Features

server: use latest flash attention
router: add argument for hostname in router
docs: Adding some help for the options in text-generation-benchmark

Fix

makefile: Update server/Makefile to include Makefile-vllm
server: Handle loading from local files for MPT
server: avoid errors for very small top_p values

Full Changelog: v0.9.0...v0.9.1

Assets 2

01 Jul 17:26

OlivierDehaene

v0.9.0

e28a809

v0.9.0

Highlights

server: add paged attention to flash models
server: Inference support for GPTQ (llama + falcon tested) + Quantization script
server: only compute prefill logprobs when asked

Features

launcher: parse oom signals
server: batch tokenization for flash causal lm
server: Rework loading by
server: optimize dist ops
router: add ngrok integration
server: improve flash attention import errors
server: Refactor conversion logic
router: add header option to disable buffering for the generate_stream response by @rkimball
router: add arg validation

Fix

docs: CUDA_VISIBLE_DEVICES comment by @antferdom
docs: Fix typo and use POSIX comparison in the makefile by @piratos
server: fix warpers on CPU
server: Fixing T5 in case the names are mixed up
router: add timeout on flume sends
server: Do not init process group if already initialized
server: Add the option to force another dtype than f16
launcher: fix issue where launcher does not properly report shard failures

New Contributors

@antferdom made their first contribution in #441
@piratos made their first contribution in #443
@Yard1 made their first contribution in #388
@rkimball made their first contribution in #498

Full Changelog: v0.8.2...v0.9.0

Contributors

piratos, Yard1, and 2 other contributors

Assets 2

01 Jun 17:51

OlivierDehaene

v0.8.2

e7248fe

v0.8.2

Features

server: remove trust_remote_code requirement for falcon models
server: load santacoder/starcoder models with safetensors

Fix

server: fix has_position_ids

Full Changelog: v0.8.1...v0.8.2

Assets 2

31 May 10:10

OlivierDehaene

v0.8.1

db2ebe3

v0.8.1

Features

server: add retry on download

Fix

server: fix bnb quantization for CausalLM models

Full Changelog: v0.8.0...v0.8.1

Assets 2

30 May 16:45

OlivierDehaene

v0.8.0

081b926

v0.8.0

Features

router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
proto: decrease IPC proto size
benchmarker: add summary tables
server: support RefinedWeb models

Fix

server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

@CL-Shang made their first contribution in #370
@jlamypoirier made their first contribution in #317

Full Changelog: v0.7.0...v0.8.0

Contributors

CL-Shang and jlamypoirier

Assets 2

23 May 19:21

OlivierDehaene

v0.7.0

d31562f

v0.7.0

Features

server: reduce vram requirements of continuous batching (contributed by @njhill)
server: Support BLOOMChat-176B (contributed by @njhill)
server: add watermarking tests (contributed by @ehsanmok)
router: Adding response schema for compat_generate (contributed by @gsaivinay)
router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
server: improve download and decrease conversion to safetensors RAM requirements
server: optimize flash causal lm decode token
server: shard decode token
server: use cuda graph in logits warping
server: support trust_remote_code
tests: add snapshot testing

Fix

server: use float16
server: fix multinomial implem in Sampling
server: do not use device_map auto on single GPU

Misc

docker: use nvidia base image

New Contributors

@ehsanmok made their first contribution in #248
@gsaivinay made their first contribution in #292
@xyang16 made their first contribution in #343
@oOraph made their first contribution in #359

Full Changelog: v0.6.0...v0.7.0

Contributors

ehsanmok, gsaivinay, and 3 other contributors

Assets 2

21 Apr 19:02

OlivierDehaene

v0.6.0

6ded76a

v0.6.0

Features

server: flash attention past key values optimization (contributed by @njhill)
router: remove requests when client closes the connection (co-authored by @njhill)
server: support quantization for flash models
router: add info route
server: optimize token decode
server: support flash sharded santacoder
security: image signing with cosign
security: image analysis with trivy
docker: improve image size

Fix

server: check cuda capability before importing flash attention
server: fix hf_transfer issue with private repositories
router: add auth token for private tokenizers

Misc

rust: update to 1.69

Contributors

njhill

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Fix

New Contributors

Contributors

Highlights

Features

Features

Fix

New Contributors

Contributors

Highlights

Features

Fix

Highlights

Features

Fix

New Contributors

Contributors

Features

Fix

Features

Fix

Features

Fix

New Contributors

Contributors

Features

Fix

Misc

New Contributors

Contributors

Features

Fix

Misc

Contributors

Releases: huggingface/text-generation-inference

v0.9.4

Features

Fix

New Contributors

Contributors

v0.9.3

Highlights

Features

v0.9.2

Features

Fix

New Contributors

Contributors

v0.9.1

Highlights

Features

Fix

v0.9.0

Highlights

Features

Fix

New Contributors

Contributors

v0.8.2

Features

Fix

v0.8.1

Features

Fix

v0.8.0

Features

Fix

New Contributors

Contributors

v0.7.0

Features

Fix

Misc

New Contributors

Contributors

v0.6.0

Features

Fix

Misc

Contributors