Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v0.9.4
Features
- server: auto max_batch_total_tokens for flash att models #630
- router: ngrok edge #642
- server: Add trust_remote_code to quantize script by @ChristophRaab #647
- server: Add exllama GPTQ CUDA kernel support #553 #666
- server: Directly load GPTBigCode to specified device by @Atry in #618
- server: add cuda memory fraction #659
- server: Using
quantize_config.json
instead of GPTQ_BITS env variables #671 - server: support new falcon config #712
Fix
- server: llama v2 GPTQ #648
- server: Fixing non parameters in quantize script
bigcode/starcoder
was an example #661 - server: use mem_get_info to get kv cache size #664
- server: fix exllama buffers #689
- server: fix quantization python requirements #708
New Contributors
- @ChristophRaab made their first contribution in #647
- @fxmarty made their first contribution in #648
- @Atry made their first contribution in #618
Full Changelog: v0.9.3...v0.9.4
v0.9.3
Highlights
- server: add support for flash attention v2
- server: add support for llamav2
Features
- launcher: add debug logs
- server: rework the quantization to support all models
Full Changelog: v0.9.2...v0.9.3
v0.9.2
Features
- server: harden a bit the weights choice to save on disk
- server: better errors for warmup and TP
- server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
- server: Implements sharding for non divisible
vocab_size
- launcher: add arg validation and drop subprocess
- router: explicit warning if revision is not set
Fix
- server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
- server: T5 weights names
- server: Adding logger import to t5_modeling.py by @akowalsk
- server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
- server: GPTQ Env vars: catch correct type of error by @ssmi153
- server: blacklist local files
New Contributors
- @akowalsk made their first contribution in #585
- @ssmi153 made their first contribution in #590
- @gary149 made their first contribution in #611
Full Changelog: v0.9.1...v0.9.2
v0.9.1
Highlights
- server: Non flash MPT
- server: decrease memory fragmentation
Features
- server: use latest flash attention
- router: add argument for hostname in router
- docs: Adding some help for the options in
text-generation-benchmark
Fix
- makefile: Update server/Makefile to include Makefile-vllm
- server: Handle loading from local files for MPT
- server: avoid errors for very small top_p values
Full Changelog: v0.9.0...v0.9.1
v0.9.0
Highlights
- server: add paged attention to flash models
- server: Inference support for GPTQ (llama + falcon tested) + Quantization script
- server: only compute prefill logprobs when asked
Features
- launcher: parse oom signals
- server: batch tokenization for flash causal lm
- server: Rework loading by
- server: optimize dist ops
- router: add ngrok integration
- server: improve flash attention import errors
- server: Refactor conversion logic
- router: add header option to disable buffering for the generate_stream response by @rkimball
- router: add arg validation
Fix
- docs: CUDA_VISIBLE_DEVICES comment by @antferdom
- docs: Fix typo and use POSIX comparison in the makefile by @piratos
- server: fix warpers on CPU
- server: Fixing T5 in case the names are mixed up
- router: add timeout on flume sends
- server: Do not init process group if already initialized
- server: Add the option to force another dtype than
f16
- launcher: fix issue where launcher does not properly report shard failures
New Contributors
- @antferdom made their first contribution in #441
- @piratos made their first contribution in #443
- @Yard1 made their first contribution in #388
- @rkimball made their first contribution in #498
Full Changelog: v0.8.2...v0.9.0
v0.8.2
Features
- server: remove trust_remote_code requirement for falcon models
- server: load santacoder/starcoder models with safetensors
Fix
- server: fix has_position_ids
Full Changelog: v0.8.1...v0.8.2
v0.8.1
Features
- server: add retry on download
Fix
- server: fix bnb quantization for CausalLM models
Full Changelog: v0.8.0...v0.8.1
v0.8.0
Features
- router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
- proto: decrease IPC proto size
- benchmarker: add summary tables
- server: support RefinedWeb models
Fix
- server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)
New Contributors
- @CL-Shang made their first contribution in #370
- @jlamypoirier made their first contribution in #317
Full Changelog: v0.7.0...v0.8.0
v0.7.0
Features
- server: reduce vram requirements of continuous batching (contributed by @njhill)
- server: Support BLOOMChat-176B (contributed by @njhill)
- server: add watermarking tests (contributed by @ehsanmok)
- router: Adding response schema for compat_generate (contributed by @gsaivinay)
- router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
- server: improve download and decrease conversion to safetensors RAM requirements
- server: optimize flash causal lm decode token
- server: shard decode token
- server: use cuda graph in logits warping
- server: support trust_remote_code
- tests: add snapshot testing
Fix
- server: use float16
- server: fix multinomial implem in Sampling
- server: do not use device_map auto on single GPU
Misc
- docker: use nvidia base image
New Contributors
- @ehsanmok made their first contribution in #248
- @gsaivinay made their first contribution in #292
- @xyang16 made their first contribution in #343
- @oOraph made their first contribution in #359
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Features
- server: flash attention past key values optimization (contributed by @njhill)
- router: remove requests when client closes the connection (co-authored by @njhill)
- server: support quantization for flash models
- router: add info route
- server: optimize token decode
- server: support flash sharded santacoder
- security: image signing with cosign
- security: image analysis with trivy
- docker: improve image size
Fix
- server: check cuda capability before importing flash attention
- server: fix hf_transfer issue with private repositories
- router: add auth token for private tokenizers
Misc
- rust: update to 1.69