Skip to content

Releases: huggingface/text-generation-inference

v0.9.4

27 Jul 17:29
9f18f4c
Compare
Choose a tag to compare

Features

  • server: auto max_batch_total_tokens for flash att models #630
  • router: ngrok edge #642
  • server: Add trust_remote_code to quantize script by @ChristophRaab #647
  • server: Add exllama GPTQ CUDA kernel support #553 #666
  • server: Directly load GPTBigCode to specified device by @Atry in #618
  • server: add cuda memory fraction #659
  • server: Using quantize_config.json instead of GPTQ_BITS env variables #671
  • server: support new falcon config #712

Fix

  • server: llama v2 GPTQ #648
  • server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
  • server: use mem_get_info to get kv cache size #664
  • server: fix exllama buffers #689
  • server: fix quantization python requirements #708

New Contributors

Full Changelog: v0.9.3...v0.9.4

v0.9.3

18 Jul 16:53
5e6ddfd
Compare
Choose a tag to compare

Highlights

  • server: add support for flash attention v2
  • server: add support for llamav2

Features

  • launcher: add debug logs
  • server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

v0.9.2

14 Jul 14:36
c58a0c1
Compare
Choose a tag to compare

Features

  • server: harden a bit the weights choice to save on disk
  • server: better errors for warmup and TP
  • server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
  • server: Implements sharding for non divisible vocab_size
  • launcher: add arg validation and drop subprocess
  • router: explicit warning if revision is not set

Fix

  • server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
  • server: T5 weights names
  • server: Adding logger import to t5_modeling.py by @akowalsk
  • server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
  • server: GPTQ Env vars: catch correct type of error by @ssmi153
  • server: blacklist local files

New Contributors

Full Changelog: v0.9.1...v0.9.2

v0.9.1

06 Jul 14:09
31b36cc
Compare
Choose a tag to compare

Highlights

  • server: Non flash MPT
  • server: decrease memory fragmentation

Features

  • server: use latest flash attention
  • router: add argument for hostname in router
  • docs: Adding some help for the options in text-generation-benchmark

Fix

  • makefile: Update server/Makefile to include Makefile-vllm
  • server: Handle loading from local files for MPT
  • server: avoid errors for very small top_p values

Full Changelog: v0.9.0...v0.9.1

v0.9.0

01 Jul 17:26
e28a809
Compare
Choose a tag to compare

Highlights

  • server: add paged attention to flash models
  • server: Inference support for GPTQ (llama + falcon tested) + Quantization script
  • server: only compute prefill logprobs when asked

Features

  • launcher: parse oom signals
  • server: batch tokenization for flash causal lm
  • server: Rework loading by
  • server: optimize dist ops
  • router: add ngrok integration
  • server: improve flash attention import errors
  • server: Refactor conversion logic
  • router: add header option to disable buffering for the generate_stream response by @rkimball
  • router: add arg validation

Fix

  • docs: CUDA_VISIBLE_DEVICES comment by @antferdom
  • docs: Fix typo and use POSIX comparison in the makefile by @piratos
  • server: fix warpers on CPU
  • server: Fixing T5 in case the names are mixed up
  • router: add timeout on flume sends
  • server: Do not init process group if already initialized
  • server: Add the option to force another dtype than f16
  • launcher: fix issue where launcher does not properly report shard failures

New Contributors

Full Changelog: v0.8.2...v0.9.0

v0.8.2

01 Jun 17:51
Compare
Choose a tag to compare

Features

  • server: remove trust_remote_code requirement for falcon models
  • server: load santacoder/starcoder models with safetensors

Fix

  • server: fix has_position_ids

Full Changelog: v0.8.1...v0.8.2

v0.8.1

31 May 10:10
Compare
Choose a tag to compare

Features

  • server: add retry on download

Fix

  • server: fix bnb quantization for CausalLM models

Full Changelog: v0.8.0...v0.8.1

v0.8.0

30 May 16:45
Compare
Choose a tag to compare

Features

  • router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
  • proto: decrease IPC proto size
  • benchmarker: add summary tables
  • server: support RefinedWeb models

Fix

  • server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

23 May 19:21
d31562f
Compare
Choose a tag to compare

Features

  • server: reduce vram requirements of continuous batching (contributed by @njhill)
  • server: Support BLOOMChat-176B (contributed by @njhill)
  • server: add watermarking tests (contributed by @ehsanmok)
  • router: Adding response schema for compat_generate (contributed by @gsaivinay)
  • router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
  • server: improve download and decrease conversion to safetensors RAM requirements
  • server: optimize flash causal lm decode token
  • server: shard decode token
  • server: use cuda graph in logits warping
  • server: support trust_remote_code
  • tests: add snapshot testing

Fix

  • server: use float16
  • server: fix multinomial implem in Sampling
  • server: do not use device_map auto on single GPU

Misc

  • docker: use nvidia base image

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.0

21 Apr 19:02
6ded76a
Compare
Choose a tag to compare

Features

  • server: flash attention past key values optimization (contributed by @njhill)
  • router: remove requests when client closes the connection (co-authored by @njhill)
  • server: support quantization for flash models
  • router: add info route
  • server: optimize token decode
  • server: support flash sharded santacoder
  • security: image signing with cosign
  • security: image analysis with trivy
  • docker: improve image size

Fix

  • server: check cuda capability before importing flash attention
  • server: fix hf_transfer issue with private repositories
  • router: add auth token for private tokenizers

Misc

  • rust: update to 1.69