Skip to content

v2.3.0

Compare
Choose a tag to compare
@Narsil Narsil released this 20 Sep 16:20
· 138 commits to main since this release
169178b

Important changes

  • Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
    So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.

  • Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).

  • Lots of performance improvements with Marlin and quantization.

What's Changed

New Contributors

Full Changelog: v2.2.0...v2.3.0