v0.9.0

OlivierDehaene released this 01 Jul 17:26

· 863 commits to main since this release

Highlights

server: add paged attention to flash models
server: Inference support for GPTQ (llama + falcon tested) + Quantization script
server: only compute prefill logprobs when asked

Features

launcher: parse oom signals
server: batch tokenization for flash causal lm
server: Rework loading by
server: optimize dist ops
router: add ngrok integration
server: improve flash attention import errors
server: Refactor conversion logic
router: add header option to disable buffering for the generate_stream response by @rkimball
router: add arg validation

Fix

docs: CUDA_VISIBLE_DEVICES comment by @antferdom
docs: Fix typo and use POSIX comparison in the makefile by @piratos
server: fix warpers on CPU
server: Fixing T5 in case the names are mixed up
router: add timeout on flume sends
server: Do not init process group if already initialized
server: Add the option to force another dtype than f16
launcher: fix issue where launcher does not properly report shard failures

New Contributors

@antferdom made their first contribution in #441
@piratos made their first contribution in #443
@Yard1 made their first contribution in #388
@rkimball made their first contribution in #498

Full Changelog: v0.8.2...v0.9.0

Contributors

piratos, Yard1, and 2 other contributors

Assets 2