v0.9.0
Highlights
- server: add paged attention to flash models
- server: Inference support for GPTQ (llama + falcon tested) + Quantization script
- server: only compute prefill logprobs when asked
Features
- launcher: parse oom signals
- server: batch tokenization for flash causal lm
- server: Rework loading by
- server: optimize dist ops
- router: add ngrok integration
- server: improve flash attention import errors
- server: Refactor conversion logic
- router: add header option to disable buffering for the generate_stream response by @rkimball
- router: add arg validation
Fix
- docs: CUDA_VISIBLE_DEVICES comment by @antferdom
- docs: Fix typo and use POSIX comparison in the makefile by @piratos
- server: fix warpers on CPU
- server: Fixing T5 in case the names are mixed up
- router: add timeout on flume sends
- server: Do not init process group if already initialized
- server: Add the option to force another dtype than
f16
- launcher: fix issue where launcher does not properly report shard failures
New Contributors
- @antferdom made their first contribution in #441
- @piratos made their first contribution in #443
- @Yard1 made their first contribution in #388
- @rkimball made their first contribution in #498
Full Changelog: v0.8.2...v0.9.0