v0.7.0

OlivierDehaene released this 23 May 19:21

· 1038 commits to main since this release

d31562f

Features

server: reduce vram requirements of continuous batching (contributed by @njhill)
server: Support BLOOMChat-176B (contributed by @njhill)
server: add watermarking tests (contributed by @ehsanmok)
router: Adding response schema for compat_generate (contributed by @gsaivinay)
router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
server: improve download and decrease conversion to safetensors RAM requirements
server: optimize flash causal lm decode token
server: shard decode token
server: use cuda graph in logits warping
server: support trust_remote_code
tests: add snapshot testing

Fix

server: use float16
server: fix multinomial implem in Sampling
server: do not use device_map auto on single GPU

Misc

docker: use nvidia base image

New Contributors

@ehsanmok made their first contribution in #248
@gsaivinay made their first contribution in #292
@xyang16 made their first contribution in #343
@oOraph made their first contribution in #359

Full Changelog: v0.6.0...v0.7.0

Contributors

ehsanmok, gsaivinay, and 3 other contributors

Assets 2