v0.7.0
Features
- server: reduce vram requirements of continuous batching (contributed by @njhill)
- server: Support BLOOMChat-176B (contributed by @njhill)
- server: add watermarking tests (contributed by @ehsanmok)
- router: Adding response schema for compat_generate (contributed by @gsaivinay)
- router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
- server: improve download and decrease conversion to safetensors RAM requirements
- server: optimize flash causal lm decode token
- server: shard decode token
- server: use cuda graph in logits warping
- server: support trust_remote_code
- tests: add snapshot testing
Fix
- server: use float16
- server: fix multinomial implem in Sampling
- server: do not use device_map auto on single GPU
Misc
- docker: use nvidia base image
New Contributors
- @ehsanmok made their first contribution in #248
- @gsaivinay made their first contribution in #292
- @xyang16 made their first contribution in #343
- @oOraph made their first contribution in #359
Full Changelog: v0.6.0...v0.7.0