Skip to content

v0.7.0

Compare
Choose a tag to compare
@OlivierDehaene OlivierDehaene released this 23 May 19:21
· 1038 commits to main since this release
d31562f

Features

  • server: reduce vram requirements of continuous batching (contributed by @njhill)
  • server: Support BLOOMChat-176B (contributed by @njhill)
  • server: add watermarking tests (contributed by @ehsanmok)
  • router: Adding response schema for compat_generate (contributed by @gsaivinay)
  • router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
  • server: improve download and decrease conversion to safetensors RAM requirements
  • server: optimize flash causal lm decode token
  • server: shard decode token
  • server: use cuda graph in logits warping
  • server: support trust_remote_code
  • tests: add snapshot testing

Fix

  • server: use float16
  • server: fix multinomial implem in Sampling
  • server: do not use device_map auto on single GPU

Misc

  • docker: use nvidia base image

New Contributors

Full Changelog: v0.6.0...v0.7.0