Skip to content

v0.6.0

Compare
Choose a tag to compare
@github-actions github-actions released this 04 Sep 23:35
· 1321 commits to main since this release
32e7db2

Highlights

Performance Update

  • We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting --num-scheduler-steps 8 in the engine arguments. Please note that it still have some limitations and being actively hardened, see #7528 for known issues.
    • Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
    • Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
    • Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)

Model Support

  • Support bitsandbytes 8-bit and FP4 quantized models (#7445)
  • New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
  • A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
  • Multi-modality:
    • multi-image input support for LLaVA-Next (#7230), Phi-3-vision models (#7783)
    • Ultravox support for multiple audio chunks (#7963)
    • TP support for ViTs (#7186)

Hardware Support

  • NVIDIA GPU: extend cuda graph size for H200 (#7894)
  • AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
  • Intel GPU: pipeline parallel support (#7810)
  • Neuron: context lengths and token generation buckets (#7885, #8062)
  • TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)

Production Features

  • OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
  • Add json_schema support from OpenAI protocol (#7654)
  • Enable chunked prefill and prefix caching together (#7753, #8120)
  • Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)

Misc

  • Support benchmarking async engine in benchmark_throughput.py (#7964)
  • Progress in integration with torch.compile: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)

What's Changed

New Contributors

Full Changelog: v0.5.5...v0.6.0