Skip to content

v0.6.2

Compare
Choose a tag to compare
@github-actions github-actions released this 25 Sep 21:50
· 1160 commits to main since this release
7193774

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

New Contributors

Full Changelog: v0.6.1...v0.6.2