v0.6.3.post1
github-actions
released this
17 Oct 17:26
·
932 commits
to main
since this release
Highlights
New Models
- Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
- Support multiple and interleaved images for Llama3.2 (#9095)
- Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)
Important bug fix
- Fix chat API continuous usage stats (#9357)
- Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
- Fix Molmo text-only input bug (#9397)
- Fix CUDA 11.8 Build (#9386)
- Fix
_version.py
not found issue (#9375)
Other Enhancements
- Remove block manager v1 and make block manager v2 default (#8704)
- Spec Decode Optimize ngram lookup performance (#9333)
What's Changed
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
- [Frontend] merge beam search implementations by @LunrEclipse in #9296
- [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
- [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
- [Frontend] Clarify model_type error messages by @stevegrubb in #9345
- [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
- [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
- [BugFix] Fix chat API continuous usage stats by @njhill in #9357
- pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
- [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
- [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
- [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
- [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
- [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
- [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
- [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
- [Core] Rename input data types by @DarkLight1337 in #8688
- [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
- [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
- Support mistral interleaved attn by @patrickvonplaten in #9414
- [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
- [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
- [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
- [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
- [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
- Add notes on the use of Slack by @terrytangyuan in #9442
- [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
- [Misc] Print stack trace using
logger.exception
by @DarkLight1337 in #9461 - [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
- [Bugfix] Allow prefill of assistant response when using
mistral_common
by @sasha0552 in #9446 - [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
- [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
- [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
- [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
New Contributors
- @gracehonv made their first contribution in #9349
- @streaver91 made their first contribution in #9396
Full Changelog: v0.6.3...v0.6.3.post1