Highlight
- 2x large batch prefill improvement with the new flashinfer kernels #579
- Multi-node tensor parallelism #550
- New model support: ChatGLM #516
What's Changed
- Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
- Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
- [Minor] Correct Optional type hints in api by @fpreiss in #526
- Add ChatGLM Model Support by @Qubitium in #516
- Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
- Decode Incrementally by @hnyls2002 in #517
- Fix dependency by @merrymercy in #538
- Fix dependency & crash issues by @Ying1123 in #539
- Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
- Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
- Fix tp worker only checking req[0] for stream by @Qubitium in #546
- Fix the Jump-Forward with Chinese by @hnyls2002 in #551
- Update fused_moe by @merrymercy in #553
- Multi-node Tensor Parallelism by @Ying1123 in #550
- Update flashinfer to 0.0.5 by @merrymercy in #554
- Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
- Fix latency benchmark by @hnyls2002 in #557
- Clean up logits processor by @merrymercy in #558
- Update test_flashinfer by @hnyls2002 in #560
- Allow running with vllm==0.4.3 by @merrymercy in #561
- Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
- Add sglang.bench_latency for offline benchmark by @merrymercy in #564
- Warmup cublas by @merrymercy in #566
- Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
- Update readme by @merrymercy in #568
- Expose dtype argument by @merrymercy in #569
- Update benchmark script by @Ying1123 in #571
- Minor fix in compiler & format by @ZackZeng999 in #545
- Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
- Fix flashinfer version by @PanJason in #576
- [BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
- Turn on flashinfer by default by @Ying1123 in #578
- fix the broken server args by @hnyls2002 in #585
- 2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579
New Contributors
- @fpreiss made their first contribution in #524
- @ZackZeng999 made their first contribution in #545
- @PanJason made their first contribution in #576
- @dhgarcia made their first contribution in #577
Full Changelog: v0.1.17...v0.1.18