v0.1.16
Highlight
- Support more models: DBRX, Command-R, Gemma
- Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)
- Cache performance improvements (#418, #364)
- Marlin quantization kernels
- Many bug fixes
- Update dependencies to be compatible with their latest versions
What's Changed
- Fix Runtime missing some ServerArgs options by @Qubitium in #281
- adding the triton docker build minimal example by @amirarsalan90 in #242
- Fix flashinfer >= 0.0.3 compat by @Qubitium in #282
- Fix Incorrect CURL Request Example in README by @amirarsalan90 in #287
- enable marlin kernels by @qeternity in #286
- Fix env (docker) compat due to file usage by @Qubitium in #288
- Fix marlin model loading compat with autogptq by @Liurl21 in #290
- Fix outlines-0.0.35 incompatibility by @ZhouGongZaiShi in #291
- [Fix/Potential Bugs] Can not correctly import models in python/sglang/srt/models by @Luodian in #311
- Use Anthropic messages API by @janimo in #304
- Add StableLM model. by @janimo in #301
- Support oai in benchmark/mmlu by @merrymercy in #323
- Update version to v0.1.14 by @merrymercy in #324
- Cleanup codebase: removed unnecessary code/logic by @Qubitium in #298
- Update dependencies by @janimo in #326
- Openrouter usage example by @janimo in #327
model_rpc
style improvement by @hnyls2002 in #293model_runner
simplify by @hnyls2002 in #329- Logprobs Refractor by @hnyls2002 in #331
DBRX
support by @hnyls2002 in #337- Add support for new autogptq quant_config.checkpoint_format by @Qubitium in #332
- Fix llava parallelism/fork bug by @lockon-n in #315
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @hnyls2002 in #343
- Revert "Eliminate 2 gpu ops during sampling when logit_bias is zero" by @hnyls2002 in #345
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @Qubitium in #338
- Add timeout to get_meta_info by @SimoneRaponi in #346
- Fix typos in infer_batch.py by @tom-doerr in #354
- Time cost utils by @hnyls2002 in #355
- Update README.md by @eltociear in #358
- support
command-r
by @ZhouXingg in #369 - Fix issue #367 – System message not supported for Anthropic (anthropic.BadRequestError) by @fronx in #368
- Update model support in readme by @Ying1123 in #370
- Optimize radix tree matching by @ispobock in #364
- Reduce overhead when
fork(1)
by @hnyls2002 in #375 - llama3 instruct template by @qeternity in #372
- add
.isort.cfg
by @hnyls2002 in #378 - Revert removing the unused imports by @hnyls2002 in #385
- Benchmark Updates by @hnyls2002 in #382
- Improve performance when running with full parallel by @hnyls2002 in #394
- Minor: style improvement of radix_cache and memory_pool by @hnyls2002 in #395
- Format Benchmark Code by @hnyls2002 in #399
- Fix chatml template by @merrymercy in #406
- Adding RAG tracing & eval cookbook using Parea by @joschkabraun in #390
- SamplingParams add "spaces_between_special_tokens" argument by @ZhouXingg in #392
- Organize Benchmark by @hnyls2002 in #381
- Add Cohere Command R chat template by @noah-kim-theori in #411
- Fix
sync()
whenfork(1)
by @hnyls2002 in #412 - Include finish reason in meta info response by @qeternity in #415
- Make public APIs more standard. by @hnyls2002 in #416
- Compat with latest VLLM 0.4.2 main + fork.number rename + Flashinfer 0.0.4 by @Qubitium in #380
- Optimize the memory usage of logits processor by @merrymercy in #420
- Clean up by @merrymercy in #422
- Fix logit processor bugs by @merrymercy in #427
- Minor fix for the import path by @merrymercy in #428
- Move openai api server into a separate file by @merrymercy in #429
- Fix flashinfer by @merrymercy in #430
- Update version to 0.1.15 by @merrymercy in #431
- Misc fixes by @merrymercy in #432
- Allow
input_ids
in the input of the/generate
endpoint by @lolipopshock in #363 - Improve error handling by @merrymercy in #433
- Cache optimizations by @hnyls2002 in #418
- Update readme by @merrymercy in #434
- Raise errors for prompts that are too long by @merrymercy in #436
- support llava video by @ZhangYuanhan-AI in #426
- Fix streaming by @merrymercy in #437
- Update version to 0.1.16 by @merrymercy in #438
New Contributors
- @Qubitium made their first contribution in #281
- @amirarsalan90 made their first contribution in #242
- @Liurl21 made their first contribution in #290
- @ZhouGongZaiShi made their first contribution in #291
- @Luodian made their first contribution in #311
- @janimo made their first contribution in #304
- @lockon-n made their first contribution in #315
- @SimoneRaponi made their first contribution in #346
- @tom-doerr made their first contribution in #354
- @ZhouXingg made their first contribution in #369
- @fronx made their first contribution in #368
- @ispobock made their first contribution in #364
- @joschkabraun made their first contribution in #390
- @noah-kim-theori made their first contribution in #411
- @lolipopshock made their first contribution in #363
- @ZhangYuanhan-AI made their first contribution in #426
Full Changelog: v0.1.13...v0.1.16