Releases: sgl-project/sglang
Releases · sgl-project/sglang
Release v0.1.20
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in #588
- Add Gemma2 by @Ying1123 in #592
- Format by @Ying1123 in #593
- Fix Llava model by @wisclmy0611 in #594
- Add
--enable-p2p-check
option by @hnyls2002 in #599 - Fix streaming by @hnyls2002 in #600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
- add
LogitsMetadata
by @hnyls2002 in #604 - add minicpm support by @Titan-p in #602
- Make sglang compat with vllm 0.5.1 by @M0gician in #598
- Add Qwen2 MoE support by @M0gician in #603
- Update chat template for qwen and yi-1.5. by @for-just-we in #530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in #503 - Fix bench latency by @merrymercy in #607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
- Clean up the usage of flashinfer by @merrymercy in #610
- Cleanup attention backend: flashinfer and triton by @merrymercy in #611
- Enable cuda graph by default by @merrymercy in #612
- Improve benchmark scripts & fix llava by @merrymercy in #613
- Memorypool chunked prefetch by @hnyls2002 in #614
- Improve benchmark scripts by @merrymercy in #615
- Fix memory pool index error by @Ying1123 in #616
- Bump version to 0.1.20 by @merrymercy in #618
New Contributors
- @wisclmy0611 made their first contribution in #594
- @Titan-p made their first contribution in #586
- @M0gician made their first contribution in #598
- @for-just-we made their first contribution in #530
Full Changelog: v0.1.18...v0.1.20
Release v0.1.18
Highlight
- 2x large batch prefill improvement with the new flashinfer kernels #579
- Multi-node tensor parallelism #550
- New model support: ChatGLM #516
What's Changed
- Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
- Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
- [Minor] Correct Optional type hints in api by @fpreiss in #526
- Add ChatGLM Model Support by @Qubitium in #516
- Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
- Decode Incrementally by @hnyls2002 in #517
- Fix dependency by @merrymercy in #538
- Fix dependency & crash issues by @Ying1123 in #539
- Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
- Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
- Fix tp worker only checking req[0] for stream by @Qubitium in #546
- Fix the Jump-Forward with Chinese by @hnyls2002 in #551
- Update fused_moe by @merrymercy in #553
- Multi-node Tensor Parallelism by @Ying1123 in #550
- Update flashinfer to 0.0.5 by @merrymercy in #554
- Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
- Fix latency benchmark by @hnyls2002 in #557
- Clean up logits processor by @merrymercy in #558
- Update test_flashinfer by @hnyls2002 in #560
- Allow running with vllm==0.4.3 by @merrymercy in #561
- Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
- Add sglang.bench_latency for offline benchmark by @merrymercy in #564
- Warmup cublas by @merrymercy in #566
- Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
- Update readme by @merrymercy in #568
- Expose dtype argument by @merrymercy in #569
- Update benchmark script by @Ying1123 in #571
- Minor fix in compiler & format by @ZackZeng999 in #545
- Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
- Fix flashinfer version by @PanJason in #576
- [BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
- Turn on flashinfer by default by @Ying1123 in #578
- fix the broken server args by @hnyls2002 in #585
- 2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579
New Contributors
- @fpreiss made their first contribution in #524
- @ZackZeng999 made their first contribution in #545
- @PanJason made their first contribution in #576
- @dhgarcia made their first contribution in #577
Full Changelog: v0.1.17...v0.1.18
Release v0.1.17
Highlights
- Add data parallelim #480
- Add speculative execution for OpenAI API #250
- Update vllm to v0.4.3 for new quantization features #511
- Better error handling (#457, #449, #514)
What's Changed
- [Feat] Add llava qwen, llava mistral by @kcz358 in #419
- Format code by @hnyls2002 in #441
- Add finish_reason to OpenAI API by @mgerstgrasser in #446
- Simplify port allocation by @merrymercy in #447
- Add PUT for generate api by @Ying1123 in #448
- Improve error handling & abort disconnected requests by @merrymercy in #449
- Fix the broken
--disable-radix-cache
by @hnyls2002 in #451 - openai chat speculative execution by @ChuyueSun in #250
- Fix openai speculative execution by @Ying1123 in #456
- Abort disconnected requests by @merrymercy in #457
- Rename api_num_spec_tokens -> num_api_spec_tokens by @merrymercy in #458
- Use model loader from vllm by @merrymercy in #459
- port fp8 mixtral by @merrymercy in #460
- fix test bug in srt_llava_next_test.py by @bingwork in #470
- Add the instruction link to the LLaVA-NeXT-Video at README by @ZhangYuanhan-AI in #463
- Improve logging & add logit cap by @merrymercy in #471
- Optimize retract by @hnyls2002 in #440
- Add benchmark scripts by @Ying1123 in #476
- [Feat/Fix] Refactoring Llava models into single file by @Luodian in #475
- Improve benchmark scripts & rename some scripts by @merrymercy in #477
- Improve benchmark scripts & add more models by @merrymercy in #484
- Support data parallelism (static) by @Ying1123 in #480
- Make the server random by default by @merrymercy in #488
- Revert "Make the server random by default" by @Ying1123 in #492
- update the script: examples/usage/llava_video/srt_example_llava_v.sh by @ZhangYuanhan-AI in #491
- Make the server random by default by @merrymercy in #493
- Update vllm to v0.4.3 by @merrymercy in #511
- remove redundant pad_input_ids function by @amosyou in #500
- Litellm Backend by @huyiwen in #502
- Fix rid state map leak + Refractor .finished by @Qubitium in #505
- Crash the server when error or OOM happens by @merrymercy in #514
- Update version to 0.1.17 by @merrymercy in #515
New Contributors
- @kcz358 made their first contribution in #419
- @mgerstgrasser made their first contribution in #446
- @bingwork made their first contribution in #470
- @amosyou made their first contribution in #500
- @huyiwen made their first contribution in #502
Full Changelog: v0.1.16...v0.1.17
v0.1.16
Highlight
- Support more models: DBRX, Command-R, Gemma
- Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)
- Cache performance improvements (#418, #364)
- Marlin quantization kernels
- Many bug fixes
- Update dependencies to be compatible with their latest versions
What's Changed
- Fix Runtime missing some ServerArgs options by @Qubitium in #281
- adding the triton docker build minimal example by @amirarsalan90 in #242
- Fix flashinfer >= 0.0.3 compat by @Qubitium in #282
- Fix Incorrect CURL Request Example in README by @amirarsalan90 in #287
- enable marlin kernels by @qeternity in #286
- Fix env (docker) compat due to file usage by @Qubitium in #288
- Fix marlin model loading compat with autogptq by @Liurl21 in #290
- Fix outlines-0.0.35 incompatibility by @ZhouGongZaiShi in #291
- [Fix/Potential Bugs] Can not correctly import models in python/sglang/srt/models by @Luodian in #311
- Use Anthropic messages API by @janimo in #304
- Add StableLM model. by @janimo in #301
- Support oai in benchmark/mmlu by @merrymercy in #323
- Update version to v0.1.14 by @merrymercy in #324
- Cleanup codebase: removed unnecessary code/logic by @Qubitium in #298
- Update dependencies by @janimo in #326
- Openrouter usage example by @janimo in #327
model_rpc
style improvement by @hnyls2002 in #293model_runner
simplify by @hnyls2002 in #329- Logprobs Refractor by @hnyls2002 in #331
DBRX
support by @hnyls2002 in #337- Add support for new autogptq quant_config.checkpoint_format by @Qubitium in #332
- Fix llava parallelism/fork bug by @lockon-n in #315
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @hnyls2002 in #343
- Revert "Eliminate 2 gpu ops during sampling when logit_bias is zero" by @hnyls2002 in #345
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @Qubitium in #338
- Add timeout to get_meta_info by @SimoneRaponi in #346
- Fix typos in infer_batch.py by @tom-doerr in #354
- Time cost utils by @hnyls2002 in #355
- Update README.md by @eltociear in #358
- support
command-r
by @ZhouXingg in #369 - Fix issue #367 – System message not supported for Anthropic (anthropic.BadRequestError) by @fronx in #368
- Update model support in readme by @Ying1123 in #370
- Optimize radix tree matching by @ispobock in #364
- Reduce overhead when
fork(1)
by @hnyls2002 in #375 - llama3 instruct template by @qeternity in #372
- add
.isort.cfg
by @hnyls2002 in #378 - Revert removing the unused imports by @hnyls2002 in #385
- Benchmark Updates by @hnyls2002 in #382
- Improve performance when running with full parallel by @hnyls2002 in #394
- Minor: style improvement of radix_cache and memory_pool by @hnyls2002 in #395
- Format Benchmark Code by @hnyls2002 in #399
- Fix chatml template by @merrymercy in #406
- Adding RAG tracing & eval cookbook using Parea by @joschkabraun in #390
- SamplingParams add "spaces_between_special_tokens" argument by @ZhouXingg in #392
- Organize Benchmark by @hnyls2002 in #381
- Add Cohere Command R chat template by @noah-kim-theori in #411
- Fix
sync()
whenfork(1)
by @hnyls2002 in #412 - Include finish reason in meta info response by @qeternity in #415
- Make public APIs more standard. by @hnyls2002 in #416
- Compat with latest VLLM 0.4.2 main + fork.number rename + Flashinfer 0.0.4 by @Qubitium in #380
- Optimize the memory usage of logits processor by @merrymercy in #420
- Clean up by @merrymercy in #422
- Fix logit processor bugs by @merrymercy in #427
- Minor fix for the import path by @merrymercy in #428
- Move openai api server into a separate file by @merrymercy in #429
- Fix flashinfer by @merrymercy in #430
- Update version to 0.1.15 by @merrymercy in #431
- Misc fixes by @merrymercy in #432
- Allow
input_ids
in the input of the/generate
endpoint by @lolipopshock in #363 - Improve error handling by @merrymercy in #433
- Cache optimizations by @hnyls2002 in #418
- Update readme by @merrymercy in #434
- Raise errors for prompts that are too long by @merrymercy in #436
- support llava video by @ZhangYuanhan-AI in #426
- Fix streaming by @merrymercy in #437
- Update version to 0.1.16 by @merrymercy in #438
New Contributors
- @Qubitium made their first contribution in #281
- @amirarsalan90 made their first contribution in #242
- @Liurl21 made their first contribution in #290
- @ZhouGongZaiShi made their first contribution in #291
- @Luodian made their first contribution in #311
- @janimo made their first contribution in #304
- @lockon-n made their first contribution in #315
- @SimoneRaponi made their first contribution in #346
- @tom-doerr made their first contribution in #354
- @ZhouXingg made their first contribution in #369
- @fronx made their first contribution in #368
- @ispobock made their first contribution in #364
- @joschkabraun made their first contribution in #390
- @noah-kim-theori made their first contribution in #411
- @lolipopshock made their first contribution in #363
- @ZhangYuanhan-AI made their first contribution in #426
Full Changelog: v0.1.13...v0.1.16
Release v0.1.13
Highlights
- Gemma Support by @hnyls2002 in #256
- Add Together and AzureOpenAI examples by @merrymercy in #184
What's Changed
- correct a mistake on the README.md by @yaya-sy in #182
- correct reference dtype openai.py by @yaya-sy in #181
- Add Together and AzureOpenAI examples by @merrymercy in #184
- Fix server launch for jupyter notebook by @merrymercy in #186
- Refactor decoding logprob and add completion_tokens_wo_jump_forward by @comaniac in #189
- Pin outlines version by @comaniac in #196
- Adjust outlines version. by @hnyls2002 in #200
- Update README.md by @eltociear in #207
- Added the ability to Modify the Context Length by @psych0v0yager in #210
- Fix logprobs with logprob_start_len by @comaniac in #193
- Support outlines > 0.0.31 by @comaniac in #219
- Fix stop str merging by @hnyls2002 in #225
- Fix interpreter.py
get_var(var_name)
in text iter whenstream
is not enabled by @exceedzhang in #198 - fix chatml template by @qeternity in #195
- Upload
agent_calls.jsonl
download link by @hnyls2002 in #226 - Fix addr reuse in check_port by @hnyls2002 in #253
- Add SSL Cert Functionality by @nivibilla in #224
- Refactor ChatTemplate for Enhanced Clarity and Efficiency by @cubxxw in #201
- Add
set_var
to interpreter.py by @1024th in #263 - Add logo by @merrymercy in #275
- Fix qwen config by @hnyls2002 in #261
- replace skip_embed with input_embeds by @TideDra in #222
- Gemma Support by @hnyls2002 in #256
- Improve gemma and documentations by @merrymercy in #278
- Organize
server_args
by @hnyls2002 in #277 - Add Support for API Key Authentication by @alessiodallapiazza in #230
- Fix RuntimeEndpoint by @merrymercy in #279
- Update version to v0.1.13 by @merrymercy in #280
New Contributors
- @psych0v0yager made their first contribution in #210
- @exceedzhang made their first contribution in #198
- @qeternity made their first contribution in #195
- @cubxxw made their first contribution in #201
- @1024th made their first contribution in #263
- @TideDra made their first contribution in #222
- @alessiodallapiazza made their first contribution in #230
Full Changelog: v0.1.12...v0.1.13
Release v0.1.12
Highlights
- Fast JSON Decoding (blog)
- Output logprobs for decoding tokens
- Multiple bug fixes
What's Changed
- Fix no-cache mode by @Ying1123 in #136
- Support Faster JSON decoding for llava by @hnyls2002 in #137
- fix undfined variable by @yaya-sy in #142
- jump-forward rename by @hnyls2002 in #144
- Add warmup to SRT server by @comaniac in #146
- add openai error handler with retry and logger by @ChuyueSun in #148
- Temporary fix OpenAI API for Pydantic v1/v2 by @comaniac in #153
- Add gptq quantization model support by @Arcmoon-Hu in #141
- Support decode token logprobs by @comaniac in #130
- Format code & move functions by @merrymercy in #155
- [Submodule] Change FlashInfer to import by @comaniac in #156
- add
--disable-disk-cache
by @hnyls2002 in #160 - Add Auth Token to RuntimeEndpoint by @nivibilla in #162
- Fix BaseCache metric by @comaniac in #170
- import outlines by @hnyls2002 in #168
- Fix token usage with jump forward by @comaniac in #174
- Support extra field regex in OpenAI API by @comaniac in #172
- Fix the chat template for llava-v1.6-34b & format code by @merrymercy in #177
- Update version to 0.1.12 by @merrymercy in #178
New Contributors
- @yaya-sy made their first contribution in #142
- @ChuyueSun made their first contribution in #148
- @nivibilla made their first contribution in #162
Full Changelog: v0.1.11...v0.1.12
Release v0.1.11
Highlights
- Serve the official release demo of LLaVA v1.6 blog
- Support Yi-VL example
- Faster JSON decoding blog
- Support QWen 2
What's Changed
- Fix the error message and dependency of openai backend by @merrymercy in #71
- Add an async example by @Ying1123 in #37
- Add a note about triton version for older GPUs by @merrymercy in #72
- Support load fine-tuned LLaVA model by @isaac-vidas in #80
- Suppport qwen model and solve some problems by @Arcmoon-Hu in #75
- Fix after QWen support by @merrymercy in #82
- Fix the chat template for QWen by @merrymercy in #83
- Fix SRT endpoint api json syntax by @CSWellesSun in #84
- Return logprob for choices by @merrymercy in #87
- Add health endpoint to SGLang runtime server by @isaac-vidas in #90
- Llava-hd Support by @caoshiyi in #92
- Bump the version to v0.1.8 by @merrymercy in #93
- Improve Chinese character streaming when the last char is half Chinese word. by @haotian-liu in #95
- Handle grayscale images in expand2square by @isaac-vidas in #97
- support speculative execution for openai API by @parasol-aser in #48
- fix batch error for llava-hd by @caoshiyi in #98
- Dynamic model class loading by @comaniac in #101
- Flush Cache API by @hnyls2002 in #103
- Fix Mistral model loading by @comaniac in #108
- Improve the control of streaming and improve the first token latency in streaming by @merrymercy in #117
- Add qwen2 by @JustinLin610 in #114
- Format code by @merrymercy in #118
- Update quick start examples by @merrymercy in #120
- Improve docs & Add JSON decode example by @merrymercy in #121
- [Feature] Adds basic support for image content in OpenAI chat routes by @fozziethebeat in #113
- [Feature] Allow specifying all ports to use in advance by @Ja1Zhou in #116
- Add cache metrics by @comaniac in #119
- Fix model loading & format code by @merrymercy in #125
- Add city doc benchmark mode by @hnyls2002 in #129
- Yi-VL Model by @BabyChouSr in #112
- Fix
is_multimodal_model
judge by @hnyls2002 in #132 - Add max_prefill_num_token into server arguments by @Ying1123 in #133
- Release 0.1.11 by @Ying1123 in #134
New Contributors
- @isaac-vidas made their first contribution in #80
- @Arcmoon-Hu made their first contribution in #75
- @CSWellesSun made their first contribution in #84
- @haotian-liu made their first contribution in #95
- @parasol-aser made their first contribution in #48
- @JustinLin610 made their first contribution in #114
- @fozziethebeat made their first contribution in #113
- @Ja1Zhou made their first contribution in #116
Full Changelog: v0.1.6...v0.1.11
Release v0.1.6
Major features
- Add OpenAI-compatible API server (Completion and ChatCompletion)
- Fix
sgl.select
All PRs
- Support v1/chat/completions by @comaniac in #50
- Fix select and normalized logprobs by @merrymercy in #67
- Bump version to 0.1.5 by @merrymercy in #33
- Use HTTP link in 3rdparty module by @comaniac in #42
- Document sampling parameters by @merrymercy in #45
- Increase interpreter parallelism by @merrymercy in #46
- Add a llava example by @merrymercy in #47
- Support stream=True in v1/completions by @comaniac in #49
- Format code & Improve readme by @merrymercy in #52
- Fix the possible bug of decode out of memory by @hnyls2002 in #36
- Improve error message & Add vicuna template by @merrymercy in #57
- Update README.md by @eltociear in #58
- Disk FSM cache and adjust code. by @hnyls2002 in #63
- Fix select by @merrymercy in #64
- Bump version to 0.1.6 by @merrymercy in #68
New Contributors
- @comaniac made their first contribution in #42
- @eltociear made their first contribution in #58
Full Changelog: v0.1.5...v0.1.6
Release v0.1.5
What's Changed
- Fix for T4 GPUs by @Ying1123 in #16
- Gemini Backend by @caoshiyi in #9
- Teak mem fraction by @merrymercy in #20
- Add option to return metadata in async streaming by @BabyChouSr in #18
- Expose more arguments to control the scheduling policy by @merrymercy in #32
- Rename image_url to image_file by @BabyChouSr in #15
- Improve docs by @merrymercy in #17
- Improve docs & Rename Gemini -> VertexAI by @merrymercy in #19
- Fix streaming by @merrymercy in #30
New Contributors
- @BabyChouSr made their first contribution in #15
- @Ying1123 made their first contribution in #16
- @caoshiyi made their first contribution in #9
Full Changelog: v0.1.3...v0.1.5