modelscope · Jintao-Huang · Apr 9, 2025 · Apr 9, 2025 · Apr 14, 2025 · Apr 14, 2025
diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ You can contact us and communicate with us by adding our group:
 ## 📝 Introduction
 🍲 ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It currently supports the training (pre-training, fine-tuning, human alignment), inference, evaluation, quantization, and deployment of 500+ large models and 200+ multi-modal large models. These large language models (LLMs) include models such as Qwen3, Qwen3-MoE, Qwen2.5, InternLM3, GLM4, Mistral, DeepSeek-R1, Yi1.5, TeleChat2, Baichuan2, and Gemma2. The multi-modal LLMs include models such as Qwen2.5-VL, Qwen2-Audio, Llama4, Llava, InternVL3, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, and GOT-OCR2.
 
-🍔 Additionally, ms-swift incorporates the latest training technologies, including lightweight techniques such as LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger, as well as human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, and ORPO. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM and LMDeploy, and it supports model quantization with technologies like GPTQ, AWQ, and BNB. Furthermore, ms-swift offers a Gradio-based Web UI and a wealth of best practices.
+🍔 Additionally, ms-swift incorporates the latest training technologies, including lightweight techniques such as LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger, as well as human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, and ORPO. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM, SGLang and LMDeploy, and it supports model quantization with technologies like GPTQ, AWQ, and BNB. Furthermore, ms-swift offers a Gradio-based Web UI and a wealth of best practices.
 
 **Why choose ms-swift?**
 
@@ -68,12 +68,13 @@ You can contact us and communicate with us by adding our group:
 - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
 - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
 - 🍉 **Toolbox Capabilities**: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
-- **Inference Acceleration**: Supports inference acceleration engines like PyTorch, vLLM, LmDeploy, and provides OpenAI API for accelerating inference, deployment, and evaluation modules.
+- **Inference Acceleration**: Supports inference acceleration engines like PyTorch, vLLM, SGLang, LmDeploy, and provides OpenAI API for accelerating inference, deployment, and evaluation modules.
 - **Model Evaluation**: Uses EvalScope as the evaluation backend and supports evaluation on 100+ datasets for both pure text and multi-modal models.
-- **Model Quantization**: Supports AWQ, GPTQ, and BNB quantized exports, with models that can use vLLM/LmDeploy for inference acceleration and continue training.
+- **Model Quantization**: Supports AWQ, GPTQ, and BNB quantized exports, with models that can use vLLM/SGLang/LmDeploy for inference acceleration and continue training.
 
 
 ## 🎉 News
+- 🎁 2025.06.18: Support for accelerating the ms-swift [inference](https://github.com/modelscope/ms-swift/blob/main/examples/infer/sglang), deployment, evaluation, and UI modules using the sglang inference acceleration engine. Simply set `--infer_backend sglang` to enable it.
 - 🎁 2025.06.15: Support for GKD training on both pure text large models and multimodal models. Training scripts can be found here: [Pure Text](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh), [Multimodal](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh).
 - 🎁 2025.06.11: Support for using Megatron parallelism techniques for RLHF training. The training script can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf).
 - 🎁 2025.05.29: Support sequence parallel in pt, sft, dpo and grpo, check script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text).
@@ -124,6 +125,7 @@ Running Environment:
 | trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed    | >=0.14       | 0.14.5 / 0.16.9 | Training                                  |
 | vllm         | >=0.5.1      | 0.8.5.post1       | Inference/Deployment/Evaluation           |
+| sglang |     | 0.4.6.post5 | Inference/Deployment/Evaluation |
 | lmdeploy     | >=0.5        | 0.8       | Inference/Deployment/Evaluation           |
 | evalscope | >=0.11       |  | Evaluation |
 

diff --git a/README_CN.md b/README_CN.md
@@ -51,7 +51,7 @@
 ## 📝 简介
 🍲 ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持500+大模型与200+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中大模型包括：Qwen3、Qwen3-MoE、Qwen2.5、InternLM3、GLM4、Mistral、DeepSeek-R1、Yi1.5、TeleChat2、Baichuan2、Gemma2等模型，多模态大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama4、Llava、InternVL3、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。
 
-🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等轻量化训练技术，以及DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速，并支持使用GPTQ、AWQ、BNB等技术对大模型进行量化。ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
+🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等轻量化训练技术，以及DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。ms-swift支持使用vLLM、SGLang和LMDeploy对推理、评测和部署模块进行加速，并支持使用GPTQ、AWQ、BNB等技术对大模型进行量化。ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
 
 **为什么选择ms-swift？**
 - 🍎 **模型类型**：支持500+纯文本大模型、**200+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**。
@@ -65,11 +65,12 @@
 - **界面训练**：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - **插件化与拓展**：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 - 🍉 **工具箱能力**：不仅提供大模型和多模态大模型的训练支持，还涵盖其推理、评测、量化和部署全流程。
-- **推理加速**：支持PyTorch、vLLM、LmDeploy推理加速引擎，并提供OpenAI接口，为推理、部署和评测模块提供加速。
+- **推理加速**：支持PyTorch、vLLM、SGLang和LmDeploy推理加速引擎，并提供OpenAI接口，为推理、部署和评测模块提供加速。
 - **模型评测**：以EvalScope作为评测后端，支持100+评测数据集对纯文本和多模态模型进行评测。
-- **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
+- **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/SGLang/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.06.18: 支持使用sglang推理加速引擎对ms-swift[推理](https://github.com/modelscope/ms-swift/blob/main/examples/infer/sglang)/部署/评测/ui模块进行加速，设置`--infer_backend sglang`即可。
 - 🎁 2025.06.15: 支持对纯文本大模型和多模态模型进行GKD训练。训练脚本参考这里：[纯文本](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh), [多模态](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh)。
 - 🎁 2025.06.11: 支持使用Megatron并行技术进行RLHF训练，训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf)。
 - 🎁 2025.05.29: 支持pt、sft、dpo、grpo的序列并行，具体请查看[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text)。
@@ -120,6 +121,7 @@ pip install -e .
 | trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed | >=0.14       | 0.14.5 / 0.16.9 |训练|
 | vllm | >=0.5.1      | 0.8.5.post1 |推理/部署/评测|
+| sglang |     | 0.4.6.post5 |推理/部署/评测|
 | lmdeploy | >=0.5        | 0.8 |推理/部署/评测|
 | evalscope | >=0.11       | |评测|
 

diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -165,6 +165,7 @@ query-response格式：
 该格式将自动转换数据集格式为对应模型的grounding任务格式，且选择对应模型的bbox归一化方式。该格式比通用格式多了objects字段，该字段包含的字段有：
  - ref: 用于替换`<ref-object>`。
  - bbox: 用于替换`<bbox>`。若bbox中每个box长度为2，则代表x和y坐标，若box长度为4，则代表2个点的x和y坐标。
+  - 注意：`<ref-object>`和`<bbox>`并没有对应关系，ref和bbox各自替换各自的占位符。
  - bbox_type: 可选项为'real'，'norm1'。默认为'real'，即bbox为真实bbox值。若是'norm1'，则bbox已经归一化为0~1。
  - image_id: 该参数只有当bbox_type为'real'时生效。代表bbox对应的图片是第几张，用于缩放bbox。索引从0开始，默认全为第0张。
 

diff --git a/docs/source/GetStarted/SWIFT安装.md b/docs/source/GetStarted/SWIFT安装.md
@@ -88,6 +88,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
 | trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed | >=0.14       | 0.14.5 / 0.16.9 |训练|
 | vllm | >=0.5.1      | 0.8.5.post1 |推理/部署/评测|
+| sglang |     | 0.4.6.post5 |推理/部署/评测|
 | lmdeploy | >=0.5        | 0.8 |推理/部署/评测|
 | evalscope | >=0.11       | |评测|
 

diff --git a/docs/source/GetStarted/快速开始.md b/docs/source/GetStarted/快速开始.md
@@ -13,9 +13,9 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 - 界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - 插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 - 🍉 工具箱能力：除了对大模型和多模态大模型的训练支持外，还支持其推理、评测、量化和部署全流程。
-- 推理加速：支持PyTorch、vLLM、LmDeploy推理加速引擎，并提供OpenAI接口，为推理、部署和评测模块提供加速。
+- 推理加速：支持PyTorch、vLLM、SGLang和LmDeploy推理加速引擎，并提供OpenAI接口，为推理、部署和评测模块提供加速。
 - 模型评测：以EvalScope作为评测后端，支持100+评测数据集对纯文本和多模态模型进行评测。
-- 模型量化：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
+- 模型量化：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/SGLang/LmDeploy推理加速，并支持继续训练。
 
 
 ## 安装

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -313,31 +313,46 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - reft_intervention_type: ReFT的类型, 支持'NoreftIntervention', 'LoreftIntervention', 'ConsreftIntervention', 'LobireftIntervention', 'DireftIntervention', 'NodireftIntervention', 默认为`LoreftIntervention`.
 - reft_args: ReFT Intervention中的其他支持参数, 以json-string格式输入.
 
-### LMDeploy参数
-参数含义可以查看[lmdeploy文档](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig)。
-
-- 🔥tp: tensor并行度。默认为`1`。
-- session_len: 默认为`None`。
-- cache_max_entry_count: 默认为`0.8`。
-- quant_policy: 默认为`0`。
-- vision_batch_size: 默认为`1`。
-
 ### vLLM参数
 参数含义可以查看[vllm文档](https://docs.vllm.ai/en/latest/serving/engine_args.html)。
 
-- 🔥gpu_memory_utilization: 默认值`0.9`。
-- 🔥tensor_parallel_size: 默认为`1`。
-- pipeline_parallel_size: 默认为`1`。
-- max_num_seqs: 默认为`256`。
-- 🔥max_model_len: 默认为`None`。
-- disable_custom_all_reduce: 默认为`True`。
+- 🔥gpu_memory_utilization: GPU内存比例，取值范围为0到1。默认值`0.9`。
+- 🔥tensor_parallel_size: tp并行数，默认为`1`。
+- pipeline_parallel_size: pp并行数，默认为`1`。
+- max_num_seqs: 单次迭代中处理的最大序列数，默认为`256`。
+- 🔥max_model_len: 默认为`None`，即从config.json中读取。
+- disable_custom_all_reduce: 禁用自定义的 all-reduce 内核，回退到 NCCL。为了稳定性，默认为`True`。
 - enforce_eager: vllm使用pytorch eager模式还是建立cuda graph，默认为`False`。设置为True可以节约显存，但会影响效率。
 - 🔥limit_mm_per_prompt: 控制vllm使用多图，默认为`None`。例如传入`--limit_mm_per_prompt '{"image": 5, "video": 2}'`。
 - vllm_max_lora_rank: 默认为`16`。vllm对于lora支持的参数。
 - vllm_quantization: vllm可以在内部量化模型，参数支持的值详见[这里](https://docs.vllm.ai/en/latest/serving/engine_args.html)。
 - enable_prefix_caching: 开启vllm的自动前缀缓存，节约重复查询前缀的处理时间。默认为`False`。
 - use_async_engine: vLLM backend下是否使用async engine。部署情况（swift deploy）默认为True，其他情况默认为False。
 
+### SGLang参数
+参数含义可以查看[sglang文档](https://docs.sglang.ai/backend/server_arguments.html)。
+
+- sglang_tp_size: tp数。默认为1。
+- sglang_pp_size: pp数。默认为1。
+- sglang_dp_size: dp数。默认为1。
+- sglang_ep_size: ep数。默认为1。
+- sglang_mem_fraction_static: 用于静态分配模型权重和KV缓存内存池的GPU内存比例。如果你遇到GPU内存不足错误，可以尝试降低该值。默认为None。
+- sglang_context_length: 模型的最大上下文长度。默认为 None，将使用模型的`config.json`中的值。
+- sglang_disable_cuda_graph: 禁用CUDA图。默认为False。
+- sglang_disable_radix_cache: 禁用 RadixAttention 用于前缀缓存。为了稳定性，默认为True。
+- sglang_quantization: 量化方法。默认为None。
+- sglang_kv_cache_dtype: 用于k/v缓存存储的数据类型。'auto'表示将使用模型的数据类型。'fp8_e5m2'和'fp8_e4m3'适用于CUDA 11.8及以上版本。默认为'auto'。
+- sglang_enable_dp_attention: 为注意力机制启用数据并行，为前馈网络（FFN）启用张量并行。数据并行的规模（dp size）应等于张量并行的规模（tp size）。目前支持DeepSeek-V2/3以及Qwen2/3 MoE模型。默认为False。
+
+### LMDeploy参数
+参数含义可以查看[lmdeploy文档](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig)。
+
+- 🔥tp: tensor并行度。默认为`1`。
+- session_len: 最大会话长度。默认为`None`。
+- cache_max_entry_count: k/v缓存占用的GPU内存百分比。默认为`0.8`。
+- quant_policy: 默认为0。当需要将k/v量化为4或8位时，分别将其设置为4或8。
+- vision_batch_size: 传入VisionConfig的max_batch_size参数。默认为`1`。
+
 ### 合并参数
 
 - 🔥merge_lora: 是否合并lora，本参数支持lora、llamapro、longlora，默认为False。例子参数[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/merge_lora.sh)。
@@ -498,7 +513,7 @@ soft overlong 奖励参数
 
 推理参数除包含[基本参数](#基本参数)、[合并参数](#合并参数)、[vLLM参数](#vllm参数)、[LMDeploy参数](#LMDeploy参数)外，还包含下面的部分：
 
-- 🔥infer_backend: 推理加速后端，支持'pt'、'vllm'、'lmdeploy'三种推理引擎。默认为'pt'。
+- 🔥infer_backend: 推理加速后端，支持'pt'、'vllm'、'sglang'、'lmdeploy'四种推理引擎。默认为'pt'。
 - 🔥max_batch_size: 指定infer_backend为pt时生效，用于批量推理，默认为1。若设置为-1，则不受限制。
 - 🔥result_path: 推理结果存储路径（jsonl），默认为None，保存在checkpoint目录（含args.json文件）或者'./result'目录，最终存储路径会在命令行中打印。
   - 注意：若已存在`result_path`文件，则会进行追加写入。

diff --git a/docs/source/Instruction/导出与推送.md b/docs/source/Instruction/导出与推送.md
@@ -35,7 +35,7 @@ pip install bitsandbytes -U
 - 支持[AWQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)/[GPTQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq.sh)/[BNB](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/bnb.sh)量化导出。
 - 多模态量化: 支持使用GPTQ和AWQ对多模态模型进行量化，其中AWQ支持的多模态模型有限。参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/mllm)。
 - 更多系列模型的支持: 支持[Bert](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/bert)，[Reward Model](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/reward_model)的量化导出。
-- 使用SWIFT量化导出的模型支持使用vllm/lmdeploy进行推理加速；也支持使用QLoRA继续进行SFT/RLHF。
+- 使用SWIFT量化导出的模型支持使用vllm/sglang/lmdeploy进行推理加速；也支持使用QLoRA继续进行SFT/RLHF。
 
 
 ## 推送模型