diff --git a/README.md b/README.md index 66890f2db4..48a1ed2a98 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ You can contact us and communicate with us by adding our group: ## 📝 Introduction 🍲 ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It currently supports the training (pre-training, fine-tuning, human alignment), inference, evaluation, quantization, and deployment of 500+ large models and 200+ multi-modal large models. These large language models (LLMs) include models such as Qwen3, Qwen3-MoE, Qwen2.5, InternLM3, GLM4, Mistral, DeepSeek-R1, Yi1.5, TeleChat2, Baichuan2, and Gemma2. The multi-modal LLMs include models such as Qwen2.5-VL, Qwen2-Audio, Llama4, Llava, InternVL3, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, and GOT-OCR2. -🍔 Additionally, ms-swift incorporates the latest training technologies, including lightweight techniques such as LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger, as well as human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, and ORPO. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM and LMDeploy, and it supports model quantization with technologies like GPTQ, AWQ, and BNB. Furthermore, ms-swift offers a Gradio-based Web UI and a wealth of best practices. +🍔 Additionally, ms-swift incorporates the latest training technologies, including lightweight techniques such as LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger, as well as human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, and ORPO. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM, SGLang and LMDeploy, and it supports model quantization with technologies like GPTQ, AWQ, and BNB. Furthermore, ms-swift offers a Gradio-based Web UI and a wealth of best practices. **Why choose ms-swift?** @@ -68,12 +68,13 @@ You can contact us and communicate with us by adding our group: - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline. - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer. - 🍉 **Toolbox Capabilities**: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment. -- **Inference Acceleration**: Supports inference acceleration engines like PyTorch, vLLM, LmDeploy, and provides OpenAI API for accelerating inference, deployment, and evaluation modules. +- **Inference Acceleration**: Supports inference acceleration engines like PyTorch, vLLM, SGLang, LmDeploy, and provides OpenAI API for accelerating inference, deployment, and evaluation modules. - **Model Evaluation**: Uses EvalScope as the evaluation backend and supports evaluation on 100+ datasets for both pure text and multi-modal models. -- **Model Quantization**: Supports AWQ, GPTQ, and BNB quantized exports, with models that can use vLLM/LmDeploy for inference acceleration and continue training. +- **Model Quantization**: Supports AWQ, GPTQ, and BNB quantized exports, with models that can use vLLM/SGLang/LmDeploy for inference acceleration and continue training. ## 🎉 News +- 🎁 2025.06.18: Support for accelerating the ms-swift [inference](https://github.com/modelscope/ms-swift/blob/main/examples/infer/sglang), deployment, evaluation, and UI modules using the [sglang](https://github.com/sgl-project/sglang) inference acceleration engine. Simply set `--infer_backend sglang` to enable it. - 🎁 2025.06.15: Support for GKD training on both pure text large models and multimodal models. Training scripts can be found here: [Pure Text](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh), [Multimodal](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh). - 🎁 2025.06.11: Support for using Megatron parallelism techniques for RLHF training. The training script can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf). - 🎁 2025.05.29: Support sequence parallel in pt, sft, dpo and grpo, check script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text). @@ -124,6 +125,7 @@ Running Environment: | trl | >=0.13,<0.19 | 0.18 |RLHF| | deepspeed | >=0.14 | 0.14.5 / 0.16.9 | Training | | vllm | >=0.5.1 | 0.8.5.post1 | Inference/Deployment/Evaluation | +| sglang | | 0.4.6.post5 | Inference/Deployment/Evaluation | | lmdeploy | >=0.5 | 0.8 | Inference/Deployment/Evaluation | | evalscope | >=0.11 | | Evaluation | diff --git a/README_CN.md b/README_CN.md index 57288abefb..50b42f0c13 100644 --- a/README_CN.md +++ b/README_CN.md @@ -51,7 +51,7 @@ ## 📝 简介 🍲 ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架,现已支持500+大模型与200+多模态大模型的训练(预训练、微调、人类对齐)、推理、评测、量化与部署。其中大模型包括:Qwen3、Qwen3-MoE、Qwen2.5、InternLM3、GLM4、Mistral、DeepSeek-R1、Yi1.5、TeleChat2、Baichuan2、Gemma2等模型,多模态大模型包括:Qwen2.5-VL、Qwen2-Audio、Llama4、Llava、InternVL3、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。 -🍔 除此之外,ms-swift汇集了最新的训练技术,包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等轻量化训练技术,以及DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速,并支持使用GPTQ、AWQ、BNB等技术对大模型进行量化。ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。 +🍔 除此之外,ms-swift汇集了最新的训练技术,包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等轻量化训练技术,以及DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。ms-swift支持使用vLLM、SGLang和LMDeploy对推理、评测和部署模块进行加速,并支持使用GPTQ、AWQ、BNB等技术对大模型进行量化。ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。 **为什么选择ms-swift?** - 🍎 **模型类型**:支持500+纯文本大模型、**200+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**。 @@ -65,11 +65,12 @@ - **界面训练**:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。 - **插件化与拓展**:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。 - 🍉 **工具箱能力**:不仅提供大模型和多模态大模型的训练支持,还涵盖其推理、评测、量化和部署全流程。 -- **推理加速**:支持PyTorch、vLLM、LmDeploy推理加速引擎,并提供OpenAI接口,为推理、部署和评测模块提供加速。 +- **推理加速**:支持PyTorch、vLLM、SGLang和LmDeploy推理加速引擎,并提供OpenAI接口,为推理、部署和评测模块提供加速。 - **模型评测**:以EvalScope作为评测后端,支持100+评测数据集对纯文本和多模态模型进行评测。 -- **模型量化**:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/LmDeploy推理加速,并支持继续训练。 +- **模型量化**:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/SGLang/LmDeploy推理加速,并支持继续训练。 ## 🎉 新闻 +- 🎁 2025.06.18: 支持使用[sglang](https://github.com/sgl-project/sglang)推理加速引擎对ms-swift[推理](https://github.com/modelscope/ms-swift/blob/main/examples/infer/sglang)/部署/评测/ui模块进行加速,设置`--infer_backend sglang`即可。 - 🎁 2025.06.15: 支持对纯文本大模型和多模态模型进行GKD训练。训练脚本参考这里:[纯文本](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh), [多模态](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh)。 - 🎁 2025.06.11: 支持使用Megatron并行技术进行RLHF训练,训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf)。 - 🎁 2025.05.29: 支持pt、sft、dpo、grpo的序列并行,具体请查看[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text)。 @@ -120,6 +121,7 @@ pip install -e . | trl | >=0.13,<0.19 | 0.18 |RLHF| | deepspeed | >=0.14 | 0.14.5 / 0.16.9 |训练| | vllm | >=0.5.1 | 0.8.5.post1 |推理/部署/评测| +| sglang | | 0.4.6.post5 |推理/部署/评测| | lmdeploy | >=0.5 | 0.8 |推理/部署/评测| | evalscope | >=0.11 | |评测| diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" index 8b5ec3db64..2b1b018599 100644 --- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" +++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" @@ -165,6 +165,7 @@ query-response格式: 该格式将自动转换数据集格式为对应模型的grounding任务格式,且选择对应模型的bbox归一化方式。该格式比通用格式多了objects字段,该字段包含的字段有: - ref: 用于替换``。 - bbox: 用于替换``。若bbox中每个box长度为2,则代表x和y坐标,若box长度为4,则代表2个点的x和y坐标。 + - 注意:``和``并没有对应关系,ref和bbox各自替换各自的占位符。 - bbox_type: 可选项为'real','norm1'。默认为'real',即bbox为真实bbox值。若是'norm1',则bbox已经归一化为0~1。 - image_id: 该参数只有当bbox_type为'real'时生效。代表bbox对应的图片是第几张,用于缩放bbox。索引从0开始,默认全为第0张。 diff --git "a/docs/source/GetStarted/SWIFT\345\256\211\350\243\205.md" "b/docs/source/GetStarted/SWIFT\345\256\211\350\243\205.md" index 46bb30a9e2..93ad751cb6 100644 --- "a/docs/source/GetStarted/SWIFT\345\256\211\350\243\205.md" +++ "b/docs/source/GetStarted/SWIFT\345\256\211\350\243\205.md" @@ -88,6 +88,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2 | trl | >=0.13,<0.19 | 0.18 |RLHF| | deepspeed | >=0.14 | 0.14.5 / 0.16.9 |训练| | vllm | >=0.5.1 | 0.8.5.post1 |推理/部署/评测| +| sglang | | 0.4.6.post5 |推理/部署/评测| | lmdeploy | >=0.5 | 0.8 |推理/部署/评测| | evalscope | >=0.11 | |评测| diff --git "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" index 833e01f4b2..cd00fec521 100644 --- "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" +++ "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" @@ -13,9 +13,9 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架 - 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。 - 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。 - 🍉 工具箱能力:除了对大模型和多模态大模型的训练支持外,还支持其推理、评测、量化和部署全流程。 -- 推理加速:支持PyTorch、vLLM、LmDeploy推理加速引擎,并提供OpenAI接口,为推理、部署和评测模块提供加速。 +- 推理加速:支持PyTorch、vLLM、SGLang和LmDeploy推理加速引擎,并提供OpenAI接口,为推理、部署和评测模块提供加速。 - 模型评测:以EvalScope作为评测后端,支持100+评测数据集对纯文本和多模态模型进行评测。 -- 模型量化:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/LmDeploy推理加速,并支持继续训练。 +- 模型量化:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/SGLang/LmDeploy推理加速,并支持继续训练。 ## 安装 diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index b5d9d36353..c5309e361a 100644 --- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -313,24 +313,15 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数. - reft_intervention_type: ReFT的类型, 支持'NoreftIntervention', 'LoreftIntervention', 'ConsreftIntervention', 'LobireftIntervention', 'DireftIntervention', 'NodireftIntervention', 默认为`LoreftIntervention`. - reft_args: ReFT Intervention中的其他支持参数, 以json-string格式输入. -### LMDeploy参数 -参数含义可以查看[lmdeploy文档](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig)。 - -- 🔥tp: tensor并行度。默认为`1`。 -- session_len: 默认为`None`。 -- cache_max_entry_count: 默认为`0.8`。 -- quant_policy: 默认为`0`。 -- vision_batch_size: 默认为`1`。 - ### vLLM参数 参数含义可以查看[vllm文档](https://docs.vllm.ai/en/latest/serving/engine_args.html)。 -- 🔥gpu_memory_utilization: 默认值`0.9`。 -- 🔥tensor_parallel_size: 默认为`1`。 -- pipeline_parallel_size: 默认为`1`。 -- max_num_seqs: 默认为`256`。 -- 🔥max_model_len: 默认为`None`。 -- disable_custom_all_reduce: 默认为`True`。 +- 🔥gpu_memory_utilization: GPU内存比例,取值范围为0到1。默认值`0.9`。 +- 🔥tensor_parallel_size: tp并行数,默认为`1`。 +- pipeline_parallel_size: pp并行数,默认为`1`。 +- max_num_seqs: 单次迭代中处理的最大序列数,默认为`256`。 +- 🔥max_model_len: 默认为`None`,即从config.json中读取。 +- disable_custom_all_reduce: 禁用自定义的 all-reduce 内核,回退到 NCCL。为了稳定性,默认为`True`。 - enforce_eager: vllm使用pytorch eager模式还是建立cuda graph,默认为`False`。设置为True可以节约显存,但会影响效率。 - 🔥limit_mm_per_prompt: 控制vllm使用多图,默认为`None`。例如传入`--limit_mm_per_prompt '{"image": 5, "video": 2}'`。 - vllm_max_lora_rank: 默认为`16`。vllm对于lora支持的参数。 @@ -338,6 +329,30 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数. - enable_prefix_caching: 开启vllm的自动前缀缓存,节约重复查询前缀的处理时间。默认为`False`。 - use_async_engine: vLLM backend下是否使用async engine。部署情况(swift deploy)默认为True,其他情况默认为False。 +### SGLang参数 +参数含义可以查看[sglang文档](https://docs.sglang.ai/backend/server_arguments.html)。 + +- sglang_tp_size: tp数。默认为1。 +- sglang_pp_size: pp数。默认为1。 +- sglang_dp_size: dp数。默认为1。 +- sglang_ep_size: ep数。默认为1。 +- sglang_mem_fraction_static: 用于静态分配模型权重和KV缓存内存池的GPU内存比例。如果你遇到GPU内存不足错误,可以尝试降低该值。默认为None。 +- sglang_context_length: 模型的最大上下文长度。默认为 None,将使用模型的`config.json`中的值。 +- sglang_disable_cuda_graph: 禁用CUDA图。默认为False。 +- sglang_quantization: 量化方法。默认为None。 +- sglang_kv_cache_dtype: 用于k/v缓存存储的数据类型。'auto'表示将使用模型的数据类型。'fp8_e5m2'和'fp8_e4m3'适用于CUDA 11.8及以上版本。默认为'auto'。 +- sglang_enable_dp_attention: 为注意力机制启用数据并行,为前馈网络(FFN)启用张量并行。数据并行的规模(dp size)应等于张量并行的规模(tp size)。目前支持DeepSeek-V2/3以及Qwen2/3 MoE模型。默认为False。 +- sglang_disable_custom_all_reduce: 禁用自定义的 all-reduce 内核,回退到 NCCL。为了稳定性,默认为True。 + +### LMDeploy参数 +参数含义可以查看[lmdeploy文档](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig)。 + +- 🔥tp: tensor并行度。默认为`1`。 +- session_len: 最大会话长度。默认为`None`。 +- cache_max_entry_count: k/v缓存占用的GPU内存百分比。默认为`0.8`。 +- quant_policy: 默认为0。当需要将k/v量化为4或8位时,分别将其设置为4或8。 +- vision_batch_size: 传入VisionConfig的max_batch_size参数。默认为`1`。 + ### 合并参数 - 🔥merge_lora: 是否合并lora,本参数支持lora、llamapro、longlora,默认为False。例子参数[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/merge_lora.sh)。 @@ -498,7 +513,7 @@ soft overlong 奖励参数 推理参数除包含[基本参数](#基本参数)、[合并参数](#合并参数)、[vLLM参数](#vllm参数)、[LMDeploy参数](#LMDeploy参数)外,还包含下面的部分: -- 🔥infer_backend: 推理加速后端,支持'pt'、'vllm'、'lmdeploy'三种推理引擎。默认为'pt'。 +- 🔥infer_backend: 推理加速后端,支持'pt'、'vllm'、'sglang'、'lmdeploy'四种推理引擎。默认为'pt'。 - 🔥max_batch_size: 指定infer_backend为pt时生效,用于批量推理,默认为1。若设置为-1,则不受限制。 - 🔥result_path: 推理结果存储路径(jsonl),默认为None,保存在checkpoint目录(含args.json文件)或者'./result'目录,最终存储路径会在命令行中打印。 - 注意:若已存在`result_path`文件,则会进行追加写入。 diff --git "a/docs/source/Instruction/\345\257\274\345\207\272\344\270\216\346\216\250\351\200\201.md" "b/docs/source/Instruction/\345\257\274\345\207\272\344\270\216\346\216\250\351\200\201.md" index e532bcef76..cf7d3c4f60 100644 --- "a/docs/source/Instruction/\345\257\274\345\207\272\344\270\216\346\216\250\351\200\201.md" +++ "b/docs/source/Instruction/\345\257\274\345\207\272\344\270\216\346\216\250\351\200\201.md" @@ -35,7 +35,7 @@ pip install bitsandbytes -U - 支持[AWQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)/[GPTQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq.sh)/[BNB](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/bnb.sh)量化导出。 - 多模态量化: 支持使用GPTQ和AWQ对多模态模型进行量化,其中AWQ支持的多模态模型有限。参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/mllm)。 - 更多系列模型的支持: 支持[Bert](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/bert),[Reward Model](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/reward_model)的量化导出。 -- 使用SWIFT量化导出的模型支持使用vllm/lmdeploy进行推理加速;也支持使用QLoRA继续进行SFT/RLHF。 +- 使用SWIFT量化导出的模型支持使用vllm/sglang/lmdeploy进行推理加速;也支持使用QLoRA继续进行SFT/RLHF。 ## 推送模型 diff --git "a/docs/source/Instruction/\346\216\250\347\220\206\345\222\214\351\203\250\347\275\262.md" "b/docs/source/Instruction/\346\216\250\347\220\206\345\222\214\351\203\250\347\275\262.md" index aac9448cb0..8099948639 100644 --- "a/docs/source/Instruction/\346\216\250\347\220\206\345\222\214\351\203\250\347\275\262.md" +++ "b/docs/source/Instruction/\346\216\250\347\220\206\345\222\214\351\203\250\347\275\262.md" @@ -6,6 +6,7 @@ | ------------ | -------------- | ---------- | ------ | -------- | ------ | ----- | ----- | | pytorch | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) |DDP/device_map | | [vllm](https://github.com/vllm-project/vllm) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | ❌ | ✅ | TP/PP/DP | +| [sglang](https://github.com/sgl-project/sglang) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | TP/PP/DP/EP | | [lmdeploy](https://github.com/InternLM/lmdeploy) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | ✅ | ❌ | ❌ | ✅ | TP/DP | @@ -102,7 +103,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer \ - 界面推理:你可以将`swift infer`改成`swift app`。 - batch推理:`infer_backend=pt`可以指定`--max_batch_size`对大模型和多模态大模型进行batch推理,具体参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh)。在进行batch推理时,你不能设置`--stream true`。 - DDP/device_map推理:`infer_backend=pt`支持使用DDP/device_map技术进行并行推理,具体参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh)。 -- 推理加速:swift支持使用vllm/lmdeploy对推理、部署和评测模块进行推理加速,只需要额外指定`--infer_backend vllm/lmdeploy`即可。可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh)。 +- 推理加速:swift支持使用vllm/sglang/lmdeploy对推理、部署和评测模块进行推理加速,只需要额外指定`--infer_backend vllm/sglang/lmdeploy`即可。可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh)。 - 多模态模型:我们提供了[pt](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh)/[vllm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh)/[lmdeploy](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh)对多模态模型进行多GPU推理的shell脚本。 - 量化模型:直接选择GPTQ、AWQ、BNB量化的模型,例如:`--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`即可。 - 更多模型类型:我们提供了[bert](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/bert.sh)、[reward_model](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/reward_model.sh)、[prm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/prm.sh)的推理脚本。 @@ -111,7 +112,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer \ **小帖士:** - SWIFT会将推理结果保存起来,你可以通过`--result_path`指定保存路径。 - 如果要输出logprobs,只需要在推理时,指定`--logprobs true`即可。SWIFT会保存。注意,设置`--stream true`将不会存储。 -- infer_backend为pt支持所有swift已支持模型的推理,而infer_backend为vllm/lmdeploy只支持部分模型,具体请参考[vllm](https://docs.vllm.ai/en/latest/models/supported_models.html)、[lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html)文档。 +- infer_backend为pt支持所有swift已支持模型的推理,而infer_backend为vllm/sglang/lmdeploy只支持部分模型,具体请参考[vllm](https://docs.vllm.ai/en/latest/models/supported_models.html)、[sglang](https://docs.sglang.ai/supported_models/generative_models.html)、[lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html)文档。 - 使用`--infer_backend vllm`出现OOM,可以通过降低`--max_model_len`,`--max_num_seqs`,选择合适的`--gpu_memory_utilization`,设置`--enforce_eager true`。或者使用tensor并行`--tensor_parallel_size`来解决。 - 使用`--infer_backend vllm`推理多模态模型,需要传入多张图片。可以设置`--limit_mm_per_prompt`解决,例如:`--limit_mm_per_prompt '{"image": 10, "video": 5}'`。 - 推理qwen2-vl/qwen2.5-vl出现OOM,可以通过设置`MAX_PIXELS`、`VIDEO_MAX_PIXELS`、`FPS_MAX_FRAMES`解决,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh)。 @@ -180,7 +181,7 @@ print(f'response2: {resp_list[2].choices[0].message.content}') ``` 我们也提供了更多使用python推理的demo: -- 使用流式推理以及`VllmEngine`、`LmdeployEngine`进行推理加速,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py)。 +- 使用流式推理以及`VllmEngine`、`SglangEngine`、`LmdeployEngine`进行推理加速,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py)。 - 多模态推理:除了上述多模态输入格式外,swift兼容OpenAI的多模态输入格式,参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py)。 - grounding任务:对多模态模型进行Grounding任务画框,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py)。 - 多LoRA推理:参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py)。 diff --git "a/docs/source/Instruction/\350\257\204\346\265\213.md" "b/docs/source/Instruction/\350\257\204\346\265\213.md" index 462b48ad56..883c8adfeb 100644 --- "a/docs/source/Instruction/\350\257\204\346\265\213.md" +++ "b/docs/source/Instruction/\350\257\204\346\265\213.md" @@ -82,7 +82,7 @@ swift eval \ 其中: - model: 可指定本地模型路径或者modelscope上的模型ID - eval_backend: 可选 Native, OpenCompass, VLMEvalKit,默认为 Native -- infer_backend: 可选 pt, vllm, lmdeploy,默认为 pt +- infer_backend: 可选 pt, vllm, sglang, lmdeploy,默认为 pt - eval_limit: 每个评测集的采样数,默认为None,表示使用全部数据,可用于快速验证 - eval_dataset: 评测数据集,可设置多个数据集,用空格分割 diff --git "a/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\344\270\216\345\276\256\350\260\203.md" "b/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\344\270\216\345\276\256\350\260\203.md" index 2b6ac6d531..5e218c38f7 100644 --- "a/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\344\270\216\345\276\256\350\260\203.md" +++ "b/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\344\270\216\345\276\256\350\260\203.md" @@ -74,7 +74,7 @@ ms-swift使用了分层式的设计思想,用户可以使用命令行界面、 - 在使用`swift sft`通过LoRA技术微调base模型为chat模型时,有时需要手动设置模板。通过添加`--template default`参数来避免base模型因未见过对话模板中的特殊字符而无法正常停止的情况。具体参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/base_to_chat)。 - 如果需要在**断网**环境下进行训练,请设置`--model `和`--check_model false`。如果对应的模型需要`git clone`github的仓库,例如`deepseek-ai/Janus-Pro-7B`,请设置手动下载仓库,并设置`--local_repo_path `。具体参数含义请参考[命令行参数文档](命令行参数.md)。 -- 无法对QLoRA训练的模型进行Merge LoRA,因此不建议使用QLoRA进行微调,无法在推理和部署时使用vLLM/LMDeploy进行推理加速。建议使用LoRA/全参数进行微调,合并为完整权重后再使用GPTQ/AWQ/BNB进行[量化](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize)。 +- 无法对QLoRA训练的模型进行Merge LoRA,因此不建议使用QLoRA进行微调,无法在推理和部署时使用vLLM/Sglang/LMDeploy进行推理加速。建议使用LoRA/全参数进行微调,合并为完整权重后再使用GPTQ/AWQ/BNB进行[量化](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize)。 - 如果使用NPU进行训练,只需要将shell中的`CUDA_VISIBLE_DEVICES`修改为`ASCEND_RT_VISIBLE_DEVICES`。 - SWIFT默认在训练时设置`--gradient_checkpointing true`来节约显存,这会略微降低训练速度。 - 若使用DDP进行训练,出现报错:`RuntimeError: Expected to mark a variable ready only once.`,请额外设置参数`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`或者使用DeepSpeed进行训练。 @@ -127,7 +127,7 @@ swift infer \ - adapters文件夹中包含了训练的参数文件`args.json`,因此不需要额外指定`--model`,`--system`,swift会自动读取这些参数。如果要关闭此行为,可以设置`--load_args false`。 - 如果使用全参数训练,请使用`--model`替代`--adapters`指定训练的checkpoint目录。更多参考[推理和部署文档](./推理和部署.md#推理)。 - 你可以使用`swift app`替代`swift infer`进行界面推理。 -- 你可以选择对LoRA进行merge(额外指定`--merge_lora true`),然后指定`--infer_backend vllm/lmdeploy`进行推理加速。 +- 你可以选择对LoRA进行merge(额外指定`--merge_lora true`),然后指定`--infer_backend vllm/sglang/lmdeploy`进行推理加速。 对数据集中的验证集进行批量推理: ```shell @@ -141,7 +141,7 @@ swift infer \ --max_batch_size 1 ``` -- 你可以设置`--max_batch_size 8`,从而使用`--infer_backend pt`进行批量处理。若使用`infer_backend vllm/lmdeploy`则无需指定,会进行自动batch。 +- 你可以设置`--max_batch_size 8`,从而使用`--infer_backend pt`进行批量处理。若使用`infer_backend vllm/sglang/lmdeploy`则无需指定,会进行自动batch。 - `--load_data_args true`会额外读取训练存储参数文件`args.json`中的数据参数。 若想对额外的测试集进行推理,而不使用训练时的验证集,使用`--val_dataset `进行推理: @@ -242,7 +242,7 @@ print(f'args.default_system: {args.system}') ``` - 对全参数训练的checkpoint进行推理,将`model`设置为checkpoint_dir,并将lora_checkpoint设置为None即可。更多参考[推理和部署文档](./推理和部署.md#推理)。 -- 使用流式推理以及`VllmEngine`、`LmdeployEngine`进行推理加速,可以参考[大模型](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py)和[多模态大模型](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py)推理示例。 +- 使用流式推理以及`VllmEngine`、`SglangEngine`、`LmdeployEngine`进行推理加速,可以参考[大模型](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py)和[多模态大模型](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py)推理示例。 - 微调后的模型使用huggingface transformers/peft生态推理,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_hf.py)。 - 若训练了多个LoRA,要进行多LoRA切换,可以参考[推理](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py)、[部署](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/lora)样例。 - 对多模态模型进行Grounding任务的画框,可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py)。 diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md index 7a0c4eb87f..ef0ad66c29 100644 --- a/docs/source_en/Customization/Custom-dataset.md +++ b/docs/source_en/Customization/Custom-dataset.md @@ -180,6 +180,7 @@ The format will automatically convert the dataset format to the corresponding mo - ref: Used to replace ``. - bbox: Used to replace ``. If the length of each box in the bbox is 2, it represents the x and y coordinates. If the box length is 4, it represents the x and y coordinates of two points. + - Note: `` and `` do not have a corresponding relationship; references and bounding boxes replace their own placeholders separately. - bbox_type: Optional values are 'real' and 'norm1'. The default is 'real', meaning the bbox represents the actual bounding box value. If set to 'norm1', the bbox is normalized to the range 0~1. - image_id: This parameter is only effective when bbox_type is 'real'. It indicates the index of the image corresponding to the bbox, used for scaling the bbox. The index starts from 0, and the default is 0 for all. diff --git a/docs/source_en/GetStarted/Quick-start.md b/docs/source_en/GetStarted/Quick-start.md index a126a57cfa..e8b6a2d874 100644 --- a/docs/source_en/GetStarted/Quick-start.md +++ b/docs/source_en/GetStarted/Quick-start.md @@ -13,9 +13,9 @@ ms-swift is a comprehensive training and deployment framework for large language - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models. - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc. - 🍉 Toolbox Capabilities: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment. -- Inference Acceleration: Supports inference acceleration engines like PyTorch, vLLM, LmDeploy, and provides OpenAI interface, accelerating inference, deployment, and evaluation modules. +- Inference Acceleration: Supports inference acceleration engines like PyTorch, vLLM, SGLang, LmDeploy, and provides OpenAI interface, accelerating inference, deployment, and evaluation modules. - Model Evaluation: Uses EvalScope as the evaluation backend and supports evaluation of text-based and multimodal models with over 100 evaluation datasets. -- Model Quantization: Supports the export of quantized models in AWQ, GPTQ, and BNB formats, which can be accelerated using vLLM/LmDeploy for inference and support continued training. +- Model Quantization: Supports the export of quantized models in AWQ, GPTQ, and BNB formats, which can be accelerated using vLLM/SGLang/LmDeploy for inference and support continued training. ## Installation diff --git a/docs/source_en/GetStarted/SWIFT-installation.md b/docs/source_en/GetStarted/SWIFT-installation.md index 49442d46f3..c307aa116e 100644 --- a/docs/source_en/GetStarted/SWIFT-installation.md +++ b/docs/source_en/GetStarted/SWIFT-installation.md @@ -89,6 +89,7 @@ More images can be found [here](https://modelscope.cn/docs/intro/environment-set | trl | >=0.13,<0.19 | 0.18 | RLHF | | deepspeed | >=0.14 | 0.14.5 / 0.16.9 | Training | | vllm | >=0.5.1 | 0.8.5.post1 | Inference/Deployment/Evaluation | +| sglang | | 0.4.6.post5 | Inference/Deployment/Evaluation | | lmdeploy | >=0.5 | 0.8 | Inference/Deployment/Evaluation | | evalscope | >=0.11 | | Evaluation | diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md index 076c07c959..092ee08e05 100644 --- a/docs/source_en/Instruction/Command-line-parameters.md +++ b/docs/source_en/Instruction/Command-line-parameters.md @@ -320,26 +320,16 @@ The following parameters are effective when `train_type` is set to `reft`. - reft_intervention_type: Type of ReFT, supports 'NoreftIntervention', 'LoreftIntervention', 'ConsreftIntervention', 'LobireftIntervention', 'DireftIntervention', 'NodireftIntervention', default is `LoreftIntervention`. - reft_args: Other supported parameters for ReFT Intervention, input in json-string format. -### LMDeploy Arguments - -Parameter meanings can be found in the [lmdeploy documentation](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig). - -- 🔥tp: tensor parallelism degree. Default is `1`. -- session_len: Default is `None`. -- cache_max_entry_count: Default is `0.8`. -- quant_policy: Default is `0`. -- vision_batch_size: Default is `1`. - ### vLLM Arguments Parameter meanings can be found in the [vllm documentation](https://docs.vllm.ai/en/latest/serving/engine_args.html). -- 🔥gpu_memory_utilization: Default value is `0.9`. -- 🔥tensor_parallel_size: Default is `1`. -- pipeline_parallel_size: Default is `1`. -- max_num_seqs: Default is `256`. -- 🔥max_model_len: Default is `None`. -- disable_custom_all_reduce: Default is `True`. +- 🔥gpu_memory_utilization: GPU memory ratio, ranging from 0 to 1. Default is `0.9`. +- 🔥tensor_parallel_size: Tensor parallelism size. Default is `1`. +- pipeline_parallel_size: Pipeline parallelism size. Default is `1`. +- max_num_seqs: Maximum number of sequences to be processed in a single iteration. Default is `256`. +- 🔥max_model_len: Default is `None`, meaning it will be read from `config.json`. +- disable_custom_all_reduce: Disables the custom all-reduce kernel and falls back to NCCL. For stability, the default is `True`. - enforce_eager: Determines whether vllm uses PyTorch eager mode or constructs a CUDA graph, default is `False`. Setting it to True can save memory but may affect efficiency. - 🔥limit_mm_per_prompt: Controls the use of multiple media in vllm, default is `None`. For example, you can pass in `--limit_mm_per_prompt '{"image": 5, "video": 2}'`. - vllm_max_lora_rank: Default is `16`. This is the parameter supported by vllm for lora. @@ -347,6 +337,31 @@ Parameter meanings can be found in the [vllm documentation](https://docs.vllm.ai - enable_prefix_caching: Enable the automatic prefix caching of vllm to save processing time for querying repeated prefixes. The default is `False`. - use_async_engine: Whether to use the async engine under the vLLM backend. The deployment status (swift deploy) defaults to True, and other statuses default to False. +### SGLang Arguments +Parameter meanings can be found in the [sglang documentation](https://docs.sglang.ai/backend/server_arguments.html). + +- sglang_tp_size: Tensor parallelism size. Default is 1. +- sglang_pp_size: Pipeline parallelism size. Default is 1. +- sglang_dp_size: Data parallelism size. Default is 1. +- sglang_ep_size: Expert parallelism size. Default is 1. +- sglang_mem_fraction_static: The fraction of GPU memory used for static allocation (model weights and KV cache memory pool). If you encounter out-of-memory errors, try reducing this value. Default is None. +- sglang_context_length: The maximum context length of the model. Default is None, which means it will use the value from the model's `config.json`. +- sglang_disable_cuda_graph: Disables CUDA graph. Default is False. +- sglang_quantization: Quantization method. Default is None. +- sglang_kv_cache_dtype: Data type for KV cache storage. 'auto' means it will use the model's data type. 'fp8_e5m2' and 'fp8_e4m3' are supported on CUDA 11.8 and above. Default is 'auto'. +- sglang_enable_dp_attention: Enables data parallelism for attention and tensor parallelism for FFN. The data parallelism size (dp size) should be equal to the tensor parallelism size (tp size). Currently supports DeepSeek-V2/3 and Qwen2/3 MoE models. Default is False. +- sglang_disable_custom_all_reduce: Disables the custom all-reduce kernel and falls back to NCCL. For stability, the default is True. + +### LMDeploy Arguments + +Parameter meanings can be found in the [lmdeploy documentation](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html#turbomindengineconfig). + +- 🔥tp: tensor parallelism degree. Default is `1`. +- session_len: Maximum session length. Default is `None`. +- cache_max_entry_count: The percentage of GPU memory occupied by the k/v cache. Default is `0.8`. +- quant_policy: Default is `0`. Set it to `4` or `8` when quantizing k/v to 4-bit or 8-bit, respectively. +- vision_batch_size: The `max_batch_size` parameter passed to `VisionConfig`. Default is `1`. + ### Merge Arguments - 🔥merge_lora: Indicates whether to merge lora; this parameter supports lora, llamapro, and longlora, default is `False`. Example parameters [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/merge_lora.sh). @@ -519,7 +534,7 @@ Soft overlong reward parameters: Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following: -- 🔥infer_backend: Inference acceleration backend, supporting three inference engines: 'pt', 'vllm', and 'lmdeploy'. The default is 'pt'. +- 🔥infer_backend: Inference acceleration backend, supporting four inference engines: 'pt', 'vllm', 'sglang', and 'lmdeploy'. The default is 'pt'. - 🔥max_batch_size: Effective when infer_backend is set to 'pt'; used for batch inference, with a default value of 1. If set to -1, there is no restriction. - 🔥result_path: Path to store inference results (jsonl). The default is None, meaning results are saved in the checkpoint directory (with args.json file) or './result' directory. The final storage path will be printed in the command line. - Note: If the `result_path` file already exists, it will be appended to. diff --git a/docs/source_en/Instruction/Evaluation.md b/docs/source_en/Instruction/Evaluation.md index 16b132c285..03eaff6e42 100644 --- a/docs/source_en/Instruction/Evaluation.md +++ b/docs/source_en/Instruction/Evaluation.md @@ -82,7 +82,7 @@ swift eval \ Where: - model: Can specify a local model path or a model ID on modelscope - eval_backend: Options are Native, OpenCompass, VLMEvalKit; default is Native -- infer_backend: Options are pt, vllm, lmdeploy; default is pt +- infer_backend: Options are pt, vllm, sglang, lmdeploy; default is pt - eval_limit: Sample size for each evaluation set; default is None, which means using all data; can be used for quick validation - eval_dataset: Evaluation dataset(s); multiple datasets can be set, separated by spaces diff --git a/docs/source_en/Instruction/Export-and-push.md b/docs/source_en/Instruction/Export-and-push.md index e849b6f7b1..741061317f 100644 --- a/docs/source_en/Instruction/Export-and-push.md +++ b/docs/source_en/Instruction/Export-and-push.md @@ -35,7 +35,7 @@ We provide a series of scripts to demonstrate SWIFT's quantization export capabi - Supports [AWQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)/[GPTQ](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq.sh)/[BNB](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/bnb.sh) quantization exports. - Multimodal quantization: Supports quantizing multimodal models using GPTQ and AWQ, with limited multimodal models supported by AWQ. Refer to [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/mllm). - Support for more model series: Supports quantization exports for [BERT](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/bert) and [Reward Model](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize/reward_model). -- Models exported with SWIFT's quantization support inference acceleration using vllm/lmdeploy; they also support further SFT/RLHF using QLoRA. +- Models exported with SWIFT's quantization support inference acceleration using vllm/sglang/lmdeploy; they also support further SFT/RLHF using QLoRA. ## Push Model diff --git a/docs/source_en/Instruction/Inference-and-deployment.md b/docs/source_en/Instruction/Inference-and-deployment.md index e7e1ee4881..0f239edb13 100644 --- a/docs/source_en/Instruction/Inference-and-deployment.md +++ b/docs/source_en/Instruction/Inference-and-deployment.md @@ -6,6 +6,7 @@ Below are the inference engines supported by Swift along with their correspondin | ------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ----- | ------------------------------------------------------------ | ------------------- | | pytorch | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) | DDP/device_map | | [vllm](https://github.com/vllm-project/vllm) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | ❌ | ✅ | TP/PP/DP | +| [sglang](https://github.com/sgl-project/sglang) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | TP/PP/DP/EP | | [lmdeploy](https://github.com/InternLM/lmdeploy) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | ✅ | ❌ | ❌ | ✅ | TP/DP | ## Inference @@ -96,7 +97,7 @@ The above example provides streaming inference for both full parameters and LoRA - Interface Inference: You can change `swift infer` to `swift app`. - Batch Inference: For large models and multimodal models, you can specify `--max_batch_size` for batch inference by using `infer_backend=pt`. For specific details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh). Note that you cannot set `--stream true` when performing batch inference. - DDP/device_map Inference: `infer_backend=pt` supports parallel inference using DDP/device_map technology. For further details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh). -- Inference Acceleration: Swift supports using vllm/lmdeploy for inference acceleration across the inference, deployment, and evaluation modules by simply adding `--infer_backend vllm/lmdeploy`. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh). +- Inference Acceleration: Swift supports using vllm/sglang/lmdeploy for inference acceleration across the inference, deployment, and evaluation modules by simply adding `--infer_backend vllm/sglang/lmdeploy`. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh). - Multimodal Models: We provide shell scripts for multi-GPU inference for multimodal models using [pt](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh), [vllm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh), and [lmdeploy](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh). - Quantized Models: You can directly select models that are quantized with GPTQ, AWQ, or BNB, for example: `--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`. - More Model Types: We also provide inference scripts for [bert](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/bert.sh), [reward_model](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/reward_model.sh), and [prm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/prm.sh). @@ -105,7 +106,7 @@ The above example provides streaming inference for both full parameters and LoRA - SWIFT saves inference results, and you can specify the save path using `--result_path`. - To output log probabilities, simply specify `--logprobs true` during inference. SWIFT will save these results. Note that setting `--stream true` will prevent storage of results. -- Using `infer_backend=pt` supports inference for all models supported by SWIFT, while `infer_backend=vllm/lmdeploy` supports only a subset of models. Please refer to the documentation for [vllm](https://docs.vllm.ai/en/latest/models/supported_models.html) and [lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html). +- Using `infer_backend=pt` supports inference for all models supported by SWIFT, while `infer_backend=vllm/lmdeploy` supports only a subset of models. Please refer to the documentation for [vllm](https://docs.vllm.ai/en/latest/models/supported_models.html), [sglang](https://docs.sglang.ai/supported_models/generative_models.html) and [lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html). - If you encounter OOM when using `--infer_backend vllm`, you can lower `--max_model_len`, `--max_num_seqs`, choose an appropriate `--gpu_memory_utilization`, or set `--enforce_eager true`. Alternatively, you can address this by using tensor parallelism with `--tensor_parallel_size`. - When inferring multimodal models using `--infer_backend vllm`, you need to input multiple images. You can set `--limit_mm_per_prompt` to resolve this, for example: `--limit_mm_per_prompt '{"image": 10, "video": 5}'`. - If you encounter OOM issues while inferring qwen2-vl/qwen2.5-vl, you can address this by setting `MAX_PIXELS`, `VIDEO_MAX_PIXELS`, and `FPS_MAX_FRAMES`. For more information, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh). @@ -178,7 +179,7 @@ print(f'response2: {resp_list[2].choices[0].message.content}') We also provide more demos for Python-based inference: -- For streaming inference using `VllmEngine` and `LmdeployEngine` for inference acceleration, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py). +- For streaming inference using `VllmEngine`, `SglangEngine` and `LmdeployEngine` for inference acceleration, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py). - Multimodal Inference: In addition to the aforementioned multimodal input formats, Swift is compatible with OpenAI's multimodal input format; refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py). - Grounding Tasks: For performing grounding tasks with multimodal models, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py). - Multiple LoRA Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py). diff --git a/docs/source_en/Instruction/Pre-training-and-Fine-tuning.md b/docs/source_en/Instruction/Pre-training-and-Fine-tuning.md index bdf6d647f1..1563d573ba 100644 --- a/docs/source_en/Instruction/Pre-training-and-Fine-tuning.md +++ b/docs/source_en/Instruction/Pre-training-and-Fine-tuning.md @@ -78,7 +78,7 @@ Additionally, we offer a series of scripts to help you understand the training c - When fine-tuning a base model to a chat model using LoRA technology with `swift sft`, you may sometimes need to manually set the template. Add the `--template default` parameter to avoid issues where the base model may fail to stop correctly due to encountering special characters in the dialogue template that it has not seen before. For more details, see [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/base_to_chat). - If you need to train in an **offline** environment, please set `--model ` and `--check_model false`. If the corresponding model requires `git clone` from GitHub repositories, such as `deepseek-ai/Janus-Pro-7B`, please manually download the repository and set `--local_repo_path `. For specific parameter meanings, refer to the [command line parameter documentation](./Command-line-parameters.md). -- Merging LoRA for models trained with QLoRA is not possible, so it is not recommended to use QLoRA for fine-tuning, as it cannot utilize vLLM/LMDeploy for inference acceleration during inference and deployment. It is recommended to use LoRA or full parameter fine-tuning, merge them into complete weights, and then use GPTQ/AWQ/BNB for [quantization](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize). +- Merging LoRA for models trained with QLoRA is not possible, so it is not recommended to use QLoRA for fine-tuning, as it cannot utilize vLLM/Sglang/LMDeploy for inference acceleration during inference and deployment. It is recommended to use LoRA or full parameter fine-tuning, merge them into complete weights, and then use GPTQ/AWQ/BNB for [quantization](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize). - If you are using an NPU for training, simply change `CUDA_VISIBLE_DEVICES` in the shell to `ASCEND_RT_VISIBLE_DEVICES`. - By default, SWIFT sets `--gradient_checkpointing true` during training to save memory, which may slightly slow down the training speed. - If you are using DDP for training and encounter the error: `RuntimeError: Expected to mark a variable ready only once.`, please additionally set the parameter `--gradient_checkpointing_kwargs '{"use_reentrant": false}'` or use DeepSpeed for training. @@ -134,7 +134,7 @@ swift infer \ - The adapters folder contains the trained parameter file `args.json`, so there is no need to specify `--model` or `--system` explicitly; Swift will automatically read these parameters. If you want to disable this behavior, you can set `--load_args false`. - If you are using full parameter training, please use `--model` instead of `--adapters` to specify the training checkpoint directory. For more information, refer to the [Inference and Deployment documentation](./Inference-and-deployment.md#Inference). - You can use `swift app` instead of `swift infer` for interactive inference. -- You can choose to merge LoRA (by additionally specifying `--merge_lora true`), and then specify `--infer_backend vllm/lmdeploy` for inference acceleration. +- You can choose to merge LoRA (by additionally specifying `--merge_lora true`), and then specify `--infer_backend vllm/sglang/lmdeploy` for inference acceleration. For batch inference on the validation set of the dataset: @@ -149,7 +149,7 @@ swift infer \ --max_batch_size 1 ``` -- You can set `--max_batch_size 8` to enable batch processing with `--infer_backend pt`. If you use `infer_backend vllm/lmdeploy`, it will automatically handle batching without needing to specify. +- You can set `--max_batch_size 8` to enable batch processing with `--infer_backend pt`. If you use `infer_backend vllm/sglang/lmdeploy`, it will automatically handle batching without needing to specify. - `--load_data_args true` will additionally read the data parameters from the training storage parameter file `args.json`. If you want to perform inference on an additional test set instead of using the training validation set, use `--val_dataset ` for inference: @@ -253,7 +253,7 @@ print(f'args.default_system: {args.system}') ``` - To perform inference on a checkpoint trained with full parameters, set `model` to `checkpoint_dir` and `lora_checkpoint` to `None`. For more information, refer to the [Inference and Deployment documentation](./Inference-and-deployment.md#Inference). -- For streaming inference and acceleration using `VllmEngine` and `LmdeployEngine`, you can refer to the inference examples for [large models](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py) and [multi-modal large models](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py). +- For streaming inference and acceleration using `VllmEngine`, `SglangEngine` and `LmdeployEngine`, you can refer to the inference examples for [large models](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py) and [multi-modal large models](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py). - For inference on fine-tuned models using the Hugging Face transformers/PEFT ecosystem, you can see [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_hf.py). - If you have trained multiple LoRAs and need to switch among them, refer to the [inference](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) and [deployment](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/lora) examples. - For grounding tasks in multi-modal models, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py). diff --git a/examples/app/llm/sglang.sh b/examples/app/llm/sglang.sh new file mode 100644 index 0000000000..0cdef6262a --- /dev/null +++ b/examples/app/llm/sglang.sh @@ -0,0 +1,7 @@ +# test_env: pip install "sglang[all]==0.4.6.*" -U +CUDA_VISIBLE_DEVICES=0 swift app \ + --model Qwen/Qwen2.5-7B-Instruct \ + --stream true \ + --infer_backend sglang \ + --max_new_tokens 2048 \ + --lang zh diff --git a/examples/app/llm.sh b/examples/app/llm/vllm.sh similarity index 100% rename from examples/app/llm.sh rename to examples/app/llm/vllm.sh diff --git a/examples/eval/llm/sglang.sh b/examples/eval/llm/sglang.sh new file mode 100644 index 0000000000..66355f0e00 --- /dev/null +++ b/examples/eval/llm/sglang.sh @@ -0,0 +1,7 @@ +CUDA_VISIBLE_DEVICES=0 \ +swift eval \ + --model Qwen/Qwen2.5-1.5B-Instruct \ + --eval_backend OpenCompass \ + --infer_backend sglang \ + --eval_limit 100 \ + --eval_dataset gsm8k diff --git a/examples/eval/llm/eval.sh b/examples/eval/llm/vllm.sh similarity index 100% rename from examples/eval/llm/eval.sh rename to examples/eval/llm/vllm.sh diff --git a/examples/infer/demo.py b/examples/infer/demo.py index 08b837b3db..664563c967 100644 --- a/examples/infer/demo.py +++ b/examples/infer/demo.py @@ -41,6 +41,9 @@ def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'): elif infer_backend == 'vllm': from swift.llm import VllmEngine engine = VllmEngine(model, max_model_len=8192) + elif infer_backend == 'sglang': + from swift.llm import SglangEngine + engine = SglangEngine(model) elif infer_backend == 'lmdeploy': from swift.llm import LmdeployEngine engine = LmdeployEngine(model) diff --git a/examples/infer/sglang/demo.sh b/examples/infer/sglang/demo.sh new file mode 100644 index 0000000000..81362ba69e --- /dev/null +++ b/examples/infer/sglang/demo.sh @@ -0,0 +1,7 @@ +# test_env: pip install "sglang[all]==0.4.6.*" -U +CUDA_VISIBLE_DEVICES=0 \ +swift infer \ + --model Qwen/Qwen2.5-1.5B-Instruct \ + --infer_backend sglang \ + --stream true \ + --max_new_tokens 2048 diff --git a/examples/infer/sglang/tp.sh b/examples/infer/sglang/tp.sh new file mode 100644 index 0000000000..5a139783e3 --- /dev/null +++ b/examples/infer/sglang/tp.sh @@ -0,0 +1,9 @@ +CUDA_VISIBLE_DEVICES=0,1 \ +swift infer \ + --model Qwen/Qwen3-8B \ + --infer_backend sglang \ + --val_dataset AI-ModelScope/alpaca-gpt4-data-zh#2000 \ + --max_new_tokens 2048 \ + --sglang_context_length 8192 \ + --sglang_tp_size 2 \ + --write_batch_size 1000 diff --git a/requirements/install_all.sh b/requirements/install_all.sh index 465a7da2e9..04f6dab9ac 100644 --- a/requirements/install_all.sh +++ b/requirements/install_all.sh @@ -1,5 +1,6 @@ # please use python=3.10, cuda12.* # sh requirements/install_all.sh +pip install "sglang[all]<0.4.7" -U pip install "vllm>=0.5.1,<0.9" -U pip install "lmdeploy>=0.5" -U --no-deps pip install autoawq -U --no-deps diff --git a/swift/llm/__init__.py b/swift/llm/__init__.py index fdfe412947..ff84e3549f 100644 --- a/swift/llm/__init__.py +++ b/swift/llm/__init__.py @@ -6,7 +6,8 @@ if TYPE_CHECKING: # Recommend using `xxx_main` from .infer import (VllmEngine, RequestConfig, LmdeployEngine, PtEngine, InferEngine, infer_main, deploy_main, - InferClient, run_deploy, AdapterRequest, prepare_model_template, BaseInferEngine, rollout_main) + InferClient, run_deploy, AdapterRequest, prepare_model_template, BaseInferEngine, SglangEngine, + rollout_main) from .export import (export_main, merge_lora, quantize_model, export_to_ollama) from .eval import eval_main from .app import app_main @@ -37,7 +38,8 @@ 'rlhf': ['rlhf_main'], 'infer': [ 'deploy_main', 'VllmEngine', 'RequestConfig', 'LmdeployEngine', 'PtEngine', 'infer_main', 'InferClient', - 'run_deploy', 'InferEngine', 'AdapterRequest', 'prepare_model_template', 'BaseInferEngine', 'rollout_main' + 'run_deploy', 'InferEngine', 'AdapterRequest', 'prepare_model_template', 'BaseInferEngine', 'rollout_main', + 'SglangEngine' ], 'export': ['export_main', 'merge_lora', 'quantize_model', 'export_to_ollama'], 'app': ['app_main'], diff --git a/swift/llm/argument/infer_args.py b/swift/llm/argument/infer_args.py index b475f51fbd..46d98e02e9 100644 --- a/swift/llm/argument/infer_args.py +++ b/swift/llm/argument/infer_args.py @@ -108,7 +108,38 @@ def get_vllm_engine_kwargs(self): @dataclass -class InferArguments(MergeArguments, VllmArguments, LmdeployArguments, BaseArguments): +class SglangArguments: + sglang_tp_size: int = 1 + sglang_pp_size: int = 1 + sglang_dp_size: int = 1 + sglang_ep_size: int = 1 + sglang_mem_fraction_static: Optional[float] = None + sglang_context_length: Optional[int] = None + sglang_disable_cuda_graph: bool = False + sglang_quantization: Optional[str] = None + sglang_kv_cache_dtype: str = 'auto' + sglang_enable_dp_attention: bool = False + sglang_disable_custom_all_reduce: bool = True + + def get_sglang_engine_kwargs(self): + kwargs = { + 'tp_size': self.sglang_tp_size, + 'pp_size': self.sglang_pp_size, + 'dp_size': self.sglang_dp_size, + 'ep_size': self.sglang_ep_size, + 'mem_fraction_static': self.sglang_mem_fraction_static, + 'context_length': self.sglang_context_length, + 'disable_cuda_graph': self.sglang_disable_cuda_graph, + 'quantization': self.sglang_quantization, + 'kv_cache_dtype': self.sglang_kv_cache_dtype, + 'enable_dp_attention': self.sglang_enable_dp_attention, + 'disable_custom_all_reduce': self.sglang_disable_custom_all_reduce, + } + return kwargs + + +@dataclass +class InferArguments(MergeArguments, LmdeployArguments, SglangArguments, VllmArguments, BaseArguments): """ InferArguments is a dataclass that extends BaseArguments, MergeArguments, VllmArguments, and LmdeployArguments. It is used to define the arguments required for model inference. @@ -121,7 +152,7 @@ class InferArguments(MergeArguments, VllmArguments, LmdeployArguments, BaseArgum max_batch_size (int): Maximum batch size for the pt engine. Default is 1. val_dataset_sample (Optional[int]): Sample size for validation dataset. Default is None. """ - infer_backend: Literal['vllm', 'pt', 'lmdeploy'] = 'pt' + infer_backend: Literal['vllm', 'pt', 'sglang', 'lmdeploy'] = 'pt' result_path: Optional[str] = None write_batch_size: int = 1000 diff --git a/swift/llm/infer/__init__.py b/swift/llm/infer/__init__.py index 115a0d20e7..dceb30bac9 100644 --- a/swift/llm/infer/__init__.py +++ b/swift/llm/infer/__init__.py @@ -9,7 +9,7 @@ from .deploy import deploy_main, SwiftDeploy, run_deploy from .protocol import RequestConfig, Function from .utils import prepare_model_template - from .infer_engine import (InferEngine, VllmEngine, LmdeployEngine, PtEngine, InferClient, + from .infer_engine import (InferEngine, VllmEngine, LmdeployEngine, SglangEngine, PtEngine, InferClient, prepare_generation_config, AdapterRequest, BaseInferEngine) else: _import_structure = { @@ -19,8 +19,8 @@ 'protocol': ['RequestConfig', 'Function'], 'utils': ['prepare_model_template'], 'infer_engine': [ - 'InferEngine', 'VllmEngine', 'LmdeployEngine', 'PtEngine', 'InferClient', 'prepare_generation_config', - 'AdapterRequest', 'BaseInferEngine' + 'InferEngine', 'VllmEngine', 'LmdeployEngine', 'SglangEngine', 'PtEngine', 'InferClient', + 'prepare_generation_config', 'AdapterRequest', 'BaseInferEngine' ], } diff --git a/swift/llm/infer/infer.py b/swift/llm/infer/infer.py index 9819276889..35667bf1e7 100644 --- a/swift/llm/infer/infer.py +++ b/swift/llm/infer/infer.py @@ -72,6 +72,10 @@ def get_infer_engine(args: InferArguments, template=None, **kwargs): seed += get_dist_setting()[0] // args.tensor_parallel_size kwargs['distributed_executor_backend'] = 'external_launcher' kwargs['seed'] = seed + elif infer_backend == 'sglang': + from .infer_engine import SglangEngine + infer_engine_cls = SglangEngine + kwargs.update(args.get_sglang_engine_kwargs()) else: from .infer_engine import LmdeployEngine infer_engine_cls = LmdeployEngine diff --git a/swift/llm/infer/infer_engine/__init__.py b/swift/llm/infer/infer_engine/__init__.py index 49a54005ea..99415ac405 100644 --- a/swift/llm/infer/infer_engine/__init__.py +++ b/swift/llm/infer/infer_engine/__init__.py @@ -7,6 +7,7 @@ from .vllm_engine import VllmEngine from .grpo_vllm_engine import GRPOVllmEngine from .lmdeploy_engine import LmdeployEngine + from .sglang_engine import SglangEngine from .pt_engine import PtEngine from .infer_client import InferClient from .infer_engine import InferEngine @@ -17,6 +18,7 @@ 'vllm_engine': ['VllmEngine'], 'grpo_vllm_engine': ['GRPOVllmEngine'], 'lmdeploy_engine': ['LmdeployEngine'], + 'sglang_engine': ['SglangEngine'], 'pt_engine': ['PtEngine'], 'infer_client': ['InferClient'], 'infer_engine': ['InferEngine'], diff --git a/swift/llm/infer/infer_engine/infer_engine.py b/swift/llm/infer/infer_engine/infer_engine.py index 7dbf2b6a47..a71062d9a2 100644 --- a/swift/llm/infer/infer_engine/infer_engine.py +++ b/swift/llm/infer/infer_engine/infer_engine.py @@ -72,7 +72,8 @@ async def _run_async_iter(): else: queue.put(None) - thread = Thread(target=lambda: asyncio.run(_run_async_iter())) + loop = asyncio.get_event_loop() + thread = Thread(target=lambda: loop.run_until_complete(_run_async_iter())) thread.start() pre_output = None while True: @@ -257,7 +258,12 @@ def func(target, queue, args, kwargs): @staticmethod def safe_asyncio_run(coro): - return InferEngine.thread_run(asyncio.run, args=(coro, )) + loop = asyncio.get_event_loop() + + def asyncio_run(core): + return loop.run_until_complete(core) + + return InferEngine.thread_run(asyncio_run, args=(coro, )) @staticmethod def _batch_encode(infer_requests: List[InferRequest], template: Template, strict: bool): diff --git a/swift/llm/infer/infer_engine/pt_engine.py b/swift/llm/infer/infer_engine/pt_engine.py index 9a0e6a15e5..a749d8091c 100644 --- a/swift/llm/infer/infer_engine/pt_engine.py +++ b/swift/llm/infer/infer_engine/pt_engine.py @@ -18,14 +18,11 @@ from swift.llm import InferRequest, Template, TemplateMeta, get_model_tokenizer, safe_snapshot_download, to_device from swift.plugin import Metric from swift.tuners import Swift -from swift.utils import get_logger from ..protocol import (ChatCompletionResponse, ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse, ChatMessage, DeltaMessage, RequestConfig, random_uuid) from .infer_engine import InferEngine from .utils import AdapterRequest, InferStreamer, LogitsStreamer, TokensIteratorStreamer, prepare_generation_config -logger = get_logger() - class _GenerationConfig(GenerationConfig): diff --git a/swift/llm/infer/infer_engine/sglang_engine.py b/swift/llm/infer/infer_engine/sglang_engine.py new file mode 100644 index 0000000000..ddf3be4b1c --- /dev/null +++ b/swift/llm/infer/infer_engine/sglang_engine.py @@ -0,0 +1,207 @@ +import asyncio +import inspect +import os +from copy import deepcopy +from typing import Any, AsyncIterator, Dict, Iterator, List, Optional, Union + +import sglang as sgl +import torch +from sglang.srt.sampling.sampling_params import SamplingParams +from sglang.srt.server_args import ServerArgs +from transformers import GenerationConfig + +from swift.llm import InferRequest, Template, TemplateMeta, get_model_tokenizer +from swift.plugin import Metric +from ..protocol import (ChatCompletionResponse, ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice, + ChatCompletionStreamResponse, ChatMessage, DeltaMessage, RequestConfig, random_uuid) +from .infer_engine import InferEngine + + +class SglangEngine(InferEngine): + + def __init__( + self, + model_id_or_path: str, + torch_dtype: Optional[torch.dtype] = None, + *, + model_type: Optional[str] = None, + use_hf: Optional[bool] = None, + hub_token: Optional[str] = None, + revision: Optional[str] = None, + # engine kwargs + tp_size: int = 1, + pp_size: int = 1, + dp_size: int = 1, + ep_size: int = 1, + mem_fraction_static: Optional[float] = None, + context_length: Optional[int] = None, + disable_cuda_graph: bool = False, + quantization: Optional[str] = None, + kv_cache_dtype: str = 'auto', + enable_dp_attention: bool = False, + disable_custom_all_reduce: bool = True, + log_level='error', + engine_kwargs: Optional[Dict[str, Any]] = None, + template: Optional[Template] = None, + ): + if engine_kwargs is None: + engine_kwargs = {} + self.processor = get_model_tokenizer( + model_id_or_path, + torch_dtype, + load_model=False, + download_model=True, + model_type=model_type, + use_hf=use_hf, + hub_token=hub_token, + revision=revision)[1] + self._post_init(template) + if self.max_model_len is not None: + self.max_model_len -= 1 + parameters = inspect.signature(ServerArgs).parameters + if 'pp_size' in parameters: + engine_kwargs['pp_size'] = pp_size + self.server_args = ServerArgs( + model_path=self.model_dir, + dtype=self.model_info.torch_dtype, + tp_size=tp_size, + dp_size=dp_size, + ep_size=ep_size, + mem_fraction_static=mem_fraction_static, + context_length=context_length, + disable_cuda_graph=disable_cuda_graph, + quantization=quantization, + kv_cache_dtype=kv_cache_dtype, + enable_dp_attention=enable_dp_attention, + disable_custom_all_reduce=disable_custom_all_reduce, + log_level=log_level, + **engine_kwargs, + ) + self.engine = sgl.Engine(server_args=self.server_args) + self._load_generation_config() + + def _load_generation_config(self) -> None: + generation_config_path = os.path.join(self.model_dir, 'generation_config.json') + if os.path.isfile(generation_config_path): + generation_config = GenerationConfig.from_pretrained(self.model_dir) + kwargs = generation_config.to_dict() + top_k = kwargs.get('top_k') + if top_k == 0: + kwargs['top_k'] = -1 + + parameters = inspect.signature(SamplingParams).parameters + for k, v in kwargs.copy().items(): + if k not in parameters or v is None: + kwargs.pop(k) + self.generation_config = kwargs + else: + self.generation_config = {} + + def _prepare_generation_config(self, request_config: RequestConfig) -> Dict[str, Any]: + kwargs = {'max_new_tokens': request_config.max_tokens} + for key in ['temperature', 'top_k', 'top_p', 'repetition_penalty']: + new_value = getattr(request_config, key) + if new_value is None: + kwargs[key] = self.generation_config.get(key) + else: + kwargs[key] = new_value + for key in ['n', 'frequency_penalty', 'presence_penalty']: + kwargs[key] = getattr(request_config, key) + + return kwargs + + def _add_stop_words(self, generation_config: Dict[str, Any], request_config: RequestConfig, + template_meta: TemplateMeta) -> None: + stop_words = (request_config.stop or []) + (self.generation_config.get('stop') or []) + template_meta.stop_words + generation_config['stop'] = self._get_stop_words(stop_words) + + def _create_chat_completion_response(self, output, template): + assert output is not None + meta_info = output['meta_info'] + usage_info = self._get_usage_info(meta_info['prompt_tokens'], meta_info['completion_tokens']) + response = output['text'] + toolcall = self._get_toolcall(response, template) + choice = ChatCompletionResponseChoice( + index=0, + message=ChatMessage(role='assistant', content=response, tool_calls=toolcall), + finish_reason=meta_info['finish_reason']['type'], + logprobs=None) + return ChatCompletionResponse(model=self.model_name, choices=[choice], usage=usage_info, id=random_uuid()) + + def infer( + self, + infer_requests: List[InferRequest], + request_config: Optional[RequestConfig] = None, + metrics: Optional[List[Metric]] = None, + *, + template: Optional[Template] = None, + use_tqdm: Optional[bool] = None, + ) -> List[Union[ChatCompletionResponse, Iterator[ChatCompletionStreamResponse]]]: + return super().infer(infer_requests, request_config, metrics, template=template, use_tqdm=use_tqdm) + + async def infer_async(self, + infer_request: InferRequest, + request_config: Optional[RequestConfig] = None, + *, + template: Optional[Template] = None, + pre_infer_hook=None, + **kwargs) -> Union[ChatCompletionResponse, AsyncIterator[ChatCompletionStreamResponse]]: + request_config = deepcopy(request_config or RequestConfig()) + if template is None: + template = self.default_template + + template.set_mode('pt') + loop = asyncio.get_running_loop() + with torch.inference_mode(): + inputs = await loop.run_in_executor(None, template.encode, infer_request) + + self.set_default_max_tokens(request_config, inputs) + generation_config = self._prepare_generation_config(request_config) + self._add_stop_words(generation_config, request_config, template.template_meta) + kwargs.update({'template': template, 'inputs': inputs, 'generation_config': generation_config}) + if pre_infer_hook: + kwargs = pre_infer_hook(kwargs) + if request_config.stream: + return self._infer_stream_async(**kwargs) + else: + return await self._infer_full_async(**kwargs) + + async def _infer_full_async(self, template: Template, inputs: Dict[str, Any], + generation_config: Dict[str, Any]) -> ChatCompletionResponse: + output = await self.engine.async_generate(**inputs, sampling_params=generation_config) + return self._create_chat_completion_response(output, template) + + async def _infer_stream_async(self, template: Template, inputs: Dict[str, Any], + generation_config: Dict[str, Any]) -> AsyncIterator[ChatCompletionStreamResponse]: + result_generator = await self.engine.async_generate(**inputs, sampling_params=generation_config, stream=True) + idx = [0] + async for output in result_generator: + res = self._create_chat_completion_stream_response(output, template, generation_config, idx) + if res is None: + continue + yield res + + def _create_chat_completion_stream_response(self, output, template, generation_config, + idx) -> Optional[ChatCompletionStreamResponse]: + assert output is not None + response = output['text'] + meta_info = output['meta_info'] + finish_reason = meta_info['finish_reason'] + delta_text = response[idx[0]:] + idx[0] = len(response) + if not delta_text: + return + if finish_reason: + finish_reason = finish_reason['type'] + toolcall = self._get_toolcall(response, template) + else: + toolcall = None + meta_info = output['meta_info'] + usage_info = self._get_usage_info(meta_info['prompt_tokens'], meta_info['completion_tokens']) + # TODO: logprobs + choice = ChatCompletionResponseStreamChoice( + index=0, + delta=DeltaMessage(role='assistant', content=delta_text, tool_calls=toolcall), + finish_reason=finish_reason, + logprobs=None) + return ChatCompletionStreamResponse(model=self.model_name, choices=[choice], usage=usage_info) diff --git a/swift/llm/infer/infer_engine/vllm_engine.py b/swift/llm/infer/infer_engine/vllm_engine.py index 8cbaaac9e3..68d716c1f3 100644 --- a/swift/llm/infer/infer_engine/vllm_engine.py +++ b/swift/llm/infer/infer_engine/vllm_engine.py @@ -14,7 +14,7 @@ from swift.llm import InferRequest, Template, TemplateMeta, get_model_tokenizer from swift.plugin import Metric -from swift.utils import get_dist_setting, get_logger, get_seed, is_dist +from swift.utils import get_dist_setting, is_dist from ..protocol import (ChatCompletionResponse, ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse, ChatMessage, DeltaMessage, RequestConfig, random_uuid) from .infer_engine import InferEngine @@ -30,7 +30,6 @@ except Exception: raise -logger = get_logger() dtype_mapping = {torch.float16: 'float16', torch.bfloat16: 'bfloat16', torch.float32: 'float32'} diff --git a/tests/infer/test_sglang.py b/tests/infer/test_sglang.py new file mode 100644 index 0000000000..8b246cfc39 --- /dev/null +++ b/tests/infer/test_sglang.py @@ -0,0 +1,50 @@ +import os + +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + + +def test_engine(): + from swift.llm import SglangEngine, load_dataset, RequestConfig + dataset = load_dataset('AI-ModelScope/alpaca-gpt4-data-zh#20')[0] + engine = SglangEngine('Qwen/Qwen2.5-0.5B-Instruct') + request_config = RequestConfig(max_tokens=1024) + resp_list = engine.infer(list(dataset), request_config=request_config) + for resp in resp_list[:5]: + print(resp) + resp_list = engine.infer(list(dataset), request_config=request_config) + for resp in resp_list[:5]: + print(resp) + + +def test_engine_stream(): + from swift.llm import SglangEngine, load_dataset, RequestConfig + dataset = load_dataset('AI-ModelScope/alpaca-gpt4-data-zh#1')[0] + engine = SglangEngine('Qwen/Qwen2.5-0.5B-Instruct') + request_config = RequestConfig(max_tokens=1024, stream=True) + gen_list = engine.infer(list(dataset), request_config=request_config) + for resp in gen_list[0]: + print(resp.choices[0].delta.content, flush=True, end='') + + +def test_infer(): + from swift.llm import infer_main, InferArguments + infer_main( + InferArguments(model='Qwen/Qwen2.5-0.5B-Instruct', stream=True, infer_backend='sglang', max_new_tokens=2048)) + + +def test_eval(): + from swift.llm import EvalArguments, eval_main + eval_main( + EvalArguments( + model='Qwen/Qwen2-7B-Instruct', + eval_dataset='arc_c', + infer_backend='sglang', + eval_backend='OpenCompass', + )) + + +if __name__ == '__main__': + test_engine() + # test_engine_stream() + # test_infer() + # test_eval()