support qwen2/2.5-vl in turbomind #3744

irexyc · 2025-07-17T08:58:57Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

lmdeploy/turbomind/deploy/source_model/llama.py

windreamer · 2025-07-19T12:25:30Z

awesome 12-15% faster than Pytorch but accuracy is slightly degraded. I decided to still choose Pytorch for my project, maybe TurboMind uses optimized CUDA kernels and fusion operations which can lead to cumulative error

Could you kindly share with us how much degradation you have and how did you measure the accuracy? I believe it is definitely valuable for us to further improve our project! Thank you.

kolmogorov-quyet · 2025-07-20T04:39:18Z

I use model OCRFLUX which has identical architecture to qwen 2.5vl for my project.

kolmogorov-quyet · 2025-07-20T04:42:14Z

bash test_turbomind.py

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
import time
from PIL import Image
import subprocess
import io
from ocrflux.lm_ocr import _build_page_to_markdown_query
gen_config = GenerationConfig(temperature=0.0,max_new_tokens=16384)
backend_config = TurbomindEngineConfig(session_len=16384)

question = (
        f"Below is the image of one page of a document. "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"ALL tables should be presented in HTML format.\n"
        f"If there are images or figures in the page, present them as \"<Image>(left,top),(right,bottom)</Image>\", (left,top,right,bottom) are the coordinates of the top-left and bottom-right corners of the image or figure.\n"
        f"Present all titles and headings as H1 headings.\n"
        f"Do not hallucinate.\n"
    )

images = [_build_page_to_markdown_query("photo_2025-07-08_16-44-53.jpg", p) for p in range(1, 1 + 1)]
llm = pipeline('ChatDOC/OCRFlux-3B', backend_config=backend_config, chat_template_config=ChatTemplateConfig('ocrflux-qwen2_5_vl'))
inputs = [(question, image) for image in images]

start_time = time.time()
responses = llm(inputs, gen_config=gen_config)
print(responses)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
# import lmdeploy
# print(lmdeploy.__file__)

bash test_pytorch.py

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig, PytorchEngineConfig
import time
from PIL import Image
import subprocess
import io
from ocrflux.lm_ocr import _build_page_to_markdown_query
gen_config = GenerationConfig(temperature=0.0,max_new_tokens=16384)
backend_config = PytorchEngineConfig(session_len=16384)

question = (
        f"Below is the image of one page of a document. "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"ALL tables should be presented in HTML format.\n"
        f"If there are images or figures in the page, present them as \"<Image>(left,top),(right,bottom)</Image>\", (left,top,right,bottom) are the coordinates of the top-left and bottom-right corners of the image or figure.\n"
        f"Present all titles and headings as H1 headings.\n"
        f"Do not hallucinate.\n"
    )

images = [_build_page_to_markdown_query("photo_2025-07-08_16-44-53.jpg", p) for p in range(1, 1 + 1)]
llm = pipeline('ChatDOC/OCRFlux-3B', backend_config=backend_config, chat_template_config=ChatTemplateConfig('ocrflux-qwen2_5_vl'))
inputs = [(question, image) for image in images]

start_time = time.time()
responses = llm(inputs, gen_config=gen_config)
print(responses)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
# import lmdeploy
# print(lmdeploy.__file__)

kolmogorov-quyet · 2025-07-20T04:44:55Z

Results for test_pytorch.py

VEROVA VIDHANSABHA\\nWARD NO - 59
1 RADHA DHATONDE	7738583735	UPSHAKHA PRAMUKH
1 KAVITA SHIVEKAR	9967456609	GUT PRAMUKH -231
2 SARIKA GARGOTE	8591937678	GUT PRAMUKH -195
3 AARTI DHATONDE	9653417992	GUT PRAMUKH -234
4 MEENA MUSLE	8108248406	GUT PRAMUKH -241
5 KALPNA PADWAL	7977703617	GUT PRAMUKH -243
2 SUNITA TAMBE	9022062907	UPSHAKHA PRAMUKH
1 BHAGUBAI SHIVEKAR	8169343363	GUT PRAMUKH -220
2 LAXMI GADHDE	9702887653	GUT PRAMUKH -233
3 KASTURI JADHAV	9372962934	GUT PRAMUKH -242
4 MANDA SHINDE	9702758556	GUT PRAMUKH -206
Sumati Bhoksi.	9372616819
3 SONAL YERAM	9658637398	UPSHAKHA PRAMUKH
4 SAMINA SHAikh	9157937219	GUT PRAMUKH -212
2 RASHMI DHAVANDE	9773061234	GUT PRAMUKH -213
2 ANITA JOSHI	9769552237	GUT PRAMUKH -214
4 ASHWINI JOSHI	8454088406	GUT PRAMUKH -197
5 RAJHA BHUGAR	971916228580
4 PADMINI BACHEE	809384069	UPSHAKHA PRAMUKH
1 JAYSHREE GOPAL	9819840752	GUT PRAMUKH -202
2 MANGAL RAKSHE	9594811154	GUT PRAMUKH -218
3 ASHA ROUNDAL	9167607242	GUT PRAMUKH -221
4 PUSHPA DOKE	9326776486	GUT PRAMUKH -207
5 UIWALA GADHE	8369328791	GUT PRAMUKH -205
6 LATA SHINDE	8082587425	GUT PRAMUKH
5 SAMIRA SHAikh	7021387186	UPSHAKHA PRAMUKH
1 DILSHAD KHAN	7400232967	GUT PRAMUKH -192
2 SABHNAM HAIDE	7718300254	GUT PRAMUKH -153
3 ALIYA KHAN	7021387186	GUT PRAMUKH -153

\\n\\nकार्यालयः : हिंगादेवी संदिग्दासमोर, वांसोवा, अंतर्गत (प्र.), संपूर्ण – 800068.\\n\\nरांपर्कः : 9769022458 • ई-मेल: [email protected]

and Result for test_turbomind.py

VEROVA VIDHANSABHA
WARD NO - 59
1 RADHA DHATONDE	7738583735 UPSHAKHA PRAMUKH
1 KAVITA SHIVEKAR	9967456609 GUT PRAMUKH -231
2 SARIKA GARGOTE	8591937678 GUT PRAMUKH -195
3 AARTI DHATONDE	9653417952 GUT PRAMUKH -234
4 MEENA MUSLE	8108248406 GUT PRAMUKH -241
5 KALPNA PADWAL	7977703617 GUT PRAMUKH -243
2 SUNITA TAMBE	9022062907 UPSHAKHA PRAMUKH
1 BHAGUBAI SHIVEKAR	8169343363 GUT PRAMUKH -220
2 LAXMI GAHDHE	9702887653 GUT PRAMUKH -233
3 KASTURI JADHAV	9372962934 GUT PRAMUKH -242
4 MANDA SHINDE	9702758556 GUT PRAMUKH -206
Sumati Bhoksi.	9372616819
3 SONAL YERAM	9658637398 UPSHAKHA PRAMUKH
4 SAMINA SHAikh	9157937219 GUT PRAMUKH -212
2 RASHMI DHAVANDE	9773061234 GUT PRAMUKH -213
2 ANITA JOSHI	9769522337 GUT PRAMUKH -214
4 ASHWINI JOSHI	8454088405 GUT PRAMUKH -197
5 RAJHA BANUGAR.	979163285809
4 PADMINI BACHEE	809384069 UPSHAKHA PRAMUKH
1 JAYSHREE GOPAL	9819840752 GUT PRAMUKH -202
2 MANGAL RAKSHE	9594811154 GUT PRAMUKH -218
3 ASHA ROUNDAL	9167607242 GUT PRAMUKH -218
4 PUSHPA DOKE	9326776486 GUT PRAMUKH -221
5 UIWALA GADHE	8369328791 GUT PRAMUKH -207
6 LATA SHINDE	8082587425 GUT PRAMUKH
5 SAMIRA SHAikh	7021387186 UPSHAKHA PRAMUKH
1 DILSHAD KHAN	7400232967 GUT PRAMUKH -192
2 SABHNAM HAIDE	7718300254 GUT PRAMUKH -153
3 ALIYA KHAN	7021387186 GUT PRAMUKH -153

\\n\\nकार्यालयः हिंगादेवी संदिशासमोर, वांसोवा, अंतेरी (प.), संपुर - ८०००५१\\n\\nरांपर्कः 9769022458 • ई-मेल: [email protected]

kolmogorov-quyet · 2025-07-20T04:53:58Z

since OCRFlux's config.json is a little different from Qwen2.5VL's, I edited the file line 182-183 convert

elif scaling_type == 'mrope':
      mrope_section = rope_scaling.get('mrope_section')
      rope_param.type = 'mrope'
      rope_param.mrope_section = mrope_section
else:
    pass
    # raise RuntimeError(f'Unsupported rope type: {scaling_type}')

Error fix: RuntimeError(f'Unsupported rope type: {scaling_type}') . I don't know if this is the reason for the reduced accuracy.

irexyc · 2025-07-21T03:48:11Z

@kolmogorov-quyet

It appears that qwen2.5-vl has different notations for the rope field in config.json like Qwen2.5-VL-32B-Instruct vs Qwen2.5-VL-7B-Instruct. I update the code to support these different notations.

I can't run your test code because I couldn't find the lm_ocr module in ocrflux, could you try it again?

And please note that current mrope implementation in lmdeploy only supports the default rope type which is same as default qwen2.5-vl config.

kolmogorov-quyet · 2025-07-21T04:16:07Z

@kolmogorov-quyet

It appears that qwen2.5-vl has different notations for the rope field in config.json like Qwen2.5-VL-32B-Instruct vs Qwen2.5-VL-7B-Instruct. I update the code to support these different notations.

I can't run your test code because I couldn't find the lm_ocr module in ocrflux, could you try it again?

And please note that current mrope implementation in lmdeploy only supports the default rope type which is same as default qwen2.5-vl config.

I will provide some additional information so you can run the code based on what I’ve shared. This will allow you to execute my code using the function:

from ocrflux.image_utils import get_page_image
def _build_page_to_markdown_query(
    file_path: str,
    page_number: int,
    target_longest_image_dim: int = 1024,
    image_rotation: int = 0,
) -> Image.Image:
    assert image_rotation in [0, 90, 180, 270], "Invalid image rotation provided"
    image = get_page_image(
        file_path,
        page_number,
        target_longest_image_dim=target_longest_image_dim,
        image_rotation=image_rotation,
    )
    return image

The function get_page_image is imported from the file image_utils.

Lastly, the exact template used by lmdeploy to enable OCRFLUX to work is:

from lmdeploy.model import MODELS, BaseChatTemplate
@MODELS.register_module(name='ocrflux-qwen2_5_vl')
class OCRFluxChatTemplate(BaseChatTemplate):
    """Chat template simulating vLLM-style prompts for OCRFlux (fine-tuned Qwen2.5-VL).

    Format:
        <|im_start|>system\n{meta_instruction}<|im_end|>\n
        <|im_start|>user\n{vision_tokens}{user_text}<|im_end|>\n
        <|im_start|>assistant\n

    - `vision_tokens` = n_images * "<|vision_start|>" + IMAGE_TOKEN + "<|vision_end|>"
      (IMAGE_TOKEN will be replaced with '<|image_pad|>' * n_grid when encoding Qwen2.5-VL).
    - Supports multi-turn dialogue batching: each user message is appended similarly;
      the assistant turn ends with `<|im_end|>` after each round.
    """

    def __init__(self,
                 system: str = '<|im_start|>system\n',
                 meta_instruction: str = (
                     "You are a helpful assistant."
                 ),
                 user: str = '<|im_start|>user\n',
                 assistant: str = '<|im_start|>assistant\n',
                 eosys: str = '<|im_end|>\n',
                 eoh: str = '<|im_end|>\n',
                 eoa: str = '<|im_end|>',
                 separator: str = '\n',
                 stop_words=None):
        if stop_words is None:
            stop_words = ['<|im_end|>']
        super().__init__(system=system,
                         meta_instruction=meta_instruction,
                         eosys=eosys,
                         user=user,
                         eoh=eoh,
                         assistant=assistant,
                         eoa=eoa,
                         separator=separator,
                         stop_words=stop_words)

An important note is that the configuration between [OCRFLUX](https://huggingface.co/ChatDOC/OCRFlux-3B/blob/main/config.json) and [Qwen2.5VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ/blob/main/config.json) is identical because they originate from the same architecture — similar to [Nanonets-OCR-s](https://huggingface.co/nanonets/Nanonets-OCR-s). They are all fine-tuned from Qwen2.5VL-3B, so in theory they should work interchangeably without loss of accuracy.

I'm not sure if any additional information is needed. I’ve already provided the image file.

My local machine is running:

Python 3.12
CUDA 12.8
Torch 2.7.1
GPU: RTX 5090
transformers version: 4.49.0

irexyc · 2025-07-21T06:23:30Z

@kolmogorov-quyet

Are you using the official ChatDOC/OCRFlux-3B model or fine-tuned model based on it?

I used your code and official ChatDOC/OCRFlux-3B model, but got this response both for pytorch and turbomind backend.

Response(text='{"primary_language": "mr", "is_rotation_valid": true, "rotation_correction": 0, "is_table:": true, "is_diagram": false, "natural_text": "<table><tr><td>VEROVA VIDHANSABHA\\nWARD NO - 59</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>......

sha256sum

5074ad878f7a33630f2c2808d623101a0aa06a33678b35852c944a778f9b369c  photo_2025-07-08_16-44-53.jpg

b5c0e5cf74cf51af1ecbc4af597cfcd13fd9925611838884a681070838a14a50  LICENSE
62eb98e4e2274404ebbb36e511208cb2dc50d4c1f142672ea1083539a1f8df02  README.md
18ec40222b8b337191688c584361afd5672f80d8a035f02af7edd5cb76fea1dc  added_tokens.json
ad60d90252ed0b0705ba14e2d0ad0fec0beac1ea955642b54059b36052d8bc96  chat_template.json
7d7800c96024386e9410b886df208ac6db56e01fcdd0f9ee5a58d3adb73972ce  config.json
14de8cb79615f1d04ce208adcb9c4162689442d7a88e47d256b0a64b77195027  generation_config.json
8831e4f1a044471340f7c0a83d7bd71306a5b867e95fd870f74d0c5308a904d5  merges.txt
1d2640c387e5bdce004e2a77b08ac714d39a6c49a2f088853e2ecf05a398d137  model-00001-of-00002.safetensors
e1ddaffaa0dc63e4e6aa4a9de8515ea3935d862b6474a198d9697999113ff5c9  model-00002-of-00002.safetensors
7414be2cba003815fb8a7f5f1d01484dd2cd19b8b5f3e3c4e98041eceb61c95d  model.safetensors.index.json
f2058c716eef96ccaed1cc1e2d0c08306b62586d535b28d9d08e691b2fab7ca0  preprocessor_config.json
76862e765266b85aa9459767e33cbaf13970f327a0e88d1c65846c2ddd3a1ecd  special_tokens_map.json
075aa21db84b26db9c87f0fd0254ad7bd5cbe86f1b55bebad04020e2a56a5130  tokenizer.json
df061366020889944c712f8d693914e91b0e7cf92b8c5ae8fe3e5e3e1ed5b2b9  tokenizer_config.json
ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910  vocab.json

kolmogorov-quyet · 2025-07-21T07:14:31Z

Llama.py
Because you're using your latest code and you've updated the llama.py file, the results are different. To make them the same, it's very simple — in the llama.py file, you just need to delete or comment out the following lines:

if rope_scaling.get('mrope_section') is not None:
    # TODO: treat mrope as an option to the common rope functions
    scaling_type = 'mrope'

If you have any further issues, feel free to ask.

irexyc · 2025-07-21T07:19:44Z

@kolmogorov-quyet

The change of llama.py will not impact the PyTorch backend. (I also test the 0.9.1 of pypi version)

The result I obtained using the PyTorch backend is different from yours and it can be considered incorrect as the output does not stop till length limit.

kolmogorov-quyet · 2025-07-21T07:26:25Z

Yes, I cloned and ran from your GitHub pip source, and the result matches yours. What I mean is, I'm currently running with TurboMind — you can try testing with TurboMind to see if the result matches mine. And please comment out or delete the following code:

if rope_scaling.get('mrope_section') is not None:
    # TODO: treat mrope as an option to the common rope functions
    scaling_type = 'mrope'

I have reproduced your output and modified the code so that it reproduces the results I originally gave.
you misunderstood me change llama.py in turbomind not pytorch

irexyc · 2025-07-21T08:04:08Z

@kolmogorov-quyet
Delete these lines means using default rope type. With default rope type, my result is almost same with your turbomind result.

But ChatDOC/OCRFlux-3B use mrope type, I can't get your result with test_pytorch.py which also use mrope type.

kolmogorov-quyet · 2025-07-21T08:16:31Z

You were able to reproduce the results of test_turbomind.py but failed with test_pytorch.py — that’s still great because reproducing the exact results from test_turbomind.py is already an achievement.

As for not being able to reproduce the results of test_pytorch.py, I’m not sure what the cause is yet. However, you can try building lmdeploy from this branch: pytorch_engine_qwen2_5vl_sm120.

Hopefully, you’ll get the same results as before. Please make sure not to change any code when building from pytorch_engine_qwen2_5vl_sm120. I really hope you can successfully reproduce the results.

lvhan028 · 2025-07-22T04:31:15Z

src/turbomind/models/llama/unified_attention_layer.cc

+            params.rope_param.base += offset;
+        }
+        else if (rope_param_.type == RopeType::kMultimodal) {
+            params.rope_param.multimodal.position_ids += offset * rope_param_.multimodal.session_len * 3;


too much magic number "3"

lvhan028 · 2025-07-22T07:33:19Z

lmdeploy/vl/model/qwen2.py

+            # it seems mrope is build upon default/linear/dynamic rope functions and is not a rope type
+            # current implementation bind mrope with default rope function which is the default
+            # behavior in official model.
+            self.vl_model.config.rope_scaling['type'] = 'mrope'


Please check if it is necessary with transformers >= 4.51.1

lvhan028 · 2025-07-22T07:33:48Z

lmdeploy/vl/model/qwen2.py

+            from accelerate import init_empty_weights
+            with init_empty_weights():
+                config = self.hf_config
+                config.quantization_config = {}  # disable vision part quantization


Does Qwen2.5-VL have quantization_config?

lzhangzz · 2025-07-22T07:14:57Z

src/turbomind/models/llama/llama_rope.h

@@ -16,6 +16,7 @@ enum class RopeType
    kDynamic,
    kYarn,
    kLlama3,
+    kMultimodal,


Why is this named 'multimodal' here but idenitified by 'mrope' everywhere else?

lzhangzz · 2025-07-22T07:38:09Z

src/turbomind/kernels/attention/rotary_embedding.h

@@ -97,6 +105,9 @@ struct FastRoPE {
    template<typename T>
    __device__ void apply(Array<T, N>& x, float timestep)
    {
+        if (param_.type == RopeType::kMultimodal) {


Need before/after performance comparison of related kernels on large data chunks, since the PR is adding dynamic branching to a device function that affects ALL models.

lzhangzz · 2025-07-22T07:55:11Z

src/turbomind/models/llama/LlamaBatch.cc

+                core::Copy(r->inputs.at("mrope_length").data<int>(), 1, state.mrope.length.data() + idx);
+            }
+            else {
+                cudaMemsetAsync(state.mrope.length.data() + idx, 0, sizeof(int), stream_);


check for errors

lzhangzz · 2025-07-22T08:03:43Z

src/turbomind/models/llama/LlamaBatch.cc

@@ -1522,6 +1552,14 @@ bool LlamaBatch::Forward(GenerationState& g)
            }
        }

+        std::shared_ptr<MultimodalRope> mrope_sptr;


std::optional, or even std::vector is better than explicit dynamic allocation

lzhangzz · 2025-07-22T08:06:15Z

src/turbomind/models/llama/LlamaBatch.cc

@@ -743,6 +767,12 @@ void LlamaBatch::AllocateBuffer(ssize_t batch_size, ssize_t session_len, int cac

        s.curand_state = {{batch_size, sizeof(curandState_t)}, kDEVICE};
        Clear(s.curand_state.buffer());
+
+        if (model_->attn_param_.rope.type == RopeType::kMultimodal) {
+            s.mrope.position_ids   = {{batch_size, session_len_ * 3}, kDEVICE};


Can we use shape [batch_size, session_len, 3] to avoid the magic number 3 from recurring?

irexyc · 2025-07-22T12:46:52Z

  Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
PR
     23.4       2205348556        100   22053485.6    3239924.0    3210565  1884813565  188157584.0  void turbomind::ProcessKV_v2<turbomind::uint4_t, (int)64, (int)128, (int)4, __half, turbomind::bloc…
     16.0       1510969902        101   14960098.0    3644468.0    3620884   892766811   91763036.6  void turbomind::flattenKV_v2<(int)64, (int)128, (int)4, __half, turbomind::uint4_t, turbomind::bloc…
     
main     
     23.3       2240118633        100   22401186.3    3243459.5    3212212  1919341819  191610165.2  void turbomind::ProcessKV_v2<turbomind::uint4_t, (int)64, (int)128, (int)4, __half, turbomind::bloc…
     16.2       1560038954        101   15445930.2    3626259.0    3616403   944386446   96692386.5  void turbomind::flattenKV_v2<(int)64, (int)128, (int)4, __half, turbomind::uint4_t, turbomind::bloc…

irexyc added 26 commits February 20, 2025 09:26

refactor attn param

2fe3353

fix lint

a23c96a

fix build

1d9a69f

fix ut

dfb072c

Merge remote-tracking branch 'origin/main' into rope2

292a122

Merge remote-tracking branch 'origin/main' into rope2-qwen2-vl

f6e2aa1

support qwen2-vl with turbomind backend

7f55588

fix qwen2-vl && support qwen2.5-vl

76a1385

Merge remote-tracking branch 'origin/main' into rope2-qwen2-vl

83dfdf3

support quant qwen2_5-vl

6db7480

Merge remote-tracking branch 'origin/main' into rope2

6ae5c4d

use creator to create rope_param

489bf69

reuse parse func

4a5640d

Merge branch 'rope2' into rope2-qwen2-vl

cf757fa

fix ut

1d7f0cb

Merge branch 'rope2' into rope2-qwen2-vl

4afea5f

fix convert

10928a8

Merge remote-tracking branch 'origin/main' into rope2

edb4669

fix comments

57acc04

Merge remote-tracking branch 'irexyc/rope2' into rope2

52391e4

Merge remote-tracking branch 'origin/main' into rope2

31a9d2b

Merge branch 'rope2' into rope2-qwen2-vl

140f684

update name

502697b

Merge branch 'rope2' into rope2-qwen2-vl

fe04938

Merge remote-tracking branch 'lm/main' into rope2-qwen2-vl

66e4f38

fix session_len

6ac32a3

lzhangzz self-requested a review July 17, 2025 10:30

lvhan028 added the enhancement New feature or request label Jul 17, 2025

lvhan028 mentioned this pull request Jul 18, 2025

[Feature] Qwen2.5-VL-32B model to support the turbomind backend #3735

Open

lvhan028 reviewed Jul 18, 2025

View reviewed changes

lmdeploy/turbomind/deploy/source_model/llama.py Outdated Show resolved Hide resolved

zzf-damon mentioned this pull request Jul 21, 2025

[Bug] Qwen25 VL 32B AWQ 推理报错，ERROR - model_agent.py:679 - Task <ModelAgentLoop> failed #3749

Open

3 tasks

support different qwen2.5-vl rope_scaling config

1a231d3

irexyc added 2 commits July 21, 2025 12:19

fix copy & batch

4a58f34

fix mrope init & qwen-vl pure text chat

38283c2

lvhan028 reviewed Jul 22, 2025

View reviewed changes

lzhangzz reviewed Jul 22, 2025

View reviewed changes

irexyc added 4 commits July 22, 2025 10:06

update

21311ac

update

6300699

Merge remote-tracking branch 'lm/main' into rope2-qwen2-vl

497f480

fix model

82fe264

irexyc added 2 commits July 23, 2025 04:32

adapt new yarn init logic in transformers

2124ba8

rename

314f517

lzhangzz approved these changes Jul 23, 2025

View reviewed changes

lvhan028 merged commit 3a5b3d9 into InternLM:main Jul 23, 2025
9 checks passed

support qwen2/2.5-vl in turbomind #3744

support qwen2/2.5-vl in turbomind #3744

Conversation

irexyc commented Jul 17, 2025

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Uh oh!

windreamer commented Jul 19, 2025

Uh oh!

kolmogorov-quyet commented Jul 20, 2025

Uh oh!

kolmogorov-quyet commented Jul 20, 2025

Uh oh!

kolmogorov-quyet commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolmogorov-quyet commented Jul 20, 2025

Uh oh!

irexyc commented Jul 21, 2025

Uh oh!

kolmogorov-quyet commented Jul 21, 2025

Uh oh!

irexyc commented Jul 21, 2025

Uh oh!

kolmogorov-quyet commented Jul 21, 2025

Uh oh!

irexyc commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolmogorov-quyet commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irexyc commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolmogorov-quyet commented Jul 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

irexyc commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kolmogorov-quyet commented Jul 20, 2025 •

edited

Loading

irexyc commented Jul 21, 2025 •

edited

Loading

kolmogorov-quyet commented Jul 21, 2025 •

edited

Loading

irexyc commented Jul 21, 2025 •

edited

Loading

irexyc commented Jul 22, 2025 •

edited

Loading