-
Notifications
You must be signed in to change notification settings - Fork 581
support qwen2/2.5-vl in turbomind #3744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you kindly share with us how much degradation you have and how did you measure the accuracy? I believe it is definitely valuable for us to further improve our project! Thank you. |
I use model OCRFLUX which has identical architecture to qwen 2.5vl for my project. |
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
import time
from PIL import Image
import subprocess
import io
from ocrflux.lm_ocr import _build_page_to_markdown_query
gen_config = GenerationConfig(temperature=0.0,max_new_tokens=16384)
backend_config = TurbomindEngineConfig(session_len=16384)
question = (
f"Below is the image of one page of a document. "
f"Just return the plain text representation of this document as if you were reading it naturally.\n"
f"ALL tables should be presented in HTML format.\n"
f"If there are images or figures in the page, present them as \"<Image>(left,top),(right,bottom)</Image>\", (left,top,right,bottom) are the coordinates of the top-left and bottom-right corners of the image or figure.\n"
f"Present all titles and headings as H1 headings.\n"
f"Do not hallucinate.\n"
)
images = [_build_page_to_markdown_query("photo_2025-07-08_16-44-53.jpg", p) for p in range(1, 1 + 1)]
llm = pipeline('ChatDOC/OCRFlux-3B', backend_config=backend_config, chat_template_config=ChatTemplateConfig('ocrflux-qwen2_5_vl'))
inputs = [(question, image) for image in images]
start_time = time.time()
responses = llm(inputs, gen_config=gen_config)
print(responses)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
# import lmdeploy
# print(lmdeploy.__file__)
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig, PytorchEngineConfig
import time
from PIL import Image
import subprocess
import io
from ocrflux.lm_ocr import _build_page_to_markdown_query
gen_config = GenerationConfig(temperature=0.0,max_new_tokens=16384)
backend_config = PytorchEngineConfig(session_len=16384)
question = (
f"Below is the image of one page of a document. "
f"Just return the plain text representation of this document as if you were reading it naturally.\n"
f"ALL tables should be presented in HTML format.\n"
f"If there are images or figures in the page, present them as \"<Image>(left,top),(right,bottom)</Image>\", (left,top,right,bottom) are the coordinates of the top-left and bottom-right corners of the image or figure.\n"
f"Present all titles and headings as H1 headings.\n"
f"Do not hallucinate.\n"
)
images = [_build_page_to_markdown_query("photo_2025-07-08_16-44-53.jpg", p) for p in range(1, 1 + 1)]
llm = pipeline('ChatDOC/OCRFlux-3B', backend_config=backend_config, chat_template_config=ChatTemplateConfig('ocrflux-qwen2_5_vl'))
inputs = [(question, image) for image in images]
start_time = time.time()
responses = llm(inputs, gen_config=gen_config)
print(responses)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
# import lmdeploy
# print(lmdeploy.__file__) |
Results for
and Result for
|
since OCRFlux's config.json is a little different from Qwen2.5VL's, I edited the file line 182-183 convert elif scaling_type == 'mrope':
mrope_section = rope_scaling.get('mrope_section')
rope_param.type = 'mrope'
rope_param.mrope_section = mrope_section
else:
pass
# raise RuntimeError(f'Unsupported rope type: {scaling_type}') Error fix: RuntimeError(f'Unsupported rope type: {scaling_type}') . I don't know if this is the reason for the reduced accuracy. |
It appears that qwen2.5-vl has different notations for the rope field in config.json like Qwen2.5-VL-32B-Instruct vs Qwen2.5-VL-7B-Instruct. I update the code to support these different notations. I can't run your test code because I couldn't find the And please note that current mrope implementation in lmdeploy only supports the |
I will provide some additional information so you can run the code based on what I’ve shared. This will allow you to execute my code using the function: from ocrflux.image_utils import get_page_image
def _build_page_to_markdown_query(
file_path: str,
page_number: int,
target_longest_image_dim: int = 1024,
image_rotation: int = 0,
) -> Image.Image:
assert image_rotation in [0, 90, 180, 270], "Invalid image rotation provided"
image = get_page_image(
file_path,
page_number,
target_longest_image_dim=target_longest_image_dim,
image_rotation=image_rotation,
)
return image The function Lastly, the exact template used by lmdeploy to enable OCRFLUX to work is: from lmdeploy.model import MODELS, BaseChatTemplate
@MODELS.register_module(name='ocrflux-qwen2_5_vl')
class OCRFluxChatTemplate(BaseChatTemplate):
"""Chat template simulating vLLM-style prompts for OCRFlux (fine-tuned Qwen2.5-VL).
Format:
<|im_start|>system\n{meta_instruction}<|im_end|>\n
<|im_start|>user\n{vision_tokens}{user_text}<|im_end|>\n
<|im_start|>assistant\n
- `vision_tokens` = n_images * "<|vision_start|>" + IMAGE_TOKEN + "<|vision_end|>"
(IMAGE_TOKEN will be replaced with '<|image_pad|>' * n_grid when encoding Qwen2.5-VL).
- Supports multi-turn dialogue batching: each user message is appended similarly;
the assistant turn ends with `<|im_end|>` after each round.
"""
def __init__(self,
system: str = '<|im_start|>system\n',
meta_instruction: str = (
"You are a helpful assistant."
),
user: str = '<|im_start|>user\n',
assistant: str = '<|im_start|>assistant\n',
eosys: str = '<|im_end|>\n',
eoh: str = '<|im_end|>\n',
eoa: str = '<|im_end|>',
separator: str = '\n',
stop_words=None):
if stop_words is None:
stop_words = ['<|im_end|>']
super().__init__(system=system,
meta_instruction=meta_instruction,
eosys=eosys,
user=user,
eoh=eoh,
assistant=assistant,
eoa=eoa,
separator=separator,
stop_words=stop_words) An important note is that the configuration between [OCRFLUX](https://huggingface.co/ChatDOC/OCRFlux-3B/blob/main/config.json) and [Qwen2.5VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ/blob/main/config.json) is identical because they originate from the same architecture — similar to [Nanonets-OCR-s](https://huggingface.co/nanonets/Nanonets-OCR-s). They are all fine-tuned from Qwen2.5VL-3B, so in theory they should work interchangeably without loss of accuracy. I'm not sure if any additional information is needed. I’ve already provided the image file. My local machine is running:
|
Are you using the official ChatDOC/OCRFlux-3B model or fine-tuned model based on it? I used your code and official ChatDOC/OCRFlux-3B model, but got this response both for pytorch and turbomind backend.
sha256sum
|
Llama.py if rope_scaling.get('mrope_section') is not None:
# TODO: treat mrope as an option to the common rope functions
scaling_type = 'mrope' If you have any further issues, feel free to ask. |
The change of The result I obtained using the PyTorch backend is different from yours and it can be considered incorrect as the output does not stop till length limit. |
Yes, I cloned and ran from your GitHub pip source, and the result matches yours. What I mean is, I'm currently running with TurboMind — you can try testing with TurboMind to see if the result matches mine. And please comment out or delete the following code: if rope_scaling.get('mrope_section') is not None:
# TODO: treat mrope as an option to the common rope functions
scaling_type = 'mrope' I have reproduced your output and modified the code so that it reproduces the results I originally gave. |
@kolmogorov-quyet But ChatDOC/OCRFlux-3B use mrope type, I can't get your result with |
You were able to reproduce the results of As for not being able to reproduce the results of Hopefully, you’ll get the same results as before. Please make sure not to change any code when building from |
params.rope_param.base += offset; | ||
} | ||
else if (rope_param_.type == RopeType::kMultimodal) { | ||
params.rope_param.multimodal.position_ids += offset * rope_param_.multimodal.session_len * 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much magic number "3"
lmdeploy/vl/model/qwen2.py
Outdated
# it seems mrope is build upon default/linear/dynamic rope functions and is not a rope type | ||
# current implementation bind mrope with default rope function which is the default | ||
# behavior in official model. | ||
self.vl_model.config.rope_scaling['type'] = 'mrope' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if it is necessary with transformers >= 4.51.1
lmdeploy/vl/model/qwen2.py
Outdated
from accelerate import init_empty_weights | ||
with init_empty_weights(): | ||
config = self.hf_config | ||
config.quantization_config = {} # disable vision part quantization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Qwen2.5-VL have quantization_config
?
@@ -16,6 +16,7 @@ enum class RopeType | |||
kDynamic, | |||
kYarn, | |||
kLlama3, | |||
kMultimodal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this named 'multimodal' here but idenitified by 'mrope' everywhere else?
@@ -97,6 +105,9 @@ struct FastRoPE { | |||
template<typename T> | |||
__device__ void apply(Array<T, N>& x, float timestep) | |||
{ | |||
if (param_.type == RopeType::kMultimodal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need before/after performance comparison of related kernels on large data chunks, since the PR is adding dynamic branching to a device function that affects ALL models.
core::Copy(r->inputs.at("mrope_length").data<int>(), 1, state.mrope.length.data() + idx); | ||
} | ||
else { | ||
cudaMemsetAsync(state.mrope.length.data() + idx, 0, sizeof(int), stream_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for errors
@@ -1522,6 +1552,14 @@ bool LlamaBatch::Forward(GenerationState& g) | |||
} | |||
} | |||
|
|||
std::shared_ptr<MultimodalRope> mrope_sptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::optional
, or even std::vector
is better than explicit dynamic allocation
@@ -743,6 +767,12 @@ void LlamaBatch::AllocateBuffer(ssize_t batch_size, ssize_t session_len, int cac | |||
|
|||
s.curand_state = {{batch_size, sizeof(curandState_t)}, kDEVICE}; | |||
Clear(s.curand_state.buffer()); | |||
|
|||
if (model_->attn_param_.rope.type == RopeType::kMultimodal) { | |||
s.mrope.position_ids = {{batch_size, session_len_ * 3}, kDEVICE}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use shape [batch_size, session_len, 3]
to avoid the magic number 3 from recurring?
|
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist