Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Qwen 2.5 VL #11483

Open
4 tasks done
bold84 opened this issue Jan 29, 2025 · 21 comments
Open
4 tasks done

Feature Request: Qwen 2.5 VL #11483

bold84 opened this issue Jan 29, 2025 · 21 comments
Labels
enhancement New feature or request

Comments

@bold84
Copy link

bold84 commented Jan 29, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Is anybody implementing this?

If not, I may give it a go. But it will take some time as I am new to the source side of llama.cpp/ggml.

Motivation

Well, it's not currently working. :-)

Possible Implementation

Based on the existing Qwen 2 VL implementation.

@bold84 bold84 added the enhancement New feature or request label Jan 29, 2025
@HimariO
Copy link
Contributor

HimariO commented Jan 29, 2025

I'm currently looking into Transformers' Qwen2.5VL implementation and waiting for the paper to drop so I can better assess the differences between Qwen2VL and Qwen2.5VL. 👀

@3unnycheung
Copy link

cool

@samkoesnadi
Copy link
Contributor

I support this!

@Shyryp
Copy link

Shyryp commented Feb 2, 2025

Our world definitely needs this!

@peter-ch
Copy link

Any progress on this? Who added support for Qwen 2 VL?

@pszemraj
Copy link

pszemraj commented Feb 20, 2025

qwen2.5-vl report is up! https://huggingface.co/papers/2502.13923

edit: official codebase here: https://github.com/QwenLM/Qwen2.5-VL

@vladislavdonchev
Copy link

I can start working on this if no one else is already.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

OK then!

First order of business would be to build the GGUF file(s). Seems there is an issue with that and the latest official Transformers:

python convert_hf_to_gguf.py .\build\bin\Release\Qwen2.5-VL-7B-Instruct\
INFO:hf-to-gguf:Loading model: Qwen2.5-VL-7B-Instruct
ERROR:hf-to-gguf:Model Qwen2_5_VLForConditionalGeneration is not supported

This is pretty hot:
huggingface/transformers#36292
QwenLM/Qwen2.5-VL#723

Appears a temporary workaround would be to use the old Qwen2 templates. People are reporting this works, so I'll post an update in a bit.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit:
#10896

For more information refer to:
#11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place:
#11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 23, 2025

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
  • Batching (multiple images) in a single cli call seems to be working fine:
    llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is?

EDIT: As per Qwen2.5 docs:
min_pixels = 256x28x28
max_pixels = 1280x28x28

A RTFM moment for me...

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!
I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is? With 24G VRAM, you can expect an OOM with images >1400x1400 pixels, so you need to make sure the files are pre-processed correctly.

Thanks.

My image was 1475x1062. I was able to run inference successfuly using a 1077x671 sample, without OOM. Would it be possible to run Clip and VL on separate GPUs? Thanks again.

@zrrraa
Copy link

zrrraa commented Feb 25, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

@vladislavdonchev
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

Get it from our HF:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF

@ChmHsm
Copy link

ChmHsm commented Feb 27, 2025

Thank you for the effort, a lot of people really need this.

Any updates on the progress? Will this still take a few days? or is it more like a few weeks or months?

Thanks a lot again, we appreciate you guys a lot!.

@samkoesnadi
Copy link
Contributor

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

@vladislavdonchev
Copy link

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

Working on it as we speak, along with a quantization tool:

Image

https://github.com/Independent-AI-Labs/local-super-agents/tree/feat/additional-output-formats/quantbench

@vladislavdonchev
Copy link

UPDATE:

Opened a draft PR here: #12119

Long story short, I'll need some help debugging the vision models and llama-qwen2vl-cli as we're unable to produce anything reliably.

In addition, this still isn't resolved:
#11322

I've also asked the Qwen folks for help:
QwenLM/Qwen2.5-VL#869

@ChmHsm
Copy link

ChmHsm commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

Unfortunately, we're unable to reliably produce a working vision model from either 7B or 3B. I am not sure how the one in the repo was exported, but it seems to be working, so it's either some weird coincidence or a mistake. I've verified the LM part, including in quants and it also appears to match what you'd expect from Qwen2.5 (parameters in .gguf seem correct, responses are OK).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests