Skip to content

Conversation

@HJYao00
Copy link

@HJYao00 HJYao00 commented Nov 15, 2025

This PR adds evaluation support for the MMReason benchmark, accepted at ICCV 2025, to assess the reasoning capabilities of MLLMs.

Before submitting, I have tested that the code successfully run on Qwen2.5-VL and Qwen3-VL. The command to run the evaluation is as follows:

python3 run.py --data MMReason_testmini --model Qwen2.5-VL-7B-Instruct --verbose

python3 run.py --data MMReason_testmini --model Qwen3-VL-8B-Instruct --verbose

@FangXinyu-0913
Copy link
Collaborator

Hi @HJYao00, Have you evaluated popular MLLMs like Qwen-VL3 and InternVL3.5 on MMReason? If so, could you share the results from the VLMEvalKit along with your own evaluation data here? This would be very helpful for reproduction purposes.

@HJYao00
Copy link
Author

HJYao00 commented Nov 21, 2025

Hi @FangXinyu-0913. I have evaluated the popular MLLMs (Qwen-VL-2.5, Qwen-VL-3, and InternVL-3.5) on the MMReason benchmark using VLMEvalKit. The performance of Qwen2.5-VL in VLMEvalKit almost matches the results reported in the paper. Thank you!

Model Source of Results Accuracy
Qwen-VL-2.5-7B VLMEvalKit 16.9
Qwen-VL-2.5-7B Paper 17.3
Qwen-VL-3-8B VLMEvalKit 30.1
InternVL3.5-8B VLMEvalKit 20.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants