The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".
The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.
- [Oct 14, 2024] Our paper is accepted by SCIENCE CHINA Information Sciences!
- [Aug 6, 2024] The dataset is already accessible on Hugging Face at 🤗 yuecao0119/MMInstruct.
- [Jul 22, 2024] The paper has been released on arXiv!
- [Jul 22, 2024] Code has been released.
- Data Engine.
- Open Source Datasets.
- Release the checkpoint.
Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:
- Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
- Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.
To address these challenges, we created the MMInstruct dataset, featuring:
- 973K instructions from 24 domains
- Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.
![image](https://private-user-images.githubusercontent.com/23737120/355723000-92ef8128-89e3-4891-9dad-6c64da2c9de3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Nzk3NjgsIm5iZiI6MTczODk3OTQ2OCwicGF0aCI6Ii8yMzczNzEyMC8zNTU3MjMwMDAtOTJlZjgxMjgtODllMy00ODkxLTlkYWQtNmM2NGRhMmM5ZGUzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDAxNTEwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdjYjZhNTYwM2U3OTg4ZTVmNjAxMzMyZGFjMTFkZTliMWZiOTc4ZjU5MjBjNzAwZjI4M2U2NmM1ZTgyODZkY2MmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.J5zrsmbMxZ1nOUg8h79UCsE8USsr-817lFsOyHVzbPY)
The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:
caption_cn
: 144K English detailed image caption data generated using gpt-4-vision-preview.caption_en
: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.qa_en
: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.
We also expand MMInstruct with other open-source data, including:
Domain | Dataset |
---|---|
mathematics datasets | GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP |
charts and plots | DVQA (100K); FigureQA |
scientific figure | TQA |
map chart | MapQA |
We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.
![image](https://private-user-images.githubusercontent.com/23737120/355724717-8513df0f-f3d3-4145-bc81-baa1db656a4e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Nzk3NjgsIm5iZiI6MTczODk3OTQ2OCwicGF0aCI6Ii8yMzczNzEyMC8zNTU3MjQ3MTctODUxM2RmMGYtZjNkMy00MTQ1LWJjODEtYmFhMWRiNjU2YTRlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDAxNTEwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIxMzNiMmE5ZDcyYWJiNWQ0NWU4NDQ0N2E1NWZjZmI2ZGZkYzViNzM4N2FlMjJmM2Y2Mjg4MmQ2MGM5NzdiZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bPxx-AkXXrfKJjC99BQnSGCXG9xcUL3fQxwiVmgwN34)
As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction. Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.
(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.
(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.
(c) Then experts designed corresponding seed questions for different fields.
(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.
(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.
![image](https://private-user-images.githubusercontent.com/23737120/355725118-eca16ea4-8e73-4e92-8a5b-3036557abb94.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Nzk3NjgsIm5iZiI6MTczODk3OTQ2OCwicGF0aCI6Ii8yMzczNzEyMC8zNTU3MjUxMTgtZWNhMTZlYTQtOGU3My00ZTkyLThhNWItMzAzNjU1N2FiYjk0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDAxNTEwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWMxNGI2MWVlMjEzNGIwMzllZGIxOTQwYTllY2QwOWIzYzdjYWYzN2UxYjA0NGY0MDg4NGFiNGM1MmYwMzQ5NzUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ZqbBMHo0cm6SwPG4NFxazkY9nruDVB51c-eXTuMjQ1g)
If this work is helpful for your research, please consider citing the following BibTeX entry.
@article{liu2024mminstruct,
title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
journal={arXiv preprint arXiv:2407.15838},
year={2024}
}