MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".

The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.

📣 News

[Oct 14, 2024] Our paper is accepted by SCIENCE CHINA Information Sciences!
[Aug 6, 2024] The dataset is already accessible on Hugging Face at 🤗 yuecao0119/MMInstruct.
[Jul 22, 2024] The paper has been released on arXiv!
[Jul 22, 2024] Code has been released.

Todo List

Data Engine.
Open Source Datasets.
Release the checkpoint.

Introduction

Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:

Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.

MMInstruct Dataset

To address these challenges, we created the MMInstruct dataset, featuring:

973K instructions from 24 domains
Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.

The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:

caption_cn: 144K English detailed image caption data generated using gpt-4-vision-preview.
caption_en: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.
qa_en: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.

We also expand MMInstruct with other open-source data, including:

Domain	Dataset
mathematics datasets	GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP
charts and plots	DVQA (100K); FigureQA
scientific figure	TQA
map chart	MapQA

Data Engine

We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.

As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction. Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.

(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.

(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.

(c) Then experts designed corresponding seed questions for different fields.

(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.

(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.

Performance

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{liu2024mminstruct,
  title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
  author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
  journal={arXiv preprint arXiv:2407.15838},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_engine		data_engine
figs		figs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
train_dataset_for_llava.py		train_dataset_for_llava.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMInstruct

📣 News

Todo List

Introduction

MMInstruct Dataset

Data Engine

Performance

Citation

About

Releases

Packages

Contributors 3

Languages

License

yuecao0119/MMInstruct

Folders and files

Latest commit

History

Repository files navigation

MMInstruct

📣 News

Todo List

Introduction

MMInstruct Dataset

Data Engine

Performance

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages