Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

🤗 Hugging Face | 📑 Paper | 📑 Datasets | 💬 WeChat (微信)

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field.

Updates

[2025.05.16]: 🔥 🔥 🔥 The supervised fine-tuning (SFT) and instruction tuning (IT) code of the MMLA benchmark is released (Link), enjoy it!
[2025.05.06]: 🔥 🔥 🔥 The zero-shot inference code of the MMLA benchmark is released (Link), enjoy it!
[2025.04.29]: The datasets of the MMLA benchmark are released on Huggingface and Google Drive! The code will be released soon.
[2025.04.24]: 📜 Our paper: Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark is released (arXiv, Huggingface, alphaXiv). The official repo is released on Github.

Overview of the MMLA Benchmark

Highlights

Various Sources: 9 datasets, 61K+ samples, 3 modalities, 76.6 videos. Both acting and real-world scenarios (Films, TV series, YouTube, Vimeo, Bilibili, TED, Improvised scripts, etc.).
6 Core semantic Dimensions: Intent, Emotion, Sentiment, Dialogue Act, Speaking Style, and Communication Behavior.
3 Evaluation Methods: Zero-shot Inference, Supervised Fine-tuning, and Instruction Tuning.
8 Mainstream Foundation Models: 5 MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6), 3 LLMs (InternLM2.5, Qwen2, LLaMA3).

Supported Datasets

Dimension	Dataset	Source	Venue
Intent	MIntRec	Paper / GitHub	ACM MM 2022
Intent	MIntRec2.0	Paper / GitHub	ICLR 2024
Emotion	MELD	Paper / GitHub	ACL 2019
Emotion	IEMOCAP	Paper / Website	Language Resources and Evaluation 2008
Dialogue Act	MELD-DA	Paper / GitHub	ACL 2020
Dialogue Act	IEMOCAP-DA	Paper / Website	ACL 2020
Sentiment	MOSI	Paper / GitHub	IEEE Intelligent Systems 2016
Sentiment	CH-SIMS v2.0	Paper / GitHub	ICMI 2022
Speaking Style	UR-FUNNY-v2	Paper / GitHub	ACL 2019
Speaking Style	MUStARD	Paper / GitHub	ACL 2019
Communication Behavior	Anno-MI (client)	Paper / GitHub	ICASSP 2022
Communication Behavior	Anno-MI (therapist)	Paper / GitHub	ICASSP 2022

Release

The raw text and videos of each dataset all released on Huggingface and Google Drive.

Note that for MOSI, IEMOCAP, and IEMOCAP-DA datasets, we only provide the raw texts due to their restricted license. The raw videos of IEMOCAP can be downloaded from here. The MOSI dataset cannot be released due to the privacy limitation as mentioned in CMU-MultimodalSDK.

Supported Models

Models	Model scale and Link	Source	Type
Qwen2	🤗 0.5B / 1.5B / 7B	Paper / GitHub	LLM
Llama3	🤗 8B	Paper / GitHub	LLM
InternLM2.5	🤗 7B	Paper / GitHub	LLM
VideoLLaMA2	🤗 7B	Paper / GitHub	MLLM
Qwen2-VL	🤗 7B / 72B	Paper / GitHub	MLLM
LLaVA-Video	🤗 7B / 72B	Paper / GitHub	MLLM
LLaVA-OneVision	🤗 7B / 72B	Paper / GitHub	MLLM
MiniCPM-V-2.6	🤗 8B	Paper / GitHub	MLLM

Evaluation Results

LeaderBoard

Rank of Zero-shot Inference

RANK	Models	ACC	TYPE
🥇	GPT-4o	52.60	MLLM
🥈	Qwen2-VL-72B	52.55	MLLM
🥉	LLaVA-OV-72B	52.44	MLLM
4	LLaVA-Video-72B	51.64	MLLM
5	InternLM2.5-7B	50.28	LLM
6	Qwen2-7B	48.45	LLM
7	Qwen2-VL-7B	47.12	MLLM
8	Llama3-8B	44.06	LLM
9	LLaVA-Video-7B	43.32	MLLM
10	VideoLLaMA2-7B	42.82	MLLM
11	LLaVA-OV-7B	40.65	MLLM
12	Qwen2-1.5B	40.61	LLM
13	MiniCPM-V-2.6-8B	37.03	MLLM
14	Qwen2-0.5B	22.14	LLM

Rank of Supervised Fine-tuning (SFT) and Instruction Tuning (IT)

Rank	Models	ACC	Type
🥇	Qwen2-VL-72B (SFT)	69.18	MLLM
🥈	MiniCPM-V-2.6-8B (SFT)	68.88	MLLM
🥉	LLaVA-Video-72B (IT)	68.87	MLLM
4	LLaVA-ov-72B (SFT)	68.67	MLLM
5	Qwen2-VL-72B (IT)	68.64	MLLM
6	LLaVA-Video-72B (SFT)	68.44	MLLM
7	VideoLLaMA2-7B (SFT)	68.30	MLLM
8	Qwen2-VL-7B (SFT)	67.60	MLLM
9	LLaVA-ov-7B (SFT)	67.54	MLLM
10	LLaVA-Video-7B (SFT)	67.47	MLLM
11	Qwen2-VL-7B (IT)	67.34	MLLM
12	MiniCPM-V-2.6-8B (IT)	67.25	MLLM
13	Llama-3-8B (SFT)	66.18	LLM
14	Qwen2-7B (SFT)	66.15	LLM
15	Internlm-2.5-7B (SFT)	65.72	LLM
16	Qwen-2-7B (IT)	64.58	LLM
17	Internlm-2.5-7B (IT)	64.41	LLM
18	Llama-3-8B (IT)	64.16	LLM
19	Qwen2-1.5B (SFT)	64.00	LLM
20	Qwen2-0.5B (SFT)	62.80	LLM

Fine-grained Performance on Each Dimension

We show the results of three evaluation methods (i.e., zero-shot inference, SFT, and IT). The performance of state-of-the-art multimodal machine learning methods and GPT-4o is also shown in the figure below.

Zero-shot Inference and Supervised Fine-tuning (SFT)

Instruction Tuning (IT)

Acknowledgements

If our work is helpful to your research, please consider giving us a star 🌟 and citing the following paper:

@article{zhang2025mmla,
  author={Zhang, Hanlei and Li, Zhuohang and Zhu, Yeshuang and Xu, Hua and Wang, Peiwu and Zhu, Haige and Zhou, Jie and Zhang, Jinchao},
  title={Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark},
  year={2025},
  journal={arXiv preprint arXiv:2504.16427},
}

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
data_preprocess		data_preprocess
figs		figs
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Updates

Overview of the MMLA Benchmark

Highlights

Supported Datasets

Release

Supported Models

Evaluation Results

LeaderBoard

Rank of Zero-shot Inference

Rank of Supervised Fine-tuning (SFT) and Instruction Tuning (IT)

Fine-grained Performance on Each Dimension

Zero-shot Inference and Supervised Fine-tuning (SFT)

Instruction Tuning (IT)

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

thuiar/MMLA

Folders and files

Latest commit

History

Repository files navigation

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Updates

Overview of the MMLA Benchmark

Highlights

Supported Datasets

Release

Supported Models

Evaluation Results

LeaderBoard

Rank of Zero-shot Inference

Rank of Supervised Fine-tuning (SFT) and Instruction Tuning (IT)

Fine-grained Performance on Each Dimension

Zero-shot Inference and Supervised Fine-tuning (SFT)

Instruction Tuning (IT)

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages