Skip to content
/ MMLA Public

The first comprehensive multimodal language analysis benchmark for evaluating foundation models

Notifications You must be signed in to change notification settings

thuiar/MMLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

🤗 Hugging Face   |   📑 Paper   |   📑 Datasets  |   💬 WeChat (微信)  

poster

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field.

Updates

  • [2025.05.16]: 🔥 🔥 🔥 The supervised fine-tuning (SFT) and instruction tuning (IT) code of the MMLA benchmark is released (Link), enjoy it!
  • [2025.05.06]: 🔥 🔥 🔥 The zero-shot inference code of the MMLA benchmark is released (Link), enjoy it!
  • [2025.04.29]: The datasets of the MMLA benchmark are released on Huggingface and Google Drive! The code will be released soon.
  • [2025.04.24]: 📜 Our paper: Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark is released (arXiv, Huggingface, alphaXiv). The official repo is released on Github.

Overview of the MMLA Benchmark

method

Highlights

  • Various Sources: 9 datasets, 61K+ samples, 3 modalities, 76.6 videos. Both acting and real-world scenarios (Films, TV series, YouTube, Vimeo, Bilibili, TED, Improvised scripts, etc.).
  • 6 Core semantic Dimensions: Intent, Emotion, Sentiment, Dialogue Act, Speaking Style, and Communication Behavior.
  • 3 Evaluation Methods: Zero-shot Inference, Supervised Fine-tuning, and Instruction Tuning.
  • 8 Mainstream Foundation Models: 5 MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6), 3 LLMs (InternLM2.5, Qwen2, LLaMA3).

radar radar radar

Supported Datasets

Dimension Dataset Source Venue
Intent MIntRec Paper / GitHub ACM MM 2022
Intent MIntRec2.0 Paper / GitHub ICLR 2024
Emotion MELD Paper / GitHub ACL 2019
Emotion IEMOCAP Paper / Website Language Resources
and Evaluation 2008
Dialogue Act MELD-DA Paper / GitHub ACL 2020
Dialogue Act IEMOCAP-DA Paper / Website ACL 2020
Sentiment MOSI Paper / GitHub IEEE Intelligent
Systems 2016
Sentiment CH-SIMS v2.0 Paper / GitHub ICMI 2022
Speaking Style UR-FUNNY-v2 Paper / GitHub ACL 2019
Speaking Style MUStARD Paper / GitHub ACL 2019
Communication Behavior Anno-MI (client) Paper / GitHub ICASSP 2022
Communication Behavior Anno-MI (therapist) Paper / GitHub ICASSP 2022

Release

The raw text and videos of each dataset all released on Huggingface and Google Drive.

Note that for MOSI, IEMOCAP, and IEMOCAP-DA datasets, we only provide the raw texts due to their restricted license. The raw videos of IEMOCAP can be downloaded from here. The MOSI dataset cannot be released due to the privacy limitation as mentioned in CMU-MultimodalSDK.

Supported Models

Models Model scale and Link Source Type
Qwen2 🤗 0.5B / 1.5B / 7B Paper / GitHub LLM
Llama3 🤗 8B Paper / GitHub LLM
InternLM2.5 🤗 7B Paper / GitHub LLM
VideoLLaMA2 🤗 7B Paper / GitHub MLLM
Qwen2-VL 🤗 7B / 72B Paper / GitHub MLLM
LLaVA-Video 🤗 7B / 72B Paper / GitHub MLLM
LLaVA-OneVision 🤗 7B / 72B Paper / GitHub MLLM
MiniCPM-V-2.6 🤗 8B Paper / GitHub MLLM

Evaluation Results

LeaderBoard

Rank of Zero-shot Inference

RANK Models ACC TYPE
🥇 GPT-4o 52.60 MLLM
🥈 Qwen2-VL-72B 52.55 MLLM
🥉 LLaVA-OV-72B 52.44 MLLM
4 LLaVA-Video-72B 51.64 MLLM
5 InternLM2.5-7B 50.28 LLM
6 Qwen2-7B 48.45 LLM
7 Qwen2-VL-7B 47.12 MLLM
8 Llama3-8B 44.06 LLM
9 LLaVA-Video-7B 43.32 MLLM
10 VideoLLaMA2-7B 42.82 MLLM
11 LLaVA-OV-7B 40.65 MLLM
12 Qwen2-1.5B 40.61 LLM
13 MiniCPM-V-2.6-8B 37.03 MLLM
14 Qwen2-0.5B 22.14 LLM

Rank of Supervised Fine-tuning (SFT) and Instruction Tuning (IT)

Rank Models ACC Type
🥇 Qwen2-VL-72B (SFT) 69.18 MLLM
🥈 MiniCPM-V-2.6-8B (SFT) 68.88 MLLM
🥉 LLaVA-Video-72B (IT) 68.87 MLLM
4 LLaVA-ov-72B (SFT) 68.67 MLLM
5 Qwen2-VL-72B (IT) 68.64 MLLM
6 LLaVA-Video-72B (SFT) 68.44 MLLM
7 VideoLLaMA2-7B (SFT) 68.30 MLLM
8 Qwen2-VL-7B (SFT) 67.60 MLLM
9 LLaVA-ov-7B (SFT) 67.54 MLLM
10 LLaVA-Video-7B (SFT) 67.47 MLLM
11 Qwen2-VL-7B (IT) 67.34 MLLM
12 MiniCPM-V-2.6-8B (IT) 67.25 MLLM
13 Llama-3-8B (SFT) 66.18 LLM
14 Qwen2-7B (SFT) 66.15 LLM
15 Internlm-2.5-7B (SFT) 65.72 LLM
16 Qwen-2-7B (IT) 64.58 LLM
17 Internlm-2.5-7B (IT) 64.41 LLM
18 Llama-3-8B (IT) 64.16 LLM
19 Qwen2-1.5B (SFT) 64.00 LLM
20 Qwen2-0.5B (SFT) 62.80 LLM

Fine-grained Performance on Each Dimension

We show the results of three evaluation methods (i.e., zero-shot inference, SFT, and IT). The performance of state-of-the-art multimodal machine learning methods and GPT-4o is also shown in the figure below.

Zero-shot Inference and Supervised Fine-tuning (SFT)

zero-sft

Instruction Tuning (IT)

image-20250415145905173

Acknowledgements

If our work is helpful to your research, please consider giving us a star 🌟 and citing the following paper:

@article{zhang2025mmla,
  author={Zhang, Hanlei and Li, Zhuohang and Zhu, Yeshuang and Xu, Hua and Wang, Peiwu and Zhu, Haige and Zhou, Jie and Zhang, Jinchao},
  title={Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark},
  year={2025},
  journal={arXiv preprint arXiv:2504.16427},
}

About

The first comprehensive multimodal language analysis benchmark for evaluating foundation models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •