Skip to content
/ M4 Public

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

License

Notifications You must be signed in to change notification settings

OmniMMI/M4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multi-modal Multiplexing Modeling

Build Build Build
Build Build Build

image

Updates

  • 2025-04-02 First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
  • 2025-04-01 Evaluate on OmniMMI. A comprehensive multi-modal interaction benchmark in streaming video context.
  • 2025-04-01 First Release M4. M4 enables multiplexed modeling capabilities for a visual language model at minimal cost.

Table of Contents

M4

Introduction

We introduce Multimodal Multiplexing Modeling (M4), a framework that enhances real-time interactive reasoning with minimal fine-tuning on pre-trained MLLMs.

  • M4-IT Dataset: A synthetic instruction finetuning dataset with components interleaved image-text instruction, noise instruction, and stop instruction.
  • M4 Model: Enhances proactive response generation, assesses new queries against noise, by enabling parallel decoding.

M4-IT Dataset

Building on the LLaVA-NeXT-Data, we crafted a small video-free synthetic instruction finetuning dataset, M4-IT, with the assistance of GPT-4o. M4-IT comprises four components:

  • the original instruction, which is a data replay from the instruction data of our base model
  • interleaved image-text instruction, which is created by reordering the question and image components of the original instruction
  • noise instruction, where GPT-4 is prompted to automatically generate statements that do not require a response
  • stop instruction, where GPT-4 is prompted to generate stop phrases for the stop instruction

In addition, to assist with audio instruction tuning, we convert user queries into audio using CosyVoice, with a randomly selected VoiceAssistant as a prompt.

Data Statistics

The M4-IT dataset comprises a total of 9,963 instructions. The distribution across different categories is as follows:

Category Count
Original 2,624
Interleave 2,376
Noise 2,563
Stop 2,500

Data sample

    {
        "id": "000000240632",
        "image": "000000240632.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\n"
            },
            {
                "from": "human",
                "value": "<speech>\n" # provide the bounding box coordinates of the region that the given sentence describes
            },
            {
                "from": "gpt",
                "value": "[0.280,0.194,0.628,0.824]"
            },
            {
                "from": "human",
                "value": "<speech>\n" # Could I stop you for a second?
            },
            {
                "from": "gpt",
                "value": "<|im_end|>"
            }
        ],
        "speech": [
            "000000240632_0.wav",
            "000000240632_1.wav"
        ]
    },

If you are interested in the process of the construction of audio instruction, you can refer to the scripts in preprocess/tts

Training

Installation

This codebase is tested on CUDA 11.8 and A800-80G.

conda create -n open_gpt4o python=3.10 -y && conda activate open_gpt4o
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e "src/.[train]"
pip install packaging &&  pip install ninja && pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install -r requirements.txt

optional

Data Preparation

Download M4-IT and organize it in the following format. To enhance audio instruction-following performance, you may also download VoiceAssistant-400K and sample a portion of this dataset based on your computational resources.

intersuit/inputs        
    β”œβ”€β”€ images/ # images
      └── llava-next/
        β”œβ”€β”€ ...
        └── xxxx.jpg
    β”œβ”€β”€ speech/
      β”œβ”€β”€ voiceassistant/
        β”œβ”€β”€ ...
        └── xxxx.wav
      └── interinst/
        β”œβ”€β”€ ...
        └── xxxx.wav
    └── texts/
      β”œβ”€β”€ voiceassistant.json
      β”œβ”€β”€ m4-it-qwen.json
      └── m4-it-qwen-audio.json

Pretrained Backbone Preparation

Download the pretrained large video language model weights LongVA-7B and the pretrained audio encoder weights Whisper, and place them in the intersuit/checkpoints directory.

intersuit/checkpoints        
    β”œβ”€β”€ LongVA-7B-Qwen2
    └── whisper/large-v3.pt

If you wish to use other LLMs or instruction tuning data, feel free to follow the LLaVA-NeXT pipeline. Here, we provide a pipeline to do visual instruction tuning on Llama-3.1-8B using the datasets blip_laion_cc_sbu_558k, LLaVA-NeXT-Data, and ShareGPTVideo. Feel free to adapt it to other models.

bash lvlm_pretrain.sh
bash lvlm_finetune.sh
bash lvlm_dpo.sh

Start Training

Our training logic is essentially the same as the visual instruction tuning. (The training process takes ~2 hours on 4 NVIDIA A800-80G)

cd intersuit
# finetune on m4-it
bash scripts/finetune_m4.sh
# finetune on m4-it-audio
bash scripts/finetune_m4_audio.sh

Before fine-tuning the audio version, you are encouraged to tune the vision-language model on audio instructions to improve the generality of audio understanding. (This process takes ~100 hours on 4 A800 GPU)

bash scripts/finetune_voiceassistant.sh

To assist those with limited computational resources, we also provide an off-the-shelf checkpoint. Check it out at Model

To enhance the model's visual-audio understanding capabilities, we offer a script to fine-tune it using the Dataset dataset. This aims to improve visual-audio alignment performance. (This process takes ~140 hours on 4 A800 GPU)

NOTE: We find that this process is more prone to collapse than audio instruction tuning alone, so we provide a model just for further study.

bash scripts/finetune_llavanextaudio.sh

For those with limited computational resources, we also provide a ready-to-use checkpoint (17500 step). You can access it here Model

Try the visual-audio base model through python -m local_demo.baseline_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "local_demo/wav/water.mp4.wav"

Usage

Currently, we only provide a demo, but you are welcome to deploy it using your preferred framework.

(i) attention-based proactive reasoning

cd intersuit
python -m local_demo.proactive_cli  --model_path M4-LongVA-Qwen-7B --frame_fps 1 --video_file local_demo/assets/water.mp4

(ii) multiplexing modeling

text input

cd intersuit
# new valid query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20

audio input

For better visualization, you can input text, and ChatTTS will automatically convert it into audio. You can then find the generated audio in local_demo/wav.

cd intersuit
# new valid query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20

or you can specify the audio

cd intersuit
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "XXX.wav" --new_query_audio "XXX.wav" --new_query_pos 20

Evaluation

To evaluate the interaction ability of M4 in streaming video contexts, you are encouraged to try our OmniMMI!

Roadmap

  • This work does not cover audio decoding. I am working on an end-to-end interactive omni-language model (visual/speech-to-speech) and actively seeking additional computational resources😞. However, for those lacking computational resources too, I believe a streaming TTS could serve as an alternative without significant delay.

Check Open-Omni-Nexus!

Acknowledgement

We Thank LLaVA-NeXT, LongVA, videollm-online, LLaMA-Omni for open-sourcing their work.

Citation

If you find our work helpful, please consider citing it.

@article{omnimmi,
    title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
    author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
    journal={arxiv},
    year={2025}
}

About

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published