2025-04-02
First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.2025-04-01
Evaluate on OmniMMI. A comprehensive multi-modal interaction benchmark in streaming video context.2025-04-01
First Release M4. M4 enables multiplexed modeling capabilities for a visual language model at minimal cost.
Table of Contents
We introduce Multimodal Multiplexing Modeling (M4), a framework that enhances real-time interactive reasoning with minimal fine-tuning on pre-trained MLLMs.
- M4-IT Dataset: A synthetic instruction finetuning dataset with components interleaved image-text instruction, noise instruction, and stop instruction.
- M4 Model: Enhances proactive response generation, assesses new queries against noise, by enabling parallel decoding.
Building on the LLaVA-NeXT-Data, we crafted a small video-free synthetic instruction finetuning dataset, M4-IT, with the assistance of GPT-4o. M4-IT comprises four components:
- the original instruction, which is a data replay from the instruction data of our base model
- interleaved image-text instruction, which is created by reordering the question and image components of the original instruction
- noise instruction, where GPT-4 is prompted to automatically generate statements that do not require a response
- stop instruction, where GPT-4 is prompted to generate stop phrases for the stop instruction
In addition, to assist with audio instruction tuning, we convert user queries into audio using CosyVoice, with a randomly selected VoiceAssistant as a prompt.
Data Statistics
The M4-IT dataset comprises a total of 9,963 instructions. The distribution across different categories is as follows:
Category | Count |
---|---|
Original | 2,624 |
Interleave | 2,376 |
Noise | 2,563 |
Stop | 2,500 |
Data sample
{
"id": "000000240632",
"image": "000000240632.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\n"
},
{
"from": "human",
"value": "<speech>\n" # provide the bounding box coordinates of the region that the given sentence describes
},
{
"from": "gpt",
"value": "[0.280,0.194,0.628,0.824]"
},
{
"from": "human",
"value": "<speech>\n" # Could I stop you for a second?
},
{
"from": "gpt",
"value": "<|im_end|>"
}
],
"speech": [
"000000240632_0.wav",
"000000240632_1.wav"
]
},
If you are interested in the process of the construction of audio instruction, you can refer to the scripts in preprocess/tts
This codebase is tested on CUDA 11.8 and A800-80G.
conda create -n open_gpt4o python=3.10 -y && conda activate open_gpt4o
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e "src/.[train]"
pip install packaging && pip install ninja && pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install -r requirements.txt
optional
Download M4-IT and organize it in the following format. To enhance audio instruction-following performance, you may also download VoiceAssistant-400K and sample a portion of this dataset based on your computational resources.
intersuit/inputs
βββ images/ # images
βββ llava-next/
βββ ...
βββ xxxx.jpg
βββ speech/
βββ voiceassistant/
βββ ...
βββ xxxx.wav
βββ interinst/
βββ ...
βββ xxxx.wav
βββ texts/
βββ voiceassistant.json
βββ m4-it-qwen.json
βββ m4-it-qwen-audio.json
Download the pretrained large video language model weights LongVA-7B and the pretrained audio encoder weights Whisper, and place them in the intersuit/checkpoints
directory.
intersuit/checkpoints
βββ LongVA-7B-Qwen2
βββ whisper/large-v3.pt
If you wish to use other LLMs or instruction tuning data, feel free to follow the LLaVA-NeXT pipeline. Here, we provide a pipeline to do visual instruction tuning on Llama-3.1-8B using the datasets blip_laion_cc_sbu_558k, LLaVA-NeXT-Data, and ShareGPTVideo. Feel free to adapt it to other models.
bash lvlm_pretrain.sh
bash lvlm_finetune.sh
bash lvlm_dpo.sh
Our training logic is essentially the same as the visual instruction tuning. (The training process takes ~2 hours on 4 NVIDIA A800-80G)
cd intersuit
# finetune on m4-it
bash scripts/finetune_m4.sh
# finetune on m4-it-audio
bash scripts/finetune_m4_audio.sh
Before fine-tuning the audio version, you are encouraged to tune the vision-language model on audio instructions to improve the generality of audio understanding. (This process takes ~100 hours on 4 A800 GPU)
bash scripts/finetune_voiceassistant.sh
To assist those with limited computational resources, we also provide an off-the-shelf checkpoint. Check it out at
To enhance the model's visual-audio understanding capabilities, we offer a script to fine-tune it using the dataset. This aims to improve visual-audio alignment performance. (This process takes ~140 hours on 4 A800 GPU)
NOTE: We find that this process is more prone to collapse than audio instruction tuning alone, so we provide a model just for further study.
bash scripts/finetune_llavanextaudio.sh
For those with limited computational resources, we also provide a ready-to-use checkpoint (17500 step). You can access it here
Try the visual-audio base model through python -m local_demo.baseline_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "local_demo/wav/water.mp4.wav"
Currently, we only provide a demo, but you are welcome to deploy it using your preferred framework.
(i) attention-based proactive reasoning
cd intersuit
python -m local_demo.proactive_cli --model_path M4-LongVA-Qwen-7B --frame_fps 1 --video_file local_demo/assets/water.mp4
(ii) multiplexing modeling
text input
cd intersuit
# new valid query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20
audio input
For better visualization, you can input text, and ChatTTS will automatically convert it into audio. You can then find the generated audio in local_demo/wav
.
cd intersuit
# new valid query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20
or you can specify the audio
cd intersuit
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "XXX.wav" --new_query_audio "XXX.wav" --new_query_pos 20
To evaluate the interaction ability of M4 in streaming video contexts, you are encouraged to try our OmniMMI!
- This work does not cover audio decoding. I am working on an end-to-end interactive omni-language model (visual/speech-to-speech) and actively seeking additional computational resourcesπ. However, for those lacking computational resources too, I believe a streaming TTS could serve as an alternative without significant delay.
Check Open-Omni-Nexus!
We Thank LLaVA-NeXT, LongVA, videollm-online, LLaMA-Omni for open-sourcing their work.
If you find our work helpful, please consider citing it.
@article{omnimmi,
title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
journal={arxiv},
year={2025}
}