Multi-modal Multiplexing Modeling

Updates

2025-04-02 First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
2025-04-01 Evaluate on OmniMMI. A comprehensive multi-modal interaction benchmark in streaming video context.
2025-04-01 First Release M4. M4 enables multiplexed modeling capabilities for a visual language model at minimal cost.

Table of Contents

M4
- Introduction
- M4-IT
Train
Usage
Evaluation
Roadmap
Acknowledgement
Citation

M4

Introduction

We introduce Multimodal Multiplexing Modeling (M4), a framework that enhances real-time interactive reasoning with minimal fine-tuning on pre-trained MLLMs.

M4-IT Dataset: A synthetic instruction finetuning dataset with components interleaved image-text instruction, noise instruction, and stop instruction.
M4 Model: Enhances proactive response generation, assesses new queries against noise, by enabling parallel decoding.

M4-IT Dataset

Building on the LLaVA-NeXT-Data, we crafted a small video-free synthetic instruction finetuning dataset, M4-IT, with the assistance of GPT-4o. M4-IT comprises four components:

the original instruction, which is a data replay from the instruction data of our base model
interleaved image-text instruction, which is created by reordering the question and image components of the original instruction
noise instruction, where GPT-4 is prompted to automatically generate statements that do not require a response
stop instruction, where GPT-4 is prompted to generate stop phrases for the stop instruction

In addition, to assist with audio instruction tuning, we convert user queries into audio using CosyVoice, with a randomly selected VoiceAssistant as a prompt.

Data Statistics

The M4-IT dataset comprises a total of 9,963 instructions. The distribution across different categories is as follows:

Category	Count
Original	2,624
Interleave	2,376
Noise	2,563
Stop	2,500

Data sample

    {
        "id": "000000240632",
        "image": "000000240632.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\n"
            },
            {
                "from": "human",
                "value": "<speech>\n" # provide the bounding box coordinates of the region that the given sentence describes
            },
            {
                "from": "gpt",
                "value": "[0.280,0.194,0.628,0.824]"
            },
            {
                "from": "human",
                "value": "<speech>\n" # Could I stop you for a second?
            },
            {
                "from": "gpt",
                "value": "<|im_end|>"
            }
        ],
        "speech": [
            "000000240632_0.wav",
            "000000240632_1.wav"
        ]
    },

If you are interested in the process of the construction of audio instruction, you can refer to the scripts in preprocess/tts

Training

Installation

This codebase is tested on CUDA 11.8 and A800-80G.

conda create -n open_gpt4o python=3.10 -y && conda activate open_gpt4o
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e "src/.[train]"
pip install packaging &&  pip install ninja && pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install -r requirements.txt

optional

Data Preparation

Download M4-IT and organize it in the following format. To enhance audio instruction-following performance, you may also download VoiceAssistant-400K and sample a portion of this dataset based on your computational resources.

intersuit/inputs        
    ├── images/ # images
      └── llava-next/
        ├── ...
        └── xxxx.jpg
    ├── speech/
      ├── voiceassistant/
        ├── ...
        └── xxxx.wav
      └── interinst/
        ├── ...
        └── xxxx.wav
    └── texts/
      ├── voiceassistant.json
      ├── m4-it-qwen.json
      └── m4-it-qwen-audio.json

Pretrained Backbone Preparation

Download the pretrained large video language model weights LongVA-7B and the pretrained audio encoder weights Whisper, and place them in the intersuit/checkpoints directory.

intersuit/checkpoints        
    ├── LongVA-7B-Qwen2
    └── whisper/large-v3.pt

If you wish to use other LLMs or instruction tuning data, feel free to follow the LLaVA-NeXT pipeline. Here, we provide a pipeline to do visual instruction tuning on Llama-3.1-8B using the datasets blip_laion_cc_sbu_558k, LLaVA-NeXT-Data, and ShareGPTVideo. Feel free to adapt it to other models.

bash lvlm_pretrain.sh
bash lvlm_finetune.sh
bash lvlm_dpo.sh

Start Training

Our training logic is essentially the same as the visual instruction tuning. (The training process takes ~2 hours on 4 NVIDIA A800-80G)

cd intersuit
# finetune on m4-it
bash scripts/finetune_m4.sh
# finetune on m4-it-audio
bash scripts/finetune_m4_audio.sh

Before fine-tuning the audio version, you are encouraged to tune the vision-language model on audio instructions to improve the generality of audio understanding. (This process takes ~100 hours on 4 A800 GPU)

bash scripts/finetune_voiceassistant.sh

To assist those with limited computational resources, we also provide an off-the-shelf checkpoint. Check it out at

To enhance the model's visual-audio understanding capabilities, we offer a script to fine-tune it using the dataset. This aims to improve visual-audio alignment performance. (This process takes ~140 hours on 4 A800 GPU)

NOTE: We find that this process is more prone to collapse than audio instruction tuning alone, so we provide a model just for further study.

bash scripts/finetune_llavanextaudio.sh

For those with limited computational resources, we also provide a ready-to-use checkpoint (17500 step). You can access it here

Try the visual-audio base model through python -m local_demo.baseline_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "local_demo/wav/water.mp4.wav"

Usage

Currently, we only provide a demo, but you are welcome to deploy it using your preferred framework.

(i) attention-based proactive reasoning

cd intersuit
python -m local_demo.proactive_cli  --model_path M4-LongVA-Qwen-7B --frame_fps 1 --video_file local_demo/assets/water.mp4

(ii) multiplexing modeling

text input

cd intersuit
# new valid query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20

audio input

For better visualization, you can input text, and ChatTTS will automatically convert it into audio. You can then find the generated audio in local_demo/wav.

cd intersuit
# new valid query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "How many people in the video?" --new_query_pos 20
# new interrupt query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Sorry to interrupt?" --new_query_pos 20
# new noise query
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question "Can you describe the video?" --new_query "Okay, I see." --new_query_pos 20

or you can specify the audio

cd intersuit
python -m local_demo.turntaking_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "XXX.wav" --new_query_audio "XXX.wav" --new_query_pos 20

Evaluation

To evaluate the interaction ability of M4 in streaming video contexts, you are encouraged to try our OmniMMI!

Roadmap

This work does not cover audio decoding. I am working on an end-to-end interactive omni-language model (visual/speech-to-speech) and actively seeking additional computational resources😞. However, for those lacking computational resources too, I believe a streaming TTS could serve as an alternative without significant delay.

Check Open-Omni-Nexus!

Acknowledgement

We Thank LLaVA-NeXT, LongVA, videollm-online, LLaMA-Omni for open-sourcing their work.

Citation

If you find our work helpful, please consider citing it.

@article{omnimmi,
    title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
    author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
    journal={arxiv},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
intersuit		intersuit
preprocess/tts		preprocess/tts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal Multiplexing Modeling

Updates

M4

Introduction

M4-IT Dataset

Training

Installation

Data Preparation

Pretrained Backbone Preparation

Start Training

Usage

Evaluation

Roadmap

Acknowledgement

Citation

About

Releases

Packages

Languages

License

OmniMMI/M4

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Multiplexing Modeling

Updates

M4

Introduction

M4-IT Dataset

Training

Installation

Data Preparation

Pretrained Backbone Preparation

Start Training

Usage

Evaluation

Roadmap

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages