Switch to the Chinese version (切换至中文版)
-
[2025/3/11] 🚀 Our codebase is merged into OpenRLHF-M, the official multimodal RL infrastructure developed by OpenRLHF.
-
[2025/3/11] ✨ We release our paper "LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL"!
-
[2025/2/13] We release code of LMM-R1!
Smaller 3B Large Multimodal Models (LMMs) struggle with reasoning tasks due to their limited parameter capacity and the inherent complexity of integrating visual perception with logical reasoning. High-quality multimodal reasoning data is also scarce, further complicating training. To address these challenges, we propose LMM-R1, a two-stage rule-based RL framework that efficiently enhances reasoning capabilities:
- Foundational Reasoning Enhancement (FRE): Uses text-only data to build strong reasoning foundations
- Multimodal Generalization Training (MGT): Extends these capabilities to multimodal domains
This approach overcomes data limitations while significantly improving performance across diverse reasoning tasks.
Geometry Question:
Sokoban Demo:
git clone https://github.com/TideDra/lmm-r1.git
cd lmm-r1
pip install -e .[vllm]
pip install flash_attn --no-build-isolation
Note
We recommend using vLLM 0.7.2 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.
LMM-R1 requires the multimodal prompt dataset to be in OpenAI-compatible message format:
[
{
"message":"[
{
\"role\": \"user\",
\"content\": [
{ \
\"type\": \"image\",
\"image\": \"file:///path/to/your/image.jpg\",
}, \
{\"type\": \"text\", \"text\": \"How many cats in the image?\"},
],
}
]",
"answer": "$3$"
},
]
Note that message is a stringfied list.
An example dataset examples/data/test_message.jsonl
is for reference.
- We can use
--input_key
to specify theJSON key name
of the input datasets--prompt_data {name or path}
(PPO) or--dataset {name or path}
. Do not use--apply_chat_template
for multimodal prompt, the message will be processed internally. - OpenRLHF also support mixing multiple datasets using
--prompt_data_probs 0.1,0.4,0.5
(PPO) or--dataset_probs 0.1,0.4,0.5
.
Our training process follows the two-stage approach described in the paper. We provide scripts for each stage to facilitate reproduction of our results.
This stage focuses on enhancing the model's reasoning capabilities using text-only data.
# Train with text-only data (FRE-Text)
bash examples/scripts/lmm_r1/train_fre_text.sh
# Train with multimodal data (FRE-Multi) for comparison
bash examples/scripts/lmm_r1/train_fre_multi.sh
The FRE-Text script uses the DeepScaler-40K dataset with rule-based RL to enhance the model's foundational reasoning capabilities. This stage is crucial for establishing strong reasoning abilities before moving to multimodal tasks.
This stage extends the reasoning capabilities to multimodal domains through continued training on specific tasks.
# Train on geometry domain (MGT-Geo)
bash examples/scripts/lmm_r1/train_mgt_geo.sh
# Train on perception-reasoning balanced domain (MGT-PerceReason)
bash examples/scripts/lmm_r1/train_mgt_percereas.sh
Each MGT script continues training from the FRE-Text checkpoint, focusing on a specific domain:
- MGT-Geo: Uses VerMulti-Geo dataset (15K geometry problems) to enhance geometric reasoning
- MGT-PerceReason: Uses the full VerMulti dataset to balance perception and reasoning capabilities.
We release our final model, MGT-PerceReason.
We also provide scripts for direct RL training without the FRE stage, which we use as comparison baselines in our paper:
# Direct RL training on geometry domain
bash examples/scripts/lmm_r1/train_direct_rl_geo.sh
These scripts train the baseline model directly on domain-specific data, skipping the FRE stage, which helps demonstrate the effectiveness of our two-stage approach.
LMM-R1 is a fork of OpenRLHF, aimed at providing high-performance LMM Reinforcement Learning infrastructure for enhancing multimodal reasoning capabilities. We currently support PPO/REINFORCE++/RLOO training for LMM, and achieve 4.7x speedup (RLOO) compared with R1-V (GRPO).
- Support LMM training (Qwen2.5-VL, Phi3.5-V, Phi4-Multimodal).
- Distributed PPO and REINFORCE++/RLOO implementations based on Ray.
- Ray-based Reinforced Finetuning
- Support Ray-based PPO and REINFORCE++/RLOO using Hybrid Engine (
--colocate_all_models
,--vllm_enable_sleep
and--vllm_gpu_memory_utilization 0.5
) - Full RLHF fine-tuning support for models with over 70 billion parameters.
- Integration with vLLM for accelerated generation in RLHF tasks (
--vllm_num_engines
). - Support for multiple reward models (
--reward_pretrain model1,model2...
) and remote reward models (--remote_rm_url
). - Integration of FlashAttention2 (
--flash_attn
). - Support for QLoRA (
--load_in_4bit
) and LoRA (--lora_rank
,--target_modules
). - Logging support with Wandb (
--use_wandb
) and TensorBoard (--use_tensorboard
). - Checkpoint recovery functionality (
--load_checkpoint
and--save_steps
). - Provided multi-node training scripts, such as Ray PPO.
We sincerely thank DeepSeek for their exploration on LLM reasoning, and OpenRLHF for their incredible RL infrastructure. We also thank open-r1 and simpleRL-reason which give us insights on reproduction of R1. Yingzhe Peng's work was completed during his internship at Ant Group, and Kai Yang is his intern mentor. Special thanks to Kai Yang, Jie Liu, ZhiYuan You for their valuable suggestions, and the Big Data Computing Center of Southeast University for the hardware support.
If you find LMM-R1 useful for your research and applications, please cite using this BibTeX:
@article{peng2025lmmr1,
title={LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL},
author={Peng, Yingzhe and Zhang, Gongrui and Zhang, Miaosen and You, Zhiyuan and Liu, Jie and Zhu, Qipeng and Yang, Kai and Xu, Xingzhong and Geng, Xin and Yang, Xu},
journal={arXiv preprint arXiv:2503.07536},
year={2025}
}