LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities

Open-source / Comprehensive / Lightweight / Easy-to-use

News

[2025/3/11] 🚀 Our codebase is merged into OpenRLHF-M, the official multimodal RL infrastructure developed by OpenRLHF.
[2025/3/11] ✨ We release our paper "LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL"!
[2025/2/13] We release code of LMM-R1!

Introduction

Smaller 3B Large Multimodal Models (LMMs) struggle with reasoning tasks due to their limited parameter capacity and the inherent complexity of integrating visual perception with logical reasoning. High-quality multimodal reasoning data is also scarce, further complicating training. To address these challenges, we propose LMM-R1, a two-stage rule-based RL framework that efficiently enhances reasoning capabilities:

Foundational Reasoning Enhancement (FRE): Uses text-only data to build strong reasoning foundations
Multimodal Generalization Training (MGT): Extends these capabilities to multimodal domains

This approach overcomes data limitations while significantly improving performance across diverse reasoning tasks.

Demo

Geometry Question:

Sokoban Demo:

Quick Start

Installation

git clone https://github.com/TideDra/lmm-r1.git
cd lmm-r1
pip install -e .[vllm]
pip install flash_attn --no-build-isolation

Note

We recommend using vLLM 0.7.2 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.

Prepare Datasets

LMM-R1 requires the multimodal prompt dataset to be in OpenAI-compatible message format:

[
  {
    "message":"[
      {
        \"role\": \"user\",
        \"content\": [
            { \
                \"type\": \"image\",
                \"image\": \"file:///path/to/your/image.jpg\",
            }, \
            {\"type\": \"text\", \"text\": \"How many cats in the image?\"},
        ],
      }
    ]",
    "answer": "$3$"
  },
]

Note that message is a stringfied list. An example dataset examples/data/test_message.jsonl is for reference.

We can use --input_key to specify the JSON key name of the input datasets --prompt_data {name or path} (PPO) or --dataset {name or path}. Do not use --apply_chat_template for multimodal prompt, the message will be processed internally.
OpenRLHF also support mixing multiple datasets using --prompt_data_probs 0.1,0.4,0.5 (PPO) or --dataset_probs 0.1,0.4,0.5.

Training

Our training process follows the two-stage approach described in the paper. We provide scripts for each stage to facilitate reproduction of our results.

Stage 1: Foundational Reasoning Enhancement (FRE)

This stage focuses on enhancing the model's reasoning capabilities using text-only data.

# Train with text-only data (FRE-Text)
bash examples/scripts/lmm_r1/train_fre_text.sh

# Train with multimodal data (FRE-Multi) for comparison
bash examples/scripts/lmm_r1/train_fre_multi.sh

The FRE-Text script uses the DeepScaler-40K dataset with rule-based RL to enhance the model's foundational reasoning capabilities. This stage is crucial for establishing strong reasoning abilities before moving to multimodal tasks.

Stage 2: Multimodal Generalization Training (MGT)

This stage extends the reasoning capabilities to multimodal domains through continued training on specific tasks.

# Train on geometry domain (MGT-Geo)
bash examples/scripts/lmm_r1/train_mgt_geo.sh

# Train on perception-reasoning balanced domain (MGT-PerceReason)
bash examples/scripts/lmm_r1/train_mgt_percereas.sh

Each MGT script continues training from the FRE-Text checkpoint, focusing on a specific domain:

MGT-Geo: Uses VerMulti-Geo dataset (15K geometry problems) to enhance geometric reasoning
MGT-PerceReason: Uses the full VerMulti dataset to balance perception and reasoning capabilities.

We release our final model, MGT-PerceReason.

Direct RL Training (for comparison)

We also provide scripts for direct RL training without the FRE stage, which we use as comparison baselines in our paper:

# Direct RL training on geometry domain
bash examples/scripts/lmm_r1/train_direct_rl_geo.sh

These scripts train the baseline model directly on domain-specific data, skipping the FRE stage, which helps demonstrate the effectiveness of our two-stage approach.

Features

LMM-R1 is a fork of OpenRLHF, aimed at providing high-performance LMM Reinforcement Learning infrastructure for enhancing multimodal reasoning capabilities. We currently support PPO/REINFORCE++/RLOO training for LMM, and achieve 4.7x speedup (RLOO) compared with R1-V (GRPO).

Support LMM training (Qwen2.5-VL, Phi3.5-V, Phi4-Multimodal).
Distributed PPO and REINFORCE++/RLOO implementations based on Ray.
Ray-based Reinforced Finetuning
Support Ray-based PPO and REINFORCE++/RLOO using Hybrid Engine (--colocate_all_models, --vllm_enable_sleep and --vllm_gpu_memory_utilization 0.5)
Full RLHF fine-tuning support for models with over 70 billion parameters.
Integration with vLLM for accelerated generation in RLHF tasks (--vllm_num_engines).
Support for multiple reward models (--reward_pretrain model1,model2...) and remote reward models (--remote_rm_url).
Integration of FlashAttention2 (--flash_attn).
Support for QLoRA (--load_in_4bit) and LoRA (--lora_rank, --target_modules).
Logging support with Wandb (--use_wandb) and TensorBoard (--use_tensorboard).
Checkpoint recovery functionality (--load_checkpoint and --save_steps).
Provided multi-node training scripts, such as Ray PPO.

References & Acknowledgements

We sincerely thank DeepSeek for their exploration on LLM reasoning, and OpenRLHF for their incredible RL infrastructure. We also thank open-r1 and simpleRL-reason which give us insights on reproduction of R1. Yingzhe Peng's work was completed during his internship at Ant Group, and Kai Yang is his intern mentor. Special thanks to Kai Yang, Jie Liu, ZhiYuan You for their valuable suggestions, and the Big Data Computing Center of Southeast University for the hardware support.

Citation

If you find LMM-R1 useful for your research and applications, please cite using this BibTeX:

@article{peng2025lmmr1,
  title={LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL},
  author={Peng, Yingzhe and Zhang, Gongrui and Zhang, Miaosen and You, Zhiyuan and Liu, Jie and Zhu, Qipeng and Yang, Kai and Xu, Xingzhong and Geng, Xin and Yang, Xu},
  journal={arXiv preprint arXiv:2503.07536},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,335 Commits
.github/workflows		.github/workflows
dockerfile		dockerfile
docs		docs
examples		examples
openrlhf		openrlhf
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities

News

Introduction

Demo

Quick Start

Installation

Prepare Datasets

Training

Stage 1: Foundational Reasoning Enhancement (FRE)

Stage 2: Multimodal Generalization Training (MGT)

Direct RL Training (for comparison)

Features

References & Acknowledgements

Citation

About

Releases 3

Packages

Contributors 27

Languages

License

TideDra/lmm-r1

Folders and files

Latest commit

History

Repository files navigation

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities

News

Introduction

Demo

Quick Start

Installation

Prepare Datasets

Training

Stage 1: Foundational Reasoning Enhancement (FRE)

Stage 2: Multimodal Generalization Training (MGT)

Direct RL Training (for comparison)

Features

References & Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 27

Languages

Packages