Skip to content
/ lmm-r1 Public

Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.

License

Notifications You must be signed in to change notification settings

TideDra/lmm-r1

Repository files navigation

LMM-R1 logo

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities

GitHub Contributors Issues Issues GitHub pull requests GitHub stars
Open-source / Comprehensive / Lightweight / Easy-to-use


🤗 HF Dataset 🤗 HF Model 📄 Paper 🌐 Project Page

Switch to the Chinese version (切换至中文版)

News

Introduction

Smaller 3B Large Multimodal Models (LMMs) struggle with reasoning tasks due to their limited parameter capacity and the inherent complexity of integrating visual perception with logical reasoning. High-quality multimodal reasoning data is also scarce, further complicating training. To address these challenges, we propose LMM-R1, a two-stage rule-based RL framework that efficiently enhances reasoning capabilities:

  1. Foundational Reasoning Enhancement (FRE): Uses text-only data to build strong reasoning foundations
  2. Multimodal Generalization Training (MGT): Extends these capabilities to multimodal domains

This approach overcomes data limitations while significantly improving performance across diverse reasoning tasks.

pipeline

Demo

Geometry Question:

motivation

Sokoban Demo:

sobokan_deom

Quick Start

Installation

git clone https://github.com/TideDra/lmm-r1.git
cd lmm-r1
pip install -e .[vllm]
pip install flash_attn --no-build-isolation

Note

We recommend using vLLM 0.7.2 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.

Prepare Datasets

LMM-R1 requires the multimodal prompt dataset to be in OpenAI-compatible message format:

[
  {
    "message":"[
      {
        \"role\": \"user\",
        \"content\": [
            { \
                \"type\": \"image\",
                \"image\": \"file:///path/to/your/image.jpg\",
            }, \
            {\"type\": \"text\", \"text\": \"How many cats in the image?\"},
        ],
      }
    ]",
    "answer": "$3$"
  },
]

Note that message is a stringfied list. An example dataset examples/data/test_message.jsonl is for reference.

  • We can use --input_key to specify the JSON key name of the input datasets --prompt_data {name or path} (PPO) or --dataset {name or path}. Do not use --apply_chat_template for multimodal prompt, the message will be processed internally.
  • OpenRLHF also support mixing multiple datasets using --prompt_data_probs 0.1,0.4,0.5 (PPO) or --dataset_probs 0.1,0.4,0.5.

Training

Our training process follows the two-stage approach described in the paper. We provide scripts for each stage to facilitate reproduction of our results.

Stage 1: Foundational Reasoning Enhancement (FRE)

This stage focuses on enhancing the model's reasoning capabilities using text-only data.

# Train with text-only data (FRE-Text)
bash examples/scripts/lmm_r1/train_fre_text.sh

# Train with multimodal data (FRE-Multi) for comparison
bash examples/scripts/lmm_r1/train_fre_multi.sh

The FRE-Text script uses the DeepScaler-40K dataset with rule-based RL to enhance the model's foundational reasoning capabilities. This stage is crucial for establishing strong reasoning abilities before moving to multimodal tasks.

Stage 2: Multimodal Generalization Training (MGT)

This stage extends the reasoning capabilities to multimodal domains through continued training on specific tasks.

# Train on geometry domain (MGT-Geo)
bash examples/scripts/lmm_r1/train_mgt_geo.sh

# Train on perception-reasoning balanced domain (MGT-PerceReason)
bash examples/scripts/lmm_r1/train_mgt_percereas.sh

Each MGT script continues training from the FRE-Text checkpoint, focusing on a specific domain:

  • MGT-Geo: Uses VerMulti-Geo dataset (15K geometry problems) to enhance geometric reasoning
  • MGT-PerceReason: Uses the full VerMulti dataset to balance perception and reasoning capabilities.

We release our final model, MGT-PerceReason.

Direct RL Training (for comparison)

We also provide scripts for direct RL training without the FRE stage, which we use as comparison baselines in our paper:

# Direct RL training on geometry domain
bash examples/scripts/lmm_r1/train_direct_rl_geo.sh

These scripts train the baseline model directly on domain-specific data, skipping the FRE stage, which helps demonstrate the effectiveness of our two-stage approach.

Features

LMM-R1 is a fork of OpenRLHF, aimed at providing high-performance LMM Reinforcement Learning infrastructure for enhancing multimodal reasoning capabilities. We currently support PPO/REINFORCE++/RLOO training for LMM, and achieve 4.7x speedup (RLOO) compared with R1-V (GRPO).

time_compare

  • Support LMM training (Qwen2.5-VL, Phi3.5-V, Phi4-Multimodal).
  • Distributed PPO and REINFORCE++/RLOO implementations based on Ray.
  • Ray-based Reinforced Finetuning
  • Support Ray-based PPO and REINFORCE++/RLOO using Hybrid Engine (--colocate_all_models, --vllm_enable_sleep and --vllm_gpu_memory_utilization 0.5)
  • Full RLHF fine-tuning support for models with over 70 billion parameters.
  • Integration with vLLM for accelerated generation in RLHF tasks (--vllm_num_engines).
  • Support for multiple reward models (--reward_pretrain model1,model2...) and remote reward models (--remote_rm_url).
  • Integration of FlashAttention2 (--flash_attn).
  • Support for QLoRA (--load_in_4bit) and LoRA (--lora_rank, --target_modules).
  • Logging support with Wandb (--use_wandb) and TensorBoard (--use_tensorboard).
  • Checkpoint recovery functionality (--load_checkpoint and --save_steps).
  • Provided multi-node training scripts, such as Ray PPO.

References & Acknowledgements

We sincerely thank DeepSeek for their exploration on LLM reasoning, and OpenRLHF for their incredible RL infrastructure. We also thank open-r1 and simpleRL-reason which give us insights on reproduction of R1. Yingzhe Peng's work was completed during his internship at Ant Group, and Kai Yang is his intern mentor. Special thanks to Kai Yang, Jie Liu, ZhiYuan You for their valuable suggestions, and the Big Data Computing Center of Southeast University for the hardware support.

Citation

If you find LMM-R1 useful for your research and applications, please cite using this BibTeX:

@article{peng2025lmmr1,
  title={LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL},
  author={Peng, Yingzhe and Zhang, Gongrui and Zhang, Miaosen and You, Zhiyuan and Liu, Jie and Zhu, Qipeng and Yang, Kai and Xu, Xingzhong and Geng, Xin and Yang, Xu},
  journal={arXiv preprint arXiv:2503.07536},
  year={2025}
}

About

Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages