Awesome-RL-based-LLM-Reasoning

We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)

Papers

Outcome-based Reward Model

[2502] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Shanghai AI Lab)
[2502] Demystifying Long Chain-of-Thought Reasoning in LLMs (Introduced cosine length-scaling reward with repetition penalty for stable CoT length growth) (IN.AI)
[2501] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (HKU, Berkeley)
[2501] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (Deepseek)
[2501] Kimi k1.5: Scaling Reinforcement Learning with LLMs (Kimi)

Process-based Reward Model

[2502] S² R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Tencent)
[2502] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (THU)
[2502] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (UCLA-Yizhou Sun)
[2312] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (PKU & Deepseek)
[2305] Let's verify step by step (OpenAI)
[2211] Solving math word problems with process-and outcome-based feedback (DeepMind)

Reinforcement learning

[2502] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (Reinforcement Learning via Self-Play) (MIT)
[2502] STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (the scarcity of correct proofs sparse rewards will make performance quickly plateaus. To overcome this, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them.) (Stanford-Tengyu Ma)
[2409] Training Language Models to Self-Correct via Reinforcement Learning (DeepMind)

Search algorithms (Monte Carlo Tree Search or Beam Search)

[2502] Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls (Tencent)
[2408] Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (DeepSeek)
[2310] Solving olympiad geometry without human demonstrations (DeepMind)

Other Newest Interesting Papers about LLM Reasoning

[2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
[2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
[2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
[2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
[2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
[2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
[2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
[2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
[2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
[2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
[2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
[2501] s1: Simple test-time scaling (Stanford)
[2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)

Slides and Discussion

LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
Towards Reasoning in Large Language Models Jie Huang-UIUC
Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
为什么在Deepseek-R1-ZERO出现前，无人尝试放弃微调对齐，通过强化学习生成思考链推理模型？ Zhihu
Kimi Flood Sung Zhihu
Deepseek系列文章梳理 Zhihu
ChatGPT and The Art of Post-Training Stanford-25/02/18

Video

Open-Source Project

TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)

Introduction to Reinforcement Learning

Cloud GPU

Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)

Other Interesting RL-based Reasoning Repository

Awesome RL-based Reasoning MLLMs

Contributing

Feel free to contribute more papers or other any resources!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Other Newest Interesting Papers about LLM Reasoning

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Releases

Packages

Contributors 2

bruno686/Awesome-RL-based-LLM-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Other Newest Interesting Papers about LLM Reasoning

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages