We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)
- [2502] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Shanghai AI Lab)
- [2502] Demystifying Long Chain-of-Thought Reasoning in LLMs (Introduced cosine length-scaling reward with repetition penalty for stable CoT length growth) (IN.AI)
- [2501] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (HKU, Berkeley)
- [2501] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (Deepseek)
- [2501] Kimi k1.5: Scaling Reinforcement Learning with LLMs (Kimi)
- [2502] S2 R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Tencent)
- [2502] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (THU)
- [2502] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (UCLA-Yizhou Sun)
- [2312] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (PKU & Deepseek)
- [2305] Let's verify step by step (OpenAI)
- [2211] Solving math word problems with process-and outcome-based feedback (DeepMind)
- [2502] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (Reinforcement Learning via Self-Play) (MIT)
- [2502] STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (the scarcity of correct proofs sparse rewards will make performance quickly plateaus. To overcome this, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them.) (Stanford-Tengyu Ma)
- [2409] Training Language Models to Self-Correct via Reinforcement Learning (DeepMind)
- [2502] Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls (Tencent)
- [2408] Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (DeepSeek)
- [2310] Solving olympiad geometry without human demonstrations (DeepMind)
- [2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
- [2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
- [2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
- [2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
- [2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
- [2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
- [2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
- [2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
- [2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
- [2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
- [2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
- [2501] s1: Simple test-time scaling (Stanford)
- [2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
- LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
- Towards Reasoning in Large Language Models Jie Huang-UIUC
- Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
- Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
- Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
- Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
- 为什么在Deepseek-R1-ZERO出现前,无人尝试放弃微调对齐,通过强化学习生成思考链推理模型? Zhihu
- Kimi Flood Sung Zhihu
- Deepseek系列文章梳理 Zhihu
- ChatGPT and The Art of Post-Training Stanford-25/02/18
- [LLM+RL] R1 论文导读,SFT vs. RL,RL 基础以及 GRPO 细节,以及一系列复现工作讨论
- [LLM+RL] 理解 GRPO 公式原理及 TRL GrpoTrainer 代码实现(advantage 与 loss 计算)
- LLM-Based Reasoning: Opportunities and Pitfalls (LAVA Workshop in ACCV 2024)
- Reinforcement Learning in DeepSeek r1 Visualized (Chinese)
- EZ撸paper: DeepSeek-R1 论文详解 part 3:GPT发展史 | scaling law | 训练范式 | emergent ability
- EZ撸paper: DeepSeek-R1 论文详解 part 2:AGI是什么? | Reinforcement Learning快速入门 | AlphaGo介绍
- EZ撸paper: DeepSeek-R1 论文详解 part 1:比肩 OpenAI-o1,如何做到的?
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek R1 Explained to your grandma
TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
- Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)
- Feel free to contribute more papers or other any resources!