- [🔗 link] — deepseek-ai/DeepSeek-R1: tech report
- [🔗 link] — huggingface/open-r1: Fully open reproduction of DeepSeek-R1
- [🔗 link] — Jiayi-Pan/TinyZero
- [🔗 link] — hkust-nlp/simpleRL-reason: This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
- R1+Sonnet Shatters Benchmark at 14X Lower Cost: DeepSeek R1 paired with Sonnet achieved 64% on the aider polyglot benchmark, outperforming o1 while costing 14X less. Users highlighted its MIT license and adoption at top universities.
- R1 Re-Distillation Boosts Qwen-1.5B: Mobius Labs’ redistilled R1 variant surpassed the original, with plans to expand to other architectures.
- R1’s Arena Rankings Spark GPU Allocation Theories: R1 hit #3 in LMArena, matching o1’s coding performance at 20x cheaper, fueled by rumors of spare NVIDIA H100 usage and Chinese government backing.
- [🔗 link] — atfortes/Awesome-LLM-Reasoning: Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓
- [🔗 link] — Nathan Lambert: DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs
When we say "R1", it's ambiguous. DeepSeek actually dropped 8 R1 models - 2 "full" models, and 6 distillations on open models:
- from Qwen 2.5: finetuned with 800k samples curated with DeepSeek-R1, in 1.5B, 7B, 14B, and 32B
- from Llama 3.1 8B Base: DeepSeek-R1-Distill-Llama-8B
- from Llama3.3-70B-Instruct: DeepSeek-R1-Distill-Llama-70B
- and DeepSeek-R1 and DeepSeek-R1-Zero, the full-size, 671B MoE models similar to DeepSeek V3. Surprisingly, MIT licensed rather than custom licenses, including explicit OK for finetuning and distillation
Other notables from the launch:
- Pricing (per million tokens): 14 cents input (cache hit), 55 cents input (cache miss), and 219 cents output. This compares to o1 at 750 cents input (cache hit), 1500 cents input (cache miss), 6000 cents output. That's 27x-50x cheaper than o1.
- solves every problem from the o1 blogpost. every one.
- can run the distilled models on ollama
- can write manim code really well
Surprises from the paper:
-
The process was:
- V3 Base → R1 Zero (using GRPO - aka reward for correctness and style outcomes - no fancy PRM/MCTS/RMs)
- R1 Zero → R1 Finetuned Cold Start (distil long CoT samples from R1 Zero)
- R1 Cold Start → R1 Reasoner with RL (focus on language consistency - to produce readable reasoning)
- R1 Reasoning → R1 Finetuned-Reasoner (Generate 600k: multi-response sampling and only keep correct samples (using prev rules) and using V3 as a judge: filter out mixed languages, long paragraphs, and code)
- R1 Instruct-Reasoner → R1 Aligned (Balance reasoning with helpfulness and harmlessness using GRPO)
-
Supervised data, Process reward models, and MCTS did -NOT- work
- but they do use GRPO from DeepSeekMath (challenged by the DPO author) as "the RL framework to improve model performance in reasoning" where reasoning (like in-context back-tracking) "naturally emerged" after "thousands of RL steps" - not quite the famous o1 scaling plot, but a close cousin.
- using "aha moments" as pivot tokens, often mixing languages in a reader unfriendly way
- R1 began training less than a month after the o1 announcement
- R1 distillations were remarkably effective, giving us this insane quote: "DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH.", and this is without even pushing the distillation to their limits.
- This is more effective than just RL-tuning a small model: "reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models." aka "total SFT victory"
-
The most interesting new finding is that there is a lower bound to the distillation effect we covered yesterday - 1.5B is as low as you go. RLCoT reasoning is itself an emergent property.
-
RL technique (PPO, DeepSeek's GRPO, or PRIME) doesnt really matter
-
Starting from Instruct model converges faster but otherwise both end the same (as per R1 paper observation)