📄 English Paper | 📄 Chinese Paper | 🌐 Website | 💡 Vision | 📖 Recipes for RL Scaling | 💻 Code Tutorial for RL Scaling | 📚 Curated Papers (300+) | 📑 Long CoT Resource | 📅 Timeline | 📜 Bib
The first AI generation ("Act I", 2020-2023) scaled parameters and data impressively but faced limitations in knowledge currency, reasoning depth, and cognitive flexibility. Prompt engineering became our primary AI interface. "Act II" (2024-present) transforms models from knowledge retrievers to thought constructors through test-time scaling, establishing mind-level AI connections. This work defines cognition engineering's foundations, provides tutorials and implementations, and democratizes access to AI's second paradigm.
- [2025-04-21] 🎉 🎉 🎉 We have released our paper "Generative AI Act II: Test Time Scaling Drives Cognition Engineering" along with all resources!
-
👩🔬 As an AI researcher, are you looking for the new research direction to break through the current bottlenecks of large language models?
-
💻 As an AI engineer, do you need a step-by-step tutorial to implement and optimize test-time scaling methods?
-
🎓 As a student or AI newcomer, do you want a systematic framework to understand the concept and application of "cognition engineering"?
-
👩🏫 As an educator, do you need structured teaching resources to explain test-time scaling?
-
💼 As an investor or decision-maker, would you like to understand what new stage generative AI has entered?
The three scaling phases illustrated as a progression of knowledge representation. Pre-training scaling (blue) forms isolated knowledge islands with fundamental physics concepts connected by limited innate associations. Post-training scaling (green) densifies these islands with more sophisticated learned connections between related concepts. Test-time scaling (red) enables dynamic reasoning pathway formation between previously disconnected concepts through extended computation, facilitating multi-hop inference across the entire knowledge space.
Test-time scaling builds bridges between knowledge islands, connecting distant nodes that remain isolated during pre-training and conventional post-training.
The emergence of cognition engineering through test-time scaling marks a fundamental paradigm shift in artificial intelligence. Far beyond mere technical implementation, this transformation carries profound implications for how we develop AI systems (Data Engineering 2.0), reimagine human-AI collaboration, and conduct scientific research. (See paper for details.)
Training Algorithm
Problem to Solve | Method Overview | Evidence | Related Studies |
---|---|---|---|
Computational inefficiency in traditional PPO for LLM training | GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation. | Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature. | GRPO |
Token inefficiency and overthinking in long-form reasoning | Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation. | Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems. | Dr.GRPO |
Instability with varying response lengths in long-form reasoning | DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths. | Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively. | DAPO |
Limited policy exploration due to rigid constraints | GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization. | Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates. | GPG |
Repetitive or narrow reasoning patterns | Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns. | Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy. | T1 |
Limitations of fixed reference models | On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model. | Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics. | T1 |
Value model misalignment with strong prior policies | Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins. | Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks. | VC-PPO,VAPO |
Conflicting variance-bias requirements between value and policy optimization | Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates. | Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks. | VC-PPO,VAPO |
Limited exploration in constrained policy optimization | KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely. | Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training. | Open-Reasoner-Zero, DAPO |
Premature deterministic behavior in RL systems | Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability. | Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training. | DAPO |
Ineffective gradient signals in late-stage training | Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals. | Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling. | DAPO, Bae et al. |
Noisy reward signals from length-truncated samples | Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning. | Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples. | DAPO |
Inconsistent advantage estimation across variable-length sequences | Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs. | Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning. | VAPO |
Reward Design
Problem to Solve | Method Overview | Evidence | Related Studies |
---|---|---|---|
Uncontrolled CoT length in reasoning tasks | Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones. | Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance. | Demysitify |
Reward hacking in deterministic reasoning tasks | Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags. | Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline. | DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3 |
Language mixing issues in multilingual environments | Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues. | User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts. | DeepSeek-R1 |
Model overthinking and verbosity | Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking. | Gradually introduced length penalties resulted in more token-efficient reasoning. | KIMI-K1.5,DAPO |
Inaccurate reward modeling in nuanced domains | Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria. | Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps. | KIMI-K1.5 |
Training Data
Problem to Solve | Method Overview | Evidence | Related Studies |
---|---|---|---|
Resource-constrained RL training environments | High-impact Sample Selection: Prioritizes training samples based on learning impact measurement. | Results show significant reduction in required training data while maintaining performance. | LIMR |
Training with noisy web-extracted data | Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data. | Filtered datasets demonstrate improved generalization capabilities on OOD tasks. | Demysitify |
Multi-stage Training
Problem to Solve | Method Overview | Evidence | Related Studies |
---|---|---|---|
Poor readability and reasoning in direct RL approaches | Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning. | Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches. | DeepSeek-R1, T1, DeepscaleR, STILL-3 |
Inefficient training with problems of varied difficulty | Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest. | Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training. | KIMI-K1.5 |
Inefficient use of context in long-form reasoning | Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level. | Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training. | DeepscaleR |
Performance gaps on challenging reasoning problems | Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities. | Enhanced performance metrics on complex reasoning tasks without compromising general capabilities. | Open-Reasoner-Zero |
See code for handy tutorial.
See papers
Work | Application | Type | Source | Quantity | Modality | Link |
---|---|---|---|---|---|---|
O1 Journey--Part 1 | Math | Synthesize | GPT-4o | 0.3K | Text | GitHub HuggingFace |
Marco-o1 | Reasoning | Synthesize | Qwen2-7B-Instruct | 10K | Text | GitHub |
STILL-2 | Math, Code, Science, Puzzle | Distillation | DeepSeek-R1-Lite-Preview, QwQ-32B-preview | 5K | Text | GitHub HuggingFace |
RedStar-math | Math | Distillation | QwQ-32B-preview | 4K | Text | HuggingFace |
RedStar-code | Code | Distillation | QwQ-32B-preview | 16K | Text | HuggingFace |
RedStar-multimodal | Math | Distillation | QwQ-32B-preview | 12K | Vision, Text | HuggingFace |
S1K | Math, Science, Code | Distillation | Gemini Flash Thinking | 1K | Text | GitHub HuggingFace |
S1K-1.1 | Math, Science, Code | Distillation | DeepSeek R1 | 1K | Text | GitHub HuggingFace |
LIMO | Math | Distillation | DeepSeek R1, DeepSeekR1-Distill-Qwen-32B | 0.8K | Text | GitHub HuggingFace |
OpenThoughts-114k | Math, Code, Science, Puzzle | Distillation | DeepSeek R1 | 114K | Text | GitHub HuggingFace |
OpenR1-Math-220k | Math | Distillation | DeepSeek R1 | 220K | Text | GitHub HuggingFace |
OpenThoughts2-1M | Math, Code, Science, Puzzle | Distillation | DeepSeek R1 | 1M | Text | GitHub HuggingFace |
CodeForces-CoTs | Code | Distillation | DeepSeek R1 | 47K | Text | GitHub HuggingFace |
Sky-T1-17k | Math, Code, Science, Puzzle | Distillation | QwQ-32B-Preview | 17K | Text | GitHub HuggingFace |
S²R | Math | Synthesize | Qwen2.5-Math-7B | 3K | Text | GitHub HuggingFace |
R1-Onevision | Science, Math, General | Distillation | DeepSeek R1 | 155K | Vision, Text | GitHub HuggingFace |
OpenO1-SFT | Math, Code | Synthesize | - | 77K | Text | GitHub HuggingFace |
Medical-o1 | Medical | Distillation | Deepseek R1 | 25K | Text | GitHub HuggingFace |
O1 Journey--Part 3 | Medical | Distillation | o1-preview | 0.5K | Text | GitHub HuggingFace |
SCP-116K | Math, Science | Distillation | Deepseek R1 | 116K | Text | GitHub HuggingFace |
open-r1-multimodal | Math | Distillation | GPT-4o | 8K | Vision, Text | GitHub HuggingFace |
Vision-R1-cold | Science, Math, General | Distillation | Deepseek R1 | 200K | Vision, Text | GitHub HuggingFace |
MMMU-Reasoning-Distill-Validation | Science, Math, General | Distillation | Deepseek R1 | 0.8K | Vision, Text | ModelScope |
Clevr-CoGenT | Vision Counting | Distillation | Deepseek R1 | 37.8K | Vision, Text | GitHub HuggingFace |
VL-Thinking | Science, Math, General | Distillation | Deepseek R1 | 158K | Vision, Text | GitHub HuggingFace |
Video-R1 | Video | Distillation | Qwen2.5-VL-72B | 158K | Vision, Text | GitHub HuggingFace |
Embodied-Reasoner | Embodied AI | Synthesize | GPT-4o | 9K | Vision, Text | GitHub HuggingFace |
OpenCodeReasoning | Code | Distillation | DeepSeek R1 | 736K | Text | HuggingFace |
SafeChain | Safety | Distillation | Deepseek R1 | 40K | Text | GitHub HuggingFace |
KodCode | Code | Distillation | DeepSeek R1 | 2.8K | Text | GitHub HuggingFace |
The images present a comprehensive timeline of test-time scaling methods applied across various AI domains from 2020 to 2025. These visualizations track the evolution of key techniques including Parallel Sampling, Tree Search, Multi-turn Correction, and Long Chain-of-Thought (CoT) across different fields of application.
The research maps four primary test-time scaling approaches:
- Parallel Sampling (blue): Generating multiple candidate solutions in parallel
- Tree Search (green): Exploring decision trees to find optimal solutions
- Multi-turn Correction (red): Iterative refinement through multiple passes
- Long CoT (Chain-of-Thought) (purple): Extended reasoning chains for complex problem-solving
The methods are implemented using various training approaches:
- SFT (Supervised Fine-Tuning): Diamond symbol
- DPO (Direct Preference Optimization): Triangle symbol
- RL (Reinforcement Learning): Square symbol
- Inference-only: Circle symbol
If you find our paper useful for your research, please cite the following paper:
@misc{xia2025generativeaiactii,
title={Generative AI Act II: Test Time Scaling Drives Cognition Engineering},
author={Shijie Xia and Yiwei Qin and Xuefeng Li and Yan Ma and Run-Ze Fan and Steffi Chern and Haoyang Zou and Fan Zhou and Xiangkun Hu and Jiahe Jin and Yanheng He and Yixin Ye and Yixiu Liu and Pengfei Liu},
year={2025},
eprint={2504.13828},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.13828},
}