Skip to content

GAIR-NLP/cognition-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

📄 English Paper  |  📄 Chinese Paper  |  🌐 Website  |  💡 Vision  |  📖 Recipes for RL Scaling  |  💻 Code Tutorial for RL Scaling  |  📚 Curated Papers (300+)  |  📑 Long CoT Resource  |  📅 Timeline  |  📜 Bib

animation

The first AI generation ("Act I", 2020-2023) scaled parameters and data impressively but faced limitations in knowledge currency, reasoning depth, and cognitive flexibility. Prompt engineering became our primary AI interface. "Act II" (2024-present) transforms models from knowledge retrievers to thought constructors through test-time scaling, establishing mind-level AI connections. This work defines cognition engineering's foundations, provides tutorials and implementations, and democratizes access to AI's second paradigm.

🔥 News

  • [2025-04-21] 🎉 🎉 🎉 We have released our paper "Generative AI Act II: Test Time Scaling Drives Cognition Engineering" along with all resources!

👋 Is This Paper For You?

  • 👩‍🔬 As an AI researcher, are you looking for the new research direction to break through the current bottlenecks of large language models?

  • 💻 As an AI engineer, do you need a step-by-step tutorial to implement and optimize test-time scaling methods?

  • 🎓 As a student or AI newcomer, do you want a systematic framework to understand the concept and application of "cognition engineering"?

  • 👩‍🏫 As an educator, do you need structured teaching resources to explain test-time scaling?

  • 💼 As an investor or decision-maker, would you like to understand what new stage generative AI has entered?

Our Vision

three_scaling_laws The three scaling phases illustrated as a progression of knowledge representation. Pre-training scaling (blue) forms isolated knowledge islands with fundamental physics concepts connected by limited innate associations. Post-training scaling (green) densifies these islands with more sophisticated learned connections between related concepts. Test-time scaling (red) enables dynamic reasoning pathway formation between previously disconnected concepts through extended computation, facilitating multi-hop inference across the entire knowledge space. Test-time scaling builds bridges between knowledge islands, connecting distant nodes that remain isolated during pre-training and conventional post-training.

The emergence of cognition engineering through test-time scaling marks a fundamental paradigm shift in artificial intelligence. Far beyond mere technical implementation, this transformation carries profound implications for how we develop AI systems (Data Engineering 2.0), reimagine human-AI collaboration, and conduct scientific research. (See paper for details.)

Recipes/Tricks for RL Scaling

Training Algorithm
Problem to Solve Method Overview Evidence Related Studies
Computational inefficiency in traditional PPO for LLM training GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation. Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature. GRPO
Token inefficiency and overthinking in long-form reasoning Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation. Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems. Dr.GRPO
Instability with varying response lengths in long-form reasoning DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths. Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively. DAPO
Limited policy exploration due to rigid constraints GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization. Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates. GPG
Repetitive or narrow reasoning patterns Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns. Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy. T1
Limitations of fixed reference models On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model. Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics. T1
Value model misalignment with strong prior policies Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins. Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks. VC-PPO,VAPO
Conflicting variance-bias requirements between value and policy optimization Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates. Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks. VC-PPO,VAPO
Limited exploration in constrained policy optimization KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely. Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training. Open-Reasoner-Zero, DAPO
Premature deterministic behavior in RL systems Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability. Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training. DAPO
Ineffective gradient signals in late-stage training Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals. Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling. DAPO, Bae et al.
Noisy reward signals from length-truncated samples Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning. Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples. DAPO
Inconsistent advantage estimation across variable-length sequences Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs. Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning. VAPO
Reward Design
Problem to Solve Method Overview Evidence Related Studies
Uncontrolled CoT length in reasoning tasks Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones. Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance. Demysitify
Reward hacking in deterministic reasoning tasks Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags. Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline. DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3
Language mixing issues in multilingual environments Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues. User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts. DeepSeek-R1
Model overthinking and verbosity Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking. Gradually introduced length penalties resulted in more token-efficient reasoning. KIMI-K1.5,DAPO
Inaccurate reward modeling in nuanced domains Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria. Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps. KIMI-K1.5
Training Data
Problem to Solve Method Overview Evidence Related Studies
Resource-constrained RL training environments High-impact Sample Selection: Prioritizes training samples based on learning impact measurement. Results show significant reduction in required training data while maintaining performance. LIMR
Training with noisy web-extracted data Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data. Filtered datasets demonstrate improved generalization capabilities on OOD tasks. Demysitify
Multi-stage Training
Problem to Solve Method Overview Evidence Related Studies
Poor readability and reasoning in direct RL approaches Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning. Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches. DeepSeek-R1, T1, DeepscaleR, STILL-3
Inefficient training with problems of varied difficulty Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest. Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training. KIMI-K1.5
Inefficient use of context in long-form reasoning Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level. Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training. DeepscaleR
Performance gaps on challenging reasoning problems Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities. Enhanced performance metrics on complex reasoning tasks without compromising general capabilities. Open-Reasoner-Zero

Implementation of RL Scaling

The Practitioner’s Roadmap: How to Apply Test-Time Scaling to your Applications?

Practitioner’s Roadmap

See code for handy tutorial.

Curated Papers

See papers

Long CoT Resource

Work Application Type Source Quantity Modality Link
O1 Journey--Part 1 Math Synthesize GPT-4o 0.3K Text GitHub HuggingFace
Marco-o1 Reasoning Synthesize Qwen2-7B-Instruct 10K Text GitHub
STILL-2 Math, Code, Science, Puzzle Distillation DeepSeek-R1-Lite-Preview, QwQ-32B-preview 5K Text GitHub HuggingFace
RedStar-math Math Distillation QwQ-32B-preview 4K Text HuggingFace
RedStar-code Code Distillation QwQ-32B-preview 16K Text HuggingFace
RedStar-multimodal Math Distillation QwQ-32B-preview 12K Vision, Text HuggingFace
S1K Math, Science, Code Distillation Gemini Flash Thinking 1K Text GitHub HuggingFace
S1K-1.1 Math, Science, Code Distillation DeepSeek R1 1K Text GitHub HuggingFace
LIMO Math Distillation DeepSeek R1, DeepSeekR1-Distill-Qwen-32B 0.8K Text GitHub HuggingFace
OpenThoughts-114k Math, Code, Science, Puzzle Distillation DeepSeek R1 114K Text GitHub HuggingFace
OpenR1-Math-220k Math Distillation DeepSeek R1 220K Text GitHub HuggingFace
OpenThoughts2-1M Math, Code, Science, Puzzle Distillation DeepSeek R1 1M Text GitHub HuggingFace
CodeForces-CoTs Code Distillation DeepSeek R1 47K Text GitHub HuggingFace
Sky-T1-17k Math, Code, Science, Puzzle Distillation QwQ-32B-Preview 17K Text GitHub HuggingFace
S²R Math Synthesize Qwen2.5-Math-7B 3K Text GitHub HuggingFace
R1-Onevision Science, Math, General Distillation DeepSeek R1 155K Vision, Text GitHub HuggingFace
OpenO1-SFT Math, Code Synthesize - 77K Text GitHub HuggingFace
Medical-o1 Medical Distillation Deepseek R1 25K Text GitHub HuggingFace
O1 Journey--Part 3 Medical Distillation o1-preview 0.5K Text GitHub HuggingFace
SCP-116K Math, Science Distillation Deepseek R1 116K Text GitHub HuggingFace
open-r1-multimodal Math Distillation GPT-4o 8K Vision, Text GitHub HuggingFace
Vision-R1-cold Science, Math, General Distillation Deepseek R1 200K Vision, Text GitHub HuggingFace
MMMU-Reasoning-Distill-Validation Science, Math, General Distillation Deepseek R1 0.8K Vision, Text ModelScope
Clevr-CoGenT Vision Counting Distillation Deepseek R1 37.8K Vision, Text GitHub HuggingFace
VL-Thinking Science, Math, General Distillation Deepseek R1 158K Vision, Text GitHub HuggingFace
Video-R1 Video Distillation Qwen2.5-VL-72B 158K Vision, Text GitHub HuggingFace
Embodied-Reasoner Embodied AI Synthesize GPT-4o 9K Vision, Text GitHub HuggingFace
OpenCodeReasoning Code Distillation DeepSeek R1 736K Text HuggingFace
SafeChain Safety Distillation Deepseek R1 40K Text GitHub HuggingFace
KodCode Code Distillation DeepSeek R1 2.8K Text GitHub HuggingFace

Development Timeline

The images present a comprehensive timeline of test-time scaling methods applied across various AI domains from 2020 to 2025. These visualizations track the evolution of key techniques including Parallel Sampling, Tree Search, Multi-turn Correction, and Long Chain-of-Thought (CoT) across different fields of application.

Key Test-Time Scaling Methods

The research maps four primary test-time scaling approaches:

  • Parallel Sampling (blue): Generating multiple candidate solutions in parallel
  • Tree Search (green): Exploring decision trees to find optimal solutions
  • Multi-turn Correction (red): Iterative refinement through multiple passes
  • Long CoT (Chain-of-Thought) (purple): Extended reasoning chains for complex problem-solving

Training Strategies

The methods are implemented using various training approaches:

  • SFT (Supervised Fine-Tuning): Diamond symbol
  • DPO (Direct Preference Optimization): Triangle symbol
  • RL (Reinforcement Learning): Square symbol
  • Inference-only: Circle symbol

image

image

image

image

image

image

image

image

Bib

If you find our paper useful for your research, please cite the following paper:

@misc{xia2025generativeaiactii,
      title={Generative AI Act II: Test Time Scaling Drives Cognition Engineering}, 
      author={Shijie Xia and Yiwei Qin and Xuefeng Li and Yan Ma and Run-Ze Fan and Steffi Chern and Haoyang Zou and Fan Zhou and Xiangkun Hu and Jiahe Jin and Yanheng He and Yixin Ye and Yixiu Liu and Pengfei Liu},
      year={2025},
      eprint={2504.13828},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.13828}, 
}

About

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •