Generative AI Act II: Test Time Scaling Drives Cognition Engineering

The first AI generation ("Act I", 2020-2023) scaled parameters and data impressively but faced limitations in knowledge currency, reasoning depth, and cognitive flexibility. Prompt engineering became our primary AI interface. "Act II" (2024-present) transforms models from knowledge retrievers to thought constructors through test-time scaling, establishing mind-level AI connections. This work defines cognition engineering's foundations, provides tutorials and implementations, and democratizes access to AI's second paradigm.

🔥 News

[2025-04-21] 🎉 🎉 🎉 We have released our paper "Generative AI Act II: Test Time Scaling Drives Cognition Engineering" along with all resources!

👋 Is This Paper For You?

👩‍🔬 As an AI researcher, are you looking for the new research direction to break through the current bottlenecks of large language models?
💻 As an AI engineer, do you need a step-by-step tutorial to implement and optimize test-time scaling methods?
🎓 As a student or AI newcomer, do you want a systematic framework to understand the concept and application of "cognition engineering"?
👩‍🏫 As an educator, do you need structured teaching resources to explain test-time scaling?
💼 As an investor or decision-maker, would you like to understand what new stage generative AI has entered?

Our Vision

The three scaling phases illustrated as a progression of knowledge representation. Pre-training scaling (blue) forms isolated knowledge islands with fundamental physics concepts connected by limited innate associations. Post-training scaling (green) densifies these islands with more sophisticated learned connections between related concepts. Test-time scaling (red) enables dynamic reasoning pathway formation between previously disconnected concepts through extended computation, facilitating multi-hop inference across the entire knowledge space. Test-time scaling builds bridges between knowledge islands, connecting distant nodes that remain isolated during pre-training and conventional post-training.

The emergence of cognition engineering through test-time scaling marks a fundamental paradigm shift in artificial intelligence. Far beyond mere technical implementation, this transformation carries profound implications for how we develop AI systems (Data Engineering 2.0), reimagine human-AI collaboration, and conduct scientific research. (See paper for details.)

Recipes/Tricks for RL Scaling

Training Algorithm

Problem to Solve	Method Overview	Evidence	Related Studies
Computational inefficiency in traditional PPO for LLM training	GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation.	Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature.	GRPO
Token inefficiency and overthinking in long-form reasoning	Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation.	Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems.	Dr.GRPO
Instability with varying response lengths in long-form reasoning	DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths.	Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively.	DAPO
Limited policy exploration due to rigid constraints	GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization.	Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates.	GPG
Repetitive or narrow reasoning patterns	Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns.	Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy.	T1
Limitations of fixed reference models	On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model.	Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics.	T1
Value model misalignment with strong prior policies	Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins.	Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks.	VC-PPO,VAPO
Conflicting variance-bias requirements between value and policy optimization	Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates.	Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks.	VC-PPO,VAPO
Limited exploration in constrained policy optimization	KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely.	Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training.	Open-Reasoner-Zero, DAPO
Premature deterministic behavior in RL systems	Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability.	Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training.	DAPO
Ineffective gradient signals in late-stage training	Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals.	Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling.	DAPO, Bae et al.
Noisy reward signals from length-truncated samples	Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning.	Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples.	DAPO
Inconsistent advantage estimation across variable-length sequences	Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs.	Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning.	VAPO

Reward Design

Problem to Solve	Method Overview	Evidence	Related Studies
Uncontrolled CoT length in reasoning tasks	Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones.	Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance.	Demysitify
Reward hacking in deterministic reasoning tasks	Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags.	Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline.	DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3
Language mixing issues in multilingual environments	Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues.	User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts.	DeepSeek-R1
Model overthinking and verbosity	Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking.	Gradually introduced length penalties resulted in more token-efficient reasoning.	KIMI-K1.5,DAPO
Inaccurate reward modeling in nuanced domains	Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria.	Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps.	KIMI-K1.5

Training Data

Problem to Solve	Method Overview	Evidence	Related Studies
Resource-constrained RL training environments	High-impact Sample Selection: Prioritizes training samples based on learning impact measurement.	Results show significant reduction in required training data while maintaining performance.	LIMR
Training with noisy web-extracted data	Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data.	Filtered datasets demonstrate improved generalization capabilities on OOD tasks.	Demysitify

Multi-stage Training

Problem to Solve	Method Overview	Evidence	Related Studies
Poor readability and reasoning in direct RL approaches	Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning.	Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches.	DeepSeek-R1, T1, DeepscaleR, STILL-3
Inefficient training with problems of varied difficulty	Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest.	Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training.	KIMI-K1.5
Inefficient use of context in long-form reasoning	Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level.	Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training.	DeepscaleR
Performance gaps on challenging reasoning problems	Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities.	Enhanced performance metrics on complex reasoning tasks without compromising general capabilities.	Open-Reasoner-Zero

Implementation of RL Scaling

The Practitioner’s Roadmap: How to Apply Test-Time Scaling to your Applications?

See code for handy tutorial.

Curated Papers

See papers

Long CoT Resource

Work	Application	Type	Source	Quantity	Modality	Link
O1 Journey--Part 1	Math	Synthesize	GPT-4o	0.3K	Text	GitHub HuggingFace
Marco-o1	Reasoning	Synthesize	Qwen2-7B-Instruct	10K	Text	GitHub
STILL-2	Math, Code, Science, Puzzle	Distillation	DeepSeek-R1-Lite-Preview, QwQ-32B-preview	5K	Text	GitHub HuggingFace
RedStar-math	Math	Distillation	QwQ-32B-preview	4K	Text	HuggingFace
RedStar-code	Code	Distillation	QwQ-32B-preview	16K	Text	HuggingFace
RedStar-multimodal	Math	Distillation	QwQ-32B-preview	12K	Vision, Text	HuggingFace
S1K	Math, Science, Code	Distillation	Gemini Flash Thinking	1K	Text	GitHub HuggingFace
S1K-1.1	Math, Science, Code	Distillation	DeepSeek R1	1K	Text	GitHub HuggingFace
LIMO	Math	Distillation	DeepSeek R1, DeepSeekR1-Distill-Qwen-32B	0.8K	Text	GitHub HuggingFace
OpenThoughts-114k	Math, Code, Science, Puzzle	Distillation	DeepSeek R1	114K	Text	GitHub HuggingFace
OpenR1-Math-220k	Math	Distillation	DeepSeek R1	220K	Text	GitHub HuggingFace
OpenThoughts2-1M	Math, Code, Science, Puzzle	Distillation	DeepSeek R1	1M	Text	GitHub HuggingFace
CodeForces-CoTs	Code	Distillation	DeepSeek R1	47K	Text	GitHub HuggingFace
Sky-T1-17k	Math, Code, Science, Puzzle	Distillation	QwQ-32B-Preview	17K	Text	GitHub HuggingFace
S²R	Math	Synthesize	Qwen2.5-Math-7B	3K	Text	GitHub HuggingFace
R1-Onevision	Science, Math, General	Distillation	DeepSeek R1	155K	Vision, Text	GitHub HuggingFace
OpenO1-SFT	Math, Code	Synthesize	-	77K	Text	GitHub HuggingFace
Medical-o1	Medical	Distillation	Deepseek R1	25K	Text	GitHub HuggingFace
O1 Journey--Part 3	Medical	Distillation	o1-preview	0.5K	Text	GitHub HuggingFace
SCP-116K	Math, Science	Distillation	Deepseek R1	116K	Text	GitHub HuggingFace
open-r1-multimodal	Math	Distillation	GPT-4o	8K	Vision, Text	GitHub HuggingFace
Vision-R1-cold	Science, Math, General	Distillation	Deepseek R1	200K	Vision, Text	GitHub HuggingFace
MMMU-Reasoning-Distill-Validation	Science, Math, General	Distillation	Deepseek R1	0.8K	Vision, Text	ModelScope
Clevr-CoGenT	Vision Counting	Distillation	Deepseek R1	37.8K	Vision, Text	GitHub HuggingFace
VL-Thinking	Science, Math, General	Distillation	Deepseek R1	158K	Vision, Text	GitHub HuggingFace
Video-R1	Video	Distillation	Qwen2.5-VL-72B	158K	Vision, Text	GitHub HuggingFace
Embodied-Reasoner	Embodied AI	Synthesize	GPT-4o	9K	Vision, Text	GitHub HuggingFace
OpenCodeReasoning	Code	Distillation	DeepSeek R1	736K	Text	HuggingFace
SafeChain	Safety	Distillation	Deepseek R1	40K	Text	GitHub HuggingFace
KodCode	Code	Distillation	DeepSeek R1	2.8K	Text	GitHub HuggingFace

Development Timeline

The images present a comprehensive timeline of test-time scaling methods applied across various AI domains from 2020 to 2025. These visualizations track the evolution of key techniques including Parallel Sampling, Tree Search, Multi-turn Correction, and Long Chain-of-Thought (CoT) across different fields of application.

Key Test-Time Scaling Methods

The research maps four primary test-time scaling approaches:

Parallel Sampling (blue): Generating multiple candidate solutions in parallel
Tree Search (green): Exploring decision trees to find optimal solutions
Multi-turn Correction (red): Iterative refinement through multiple passes
Long CoT (Chain-of-Thought) (purple): Extended reasoning chains for complex problem-solving

Training Strategies

The methods are implemented using various training approaches:

SFT (Supervised Fine-Tuning): Diamond symbol
DPO (Direct Preference Optimization): Triangle symbol
RL (Reinforcement Learning): Square symbol
Inference-only: Circle symbol

Bib

If you find our paper useful for your research, please cite the following paper:

@misc{xia2025generativeaiactii,
      title={Generative AI Act II: Test Time Scaling Drives Cognition Engineering}, 
      author={Shijie Xia and Yiwei Qin and Xuefeng Li and Yan Ma and Run-Ze Fan and Steffi Chern and Haoyang Zou and Fan Zhou and Xiangkun Hu and Jiahe Jin and Yanheng He and Yixin Ye and Yixiu Liu and Pengfei Liu},
      year={2025},
      eprint={2504.13828},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.13828}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
resources/papers		resources/papers
simple_tts		simple_tts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

🔥 News

👋 Is This Paper For You?

Our Vision

Recipes/Tricks for RL Scaling

Implementation of RL Scaling

The Practitioner’s Roadmap: How to Apply Test-Time Scaling to your Applications?

Curated Papers

Long CoT Resource

Development Timeline

Key Test-Time Scaling Methods

Training Strategies

Bib

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

GAIR-NLP/cognition-engineering

Folders and files

Latest commit

History

Repository files navigation

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

🔥 News

👋 Is This Paper For You?

Our Vision

Recipes/Tricks for RL Scaling

Implementation of RL Scaling

The Practitioner’s Roadmap: How to Apply Test-Time Scaling to your Applications?

Curated Papers

Long CoT Resource

Development Timeline

Key Test-Time Scaling Methods

Training Strategies

Bib

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages