Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning. This repository aggregates and refactors four distinct implementations of GRPO, each demonstrating different approaches to the core algorithm while sharing common principles.
GRPO eliminates the need for additional value function approximation by using the average reward of multiple sampled responses to the same query as the baseline, significantly reducing computational and memory overhead in reinforcement learning training. This approach maintains policy optimization stability while removing the complexity of training a value function.
For each question
Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of
A reward function is then used to score the outputs, yielding
where,
-
$\pi_{\theta}$ and$\pi_{\theta_{old}}$ are the current and old policy models, -
$\pi_{ref}$ is the reference model -
$q$ are questions from the question dataset -
$o$ are outputs sampled from the old policy$\pi_{\theta_{old}}$ -
$R(q, o_i)$ is the reward for output$o_i$ to question$q$ -
$\epsilon$ is a clipping-related hyper-parameter introduced in PPO for stabilizing training. -
$\beta$ is the coefficient of the KL penalty -
$\hat{A}_{i,t}$ is the advantage calculated based on relative rewards of the outputs inside each group (note: the advantage value is constant for all tokens within a single output sequence).
Algorithm: Input initial policy model πθinit; reward function rφ; task prompts 𝓓; hyperparameters ε, β, μ 1: policy model πθ ← πθinit 2: for iteration = 1, ..., I do 3: reference model πref ← πθ 4: for step = 1, ..., M do 5: Sample a batch 𝓓b from 𝓓 6: Update the old policy model πθold ← πθ 7: Sample G outputs {oi}Gi=1 ∼ πθold(·|q) for each question q ∈ 𝓓b 8: Compute rewards {ri}Gi=1 = {R(q, o1), R(q, o2), ..., R(q, oG)} for each output oi using rφ 9: Compute output-level advantage: Âi = (R(q, oi) - mean(R(q, o1), ..., R(q, oG))) / std(R(q, o1), ..., R(q, oG)) and set Âi,t = Âi for all tokens t in output oi (constant within each output) 10: for GRPO iteration = 1, ..., μ do 11: Update the policy model πθ by maximizing the GRPO objective Output πθ
We provide four refactored implementations of GRPO, each with a different focus and design:
An implementation from nanoAhaMoment, that separates each step of the GRPO loop into distinct components. It uses a rule-based reward function for a Countdown task and integrates with vLLM for efficient generation.
- Modular pipeline with separated components
- vLLM integration for efficient generation
- DeepSpeed training backend
- Format:
<think>...</think>\n<answer>...</answer>
- Rule-based reward functions for Countdown tasks
2. GRPO:Zero
An implementation from GRPO-Zero, that uses a separate server for the reference model to offload computation. It uses the GSM8K dataset and a combined reward for correctness and format.
- Qwen2.5-3B-Instruct base model
- Countdown-Tasks-3to4 dataset
- Simplified training workflow
- Reward Function: Combined reward for correctness and format
3. Simple GRPO
An implementation from Simple GRPO, that uses DeepSpeed for training and a reference model server. It features a policy gradient loss with KL penalty and reward normalization within groups.
- Reference model server architecture
- GSM8K dataset
- KL divergence penalty term
- Per-token advantage calculation
- Distributed training support
- Loss Calculation:
loss = -(policy_ratio * advantage - beta * kl_divergence)
An implementation from "The LM Book" by Andriy Burkov, that demonstrates the core GRPO algorithm step-by-step. It uses a copy of the reference model and performs multiple updates per batch.
- Periodic reference model updates
- Multiple updates per batch (μ-PPO)
- Comprehensive reward decomposition
- Memory optimization techniques
- Reward Function: Combined reward for correctness and format
All implementations share the following steps:
- Group Sampling: For each prompt, multiple completions are generated to form a group.
- Reward Calculation: Each completion receives a scalar reward, typically combining correctness and format adherence.
- Advantage Normalization: Within each group, rewards are normalized to have zero mean and unit variance to form advantages.
- Policy Update: The policy is updated using a policy gradient method (with or without clipping) and often includes a KL penalty to prevent deviation from a reference policy.
The implementations have different variations in the following aspects:
-
Reward Functions: The implementations use different reward functions tailored to the task and different weights for format and correctness.
- Format Reward: Enforces XML-style reasoning structure
- Correctness Reward: Validates solution accuracy
- Combined Reward:
R_total = R_format + R_correctness
-
Reference Model Handling: Some implementations use a fixed reference model (via a separate server or a frozen copy) while others update the reference model periodically.
-
Training Framework: The implementations use different training frameworks (e.g., DeepSpeed, pure PyTorch) and optimization techniques (e.g., gradient checkpointing).
-
Batching and Generation: The approaches to generation (vLLM, Hugging Face transformers) and batching vary.
[1] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models