Skip to content

avnlp/grpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning. This repository aggregates and refactors four distinct implementations of GRPO, each demonstrating different approaches to the core algorithm while sharing common principles.

Algorithm

GRPO eliminates the need for additional value function approximation by using the average reward of multiple sampled responses to the same query as the baseline, significantly reducing computational and memory overhead in reinforcement learning training. This approach maintains policy optimization stability while removing the complexity of training a value function.

For each question $q$, GRPO samples a group of outputs ${o_1, o_2, \cdots, o_G}$ from the old policy $\pi_{\theta_{old}}$ and then optimizes the policy model by maximizing the following objective:

GRPO Objective

Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of $\hat{A}_{i,t}$. The KL divergence is estimated with the following unbiased estimator, which is guaranteed to be positive:

GRPO KL Divergence

 

A reward function is then used to score the outputs, yielding $G$ rewards $\mathbf{r}={R(q, o_1), R(q, o_2), \cdots, R(q, o_G)}$ correspondingly. These rewards are normalized by subtracting the group average and dividing by the group standard deviation. The normalized rewards for each output $o_i$ define the advantages $\hat{A}_{i, t}$ for all tokens in the output. The policy is then optimized by maximizing the GRPO objective.

GRPO Rewards

where,

  • $\pi_{\theta}$ and $\pi_{\theta_{old}}$ are the current and old policy models,
  • $\pi_{ref}$ is the reference model
  • $q$ are questions from the question dataset
  • $o$ are outputs sampled from the old policy $\pi_{\theta_{old}}$
  • $R(q, o_i)$ is the reward for output $o_i$ to question $q$
  • $\epsilon$ is a clipping-related hyper-parameter introduced in PPO for stabilizing training.
  • $\beta$ is the coefficient of the KL penalty
  • $\hat{A}_{i,t}$ is the advantage calculated based on relative rewards of the outputs inside each group (note: the advantage value is constant for all tokens within a single output sequence).

 

Algorithm:
Input initial policy model πθinit; reward function rφ; task prompts 𝓓; hyperparameters ε, β, μ
1: policy model πθ ← πθinit
2: for iteration = 1, ..., I do
3:     reference model πref ← πθ
4:     for step = 1, ..., M do
5:         Sample a batch 𝓓b from 𝓓
6:         Update the old policy model πθold ← πθ
7:         Sample G outputs {oi}Gi=1 ∼ πθold(·|q) for each question q ∈ 𝓓b
8:         Compute rewards {ri}Gi=1 = {R(q, o1), R(q, o2), ..., R(q, oG)} for each output oi using rφ
9:         Compute output-level advantage:
               Âi = (R(q, oi) - mean(R(q, o1), ..., R(q, oG))) / std(R(q, o1), ..., R(q, oG))
               and set Âi,t = Âi for all tokens t in output oi (constant within each output)
10:        for GRPO iteration = 1, ..., μ do
11:            Update the policy model πθ by maximizing the GRPO objective
Output πθ

Implementations

We provide four refactored implementations of GRPO, each with a different focus and design:

An implementation from nanoAhaMoment, that separates each step of the GRPO loop into distinct components. It uses a rule-based reward function for a Countdown task and integrates with vLLM for efficient generation.

  • Modular pipeline with separated components
  • vLLM integration for efficient generation
  • DeepSpeed training backend
  • Format: <think>...</think>\n<answer>...</answer>
  • Rule-based reward functions for Countdown tasks

An implementation from GRPO-Zero, that uses a separate server for the reference model to offload computation. It uses the GSM8K dataset and a combined reward for correctness and format.

  • Qwen2.5-3B-Instruct base model
  • Countdown-Tasks-3to4 dataset
  • Simplified training workflow
  • Reward Function: Combined reward for correctness and format

An implementation from Simple GRPO, that uses DeepSpeed for training and a reference model server. It features a policy gradient loss with KL penalty and reward normalization within groups.

  • Reference model server architecture
  • GSM8K dataset
  • KL divergence penalty term
  • Per-token advantage calculation
  • Distributed training support
  • Loss Calculation: loss = -(policy_ratio * advantage - beta * kl_divergence)

An implementation from "The LM Book" by Andriy Burkov, that demonstrates the core GRPO algorithm step-by-step. It uses a copy of the reference model and performs multiple updates per batch.

  • Periodic reference model updates
  • Multiple updates per batch (μ-PPO)
  • Comprehensive reward decomposition
  • Memory optimization techniques
  • Reward Function: Combined reward for correctness and format

Common Components

All implementations share the following steps:

  • Group Sampling: For each prompt, multiple completions are generated to form a group.
  • Reward Calculation: Each completion receives a scalar reward, typically combining correctness and format adherence.
  • Advantage Normalization: Within each group, rewards are normalized to have zero mean and unit variance to form advantages.
  • Policy Update: The policy is updated using a policy gradient method (with or without clipping) and often includes a KL penalty to prevent deviation from a reference policy.

Variations

The implementations have different variations in the following aspects:

  • Reward Functions: The implementations use different reward functions tailored to the task and different weights for format and correctness.

    • Format Reward: Enforces XML-style reasoning structure
    • Correctness Reward: Validates solution accuracy
    • Combined Reward: R_total = R_format + R_correctness
  • Reference Model Handling: Some implementations use a fixed reference model (via a separate server or a frozen copy) while others update the reference model periodically.

  • Training Framework: The implementations use different training frameworks (e.g., DeepSpeed, pure PyTorch) and optimization techniques (e.g., gradient checkpointing).

  • Batching and Generation: The approaches to generation (vLLM, Hugging Face transformers) and batching vary.

References

[1] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

About

Group Relative Policy Optimization (GRPO) Implementations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •