Add History-Aware Adaptive Difficulty Weighting (HA-DW) to GRPO #4872
+488
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements History-Aware Adaptive Difficulty Weighting (HA-DW) for the GRPO trainer, based on the paper "Your Group-Relative Advantage Is Biased".
Problem
The paper identifies a fundamental issue in group-based RL: the group-relative advantage estimator is inherently biased:
This systematic bias causes the policy to under-learn from hard questions while over-exploiting easy ones, ultimately hurting both training stability and generalization.
Solution
HA-DW addresses this bias through two key components:
1. Evolving Difficulty Anchor
Tracks the model's solving capability across batches using a Kalman-style update:
where η_t = η * σ_t adapts to training stability.
2. Adaptive Reweighting
Computes reweighting factors that correct biased advantage estimates:
where:
Changes
GRPOConfig
Added four new hyperparameters (all opt-in, default disabled):
use_hadw(bool, default=False): Enable/disable HA-DWhadw_eta(float, default=0.1): Base forgetting factor for capability updateshadw_lambda_scale(float, default=1.0): Scaling factor for reweightinghadw_history_window(int, default=10): Window for computing training stabilityGRPOTrainer
_compute_hadw_reweighting()method to compute reweighting factors_generate_and_score_completions()advantage computationResults from Paper
The paper demonstrates consistent improvements when HA-DW is integrated with GRPO and its variants:
Similar improvements are observed on Qwen-3-8B and LLaMA-3.2-3B models.
Testing
Local Testing on Apple Silicon (MPS)
We successfully tested the implementation on Apple Silicon with the following setup:
Results:
✅ HA-DW successfully integrated and functional
✅ Adaptive reweighting activated when batch had mixed results (50% accuracy)
✅ Model learned successfully (0% → 100% accuracy)
✅ No numerical instability or crashes
✅ All HA-DW metrics logged correctly:
hadw/capability_priorandhadw/capability_posteriortracked model evolutionhadw/eta_tshowed adaptive forgetting factorhadw/reweighting_mean= 1.13,hadw/reweighting_std= 0.74 when activatedExample HA-DW activation (from training logs):
Numerical Stability
Test Script Included
test_hadw_grpo.py: Standalone test script for quick validationTEST_HADW_README.md: Comprehensive testing documentation--no-hadwflag)Backward Compatibility
This implementation is fully backward compatible:
use_hadw=False)Usage Example
References