Training Reasoning Models: Notes as I Try to Understand GRPO and DAPO
I got interested in this topic after DeepSeek came out last year, and reasoning models suddenly started making noticeable jumps on major reasoning benchmarks. I tried reading the papers, took notes as I went, and realized that a lot of the gains seemed to be coming from reinforcement learning during post-training rather than just bigger models.
I'm currently taking CDSS 94 - Building Thoughtful AI Systems at Berkeley and revisiting these topics, so I figured I'd organize my notes more cleanly here. Below are some notes from my attempt to understand GRPO and DAPO.
Notes on GRPO
Policy in RL
- A strategy an agent uses to decide which action to take in a given state
- Deterministic policy → always selects the same action for a given state
- Stochastic policy → defines a probability distribution over actions for a given state
- The goal in RL is to learn an optimal policy that maximizes cumulative reward
Value Function
- How good a state or action is in terms of future rewards
- State Value (V) → how good it is to be in a state
- Action Value (Q) → how good it is to take a specific action in a state
- A policy decides what actions to take, whereas the value function helps improve the policy by estimating which states or actions can lead to higher rewards
Actor-Critic Architecture
- A popular architecture that combines two components
- Actor → learns the policy and decides actions
- Critic → evaluates value functions to guide the actor
GRPO (Group Relative Policy Optimization) is a policy optimization method that trains language models by comparing multiple sampled outputs relative to one another, rather than relying on an explicit learned value function.
High-Level Overview of GRPO
Imagine you're trying to teach a robot to play a game where it has to choose between different paths to reach a goal. The robot needs to learn which paths are good and which ones are not.
GRPO helps the robot by:
- Trying different paths: the robot samples a few different actions from its current strategy (policy)
- Comparing how well each path worked
- Making small changes to its strategy to improve performance
GRPO uses group-based comparisons, where it samples N actions instead of just 1. Each sample is evaluated using a reward function. The advantage score is computed by comparing each action's reward to the average reward of the group:
- If A(ai) > 0 → the action ai performed better than average
- If A(ai) < 0 → the action ai performed worse than average
Traditional actor-critic methods rely on learning a separate value function to compute advantage. GRPO removes this by using group-relative rewards instead.
- GRPO is said to be more efficient and has more stable updates
- KL divergence keeps the difference between the new and old probability distributions of the policy under a constant, making sure the new policy isn't deviating too much from the old one
Notes on DAPO
Keywords
- PPO (Proximal Policy Optimization) → a technique popular in RLHF. Clipping is used to make sure the model doesn't change too drastically after each timestep.
- Tokens with lower probability represent more exploratory paths; PPO's clipping restricts increasing the probability of these rare tokens
- This results in the model sticking to what it already knows (low token probabilities can't grow fast enough)
- Exploration dies off → entropy collapse
- AIME benchmark → a challenging high school math competition used in AI research to test complex mathematical reasoning
DAPO aims to increase transparency by open-sourcing the full system, code, and dataset, making reproduction possible.
DAPO = Decoupled Clip + Dynamic Sampling Policy Optimization
DAPO is a reinforcement learning optimization framework designed to stabilize and improve reasoning-focused LLM training by encouraging exploration, preserving gradient signal, and handling long-form outputs more effectively than PPO or GRPO.
Main Tricks Introduced by DAPO
1. Clip Higher
In normal PPO, clipping limits how much probabilities of actions (tokens) can change. This often hurts low-probability (exploratory) tokens.
DAPO separates the clipping thresholds:
- Sets a smaller lower bound (εlow = 0.2)
- But increases the upper bound (εhigh = 0.28)
This lets low-probability tokens grow more, encouraging exploration and avoiding entropy collapse.
2. Dynamic Sampling: Keep Training Signals Useful
When all sampled answers to a prompt are perfect (accuracy = 1), their reward variance is 0 — no gradient, wasted compute.
Fix: Dynamically filter out samples where all responses are either perfect or completely wrong (accuracy = 0 or 1). Only prompts with partial success (some right, some wrong) are kept. This ensures gradient updates always have a signal.
3. Token-Level Policy Gradient Loss: Fair Weighting
Problem: In GRPO, long responses (more tokens) get averaged down. So high-quality long answers or bad long answers don't get proper credit or blame.
Fix: Switch from sample-level loss (per-response) to token-level loss. This gives:
- More precise gradient updates
- Better handling of reasoning sequences of different lengths
- Penalizes gibberish in long outputs
4. Overlong Reward Shaping: Tame Long Responses
Problem: Some outputs exceed max length and get truncated. Penalizing them harshly introduces reward noise, even if the reasoning was fine.
Fix:
- First, mask overlong samples from loss entirely
- Then, apply a Soft Overlong Penalty: gradually reduce reward the longer it gets past a threshold (rather than hard-cutting it)
This stabilizes training and avoids penalizing complex reasoning unfairly.
These are the notes I took while trying to understand how reasoning models are trained. A lot of the visible improvements in reasoning performance seem tightly connected to these post-training RL details rather than just larger models.