Training Reasoning Models: Notes as I Try to Understand GRPO and DAPO

February 16, 2026

I got interested in this topic after DeepSeek came out last year, and reasoning models suddenly started making noticeable jumps on major reasoning benchmarks. I tried reading the papers, took notes as I went, and realized that a lot of the gains seemed to be coming from reinforcement learning during post-training rather than just bigger models.

I'm currently taking CDSS 94 - Building Thoughtful AI Systems at Berkeley and revisiting these topics, so I figured I'd organize my notes more cleanly here. Below are some notes from my attempt to understand GRPO and DAPO.

Notes on GRPO

Policy in RL

A strategy an agent uses to decide which action to take in a given state
Deterministic policy → always selects the same action for a given state
Stochastic policy → defines a probability distribution over actions for a given state
The goal in RL is to learn an optimal policy that maximizes cumulative reward

Value Function

How good a state or action is in terms of future rewards
State Value (V) → how good it is to be in a state
Action Value (Q) → how good it is to take a specific action in a state
A policy decides what actions to take, whereas the value function helps improve the policy by estimating which states or actions can lead to higher rewards

Actor-Critic Architecture

A popular architecture that combines two components
Actor → learns the policy and decides actions
Critic → evaluates value functions to guide the actor

GRPO (Group Relative Policy Optimization) is a policy optimization method that trains language models by comparing multiple sampled outputs relative to one another, rather than relying on an explicit learned value function.

High-Level Overview of GRPO

Imagine you're trying to teach a robot to play a game where it has to choose between different paths to reach a goal. The robot needs to learn which paths are good and which ones are not.

GRPO helps the robot by:

Trying different paths: the robot samples a few different actions from its current strategy (policy)
Comparing how well each path worked
Making small changes to its strategy to improve performance

GRPO uses group-based comparisons, where it samples N actions instead of just 1. Each sample is evaluated using a reward function. The advantage score is computed by comparing each action's reward to the average reward of the group:

If A(a_i) > 0 → the action a_i performed better than average
If A(a_i) < 0 → the action a_i performed worse than average

Traditional actor-critic methods rely on learning a separate value function to compute advantage. GRPO removes this by using group-relative rewards instead.

Key Idea: Instead of learning a separate value function to estimate advantage, GRPO computes advantage using relative comparisons within a sampled group of outputs.

GRPO is said to be more efficient and has more stable updates
KL divergence keeps the difference between the new and old probability distributions of the policy under a constant, making sure the new policy isn't deviating too much from the old one

Notes on DAPO

Keywords

PPO (Proximal Policy Optimization) → a technique popular in RLHF. Clipping is used to make sure the model doesn't change too drastically after each timestep.
- Tokens with lower probability represent more exploratory paths; PPO's clipping restricts increasing the probability of these rare tokens
- This results in the model sticking to what it already knows (low token probabilities can't grow fast enough)
- Exploration dies off → entropy collapse
AIME benchmark → a challenging high school math competition used in AI research to test complex mathematical reasoning

DAPO aims to increase transparency by open-sourcing the full system, code, and dataset, making reproduction possible.

DAPO = Decoupled Clip + Dynamic Sampling Policy Optimization

DAPO is a reinforcement learning optimization framework designed to stabilize and improve reasoning-focused LLM training by encouraging exploration, preserving gradient signal, and handling long-form outputs more effectively than PPO or GRPO.

Main Tricks Introduced by DAPO

1. Clip Higher

In normal PPO, clipping limits how much probabilities of actions (tokens) can change. This often hurts low-probability (exploratory) tokens.

DAPO separates the clipping thresholds:

Sets a smaller lower bound (ε_low = 0.2)
But increases the upper bound (ε_high = 0.28)

This lets low-probability tokens grow more, encouraging exploration and avoiding entropy collapse.

2. Dynamic Sampling: Keep Training Signals Useful

When all sampled answers to a prompt are perfect (accuracy = 1), their reward variance is 0 — no gradient, wasted compute.

Fix: Dynamically filter out samples where all responses are either perfect or completely wrong (accuracy = 0 or 1). Only prompts with partial success (some right, some wrong) are kept. This ensures gradient updates always have a signal.

3. Token-Level Policy Gradient Loss: Fair Weighting

Problem: In GRPO, long responses (more tokens) get averaged down. So high-quality long answers or bad long answers don't get proper credit or blame.

Fix: Switch from sample-level loss (per-response) to token-level loss. This gives:

More precise gradient updates
Better handling of reasoning sequences of different lengths
Penalizes gibberish in long outputs

4. Overlong Reward Shaping: Tame Long Responses

Problem: Some outputs exceed max length and get truncated. Penalizing them harshly introduces reward noise, even if the reasoning was fine.

Fix:

First, mask overlong samples from loss entirely
Then, apply a Soft Overlong Penalty: gradually reduce reward the longer it gets past a threshold (rather than hard-cutting it)

This stabilizes training and avoids penalizing complex reasoning unfairly.

These are the notes I took while trying to understand how reasoning models are trained. A lot of the visible improvements in reasoning performance seem tightly connected to these post-training RL details rather than just larger models.