Training Reasoning Models: Notes as I Try to Understand GRPO and DAPO

I got interested in this topic after DeepSeek came out last year, and reasoning models suddenly started making noticeable jumps on major reasoning benchmarks. I tried reading the papers, took notes as I went, and realized that a lot of the gains seemed to be coming from reinforcement learning during post-training rather than just bigger models.

I'm currently taking CDSS 94 - Building Thoughtful AI Systems at Berkeley and revisiting these topics, so I figured I'd organize my notes more cleanly here. Below are some notes from my attempt to understand GRPO and DAPO.

Notes on GRPO

Policy in RL

Value Function

Actor-Critic Architecture

GRPO (Group Relative Policy Optimization) is a policy optimization method that trains language models by comparing multiple sampled outputs relative to one another, rather than relying on an explicit learned value function.

High-Level Overview of GRPO

Imagine you're trying to teach a robot to play a game where it has to choose between different paths to reach a goal. The robot needs to learn which paths are good and which ones are not.

GRPO helps the robot by:

GRPO uses group-based comparisons, where it samples N actions instead of just 1. Each sample is evaluated using a reward function. The advantage score is computed by comparing each action's reward to the average reward of the group:

Traditional actor-critic methods rely on learning a separate value function to compute advantage. GRPO removes this by using group-relative rewards instead.

Key Idea: Instead of learning a separate value function to estimate advantage, GRPO computes advantage using relative comparisons within a sampled group of outputs.

Notes on DAPO

Keywords

DAPO aims to increase transparency by open-sourcing the full system, code, and dataset, making reproduction possible.

DAPO = Decoupled Clip + Dynamic Sampling Policy Optimization

DAPO is a reinforcement learning optimization framework designed to stabilize and improve reasoning-focused LLM training by encouraging exploration, preserving gradient signal, and handling long-form outputs more effectively than PPO or GRPO.

Main Tricks Introduced by DAPO

1. Clip Higher

In normal PPO, clipping limits how much probabilities of actions (tokens) can change. This often hurts low-probability (exploratory) tokens.

DAPO separates the clipping thresholds:

This lets low-probability tokens grow more, encouraging exploration and avoiding entropy collapse.

2. Dynamic Sampling: Keep Training Signals Useful

When all sampled answers to a prompt are perfect (accuracy = 1), their reward variance is 0 — no gradient, wasted compute.

Fix: Dynamically filter out samples where all responses are either perfect or completely wrong (accuracy = 0 or 1). Only prompts with partial success (some right, some wrong) are kept. This ensures gradient updates always have a signal.

3. Token-Level Policy Gradient Loss: Fair Weighting

Problem: In GRPO, long responses (more tokens) get averaged down. So high-quality long answers or bad long answers don't get proper credit or blame.

Fix: Switch from sample-level loss (per-response) to token-level loss. This gives:

4. Overlong Reward Shaping: Tame Long Responses

Problem: Some outputs exceed max length and get truncated. Penalizing them harshly introduces reward noise, even if the reasoning was fine.

Fix:

This stabilizes training and avoids penalizing complex reasoning unfairly.


These are the notes I took while trying to understand how reasoning models are trained. A lot of the visible improvements in reasoning performance seem tightly connected to these post-training RL details rather than just larger models.