Reward Modeling and RLHF

April 26, 2026

I've been trying to build some intuition for reward modeling and RLHF recently. It shows up everywhere in modern LLMs, and I wanted to understand what is actually going on under the hood. So I figured I'd try to organize my thoughts.

At a high level, RLHF (reinforcement learning from human feedback) is just reinforcement learning, but where the reward function isn't given. We have to learn it.

In standard RL, you assume there's some reward signal telling you what's good or bad, and you optimize a policy to maximize expected reward over trajectories. RLHF fits into this loop, but replaces the "true" reward with a learned one.

Reward Modeling

Instead of asking humans to assign scalar rewards, we usually collect preferences. Given two model outputs, which one is better? From a dataset of these comparisons, we train a reward model that scores outputs so that preferred ones get higher scores. Under the hood, this often looks like a Bradley-Terry style setup where we maximize the probability that the preferred output has higher reward than the rejected one.

Once you have this reward model, you treat it like a proxy for human judgment. Now you can run RL, typically policy gradient methods, to fine-tune the base model so that it generates outputs that score higher under this learned reward.

The Pipeline

Conceptually, it's pretty clean:

Supervised fine-tuning gets you reasonable behavior
Reward modeling defines what "good" means
RL pushes the model toward that notion of good

But a few subtleties make this tricky.

First, reward models are imperfect. If the policy starts exploiting quirks in the reward model, you get reward hacking, outputs score high but are not actually better.

Second, policy gradients in this setting are noisy and high variance, which shows up in standard RL too. That makes training unstable, so methods like PPO or DPO add constraints to keep updates small.

Third, the whole system is somewhat circular. The reward model is trained on data from earlier policies, but then used to optimize newer ones. This shift in distribution can cause issues.

Overall, RLHF feels less like a single clean algorithm and more like a pipeline that works because each piece compensates for the others. But the core idea is simple: learn what humans prefer, then optimize for it.