Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why RLHF exists

A pretrained language model knows everything and answers nothing. RLHF exists because the gap between 'predict the next token' and 'do what the user asked' is wider than prompt engineering can paper over.

AI & ML intermediate Apr 29, 2026

Why it exists

If you take a freshly pretrained LLM — one that has only seen the next-token objective on a giant pile of internet text — and ask it “what is the capital of France?”, you are not guaranteed to get “Paris.” You might get a plausible continuation like “what is the capital of Germany? what is the capital of Spain?” — because in the training data, that question often shows up in a list of similar questions.

The model isn’t broken. It’s doing exactly what it was trained to do: predict likely next tokens given the prompt. The problem is that “likely continuation of this string on the internet” and “answer this question helpfully” are different distributions, and the second one is the one users want.

You can paper over this with prompt engineering — few-shot examples, instruction phrasings, formatting tricks — and that worked for a while. But the gap is too wide and too varied to close with prompts alone. The model needs to be trained to prefer the helpful continuation over the merely likely one. The natural move is to write down a hand-coded reward function — “give +1 for helpful, −1 for unhelpful” — gradient-descend on it, and call it a day. The catch: nobody has a clean hand-written function for “helpful.” It’s the kind of thing humans recognize when they see it but cannot specify in advance.

RLHF exists because that’s exactly what RL was built for: optimizing against rewards you can score but can’t hand-write. The “reward” here is still a function — but a learned one, a second model trained on humans comparing pairs of outputs. The LLM is then nudged, via reinforcement learning, to produce outputs that the reward model rates highly. That whole stack is the foundation of what “alignment” means in practice for the current generation of chatbots, even as the techniques on top of it have evolved.

Why it matters now

Every chatbot you’ve actually enjoyed using is a post-RLHF (or RLHF-adjacent) model. The base pretrained checkpoint is, by modern product standards, almost unusable as a conversational assistant: it rambles, ignores instructions, refuses inconsistently, and confabulates with the same confidence it states facts.

What RLHF (and its successors — DPO, RLAIF, constitutional methods, and the reasoning-model RL recipes) buy you in production:

For an engineer building on top of these models, the practical consequence is that much of what you observe at the API — refusal patterns, formatting habits, verbosity, the way it handles ambiguous instructions — is post-training-shaped, not pretraining-shaped. (Not all of that post-training is literally RLHF in 2026; SFT, DPO-family methods, AI-feedback variants, and reasoning-model RL all live in the same stage.) When the model surprises you, the surprise often lives in post-training, not the base.

The short answer

RLHF = supervised fine-tuning + reward model trained on human preference comparisons + RL loop that optimizes the LLM against the reward model

You can’t write a loss function for “be helpful,” so you train a second model to predict which of two responses a human would prefer, and then use reinforcement learning to push the LLM toward outputs the second model rates highly. The reward model is the proxy; the RL loop is how you cash that proxy into weight updates.

How it works

The canonical recipe — the one OpenAI used for InstructGPT (Ouyang et al., 2022, arXiv:2203.02155) and that became the default template for the field — has three stages.

Stage 1 — Supervised fine-tuning (SFT)

Start with the pretrained base model. Collect a modest dataset of prompts paired with high-quality human-written responses. Fine-tune with the ordinary next-token objective. This is the cheapest, simplest step and it does most of the visible work: after SFT alone, the model already feels much more like an assistant. Most of “instruction following” is already happening here.

SFT alone has limits, though. You can only show the model so many human-written examples, and “good response” is high-dimensional — the SFT data captures one cross-section of it, not the whole space. You also can’t easily teach negative preferences (don’t be confidently wrong, don’t pad) by showing more positive examples.

Stage 2 — Train a reward model on preference pairs

Take the SFT model. For a batch of prompts, sample several responses from it. Show pairs to human raters and ask: which one is better? Train a small model (often initialized from the LLM itself, with the language-modeling head replaced by a scalar score head) to predict the human’s choice. The standard loss is the Bradley-Terry preference loss — for a pair (chosen, rejected), maximize the log-probability that the reward of “chosen” exceeds the reward of “rejected.”

This is the load-bearing trick of the whole pipeline. You’ve turned an unspecified concept (“helpful”) into a learned function — one that takes a prompt and a candidate response and returns a number you can differentiate against. You haven’t defined helpfulness; you’ve curve-fit human judgments of it.

Stage 3 — RL against the reward model

Now use reinforcement learning to update the LLM’s weights so its sampled outputs score higher under the reward model. The InstructGPT paper used PPO, which became the standard choice in early RLHF pipelines (the open-source landscape has since diversified — DPO, GRPO and others). To stop the model from drifting into bizarre high-reward regions, the loss includes a KL penalty against the SFT model: “go up the reward gradient, but don’t move too far from where you started.”

Without that KL leash, you get reward hacking: the LLM finds outputs that score high on the reward model but a human would call gibberish. The reward model is a proxy, and the LLM is good at finding holes in proxies.

Why “RL” specifically?

This is the part that confused everyone (including me) at first. If you have a differentiable reward model, why not just backprop through it? The honest answer: there’s no clean gradient path. The LLM produces tokens by sampling, and the sampling step (argmax / categorical draw over the vocab) is non-differentiable. You can finesse this with relaxations or sequence-level objectives, but policy-gradient RL is the well-trodden workaround: PPO and its cousins push the LLM’s distribution in directions that, on average, raise the reward, without needing a gradient through the sampling step.

The 2017 paper that put preference-based RL on the map for deep learning is Christiano et al., Deep Reinforcement Learning from Human Preferences, which trained agents on Atari and simulated locomotion using human comparisons of trajectory pairs. The InstructGPT line is the application of that idea to language models.

What goes wrong, and what came after

A few seams worth knowing:

What RLHF really is, when you squint, is a way to convert one of the most scalable kinds of human training signal — pairwise preferences — into weight updates. (Demonstrations, critiques, and rubrics are alternatives, each with their own tradeoffs.) Pairwise preference data won out for RLHF largely because it’s cheap to collect and it sidesteps the “write down what ‘helpful’ means” problem. The field is actively figuring out which variant of preference-based post-training wins long-term. But the core problem RLHF solves — the pretraining objective is the wrong objective for what users want — is permanent, even as the technique evolves.

Going deeper

What I’m confident about: the three-stage recipe, the reason RL is used (sampling is non-differentiable), and the failure modes (reward hacking, sycophancy, mode collapse) — these are well-documented. What I’m less confident about: the exact post-training recipes used by frontier labs in 2026. The public papers describe the shape of the pipeline, but the data mixtures, the reward-model architectures, and how PPO-vs-DPO-vs-something-else has shaken out at scale are mostly proprietary.