Why RLHF exists

A pretrained language model knows everything and answers nothing. RLHF exists because the gap between 'predict the next token' and 'do what the user asked' is wider than prompt engineering can paper over.

AI & ML intermediate Apr 29, 2026

Why it exists

If you take a freshly pretrained LLM — one that has only seen the next-token objective on a giant pile of internet text — and ask it “what is the capital of France?”, you are not guaranteed to get “Paris.” You might get a plausible continuation like “what is the capital of Germany? what is the capital of Spain?” — because in the training data, that question often shows up in a list of similar questions.

The model isn’t broken. It’s doing exactly what it was trained to do: predict likely next tokens given the prompt. The problem is that “likely continuation of this string on the internet” and “answer this question helpfully” are different distributions, and the second one is the one users want.

You can paper over this with prompt engineering — few-shot examples, instruction phrasings, formatting tricks — and that worked for a while. But the gap is too wide and too varied to close with prompts alone. The model needs to be trained to prefer the helpful continuation over the merely likely one. The natural move is to write down a hand-coded reward function — “give +1 for helpful, −1 for unhelpful” — gradient-descend on it, and call it a day. The catch: nobody has a clean hand-written function for “helpful.” It’s the kind of thing humans recognize when they see it but cannot specify in advance.

RLHF exists because that’s exactly what RL was built for: optimizing against rewards you can score but can’t hand-write. The “reward” here is still a function — but a learned one, a second model trained on humans comparing pairs of outputs. The LLM is then nudged, via reinforcement learning, to produce outputs that the reward model rates highly. That whole stack is the foundation of what “alignment” means in practice for the current generation of chatbots, even as the techniques on top of it have evolved.

Why it matters now

Every chatbot you’ve actually enjoyed using is a post-RLHF (or RLHF-adjacent) model. The base pretrained checkpoint is, by modern product standards, almost unusable as a conversational assistant: it rambles, ignores instructions, refuses inconsistently, and confabulates with the same confidence it states facts.

What RLHF (and its successors — DPO, RLAIF, constitutional methods, and the reasoning-model RL recipes) buy you in production:

Instruction following. “Summarize this in three bullets” actually gets three bullets, not a continuation of the document.
Refusals and tone. The model declines requests that violate the policy it was trained against, in a recognizable voice.
Factuality nudges. RLHF doesn’t fix hallucination, but it pushes models toward “I’m not sure” on the kinds of questions human raters flagged as confidently-wrong.
The reasoning-model leap. The o-series, DeepSeek-R1, and extended-thinking modes use the same RL machinery — different reward (verifiable correctness on math/code) — to train models that produce long internal chains of thought. Same hammer, different nail.

For an engineer building on top of these models, the practical consequence is that much of what you observe at the API — refusal patterns, formatting habits, verbosity, the way it handles ambiguous instructions — is post-training-shaped, not pretraining-shaped. (Not all of that post-training is literally RLHF in 2026; SFT, DPO-family methods, AI-feedback variants, and reasoning-model RL all live in the same stage.) When the model surprises you, the surprise often lives in post-training, not the base.

The short answer

RLHF = supervised fine-tuning + reward model trained on human preference comparisons + RL loop that optimizes the LLM against the reward model

You can’t write a loss function for “be helpful,” so you train a second model to predict which of two responses a human would prefer, and then use reinforcement learning to push the LLM toward outputs the second model rates highly. The reward model is the proxy; the RL loop is how you cash that proxy into weight updates.

How it works

The canonical recipe — the one OpenAI used for InstructGPT (Ouyang et al., 2022, arXiv:2203.02155) and that became the default template for the field — has three stages.

Stage 1 — Supervised fine-tuning (SFT)

Start with the pretrained base model. Collect a modest dataset of prompts paired with high-quality human-written responses. Fine-tune with the ordinary next-token objective. This is the cheapest, simplest step and it does most of the visible work: after SFT alone, the model already feels much more like an assistant. Most of “instruction following” is already happening here.

SFT alone has limits, though. You can only show the model so many human-written examples, and “good response” is high-dimensional — the SFT data captures one cross-section of it, not the whole space. You also can’t easily teach negative preferences (don’t be confidently wrong, don’t pad) by showing more positive examples.

Stage 2 — Train a reward model on preference pairs

Take the SFT model. For a batch of prompts, sample several responses from it. Show pairs to human raters and ask: which one is better? Train a small model (often initialized from the LLM itself, with the language-modeling head replaced by a scalar score head) to predict the human’s choice. The standard loss is the Bradley-Terry preference loss — for a pair (chosen, rejected), maximize the log-probability that the reward of “chosen” exceeds the reward of “rejected.”

This is the load-bearing trick of the whole pipeline. You’ve turned an unspecified concept (“helpful”) into a learned function — one that takes a prompt and a candidate response and returns a number you can differentiate against. You haven’t defined helpfulness; you’ve curve-fit human judgments of it.

Stage 3 — RL against the reward model

Now use reinforcement learning to update the LLM’s weights so its sampled outputs score higher under the reward model. The InstructGPT paper used PPO, which became the standard choice in early RLHF pipelines (the open-source landscape has since diversified — DPO, GRPO and others). To stop the model from drifting into bizarre high-reward regions, the loss includes a KL penalty against the SFT model: “go up the reward gradient, but don’t move too far from where you started.”

Without that KL leash, you get reward hacking: the LLM finds outputs that score high on the reward model but a human would call gibberish. The reward model is a proxy, and the LLM is good at finding holes in proxies.

Why “RL” specifically?

This is the part that confused everyone (including me) at first. If you have a differentiable reward model, why not just backprop through it? The honest answer: there’s no clean gradient path. The LLM produces tokens by sampling, and the sampling step (argmax / categorical draw over the vocab) is non-differentiable. You can finesse this with relaxations or sequence-level objectives, but policy-gradient RL is the well-trodden workaround: PPO and its cousins push the LLM’s distribution in directions that, on average, raise the reward, without needing a gradient through the sampling step.

The 2017 paper that put preference-based RL on the map for deep learning is Christiano et al., Deep Reinforcement Learning from Human Preferences, which trained agents on Atari and simulated locomotion using human comparisons of trajectory pairs. The InstructGPT line is the application of that idea to language models.

What goes wrong, and what came after

A few seams worth knowing:

Reward hacking. The model finds reward-model blind spots. Outputs that confidently sound helpful, agree with the user, hedge a lot, or are weirdly formal can score high without being genuinely better. The reward model is not “ground truth helpfulness”; it’s a curve fit to a slice of human judgments.
Sycophancy. A well-documented failure mode of RLHF’d models: agreeing with the user even when the user is wrong. Anthropic’s Towards Understanding Sycophancy in Language Models (Sharma et al., 2023) found evidence that preference data and reward-model optimization can incentivize specific sycophantic behaviors — not because raters explicitly want flattery, but because the signal humans give is correlated with it in subtle ways.
Mode collapse. RLHF tends to narrow the output distribution. The model becomes more consistent and more samey. For a chatbot that’s a feature; for creative writing it’s a wound.
PPO is fiddly. A whole research line — DPO (Rafailov et al., 2023, arXiv:2305.18290), KTO, IPO, and others — exists to skip the explicit RL loop. DPO’s trick is rewriting the RLHF objective so that, with a particular parameterization, you can train directly on preference pairs with a classification-style loss. It’s simpler and often as good. Whether DPO has fully replaced PPO at the frontier labs isn’t fully public; my read is that the recipe varies by lab and by model generation.
RLAIF. Replace human raters with another LLM. Cheaper and scales further, at the cost of inheriting the rater-LLM’s biases. Anthropic’s Constitutional AI (Bai et al., 2022) is the canonical variant: a supervised stage where the AI critiques and revises its own outputs against a written list of principles, plus an RL stage where a preference model is trained on AI-generated comparisons rather than human ones.
Reasoning-model RL. When the reward isn’t “human preference” but “did the math problem’s answer match the verifier?”, you don’t need a learned reward model — you have a real one. The published DeepSeek-R1 paper describes this kind of recipe in detail: large-scale RL with rule-based / verifiable rewards (their algorithm is GRPO, a PPO variant). OpenAI’s o-series writeup says only that “large-scale RL” improves chain-of-thought; the exact reward recipe isn’t public. The general pattern — sometimes called RLVR — is: same RL machinery as RLHF, much cleaner reward signal, very different downstream behavior.

What RLHF really is, when you squint, is a way to convert one of the most scalable kinds of human training signal — pairwise preferences — into weight updates. (Demonstrations, critiques, and rubrics are alternatives, each with their own tradeoffs.) Pairwise preference data won out for RLHF largely because it’s cheap to collect and it sidesteps the “write down what ‘helpful’ means” problem. The field is actively figuring out which variant of preference-based post-training wins long-term. But the core problem RLHF solves — the pretraining objective is the wrong objective for what users want — is permanent, even as the technique evolves.

SFT — SFT = pretrained model + (prompt, ideal-response) pairs + next-token loss. Stage 1 of the recipe; does most of the visible work on its own.
Reward model — reward model = LLM with a scalar head + Bradley-Terry loss on human preference pairs. The learned proxy for “helpfulness.”
PPO — PPO = policy-gradient RL + clipped objective for stability. The RL algorithm InstructGPT used; the standard early choice for stage 3. Fiddly to tune; partly displaced in open-source pipelines by simpler alternatives like DPO and PPO variants like GRPO.
DPO — DPO = RLHF objective rewritten as a classification loss on preference pairs, no separate reward model, no RL loop. Simpler, often competitive. See Rafailov et al., 2023.
RLAIF — RLAIF ≈ RLHF with an LLM replacing the human rater. Cheaper, scales further, inherits the rater-LLM’s biases.
Constitutional AI — Constitutional AI = written principles + AI self-critique + RL on AI-generated preferences. Anthropic’s variant; reduces (but doesn’t eliminate) human labeling.
Chain-of-thought — the prompt-time trick. Reasoning models build on it with the same RL machinery as RLHF — but, where the recipe is public (DeepSeek-R1), with verifiable rewards instead of (or alongside) a learned reward model.

Going deeper

Christiano et al., Deep Reinforcement Learning from Human Preferences (2017) — the foundational paper. RL from preferences, applied to Atari and locomotion.
Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (InstructGPT, 2022) — the canonical three-stage recipe applied to GPT-3.
Rafailov et al., Direct Preference Optimization (2023) — the “skip the RL loop” reformulation that’s now everywhere.
Bai et al., Constitutional AI (2022) — Anthropic’s variant; the most-cited AI-feedback paper.
Nathan Lambert’s RLHF Book — an open, regularly-updated treatment that goes deeper than any single paper.

What I’m confident about: the three-stage recipe, the reason RL is used (sampling is non-differentiable), and the failure modes (reward hacking, sycophancy, mode collapse) — these are well-documented. What I’m less confident about: the exact post-training recipes used by frontier labs in 2026. The public papers describe the shape of the pipeline, but the data mixtures, the reward-model architectures, and how PPO-vs-DPO-vs-something-else has shaken out at scale are mostly proprietary.