Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why beam search died for LLMs

Beam search was the default way to decode neural sequence models for years. Then chatbots arrived and quietly stopped using it. The reason is stranger than 'sampling is more creative.'

AI & ML intermediate Apr 29, 2026

Why it exists

If you trained a neural sequence model in 2016, you almost certainly used beam search to generate from it. It was the standard decoder for NMT, summarization, and image captioning. The intuition is clean: greedy decoding picks the locally best token and gets stuck; beam search keeps the top-k partial sequences alive at every step, so it has a shot at finding a globally higher-probability sequence. More search, better answer.

Then LLMs arrived, ChatGPT shipped, and beam search quietly stopped being the default for open-ended generation. The OpenAI chat completions API doesn’t expose it. Most production chat stacks ship sampling. The new default is sampling.

That’s the puzzle. The thing that made decoding “work” for a decade got dropped exactly when models got big enough that you’d think more search would help even more.

Why it matters now

Every time you call a chat model, something is choosing one of tens of thousands of tokens per step. That choice is a strategy, not a fact about the model. Picking the wrong strategy makes a state-of-the-art model produce slop — bland paragraphs, repetitive loops, weirdly short answers. Engineers who reach for beam search expecting “higher quality output” get the opposite, and the failure mode looks like the model itself is bad.

The shift also reshaped how people think about prompting. If decoding is sampling, then “the answer” isn’t a thing the model has — it’s a distribution the model has, and you’re drawing from it. That mental model is load-bearing for understanding why the same prompt gives different outputs, why temperature matters, and why two perfectly correct answers can sit side by side.

The short answer

beam search died for LLMs = maximum-likelihood decoding + open-ended generation = degenerate text

Beam search optimizes for the highest-probability sequence under the model. That works when there is one right answer and the model is well-calibrated about it (translation, transcription). It breaks when there are many valid continuations and the model’s own probability surface has a pathological mode at “boring repeating text.” For open-ended generation, the most probable sequence is worse than a randomly sampled one. So sampling won.

How it works

The mechanism is counter-intuitive enough that it’s worth walking through.

Beam search, briefly

At each step, keep the top-k partial sequences (the “beam”) ranked by cumulative log-probability. Expand each by one token, score all kN candidates, keep the top k again. At the end, return the highest-scoring full sequence. With k=1 you get greedy decoding; as k grows, you approach exact MAP decoding (true exhaustive argmax over sequences requires more than just a wide beam, but the intuition is right).

For a translation system, this is great. There’s roughly one correct translation; the model concentrates probability on tokens near that translation; searching harder finds it.

What goes wrong on open-ended prompts

Holtzman et al.’s 2019 paper The Curious Case of Neural Text Degeneration (ICLR 2020) is the canonical write-up. They showed something weird: if you take a strong language model and ask it for the most likely continuation of a prompt, you get text that loops. Schematically (paraphrased — not a quote from the paper):

The unicorns were extremely friendly. The unicorns were extremely friendly. The unicorns were extremely friendly…

This isn’t a bug in beam search. The model genuinely assigns higher probability to the looping text than to a coherent paragraph. Why the distribution is shaped that way is still debated — Holtzman et al. document the effect; later work (e.g. Finlayson et al. 2024) traces it back to the softmax bottleneck and the way model errors compound on rare tokens. What’s solid is the observational fact: the argmax is degenerate, even though sampling from the same distribution produces human-like text. Sampling gives prose; maximizing gives mush.

Their conclusion is the load-bearing one: maximization is an inappropriate decoding objective for open-ended generation. The fix isn’t a smarter search; it’s a different objective. They proposed nucleus sampling (top-p), which is now ubiquitous.

The “beam search curse” in translation, too

Even in machine translation — beam search’s home turf — the picture is messier than “more search is better.” Koehn and Knowles (2017, Six Challenges for Neural Machine Translation) documented the beam search curse: past modest beam widths, BLEU scores stop improving and eventually degrade. Larger beams find higher-probability sequences that are systematically too short relative to the reference; length-normalization heuristics push the sweet spot wider, but very large beams still hurt. The underlying fact is uncomfortable: even when MAP-decoding is roughly the right idea, doing it harder eventually hurts.

The shape of the lesson is the same in both worlds: the model’s probability-of-the-whole-sequence is not directly the thing you want to maximize.

Why sampling won

Sampling has three things going for it:

  1. It matches the model’s own objective. Models are trained to imitate the data distribution; drawing from that distribution gives outputs shaped like training data. Argmax-ing it produces a different beast — the mode, not a sample.
  2. It’s cheap. One forward pass per token, one beam. Beam search with width k roughly scales decode memory and bandwidth with k — and with KV-cache-bound serving, that’s a real cost.
  3. It composes with the modern toolkit. Temperature, top-p, and top-k all live inside the sampling frame. They give you a knob for diversity without abandoning the model’s distribution.

There are exceptions where beam search still earns its keep: machine translation systems shipping to production, speech recognition with a clear ground-truth target, constrained decoding where you genuinely need the highest-scoring valid output. But for “talk to me,” it’s rarely the default anymore.

Where this gets murky

I’d flag a few things I’m not certain about and the post shouldn’t pretend otherwise:

Going deeper