Why long-context models still get lost in the middle
Your model has a 1M token context window. It can recall the first paragraph perfectly. It can recall the last paragraph perfectly. The thing in the middle? Coin flip. This is not a bug — it's what happens when you ask a model trained one way to behave a different way.
Why it exists
You upgrade to the model with the giant context window. You stuff in a 200-page document. You ask a question whose answer is on page 100. The model misses, or paraphrases, or quietly invents.
You try the same question with the answer on page 1. Perfect recall. You move it to page 200. Perfect recall. Page 100 again — gone.
This is the lost in the middle effect. It is the most reliably surprising thing about long-context LLMs: the published context length is not the useful context length. A 1M-token window does not mean the model treats all 1M tokens equally. The recall curve is roughly U-shaped — strong at the front, strong at the back, sagging through the middle — and the sag gets worse as the context gets longer.
The paper that popularized and cleanly quantified this is Liu et al., Lost in the Middle: How Language Models Use Long Contexts (arXiv 2023, TACL 2024). They built a multi-document QA task with 10, 20, and 30 retrieved documents — exactly one of which contained the answer. They moved the answer document around. Performance peaked at the first and last positions and dropped sharply in the middle. In the 20-document condition, GPT-3.5-Turbo’s accuracy in the middle dropped below its closed-book baseline of 56.1% — i.e., adding the relevant document hurt more than it helped. That last fact is the one that should sting.
The reason the effect exists is not “the model is dumb” or “context windows are fake.” It’s that the way models are trained, the way attention scales, and the way positions are encoded all conspire to over-weight the edges. The middle has to fight for attention it was never trained to receive.
Why it matters now
Three years ago this was a research curiosity. Today it shapes a lot of production decisions:
- RAG pipelines depend on it. The whole point of RAG is “retrieve relevant chunks and put them in the prompt.” If the model only reliably reads the top and bottom, retrieval ranking matters more than retrieval recall. Putting the best chunk in position 1 and the second-best last is a plausible practical mitigation that follows directly from the U-curve, even if I can’t point to a single canonical “this is the right ordering” paper.
- Long-document agents misbehave silently. A coding agent looking at a 100k-token codebase dump cannot be trusted to have actually used the file in the middle. The output looks confident either way.
- Needle-in-a-haystack benchmarks oversold long context. The original needle-in-a-haystack test, popularized by Greg Kamradt in late 2023, is a single needle of out-of-place text. Models often pass that easily. Real long-context use — multiple needles, paraphrased queries, distractors — is much harder, and that’s where the U-curve shows up.
The gap between “context length the model accepts” and “context length the model actually uses well” is one of the bigger sources of silent quality regressions in LLM applications today. If you ship without testing for it, you’ll find out from a confused user rather than a benchmark.
The short answer
lost-in-the-middle ≈ uneven training distribution + softmax pressure + edge-biased positional encoding
Models likely attend most strongly to the start and end of their input because (a) attention’s softmax forces some probability mass to land somewhere even when nothing is relevant, and the first tokens become a default “sink” for that mass; (b) document structure and training distribution put salient cues near the edges; and (c) positional encodings make distant tokens easier to ignore. The middle of a long context is the unlucky region where none of those forces is helping. The exact weighting between these three causes is not nailed down in the literature — treat the equation as a synthesis, not a proof.
How it works
The U-shape probably isn’t one mechanism — it’s at least three forces stacking. The literature doesn’t have a clean ablation that disentangles them on a frontier model, so what follows is a reasonable synthesis, not a derivation.
1. Attention sinks: the softmax has to put its mass somewhere
Self-attention ends with a softmax over a row of scores. By construction, that row sums to 1. Every query token must distribute exactly one unit of attention weight across the keys, even when no key is actually relevant.
What happens to the leftover mass when the model has nothing in particular to attend to? Xiao et al., Efficient Streaming Language Models with Attention Sinks (ICLR 2024) found that it overwhelmingly lands on the first few tokens of the sequence. They named this an attention sink. Trained-model attention maps often show strong attention to the first few tokens regardless of content.
Why the first tokens? Because every token in a causal model can see them — they’re in everyone’s attention window — so they’re the only universally available “place to dump mass.” A natural interpretation is that the model treats them as a safe no-op target during training and that pattern becomes self-reinforcing; this is a mechanistic story consistent with the data, not a directly proven causal claim. What the Xiao paper does establish is the practical consequence: you can recover most of the model’s quality on long sequences by keeping the first 4 tokens of KV cache around plus a recent window — strong evidence that the front of the sequence is doing structural work beyond its semantic content.
This is the first reason the front gets favored. There is no equivalent “back sink” by the same mechanism, but the back gets its own boost from a different source.
2. Recency: the back is just closer
The next-token prediction objective overwhelmingly cares about local structure. To predict token N+1, the most informative tokens are usually nearby — the previous sentence, the previous paragraph. Training data reflects this: the gradient signal for paying close attention to recent context is strong and constant. The gradient signal for “attend carefully to a token 50,000 positions ago” is sparse, weak, and only present in a small fraction of training examples.
This shows up in two ways:
- Positional encodings degrade with distance. Modern models use RoPE, which encodes position by rotating the query and key vectors. The original RoFormer paper argues for a long-term decay property where the inter-token dependency tends to weaken as |i − j| grows. Various long-context techniques (NTK-aware scaling, YaRN, position interpolation) reshape that decay curve. The honest version of this story: the literature broadly attributes long-distance attention decay to RoPE’s frequency structure, and engineering tricks to fix the decay are an active subfield. I am not going to claim a precise functional form here without re-deriving it.
- Training-distribution mismatch. Many common document formats — articles, papers, code files — place salient cues near the beginning (titles, intros, thesis sentences) or end (conclusions, returns, answer lines). Pretraining examples where the load-bearing fact sits at the 51% mark of a 200k-token document are rarer. The model has plausibly seen far fewer examples of “attend to the middle of a long document and use it correctly,” so it’s likely worse at it. (This is a structural-prior argument; I don’t have a clean ablation to point at.)
The recency effect rewards the end of the context: a token at position N can attend to tokens at positions N−1, N−2, … using the strongest, best-trained part of the attention curve. The middle is the only region with neither benefit.
3. The Liu et al. evidence
Liu and collaborators ran the experiment cleanly. Their multi-document QA setup: 20 short documents in the prompt, exactly one of which contains the answer to a natural-language question. They varied the position of the answer document from 1 to 20 and measured accuracy. The shape of the resulting curve, across multiple frontier models of the era (the paper tested commercial and open models including GPT-3.5-Turbo and Claude variants, plus open models like LongChat), was the now-famous U: highest accuracy at positions 1 and 20, lowest somewhere in the middle. The drop from best to worst position was on the order of 20 percentage points or more for several models, and the effect did not go away for models marketed as long-context.
I’m being deliberately fuzzy on the exact numbers because the paper went through revisions and the per-model accuracies depend on which version and which task variant you read. What’s solid is the shape. The U-curve has been replicated many times since on different models and tasks. Here’s the bit that should worry you: in some configurations the middle position scored below the closed-book baseline — i.e., providing the relevant document at the middle position hurt accuracy compared to not providing it at all. Whatever you call that mechanism, the practical lesson is the same.
Why “longer context window” alone doesn’t fix it
A natural assumption is that the next generation of models, trained on longer sequences, will simply make this go away. The evidence is mixed and time-sensitive. On the better side: Google’s Gemini 1.5 technical report claimed near-perfect single-needle recall out to very long context. On the not-fully-fixed side: harder evals like NoLiMa (2025), which require the model to make a small inferential hop rather than match a literal string, show frontier models still degrading sharply with context length. The U-shape gets shallower; it doesn’t disappear.
Why the partial fix? The training-distribution argument is plausibly load-bearing: as long as it’s expensive to construct training examples that force the model to attend to the middle of a long document, the model is likely to undertrain that region relative to the edges. Synthetic long-context training data (artificial needles, multi-document tasks, position permutation augmentations) is plausibly how progress continues, but the closed labs don’t publish their data mixes, so this is informed speculation, not a sourced claim.
What you do about it (the practical part)
If you can’t make the model fix this, you reshape the prompt:
- Put the most important context first or last. This is the cheapest mitigation and follows directly from the U-curve. A common practical tactic is to sort retrieved chunks by relevance and place the top-ranked chunks at the boundaries.
- Re-rank, then truncate. If you have 50 retrieved chunks but the middle ones won’t be read carefully, including them anyway may hurt — they’re distractors. Aggressive re-ranking and a tighter context budget often beats “just include more.”
- Test with the answer at multiple positions. A long-context eval that only tests position-0 needles is lying to you. Sweep the position; look at the curve; that curve is your real context length.
- Use RAG instead of stuffing. A small, well-ranked context that fits in the strong-recall region beats a giant context where the answer is at position 47%. This is one of the under-appreciated arguments for retrieval — not “the model can’t read 200k tokens” but “the model reads them unevenly.”
Where I’m not sure
A few honest gaps in the above story:
- The relative weight of the three causes (sinks, RoPE decay, training distribution) almost certainly differs between architectures, and I haven’t seen a clean ablation study that disentangles them on a frontier model. The mechanistic-interpretability literature is making progress, but I don’t have a single citation that nails the decomposition.
- “Frontier models in 2025/2026 still show middle-position weakness” is partially supported by NoLiMa-style evals; whether the precise U-shape (rather than a more general degradation) survives at the very latest models is something I’m reading off secondhand reports rather than running myself.
- Whether the effect can be fully eliminated by training, or whether it’s an architectural property that survives any training mix, is an open question. Bet against “fully eliminated.”
The reason this all matters: every time you write a prompt longer than a few thousand tokens, you are silently entering territory where position-in-context is a quality variable. Not knowing that is how surprises ship.
Famous related terms
- Attention —
attention = softmax(QKᵀ/√d) · V over all token pairs. The N² operation whose softmax is the source of the attention sink. - Attention sink —
attention sink = first few tokens absorbing leftover softmax probability mass. Why “the first 4 tokens are special” turns up in so many long-context tricks. - Needle-in-a-haystack —
needle-in-a-haystack ≈ hide one fact in a long document, see if the model can find it. Easy version of the problem; passing it is necessary but not sufficient. - RoPE —
RoPE = position encoded as a rotation of Q and K. Argued in the original paper to give a long-term decay property; one of several reasons distant tokens get weaker attention scores. - Position interpolation / YaRN / NTK scaling —
position scaling ≈ stretch the RoPE frequencies so old models work at new lengths. The patches (often combined with some continued training) that helped extend models to longer contexts without training from scratch. - RAG —
RAG = retrieval + LLM. The architectural answer to “the middle of long context is unreliable.” - Why LLM eval is hard — single-position needle tests overstating long-context quality is a textbook example of evals lying to you.
Going deeper
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts (TACL 2024 / arXiv 2307.03172) — the canonical reference. Read at least the introduction and the U-curve plot; the rest is detail.
- Xiao et al., Efficient Streaming Language Models with Attention Sinks (ICLR 2024) — the attention-sink phenomenon, and why keeping the first few tokens of KV cache rescues sliding-window models.
- Greg Kamradt’s needle-in-a-haystack repository — the popularized eval. Useful baseline; insufficient on its own.
- LangChain’s Multi Needle in a Haystack writeup — the eval that breaks models that pass the single-needle version.
- Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) — the RoPE paper, if you want the math behind the position-decay claim.