Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

RAG: why retrieval didn't die when context windows got huge

Long context windows were supposed to kill retrieval-augmented generation. They didn't. Here's why the bottleneck moved instead of disappearing.

AI & ML intro Apr 29, 2026

Why it exists

Around 2023, every serious app built on top of an LLM hit the same wall: the model didn’t know your data. It knew Wikipedia and a slice of the open web up to its training cutoff, and that was it. Your codebase, your customer’s tickets, your internal wiki, last week’s Slack — invisible.

The obvious fix was “just put the data in the prompt.” But context windows were a few thousand tokens, and your wiki was a few million. So people did the next thing: chunk the corpus, embed each chunk, store the vectors, and at query time fetch only the chunks that look relevant. Stuff those into the prompt. That pipeline is Retrieval-Augmented Generation, and for two years it was synonymous with “doing AI on your own data.”

Then context windows exploded. 128k. 200k. 1M. 2M. And a recurring take appeared: RAG is dead, just paste the whole corpus in.

It didn’t happen. Retrieval is more entrenched in 2026 than it was in 2024. The interesting question is why.

Why it matters now

If you’re building anything that talks to an LLM about non-public information — a coding agent, a support bot, a search-over-docs feature, a notes app — you are doing retrieval, whether you call it that or not. Understanding why the bottleneck moved instead of disappearing is the difference between a system that works at 100 documents and one that works at 100 million.

The short answer

RAG = retriever + generator + prompt assembly

A retriever picks a small number of relevant chunks from a big corpus, an LLM generates an answer conditioned on them, and a thin layer in between decides what actually goes into the prompt. The reason huge context windows didn’t kill it: putting more tokens in the prompt is quadratically expensive at attention time, and even when you can afford it, the model gets worse at using them.

How it works

Three things keep retrieval alive after context windows got big.

1. Cost and latency scale with prompt length, not just window size. The model doesn’t charge you for the window; it charges you for the tokens you actually send. A 1M-token window is a capacity, not a free lunch. Stuffing a 500k-token corpus into every request means paying to process 500k tokens every turn. Prompt caching helps when the prefix is stable, but it doesn’t help when the relevant slice changes per query, which is the whole point of search. Retrieval is, among other things, a cost optimization: send 4k tokens instead of 500k.

2. Models get worse at long contexts than the marketing implies. This is the “lost in the middle” / “context rot” finding: as the relevant fact moves from the start or end of a long prompt toward the middle, accuracy drops, sometimes sharply. I don’t have a single canonical citation that covers every frontier model in 2026 — vendors publish their own needle-in-a-haystack scores and they’re not directly comparable — but the qualitative pattern (recall sags in the middle of long contexts, and degrades further when there are many distractors) has held up across multiple independent evals. The practical consequence: even if you can fit the whole corpus, putting irrelevant text next to the relevant text measurably hurts the answer.

3. Most corpora aren’t static. Your docs change. New tickets arrive. The codebase gets a commit. A retrieval index can be incrementally updated; a 1M-token monolithic prompt can’t. As soon as freshness or scale matters, you’re maintaining an index whether you wanted to or not.

A modern RAG system is rarely just “embed and cosine-similarity.” It’s usually a small pipeline:

The shape of the problem shifts depending on what you’re retrieving over. Code retrieval cares about symbol graphs and call edges, not just embedding similarity. Conversation retrieval cares about recency. Legal retrieval cares about exact phrasing. There isn’t one “RAG”; there’s a family of pipelines that share the same shape.

The seam: agents complicate the picture

The cleaner version of “stuff everything in the prompt” that’s actually winning ground from RAG isn’t long context — it’s agentic retrieval. Instead of one retrieval call before generation, the model uses tools to search, read files, follow links, and decide what to look at next. The corpus stays external, but the model drives retrieval instead of a fixed pipeline.

This is closer to how a human researcher works: you don’t pre-fetch the whole library, and you don’t read the whole book — you skim, follow citations, and stop when you have enough. Whether agentic retrieval fully replaces classical RAG, or whether the two stay layered (agent on top, vector search underneath), is genuinely unsettled as of early 2026. I don’t think there’s enough public production data to call it yet.

Going deeper