RAG: why retrieval didn't die when context windows got huge

Long context windows were supposed to kill retrieval-augmented generation. They didn't. Here's why the bottleneck moved instead of disappearing.

AI & ML intro Apr 29, 2026

Why it exists

Around 2023, every serious app built on top of an LLM hit the same wall: the model didn’t know your data. It knew Wikipedia and a slice of the open web up to its training cutoff, and that was it. Your codebase, your customer’s tickets, your internal wiki, last week’s Slack — invisible.

The obvious fix was “just put the data in the prompt.” But context windows were a few thousand tokens, and your wiki was a few million. So people did the next thing: chunk the corpus, embed each chunk, store the vectors, and at query time fetch only the chunks that look relevant. Stuff those into the prompt. That pipeline is Retrieval-Augmented Generation, and for two years it was synonymous with “doing AI on your own data.”

Then context windows exploded. 128k. 200k. 1M. 2M. And a recurring take appeared: RAG is dead, just paste the whole corpus in.

It didn’t happen. Retrieval is more entrenched in 2026 than it was in 2024. The interesting question is why.

Why it matters now

If you’re building anything that talks to an LLM about non-public information — a coding agent, a support bot, a search-over-docs feature, a notes app — you are doing retrieval, whether you call it that or not. Understanding why the bottleneck moved instead of disappearing is the difference between a system that works at 100 documents and one that works at 100 million.

The short answer

RAG = retriever + generator + prompt assembly

A retriever picks a small number of relevant chunks from a big corpus, an LLM generates an answer conditioned on them, and a thin layer in between decides what actually goes into the prompt. The reason huge context windows didn’t kill it: putting more tokens in the prompt is quadratically expensive at attention time, and even when you can afford it, the model gets worse at using them.

How it works

Three things keep retrieval alive after context windows got big.

1. Cost and latency scale with prompt length, not just window size. The model doesn’t charge you for the window; it charges you for the tokens you actually send. A 1M-token window is a capacity, not a free lunch. Stuffing a 500k-token corpus into every request means paying to process 500k tokens every turn. Prompt caching helps when the prefix is stable, but it doesn’t help when the relevant slice changes per query, which is the whole point of search. Retrieval is, among other things, a cost optimization: send 4k tokens instead of 500k.

2. Models get worse at long contexts than the marketing implies. This is the “lost in the middle” / “context rot” finding: as the relevant fact moves from the start or end of a long prompt toward the middle, accuracy drops, sometimes sharply. I don’t have a single canonical citation that covers every frontier model in 2026 — vendors publish their own needle-in-a-haystack scores and they’re not directly comparable — but the qualitative pattern (recall sags in the middle of long contexts, and degrades further when there are many distractors) has held up across multiple independent evals. The practical consequence: even if you can fit the whole corpus, putting irrelevant text next to the relevant text measurably hurts the answer.

3. Most corpora aren’t static. Your docs change. New tickets arrive. The codebase gets a commit. A retrieval index can be incrementally updated; a 1M-token monolithic prompt can’t. As soon as freshness or scale matters, you’re maintaining an index whether you wanted to or not.

A modern RAG system is rarely just “embed and cosine-similarity.” It’s usually a small pipeline:

Chunking — split documents into passages. The chunk boundary is load-bearing; bad chunking is the most common reason RAG “doesn’t work.”
Hybrid retrieval — combine semantic search (embeddings) with keyword search (BM25). Embeddings catch paraphrase; keywords catch exact identifiers like error codes and function names.
Re-ranking — a smaller, slower cross-encoder re-scores the top ~100 hits down to the top ~5. This is where most of the quality comes from.
Prompt assembly — pack the survivors into the prompt with enough provenance that the model can cite, and the user can verify.

The shape of the problem shifts depending on what you’re retrieving over. Code retrieval cares about symbol graphs and call edges, not just embedding similarity. Conversation retrieval cares about recency. Legal retrieval cares about exact phrasing. There isn’t one “RAG”; there’s a family of pipelines that share the same shape.

The seam: agents complicate the picture

The cleaner version of “stuff everything in the prompt” that’s actually winning ground from RAG isn’t long context — it’s agentic retrieval. Instead of one retrieval call before generation, the model uses tools to search, read files, follow links, and decide what to look at next. The corpus stays external, but the model drives retrieval instead of a fixed pipeline.

This is closer to how a human researcher works: you don’t pre-fetch the whole library, and you don’t read the whole book — you skim, follow citations, and stop when you have enough. Whether agentic retrieval fully replaces classical RAG, or whether the two stay layered (agent on top, vector search underneath), is genuinely unsettled as of early 2026. I don’t think there’s enough public production data to call it yet.

Embeddings — embedding = text → vector — the substrate that makes semantic search possible. See embeddings.
Vector database — vector DB ≈ index optimized for nearest-neighbor in high dimensions — the storage layer; increasingly just a feature of Postgres or Elasticsearch rather than a standalone product.
Re-ranker — re-ranker = cross-encoder + (query, doc) → score — the step that turns “100 plausibly relevant chunks” into “5 actually relevant ones.”
Lost in the middle — lost in the middle = LLM recall sags for facts placed mid-prompt vs. at the ends — the empirical finding that LLMs underuse the middle of long contexts. The exact magnitude is model-specific and moves around with each release.

Going deeper

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2020) — the paper that named the pattern.
Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023) — the original lost-in-the-middle result. Frontier models have improved on it, but not erased it.
I don’t have a single best reference for “agentic retrieval” as a named concept yet — the practice is ahead of the literature.