Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does prompt caching exist?

Your agent sends the same 50,000-token system prompt on every turn. The provider charges you 90% less when they recognize it. They're not being generous — they're charging you for work they didn't do.

AI & ML intermediate Apr 29, 2026

Why it exists

Build an agent for five minutes and you’ll notice something uncomfortable.

Every turn, you’re sending the model the same long system prompt, the same tool definitions, the same scrollback of “here’s what we did last step.” The variable part — the user’s new message, or the result of the last tool call — is tiny. Maybe 1% of the input tokens. The other 99% is identical to what you sent on the previous request.

The provider, on the other hand, is in an even more uncomfortable position. From their side: a request comes in with 50,000 tokens of prefix. Their server runs the full prefill pass — quadratic in prompt length, the most expensive thing the model does — to populate the KV cache for those 50,000 tokens. The model emits a few hundred tokens of output. The request ends. The KV cache is freed. Three seconds later, the same client sends 50,001 tokens — the original 50,000 plus one new user message. The server does the entire 50,000-token prefill again, from scratch, because nothing was kept around.

That is an absurd amount of duplicated work. Both sides know it. Prompt caching is the obvious move: keep the KV cache for the prefix around between requests (in GPU memory, or on a slower tier behind it), and bill the user for the cache hit instead of the recomputation. Prefill is the most expensive part of inference for long prompts. Skipping it is what gets you the 90% discount the docs advertise.

Why it matters now

A few years ago, prompts were short. “Summarize this paragraph.” Caching across requests would have saved nothing worth the engineering. Three things changed:

In all three cases, the prefix is stable and the suffix is small. That’s the exact regime where caching wins. Without it, agentic workloads would be priced as if every step is a fresh request — which is to say, mostly priced as prefill cost on a prompt that was already prefilled five seconds ago.

It also matters because the cost model is not symmetric. A cache hit on Anthropic’s API is priced at roughly 10% of the base input rate; a cache write is priced at roughly 125% of base (for the 5-minute TTL) or 200% (for the 1-hour TTL). If you misuse caching — putting the breakpoint on something that changes every request — you pay the premium every time and never get the discount. The feature has a foot-gun.

The short answer

prompt caching = KV cache for the prefix + persisted across requests + priced as a discount

A request’s KV cache is normally thrown away when the request ends. Prompt caching keeps the cache for a marked prefix in GPU memory (or on a tier behind it) for some TTL — typically a few minutes — so a follow-up request that starts with the same prefix can skip prefill and start decoding almost immediately. The provider charges less because they did less work. That’s the entire idea.

How it works

Two layers, and it’s worth keeping them separate in your head.

Layer 1: the underlying optimization. The KV cache for a prefix — the per-layer key and value tensors for tokens 1…N — is exactly the artifact that prefill produced. If the provider preserves it, the next request that begins with the same N tokens can attend against it directly and start generating. Prefill on N tokens is roughly O(N²) in the attention layers; reusing the cache means you skip recomputing the prefix entirely, and only pay to process the new M-token suffix against the reused prefix (roughly O(M·N + M²)). When M ≪ N, that’s a huge saving — the quadratic explosion in N is gone.

Layer 2: the API surface. Different providers expose this differently:

The mechanism details differ, but the underlying physics doesn’t. Somebody, somewhere, is keeping the prefix’s intermediate state around so prefill doesn’t have to run again.

The hash-and-match step

The provider has to recognize that your new request shares a prefix with a cached one. The standard move is to hash the prefix in chunks — typically token blocks — and look the hash up in a table of recently-computed prefixes belonging to your account. If the hash matches, the cache entry is reused. If even one token in the prefix differs, the hash diverges and you get a miss.

That’s why “exact prefix match” is not a marketing simplification — it’s load-bearing. Insert a timestamp in your system prompt and every request hashes differently. Reorder your tool list and the cache misses. Swap a single token of whitespace and you’re paying full price.

This is also why caches are scoped per organization (or per workspace, depending on the provider). Two different customers happening to share a prefix do not share a cache — both for security and because the cache lives on a specific GPU server in a specific data center, and routing your request to that server is part of the trick.

What gets cached, what doesn’t

Across the major providers, the rule of thumb is: anything in the request that’s part of the prefix can be cached — system messages, tool definitions, message history, even images and documents in some cases. The output of the model is not “cached” in any useful sense; the same input still has to be re-decoded each time you ask for new tokens. (You can think of it this way: prefill is what gets reused, decode is what you pay for fresh.)

What breaks caching:

TTL and why caches expire fast

Five minutes feels short. My read on why it’s short: GPU memory is the most expensive memory in the data center, and a single long-context KV cache can be multiple gigabytes per request. Keeping millions of users’ prefixes resident “just in case” would tile out the GPUs immediately. The TTL is a brutal eviction policy disguised as a feature. Anthropic’s 1-hour option costs 2× the write price; the docs don’t spell out the internal cost rationale, but it’s consistent with either pinning more VRAM or pushing the cache to a slower tier — I don’t have a source on which.

DeepSeek’s disk-backed approach is a different bet: cheaper to retain, slower to fetch. Whether disk- or VRAM-backed wins depends on the access pattern. For long-tail prefixes that get hit hours later, disk wins. For chatty agent loops where the next request lands in 2 seconds, VRAM wins because you can avoid the reload entirely.

Where it gets subtle

The core thing to take away: prompt caching is not a separate feature stapled onto inference. It is the KV cache — the optimization that already makes generation fast within a single request — extended in lifetime so it can pay for itself across requests. The pricing is just the bookkeeping that makes the optimization legible to the customer.

Going deeper