Why does prompt caching exist?

Your agent sends the same 50,000-token system prompt on every turn. The provider charges you 90% less when they recognize it. They're not being generous — they're charging you for work they didn't do.

AI & ML intermediate Apr 29, 2026

Why it exists

Build an agent for five minutes and you’ll notice something uncomfortable.

Every turn, you’re sending the model the same long system prompt, the same tool definitions, the same scrollback of “here’s what we did last step.” The variable part — the user’s new message, or the result of the last tool call — is tiny. Maybe 1% of the input tokens. The other 99% is identical to what you sent on the previous request.

The provider, on the other hand, is in an even more uncomfortable position. From their side: a request comes in with 50,000 tokens of prefix. Their server runs the full prefill pass — quadratic in prompt length, the most expensive thing the model does — to populate the KV cache for those 50,000 tokens. The model emits a few hundred tokens of output. The request ends. The KV cache is freed. Three seconds later, the same client sends 50,001 tokens — the original 50,000 plus one new user message. The server does the entire 50,000-token prefill again, from scratch, because nothing was kept around.

That is an absurd amount of duplicated work. Both sides know it. Prompt caching is the obvious move: keep the KV cache for the prefix around between requests (in GPU memory, or on a slower tier behind it), and bill the user for the cache hit instead of the recomputation. Prefill is the most expensive part of inference for long prompts. Skipping it is what gets you the 90% discount the docs advertise.

Why it matters now

A few years ago, prompts were short. “Summarize this paragraph.” Caching across requests would have saved nothing worth the engineering. Three things changed:

Long system prompts. Modern assistants and coding agents ship with 5,000–50,000 tokens of system instructions, persona, formatting rules, and examples before the user has typed a single character.
Long tool / MCP definitions. A capable agent harness registers dozens of tools. Their JSON schemas alone are often thousands of tokens. They almost never change between turns.
Multi-turn loops. An agent that runs ten tool calls to answer one question sends the entire prior trace as input on each step. Turn N’s input is turn N-1’s input plus a few hundred tokens.

In all three cases, the prefix is stable and the suffix is small. That’s the exact regime where caching wins. Without it, agentic workloads would be priced as if every step is a fresh request — which is to say, mostly priced as prefill cost on a prompt that was already prefilled five seconds ago.

It also matters because the cost model is not symmetric. A cache hit on Anthropic’s API is priced at roughly 10% of the base input rate; a cache write is priced at roughly 125% of base (for the 5-minute TTL) or 200% (for the 1-hour TTL). If you misuse caching — putting the breakpoint on something that changes every request — you pay the premium every time and never get the discount. The feature has a foot-gun.

The short answer

prompt caching = KV cache for the prefix + persisted across requests + priced as a discount

A request’s KV cache is normally thrown away when the request ends. Prompt caching keeps the cache for a marked prefix in GPU memory (or on a tier behind it) for some TTL — typically a few minutes — so a follow-up request that starts with the same prefix can skip prefill and start decoding almost immediately. The provider charges less because they did less work. That’s the entire idea.

How it works

Two layers, and it’s worth keeping them separate in your head.

Layer 1: the underlying optimization. The KV cache for a prefix — the per-layer key and value tensors for tokens 1…N — is exactly the artifact that prefill produced. If the provider preserves it, the next request that begins with the same N tokens can attend against it directly and start generating. Prefill on N tokens is roughly O(N²) in the attention layers; reusing the cache means you skip recomputing the prefix entirely, and only pay to process the new M-token suffix against the reused prefix (roughly O(M·N + M²)). When M ≪ N, that’s a huge saving — the quadratic explosion in N is gone.

Layer 2: the API surface. Different providers expose this differently:

Anthropic (Claude) — supports both automatic caching (a single top-level cache_control field that moves forward as the conversation grows) and explicit block-level breakpoints (cache_control: { type: "ephemeral" } on individual content blocks, up to 4 per request). Default TTL 5 minutes; an extended 1-hour TTL is available at a higher write cost. Minimum cacheable size depends on the model (1024–4096 tokens at the time of writing). Cache reads cost ~10% of base input; cache writes cost 1.25× (5m) or 2× (1h) of base input.
OpenAI — automatic, no opt-in. Announced at DevDay on October 1, 2024. Prefixes of ≥1024 tokens are eligible; cache hits land in 128-token increments. The launch announcement gave a 50% discount on cached input tokens; current pricing is model-specific and varies, so check the model’s pricing page rather than trusting “50%” as a stable rule.
DeepSeek — automatic, disk-backed (not in-VRAM). At the August 2024 launch, cache hits were priced at 1/10 of the base input rate. Current ratios vary per model. The disk-tier choice is a different point on the cost-vs-latency tradeoff: cheaper to keep around for longer, slower to load back.

The mechanism details differ, but the underlying physics doesn’t. Somebody, somewhere, is keeping the prefix’s intermediate state around so prefill doesn’t have to run again.

The hash-and-match step

The provider has to recognize that your new request shares a prefix with a cached one. The standard move is to hash the prefix in chunks — typically token blocks — and look the hash up in a table of recently-computed prefixes belonging to your account. If the hash matches, the cache entry is reused. If even one token in the prefix differs, the hash diverges and you get a miss.

That’s why “exact prefix match” is not a marketing simplification — it’s load-bearing. Insert a timestamp in your system prompt and every request hashes differently. Reorder your tool list and the cache misses. Swap a single token of whitespace and you’re paying full price.

This is also why caches are scoped per organization (or per workspace, depending on the provider). Two different customers happening to share a prefix do not share a cache — both for security and because the cache lives on a specific GPU server in a specific data center, and routing your request to that server is part of the trick.

What gets cached, what doesn’t

Across the major providers, the rule of thumb is: anything in the request that’s part of the prefix can be cached — system messages, tool definitions, message history, even images and documents in some cases. The output of the model is not “cached” in any useful sense; the same input still has to be re-decoded each time you ask for new tokens. (You can think of it this way: prefill is what gets reused, decode is what you pay for fresh.)

What breaks caching:

Anything that changes the prefix bytes. Timestamps, request IDs, randomized examples, anything dated.
Anything earlier in the hierarchy changing. Most providers cascade invalidation: if your tools change, the system-message and message caches downstream of them also invalidate, even if those bytes didn’t change. The KV cache for layer N depends on the inputs for layers 1…N, so this isn’t an arbitrary rule — it’s forced by the math.
Model swaps or parameter changes. Different weights produce different keys and values; the cache is model-specific. Some non-obvious parameters (e.g. tool-choice mode, certain feature flags) can invalidate parts of the cache too.

TTL and why caches expire fast

Five minutes feels short. My read on why it’s short: GPU memory is the most expensive memory in the data center, and a single long-context KV cache can be multiple gigabytes per request. Keeping millions of users’ prefixes resident “just in case” would tile out the GPUs immediately. The TTL is a brutal eviction policy disguised as a feature. Anthropic’s 1-hour option costs 2× the write price; the docs don’t spell out the internal cost rationale, but it’s consistent with either pinning more VRAM or pushing the cache to a slower tier — I don’t have a source on which.

DeepSeek’s disk-backed approach is a different bet: cheaper to retain, slower to fetch. Whether disk- or VRAM-backed wins depends on the access pattern. For long-tail prefixes that get hit hours later, disk wins. For chatty agent loops where the next request lands in 2 seconds, VRAM wins because you can avoid the reload entirely.

Where it gets subtle

Concurrent requests can race. If two requests with the same fresh prefix arrive simultaneously, both will trigger a cache write before either has finished. Anthropic’s docs explicitly recommend serializing the first request to populate the cache before fanning out, which is a tell that this case is real and annoying.
The breakpoint must sit on stable content. If you put your cache_control mark on the dynamic suffix, the prefix hash includes the changing bytes and you cache-write a fresh entry every request — paying the write premium and never getting a hit. Mark the end of the static prefix instead.
The “discount” is also a sales argument for longer prompts. Once caching is on, the marginal cost of stuffing more examples or more docs into your system prompt drops by ~10× on a hit. My read — not something I have a source for — is that this is part of why agent system prompts ballooned over 2024–2025: caching made size much less expensive than it used to be. Whether that’s a good thing for prompt quality is a separate question.
Output-token cost is not affected. Caching only touches input pricing. If your agent generates lots of tokens, prompt caching helps less than you think.
Numbers I’m soft on. I’m pulling concrete pricing ratios and minimum-token thresholds from each provider’s public docs as of early 2026. They drift. The shape of the cost (read ≪ base ≪ write, with write only worth it if you’ll hit it more than ~once) is the durable part; treat the specific multipliers as a snapshot.

The core thing to take away: prompt caching is not a separate feature stapled onto inference. It is the KV cache — the optimization that already makes generation fast within a single request — extended in lifetime so it can pay for itself across requests. The pricing is just the bookkeeping that makes the optimization legible to the customer.

KV cache — KV cache = per-layer K and V tensors for past tokens, reused on the next decode step. Prompt caching is this, persisted past the end of a request.
Prefill vs. decode — prefill = one quadratic pass over the prompt; decode = cheap one-token-at-a-time steps. Prompt caching specifically targets prefill — the expensive phase. It does nothing for decode.
Cache breakpoint — breakpoint = a marker on a content block saying "the prefix up to here is cacheable". Anthropic’s explicit form. OpenAI’s automatic mode is roughly “the breakpoint is the longest matched prefix, found for you.”
Cache hit rate — hit rate = fraction of input tokens served from cache. The metric to actually optimize. Reported under provider-specific names — cache_read_input_tokens (Anthropic), cached_tokens (OpenAI), prompt_cache_hit_tokens (DeepSeek).
Continuous batching — continuous batching = scheduling new tokens from many requests together each step. Orthogonal optimization on the decode side; stacks with prompt caching on the prefill side.
Speculative decoding — speculative decoding = small model proposes, big model verifies in parallel. Another decode-side trick. Stacks with prompt caching.
Context window — context window = max tokens the attention mechanism can address. Prompt caching makes large windows affordable; it doesn’t make them larger.

Going deeper

Anthropic — Prompt caching with Claude (announcement, August 14, 2024) — original launch post.
Anthropic prompt caching docs — the explicit-breakpoint API, TTLs, pricing tiers, invalidation rules.
OpenAI — Prompt Caching in the API (DevDay 2024 announcement) — the automatic-caching version, 50% discount, ≥1024-token threshold.
DeepSeek — Context Caching on Disk — the disk-backed variant; useful contrast to the VRAM-backed approach.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM, 2023) — the canonical treatment of how KV cache memory is actually managed in a serving system. Prompt caching is what you build on top of a system that already handles the per-request case well.