Visual guide
The KV Cache, Illustrated
How modern LLMs actually remember the conversation while they generate. From the redundant computation that made naïve attention untenable, through what's exactly inside the cache, to the optimizations every production team uses on top: paged attention, GQA, prompt caching, attention sinks, speculative decoding. With worked math and thirteen diagrams.
25 min read
The first token of a 50,000-token request comes back in four seconds. Every token after that comes back in thirty milliseconds. The model is the same model. The GPUs are the same GPUs. The request is the same request. Strip out one engineering hack and that thirty-millisecond number becomes minutes, the API costs you’ve quietly accepted become a hundred times higher, and “long context” stops meaning anything useful.
The hack is the KV cache: a quiet mountain of intermediate state that lives in GPU memory while your request runs. It is, depending on how you count, the single most consequential data structure in modern LLM serving. Almost every cost curve, every latency oddity, every “prompt caching saves 90%” line item on a vendor’s pricing page is downstream of it. Once you see it, the whole inference stack (paged attention, GQA, prompt caching, attention sinks) stops being a list of acronyms and starts being the inevitable consequence of one problem and one fix.
The problem is wasteful redundancy. The fix is to cache the parts that don’t change. Everything past that is a refinement on the same idea.
What’s in this guide: the problem · the cache, exactly · anatomy · one step worked out · growth · memory math · prefill vs decode · GQA · paged attention · prompt caching · attention sinks · where it’s going · what it isnt · in your code · references
The problem: naive attention redoes its homework
To understand what the cache solves, look at what would happen without it. A transformer generating one token at a time, with no cache, has to recompute attention over the entire sequence so far on every step.
Recall how attention works inside a single layer. The input sequence (each token already an embedding vector) gets projected three ways to produce a Query matrix Q, a Key matrix K, and a Value matrix V. Attention is then softmax(Q · Kᵀ / √d) · V. A long way of saying: for each token’s query, take a weighted average over all tokens’ values, where the weights come from how well each token’s key matches that query. The output passes up the layer stack and eventually gets decoded into the next token.
When you generate the next token from a sequence of length N, the part that actually matters is the very last row of that big attention output: the row corresponding to “what should the next token be.” Everything else is by-the-way. But to compute that one row, the naïve implementation builds the entire Q, K, V matrices for all N tokens and runs the whole N×N attention. Most of that work is identical to the work you did the previous step.
The redundancy is staggering. To generate a sequence of N tokens, you’ve recomputed the K and V projections for token 1 roughly N times. For token 2, N-1 times. Total work scales as N² per layer, which is fine for N=100 and catastrophic for N=10,000 and physically impossible for N=200,000.
In practice this means that a 200K-token completion would take longer than a model release cycle, and cost more than a small mortgage. The cache exists because nobody could afford the bill, and nobody could wait for the response.
The insight: K and V of past tokens never change
Look at the attention computation again. The Q, K, V projections are linear: a matrix multiply with the layer’s projection weights, no nonlinearity in the way, no dependence on anything but the token’s own embedding and the layer’s frozen weights. So for any given past token at a given layer, its K vector and its V vector are deterministic. They are computed once and never need to change.
The only thing that does change between generation steps is the new token. Its query, q_new, hasn’t been computed yet. It’s the question “what should I attend to next?” But it doesn’t need the keys and values of past tokens to be recomputed. They’re already known.
The cache is just that recognition, made operational. Per layer, per attention head, store the K and V vectors of every past token, indexed by position. When generating token N+1, compute only q_new (one vector, not N). Look up K and V from the cache (no matmul, just memory load). Run attention from q_new against the cached K and V to get the output. Done.
The shape of the work has flipped. Each new token now costs one fresh projection and one row of attention against the cache, instead of a full re-projection of the past plus a square attention matrix. The cache itself grows linearly: one new K vector and one new V vector per layer per head per token.
This is what makes streaming generation feel instant after the first token. The model isn’t doing less attention. It’s doing exactly as much attention as it ever was. It’s just stopped redoing work it already did.
What the cache does and doesn’t fix
Worth being precise here, because this is where careful readers overcorrect in the other direction. The cache does not make attention constant-time. Generating token N+1 still computes a score against every one of the N cached keys. Per-token attention cost still grows linearly with context length, cache or no cache. Add that up over a full generation and decoding is still quadratic in total. There is no free lunch on context.
What the cache eliminates is recomputation:
- K/V projection work per step: N tokens re-projected → 1 token projected
- Attention scoring per step: an N × N square → a single 1 × N row
Summed over a whole N-token generation, projection work drops from O(N²) to O(N), and attention scoring drops from O(N³) to O(N²). A much cheaper quadratic, and one that is paid in memory reads rather than matmuls.
The 1 × N row is the part that never goes away, and it’s exactly why decode is memory-bound: every step must read the entire cache even though it computes almost nothing. Token #50,000 of a response is slower and more expensive than token #50 not because the math got harder but because the cache got bigger. The KV cache turned generation from computationally impossible into bandwidth-expensive. Most of the rest of this guide is about engineering around bandwidth-expensive.
What actually lives in the cache
Per layer, per attention head, per token, the cache stores two vectors of length head_dim: one K vector and one V vector. If you stack the K vectors for every token a head has seen, you get a (seq_len × head_dim) matrix. Same for V. Multiply by number of heads, you have the per-layer cache. Multiply by number of layers, you have the full per-request cache. Multiply by batch size, you have the full GPU-resident cache for a server handling multiple requests at once.
A useful mental model: think of the cache as a 5-dimensional tensor of shape [n_layers, 2, n_kv_heads, seq_len, head_dim], where the “2” is the K-vs-V distinction. Most serving stacks add a sixth dimension for batch, but within one request the first five are what you have.
Some concrete numbers to anchor it. Llama 3 8B has 32 layers, 32 query heads (and 8 K/V heads after grouped-query attention; more on that in a moment), head_dim 128. At a 2,048-token context, batch size 1, fp16, the cache size is:
cache_bytes = 2 (K + V) × 32 layers × 8 KV heads × 2048 tokens × 128 head_dim × 2 bytes ≈ 268 MB
Per request. At 8K context it’s just over 1 GB. At 32K, just over 4 GB. The Llama 3 8B model weights themselves are about 16 GB at fp16. The cache, at 32K context, is a quarter of the model’s footprint per concurrent request. This is why serving long context at scale is a memory problem, not a compute problem.
For comparison, Llama 3 70B (80 layers, 64 query heads / 8 KV heads, head_dim 128) at the same 32K context: 2 × 80 × 8 × 32768 × 128 × 2 ≈ 10.7 GB per request. The 70B model weights are about 140 GB. At 64K context it’s 21 GB: a quarter of an H100’s entire HBM, for a single request. This is where the bills go.
One attention step, worked out
Pick toy numbers small enough to follow in your head: 2 layers, 4 heads, head_dim 8. You are about to generate token #7 of a sequence. The cache for layer 0 currently holds 6 K vectors and 6 V vectors per head: a (4 × 6 × 8) K tensor and an identically shaped V tensor. Same for layer 1.
The flow for generating token #7:
- Look up the embedding of the last generated token (token #6). Call it
x, a vector of lengthd_model. For our toy:d_model = head_dim × num_heads = 8 × 4 = 32. - Enter layer 0. Compute three projections:
q = x · W_Q,k = x · W_K,v = x · W_V. Each is a vector of length 32. Reshape each into 4 heads of dimension 8. - For each head
h ∈ {0, 1, 2, 3}:- Append
k[h]toK_cache[layer=0, head=h]. The cache is now 7 vectors wide, shape(7 × 8). - Append
v[h]toV_cache[layer=0, head=h]. - Compute attention output for the new token (a one-row matmul, not a square one):
scores = q[h] · K_cache[0, h]ᵀ / √8→ shape(1 × 7)weights = softmax(scores)→ shape(1 × 7)out[h] = weights · V_cache[0, h]→ shape(1 × 8)
- Append
- Concatenate the four head outputs →
(1 × 32). Apply the output projectionW_O. Add the residual, run the feed-forward block, residual again. This is the layer-0 output for token #7. - Pass that vector up to layer 1 and repeat the whole flow.
- After all layers, run the final layer-norm and the unembedding to get logits over the vocabulary. Sample → token #7.
Two things are worth burning into the picture. First, the cache append is a memory operation: there’s no matmul against the past. Second, the attention matmul is now a single row of Q against the entire K cache, not a square N×N matmul. The compute cost of attention per token is linear in sequence length, not quadratic.
The corollary is that decode is memory-bandwidth-bound, not compute-bound. The GPU is mostly reading K and V vectors out of HBM and feeding them into a tiny matmul. The tensor cores sit at single-digit-percent utilization. This is the lever almost every inference optimization pulls on.
The cache grows one token at a time
Animate it in your head. The cache starts empty. Token #1 arrives, three projections happen, K_cache and V_cache each gain one column per head per layer. Token #2 the same, with attention now seeing two K and two V vectors per head. By token N, every layer has accumulated an N-column K matrix and an N-column V matrix per head.
This is the linear-in-time-and-space curve that makes long context tractable. It is also the curve that turns into a hockey stick at scale: every concurrent request keeps its own cache, every cache grows with every token, and HBM has a hard ceiling. The cache is what every serving framework spends its life trying to manage.
Why long context is expensive
The cost of holding a request in memory is the formula we keep referring back to:
cache_bytes = 2 × n_layers × n_kv_heads × seq_len × head_dim × bytes_per_param
For a few real models at typical contexts, fp16:
A serving box typically holds many concurrent requests in HBM at once. The KV cache, not the model weights, is what determines how many requests you can batch. This is why batching strategy is the central problem in inference serving, and why frameworks like vLLM, TGI, TensorRT-LLM, and SGLang build their architecture around managing this one tensor efficiently.
It also explains a subtle thing about pricing. Output tokens cost more than input tokens, and not (only) because they are “generated.” Each output token holds the entire cache in memory for the duration of its decode step. The marginal cost of generating token #50,000 of a response is much higher than generating token #50, because the cache is much bigger and every step is dominated by memory bandwidth.
Why the first token is slow
The full generation of a response has two phases that look nothing like each other.
Prefill processes the entire prompt at once. The model runs forward through every layer and computes K and V for every prompt token in parallel. This is one big GPU-friendly matmul per layer. It produces the initial cache and the logits for the first generated token.
Decode generates one token at a time. Each step is the small matmul we walked through above, plus a memory read of the entire current cache.
The asymmetry is brutal. A 50K-token prompt with a 1K-token response: prefill processes 50,000 tokens in two or three seconds; decode generates 1,000 tokens, each one bottlenecked by HBM bandwidth, in maybe ten seconds. Total wall clock: thirteen seconds. The first token sits behind the prefill. That is the “time to first token” you see in spec sheets. After that, decode is what you feel.
This asymmetry is also what prompt caching exploits. If the prefill is the same prefill, the work to build the cache is identical, so save the cache for next time.
Sharing K and V across heads: MHA, MQA, GQA
The single most effective architectural change to the KV cache in the last few years is not changing how the cache works. It’s making it smaller.
Classical multi-head attention (MHA) gives every query head its own K and V. Llama 1 with 32 query heads had 32 K heads and 32 V heads. That is a 32× multiplier on cache size that nobody is excited about paying.
Multi-query attention (MQA) shares one K and one V across all query heads. Cache shrinks by 32×. Quality drops noticeably; the model loses some of its capacity for fine-grained attention patterns.
Grouped-query attention (GQA) is the negotiated middle. Divide the query heads into G groups, each group shares one K and one V. Llama 3 uses 32 query heads in 8 groups of 4: a 4× cache reduction with quality within noise of full MHA.
Every modern open model (Llama 3, Mistral, Mixtral, Qwen, Gemma, DeepSeek) uses GQA. It is the cheapest architectural lever that exists for the KV cache, and it is the first thing to look at when you are sizing a deployment. The cost is a single line in the model config; the saving compounds across every byte of cache you allocate.
Paged attention: the cache as virtual memory
The naïve cache layout (one contiguous KV tensor per request) has the same problem early operating systems had with contiguous memory allocation. You cannot reuse fragments, you cannot share memory between processes, and one request’s footprint blocks others from fitting.
The vLLM team’s contribution was to treat the KV cache the way an operating system treats RAM. Slice the cache into fixed-size blocks: typically 16 tokens per block, per layer, per head. Maintain a block table per request that maps logical sequence position to physical blocks. The physical blocks live in a global pool managed by the server.
The wins compound. Memory fragmentation drops to a few percent. Two requests that share a prefix (the same system prompt, for instance) can literally share the same physical pages. Sequence-level parallelism, beam search, and parallel sampling all become memory-efficient because branches can share their common past.
The original paper claimed a 2-4× throughput improvement over previous serving stacks. The principle has since shown up everywhere: vLLM, TensorRT-LLM, SGLang, llamacpp’s continuous-batching mode, and most production inference frameworks now run on some variant of it. The KV cache stopped being a tensor and became an address space.
Prompt caching: the cache that outlives the request
In most APIs, the KV cache lives and dies with one request. The compute that built it (the prefill) gets thrown away the moment the response returns. The next request, even if it shares a 99% identical prefix, redoes that prefill from scratch.
Prompt caching breaks the throwaway. Anthropic, OpenAI, Google, and several inference vendors now persist the KV cache for the prefix of a request, keyed by the exact prefix bytes, for a short TTL; Anthropic’s default is around five minutes, with a longer extended-cache option. The next request that arrives with a matching prefix loads the cache directly into HBM instead of re-running prefill on those tokens.
The economic effect is large. For a long-system-prompt agentic workflow, the cache hit ratio is typically >95% on every turn after the first. Anthropic charges roughly 1/10 of normal input price for cached input. A loop that would cost a dollar per turn at full pricing costs about ten cents. The latency savings are comparable: time-to-first-token drops from “however long it takes to prefill 50K tokens” to “however long it takes to load the cache from a fast memory tier.” A tenfold improvement is normal.
There are real failure modes. The cache key is the exact prefix, byte-for-byte. Add an extra space, switch “USA” to “U.S.A”, change the date in the system prompt by one day: cache miss, full reprefill, normal price. Production teams crystallize their system prompts and pass anything dynamic as a later message in the request, after the cached boundary.
This is also the optimization that is most often left on the table. If your application is sending a long system prompt or a large RAG context on every request, you are almost certainly doing redundant prefill on most of every request. Adding the cache_control: { type: "ephemeral" } annotation on Anthropic, or the equivalent flag on the OpenAI Responses API, typically takes an afternoon and saves 70% or more on input-token spend.
Designing prompts for cache hits
Once caching is on, prompt engineering and cache engineering become the same discipline. The key is the byte-identical prefix, so the design rule is simple: static content first, dynamic content last, cache boundary between them. The classic self-inflicted miss is interpolating the clock into the top of the system prompt:
# Bad: the date changes, the bytes change, the cache misses every day
You are a support agent for Acme.
Today's date: 2026-06-15
[...8,000 more tokens of instructions and tool definitions...]
# Good: the prefix never changes; the date rides in with the first user turn
You are a support agent for Acme.
The current date is provided in the first user message.
[...8,000 more tokens of instructions and tool definitions...]
The same rule generalizes well past the date:
- Serialize tool definitions deterministically. A JSON encoder that reorders keys between deploys silently invalidates every cached prefix in production. I learned this one from a client’s bill, not from documentation.
- RAG retrievals go after the static boundary. They change per request; don’t let them poison the shared prefix. Anthropic allows up to four cache breakpoints, so a long agent loop can cache the system prompt and tools at one boundary and the accumulated conversation at another.
- Shared scaffolding before per-user content. If every tenant runs on the same instructions, keep those bytes identical across users and put user-specific memory after them; the expensive shared prefix then caches across your whole user base.
- Treat cache hit rate as an SLO. The API response reports how many input tokens were read from cache. Log it. A refactor that quietly drops your hit rate from 95% to zero looks like nothing in code review and shows up as a 10× input bill at the end of the month.
Sliding window with attention sinks
What if you don’t want the cache to grow forever? Mistral models, and a few others, cap attention to a sliding window: the last W tokens, drop the rest. The cache stays bounded at W. Generation can in principle continue forever.
The problem the StreamingLLM paper surfaced is what happens at the edge of the window. The very first tokens (the absolute oldest few) turn out to have a structural role. The model has learned to dump excess attention onto them: “attention sinks.” Throw them out, and the softmax distribution collapses, and generation quality degrades sharply within a few hundred steps.
The fix is simple: keep the first 4 tokens of the cache always, plus the last W tokens. The cache stays bounded; the sinks stay in place; the model keeps generating coherently for arbitrary sequence lengths.
This is the closest thing the field has to “infinite context” right now, and it shows up in different forms in Mistral, Gemini, and many open models. The constraint it accepts is honest: the model can’t actually attend to anything beyond the window. The sinks aren’t a memory of what happened; they’re a softmax stabilizer. Anything important from outside the window has to be re-injected by the application as part of the input.
Where the field is going
Everything above is the settled layer: what production serving already looks like. The frontier is still circling the same object, and the easiest way to keep your bearings in the paper flood is to see every new technique as pulling one of six levers on the same constraint.
Amortize the read: speculative decoding and Medusa. Every decode step reads the whole cache to emit one token. Speculative decoding has a small draft model propose several tokens cheaply, which the big model then verifies in a single forward pass. One full-bandwidth read of the cache buys up to k tokens instead of one. Medusa gets a similar effect without a separate draft model, by training extra decoding heads that guess several tokens ahead. In a memory-bound regime, doing more arithmetic per byte read is nearly free, which is why this works at all.
Shrink the bytes: KV quantization and eviction. bytes_per_param is a straight multiplier in the memory formula, and the cache tolerates low precision surprisingly well. FP8 caches are routine in TensorRT-LLM and vLLM; KIVI pushes to 2-bit with modest quality cost. A separate line of work (H2O, SnapKV) watches attention patterns and evicts cache entries that have stopped receiving attention mass. Either approach can halve the footprint (combined, better) without touching the model weights.
Schedule for sharing: prefix-aware routing. Paged attention made prefix sharing possible; the next step is making it likely. SGLang’s RadixAttention keeps a radix tree over every cached prefix on a replica and routes incoming requests toward the machine that already holds their prefix. Cache hit rate stops being a happy accident and becomes a scheduling objective.
Split the phases: disaggregated serving. Prefill is compute-bound; decode is memory-bound. Running both on the same GPU means sizing the hardware for one and wasting it on the other. Disaggregated architectures (DistServe, Mooncake) run prefill and decode on separate GPU pools, each sized for its own bottleneck, and ship the KV cache between them over the interconnect. Moving gigabytes of cache between machines sounds absurd and benchmarks fine; the interconnects got fast enough.
Tier the storage: KV offloading. A parked agent session doesn’t need its cache sitting in HBM. Offloading systems tier the cache the way an operating system tiers memory: HBM while the request is active, CPU RAM when it idles, NVMe when it parks, reload on resume. This is the piece that makes day-long agent sessions economical rather than ruinous.
None of these is exotic anymore. If you’re choosing a serving stack in 2026, you are mostly choosing between implementations of this list.
What the KV cache is not
Half the productive work of this guide is fencing off the things people confuse with the KV cache.
- It is not “the model’s memory.” The model’s parameters are frozen, set in training. The KV cache is intermediate state for one in-flight request, not retained knowledge.
- It does not persist past the request, except in prompt caching, and even then only for a short prefix-keyed window.
- It is not what RAG fills. RAG retrieves text and appends it to the prompt. That text then participates in prefill and ends up in the cache like any other tokens. The cache is downstream of context-window engineering, not a replacement for it.
- It is not what makes agents “remember” across sessions. That is external state (databases, vector stores, summarization), covered in A Visual Guide to AI Agent Memory.
- It does not “learn” from the conversation. Nothing in the KV cache feeds back into the weights. The fact that the model “knows” things from earlier in the conversation is because those tokens are still in the cache. Once they’re not, they’re gone.
The KV cache is mechanism, not memory. It is the data structure that makes attention tractable during decoding. Everything we colloquially call “memory” lives elsewhere, in engineering you build around the model.
Why this matters in your code
If you take three operational things from this guide, take these.
Your bill grows with the cache, not just with token count. The marginal cost of generating token #50,000 of a response is much higher than generating token #50: same number of “output tokens” on the invoice, but vastly more memory bandwidth consumed. Output token pricing is averaged across this, which makes it look linear when the underlying cost isn’t. If your application generates long responses, your effective cost per token climbs with response length, and your tail latencies climb with it.
Prompt caching is the highest-ROI optimization most teams haven’t done. If your application sends a long system prompt or large context (RAG retrievals, multi-turn agent state) on every request, you are very likely doing redundant prefill on most of every request. Turning on the cache header typically takes a single afternoon and saves 70%+ on input-token spend. The reason it isn’t on by default is that you have to be deliberate about what your prefix is. Once you are, it’s almost free money.
Long context isn’t “free” just because the model accepts it. A 1M-token Claude or Gemini request has a KV cache in the multiple-gigabyte range. Time-to-first-token reflects the prefill of that whole prompt. Decode rate reflects HBM bandwidth pulling those gigabytes per token. Long context is not a magic substitute for retrieval; it pushes the cost from your vector DB to the model’s HBM, and the model’s HBM is more expensive.
A back-of-the-envelope for your model:
- Pick your model’s
n_layers,n_kv_heads,head_dim. These are in every model card. - Multiply:
2 × n_layers × n_kv_heads × head_dim × seq_len × 2 bytes(for fp16) → cache size in bytes. - Compare to your serving GPU’s HBM (typically 40-80 GB), minus weights, minus framework overhead.
- That’s a rough upper bound on how many concurrent requests you can fit, assuming each runs at full context.
If you find yourself running out of headroom, the levers, in order of effort: turn on prompt caching, switch to a model with GQA if you somehow aren’t on one, reduce max context, deploy on a paged serving stack (vLLM / SGLang / TensorRT-LLM). The improvements roughly multiply.
A short opinion
The next eighteen months of inference work will be more about the KV cache than about model architecture. The architectures have mostly stabilized: decoder-only transformer, RoPE, GQA, RMSNorm, MoE in some lines, dense in others. The cache layout has not. Paged attention was 2023. Prompt caching was 2024. Every direction in the section above is still being measured in 2× and 3× wins over the previous baseline, in a field where architecture papers fight over half a percent. Nothing else in the stack is moving that fast.
If you want to be the person on your team who actually understands the production cost curve of your LLM stack, internalize this data structure. It is the single most useful piece of LLM-systems knowledge you can hold. Everything else (better serving, lower latency, lower bills, longer contexts) is a refinement on top of it.
References and further reading
- Vaswani et al. (2017). Attention Is All You Need. The original paper that made every later optimization possible. The KV cache isn’t named here; it is an inference-side trick that emerged later.
- Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. The original MQA paper.
- Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. The architectural compromise that every modern open model uses.
- Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. The vLLM paper. Required reading if you serve LLMs.
- Xiao et al. (2024). Efficient Streaming Language Models with Attention Sinks. StreamingLLM and the discovery of the attention-sink phenomenon.
- Pope et al. (2022). Efficiently Scaling Transformer Inference. The canonical reference for inference-time math, prefill vs decode, and the memory wall.
- Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. One full cache read, k tokens out.
- Cai et al. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Speculation without a draft model.
- Zheng et al. (2024). SGLang: Efficient Execution of Structured Language Model Programs. RadixAttention and prefix-aware scheduling.
- Liu et al. (2024). KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. How far the cache can be compressed before quality breaks.
- Zhang et al. (2023). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Attention-guided cache eviction.
- Zhong et al. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. The case for splitting the two phases onto separate hardware.
- Anthropic prompt-caching documentation. Current best reference for the API-level mechanics.
- Sasha Rush’s The Annotated Transformer. For the Q/K/V math in PyTorch you can run line by line.