Pranav Belhekar

Visual guide

A Visual Guide to AI Agent Memory

Your agent rebuilds its entire mind from scratch on every single turn. This guide explains — visually, from first principles — how agents remember, forget, and what actually works in production.

20 min read

In 1953, a surgeon named William Scoville removed most of the hippocampus from a young man’s brain to stop his seizures. It worked — and it also stopped something else. Henry Molaison, known to science for the next half-century as Patient H.M., could no longer form new long-term memories. He could hold a conversation. He could reason. He was, by every in-the-moment measure, intelligent. But walk out of the room, come back five minutes later, and you were a stranger introducing yourself again (Scoville & Milner, 1957).

If you’ve built anything on top of a large language model, you have met H.M. Every LLM ships with exactly his condition: fluent, capable, present — and structurally incapable of remembering you.

Everything we call “memory” in an AI agent is engineering built around that fact. This guide is a tour of that engineering — how it works, why each layer exists, and where it falls over in production. I’ve built these systems for client work and for my own tools, and most of what’s below is scar tissue. The diagrams carry the load; the text fills in the why.

What’s in this guide: the goldfish problem · there is no memory · working memory · long context won’t save you · forgetting on purpose · the memory map · episodic memory · semantic memory · procedural memory · the full architecture · what breaks in production · the long-context objection · references

The goldfish problem

Here’s a conversation every agent builder has watched with a sinking feeling:

Turn 2 — user: One ground rule: never deploy straight to prod. Always staging first.

Turn 2 — agent: Understood. Staging first, always.

…forty minutes and many turns later…

Turn 87 — agent: Build passed! Deploying to production now. 🚀

The agent didn’t disobey. It didn’t “forget” in any willful sense. Something more mechanical happened: the conversation grew past the model’s capacity, the oldest turns were dropped to make room, and the ground rule from turn 2 simply stopped being part of the input. The instruction still exists in your logs. It just doesn’t exist for the model.

User — turn 2: “Never deploy straight to prod.” the context window — everything the model can see user t1 user t2 agent t3 user t4 agent t5 user t6 agent t7 user t8 agent t9 user t10 Turns 1–4 still exist in your logs. For the model, they were never said. current turn earlier now
A ten-turn conversation. The model sees only what's inside the window. Turn 2 — the one with the critical instruction — isn't 'hard to recall'. It is not part of the input at all.

That distinction — not retrieved versus not present — is the single most important idea in this guide. Human forgetting is usually a retrieval failure; the trace is in there somewhere. When an agent forgets, the information is typically gone from the input entirely. No amount of “try harder to remember” prompting can fix that, for the same reason H.M. couldn’t try harder to remember meeting you yesterday: the machinery to carry it forward isn’t there.

So before we can talk about memory, we have to be honest about what the model actually is.

There is no memory

A large language model is a stateless function. Tokens go in, tokens come out, and between one API call and the next, nothing persists. The weights — the billions of parameters encoding everything the model learned in training — are frozen. Your conversation does not update them. There is no hidden scratchpad inside the provider’s datacenter accumulating your relationship with the model. Call it a million times and it is byte-for-byte the same function on call one and call one million.

CALL 1 input tokens instructions · history · memories · the question the model weights frozen reply 1 nothing crosses this gap no state saved · no weights updated · no trace that call 1 ever happened CALL 2 — one second later input tokens re-send everything — or it never existed the same model unchanged reply 2
Two calls, one second apart. The model is identical both times, and nothing — no state, no trace, no learning — crosses the gap. If you want call 2 to 'remember' call 1, you must re-send the relevant parts yourself.

This surprises people because chat interfaces hide it well. When ChatGPT or Claude “remembers” what you said three messages ago, what’s really happening is that the application re-sends the entire conversation — every prior message — with each new turn. The model isn’t remembering the conversation; it’s re-reading it from scratch, every time, the way an actor doesn’t remember yesterday’s performance but is handed the full script before every show.

(One nuance worth pre-empting: providers do cache things between calls — Anthropic and OpenAI both offer prompt caching, which reuses computation for a prompt prefix they’ve seen before. But it’s an ephemeral cost-and-latency optimization that expires within minutes to an hour. It stores nothing about meaning. It is not memory, any more than a CPU cache is your hard drive.)

Once you accept statelessness, the design question of this whole field snaps into focus:

Memory is not something an LLM has. Memory is whatever you choose to put back into the prompt.

Every memory system you’ve heard of — vector databases, fact stores, knowledge graphs, running summaries, MemGPT, Mem0, Zep — is an answer to one question: given limited space, what exactly should we re-send?

And that space really is limited. Which brings us to the window.

Working memory: the context window

The context window is the model’s entire perceptual universe for one call: the maximum number of tokens it can process at once. As of mid-2026, frontier models from Anthropic, OpenAI, and Google have converged on windows around a million tokens, with smaller production models at 200K. A million tokens is roughly ten novels. It sounds infinite. It is not, and the reason is what the window has to hold.

People imagine the prompt is “the user’s question.” Here’s what’s actually inside a production agent’s context window on any given turn:

your actual message ~100 tokens out of 200,000 system prompt 2–5K tool definitions 5–15K memories & retrieved docs 5–20K conversation history every turn lands here reserved for the reply 4–32K 0 200K tokens the only segment that grows without bound every turn adds to it — until something has to give
Anatomy of one turn's context for a production agent with a 200K window (proportions schematic). The user's actual message is the sliver. Everything else is scaffolding — and one segment grows every single turn.

The system prompt, the tool definitions, the retrieved memories — those are roughly constant. The conversation history is not. Every turn appends to it, and nothing about the transformer architecture compresses it for you. The arithmetic is unforgiving, and the best way to feel it is to watch it happen:

Watch a context window fill up

Drag the conversation forward. The fixed parts never shrink — only the history can give.

40 turns
system · tools · memories (17K, fixed) history (~900 tokens/turn) reply reserve (16K)

Cognitive psychology has a name for a fast, vivid, strictly-bounded store that holds whatever you’re working on right now: working memory. George Miller famously put its capacity at seven items, plus or minus two (1956); Nelson Cowan’s modern estimate is closer to four chunks (2001). The context window is the agent’s working memory, with the same defining property: it’s the only place where thinking actually happens, and it is always, always too small for the life it’s asked to hold.

But the analogy runs deeper than capacity — because it turns out models don’t even use the window they have evenly.

Long context won’t save you

In 1962, Bennet Murdock ran one of psychology’s most replicated experiments: read people a list of words, ask for free recall. The result is the serial position curve — people reliably remember the start of the list (primacy) and the end of the list (recency), and lose the middle.

In 2023, Nelson Liu and colleagues at Stanford ran what is essentially the same experiment on language models — place a needed fact at different positions inside a long context and measure whether the model uses it. The result, published as “Lost in the Middle”, is a curve any memory researcher would recognize on sight:

LLM retrieval (after Liu et al., 2023) human free recall (after Murdock, 1962) retrieval accuracy primacy the start sticks recency the end sticks the middle is where facts go to die start of context middle end of context position of the fact the model needs (curves schematic)
Two systems, one failure shape. Solid: LLM accuracy by position of the relevant fact in context (after Liu et al., 2023). Dashed: human free recall by position in the studied list (after Murdock, 1962). Curves schematic.

I want to be careful here: the curves rhyme for different underlying reasons, and the parallel is an aid to intuition, not a claim about mechanism. But the engineering consequence is identical — position is part of the API. Where you place information in the context changes whether it gets used. Critical instructions go at the start (and are often worth repeating near the end); the stuff in the middle is where facts go to die.

And the problem compounds with scale. Newer evaluations keep finding that a window’s nominal size overstates its usable size. NVIDIA’s RULER benchmark found models’ effective context is often half their claimed length or less. NoLiMa (2025) showed that when a question doesn’t share literal keywords with the fact it needs, 11 of 13 models tested dropped below half of their short-context accuracy by just 32K tokens. Chroma’s “Context Rot” report (2025) measured 18 frontier models and found performance degrades as input grows even when the task itself stays equally hard. There’s also a quieter cost: long prompts are slower, and they’re priced accordingly — some providers bill input tokens at double rate beyond a threshold (OpenAI does above ~272K as of this writing; Google’s Gemini pricing tiers above 200K).

So the honest mental model of a long context window isn’t a hard drive. It’s a desk. A bigger desk is genuinely useful — but pile it high enough and you stop being able to find things on it, and nobody works well at a desk a kilometer wide.

If you can’t keep everything on the desk, you need a policy for what leaves it. That policy has a blunt name: forgetting.

Forgetting on purpose

Forgetting has terrible PR. We treat it as failure — but a system that forgets nothing and a system that remembers nothing fail the same way: both drown the relevant in the irrelevant. Hermann Ebbinghaus mapped the human forgetting curve in 1885 and found we shed most new information within days — and a long tradition in psychology since has argued this is largely a feature: aggressive triage that keeps working memory clear for what matters. Your agent needs the same triage, except nothing about it is automatic. You have to choose a strategy, and there are only three basic moves:

1 · Drop the oldest predictable, free but decisions vanish silently 2 · Summarize everything summary (10:1) keeps the whole arc, cheaply but the loss compounds 3 · Hybrid — gist + verbatim summary old turns as gist, recent turns exact what production systems do the same 8-turn conversation, three ways of making it fit
The same eight-turn conversation made to fit, three ways. Dropping the oldest is predictable but silently discards decisions. Summarizing keeps the arc but is lossy. The hybrid — gist for the old, verbatim for the recent — is what most production systems (including Claude Code's compaction) actually do.

Strategy 1, drop the oldest, is what most chat applications quietly do, and it’s exactly what produced the prod-deploy disaster in the opening. Cheap, predictable, and it deletes ground rules without ceremony.

Strategy 2, summarize everything, asks an LLM to compress the older conversation into a running summary. This is the industry’s workhorse — it’s what happens when your coding agent announces it’s “compacting” the conversation. It preserves the arc of the conversation at a fraction of the tokens. It also has a failure mode that deserves its own diagram, because it’s the one people don’t see coming:

THE ORIGINAL — TURNS 1–30 · staging first, always · prod needs Priya's sign-off · CTA must be blue — brand   team insisted compact AFTER 1 COMPACTION · deploy via staging · Priya approves prod · CTA: blue compact AFTER 2 · deploys need approval · UI feedback given AFTER 3 · process discussed Each pass keeps the gist and sheds a constraint. By the third summary, “the CTA must be blue” has stopped existing — and the agent will repaint it.
Summarization drift: each compaction is a summary of a summary. Track the highlighted constraint. By generation three it isn't weakened — it's gone, and the agent will violate it with total confidence.

A summary of a summary of a summary behaves like a photocopy of a photocopy: each generation looks faithful to the last, and the cumulative loss is invisible until something specific — a number, a name, a never do X — has quietly evaporated. Psychologists know a version of this too: human memories aren’t replayed, they’re reconstructed on every recall, and each reconstruction can edit them (Nader et al., 2000; Loftus’s false-memory work showed how far that editing can go). Summarization is reconstruction with the same property: every compaction is a chance to lose a constraint or, worse, smooth a specific fact into a plausible-sounding generality.

Strategy 3, the hybrid — keep a gist of the old, keep the recent turns verbatim, and pin the truly load-bearing instructions somewhere eviction can’t touch them — is the production answer, and it works precisely because it admits that not all tokens are equal.

But notice what even the hybrid concedes: inside the window, everything is temporary. If something must survive indefinitely — across compactions, across sessions, across months — it has to live somewhere else entirely. The window was never going to be enough. We need the other memory systems.

The memory map

When cognitive scientists took LLM agents seriously as an object of study, they reached for a taxonomy psychology had spent a century refining. The CoALA paper (Sumers et al., 2023) made the mapping explicit, and it has quietly become the industry’s shared vocabulary — frameworks like LangChain’s LangMem ship “semantic / episodic / procedural” as literal product categories. Endel Tulving drew the original human distinction in 1972: episodic memory replays events; semantic memory knows facts with no memory of learning them; procedural memory is skill — how to do, not what happened.

IN YOUR HEAD IN YOUR AGENT Working memory what you're holding in mind right now — about four chunks of it (Cowan, 2001) The context window everything in the current prompt — the only memory the model actually has Episodic memory events you can replay — what happened, when, in what order (Tulving, 1972) Conversation logs + retrieval raw transcripts, embedded and searched — pulled back in when they look relevant Semantic memory facts you simply know, detached from when or where you learned them Fact stores & user profiles extracted facts, preferences, decisions — structured, queryable, updatable Procedural memory skills you run without thinking — typing, riding a bike, your accent System prompt, tools, instruction files how to behave — installed before the conversation starts, rarely rewritten Only the top row lives inside the model. Everything below it is engineering.
The map for the rest of this guide. Each human memory system has an engineering counterpart — and only working memory lives inside the model itself.

Here’s what makes this more than a cute analogy. H.M. — the patient from the opening — lost the ability to form new episodic and semantic memories. But Brenda Milner later showed his procedural memory still worked: practicing a mirror-drawing task across three days, he improved like anyone else, while having no recollection of ever doing the task before (Milner, 1962). The memory systems are genuinely separable — in brains, and as it turns out, in agents. You can ship an agent with perfect procedural memory (a great system prompt) and zero episodic memory (no logs). You can bolt on episodic recall without any semantic layer. They’re different subsystems solving different problems, and the next three sections take them one at a time.

Episodic memory: remembering what happened

The episodic problem: ten thousand conversation turns live in your logs, the window can hold fifty, and somewhere in there the user once said something that matters right now. How do you find it?

The dominant answer is retrieval over embeddings. Every chunk of past conversation is run through an embedding model that maps it to a point in a high-dimensional space — typically 1,500 to 3,000 dimensions — arranged so that similar meanings land near each other. When a new turn arrives, you embed it too, find the nearest stored neighbors, and splice the top few into the context. This is retrieval-augmented generation pointed at your own history instead of a document corpus.

the query: “what did we decide about deploys?” deploy decisions design feedback food & preferences “I love sushi” “I hate sushi” cosine similarity ≈ 0.95 same topic ≠ same meaning a stored memory chunk retrieved — top 3 nearest the query the real space has ~1,500–3,000 dimensions; two are shown
Past conversation chunks as points in embedding space (two of ~1,500 dimensions shown). The query lands near what it means, not near what it spells. But notice the callout: 'I love sushi' and 'I hate sushi' are near-neighbors too — similarity is about topic, not truth.

When this works, it feels like magic — ask “what did we decide about deploys?” and the system surfaces a conversation from six weeks ago that shares zero keywords with your question. But the geometry that makes it work is also the source of its two classic production failures:

Similar is not relevant. Embeddings encode topic far more strongly than logical content. Sentences that differ only by a negation routinely embed with cosine similarity above 0.9 — research on “negation blindness” (arXiv:2504.00584) documents the effect across embedding model families. Your retriever cannot reliably tell we chose Postgres from we ruled out Postgres. It hands the model the right neighborhood and the wrong house.

Relevant is not current. Pure similarity search has no concept of time. The user’s old budget and new budget are both “about the budget”; both retrieve. The most-cited fix comes from Stanford’s Generative Agents (Park et al., 2023), which scored every memory as a weighted blend of relevance (embedding similarity), recency (exponential decay — an Ebbinghaus curve, implemented in code), and importance (an LLM-assigned weight). That three-factor score, in some form, is in nearly every serious memory product today — Zep goes further and models facts as a temporal knowledge graph with explicit validity intervals.

Episodic retrieval gives your agent a past it can search. What it doesn’t give you is distilled truth — for that you don’t want to replay the meeting, you want the decision. That’s a different memory system.

Semantic memory: remembering what’s true

You don’t remember learning that Paris is the capital of France. The episode is gone; the fact remains. That’s semantic memory, and its agent counterpart is the most product-shaped layer in the stack: extract durable facts from conversations, store them in structured form, load the relevant ones every turn. When ChatGPT’s memory says “User is vegetarian. User prefers TypeScript.” — this is the layer you’re looking at.

The pipeline is simple to draw and brutally hard to get right:

THE CONVERSATION “We're a Next.js shop.” “Budget is $40K.” … three weeks later … “Actually, the budget moved to $25K.” extraction extract & reconcile THE FACT STORE fact value noted frontend stack Next.js May 2 project budget $40K May 2 project budget $25K May 23 supersedes the row above The hard part isn't writing facts down — it's noticing when a new fact replaces an old one. Store both as equals, and the agent will confidently quote whichever one it happens to retrieve.
The semantic pipeline: an extraction pass (an LLM call) turns conversation into structured facts. The hard part is highlighted: the new budget must supersede the old one, not coexist with it.

Three design problems decide whether this layer helps or hurts:

What’s worth keeping? Extract too little and the agent stays a goldfish; too much and you’ve built a hoard of trivia that bloats every prompt. The bar I use in client work: would a competent human assistant write this down? “User’s company runs on Next.js” — yes. “User said thanks” — no.

The update problem. Facts change, and new facts don’t announce themselves as replacements. The budget moves from $40K to $25K; naïve systems store both, and the agent confidently quotes whichever it retrieves first. Every serious memory framework — Mem0, Zep, LangMem — is in large part an answer to this reconciliation problem: detect when a new fact supersedes, contradicts, or merely refines an old one, and resolve before write, with timestamps as tiebreakers.

Confidence. “I’m vegetarian” stated outright is not the same as probably vegetarian inferred from three salad orders. Production systems that skip confidence tracking end up asserting inferences as facts — which is how an agent ends up insisting you’re vegetarian because you once said you liked a salad. Psychology has unflattering names for the human versions of these failures; an agent that misattributes its own shaky inference as a user-stated truth is making a source-monitoring error in JSON.

Procedural memory: remembering how to behave

The quietest memory system is the one nobody calls memory: the system prompt, the tool definitions, the instruction files (CLAUDE.md, agent skills) — everything installed before the conversation starts that shapes how the agent behaves. This is procedural memory: knowledge of how, not what. You don’t recall how to type; you just type. Your agent doesn’t recall its code-review standards; it just has them, every session, indestructibly — because they’re re-sent verbatim at position zero of every single prompt.

Two properties make the procedural layer special. It’s the only layer eviction can’t touch (it’s pinned outside the history). And it’s the natural destination for learned behavior: when an agent gets corrected the same way twice, the durable fix is usually not another fact row — it’s a new standing instruction. The frontier of this layer is agents that edit their own instruction files: Anthropic’s memory tool gives Claude a file directory it reads and writes across sessions, and the write-up reports the headline numbers for the whole approach — on a 100-turn agentic eval, memory plus context editing improved task performance by 39% while cutting token consumption by 84%. H.M. improving at mirror-drawing without remembering practice was the proof that procedural learning survives when episodic memory is gone; an agent improving its own instructions is the same trick, in reverse.

The full architecture

Now assemble all of it. Here is the diagram I wish someone had handed me before my first production agent — every layer from this guide, in the two motions that define the system: the read path that builds the window before each turn, and the write path that pays the memory forward after it.

write back — extract facts · embed episodes · (rarely) update standing instructions READ PATH — before every turn Procedural store system prompt · skills · instruction files Semantic store facts · preferences · decisions, reconciled Episodic store full transcripts · embedding index THE CONTEXT WINDOW — rebuilt from scratch, every turn standing instructions — who to be, how to behave relevant facts — profile, decisions, constraints retrieved episodes — top-k similar past moments running summary of older turns (gist) recent turns, verbatim the new user message the model one forward pass — still stateless the assistant's turn WRITE PATH — after the turn, async memory writer append turn · extract facts · embed chunks · refresh summary reads run on the critical path and cost tokens on every single call — keep them lean writes cost an extra LLM call — run them in the background, never block the reply
The whole machine. Before every turn, the read path assembles a fresh context window from the three stores plus the session itself. After every turn, an asynchronous writer updates the stores. The model in the middle remains completely stateless — the 'memory' is the loop around it.

Read it with three observations:

The window is rebuilt from scratch, every turn. Nothing carries over inside the model. Continuity is an illusion the read path reconstructs sixty times a minute, the way your sense of a continuous visual field is something your brain assembles between saccades. If the read path misses, the agent doesn’t feel forgetful — it feels wrong, the way the deploy bot felt wrong.

Reads are synchronous; writes must not be. The read path sits on the critical path of every reply — its token cost and retrieval latency are paid by the user, every turn, so it has to stay lean. The write path (extraction, embedding, summarization — each an LLM call or close to it) belongs in the background, after the reply ships. Run your fact extraction inline and you’ve added seconds to every turn to record things no one may ever ask about.

Every arrow is a failure mode. Retrieval can surface the wrong episode; extraction can store a wrong fact; the summary can drift; the bus back into the stores can write poison. Which is the right segue to the section this whole guide has been building toward.

What breaks in production

Everything above is the brochure. Here’s the version after a quarter of real traffic — the four failure modes I’ve been bitten by or been called in to fix, in rough order of how expensive they were:

1. Memory poisoning. A wrong fact gets written once — a hallucinated “decision,” a user’s sarcastic aside extracted as a preference, a typo’d budget — and then it retrieves. The agent states it confidently; the user’s confused reply mentions it again; the writer re-extracts it with fresh evidence. The false memory now has citations. This loop (store → retrieve → restate → re-store) is reconsolidation’s evil twin, and it’s why memory needs an immune system more than it needs more capacity: provenance on every fact (who said it, when, verbatim or inferred), confidence that decays without re-confirmation, and a way for users to see and delete what’s stored about them. Deletion, by the way, is a product feature and in much of the world a legal requirement.

2. Unbounded growth. Every turn writes; nothing deletes; six months later retrieval is sifting through forty thousand memories — most stale, near-duplicates of each other, or trivia. Recall quality degrades slowly enough that no single week looks like an incident. The fix is the thing human memory does free of charge and engineers must schedule: consolidation. A background pass that merges duplicates, ages out the unused, and promotes repeated episodes into single semantic facts. Sleep, basically. Budget for it from day one, because retrofitting it onto a poisoned, bloated store is miserable.

3. Retrieval latency on the read path. Embedding the query, searching the index, possibly re-ranking — all before the model can start thinking. Do it naïvely and you’ve added a second of dead air to every turn, which in a voice agent is fatal and in a chat agent is brand damage. The standard production answer: retrieve against a small hot index, cache aggressively, and push anything heavy to the write side.

4. Trusting the window because it’s big. The subtle one. A team sees a million-token window and decides memory engineering is legacy complexity — just append forever. It demos beautifully. Then conversations cross a few hundred thousand tokens and the context-rot curve quietly takes over: instructions in the middle stop binding, costs balloon (remember the long-context surcharges), latency stretches, and the agent gets vaguer rather than failing loudly. Nothing pages you. Your evals — all written on 5K-token sessions — stay green. This failure mode is invisible precisely because it’s a degradation, not an error, which is what makes it the most dangerous item on this list.

Will long context kill memory engineering?

The standing objection to everything above: windows grew from 4K to a million in three years; surely the next order of magnitude makes this whole guide a historical curiosity?

I’d bet against, for three reasons that don’t depend on windows staying small. Attention doesn’t scale like storage — the lost-in-the-middle and context-rot results show usable context lagging far behind nominal context, and a bigger desk has never been an organizational system. Economics — you pay for every token in the window on every turn, forever; re-sending a megabyte of history to answer “what time is the standup?” fails cost review even when it passes the demo. And selection is the actual product. Deciding what matters, what supersedes what, what to carry forward — that’s not a workaround for small windows; that’s what remembering is. Psychology figured this out a century ago: a perfect, unfiltered record isn’t superhuman memory. The rare humans who approach total recall describe it as debilitating. Forgetting is the feature.

So the window will keep growing, and the architecture in this guide will keep mattering — the proportions shift, the loop stays. The model stays stateless; the memory stays yours to build.

That’s also the design philosophy behind Nexus, where the memory layer ships in v2 this month — and Chapter 4 of Atlas is the book-length, code-included version of this guide. If you build one of these systems and discover a failure mode I haven’t, I want to hear about it.

References and further reading

The agent side:

The human side:

Diagrams are hand-drawn SVG and free to reuse with attribution. If this guide helped, the subscribe box below is how you hear about the next one.

Get the next guide when it ships

Plus field notes from production AI, roughly twice a month.