When teams discuss AI memory, the conversation tends to dissolve into a list of disconnected tactics: pass the history, summarize, use a vector store. These tactics are fine, but without a structure to organize them, you cannot reason about which to apply when, or how they fit together. You end up bolting on memory features reactively, one bug at a time.
This article offers a structure: the Three-Horizon Model. It organizes everything an AI feature needs to remember into three distinct memory horizons, each with its own purpose, lifespan, and storage mechanism. The horizons are not arbitrary; they fall out naturally from the fact that the model is stateless and that information has different lifespans.
Once you have the framework, the tactical questions answer themselves. You stop asking "should I summarize?" and start asking "which horizon does this information belong to?" That shift is what turns memory from a pile of hacks into a design.
The core premise: the model remembers nothing
The framework rests on one fact. The model is stateless: it retains nothing between requests, and everything it appears to know comes from the text in the current request. All memory, therefore, is something your application constructs and feeds in.
Given that, the design question is not whether to build memory but how to organize the information you feed in. Different information has different lifespans and different relevance, and treating it all uniformly is the root of most memory problems. The Three-Horizon Model exists to sort information by lifespan and route each piece to the right mechanism.
Why three horizons
- Some information matters only for the current exchange.
- Some matters for the duration of a session.
- Some must persist indefinitely, across sessions.
These three lifespans map cleanly to three storage strategies, which is why three horizons, no more and no fewer, capture the design.
Horizon 1: working memory
Working memory is the immediate context for the current turn: the user's latest message, the system prompt, and any facts retrieved specifically for this exchange. Its lifespan is a single request.
This is the highest-fidelity, most expensive horizon, because it occupies the model's active attention. Everything here competes for the same token budget and directly shapes the answer. The discipline at this horizon is curation: include exactly what the current turn needs and nothing more.
What lives in working memory
- The current user message and the immediate instructions.
- The system prompt, re-sent every request because nothing persists.
- Facts retrieved as relevant to this specific question.
The guiding rule: working memory should be lean. Padding it with marginally relevant context degrades the answer, a point our best practices guide argues at length.
Horizon 2: session memory
Session memory is the continuity of a single conversation: the back-and-forth that gives the exchange coherence. Its lifespan is the session, and it lives in the conversation history you re-send each turn.
The defining constraint here is the context window. Session memory naturally grows until it threatens the budget, at which point you must compress it. The horizon's discipline is graceful compression: summarize older turns while keeping recent ones verbatim, preserving the facts that matter.
Managing the session horizon
- Keep recent turns raw for fidelity.
- Summarize older turns when you approach the token budget.
- Protect names, decisions, and numbers through compression.
Session memory is where most overflow failures occur, because teams forget it has a hard ceiling. Mishandling this horizon produces the classic forgot-my-instructions failures.
Horizon 3: durable memory
Durable memory is everything that must survive beyond a single session: user preferences, profiles, key decisions, reference knowledge. Its lifespan is indefinite, and it lives in external storage, typically a database or vector store, retrieved into working memory when relevant.
This horizon is what makes an assistant feel like it knows you across days and weeks. It is also where governance lives, because durable storage means durable responsibility for the data you keep. The discipline here is selective promotion and selective retrieval: store only what deserves permanence, and retrieve only what a given question needs.
The durable memory loop
- After an exchange, decide what facts deserve to persist.
- Write those discrete facts to external storage.
- On future requests, retrieve the relevant facts into working memory.
- Keep retrieval lean so it does not overwhelm the budget.
Applying the framework: routing by lifespan
The framework's payoff is a simple decision procedure. For any piece of information, ask how long it must matter, and route it accordingly.
If it matters only for this turn, it belongs in working memory and should be assembled fresh. If it provides continuity within the conversation, it belongs in session memory, subject to compression as the conversation grows. If it must outlive the session, it belongs in durable memory, stored externally and retrieved on demand.
Worked routing examples
- "Summarize the document I just pasted" → working memory; the document is needed only now.
- "Earlier I said I prefer concise answers" → session memory; relevant for this conversation.
- "I'm allergic to shellfish, remember that" → durable memory; must persist across sessions.
This routing logic is exactly what the step-by-step guide implements in sequence, and what the pre-ship checklist verifies before launch. The horizons give those tactics a coherent home.
When to use each horizon
Not every feature needs all three. The framework also tells you how much to build.
A one-shot tool, summarize this, classify that, needs only working memory. A multi-turn chat needs working and session memory. An assistant that remembers users across visits needs all three. Match the horizons you implement to the lifespans your feature actually has, and you avoid both under-building and over-engineering.
How the horizons interact in a single request
It is tempting to picture the three horizons as separate systems, but in practice they converge on every request, because the model only ever sees one assembled block of text. Understanding that convergence is what makes the framework operational rather than merely conceptual.
For any given turn, your application assembles the prompt by drawing from all three horizons at once. Durable memory contributes the relevant stored facts. Session memory contributes the recent conversation and any summary of older turns. Working memory contributes the current message and the system prompt. These streams merge into a single request, and they all compete for the same token budget.
That competition is the crux. When the budget tightens, you are forced to trade across horizons: include one more retrieved fact, or one more verbatim turn of conversation? The framework helps here by clarifying what each horizon is for, so the trade is principled rather than arbitrary.
Resolving budget conflicts across horizons
- Pinned durable facts win first, because losing a stated allergy is worse than losing a paragraph of chat.
- Recent session turns beat old ones, since recency carries the most relevant nuance.
- Retrieved facts are sized to relevance, not stuffed in to fill space.
Seen this way, the framework is not three boxes; it is a priority ordering for assembling one prompt under a fixed budget. That ordering is what you reach for in the moment when everything will not fit, which, in any nontrivial feature, is most of the time.
Frequently Asked Questions
How is this different from just "context window plus a database"?
Those are mechanisms; the framework is about lifespans. The Three-Horizon Model tells you which mechanism a given piece of information belongs to based on how long it must matter, which prevents the common error of putting session-spanning facts in conversation history or transient details in permanent storage.
Which horizon causes the most production bugs?
Session memory, because teams forget it has a hard ceiling at the context window. Information grows until it overflows, and silent truncation drops early facts. Managing the session horizon with summarization rather than truncation prevents the most common class of failures.
Can information move between horizons?
Yes, and it should. A fact stated in session memory may deserve promotion to durable memory if it must persist, like a stated allergy or preference. Defining clear promotion rules, what graduates from session to durable storage, is a core part of applying the framework well.
Do small features really need the framework?
Even small features benefit from asking the routing question, because it tells you what not to build. A one-shot tool needs only working memory, and recognizing that saves you from over-engineering. The framework guides both what to build and what to skip.
Key Takeaways
- The Three-Horizon Model organizes AI memory by lifespan: working, session, and durable.
- Working memory is the lean, high-fidelity context for the current turn and demands ruthless curation.
- Session memory provides conversational continuity but has a hard ceiling at the context window, so compress it.
- Durable memory persists across sessions in external storage and carries data-governance responsibility.
- Route every piece of information by how long it must matter, and implement only the horizons your feature needs.