How One Team Rebuilt a Bot That Couldn't Remember

This is the story of a product team that shipped an AI onboarding assistant, watched it fail in a specific and frustrating way, diagnosed the root cause as a misunderstanding of statelessness, and rebuilt it into something reliable. The details are composited from common patterns we see, but every decision and trade-off described here is one that real teams face.

The value of a case study is that it follows the actual arc of a problem: the situation, the moment the team realized what was wrong, the decisions they made, how they executed, and what the numbers said afterward. Abstract principles become concrete when you watch them collide with a real deadline and a real user complaint.

If you have ever shipped an AI feature that demoed perfectly and then disappointed users, this arc will feel familiar. Here is how one team worked through it.

The situation: an assistant that lost the plot

The team built an AI assistant to guide new users through setting up an account: collecting their goals, configuring preferences, and walking through a multi-step setup. In testing with short scripts, it was delightful. It asked smart questions and responded naturally.

Then real users arrived, and the complaints started. Partway through onboarding, the assistant would ask for information the user had already provided. It would forget the user's stated goal and recommend the wrong setup path. Some users gave up entirely.

The symptom pattern

Failures clustered in longer onboarding sessions, not short ones.
The assistant always lost early information, never recent information.
Restarting the flow temporarily fixed it, then the problem recurred.

That pattern, early facts lost in long sessions, is a fingerprint. The team just had not learned to read it yet.

The realization: the model never remembered anything

The breakthrough came when an engineer ran a simple experiment: send the model a fact, then in a separate call ask it to recall the fact. It could not. The team had been building on an implicit assumption that the model carried state between calls. It did not.

Their implementation passed recent conversation history with each request, which worked fine until the history grew past the context window. At that point their code silently dropped the oldest messages, the exact ones containing the user's goal and early answers. The model was not forgetting; it was never being shown that text anymore.

This reframed the entire problem. As our beginner's guide puts it, the model is stateless, and all memory is the application's job. The team had been blaming the model for their own context management.

The decision: separate durable facts from conversation

With the cause clear, the team faced a design decision. The naive fix, just send more history, would only delay the failure and inflate cost. They chose a more durable approach: extract the facts that mattered and store them separately from the raw conversation.

They drew a deliberate line. Transient back-and-forth could live in the conversation history and expire. But durable facts, the user's stated goal, their preferences, their completed steps, would be extracted into a structured profile that was always available, independent of conversation length.

What they decided to pin

The user's primary goal, captured once and retained throughout.
Configuration choices already made, so they were never re-asked.
Progress through the setup steps, so the flow never restarted itself.

This mirrors the short-term versus long-term separation in our framework.

The execution: building the memory layer

Implementation proceeded in deliberate steps. First, they added token counting so they knew exactly when a session approached the context limit, instead of discovering it through user complaints.

Next, they built fact extraction: after each turn, a lightweight pass identified any durable facts and wrote them to the user's profile. That profile was then injected into every prompt as pinned content, never eligible for trimming. Finally, they introduced rolling summarization for the conversational middle, compressing older turns while preserving the verbatim recent ones.

Guardrails they added

Summarization instructions explicitly preserved names, choices, and numbers.
Pinned profile facts were capped in size so they never crowded the budget.
Retrieved and injected context was logged per request for debugging.

The team followed essentially the sequence laid out in our step-by-step approach, which kept the rebuild orderly.

The outcome: measured, not just felt

The team did not declare victory on vibes. They tracked specific metrics before and after, comparing the same onboarding flow under both implementations.

The "re-asked for known information" rate, the core complaint, dropped sharply once durable facts were pinned. Onboarding completion improved meaningfully, because users stopped abandoning a flow that kept restarting. Average cost per session actually went down despite the added machinery, because they stopped re-sending unbounded history and summarized instead.

The numbers that mattered

Repeated questions for already-provided information fell to near zero.
Onboarding completion rose as the restart-loop frustration disappeared.
Per-session token cost decreased thanks to summarization replacing unbounded history.

The lesson the team took away was not "AI memory is hard." It was "memory is our responsibility, and once we owned it deliberately, the problems became ordinary engineering."

What the team would do differently next time

In the retrospective, the team was candid that the whole episode was avoidable. The failure was not exotic; it was the most common memory mistake there is, dressed up as a forgetful model. With hindsight, three changes stood out.

First, they would have run the recall experiment on day one. A single test, send a fact then ask the model to recall it in a separate call, would have established the correct mental model before a line of memory code was written. They wasted weeks operating on a false assumption that ninety seconds of testing would have corrected.

Second, they would have instrumented context from the start. For most of the project they had no visibility into what was actually being sent to the model. Adding token counting and per-request context logging early would have surfaced the silent truncation long before users did.

The cultural shift that mattered most

The team stopped saying "the model forgot" and started saying "we did not send it."
Memory design became a first-class part of the spec, not an implementation detail discovered in production.
"What horizon does this fact belong to?" became a routine design question for every new feature.

That last shift is why the next feature they built shipped without a single memory regression. They had internalized that the model is a stateless tool, and that owning its memory is simply part of the job.

Frequently Asked Questions

Why did the assistant only fail in longer sessions?

Short sessions never exceeded the context window, so no messages were trimmed and nothing was lost. Longer sessions crossed the limit, and the team's code silently dropped the oldest messages, which happened to contain the user's early answers. The failure was a direct function of conversation length.

Couldn't they have just used a model with a bigger context window?

A larger window would have postponed the failure, not eliminated it, since any session can eventually grow past any window. Storing durable facts in a structured profile solved the problem permanently and also reduced cost, which a bigger window would not have done.

Why did costs go down after adding more machinery?

Previously they re-sent the entire growing conversation every turn, paying for all of it repeatedly. After the redesign, older turns were compressed into short summaries and durable facts lived in a compact profile, so each request carried far fewer tokens despite remembering more.

What was the single most important change?

Separating durable facts from the disposable conversation. By extracting the user's goal, choices, and progress into a pinned profile, the team guaranteed those facts survived regardless of how long the conversation ran, which directly fixed the core complaint.

Key Takeaways

The assistant's failure was a classic context-overflow problem masquerading as a forgetful model.
A simple experiment, send a fact then ask the model to recall it, exposed that the model was stateless all along.
The durable fix was extracting session-spanning facts into a pinned profile, separate from the conversation.
Rolling summarization preserved continuity in the conversational middle without inflating the prompt.
Measured outcomes improved across the board, including lower cost, because owning memory deliberately beat re-sending unbounded history.

If you have ever shipped an AI feature that demoed perfectly and then disappointed users, this arc will feel familiar. Here is how one team worked through it.

The situation: an assistant that lost the plot

The symptom pattern

Failures clustered in longer onboarding sessions, not short ones.
The assistant always lost early information, never recent information.
Restarting the flow temporarily fixed it, then the problem recurred.

That pattern, early facts lost in long sessions, is a fingerprint. The team just had not learned to read it yet.

The realization: the model never remembered anything

The decision: separate durable facts from conversation

What they decided to pin

The user's primary goal, captured once and retained throughout.
Configuration choices already made, so they were never re-asked.
Progress through the setup steps, so the flow never restarted itself.

This mirrors the short-term versus long-term separation in our framework.

The execution: building the memory layer

Implementation proceeded in deliberate steps. First, they added token counting so they knew exactly when a session approached the context limit, instead of discovering it through user complaints.

Guardrails they added

Summarization instructions explicitly preserved names, choices, and numbers.
Pinned profile facts were capped in size so they never crowded the budget.
Retrieved and injected context was logged per request for debugging.

The team followed essentially the sequence laid out in our step-by-step approach, which kept the rebuild orderly.

The outcome: measured, not just felt

The team did not declare victory on vibes. They tracked specific metrics before and after, comparing the same onboarding flow under both implementations.

The numbers that mattered

Repeated questions for already-provided information fell to near zero.
Onboarding completion rose as the restart-loop frustration disappeared.
Per-session token cost decreased thanks to summarization replacing unbounded history.

The lesson the team took away was not "AI memory is hard." It was "memory is our responsibility, and once we owned it deliberately, the problems became ordinary engineering."

What the team would do differently next time

The cultural shift that mattered most

The team stopped saying "the model forgot" and started saying "we did not send it."
Memory design became a first-class part of the spec, not an implementation detail discovered in production.
"What horizon does this fact belong to?" became a routine design question for every new feature.

Frequently Asked Questions

Why did the assistant only fail in longer sessions?

Couldn't they have just used a model with a bigger context window?

Why did costs go down after adding more machinery?

What was the single most important change?

Key Takeaways

The assistant's failure was a classic context-overflow problem masquerading as a forgetful model.
A simple experiment, send a fact then ask the model to recall it, exposed that the model was stateless all along.
The durable fix was extracting session-spanning facts into a pinned profile, separate from the conversation.
Rolling summarization preserved continuity in the conversational middle without inflating the prompt.
Measured outcomes improved across the board, including lower cost, because owning memory deliberately beat re-sending unbounded history.

How One Team Rebuilt a Bot That Couldn't Remember

The situation: an assistant that lost the plot

The symptom pattern

The realization: the model never remembered anything

The decision: separate durable facts from conversation

What they decided to pin

The execution: building the memory layer

Guardrails they added

The outcome: measured, not just felt

The numbers that mattered

What the team would do differently next time

The cultural shift that mattered most

Frequently Asked Questions

Why did the assistant only fail in longer sessions?

Couldn't they have just used a model with a bigger context window?

Why did costs go down after adding more machinery?

What was the single most important change?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

How One Team Rebuilt a Bot That Couldn't Remember

The situation: an assistant that lost the plot

The symptom pattern

The realization: the model never remembered anything

The decision: separate durable facts from conversation

What they decided to pin

The execution: building the memory layer

Guardrails they added

The outcome: measured, not just felt

The numbers that mattered

What the team would do differently next time

The cultural shift that mattered most

Frequently Asked Questions

Why did the assistant only fail in longer sessions?

Couldn't they have just used a model with a bigger context window?

Why did costs go down after adding more machinery?

What was the single most important change?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?