A mid-sized software company ran a renewals assistant that handled inbound chats from customers whose subscriptions were lapsing. The bot's job was to identify the account, surface the renewal offer, answer objections, and either close the renewal or route to a human. On paper it worked. In practice, the team kept getting escalations from frustrated customers who said the bot "didn't listen."
This is a narrative case study of how that team diagnosed the problem, decided on a redesign, executed it over a few weeks, and measured the result. The arc is deliberately specific because the lessons live in the details — the moment they realized the transcript alone was not enough, and the exact change that turned things around.
The names and numbers here are illustrative of a common pattern rather than a single named client, but the sequence of decisions reflects how these projects actually unfold. What makes this account worth reading is not that the team succeeded — most teams eventually do — but the order in which they discovered things, because that order is where the transferable lessons live.
The Situation: A Bot That Could Not Keep Its Place
The original assistant was built the way most first versions are. Each turn, the prompt received the full chat transcript plus a system instruction describing the bot's goal. The model was expected to infer everything — which account, what had been offered, what the customer had already declined — from reading the history.
The symptoms
- Customers were re-asked for their account email they had provided minutes earlier.
- The bot re-pitched a discount the customer had explicitly rejected.
- After a customer agreed to renew, the bot sometimes continued objection-handling as if nothing had been decided.
The cost
Escalation rates were climbing, and the renewals team estimated that roughly one in four bot conversations ended with a customer more annoyed than when they started. For a retention product, that is precisely the wrong direction.
The Decision: Stop Inferring State, Start Tracking It
The team's lead engineer made the call that defined the project: the model would no longer be responsible for reconstructing state from the transcript. Instead, the application would maintain an explicit state object and inject it into every prompt.
What they chose to track
account_identified: boolean, plus the resolved account recordoffers_presented: list of offers shownoffers_declined: list of offers the customer rejected with reasonsrenewal_status:open,agreed, ordeclinedescalation_requested: boolean
This decision aligns with the principle laid out in A Reusable Model for Tracking Dialogue State in Prompts: the model consumes state, it does not own it.
The Execution: Three Iterations Over Three Weeks
Iteration one — the state block
The first change was purely additive. They prepended a labeled state block to the existing prompt without removing the transcript. Immediately, the re-asking-for-email problem dropped sharply, because account_identified told the model the account was already known.
Iteration two — constraints anchored to state
Next they added negative constraints: "Never present an offer that appears in offers_declined." This killed the re-pitching behavior. They drew the specific constraint patterns from Concrete Scenarios That Reveal Whether Your Dialogue State Holds.
Iteration three — trimming the transcript
With reliable structured state, they discovered they could trim the raw transcript to the last few turns. The state block carried the durable facts; the recent transcript carried only conversational tone. This cut token cost meaningfully and, counterintuitively, improved accuracy because the model had less noise to wade through.
The Outcome: What Actually Moved
The team instrumented the rollout as an A/B test, sending half of eligible conversations to the new design.
Measured results
- Re-asking for already-provided information fell to near zero in the new variant.
- Escalations driven by "the bot didn't listen" complaints dropped by more than half.
- Token cost per conversation decreased after the transcript trimming, despite the added state block.
- Successful self-serve renewals rose, because conversations stayed coherent long enough to close.
The metrics they chose to watch came straight from Reading the Signal: Metrics for Dialogue State in Prompts, which gave them a vocabulary for what "better" meant.
The Lessons That Generalized
Inference is not a feature
The most important lesson was philosophical. Asking a model to re-derive state every turn is not clever prompting; it is a liability. State the team already knows should be handed to the model, not rediscovered by it.
Constraints are where state pays off
The state object was useful for telling the model what was true, but it was transformative when used to tell the model what was off-limits. The biggest behavior fixes all came from negative constraints anchored to state.
Less transcript, more structure
Trimming history once structured state existed was the surprise win. It lowered cost and raised quality at the same time, a rare combination. Teams evaluating the broader payoff will find that reasoning developed further in Putting Numbers Behind Dialogue State Management in Prompts.
What the Team Would Do Differently
Hindsight produced a short list of things the team wished they had done from the start, and these are arguably more valuable than the wins because they save the next team the detour.
Instrument before changing anything
The team built the new design and then scrambled to add measurement so they could prove it worked. Reversing that order would have been better. Had they captured re-ask rate and escalation reasons before touching the prompt, the before-and-after story would have been cleaner and the internal sell easier.
Start with the constraint, not the state block
Their first iteration added the state block, which helped, but the largest single improvement came from the negative constraint in iteration two. In retrospect, they could have led with the constraint against re-presenting declined offers, because that was the behavior generating the angriest escalations. Fixing the most painful symptom first builds organizational confidence faster.
Treat the transcript as a liability sooner
The team kept the full transcript far longer than necessary out of caution, fearing that trimming it would lose context. When they finally trimmed it, both cost and accuracy improved. The lesson: once structured state reliably carries the durable facts, a long transcript is more noise than safety net.
How This Maps to Other Domains
The renewals assistant is specific, but the arc is not. A checkout flow, a technical support bot, and an onboarding assistant all hit the same wall — the model cannot reliably reconstruct state from history — and all resolve it the same way.
The portable sequence
- Diagnose the re-asking and contradicting as state failures, not model failures.
- Inject an explicit, labeled state object built from authoritative data.
- Constrain behavior with negative rules anchored to specific state fields.
- Trim the now-redundant transcript to recover cost and accuracy.
- Measure the change against re-ask and escalation rates to prove it.
Any team facing "the bot doesn't listen" complaints can run this same sequence. The domains differ; the playbook does not.
Frequently Asked Questions
Why not just give the model the full transcript every time?
It works until conversations get long, then the model starts missing or misweighting facts buried in the history. Structured state surfaces the facts that matter, which is both cheaper and more reliable.
How long did the redesign take?
In this account, roughly three weeks across three iterations. The first iteration delivered most of the value; the later ones refined cost and edge cases.
What was the single highest-impact change?
Adding negative constraints anchored to state — specifically, never re-presenting a declined offer. That fixed the most damaging behavior the bot exhibited.
Did the state block increase token costs?
Initially yes, but trimming the now-redundant transcript more than offset it, producing a net reduction in tokens per conversation.
How did they prevent state from drifting out of sync with reality?
The application updated state from authoritative events — payment systems, CRM records — rather than from the model's outputs. The model never wrote canonical state.
Could a smaller team replicate this?
Yes. The core change is conceptual, not infrastructural: stop asking the model to infer state, and start injecting it. A small team can implement a state block in a day.
Key Takeaways
- The root problem was asking the model to reconstruct state from the transcript every turn.
- Injecting an explicit, labeled state object eliminated re-asking and re-pitching behavior.
- Negative constraints anchored to state produced the largest behavioral improvements.
- Trimming the transcript after adding structured state lowered cost and raised accuracy together.
- Canonical state must come from authoritative systems, never from the model's own output.
- Measuring with the right metrics let the team prove the redesign worked rather than assume it.