This is the story of a customer support team that deployed an AI assistant, watched it confidently mislead customers, and pulled the fabrication rate down through prompt changes alone. The names and product are generalized, but the arc—situation, decision, execution, outcome—reflects a common path that many teams walk.
What makes the story useful is not that they succeeded. It is the sequence of decisions: what they tried first, what failed, and what finally worked. Each step illustrates a principle in a setting where the stakes were real and the feedback was immediate. Read it as a worked example of the techniques rather than a triumphant before-and-after.
The concepts referenced here are laid out in full in Stop Your Model From Inventing Facts at the Prompt Layer.
The Situation
The team had launched an assistant to handle first-line support questions for a software product. It deflected a meaningful share of tickets, which the team celebrated—until the complaints arrived.
The complaints
Customers reported being told about settings that did not exist, given steps that led nowhere, and quoted policies that contradicted the actual terms. The assistant was wrong often enough to erode trust, and each wrong answer generated a follow-up ticket.
The hidden cost
Deflection numbers looked good, but the assistant was manufacturing new work and damaging the brand. A confident wrong answer was worse than no answer, because customers acted on it.
The Decision
The team's first instinct was to wait for a newer model, assuming the problem was the model's intelligence. They reconsidered after a closer look.
Diagnosing the real cause
They examined the prompt and found it asked the model to answer support questions with no product documentation supplied and no permission to abstain. The model was reconstructing answers from memory and never allowed to say it did not know.
Choosing the prompt path
Recognizing that the failure lived in the prompt, not the model, they decided to rebuild the prompt before changing anything more expensive. The bet was that grounding and abstention would move the rate further than a model upgrade.
The Execution
They rebuilt the prompt in stages, measuring after each change rather than shipping everything at once.
Stage one: grounding
They connected the assistant to the help-center articles and instructed it to answer only from the retrieved text. Invented features dropped immediately, but gaps remained where articles were thin.
Stage two: abstention
They added an explicit clause: if the articles did not cover the question, the assistant should say so and offer to connect the customer to a human. This caught most of the remaining fabrication, converting guesses into honest hand-offs. The staged approach mirrors Build a Fabrication-Resistant Prompt in Eight Moves.
Stage three: evidence and verification
For sensitive topics like billing and policy, they required the assistant to quote the supporting article and added a second pass that confirmed the answer matched the source before sending it. This handled the subtler errors that grounding alone missed.
The Outcome
The team did not reach perfection, and they did not claim to. What they achieved was a controlled, measurable improvement.
What improved
Confident wrong answers on covered topics became rare. The assistant now abstained and handed off when it lacked a grounded answer, which customers tolerated far better than being misled. Follow-up tickets from bad answers fell substantially.
What it cost
Abstention meant the assistant deflected fewer tickets outright, since it now declined questions it would previously have answered—wrongly. The team accepted this trade, judging an honest hand-off more valuable than a confident error.
The new failure mode
Early on, the assistant over-abstained, declining questions the articles actually covered. Tuning the abstention clause and improving retrieval brought that back into balance, a calibration problem rather than a fabrication one.
The Lessons
A few takeaways generalized beyond this team's situation.
The prompt was the lever, not the model
The biggest gains came from grounding and abstention, both prompt-level changes, achieved without any model upgrade. The instinct to blame the model would have delayed the fix and cost more.
Measurement made it real
Staging changes and measuring after each one let the team attribute improvement to specific moves and catch the over-abstention regression early. For the mistakes they narrowly avoided, see 7 Prompting Habits That Make AI Fabricate More, Not Less, and to lock the result into a repeatable routine, see The Pre-Ship Checklist for Keeping AI Answers Grounded.
Honesty beat coverage
The final system answered fewer questions but misled almost none. The team learned that a lower deflection rate with trustworthy answers outperformed a higher one built on confident fabrication.
How the Team Measured Success
The improvement was only believable because the team measured it, and the way they measured shaped what they built.
Building the test set from real tickets
Rather than invent test questions, the team pulled real customer questions from past tickets, including many the help center did not cover. Those uncovered questions became the heart of the test set, because they were exactly the cases that triggered fabrication. Each prompt change was scored against the same set, so improvement was attributable rather than anecdotal.
Scoring grounding, not just correctness
The team scored each answer on whether it was actually supported by the cited article, not merely whether it happened to be right. An answer that was correct but ungrounded was treated as a near miss, because it would fail on the next similar question. This focus on grounding caught fragile answers that a correctness-only score would have passed.
Watching the two failure modes together
They tracked fabrication and unnecessary abstention side by side, on one dashboard. Watching both at once was what let them catch the over-abstention regression early, before customers noticed the assistant had become unhelpfully cautious. Optimizing one number in isolation would have hidden the cost showing up in the other.
Frequently Asked Questions
Why did the team consider a model upgrade first?
Because fabrication is widely assumed to be a measure of model intelligence, so a smarter model feels like the natural fix. The closer look revealed the prompt supplied no source material and no abstention option, meaning even a better model would have kept guessing from memory. The cause was structural, not capability.
Did grounding alone solve the problem?
Mostly for covered topics, but not entirely. Grounding stopped invented features where articles existed, but gaps in the documentation still produced guesses until the abstention clause was added. The two changes together, not grounding alone, brought fabrication under control.
Why was lower deflection an acceptable trade?
Because a confident wrong answer generated a follow-up ticket and eroded trust, making it more costly than an honest hand-off. Deflecting fewer tickets while misleading almost none netted out better for both customer satisfaction and total support load.
How did the team handle over-abstention?
They treated it as a separate, opposite problem: the assistant declining questions the articles actually answered. Tuning the abstention clause and improving which articles were retrieved brought it back into balance. The goal became calibration, not maximum caution.
What would they do differently next time?
Start with grounding and abstention from day one rather than launching on a memory-only prompt, and build the measurement set—including unanswerable questions—before going live. Most of the pain came from diagnosing in production what a small upfront test set would have revealed.
Key Takeaways
- The team's confident wrong answers traced to a prompt with no supplied source material and no permission to abstain, not to model limitations.
- Grounding the assistant in help-center articles cut invented features, and an abstention clause converted remaining guesses into honest hand-offs.
- Evidence requirements and a verification pass handled the subtler errors on sensitive topics like billing and policy.
- The trade was lower deflection for far more trustworthy answers, which the team judged a clear net win.
- Staging changes with measurement after each one made improvement attributable and caught an over-abstention regression early.