AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationThe complaintsThe hidden costThe DecisionDiagnosing the real causeChoosing the prompt pathThe ExecutionStage one: groundingStage two: abstentionStage three: evidence and verificationThe OutcomeWhat improvedWhat it costThe new failure modeThe LessonsThe prompt was the lever, not the modelMeasurement made it realHonesty beat coverageHow the Team Measured SuccessBuilding the test set from real ticketsScoring grounding, not just correctnessWatching the two failure modes togetherFrequently Asked QuestionsWhy did the team consider a model upgrade first?Did grounding alone solve the problem?Why was lower deflection an acceptable trade?How did the team handle over-abstention?What would they do differently next time?Key Takeaways
Home/Blog/How One Support Team Cut Invented Answers by Prompting
General

How One Support Team Cut Invented Answers by Prompting

A

Agency Script Editorial

Editorial Team

·December 19, 2023·8 min read
reducing hallucinations through promptingreducing hallucinations through prompting case studyreducing hallucinations through prompting guideprompt engineering

This is the story of a customer support team that deployed an AI assistant, watched it confidently mislead customers, and pulled the fabrication rate down through prompt changes alone. The names and product are generalized, but the arc—situation, decision, execution, outcome—reflects a common path that many teams walk.

What makes the story useful is not that they succeeded. It is the sequence of decisions: what they tried first, what failed, and what finally worked. Each step illustrates a principle in a setting where the stakes were real and the feedback was immediate. Read it as a worked example of the techniques rather than a triumphant before-and-after.

The concepts referenced here are laid out in full in Stop Your Model From Inventing Facts at the Prompt Layer.

The Situation

The team had launched an assistant to handle first-line support questions for a software product. It deflected a meaningful share of tickets, which the team celebrated—until the complaints arrived.

The complaints

Customers reported being told about settings that did not exist, given steps that led nowhere, and quoted policies that contradicted the actual terms. The assistant was wrong often enough to erode trust, and each wrong answer generated a follow-up ticket.

The hidden cost

Deflection numbers looked good, but the assistant was manufacturing new work and damaging the brand. A confident wrong answer was worse than no answer, because customers acted on it.

The Decision

The team's first instinct was to wait for a newer model, assuming the problem was the model's intelligence. They reconsidered after a closer look.

Diagnosing the real cause

They examined the prompt and found it asked the model to answer support questions with no product documentation supplied and no permission to abstain. The model was reconstructing answers from memory and never allowed to say it did not know.

Choosing the prompt path

Recognizing that the failure lived in the prompt, not the model, they decided to rebuild the prompt before changing anything more expensive. The bet was that grounding and abstention would move the rate further than a model upgrade.

The Execution

They rebuilt the prompt in stages, measuring after each change rather than shipping everything at once.

Stage one: grounding

They connected the assistant to the help-center articles and instructed it to answer only from the retrieved text. Invented features dropped immediately, but gaps remained where articles were thin.

Stage two: abstention

They added an explicit clause: if the articles did not cover the question, the assistant should say so and offer to connect the customer to a human. This caught most of the remaining fabrication, converting guesses into honest hand-offs. The staged approach mirrors Build a Fabrication-Resistant Prompt in Eight Moves.

Stage three: evidence and verification

For sensitive topics like billing and policy, they required the assistant to quote the supporting article and added a second pass that confirmed the answer matched the source before sending it. This handled the subtler errors that grounding alone missed.

The Outcome

The team did not reach perfection, and they did not claim to. What they achieved was a controlled, measurable improvement.

What improved

Confident wrong answers on covered topics became rare. The assistant now abstained and handed off when it lacked a grounded answer, which customers tolerated far better than being misled. Follow-up tickets from bad answers fell substantially.

What it cost

Abstention meant the assistant deflected fewer tickets outright, since it now declined questions it would previously have answered—wrongly. The team accepted this trade, judging an honest hand-off more valuable than a confident error.

The new failure mode

Early on, the assistant over-abstained, declining questions the articles actually covered. Tuning the abstention clause and improving retrieval brought that back into balance, a calibration problem rather than a fabrication one.

The Lessons

A few takeaways generalized beyond this team's situation.

The prompt was the lever, not the model

The biggest gains came from grounding and abstention, both prompt-level changes, achieved without any model upgrade. The instinct to blame the model would have delayed the fix and cost more.

Measurement made it real

Staging changes and measuring after each one let the team attribute improvement to specific moves and catch the over-abstention regression early. For the mistakes they narrowly avoided, see 7 Prompting Habits That Make AI Fabricate More, Not Less, and to lock the result into a repeatable routine, see The Pre-Ship Checklist for Keeping AI Answers Grounded.

Honesty beat coverage

The final system answered fewer questions but misled almost none. The team learned that a lower deflection rate with trustworthy answers outperformed a higher one built on confident fabrication.

How the Team Measured Success

The improvement was only believable because the team measured it, and the way they measured shaped what they built.

Building the test set from real tickets

Rather than invent test questions, the team pulled real customer questions from past tickets, including many the help center did not cover. Those uncovered questions became the heart of the test set, because they were exactly the cases that triggered fabrication. Each prompt change was scored against the same set, so improvement was attributable rather than anecdotal.

Scoring grounding, not just correctness

The team scored each answer on whether it was actually supported by the cited article, not merely whether it happened to be right. An answer that was correct but ungrounded was treated as a near miss, because it would fail on the next similar question. This focus on grounding caught fragile answers that a correctness-only score would have passed.

Watching the two failure modes together

They tracked fabrication and unnecessary abstention side by side, on one dashboard. Watching both at once was what let them catch the over-abstention regression early, before customers noticed the assistant had become unhelpfully cautious. Optimizing one number in isolation would have hidden the cost showing up in the other.

Frequently Asked Questions

Why did the team consider a model upgrade first?

Because fabrication is widely assumed to be a measure of model intelligence, so a smarter model feels like the natural fix. The closer look revealed the prompt supplied no source material and no abstention option, meaning even a better model would have kept guessing from memory. The cause was structural, not capability.

Did grounding alone solve the problem?

Mostly for covered topics, but not entirely. Grounding stopped invented features where articles existed, but gaps in the documentation still produced guesses until the abstention clause was added. The two changes together, not grounding alone, brought fabrication under control.

Why was lower deflection an acceptable trade?

Because a confident wrong answer generated a follow-up ticket and eroded trust, making it more costly than an honest hand-off. Deflecting fewer tickets while misleading almost none netted out better for both customer satisfaction and total support load.

How did the team handle over-abstention?

They treated it as a separate, opposite problem: the assistant declining questions the articles actually answered. Tuning the abstention clause and improving which articles were retrieved brought it back into balance. The goal became calibration, not maximum caution.

What would they do differently next time?

Start with grounding and abstention from day one rather than launching on a memory-only prompt, and build the measurement set—including unanswerable questions—before going live. Most of the pain came from diagnosing in production what a small upfront test set would have revealed.

Key Takeaways

  • The team's confident wrong answers traced to a prompt with no supplied source material and no permission to abstain, not to model limitations.
  • Grounding the assistant in help-center articles cut invented features, and an abstention clause converted remaining guesses into honest hand-offs.
  • Evidence requirements and a verification pass handled the subtler errors on sensitive topics like billing and policy.
  • The trade was lower deflection for far more trustworthy answers, which the team judged a clear net win.
  • Staging changes with measurement after each one made improvement attributable and caught an over-abstention regression early.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification