AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationThe DecisionThe ExecutionScoping the assistantDefining the unknownBuilding the test setThe OutcomeThe LessonsWhat changed in how the team workedFrequently Asked QuestionsWhy did the original prompt fail when it demoed so well?Would adding "don't make things up" have fixed it?How big does a test set need to be to be useful?Did making the assistant more honest hurt user experience?What is the single most transferable takeaway?Key Takeaways
Home/Blog/How One Missing Line Broke a Live Support Assistant
General

How One Missing Line Broke a Live Support Assistant

A

Agency Script Editorial

Editorial Team

·July 12, 2024·7 min read
system promptssystem prompts case studysystem prompts guideprompt engineering

The most useful lessons about system prompts rarely come from theory. They come from watching one fail in front of real users and then rebuilding it under pressure. This case study follows one such arc from start to finish: the situation, the decision to rebuild, the execution, the measurable outcome, and the lessons that generalized.

The details have been composited and anonymized, but the sequence is faithful to how these failures actually unfold. The point is not the specific company. It is the pattern, because the same pattern repeats across teams that ship AI without treating the system prompt as a tested artifact.

Read it as a story with a moral, then apply the moral to your own deployment.

One thing to keep in mind as you read: at every stage, the team's choices were reasonable given what they knew at the time. The original prompt was approved because it genuinely looked good. The pressure to patch quickly came from a real desire to stop a visible problem fast. The lessons below are not about blame; they are about the gap between what feels sufficient and what production actually demands. That gap catches careful people, which is precisely why it is worth studying.

The Situation

A mid-sized software company launched an AI support assistant to handle first-line customer questions. The original system prompt was a single paragraph: a friendly persona, a request to be helpful, and a note to keep answers short. It demoed beautifully. Leadership approved it in a meeting, and it went live to thousands of users.

For two weeks it looked fine. Then the support team noticed a thread where the assistant had confidently told a customer that a feature existed when it did not, walking them through steps for a screen the product did not have. The customer followed the steps, got confused, and escalated angrily.

It was not an isolated case. Once the team went looking, they found dozens of confident fabrications about features, settings, and policies. The assistant never said "I don't know." It always answered, because nothing in the prompt told it not to.

The Decision

The instinct in the room was to patch: add a line saying "don't make things up." The lead resisted. A one-line patch to an untested prompt would likely fix the visible case while leaving the underlying fragility in place, exactly the trap described in 7 Common Mistakes with System Prompts (and How to Avoid Them).

Instead, the team decided to rebuild the prompt properly: define scope, add explicit edge handling, and stand up a regression test set before shipping anything. It would take a day longer. They accepted the delay because the cost of more public fabrications was higher.

The Execution

The rebuild followed a deliberate sequence rather than ad-hoc edits, mirroring the process in A Step-by-Step Approach to System Prompts.

Scoping the assistant

They narrowed the mandate. The assistant would answer questions only from a provided knowledge base of real features and policies. Anything outside that base was out of scope by definition.

Defining the unknown

They added the missing behavior explicitly: when a question cannot be answered from the knowledge base, the assistant says it does not have that information and offers to connect the user to a human. This single addition addressed the root cause, the absence of a graceful unknown.

Building the test set

They collected thirty real conversations, including the ones that had gone wrong, plus a handful of deliberately tricky questions about features that did not exist. Every change to the prompt was now run against this set before shipping.

The Outcome

The rebuilt assistant behaved differently in a way the team could measure. Against the test set, fabrications dropped from frequent to none, because questions outside the knowledge base now triggered the honest fallback instead of an invented answer.

In production, the pattern held. The support team stopped finding confident falsehoods in the logs. Escalations driven by bad AI answers fell sharply, and the ones that remained were genuine edge cases routed cleanly to humans, which is what should happen.

There was a soft cost. The assistant now said "I don't have that information" more often, which felt less impressive in a demo. But a slightly less dazzling assistant that tells the truth beat a dazzling one that fabricated. That trade-off, structure and honesty over surface polish, is a recurring theme in System Prompts: Best Practices That Actually Work.

The Lessons

Several lessons generalized beyond this one deployment.

  • A prompt that demos well is not a prompt that is production-ready. Demos exercise the happy path; production exercises the edges.
  • The absence of a graceful unknown is one of the most damaging gaps a prompt can have, because it manufactures confident wrong answers.
  • Resist the one-line patch on an untested prompt. Patches fix symptoms and leave the disease.
  • A regression test set turns prompt quality from a feeling into a measurement, and it is what made the rebuild trustworthy.

None of these required exotic technique. They required treating the system prompt as something to engineer and test, which is the entire lesson, expanded in The Complete Guide to System Prompts.

What changed in how the team worked

The most durable outcome was not the rebuilt prompt itself but the change in process around it. After this incident, no prompt shipped without passing the regression set, and "what does it do when it doesn't know" became a standard review question for every new assistant. The team also started treating a successful demo as the beginning of validation rather than the end of it.

That cultural shift outlasted the specific assistant. New projects inherited the discipline, so the same class of failure did not recur even as the team built different tools for different purposes. The incident was painful, but it bought a practice that paid back repeatedly. A single visible failure, handled well, often teaches a team more than a dozen quiet successes.

Frequently Asked Questions

Why did the original prompt fail when it demoed so well?

Demos test the situations the author imagined, which are the easy ones. The prompt had no instruction for what to do when it lacked an answer, so under real traffic it filled the gap with fabrication. The flaw was invisible until inputs strayed from the happy path.

Would adding "don't make things up" have fixed it?

Partially and unreliably. A single instruction can reduce fabrication, but without a defined fallback behavior and a test set to verify it, the fix is unproven and fragile. The durable solution paired an explicit unknown-handling rule with regression testing.

How big does a test set need to be to be useful?

Even thirty representative cases, including the ones that already failed and a few adversarial ones, provide real protection. The goal is coverage of the realistic and the tricky, not exhaustiveness. A small disciplined set beats no set by a wide margin.

Did making the assistant more honest hurt user experience?

In demos it looked slightly less impressive because it admitted gaps more often. In real use it improved experience, because honest hand-offs to humans frustrate users far less than confident wrong answers that waste their time. The trade favored honesty clearly.

What is the single most transferable takeaway?

Build and run a regression test set before shipping a system prompt, and always define what the model should do when it does not know. Those two practices would have prevented the entire incident.

Key Takeaways

  • A system prompt that demos well can still fail in production, because demos exercise the happy path and real traffic exercises the edges.
  • The root cause here was a missing graceful unknown: with no fallback, the assistant fabricated confident answers.
  • The team rejected a one-line patch in favor of a methodical rebuild with scope, edge handling, and a regression test set.
  • Fabrications dropped to none against the test set and escalations from bad answers fell sharply in production.
  • The transferable lessons are to test before shipping and to always define what the model does when it lacks an answer.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification