How One Missing Line Broke a Live Support Assistant

The most useful lessons about system prompts rarely come from theory. They come from watching one fail in front of real users and then rebuilding it under pressure. This case study follows one such arc from start to finish: the situation, the decision to rebuild, the execution, the measurable outcome, and the lessons that generalized.

The details have been composited and anonymized, but the sequence is faithful to how these failures actually unfold. The point is not the specific company. It is the pattern, because the same pattern repeats across teams that ship AI without treating the system prompt as a tested artifact.

Read it as a story with a moral, then apply the moral to your own deployment.

One thing to keep in mind as you read: at every stage, the team's choices were reasonable given what they knew at the time. The original prompt was approved because it genuinely looked good. The pressure to patch quickly came from a real desire to stop a visible problem fast. The lessons below are not about blame; they are about the gap between what feels sufficient and what production actually demands. That gap catches careful people, which is precisely why it is worth studying.

The Situation

A mid-sized software company launched an AI support assistant to handle first-line customer questions. The original system prompt was a single paragraph: a friendly persona, a request to be helpful, and a note to keep answers short. It demoed beautifully. Leadership approved it in a meeting, and it went live to thousands of users.

For two weeks it looked fine. Then the support team noticed a thread where the assistant had confidently told a customer that a feature existed when it did not, walking them through steps for a screen the product did not have. The customer followed the steps, got confused, and escalated angrily.

It was not an isolated case. Once the team went looking, they found dozens of confident fabrications about features, settings, and policies. The assistant never said "I don't know." It always answered, because nothing in the prompt told it not to.

The Decision

The instinct in the room was to patch: add a line saying "don't make things up." The lead resisted. A one-line patch to an untested prompt would likely fix the visible case while leaving the underlying fragility in place, exactly the trap described in 7 Common Mistakes with System Prompts (and How to Avoid Them).

Instead, the team decided to rebuild the prompt properly: define scope, add explicit edge handling, and stand up a regression test set before shipping anything. It would take a day longer. They accepted the delay because the cost of more public fabrications was higher.

The Execution

The rebuild followed a deliberate sequence rather than ad-hoc edits, mirroring the process in A Step-by-Step Approach to System Prompts.

Scoping the assistant

They narrowed the mandate. The assistant would answer questions only from a provided knowledge base of real features and policies. Anything outside that base was out of scope by definition.

Defining the unknown

They added the missing behavior explicitly: when a question cannot be answered from the knowledge base, the assistant says it does not have that information and offers to connect the user to a human. This single addition addressed the root cause, the absence of a graceful unknown.

Building the test set

They collected thirty real conversations, including the ones that had gone wrong, plus a handful of deliberately tricky questions about features that did not exist. Every change to the prompt was now run against this set before shipping.

The Outcome

The rebuilt assistant behaved differently in a way the team could measure. Against the test set, fabrications dropped from frequent to none, because questions outside the knowledge base now triggered the honest fallback instead of an invented answer.

In production, the pattern held. The support team stopped finding confident falsehoods in the logs. Escalations driven by bad AI answers fell sharply, and the ones that remained were genuine edge cases routed cleanly to humans, which is what should happen.

There was a soft cost. The assistant now said "I don't have that information" more often, which felt less impressive in a demo. But a slightly less dazzling assistant that tells the truth beat a dazzling one that fabricated. That trade-off, structure and honesty over surface polish, is a recurring theme in System Prompts: Best Practices That Actually Work.

The Lessons

Several lessons generalized beyond this one deployment.

A prompt that demos well is not a prompt that is production-ready. Demos exercise the happy path; production exercises the edges.
The absence of a graceful unknown is one of the most damaging gaps a prompt can have, because it manufactures confident wrong answers.
Resist the one-line patch on an untested prompt. Patches fix symptoms and leave the disease.
A regression test set turns prompt quality from a feeling into a measurement, and it is what made the rebuild trustworthy.

None of these required exotic technique. They required treating the system prompt as something to engineer and test, which is the entire lesson, expanded in The Complete Guide to System Prompts.

What changed in how the team worked

The most durable outcome was not the rebuilt prompt itself but the change in process around it. After this incident, no prompt shipped without passing the regression set, and "what does it do when it doesn't know" became a standard review question for every new assistant. The team also started treating a successful demo as the beginning of validation rather than the end of it.

That cultural shift outlasted the specific assistant. New projects inherited the discipline, so the same class of failure did not recur even as the team built different tools for different purposes. The incident was painful, but it bought a practice that paid back repeatedly. A single visible failure, handled well, often teaches a team more than a dozen quiet successes.

Frequently Asked Questions

Why did the original prompt fail when it demoed so well?

Demos test the situations the author imagined, which are the easy ones. The prompt had no instruction for what to do when it lacked an answer, so under real traffic it filled the gap with fabrication. The flaw was invisible until inputs strayed from the happy path.

Would adding "don't make things up" have fixed it?

Partially and unreliably. A single instruction can reduce fabrication, but without a defined fallback behavior and a test set to verify it, the fix is unproven and fragile. The durable solution paired an explicit unknown-handling rule with regression testing.

How big does a test set need to be to be useful?

Even thirty representative cases, including the ones that already failed and a few adversarial ones, provide real protection. The goal is coverage of the realistic and the tricky, not exhaustiveness. A small disciplined set beats no set by a wide margin.

Did making the assistant more honest hurt user experience?

In demos it looked slightly less impressive because it admitted gaps more often. In real use it improved experience, because honest hand-offs to humans frustrate users far less than confident wrong answers that waste their time. The trade favored honesty clearly.

What is the single most transferable takeaway?

Build and run a regression test set before shipping a system prompt, and always define what the model should do when it does not know. Those two practices would have prevented the entire incident.

Key Takeaways

A system prompt that demos well can still fail in production, because demos exercise the happy path and real traffic exercises the edges.
The root cause here was a missing graceful unknown: with no fallback, the assistant fabricated confident answers.
The team rejected a one-line patch in favor of a methodical rebuild with scope, edge handling, and a regression test set.
Fabrications dropped to none against the test set and escalations from bad answers fell sharply in production.
The transferable lessons are to test before shipping and to always define what the model does when it lacks an answer.

Read it as a story with a moral, then apply the moral to your own deployment.

The Situation

The Decision

The Execution

The rebuild followed a deliberate sequence rather than ad-hoc edits, mirroring the process in A Step-by-Step Approach to System Prompts.

Scoping the assistant

They narrowed the mandate. The assistant would answer questions only from a provided knowledge base of real features and policies. Anything outside that base was out of scope by definition.

Defining the unknown

Building the test set

The Outcome

The Lessons

Several lessons generalized beyond this one deployment.

A prompt that demos well is not a prompt that is production-ready. Demos exercise the happy path; production exercises the edges.
The absence of a graceful unknown is one of the most damaging gaps a prompt can have, because it manufactures confident wrong answers.
Resist the one-line patch on an untested prompt. Patches fix symptoms and leave the disease.
A regression test set turns prompt quality from a feeling into a measurement, and it is what made the rebuild trustworthy.

None of these required exotic technique. They required treating the system prompt as something to engineer and test, which is the entire lesson, expanded in The Complete Guide to System Prompts.

What changed in how the team worked

Frequently Asked Questions

Why did the original prompt fail when it demoed so well?

Would adding "don't make things up" have fixed it?

How big does a test set need to be to be useful?

Did making the assistant more honest hurt user experience?

What is the single most transferable takeaway?

Build and run a regression test set before shipping a system prompt, and always define what the model should do when it does not know. Those two practices would have prevented the entire incident.

Key Takeaways

A system prompt that demos well can still fail in production, because demos exercise the happy path and real traffic exercises the edges.
The root cause here was a missing graceful unknown: with no fallback, the assistant fabricated confident answers.
The team rejected a one-line patch in favor of a methodical rebuild with scope, edge handling, and a regression test set.
Fabrications dropped to none against the test set and escalations from bad answers fell sharply in production.
The transferable lessons are to test before shipping and to always define what the model does when it lacks an answer.

How One Missing Line Broke a Live Support Assistant

The Situation

The Decision

The Execution

Scoping the assistant

Defining the unknown

Building the test set

The Outcome

The Lessons

What changed in how the team worked

Frequently Asked Questions

Why did the original prompt fail when it demoed so well?

Would adding "don't make things up" have fixed it?

How big does a test set need to be to be useful?

Did making the assistant more honest hurt user experience?

What is the single most transferable takeaway?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

How One Missing Line Broke a Live Support Assistant

The Situation

The Decision

The Execution

Scoping the assistant

Defining the unknown

Building the test set

The Outcome

The Lessons

What changed in how the team worked

Frequently Asked Questions

Why did the original prompt fail when it demoed so well?

Would adding "don't make things up" have fixed it?

How big does a test set need to be to be useful?

Did making the assistant more honest hurt user experience?

What is the single most transferable takeaway?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?