How a Research Team Stopped Shipping Fabricated References

A small research team inside a consulting practice had adopted an AI model to speed up the production of evidence-backed client briefs. The briefs looked great—well-organized, confidently argued, dense with citations. Then a senior reviewer, checking one before it went out, tried to follow a citation to its source and found that the source did not exist. A second check turned up a citation to a real document that said nothing like what was claimed. The team had been one careless reviewer away from sending a client a brief built on fabricated evidence.

This is the story of how that team diagnosed the problem, redesigned how they instructed the model, and turned a liability into a reliable capability. The names and specifics are composited from common patterns rather than a single identifiable client, but the arc—situation, decision, execution, outcome—reflects how these turnarounds actually unfold. The point is not the particular team; it is the sequence of choices that any team can copy.

If you want the principles abstracted out of the narrative, the best practices article distills them. Here we watch the principles get discovered the hard way, under real pressure, with a real reputation at stake.

The Situation

The team had moved fast and gotten a nasty surprise. Understanding exactly what was wrong came before any fix.

What was happening

The model was citing from memory, not from supplied sources.
Citations were perfectly formatted, which disguised the fabricated ones.
Reviewers spot-checked formatting but rarely traced citations to sources.

Why it had gone unnoticed

The output looked more rigorous than the team's manual briefs, breeding trust.
Verification was treated as the model's job, not the reviewer's.
No one had distinguished "asked for citations" from "verified citations."

The Diagnosis

Rather than abandon the tool, the team ran a small audit to understand the failure precisely.

How they audited

They pulled a sample of recent briefs and traced every citation.
They classified each as real-and-supporting, real-but-unsupporting, or fabricated.
They recorded which prompts and sources had produced each outcome.

What the audit revealed

Fabricated and unsupporting citations clustered in ungrounded prompts.
Briefs built on supplied source material had far fewer problems.
The failure mode matched the well-documented pattern in common mistakes with generative tools.

The Decision

The audit pointed to a clear strategic choice: stop letting the model cite from memory.

The core decision

Ground every brief in sources the team supplied, never recalled references.
Treat verification as a mandatory reviewer step, not an optional courtesy.
Reward the model for abstaining when sources fell short.

Why they chose grounding first

It addressed the root cause the audit had isolated.
It made every citation mechanically checkable.
It was the highest-leverage change for the least effort, as argued in grounding outputs in sources.

The Execution

The decision became a redesigned process, rolled out deliberately rather than all at once.

What changed in the prompts

The model was instructed to cite only from provided sources, with a supporting quote per claim.
It was told plainly that an unsupported claim was worse than admitting uncertainty.
A dedicated "could not ground" section surfaced gaps for the reviewer.

What changed in the workflow

Sources were assembled before drafting, using retrieval for large source sets, following retrieval-augmented generation.
Reviewers traced every citation in client-facing briefs.
The working prompts were saved to a shared library so the practice spread.

The Outcome

The team measured the change against the same audit they had run at the start, which made the improvement legible.

What improved

Fabricated citations in audited briefs dropped to near zero, because the model now cited only supplied material.
Reviewers caught the rare remaining unsupporting citation, because quotes made checking fast.
Honest abstentions surfaced real gaps the team then filled with additional sources.

What it cost

Assembling sources up front added time to each brief.
Tracing citations added a verification step reviewers had skipped before.
The team judged both costs trivial against the avoided risk of shipping fabricated evidence.

The Lessons

Stripped of the specifics, the turnaround came down to a handful of transferable lessons.

What any team can take from this

Asking for citations is not a safeguard; verifying them is.
Grounding the model in supplied sources removes the largest source of fabrication.
Quotes make verification cheap enough that reviewers actually do it.

How to institutionalize it

Codify grounding, abstention, and verification in prompt review standards.
Audit periodically against a fixed classification so improvement stays visible.
Treat the working prompts as shared assets, not personal tricks.

What Did Not Work Along The Way

A clean narrative hides the false starts. The team tried two things before grounding that are worth naming, because they are the tempting shortcuts most teams reach for first.

The shortcuts that failed

They first tried adding a stern instruction—"only use real citations"—without supplying sources. The model complied in tone and fabricated anyway, because it had nothing real to cite.
They then tried having a second model "fact-check" the citations of the first. This caught some errors but introduced its own fabrications, layering uncertainty rather than removing it.
Only when they supplied the sources themselves did the fabrication problem actually collapse.

Why the shortcuts were tempting

Adding an instruction is nearly free and feels like it should work, which is exactly why teams try it first.
Automating verification with another model promises to remove human effort, the scarcest resource.
Both avoid the real cost—assembling sources up front—which is the thing that actually fixes the problem.

The lesson buried in the false starts is that there is no prompt-only shortcut around grounding. Instructions shape behavior but cannot conjure sources the model does not have, and a second model inherits the same failure mode as the first.

Frequently Asked Questions

What was the actual root cause of the fabricated citations?

The model was citing from memory rather than from sources the team supplied. Recalled citations are far more likely to be fabricated, and their perfect formatting disguised the problem. The team's audit isolated this clearly: briefs built on supplied source material had far fewer issues than those where the model cited from training. The fix followed directly—ground every brief in real, supplied sources.

Why did the problem go unnoticed for so long?

Because the AI-assisted briefs looked more rigorous than the team's manual work, which bred misplaced trust. Reviewers checked that citations were well-formatted but rarely traced them to their sources, and everyone implicitly assumed verification was the model's responsibility. The gap between "the output asked for citations" and "someone confirmed those citations are real and supporting" was where the failure lived, invisible until a careful reviewer happened to look.

How did the audit help beyond just finding the problem?

It turned a vague worry into a measurable baseline. By classifying every citation as real-and-supporting, real-but-unsupporting, or fabricated, the team could see exactly where failures clustered—in ungrounded prompts—and could later measure improvement against the same classification. The audit both diagnosed the root cause and created the yardstick that made the eventual turnaround legible rather than a matter of vibes.

Was grounding alone enough to fix everything?

It fixed the largest part—fabricated citations dropped to near zero because the model could only cite supplied material. But grounding did not eliminate the rarer failure where a real source was cited that did not actually support the claim. Required quotes and reviewer verification caught those. The lesson is that grounding is the highest-leverage change, but quotes and verification remain necessary to close the remaining gap between a real source and a supporting one.

Did the new process slow the team down?

Modestly. Assembling sources before drafting and tracing citations in client-facing briefs both added time the team had previously skipped. But they judged these costs trivial against the alternative—shipping a brief built on fabricated evidence to a paying client. The verification step was not new work so much as work that should have been happening all along, now made fast by the required supporting quotes.

How did they keep the improvement from eroding over time?

By codifying the practice and re-running the audit. Grounding, abstention, and verification expectations went into shared review standards so new team members inherited the discipline. The periodic audit against the fixed classification kept the improvement visible and caught any drift early. And the working prompts lived in a shared library rather than in one person's habits, so the capability belonged to the team rather than depending on whoever had championed it.

Key Takeaways

A research team nearly shipped client briefs built on fabricated citations because the model was citing from memory, not supplied sources.
An audit classifying every citation isolated the root cause—ungrounded prompts—and created a baseline for measuring improvement.
Grounding the model in supplied sources was the highest-leverage fix, dropping fabricated citations to near zero.
Required quotes made verification cheap enough that reviewers actually traced citations, catching the rare unsupporting one.
The turnaround stuck because the team codified grounding, abstention, and verification into standards and kept auditing against a fixed yardstick.

The Situation

The team had moved fast and gotten a nasty surprise. Understanding exactly what was wrong came before any fix.

What was happening

The model was citing from memory, not from supplied sources.
Citations were perfectly formatted, which disguised the fabricated ones.
Reviewers spot-checked formatting but rarely traced citations to sources.

Why it had gone unnoticed

The output looked more rigorous than the team's manual briefs, breeding trust.
Verification was treated as the model's job, not the reviewer's.
No one had distinguished "asked for citations" from "verified citations."

The Diagnosis

Rather than abandon the tool, the team ran a small audit to understand the failure precisely.

How they audited

They pulled a sample of recent briefs and traced every citation.
They classified each as real-and-supporting, real-but-unsupporting, or fabricated.
They recorded which prompts and sources had produced each outcome.

What the audit revealed

Fabricated and unsupporting citations clustered in ungrounded prompts.
Briefs built on supplied source material had far fewer problems.
The failure mode matched the well-documented pattern in common mistakes with generative tools.

The Decision

The audit pointed to a clear strategic choice: stop letting the model cite from memory.

The core decision

Ground every brief in sources the team supplied, never recalled references.
Treat verification as a mandatory reviewer step, not an optional courtesy.
Reward the model for abstaining when sources fell short.

Why they chose grounding first

It addressed the root cause the audit had isolated.
It made every citation mechanically checkable.
It was the highest-leverage change for the least effort, as argued in grounding outputs in sources.

The Execution

The decision became a redesigned process, rolled out deliberately rather than all at once.

What changed in the prompts

The model was instructed to cite only from provided sources, with a supporting quote per claim.
It was told plainly that an unsupported claim was worse than admitting uncertainty.
A dedicated "could not ground" section surfaced gaps for the reviewer.

What changed in the workflow

Sources were assembled before drafting, using retrieval for large source sets, following retrieval-augmented generation.
Reviewers traced every citation in client-facing briefs.
The working prompts were saved to a shared library so the practice spread.

The Outcome

The team measured the change against the same audit they had run at the start, which made the improvement legible.

What improved

Fabricated citations in audited briefs dropped to near zero, because the model now cited only supplied material.
Reviewers caught the rare remaining unsupporting citation, because quotes made checking fast.
Honest abstentions surfaced real gaps the team then filled with additional sources.

What it cost

Assembling sources up front added time to each brief.
Tracing citations added a verification step reviewers had skipped before.
The team judged both costs trivial against the avoided risk of shipping fabricated evidence.

The Lessons

Stripped of the specifics, the turnaround came down to a handful of transferable lessons.

What any team can take from this

Asking for citations is not a safeguard; verifying them is.
Grounding the model in supplied sources removes the largest source of fabrication.
Quotes make verification cheap enough that reviewers actually do it.

How to institutionalize it

Codify grounding, abstention, and verification in prompt review standards.
Audit periodically against a fixed classification so improvement stays visible.
Treat the working prompts as shared assets, not personal tricks.

What Did Not Work Along The Way

A clean narrative hides the false starts. The team tried two things before grounding that are worth naming, because they are the tempting shortcuts most teams reach for first.

The shortcuts that failed

They first tried adding a stern instruction—"only use real citations"—without supplying sources. The model complied in tone and fabricated anyway, because it had nothing real to cite.
They then tried having a second model "fact-check" the citations of the first. This caught some errors but introduced its own fabrications, layering uncertainty rather than removing it.
Only when they supplied the sources themselves did the fabrication problem actually collapse.

Why the shortcuts were tempting

Adding an instruction is nearly free and feels like it should work, which is exactly why teams try it first.
Automating verification with another model promises to remove human effort, the scarcest resource.
Both avoid the real cost—assembling sources up front—which is the thing that actually fixes the problem.

Frequently Asked Questions

What was the actual root cause of the fabricated citations?

Why did the problem go unnoticed for so long?

How did the audit help beyond just finding the problem?

Was grounding alone enough to fix everything?

Did the new process slow the team down?

How did they keep the improvement from eroding over time?

Key Takeaways

A research team nearly shipped client briefs built on fabricated citations because the model was citing from memory, not supplied sources.
An audit classifying every citation isolated the root cause—ungrounded prompts—and created a baseline for measuring improvement.
Grounding the model in supplied sources was the highest-leverage fix, dropping fabricated citations to near zero.
Required quotes made verification cheap enough that reviewers actually traced citations, catching the rare unsupporting one.
The turnaround stuck because the team codified grounding, abstention, and verification into standards and kept auditing against a fixed yardstick.

How a Research Team Stopped Shipping Fabricated References

The Situation

What was happening

Why it had gone unnoticed

The Diagnosis

How they audited

What the audit revealed

The Decision

The core decision

Why they chose grounding first

The Execution

What changed in the prompts

What changed in the workflow

The Outcome

What improved

What it cost

The Lessons

What any team can take from this

How to institutionalize it

What Did Not Work Along The Way

The shortcuts that failed

Why the shortcuts were tempting

Frequently Asked Questions

What was the actual root cause of the fabricated citations?

Why did the problem go unnoticed for so long?

How did the audit help beyond just finding the problem?

Was grounding alone enough to fix everything?

Did the new process slow the team down?

How did they keep the improvement from eroding over time?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

How a Research Team Stopped Shipping Fabricated References

The Situation

What was happening

Why it had gone unnoticed

The Diagnosis

How they audited

What the audit revealed

The Decision

The core decision

Why they chose grounding first

The Execution

What changed in the prompts

What changed in the workflow

The Outcome

What improved

What it cost

The Lessons

What any team can take from this

How to institutionalize it

What Did Not Work Along The Way

The shortcuts that failed

Why the shortcuts were tempting

Frequently Asked Questions

What was the actual root cause of the fabricated citations?

Why did the problem go unnoticed for so long?

How did the audit help beyond just finding the problem?

Was grounding alone enough to fix everything?

Did the new process slow the team down?

How did they keep the improvement from eroding over time?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?