AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationWhat was happeningWhy it had gone unnoticedThe DiagnosisHow they auditedWhat the audit revealedThe DecisionThe core decisionWhy they chose grounding firstThe ExecutionWhat changed in the promptsWhat changed in the workflowThe OutcomeWhat improvedWhat it costThe LessonsWhat any team can take from thisHow to institutionalize itWhat Did Not Work Along The WayThe shortcuts that failedWhy the shortcuts were temptingFrequently Asked QuestionsWhat was the actual root cause of the fabricated citations?Why did the problem go unnoticed for so long?How did the audit help beyond just finding the problem?Was grounding alone enough to fix everything?Did the new process slow the team down?How did they keep the improvement from eroding over time?Key Takeaways
Home/Blog/How a Research Team Stopped Shipping Fabricated References
General

How a Research Team Stopped Shipping Fabricated References

A

Agency Script Editorial

Editorial Team

·December 8, 2020·8 min read
instructing models to cite sourcesinstructing models to cite sources case studyinstructing models to cite sources guideprompt engineering

A small research team inside a consulting practice had adopted an AI model to speed up the production of evidence-backed client briefs. The briefs looked great—well-organized, confidently argued, dense with citations. Then a senior reviewer, checking one before it went out, tried to follow a citation to its source and found that the source did not exist. A second check turned up a citation to a real document that said nothing like what was claimed. The team had been one careless reviewer away from sending a client a brief built on fabricated evidence.

This is the story of how that team diagnosed the problem, redesigned how they instructed the model, and turned a liability into a reliable capability. The names and specifics are composited from common patterns rather than a single identifiable client, but the arc—situation, decision, execution, outcome—reflects how these turnarounds actually unfold. The point is not the particular team; it is the sequence of choices that any team can copy.

If you want the principles abstracted out of the narrative, the best practices article distills them. Here we watch the principles get discovered the hard way, under real pressure, with a real reputation at stake.

The Situation

The team had moved fast and gotten a nasty surprise. Understanding exactly what was wrong came before any fix.

What was happening

  • The model was citing from memory, not from supplied sources.
  • Citations were perfectly formatted, which disguised the fabricated ones.
  • Reviewers spot-checked formatting but rarely traced citations to sources.

Why it had gone unnoticed

  • The output looked more rigorous than the team's manual briefs, breeding trust.
  • Verification was treated as the model's job, not the reviewer's.
  • No one had distinguished "asked for citations" from "verified citations."

The Diagnosis

Rather than abandon the tool, the team ran a small audit to understand the failure precisely.

How they audited

  • They pulled a sample of recent briefs and traced every citation.
  • They classified each as real-and-supporting, real-but-unsupporting, or fabricated.
  • They recorded which prompts and sources had produced each outcome.

What the audit revealed

  • Fabricated and unsupporting citations clustered in ungrounded prompts.
  • Briefs built on supplied source material had far fewer problems.
  • The failure mode matched the well-documented pattern in common mistakes with generative tools.

The Decision

The audit pointed to a clear strategic choice: stop letting the model cite from memory.

The core decision

  • Ground every brief in sources the team supplied, never recalled references.
  • Treat verification as a mandatory reviewer step, not an optional courtesy.
  • Reward the model for abstaining when sources fell short.

Why they chose grounding first

  • It addressed the root cause the audit had isolated.
  • It made every citation mechanically checkable.
  • It was the highest-leverage change for the least effort, as argued in grounding outputs in sources.

The Execution

The decision became a redesigned process, rolled out deliberately rather than all at once.

What changed in the prompts

  • The model was instructed to cite only from provided sources, with a supporting quote per claim.
  • It was told plainly that an unsupported claim was worse than admitting uncertainty.
  • A dedicated "could not ground" section surfaced gaps for the reviewer.

What changed in the workflow

  • Sources were assembled before drafting, using retrieval for large source sets, following retrieval-augmented generation.
  • Reviewers traced every citation in client-facing briefs.
  • The working prompts were saved to a shared library so the practice spread.

The Outcome

The team measured the change against the same audit they had run at the start, which made the improvement legible.

What improved

  • Fabricated citations in audited briefs dropped to near zero, because the model now cited only supplied material.
  • Reviewers caught the rare remaining unsupporting citation, because quotes made checking fast.
  • Honest abstentions surfaced real gaps the team then filled with additional sources.

What it cost

  • Assembling sources up front added time to each brief.
  • Tracing citations added a verification step reviewers had skipped before.
  • The team judged both costs trivial against the avoided risk of shipping fabricated evidence.

The Lessons

Stripped of the specifics, the turnaround came down to a handful of transferable lessons.

What any team can take from this

  • Asking for citations is not a safeguard; verifying them is.
  • Grounding the model in supplied sources removes the largest source of fabrication.
  • Quotes make verification cheap enough that reviewers actually do it.

How to institutionalize it

  • Codify grounding, abstention, and verification in prompt review standards.
  • Audit periodically against a fixed classification so improvement stays visible.
  • Treat the working prompts as shared assets, not personal tricks.

What Did Not Work Along The Way

A clean narrative hides the false starts. The team tried two things before grounding that are worth naming, because they are the tempting shortcuts most teams reach for first.

The shortcuts that failed

  • They first tried adding a stern instruction—"only use real citations"—without supplying sources. The model complied in tone and fabricated anyway, because it had nothing real to cite.
  • They then tried having a second model "fact-check" the citations of the first. This caught some errors but introduced its own fabrications, layering uncertainty rather than removing it.
  • Only when they supplied the sources themselves did the fabrication problem actually collapse.

Why the shortcuts were tempting

  • Adding an instruction is nearly free and feels like it should work, which is exactly why teams try it first.
  • Automating verification with another model promises to remove human effort, the scarcest resource.
  • Both avoid the real cost—assembling sources up front—which is the thing that actually fixes the problem.

The lesson buried in the false starts is that there is no prompt-only shortcut around grounding. Instructions shape behavior but cannot conjure sources the model does not have, and a second model inherits the same failure mode as the first.

Frequently Asked Questions

What was the actual root cause of the fabricated citations?

The model was citing from memory rather than from sources the team supplied. Recalled citations are far more likely to be fabricated, and their perfect formatting disguised the problem. The team's audit isolated this clearly: briefs built on supplied source material had far fewer issues than those where the model cited from training. The fix followed directly—ground every brief in real, supplied sources.

Why did the problem go unnoticed for so long?

Because the AI-assisted briefs looked more rigorous than the team's manual work, which bred misplaced trust. Reviewers checked that citations were well-formatted but rarely traced them to their sources, and everyone implicitly assumed verification was the model's responsibility. The gap between "the output asked for citations" and "someone confirmed those citations are real and supporting" was where the failure lived, invisible until a careful reviewer happened to look.

How did the audit help beyond just finding the problem?

It turned a vague worry into a measurable baseline. By classifying every citation as real-and-supporting, real-but-unsupporting, or fabricated, the team could see exactly where failures clustered—in ungrounded prompts—and could later measure improvement against the same classification. The audit both diagnosed the root cause and created the yardstick that made the eventual turnaround legible rather than a matter of vibes.

Was grounding alone enough to fix everything?

It fixed the largest part—fabricated citations dropped to near zero because the model could only cite supplied material. But grounding did not eliminate the rarer failure where a real source was cited that did not actually support the claim. Required quotes and reviewer verification caught those. The lesson is that grounding is the highest-leverage change, but quotes and verification remain necessary to close the remaining gap between a real source and a supporting one.

Did the new process slow the team down?

Modestly. Assembling sources before drafting and tracing citations in client-facing briefs both added time the team had previously skipped. But they judged these costs trivial against the alternative—shipping a brief built on fabricated evidence to a paying client. The verification step was not new work so much as work that should have been happening all along, now made fast by the required supporting quotes.

How did they keep the improvement from eroding over time?

By codifying the practice and re-running the audit. Grounding, abstention, and verification expectations went into shared review standards so new team members inherited the discipline. The periodic audit against the fixed classification kept the improvement visible and caught any drift early. And the working prompts lived in a shared library rather than in one person's habits, so the capability belonged to the team rather than depending on whoever had championed it.

Key Takeaways

  • A research team nearly shipped client briefs built on fabricated citations because the model was citing from memory, not supplied sources.
  • An audit classifying every citation isolated the root cause—ungrounded prompts—and created a baseline for measuring improvement.
  • Grounding the model in supplied sources was the highest-leverage fix, dropping fabricated citations to near zero.
  • Required quotes made verification cheap enough that reviewers actually traced citations, catching the rare unsupporting one.
  • The turnaround stuck because the team codified grounding, abstention, and verification into standards and kept auditing against a fixed yardstick.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification