Catching Fabrications Before Your Client Does

Measuring AI hallucinations is harder than it sounds, and most teams discover this the wrong way—after a client receives a document citing a policy that doesn't exist, or a chatbot confidently invents a product SKU. The problem isn't that hallucinations are invisible. It's that teams lack the instrumentation to see them systematically, so they only catch failures when the damage is already done.

The good news: hallucination measurement is a solvable engineering and process problem. You don't need a research team or proprietary infrastructure. You need the right definitions, a small set of metrics that behave differently depending on your use case, and a workflow that turns those metrics into decisions. This article gives you all three—plus a clear read on the failure modes that fool even experienced practitioners.

One framing note before diving in: hallucinations aren't a single phenomenon. A model that fabricates a legal citation is doing something different from a model that correctly identifies a fact but applies it to the wrong entity. Measuring them requires distinguishing between types, not just counting errors. Once you can do that, you can act on the signal.

What Counts as a Hallucination

The term "hallucination" has become imprecise through overuse. For measurement purposes, you need a working taxonomy.

Factual Hallucinations

The model asserts something that is verifiably false—a name, date, statistic, URL, or regulatory requirement that does not exist or is materially wrong. These are the easiest to detect with automated tools because they can be checked against a ground-truth source.

Faithfulness Hallucinations

The model produces output that contradicts or goes beyond the source material it was given. This is the dominant failure mode in retrieval-augmented generation (RAG) systems. The source document says revenue grew 8%; the summary says 18%. The document never mentions a guarantee; the chatbot mentions one. Faithfulness hallucinations are particularly dangerous because the model appears to be grounded—it's citing real context—but it's misrepresenting it.

Attribution Hallucinations

The model claims a statement comes from a specific source when it doesn't, or invents a source entirely. These are common in research assistant tools and content generation workflows where users rely on citations without verifying them independently.

Instruction-Drift Hallucinations

The model technically addresses the prompt but drifts from what was asked in ways that introduce fabricated detail. A user asks for a three-bullet summary; the model produces five bullets and adds a claim that wasn't in the source. This type is underreported because it looks like thoroughness rather than error.

Getting this taxonomy right before you instrument anything saves significant rework. Different hallucination types require different detectors.

The Core Metrics

Hallucination Rate

The baseline metric: the percentage of outputs that contain at least one hallucination, out of total outputs evaluated. Hallucination rate = (hallucinated outputs / total evaluated outputs) × 100.

What "evaluated" means matters. If you're sampling 5% of production traffic, your denominator is that sample. If you're running a weekly red-team suite, it's that suite. Keep the denominator consistent so the rate is comparable over time.

Typical hallucination rates in production vary widely: well-tuned RAG systems with strong prompts and retrieval quality often fall in the 3–8% range; general-purpose assistants without retrieval can run 15–30% depending on domain and query type. These are operational ranges, not targets—your use case determines what's acceptable.

Faithfulness Score

Faithfulness measures how well model output aligns with the context it was given. It's expressed as a score (commonly 0–1 or 0–100) and applies only when there's a retrievable ground truth—a document, database, or knowledge base the model is drawing from.

Common approaches:

NLI-based scoring: A natural language inference model is used to check whether the output is entailed by the source. Tools like HHEM (Hughes Hallucination Evaluation Model) and frameworks like RAGAS implement this.
LLM-as-judge: A separate model (often GPT-4 class) evaluates the output against the source and returns a structured score. This adds cost but catches nuanced misrepresentations that NLI models miss.
Sentence-level decomposition: Break the output into atomic claims, verify each against the source individually, and average the results. This is more labor-intensive but gives you granular signal on which claim types fail most.

Factual Precision and Recall

Borrowed from information retrieval, these metrics apply when you have a verified answer set:

Factual precision: Of the factual claims in the output, what fraction are correct?
Factual recall: Of the true facts that should have appeared, what fraction did the model include?

Precision without recall catches overconfident fabrication. Recall without precision catches omission errors. For most professional workflows—legal, medical, financial—precision matters more because a wrong fact actively harms; a missing fact is survivable.

Refusal Rate

A model that refuses to answer when uncertain is doing the right thing—but a model that refuses too often is useless. Track the percentage of queries where the model declines to answer or hedges heavily. Rising refusal rates after a prompt update can signal over-correction. Falling refusal rates without a corresponding drop in hallucination rate signals the opposite problem.

Human Review Escalation Rate

In workflows where outputs are reviewed before delivery, track what percentage get flagged for revision or rejection due to suspected inaccuracy. This is a lagging indicator but a real-world one—it reflects actual operational cost, not just model behavior in isolation.

How to Instrument These Metrics

Define Evaluation Sets Before You Deploy

The most common mistake is trying to build evaluation infrastructure reactively. Define three sets up front:

Static benchmark set: 100–200 representative queries with verified correct answers. Run this on every model or prompt change. It gives you a stable baseline.
Adversarial set: Queries specifically designed to elicit hallucinations—ambiguous dates, rare entities, leading questions. This is your red-team suite.
Production sample: A random 3–10% sample of live traffic, reviewed on a rolling basis. This catches distribution shift—real users ask things your benchmark didn't anticipate.

Use Layered Detectors

No single detector catches everything. A practical stack:

Automated NLI or embedding-based checks as the first pass. Fast, cheap, catches the obvious faithfulness failures.
LLM-as-judge for flagged outputs. More expensive, but higher signal quality on edge cases.
Human review for high-stakes outputs and a calibration sample. Human review also lets you audit whether your automated detectors are well-calibrated—if humans and the detector disagree at rates above 15–20%, your automated scoring needs tuning.

Log at the Claim Level, Not Just the Response Level

Response-level logging tells you a response failed. Claim-level logging tells you which type of claim failed. Over time, claim-level data reveals patterns: the model consistently hallucinates when asked about dates before a certain year, or when the retrieved context exceeds a certain length. Those patterns point directly to fixes—prompt changes, retrieval tuning, or context window management.

Context window behavior is worth watching specifically because truncation and ordering effects within the context window directly affect faithfulness. When retrieved content is cut off or buried, models tend to fill gaps from parametric memory—which is where faithfulness hallucinations originate. If you're building RAG systems, pairing your hallucination metrics with token and context window metrics gives you a much cleaner picture of root cause.

Set Thresholds by Use Case, Not Universally

A hallucination rate of 5% might be acceptable in a first-draft copywriting tool where humans review everything. It's not acceptable in a benefits eligibility assistant where a wrong answer costs a client money or legal exposure. Define acceptable thresholds before launch, tie them to business risk, and build alerts that trigger when thresholds are breached.

Reading the Signal Correctly

Distinguish Model Failures from Prompt Failures

Many hallucination events are prompt engineering problems, not model capability problems. If you change the prompt and the hallucination rate drops 40%, that's signal that the model had the capability all along. Track prompt version as a variable in your logging so you can separate these effects.

Watch for the Confident-Wrong Pattern

High confidence with high error rate is your most dangerous failure mode. Some evaluation frameworks include a calibration metric—essentially measuring whether the model's expressed certainty matches its actual accuracy. A model that says "I'm not certain, but..." when it's wrong is safer than one that states incorrect facts declaratively. If you can extract uncertainty signals from model outputs (logprobs, explicit hedging language, or confidence scores in structured outputs), track them alongside accuracy.

Seasonality and Distribution Shift

Hallucination rates can drift without any change to your model or prompt. The cause is usually distribution shift in user queries—a new product launch means users ask about things the model wasn't trained or optimized for. Build monitoring that compares rolling 7-day and 30-day rates so drift is visible before it becomes a support incident.

Avoid Measuring Only What's Easy to Measure

Factual hallucinations with verifiable answers are the easiest to catch. But attribution hallucinations and instruction-drift hallucinations cause significant real-world harm and are harder to automate. If your metrics are all coming from automated detectors with no human review, you're likely undercounting total hallucination rate by a meaningful margin—possibly 2–4× in document-heavy workflows.

Building a Review Workflow

Instrumentation without a response loop is just data collection. The measurement system needs to connect to action.

A minimal working review loop looks like this:

Daily: Automated metrics computed on the production sample and benchmark suite. Alerts fire if thresholds are breached.
Weekly: A human reviewer audits 20–30 flagged outputs and 10–15 random non-flagged outputs. The random sample catches false negatives in the automated detector.
Monthly: Root cause analysis on the week's flags. Patterns get triaged: is this a retrieval problem (wrong documents being pulled), a prompt problem (the model isn't told to stay grounded), or a model capability limit?
On every prompt or retrieval change: Full benchmark suite run before the change ships.

This workflow scales down to a two-person team. The key discipline is running the benchmark before changes ship, not after something goes wrong.

Frequently Asked Questions

What's the difference between hallucination rate and accuracy?

Accuracy measures whether the model's output matches a correct answer. Hallucination rate specifically tracks outputs that contain invented or unsupported information. A model can be inaccurate without hallucinating (it may have a reasoning error rather than a fabrication), and in theory a model could hallucinate the correct answer by chance. For most operational purposes, you want both metrics—accuracy tells you whether the model is right; hallucination rate tells you whether the model is making things up.

Can I use an LLM to grade its own hallucinations?

Generally, no—self-evaluation is unreliable because the same biases that produced a hallucination tend to persist in self-review. Using a different, stronger model as the judge (LLM-as-judge) is a credible approach, but it works best when the judge model has access to the ground-truth source and a clear rubric. Without a source to compare against, even a stronger model is guessing.

How do I measure hallucinations in long-form outputs?

Long-form outputs require decomposition—breaking the response into individual claims before evaluating each. Tools like RAGAS, DeepEval, and custom GPT-4 prompts can automate this at scale. The key design choice is claim granularity: too coarse and you miss embedded errors; too fine and you're flagging paraphrases as hallucinations. A working rule of thumb is one claim per sentence for factual prose, with compound sentences split.

Does a smaller context window reduce hallucinations?

It can, under specific conditions. When context is short enough that the model reads all of it reliably, faithfulness hallucinations from truncation or positional bias decrease. But a context window that's too small forces the model to rely more on parametric memory, which increases factual hallucinations. The relationship is nonlinear—see the trade-offs discussion around context windows for a fuller treatment of how window sizing interacts with output quality.

How often should I run human evaluation?

At minimum, weekly on a random sample of 20–30 outputs plus any outputs flagged by automated detectors. If you're in a high-stakes domain (legal, medical, financial), a higher sampling rate—5–10% of daily volume with human review—is worth the cost relative to the risk. Human review also serves a calibration function: it tells you how well your automated detectors are performing.

What should I do when I detect a spike in hallucination rate?

First, isolate the variable that changed: model version, prompt, retrieval configuration, or user query distribution. Check your static benchmark suite—if the rate is up on the benchmark too, the change is in the model or prompt, not the query distribution. If the benchmark is stable but production rate spiked, you're likely seeing distribution shift. Then triage by severity: outputs involving factual claims with direct consequences (pricing, policy, eligibility) get reviewed and corrected first; lower-stakes outputs get flagged for the next review cycle.

Key Takeaways

Hallucination is not one thing. Factual, faithfulness, attribution, and instruction-drift hallucinations require different detectors and have different risk profiles.
Hallucination rate is the baseline metric, but it only becomes useful when paired with faithfulness score, factual precision, and refusal rate.
Instrument at the claim level, not just the response level. Claim-level logging reveals patterns that response-level logging obscures.
Layer your detectors: automated NLI or embedding checks as first pass, LLM-as-judge for flagged outputs, human review for calibration and high-stakes content.
Context window behavior is a contributing factor to faithfulness hallucinations—truncation and ordering effects matter, especially in RAG systems.
Set thresholds by use case risk, not universally. A 5% hallucination rate means something different in a brainstorming tool than in a compliance assistant.
Connect metrics to a review loop. Measurement without a response workflow is just data collection—it doesn't reduce harm.
Watch for confident-wrong outputs. High expressed confidence combined with high error rate is the most operationally dangerous pattern.

What Counts as a Hallucination

The term "hallucination" has become imprecise through overuse. For measurement purposes, you need a working taxonomy.

Factual Hallucinations

Faithfulness Hallucinations

Attribution Hallucinations

Instruction-Drift Hallucinations

Getting this taxonomy right before you instrument anything saves significant rework. Different hallucination types require different detectors.

The Core Metrics

Hallucination Rate

The baseline metric: the percentage of outputs that contain at least one hallucination, out of total outputs evaluated. Hallucination rate = (hallucinated outputs / total evaluated outputs) × 100.

Faithfulness Score

Common approaches:

NLI-based scoring: A natural language inference model is used to check whether the output is entailed by the source. Tools like HHEM (Hughes Hallucination Evaluation Model) and frameworks like RAGAS implement this.
LLM-as-judge: A separate model (often GPT-4 class) evaluates the output against the source and returns a structured score. This adds cost but catches nuanced misrepresentations that NLI models miss.
Sentence-level decomposition: Break the output into atomic claims, verify each against the source individually, and average the results. This is more labor-intensive but gives you granular signal on which claim types fail most.

Factual Precision and Recall

Borrowed from information retrieval, these metrics apply when you have a verified answer set:

Factual precision: Of the factual claims in the output, what fraction are correct?
Factual recall: Of the true facts that should have appeared, what fraction did the model include?

Refusal Rate

Human Review Escalation Rate

How to Instrument These Metrics

Define Evaluation Sets Before You Deploy

The most common mistake is trying to build evaluation infrastructure reactively. Define three sets up front:

Static benchmark set: 100–200 representative queries with verified correct answers. Run this on every model or prompt change. It gives you a stable baseline.
Adversarial set: Queries specifically designed to elicit hallucinations—ambiguous dates, rare entities, leading questions. This is your red-team suite.
Production sample: A random 3–10% sample of live traffic, reviewed on a rolling basis. This catches distribution shift—real users ask things your benchmark didn't anticipate.

Use Layered Detectors

No single detector catches everything. A practical stack:

Automated NLI or embedding-based checks as the first pass. Fast, cheap, catches the obvious faithfulness failures.
LLM-as-judge for flagged outputs. More expensive, but higher signal quality on edge cases.
Human review for high-stakes outputs and a calibration sample. Human review also lets you audit whether your automated detectors are well-calibrated—if humans and the detector disagree at rates above 15–20%, your automated scoring needs tuning.

Log at the Claim Level, Not Just the Response Level

Set Thresholds by Use Case, Not Universally

Reading the Signal Correctly

Distinguish Model Failures from Prompt Failures

Watch for the Confident-Wrong Pattern

Seasonality and Distribution Shift

Avoid Measuring Only What's Easy to Measure

Building a Review Workflow

Instrumentation without a response loop is just data collection. The measurement system needs to connect to action.

A minimal working review loop looks like this:

Daily: Automated metrics computed on the production sample and benchmark suite. Alerts fire if thresholds are breached.
Weekly: A human reviewer audits 20–30 flagged outputs and 10–15 random non-flagged outputs. The random sample catches false negatives in the automated detector.
Monthly: Root cause analysis on the week's flags. Patterns get triaged: is this a retrieval problem (wrong documents being pulled), a prompt problem (the model isn't told to stay grounded), or a model capability limit?
On every prompt or retrieval change: Full benchmark suite run before the change ships.

This workflow scales down to a two-person team. The key discipline is running the benchmark before changes ship, not after something goes wrong.

Frequently Asked Questions

What's the difference between hallucination rate and accuracy?

Can I use an LLM to grade its own hallucinations?

How do I measure hallucinations in long-form outputs?

Does a smaller context window reduce hallucinations?

How often should I run human evaluation?

What should I do when I detect a spike in hallucination rate?

Key Takeaways

Hallucination is not one thing. Factual, faithfulness, attribution, and instruction-drift hallucinations require different detectors and have different risk profiles.
Hallucination rate is the baseline metric, but it only becomes useful when paired with faithfulness score, factual precision, and refusal rate.
Instrument at the claim level, not just the response level. Claim-level logging reveals patterns that response-level logging obscures.
Layer your detectors: automated NLI or embedding checks as first pass, LLM-as-judge for flagged outputs, human review for calibration and high-stakes content.
Context window behavior is a contributing factor to faithfulness hallucinations—truncation and ordering effects matter, especially in RAG systems.
Set thresholds by use case risk, not universally. A 5% hallucination rate means something different in a brainstorming tool than in a compliance assistant.
Connect metrics to a review loop. Measurement without a response workflow is just data collection—it doesn't reduce harm.
Watch for confident-wrong outputs. High expressed confidence combined with high error rate is the most operationally dangerous pattern.

Catching Fabrications Before Your Client Does

What Counts as a Hallucination

Factual Hallucinations

Faithfulness Hallucinations

Attribution Hallucinations

Instruction-Drift Hallucinations

The Core Metrics

Hallucination Rate

Faithfulness Score

Factual Precision and Recall

Refusal Rate

Human Review Escalation Rate

How to Instrument These Metrics

Define Evaluation Sets Before You Deploy

Use Layered Detectors

Log at the Claim Level, Not Just the Response Level

Set Thresholds by Use Case, Not Universally

Reading the Signal Correctly

Distinguish Model Failures from Prompt Failures

Watch for the Confident-Wrong Pattern

Seasonality and Distribution Shift

Avoid Measuring Only What's Easy to Measure

Building a Review Workflow

Frequently Asked Questions

What's the difference between hallucination rate and accuracy?

Can I use an LLM to grade its own hallucinations?

How do I measure hallucinations in long-form outputs?

Does a smaller context window reduce hallucinations?

How often should I run human evaluation?

What should I do when I detect a spike in hallucination rate?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Catching Fabrications Before Your Client Does

What Counts as a Hallucination

Factual Hallucinations

Faithfulness Hallucinations

Attribution Hallucinations

Instruction-Drift Hallucinations

The Core Metrics

Hallucination Rate

Faithfulness Score

Factual Precision and Recall

Refusal Rate

Human Review Escalation Rate

How to Instrument These Metrics

Define Evaluation Sets Before You Deploy

Use Layered Detectors

Log at the Claim Level, Not Just the Response Level

Set Thresholds by Use Case, Not Universally

Reading the Signal Correctly

Distinguish Model Failures from Prompt Failures

Watch for the Confident-Wrong Pattern

Seasonality and Distribution Shift

Avoid Measuring Only What's Easy to Measure

Building a Review Workflow

Frequently Asked Questions

What's the difference between hallucination rate and accuracy?

Can I use an LLM to grade its own hallucinations?

How do I measure hallucinations in long-form outputs?

Does a smaller context window reduce hallucinations?

How often should I run human evaluation?

What should I do when I detect a spike in hallucination rate?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?