Extraction is one of those topics where confident folklore outpaces practice. Because pulling data out of a document looks simple, people form strong intuitions about how it should work, what it costs, and what it takes to do well — and many of those intuitions are wrong in ways that lead to brittle pipelines, wasted budget, and misplaced trust. The myths persist because they each contain a grain of truth that makes them feel right.
Untangling the folklore matters because the misconceptions are not harmless. Believing extraction is a solved, set-and-forget task leads teams to skip the monitoring that catches silent failures. Believing it requires fine-tuning leads to weeks of unnecessary labeling. Each myth has a concrete cost, and replacing it with the accurate picture changes how you build.
This article takes the most common misconceptions one at a time and lays out what is actually true, with the evidence to back it.
A useful way to read what follows is to notice the shared shape of these myths. Almost all of them stem from judging extraction by its best case rather than its worst case — by the clean demo, the easy field, the document that happened to look familiar. Language-model extraction is defined by its tail behavior, and every myth here is a way of looking away from the tail. Once you internalize that the interesting question is always what happens on the documents you did not anticipate, the folklore starts to fall apart on its own.
Myth: Once It Works, It Keeps Working
This is the most expensive belief in extraction, because it sounds so reasonable.
The accurate picture
Input formats drift. New vendors, new templates, new document layouts arrive constantly, and a pipeline tuned on last quarter's documents quietly degrades on this quarter's. The decay is rarely dramatic enough to notice from a single output; it accumulates as a slow slide in accuracy that nobody is watching for. Extraction is not a fixed function over a fixed input; it is a function over a moving input distribution. Treating it as set-and-forget guarantees silent decay, which is exactly the unmonitored-accuracy risk detailed in Silent Failures That Make Extraction Pipelines Dangerous. The accurate model is continuous monitoring, not one-time validation.
Myth: You Need to Fine-Tune for Good Accuracy
Teams burn weeks on this one before testing the cheaper path.
The accurate picture
Modern models with schema constraints and a few well-chosen examples reach high accuracy on most extraction tasks without any fine-tuning. Fine-tuning earns its place only at high volume with critical accuracy and a stable, labelable input distribution — a specific corner, not a default. Reaching for it first usually means recovering accuracy you could have had in an afternoon. The honest decision rule is in Choosing Between Few-Shot, Schema, and Fine-Tuned Extraction.
Myth: A Passing Demo Means It Works
The demo is the most misleading artifact in extraction.
The accurate picture
Demos run on clean, hand-picked documents — exactly the cases that do not break. Production faces the long tail of messy formats and edge cases the demo never touched. A pipeline that looks flawless on five documents can be wrong a third of the time at scale. The only honest measure of "it works" is field-level accuracy on a representative gold set, the discipline laid out in How to Measure Prompting for Data Extraction: Metrics That Matter.
Myth: The Model Will Tell You When It Is Unsure
People assume the model fails loudly. It fails quietly.
The accurate picture
By default, a model handed a document missing a field will often invent a plausible value rather than flag uncertainty. It produces confident, well-formed, wrong output unless you explicitly design for absence and uncertainty. Silence is not a signal of correctness; it is the absence of any signal. You have to build the confidence and null-handling in deliberately — the model will not volunteer it.
Myth: Higher Accuracy Is Always Worth Chasing
The pursuit of a perfect number wastes real resources.
The accurate picture
Accuracy has a cost, and not every field needs the same level. A category tag is fine at ninety percent; a payment amount needs far more. Spending effort to push an unimportant field from ninety-five to ninety-nine is waste, while leaving a critical field under target is negligence. The right target is set per field by business consequence, not chased uniformly — the cost of an error in that field is what tells you how much accuracy it actually warrants.
Myth: More Examples Always Improve Accuracy
People treat few-shot examples as a dial that only turns up.
The accurate picture
Examples help most when they cover the cases the model gets wrong, and they help little or not at all when they merely repeat cases it already handles. Worse, examples consume context and cost, and a prompt stuffed with redundant easy cases can crowd out the document itself or dilute the model's attention. The right move is to curate examples for the failing tail and stop adding them once accuracy stops moving, not to pile on examples in the hope that more is always better. This targeted approach is exactly how the deeper techniques in Edge Cases, Confidence, and Multi-Pass Extraction Tactics spend their example budget.
Myth: Extraction Is Too Simple to Need Real Skill
The simplicity is a surface illusion.
The accurate picture
The basic loop is easy, which is exactly why people underestimate it. Doing it reliably requires judgment about trade-offs, discipline around measurement, and engineering for the long tail of edge cases — none of which a tutorial hands you. The gap between a working demo and a trustworthy production pipeline is precisely the skill that the myth dismisses, and it is mapped end to end in The Complete Guide to Prompting for Data Extraction. The same illusion of simplicity is why extraction work is chronically under-resourced: because it looks like it should take an afternoon, teams budget an afternoon, and then spend months patching the pipeline they shipped before it was ready.
Frequently Asked Questions
Is fine-tuning ever the right call?
Yes, but only in a narrow corner: high volume, critical accuracy, and a stable input distribution you can label. For most tasks, schema constraints plus a few examples reach high accuracy without it. Defaulting to fine-tuning wastes weeks recovering accuracy that prompting could deliver immediately.
If the model does not flag uncertainty, how do I know it is wrong?
You build the signal yourself: instruct the model to return null for missing fields, add confidence reporting for triage, and run deterministic consistency checks. The model will not volunteer doubt, so detecting silent errors is a system you design rather than a feature you receive.
Why is a passing demo not proof?
Demos use clean, hand-picked documents that do not represent the messy long tail of production. Real proof is field-level accuracy measured on a representative gold set, including the ugly formats. Without that, a flawless demo can mask a pipeline that fails a third of the time at scale.
Should I always aim for the highest possible accuracy?
No. Accuracy costs effort, and the right target depends on each field's business consequence. Over-investing in an unimportant field is waste; under-serving a critical one is negligence. Set per-field targets by impact rather than chasing a single uniform number.
Key Takeaways
- Extraction is not set-and-forget; input formats drift, so continuous monitoring beats one-time validation.
- Schema constraints plus a few examples reach high accuracy without fine-tuning in most cases.
- A passing demo proves nothing; only field-level accuracy on a representative gold set does.
- The model will not flag uncertainty by default — you must design null-handling and confidence signals in.
- Set accuracy targets per field by business consequence, and respect that reliable extraction is a real skill, not a trivial one.