Opinionated Habits That Make Extraction Trustworthy

Best-practice lists for prompting tend to collapse into the same vague advice: be clear, be specific, give context. None of it is wrong, and none of it tells you what to actually do when a contract has two dates or a scanned invoice has a smudged total. The practices below are different because each one is opinionated, comes from real extraction work, and carries the reasoning that earned it a place in a production pipeline.

The framing that ties them together is that extraction is a contract between you and the model. Your half of the contract is an unambiguous specification of what you want and how to handle every situation the input can throw at the model. The model's half is to fill that specification faithfully. Most extraction failures are really failures of your half, and these practices are the terms that make the contract enforceable.

Adopt them selectively if you must, but understand the trade-off you are making each time you skip one. They are ordered roughly from most to least impactful, so if you only adopt a few, adopt the early ones.

Make JSON the Default and Validate It

The format of your output determines how much downstream pain you inherit.

Why Structured Output Wins

JSON is unambiguous, parseable, and supported by structured-output modes in modern models that guarantee valid syntax. Asking for prose or loosely formatted tables forces brittle parsing later. Specify the exact keys and types, then validate the parsed result against your schema in code. The reasoning: output you cannot mechanically check is output you cannot trust, no matter how good it looks.

Reject, Do Not Repair

When a record fails validation, reject it and flag it rather than patching it. A repaired record hides the underlying problem and lets it recur. Rejection surfaces the failure where you can fix the root cause.

Treat Examples as Specification

A worked example does more than illustrate; it specifies behavior the prose cannot.

Encode the Hard Case

Choose an example that shows how to handle a missing field or a document with competing values, because that is what the model will copy. An example of a clean document teaches nothing about the cases that actually break. The full step-by-step for selecting examples is in A Step-by-Step Approach to Prompting for Data Extraction.

One example covers the common case and the most frequent edge case
Additional examples earn their place only when a specific failure recurs
Keep examples in sync with the schema; a stale example teaches stale conventions

Extract Raw, Transform Later

The instinct to have the model clean values during extraction is a trap worth naming explicitly.

Why Separation Beats Convenience

A model normalizing currency or dates can introduce silent errors that no validation catches because the result looks valid. Extracting the raw value preserves fidelity, and a separate code step performs transformation with logic you can test and reason about. The reasoning: you can debug code; you cannot debug a model's silent decision to drop a currency symbol.

Write Rules for Every Ambiguity Your Data Contains

Ambiguity the prompt does not address becomes randomness in the output.

Disambiguate Deterministically

For every field that can have multiple candidates, write a selection rule. "If several amounts appear, return the one labeled total" converts a guess into a deterministic choice. The cost of omitting these rules is detailed in 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them).

Define Absence Explicitly

State what a missing field returns, every time. Null, empty string, or a sentinel like "not found," chosen consistently, prevents the fabrication that fills unspecified gaps.

Match the Model to the Difficulty

Defaulting to the most capable model for everything wastes money; defaulting to the cheapest wastes accuracy.

Calibrate by Input Quality

Use larger models for messy, varied input where reasoning matters and smaller, cheaper models for clean, structured documents. Routing documents by difficulty optimizes cost without sacrificing reliability where it counts. The trade-offs across the tooling landscape appear in The Best Tools for Prompting for Data Extraction.

Instrument Everything

A pipeline you cannot observe is a pipeline you cannot trust over time.

Log, Measure, Alert

Log every input and output, track parse-failure and validation-failure rates as standing metrics, and alert when they move. Extraction quality drifts as input distributions shift, and instrumentation is how you catch the drift before it becomes a backlog of bad records.

Keep Prompts Versioned and Reviewable

A prompt is code, and the practices that keep code maintainable apply to it directly. Teams that treat prompts as throwaway text accumulate a pile of slightly different versions with no record of which one is in production or why it changed.

Version Control Your Prompts

Store prompts in version control alongside the validation code that depends on them, so a change to the schema and the corresponding change to the prompt move together. When a validation-failure rate spikes, version history tells you exactly what changed and when, turning a mystery into a diff. The reasoning: a prompt you cannot trace is a prompt you cannot safely change.

Review Prompt Changes Like Code

A small wording change can shift extraction behavior in ways that are hard to predict, so subject prompt edits to the same review you would give a code change, including a rerun against your sample set. This catches regressions before they reach production. The step-by-step revision discipline that supports this lives in A Step-by-Step Approach to Prompting for Data Extraction.

Design for Graceful Failure

No extraction pipeline succeeds on every document, and the difference between a robust system and a fragile one is what happens when extraction fails.

Route Failures, Do Not Drop Them

When a record fails validation, send it to a human-review queue rather than discarding it or forcing it through. A dropped document becomes a missing record that surfaces only during an audit, long after the cause is forgotten. A routed failure gets resolved while the context is fresh. The reasoning: the cost of a missed document is almost always higher than the cost of reviewing a flagged one.

Make the Failure Visible

Surface failure rates on a dashboard the team actually watches, not buried in logs no one reads. A pipeline whose health is invisible degrades silently until a downstream report breaks. Visible metrics turn a slow, quiet decline into an early, actionable signal, which is the whole point of instrumentation. This operational discipline complements the audit cadence detailed in The Prompting for Data Extraction Checklist for 2026.

Frequently Asked Questions

Why insist on JSON over a simpler format?

JSON is unambiguous and mechanically parseable, and modern models support a structured-output mode that guarantees valid syntax. That lets you validate every record in code against a defined schema, which is the only reliable way to catch malformed or fabricated output. Simpler formats like loose tables read fine to humans but force brittle parsing and make programmatic validation harder, which is where errors slip through.

How many examples is too many?

There is no fixed limit, but each example adds length, cost, and a chance for inconsistency if examples conflict. Start with one that covers the common case and the most frequent edge case. Add another only when a specific failure recurs that the existing example does not address. Padding the prompt with many redundant examples rarely helps and can dilute the conventions you most want the model to follow.

Should I always use the most capable model available?

No. The most capable model is the right choice for messy, highly varied input where reasoning matters, but it is overkill and expensive for clean, well-structured documents that a smaller model handles reliably. Matching model capability to input difficulty, and routing documents accordingly, gives you the accuracy you need where it counts without paying premium rates for easy extractions.

What metrics should I actually monitor?

Track parse-failure rate (how often output cannot be parsed as valid JSON) and validation-failure rate (how often parsed output fails your schema checks). Both should be low and stable; a rising trend signals that your input changed or the model behaved differently. Logging every input and output alongside these metrics lets you investigate the cause quickly rather than discovering a problem only when a downstream report breaks.

Key Takeaways

Default to JSON output and validate every record against a schema in code, rejecting failures rather than patching them
Use worked examples as specification, encoding the hard edge case the model will copy
Extract raw values and transform in testable code rather than asking the model to normalize
Write explicit disambiguation and missing-value rules for every ambiguity your data contains
Match model capability to input difficulty and route documents accordingly
Instrument the pipeline by logging inputs, outputs, and failure rates, and alert when they drift

Make JSON the Default and Validate It

The format of your output determines how much downstream pain you inherit.

Why Structured Output Wins

Reject, Do Not Repair

Treat Examples as Specification

A worked example does more than illustrate; it specifies behavior the prose cannot.

Encode the Hard Case

One example covers the common case and the most frequent edge case
Additional examples earn their place only when a specific failure recurs
Keep examples in sync with the schema; a stale example teaches stale conventions

Extract Raw, Transform Later

The instinct to have the model clean values during extraction is a trap worth naming explicitly.

Why Separation Beats Convenience

Write Rules for Every Ambiguity Your Data Contains

Ambiguity the prompt does not address becomes randomness in the output.

Disambiguate Deterministically

Define Absence Explicitly

State what a missing field returns, every time. Null, empty string, or a sentinel like "not found," chosen consistently, prevents the fabrication that fills unspecified gaps.

Match the Model to the Difficulty

Defaulting to the most capable model for everything wastes money; defaulting to the cheapest wastes accuracy.

Calibrate by Input Quality

Instrument Everything

A pipeline you cannot observe is a pipeline you cannot trust over time.

Log, Measure, Alert

Keep Prompts Versioned and Reviewable

Version Control Your Prompts

Review Prompt Changes Like Code

Design for Graceful Failure

No extraction pipeline succeeds on every document, and the difference between a robust system and a fragile one is what happens when extraction fails.

Route Failures, Do Not Drop Them

Make the Failure Visible

Frequently Asked Questions

Why insist on JSON over a simpler format?

How many examples is too many?

Should I always use the most capable model available?

What metrics should I actually monitor?

Key Takeaways

Default to JSON output and validate every record against a schema in code, rejecting failures rather than patching them
Use worked examples as specification, encoding the hard edge case the model will copy
Extract raw values and transform in testable code rather than asking the model to normalize
Write explicit disambiguation and missing-value rules for every ambiguity your data contains
Match model capability to input difficulty and route documents accordingly
Instrument the pipeline by logging inputs, outputs, and failure rates, and alert when they drift

Opinionated Habits That Make Extraction Trustworthy

Make JSON the Default and Validate It

Why Structured Output Wins

Reject, Do Not Repair

Treat Examples as Specification

Encode the Hard Case

Extract Raw, Transform Later

Why Separation Beats Convenience

Write Rules for Every Ambiguity Your Data Contains

Disambiguate Deterministically

Define Absence Explicitly

Match the Model to the Difficulty

Calibrate by Input Quality

Instrument Everything

Log, Measure, Alert

Keep Prompts Versioned and Reviewable

Version Control Your Prompts

Review Prompt Changes Like Code

Design for Graceful Failure

Route Failures, Do Not Drop Them

Make the Failure Visible

Frequently Asked Questions

Why insist on JSON over a simpler format?

How many examples is too many?

Should I always use the most capable model available?

What metrics should I actually monitor?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Opinionated Habits That Make Extraction Trustworthy

Make JSON the Default and Validate It

Why Structured Output Wins

Reject, Do Not Repair

Treat Examples as Specification

Encode the Hard Case

Extract Raw, Transform Later

Why Separation Beats Convenience

Write Rules for Every Ambiguity Your Data Contains

Disambiguate Deterministically

Define Absence Explicitly

Match the Model to the Difficulty

Calibrate by Input Quality

Instrument Everything

Log, Measure, Alert

Keep Prompts Versioned and Reviewable

Version Control Your Prompts

Review Prompt Changes Like Code

Design for Graceful Failure

Route Failures, Do Not Drop Them

Make the Failure Visible

Frequently Asked Questions

Why insist on JSON over a simpler format?

How many examples is too many?

Should I always use the most capable model available?

What metrics should I actually monitor?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?