If you have ever copied information out of a PDF and pasted it into a spreadsheet by hand, you already understand the problem this article solves. That copying is tedious, error-prone work, and it is exactly the kind of task a language model can do for you in seconds. The skill of getting a model to do it reliably is called prompting for data extraction, and you do not need to be a programmer or a data scientist to learn the basics.
This guide assumes you know nothing about the topic. We will define every term as it comes up, start from the simplest possible idea, and build toward something you could actually use. By the end you will understand what extraction is, why the structured output matters, and how to write a first prompt that returns clean, usable information instead of a paragraph you have to read and retype.
The mindset to hold onto is that you are giving the model a job with clear instructions, not asking it a vague question. The clearer the job, the better the result. Everything that follows is really just ways of making the job clearer.
What Data Extraction Actually Means
Before writing anything, it helps to be precise about the goal. The words sound technical but the idea is simple.
Unstructured Versus Structured Data
Unstructured data is information written for humans: an email, a contract, a customer review. It has no fixed slots. Structured data is information organized into named fields: a row in a spreadsheet with columns for name, date, and amount. Extraction is the act of reading the unstructured version and producing the structured version.
Why Structure Matters
A computer cannot easily search or sort a paragraph, but it can instantly search a table. When you extract the customer name and order total from a hundred emails into a spreadsheet, you turn unsearchable prose into something you can filter, count, and analyze. That conversion is the entire point.
Your First Extraction Prompt
The good news is that a usable first prompt is short. You give the model the text and tell it precisely what to pull out.
The Three Parts of a Good Request
Every solid extraction prompt has three pieces: the instruction (what to do), the format (how to return it), and the input (the text itself). Leave out any one and quality drops.
- Instruction: "Extract the following fields from the text below"
- Format: "Return the result as JSON with keys name, date, and total"
- Input: the actual document, pasted in or referenced
A Worked Example
Suppose you paste in a short order confirmation. Your prompt might say: "From the email below, extract the customer name, order date, and total amount. Return JSON with keys name, order_date, and total. If a field is missing, use null." The model reads the email and hands back a tidy record you can drop straight into a spreadsheet.
Telling the Model What to Do When Information Is Missing
Beginners are often surprised when a model invents a value that was never in the document. This is the most important early lesson, so it gets its own section.
Why Models Make Things Up
A language model is built to produce plausible text. If you ask for an invoice number and the document has none, the model may supply a realistic-looking number rather than admitting it is absent. This is called hallucination, and it is the biggest risk in extraction.
The Simple Fix
Tell the model what to do when a field is missing. Add a sentence like "If a value is not present in the text, return null and do not guess." This one instruction prevents most fabrication. The broader pattern of mistakes to watch for is collected in 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them).
Making Results Consistent
Once your prompt works on one document, you will want it to work the same way on the next fifty. Consistency is what separates a fun experiment from a useful tool.
Show, Do Not Just Tell
Giving the model one example of an input and its correct output teaches it your conventions far better than description alone. This is called a few-shot example. Include one, and the model copies the pattern. The full sequence of steps for building a repeatable process is laid out in A Step-by-Step Approach to Prompting for Data Extraction.
Keep the Format Fixed
Always ask for the same field names and the same structure. If one run returns "total" and another returns "amount," your spreadsheet breaks. Pin the format down in the prompt and the model will respect it.
Checking the Results
A model's output looks confident whether it is right or wrong, so checking is not optional even for beginners.
Read a Few by Hand
When you start, compare the extracted record to the original document for the first several items. You will quickly spot whether the model is reading a field correctly or consistently getting one wrong, which tells you what to clarify in the prompt.
Watch for Quiet Errors
The dangerous errors are the ones that look fine: a date in the wrong format, a number with a misplaced decimal, a name with a title attached. These slip through unless you look. Reviewing early builds the instinct for what to clarify, and seeing how others structured the work in A Framework for Prompting for Data Extraction helps you organize the checks.
A Simple Practice Exercise
The fastest way to internalize these ideas is to try them on a real document you already have, so the abstract steps become concrete actions.
Pick Something You Understand
Choose a document whose contents you know well, like a recent order confirmation or a short bio, so you can immediately tell whether the model got it right. Familiarity with the correct answer is what makes practice useful; you cannot judge an extraction of a document you do not understand. Start small, with three or four fields, before attempting anything elaborate.
Run, Check, Adjust
Write a prompt with the three parts, paste in your document, and read the result against the original. When a field comes back wrong or missing, ask what you could add to the instruction to fix it, then try again. This run-check-adjust loop is the entire craft in miniature, and a few rounds on a familiar document teaches more than any amount of reading. The same loop, scaled up with testing and validation, is exactly what the A Step-by-Step Approach to Prompting for Data Extraction describes for production work.
Where to Go From Here
Once a first extraction works, a few natural next steps deepen the skill without overwhelming you.
Build Toward Reliability
The leap from a one-off extraction to something you can trust on many documents comes from three additions: testing on varied examples including messy ones, validating the output, and handling the edge cases your documents contain. You do not need all of this on day one, but knowing it exists keeps your early experiments pointed in the right direction. The most common stumbles to avoid as you grow are collected in 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them), and a structured overview of the whole topic waits in The Complete Guide to Prompting for Data Extraction.
Frequently Asked Questions
Do I need to know how to code to extract data this way?
No. You can paste text into a chat interface, write a clear instruction, and get structured output without writing any code. Coding becomes useful only when you want to process many documents automatically or validate output at scale. For learning the concepts and handling a handful of documents, plain instructions in a chat box are enough.
What does JSON mean and do I have to use it?
JSON is a simple text format that organizes information into labeled fields, like name and value pairs inside curly braces. You do not have to use it, but it is the cleanest way to ask for structured output because it is unambiguous and easy to move into a spreadsheet or database. For a beginner, asking for JSON is the most reliable choice.
Why did the model return a value that was not in my document?
Language models are designed to produce plausible-sounding text, so when a field is missing they sometimes fill it with an invented value. The fix is to explicitly tell the model to return null and not guess when information is absent. This single instruction prevents most fabricated values and is the most important habit to build early.
How many examples should I give the model?
For most beginner tasks, one clear example of an input and its correct output is enough to teach the model your format and conventions. Adding a second or third example helps with tricky documents that vary a lot, but start with one, see how the model does, and add more only if results are inconsistent.
Key Takeaways
- Extraction turns unstructured text into structured fields you can search and sort
- Every good prompt has three parts: instruction, format, and the input text
- Ask for JSON output with fixed field names to keep results consistent
- Always tell the model to return null and not guess when a field is missing
- Include one worked example so the model copies your conventions
- Check the first several results by hand to catch quiet errors before trusting the output