You have read the theory, and now you want a procedure you can run start to finish. This article is that procedure. It lays out the steps to take a document, prompt a model to extract knowledge-graph triples from it, and end up with validated structured data ready to load into a graph. Each step is concrete and sequential—do this, then this, then this.
The process assumes you have access to a capable language model and a way to send it prompts, whether through a chat interface for experimentation or an API for automation. It does not assume you have built extraction pipelines before. Follow the steps in order the first time; once you understand them you can adapt.
We will use a running example throughout: extracting facts about companies, their founders, and their acquisitions from business news articles. Keeping one example across all steps makes the flow easier to follow.
Step 1: Define Your Schema
List Entity Types
Write down the kinds of things you care about. For our example: Person, Organization. Keep the list short. Every entity type you add is more for the model to track and more chances for inconsistency.
List Relation Types
Write down the connections you care about. For our example: founded, acquired. Give each a one-line definition so the model resolves ambiguity your way—"acquired: the subject organization purchased the object organization." This list is your closed vocabulary; the model may use nothing outside it.
Step 2: Write the Extraction Prompt
Assemble the Core Blocks
Build the prompt in four parts: the schema (your entity and relation types with definitions), the task instruction ("extract all triples that match this schema from the text below"), the grounding rule, and the output format. Putting these in a consistent order makes the prompt easy to maintain.
State the Output Contract Precisely
Tell the model to return a JSON array where each object has subject, subjecttype, predicate, object, objecttype, and source_span. Instruct it to output only the JSON with no surrounding text. A precise contract means your code can parse the result without cleanup. The full anatomy of these blocks is laid out in Turning Unstructured Text Into Connected Entity Graphs.
Add One Worked Example
Include a short input paragraph and its correct JSON output inside the prompt. One concrete example teaches format more reliably than any description. Choose an example that exercises every field, including the source span.
Step 3: Prepare Your Documents
Clean and Segment the Text
Strip boilerplate—navigation, ads, repeated headers—so the model focuses on content. Then split long documents into chunks that fit comfortably within the model's context window, leaving room for the prompt itself.
Overlap Chunks to Preserve Relationships
When you split, overlap consecutive chunks by a sentence or two. A relationship described across a chunk boundary would otherwise be lost. Overlap gives the model a chance to see both halves of such a statement.
Step 4: Run Extraction
Process One Chunk at a Time
Send each chunk with your prompt and collect the JSON output. Keep a record of which document and chunk each result came from—this provenance is essential for later verification and conflict resolution.
Parse and Validate Immediately
As each result returns, parse the JSON. If parsing fails, the model broke the output contract; log the failure and retry with a reminder to emit valid JSON only. Catching format errors at this stage prevents corrupt data from entering your pipeline.
Step 5: Resolve Entities
Canonicalize Names
The same company appears as "Acme," "Acme Inc.," and "Acme Corporation." Decide on a canonical form and map variants to it. You can prompt the model to normalize against a reference list or run a matching pass in code. Without this, your graph holds duplicate nodes.
Merge Across Chunks and Documents
Once names are canonical, identical entities from different chunks and documents become the same node. This is how facts scattered across many sources assemble into one connected graph. The payoff of doing this well shows up clearly in How a Research Team Mapped 4,000 Papers Into One Graph.
Step 6: Validate the Triples
Check Schema Conformance
Verify in code that every entity type and relation type appears in your schema. Any value outside the vocabulary signals prompt drift or a gap in your schema, and should be flagged for review rather than silently loaded.
Verify Source Spans
For a sample of triples, confirm the source span exists in the document and genuinely expresses the relationship. This catches the subtle errors aggregate counts miss. The failure modes this step guards against are catalogued in Why Graph Extraction Prompts Silently Drop Half Your Entities.
Step 7: Load and Iterate
Deduplicate and Load
Remove duplicate triples, decide how to handle conflicting ones, and load the result into your graph store with provenance attached. The graph is now queryable.
Measure and Refine the Prompt
Compare output against a small gold-standard set to compute precision and recall. Where the prompt misses true triples or invents false ones, adjust the schema definitions, the grounding rule, or the worked example—then rerun. Use the running checks in Ship-Ready Verification Steps for Graph Extraction Prompts to keep each iteration honest.
Worked Run on the Company Example
A Sample Input and Its Triples
Take a sentence from a business article: "Acme Inc., founded by Dana Reyes, acquired rival Bolt Systems last year." Sent through the prompt above, the model should return two triples: (Dana Reyes, founded, Acme Inc.) and (Acme Inc., acquired, Bolt Systems), each with subject and object types of Person or Organization and a source span quoting the sentence. If the model adds an edge that is not in the text—say, that Dana Reyes founded Bolt Systems—your grounding rule is too weak and needs sharpening.
Reading the Result Critically
Do not just confirm the triples look right; confirm they are complete. Did the model capture both relationships, or did it stop at the first? Missing the acquisition would be a recall failure pointing at the schema definition or the example. Walking one sentence through the full pipeline by hand, before you automate, teaches you more about your prompt's behavior than any amount of bulk processing.
Promoting the Sentence to a Test Case
Once you are satisfied with the output, save this input and its correct triples into your gold-standard set. Every sentence you hand-verify becomes a permanent check against regressions, so the manual work you do early pays dividends every time you change the prompt later.
Frequently Asked Questions
How long should each document chunk be?
Short enough that the chunk plus your full prompt fits within the model's context window with room to spare, and long enough to keep related sentences together. A few hundred to a thousand words per chunk is a common starting range. Overlap consecutive chunks slightly to avoid splitting relationships.
What do I do when the model returns invalid JSON?
Catch the parse failure, log the offending chunk, and retry with an added reminder to output only valid JSON and nothing else. If failures persist, simplify the output schema or shorten the chunk. Never load unparseable output into your pipeline.
Should I resolve entities during extraction or afterward?
Afterward is usually cleaner. Extract triples first, then run a dedicated resolution pass that canonicalizes names and merges variants. Trying to do both in one prompt overloads the model and produces less consistent results.
How do I keep provenance for each triple?
Record the source document and chunk alongside every triple as it is extracted, and carry that metadata through resolution and loading. Provenance lets you trace any fact back to its origin and resolve conflicts when two sources disagree.
How many iterations does it take to get a good prompt?
Expect several. Build a small gold-standard set early, measure precision and recall after each change, and adjust one element at a time—schema definitions, grounding rule, or example. Disciplined iteration converges faster than random tweaking.
Can I run this whole process without writing code?
You can run the extraction step by hand in a chat interface for small experiments, but chunking, parsing, entity resolution, deduplication, and loading at scale realistically require code. Start manual to learn the flow, then automate the repetitive steps.
Key Takeaways
- Define a tight schema of entity and relation types first; it anchors every later step.
- Build the prompt from four consistent blocks: schema, task, grounding rule, and a precise output contract with one worked example.
- Clean and chunk documents with slight overlap so relationships are not split across boundaries.
- Process chunks one at a time, parse and validate output immediately, and record provenance for every triple.
- Resolve entities in a dedicated pass after extraction, then validate schema conformance and source spans before loading.
- Measure precision and recall against a gold-standard set and refine one prompt element at a time.