When Strict Schemas Beat Open-Ended Graph Extraction

Every team that builds a knowledge graph from text eventually hits the same fork in the road. Do you tell the model exactly which entity types and relationship types it is allowed to produce, or do you let it discover the structure of the document and report whatever it finds? The first path gives you a clean, queryable, predictable graph. The second path captures nuance you never anticipated. You cannot maximize both at once, and pretending you can is how projects drift into expensive rework.

This is not a question with a universal answer. It is a set of trade-offs whose right resolution depends on your domain, your tolerance for noise, and what you plan to do with the graph downstream. A graph feeding a strict compliance query needs different guarantees than a graph supporting exploratory research. The mistake is choosing by default rather than by analysis.

What follows lays out the competing approaches honestly, names the axes along which they differ, and ends with a decision rule you can actually apply. The aim is to make the trade-off visible so you choose it deliberately instead of inheriting it from whichever tutorial you happened to read first.

It helps to remember that this trade-off is not unique to graph extraction. It is the same tension between structure and flexibility that runs through database schema design, taxonomy work, and any effort to impose order on messy information. The reason it feels especially sharp here is that language models make both extremes cheap to attempt, so the constraint that used to come from implementation difficulty now has to come from your own judgment. That shift puts the burden of discipline squarely on the designer.

The Two Poles of Extraction Strategy

At one end sits closed-schema extraction. You define an ontology up front: entity types, relationship types, attribute constraints. The model fills in that template and nothing else.

At the other end sits open extraction, sometimes called open information extraction. The model returns subject-predicate-object triples in whatever vocabulary the text suggests, and you discover the schema after the fact by clustering what came back.

Why closed schemas feel safe

A closed schema means every node and edge is one of a known set of types. Queries are predictable. Validation is mechanical. Downstream consumers can build against a stable contract. The cost is that anything outside the schema is invisible, and writing a complete schema for a rich domain is genuinely hard.

Why open extraction feels powerful

Open extraction surfaces relationships you would never have thought to encode. It adapts to documents whose structure you do not fully understand yet. The cost is noise: inconsistent predicates, near-duplicate relationships, and a normalization burden that lands on you after extraction rather than before.

The Axes That Actually Matter

The closed-versus-open framing is too coarse. Real decisions turn on a handful of independent axes.

Schema stability. Does your domain have a settled vocabulary, or are you still learning what matters? Settled vocabularies favor closed schemas.
Downstream consumer tolerance. Will the graph feed an automated system that breaks on surprises, or a human who can interpret messiness? Automated consumers favor closed.
Recall sensitivity. Is missing a true relationship costly, or merely annoying? High recall sensitivity favors open extraction with later filtering.
Normalization budget. How much engineering can you spend cleaning up after the model? Low budgets favor closed schemas that produce clean output natively.

These axes are independent, which is why the decision resists a one-line rule until you weigh them together.

Hybrid Approaches and Why They Win in Practice

Most mature systems land in the middle. They run a closed schema for the entities and relationships they care about most, and an open pass to surface candidates for schema expansion.

Seeded open extraction

You provide the model a partial ontology and explicitly invite it to propose new types when the text demands. You get the predictability of a closed schema with a release valve for genuine novelty. The proposed types feed a review queue rather than the production graph directly.

Two-stage pipelines

The first stage extracts liberally. The second stage maps the liberal output onto your canonical ontology, dropping or flagging anything that will not map. This separates recall from precision, letting you tune each independently. The same separation underpins good evaluation, which is why this pairs naturally with the practices in Scoring Whether Your Extracted Triples Are Actually Right.

How Model Choice Interacts With the Trade-off

The approach you can afford depends partly on the model. Stronger models follow a closed schema more faithfully and propose better open relationships, which widens your options.

Structured-output support

If your model and tooling support grammar-constrained or function-call output, closed-schema extraction becomes nearly free to enforce. Without that support, closed schemas leak, and the trade-off shifts because you now pay a validation tax either way.

Context window and document length

Long documents strain both approaches. A model that loses track of entities across a long context will produce inconsistent identity regardless of schema strategy. This interacts with the deeper handling problems covered in Coreference, Long Context, and Other Graph Extraction Hard Parts.

The Decision Rule

Here is a rule you can apply in a meeting.

Default to a closed schema. Open extraction is seductive but its normalization cost is routinely underestimated, and a clean partial graph beats a messy complete one for almost every production use. Add a seeded open pass only when two conditions hold: your domain is still revealing new relationship types, and you have the review capacity to triage proposals without flooding the graph.

If you are feeding an automated downstream system, never let open output reach production unmapped. If a human is the consumer and exploration is the goal, relax the constraint and accept the noise as the price of discovery. Choosing the right tools to implement either path is the subject of Software That Turns Messy Text Into Clean Triples.

Revisit the decision as the project matures

The right answer early in a project is rarely the right answer later. A graph that began as exploratory, with open extraction surfacing the domain's vocabulary, should harden into a closed schema once that vocabulary stabilizes. A graph that began closed may need an occasional open audit to catch relationships the original ontology never anticipated. Treat the closed-versus-open choice as a setting you revisit at each phase, not a decision you make once and inherit forever. The teams that get into trouble are usually the ones that locked in an early choice and never asked whether their project had outgrown it.

Frequently Asked Questions

Is open extraction ever the right default?

Rarely. It is the right default only when you genuinely do not know your domain's vocabulary and the graph's purpose is discovery rather than reliable query. For most production systems, the normalization burden makes a closed schema the better starting point.

Can I migrate from open to closed later?

Yes, and many teams do. They run open extraction to learn the domain, cluster the results into a candidate ontology, then lock that ontology into a closed schema. The open phase becomes a research step, not the production design.

Does a hybrid approach double my cost?

Not usually, because the open pass can run on a sample rather than every document. You use open extraction to find new types occasionally and closed extraction for the bulk of throughput, which keeps cost close to the closed-only baseline.

How do I know my schema is too restrictive?

Watch what the model wants to say but cannot. If a high fraction of documents contain relationships your schema cannot express, your recall is suffering silently. A periodic open audit pass reveals what the closed schema is missing.

Which approach is cheaper to maintain?

Closed schemas are cheaper to maintain because the contract is stable and validation is mechanical. Open extraction shifts cost from design time to perpetual normalization, which compounds as your graph grows.

Key Takeaways

The core trade-off is closed-schema predictability versus open-extraction coverage, and you cannot maximize both.
Decide along independent axes: schema stability, consumer tolerance, recall sensitivity, and normalization budget.
Hybrid and two-stage pipelines capture most of the upside by separating liberal extraction from canonical mapping.
Model strength and structured-output support change which approach you can afford to enforce.
Default to a closed schema and add open extraction only when your domain is still evolving and you have review capacity.

The Two Poles of Extraction Strategy

At one end sits closed-schema extraction. You define an ontology up front: entity types, relationship types, attribute constraints. The model fills in that template and nothing else.

Why closed schemas feel safe

Why open extraction feels powerful

The Axes That Actually Matter

The closed-versus-open framing is too coarse. Real decisions turn on a handful of independent axes.

Schema stability. Does your domain have a settled vocabulary, or are you still learning what matters? Settled vocabularies favor closed schemas.
Downstream consumer tolerance. Will the graph feed an automated system that breaks on surprises, or a human who can interpret messiness? Automated consumers favor closed.
Recall sensitivity. Is missing a true relationship costly, or merely annoying? High recall sensitivity favors open extraction with later filtering.
Normalization budget. How much engineering can you spend cleaning up after the model? Low budgets favor closed schemas that produce clean output natively.

These axes are independent, which is why the decision resists a one-line rule until you weigh them together.

Hybrid Approaches and Why They Win in Practice

Most mature systems land in the middle. They run a closed schema for the entities and relationships they care about most, and an open pass to surface candidates for schema expansion.

Seeded open extraction

Two-stage pipelines

How Model Choice Interacts With the Trade-off

The approach you can afford depends partly on the model. Stronger models follow a closed schema more faithfully and propose better open relationships, which widens your options.

Structured-output support

Context window and document length

The Decision Rule

Here is a rule you can apply in a meeting.

Revisit the decision as the project matures

Frequently Asked Questions

Is open extraction ever the right default?

Can I migrate from open to closed later?

Does a hybrid approach double my cost?

How do I know my schema is too restrictive?

Which approach is cheaper to maintain?

Key Takeaways

The core trade-off is closed-schema predictability versus open-extraction coverage, and you cannot maximize both.
Decide along independent axes: schema stability, consumer tolerance, recall sensitivity, and normalization budget.
Hybrid and two-stage pipelines capture most of the upside by separating liberal extraction from canonical mapping.
Model strength and structured-output support change which approach you can afford to enforce.
Default to a closed schema and add open extraction only when your domain is still evolving and you have review capacity.

When Strict Schemas Beat Open-Ended Graph Extraction

The Two Poles of Extraction Strategy

Why closed schemas feel safe

Why open extraction feels powerful

The Axes That Actually Matter

Hybrid Approaches and Why They Win in Practice

Seeded open extraction

Two-stage pipelines

How Model Choice Interacts With the Trade-off

Structured-output support

Context window and document length

The Decision Rule

Revisit the decision as the project matures

Frequently Asked Questions

Is open extraction ever the right default?

Can I migrate from open to closed later?

Does a hybrid approach double my cost?

How do I know my schema is too restrictive?

Which approach is cheaper to maintain?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

When Strict Schemas Beat Open-Ended Graph Extraction

The Two Poles of Extraction Strategy

Why closed schemas feel safe

Why open extraction feels powerful

The Axes That Actually Matter

Hybrid Approaches and Why They Win in Practice

Seeded open extraction

Two-stage pipelines

How Model Choice Interacts With the Trade-off

Structured-output support

Context window and document length

The Decision Rule

Revisit the decision as the project matures

Frequently Asked Questions

Is open extraction ever the right default?

Can I migrate from open to closed later?

Does a hybrid approach double my cost?

How do I know my schema is too restrictive?

Which approach is cheaper to maintain?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?