AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Metrics That Separate Real Quality From HopeTriple-level precision and recallEntity resolution accuracySchema conformance rateBuilding a Gold Set Without Drowning in LabelsInstrumenting the Pipeline in ProductionConfidence and abstention signalsProvenance coverageReading the Signal Without Fooling YourselfWatch the precision-recall frontier, not a single pointSegment by document typeConnecting Metrics to DecisionsThresholds that trigger workCost-aware quality targetsCommon Measurement PitfallsOptimizing the metric instead of the graphGrading on the easy casesConfusing conformance with correctnessFrequently Asked QuestionsWhat single metric should I report if I can only pick one?How large does my gold set need to be?Can I use the model to grade itself?How do I measure recall when I do not know all the true triples?What does a sudden drop in conformance rate mean?Key Takeaways
Home/Blog/Scoring Whether Your Extracted Triples Are Actually Right
General

Scoring Whether Your Extracted Triples Are Actually Right

A

Agency Script Editorial

Editorial Team

·October 20, 2019·8 min read
prompting for knowledge graph extractionprompting for knowledge graph extraction metricsprompting for knowledge graph extraction guideprompt engineering

A knowledge graph built from text is only as trustworthy as your ability to measure it. Teams routinely ship extraction pipelines whose accuracy they have never quantified, then act surprised when a downstream query returns nonsense. The graph looked plausible in spot checks, so it shipped. Plausibility is not measurement, and spot checks miss exactly the systematic errors that hurt most.

The difficulty is that graph extraction has more failure modes than a typical classification task. A triple can be wrong because the entity is wrong, because the relationship is wrong, because the entity was correct but duplicated, or because a true relationship was never extracted at all. A single accuracy number flattens all of that into a figure that tells you almost nothing about where to spend your next engineering hour.

This piece defines the metrics that actually distinguish a good extraction pipeline from a bad one, explains how to instrument them without a labeling army, and shows how to read the resulting signal so you intervene on the right problem rather than the loudest one.

The underlying principle is simple even though the execution is not: you cannot improve what you cannot see, and a graph hides its own errors better than almost any other data artifact. A broken web service throws an error; a broken graph quietly returns a confident wrong answer. Measurement is what converts that silence into a signal you can act on, and the teams that take measurement seriously are invariably the ones whose graphs people end up trusting with real decisions.

The Metrics That Separate Real Quality From Hope

Borrow the precision and recall framing from information retrieval, but apply it at the level of triples, not documents.

Triple-level precision and recall

Precision asks: of the triples you extracted, what fraction are correct? Recall asks: of the triples that should have been extracted, what fraction did you capture? These move in opposite directions as you tune the prompt, and reporting only one hides the trade you are making.

Entity resolution accuracy

Separate from relationship correctness is the question of identity. If "Acme Corp" and "Acme Corporation" become two nodes, your graph is wrong even though every individual triple is correct. Measure the rate at which distinct surface forms collapse to the right canonical node.

Schema conformance rate

If you use a closed schema, measure how often the output actually conforms before any validation cleanup. A low raw conformance rate signals that your prompt or model is fighting the schema, which predicts trouble at scale.

Building a Gold Set Without Drowning in Labels

Every meaningful metric needs ground truth, and ground truth needs human judgment. The trick is spending that judgment efficiently.

  • Stratify your sample. Pull documents across the range of types you actually process, not just the easy ones. A gold set of clean documents flatters a pipeline that fails on messy input.
  • Label triples, not documents. Have annotators mark which extracted triples are correct and which true triples were missed. This directly yields precision and recall.
  • Reuse and grow. Each labeling round adds to a permanent evaluation set. Over time you accumulate a regression suite that catches degradation when you change prompts or models.

A few hundred carefully labeled documents beat tens of thousands of unlabeled ones. The discipline is choosing what to label, not labeling more.

Instrumenting the Pipeline in Production

Offline metrics on a gold set tell you about a frozen snapshot. Production metrics tell you what is happening now.

Confidence and abstention signals

Have the model report confidence or allow it to abstain on uncertain extractions. The rate of low-confidence outputs is a leading indicator: a sudden rise usually means your input distribution shifted, often before precision visibly drops.

Provenance coverage

Every edge should point to a source span. Measure the fraction of edges with valid provenance. Missing provenance is both a quality problem and a debugging blocker, and it pairs directly with the governance concerns in Silent Schema Drift and Other Graph Extraction Traps.

Reading the Signal Without Fooling Yourself

Numbers invite self-deception. A high precision figure on an easy gold set means nothing if production data is harder.

Watch the precision-recall frontier, not a single point

When you change a prompt, plot where precision and recall land relative to before. An improvement that trades a lot of recall for a little precision may be a regression in disguise, depending on what your downstream consumer needs.

Segment by document type

An aggregate metric hides per-segment failures. If your pipeline is excellent on contracts and terrible on emails, the average looks acceptable while half your graph is garbage. Always report metrics sliced by the dimensions that vary.

Connecting Metrics to Decisions

Metrics earn their cost only when they change what you do. Tie each metric to an action.

Thresholds that trigger work

Set a precision floor below which output gets human review rather than auto-ingestion. Set a recall floor below which you revisit the prompt or schema. Set a conformance floor below which you suspect a model or formatting regression. Without thresholds, metrics become decoration.

Cost-aware quality targets

Higher quality usually costs more tokens, more review, or both. The right target is the one where marginal quality stops being worth marginal cost for your use case, a calculation that connects directly to What Knowledge Graph Extraction Actually Saves a Data Team.

Common Measurement Pitfalls

Even teams that measure can measure badly, and a misleading metric is more dangerous than no metric because it manufactures false confidence.

Optimizing the metric instead of the graph

When a single number becomes the goal, people tune the pipeline to move that number rather than to improve the graph. A prompt change that lifts precision by suppressing every uncertain extraction looks like progress and quietly destroys recall. Always watch the metrics you are not optimizing, because that is where the regression hides.

Grading on the easy cases

A gold set assembled from clean, cooperative documents reports flattering numbers that collapse the moment real input arrives. Stratify the gold set across the full difficulty range you actually process, including the messy documents you wish you did not have to handle. A metric is only as honest as the sample it runs on.

Confusing conformance with correctness

A high schema-conformance rate tempts teams into believing the graph is good. Conformance only proves the output has the right shape, not that it states the truth. Treat the two as independent and measure both, because a perfectly shaped graph full of false triples passes every structural check while being worthless.

Frequently Asked Questions

What single metric should I report if I can only pick one?

Resist picking one. If forced, report triple-level F1, the harmonic mean of precision and recall, because it punishes you for ignoring either. But always keep the underlying precision and recall visible, since F1 alone hides which direction you are failing.

How large does my gold set need to be?

Large enough that your metrics are stable across resampling, which for most extraction tasks means a few hundred labeled documents spanning your real input variety. Stability matters more than raw size; a noisy metric from a tiny set will mislead you.

Can I use the model to grade itself?

Model-assisted grading is useful for triage and scaling, but never let it replace a human-labeled gold set entirely. A model that makes a systematic extraction error will often make the same error when grading, hiding the very problem you need to find.

How do I measure recall when I do not know all the true triples?

You estimate it on the gold set where humans have enumerated the true triples for those documents. You cannot measure recall on unlabeled production data directly, which is exactly why a representative gold set is irreplaceable.

What does a sudden drop in conformance rate mean?

Usually a change in input distribution or a model update altering output formatting. Treat it as an early warning to investigate before the quality degradation reaches your stored graph.

Key Takeaways

  • Measure at the triple level with precision, recall, and F1; a single accuracy number hides the failure mode you most need to see.
  • Entity resolution accuracy and schema conformance are distinct quality dimensions that triple correctness alone does not capture.
  • A small, stratified, reusable gold set outperforms a large unlabeled one and becomes your regression suite.
  • Instrument production with confidence signals and provenance coverage to catch distribution shifts early.
  • Tie every metric to a threshold and an action, or it becomes decoration rather than a decision tool.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification