AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationThe Cost of the Status QuoThe DecisionWhy They BuiltThe ExecutionSchema, Prompt, ValidationThe Near-MissWhat Caught ThemThe OutcomeWhat ChangedHow They Measured SuccessThe Metrics That MatteredThe Supervised RolloutLessons Other Teams Can BorrowThe Transferable PrinciplesFrequently Asked QuestionsWhy did the team build instead of buying an off-the-shelf tool?What was the most important decision in the project?How did the team prevent bad records from reaching the accounting system?Could a smaller team replicate this?Key Takeaways
Home/Blog/How One Team Replaced Manual Invoice Entry in a Month
General

How One Team Replaced Manual Invoice Entry in a Month

A

Agency Script Editorial

Editorial Team

·January 27, 2023·7 min read
prompting for data extractionprompting for data extraction case studyprompting for data extraction guideprompt engineering

The most instructive way to understand extraction is to watch a team work through a real problem from start to finish. This case study follows an operations team at a mid-sized services firm that processed vendor invoices by hand, the decisions they faced, the path they chose, and what it produced. The details are composed to illustrate the realistic arc of such a project rather than to report a specific company's figures, but every decision reflects choices these projects genuinely require.

The value here is in the sequence: how the team scoped the problem, where they almost went wrong, what changed their approach, and how they measured whether it worked. Extraction projects fail less from bad prompts than from skipping these steps, and seeing them in order makes the discipline concrete.

Read it as a story with a thesis. The thesis is that extraction succeeds when teams treat it as a process with a defined schema, explicit edge-case handling, and validation, and that the prompt itself is the smallest part of the work.

The Situation

The team received roughly four hundred vendor invoices a month, each hand-keyed into their accounting system by two staff members.

The Cost of the Status Quo

Manual entry took most of two people's time, introduced occasional transcription errors, and created a backlog at month-end that delayed payments. The errors were the real pain: a transposed total or wrong vendor occasionally triggered a payment dispute. The team wanted speed, but accuracy was the non-negotiable requirement.

The Decision

Faced with the choice between an off-the-shelf OCR product and a custom extraction pipeline, the team weighed control against convenience.

Why They Built

The off-the-shelf tools handled standard layouts but choked on the firm's long tail of irregular vendor formats. The team chose a language-model extraction pipeline because it could handle varied layouts and they could tune the edge-case rules themselves. They started by defining the output schema, a lesson reinforced in The Complete Guide to Prompting for Data Extraction, before writing any prompt.

The Execution

The build followed a deliberate sequence rather than jumping straight to a prompt.

Schema, Prompt, Validation

They first defined the record: invoice number, vendor, date, terms, subtotal, tax, and total, each typed and marked required or optional. They wrote a prompt with one worked example showing how to handle a missing due date, then added a rule to leave it null rather than calculate it. Finally they validated every parsed record against the schema in code. The full ordering they followed mirrors A Step-by-Step Approach to Prompting for Data Extraction.

  • Week 1: gathered varied invoice samples and defined the schema
  • Week 2: wrote and tested the prompt against the full sample set
  • Week 3: built code validation and a human-review queue for failures

The Near-Miss

Early in testing, the pipeline looked perfect on the demo invoices and the team nearly shipped.

What Caught Them

A reviewer noticed that every test invoice was a clean PDF, while a quarter of real invoices were scanned images with smudged totals. Tuning only on clean documents is among the most common traps, as 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them) describes. They expanded the sample set, added the messy scans, and discovered the prompt needed a stricter total-disambiguation rule. The near-miss cost a week and prevented a flood of bad records.

The Outcome

After three weeks of build and a two-week supervised rollout, the pipeline ran in production with a human queue for flagged records.

What Changed

Most invoices flowed through automatically and were validated in code; the small share that failed validation went to a person rather than into the system. The two staff members shifted from keying every invoice to reviewing only exceptions, and the month-end backlog disappeared. The errors that had triggered payment disputes dropped because validation caught malformed records before they posted.

How They Measured Success

The team resisted declaring victory based on the demo and instead defined what success would look like before the rollout, then measured against it.

The Metrics That Mattered

Rather than tracking a vague sense of speed, they watched three concrete numbers: the share of invoices that passed validation automatically, the validation-failure rate that routed records to human review, and the number of payment disputes traced to entry errors. Defining these up front meant the rollout could be judged objectively rather than by impression. The straight-through rate told them how much manual work was eliminated, the failure rate told them where the prompt still needed work, and the dispute count told them whether accuracy had genuinely improved.

The Supervised Rollout

For the first two weeks, a person reviewed every record the pipeline produced, comparing it to the source invoice even when validation passed. This supervised period was not wasted effort; it surfaced a handful of subtle errors that validation alone would not have caught, such as a correct-looking total pulled from the wrong line. Only after the supervised review confirmed the pipeline's accuracy did the team allow validated records to post without a human glance. The discipline of auditing a sample against sources continued afterward on a weekly cadence.

Lessons Other Teams Can Borrow

The specifics of vendor invoices are less important than the decisions that generalize to any extraction project.

The Transferable Principles

The team's success rested on a sequence any team can follow: define the schema first, gather realistic and messy samples, write a prompt with one edge-case example, add explicit rules for absent and competing values, validate in code, and route failures to people. The near-miss with scanned invoices is the most portable lesson, because almost every project has a hidden category of messy input that the demo never exercised. Looking for that category before launch, rather than after, is what separated a smooth rollout from an incident. The full discipline these choices reflect is mapped in A Framework for Prompting for Data Extraction, and the practices that hardened the pipeline appear in Prompting for Data Extraction: Best Practices That Actually Work.

Frequently Asked Questions

Why did the team build instead of buying an off-the-shelf tool?

Off-the-shelf OCR tools handled standard invoice layouts well but failed on the firm's long tail of irregular vendor formats, which made up a meaningful share of volume. A language-model pipeline let the team handle varied layouts and tune edge-case rules themselves. The trade-off was more upfront work, but it bought the flexibility and accuracy the standard tools could not deliver for their specific document mix.

What was the most important decision in the project?

Defining the output schema before writing the prompt, and gathering messy sample documents before shipping. The schema gave every later step a clear target, and the messy samples exposed the scanned-invoice problem that nearly slipped through. Both decisions reflect the same principle: treat extraction as a process with a defined contract and realistic test data, not as a clever prompt you write once and trust.

How did the team prevent bad records from reaching the accounting system?

They validated every parsed record against the schema in code, rejecting anything with a missing required field or wrong type and routing it to a human review queue. Records that passed validation flowed through automatically; the rest got human attention. This safety net is what reduced the payment disputes that hand-keying errors had previously caused, because malformed records never posted.

Could a smaller team replicate this?

Yes. The core work, defining a schema, writing a prompt with one good example, adding edge-case rules, and validating output in code, scales down to a single person handling a smaller volume. The human-review queue can be as simple as a flagged spreadsheet. The discipline matters more than the team size; the same sequence produces reliable results whether you process forty invoices or four hundred.

Key Takeaways

  • Manual data entry's real cost is often accuracy, not just speed
  • Build over buy when off-the-shelf tools cannot handle your irregular document formats
  • Define the schema before the prompt and follow a deliberate build sequence
  • Test against messy, realistic samples before shipping, not just clean demos
  • Validate every record in code and route failures to a human review queue
  • The prompt is the smallest part of the work; the process around it determines success

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification