AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage 1: Prime — Specify the Task Without ExamplesWhy Prime comes firstStage 2: Run — Baseline Zero-Shot on Real DataStage 3: Observe — Diagnose the FailuresStage 4: Validate — Add Examples and Measure the DeltaStage 5: Evolve — Re-Run the Loop on Every ChangeWhat triggers an Evolve passA Worked Pass Through PROVECommon Ways PROVE Is MisappliedApplying PROVE in PracticeFrequently Asked QuestionsWhat problem does the PROVE framework actually solve?Which stage do teams most often skip?How is Observe different from just looking at accuracy?Can I use PROVE for reasoning tasks?How long does one full PROVE cycle take?Key Takeaways
Home/Blog/PROVE: A Five-Stage Loop for the Examples Decision
General

PROVE: A Five-Stage Loop for the Examples Decision

A

Agency Script Editorial

Editorial Team

·June 18, 2025·6 min read
zero shot vs few shot learningzero shot vs few shot learning frameworkzero shot vs few shot learning guideai fundamentals

Teams decide between zero-shot and few-shot prompting by instinct, and instinct defaults to adding examples. A framework replaces that reflex with a repeatable process where every step produces evidence. The one below — call it PROVE — has five stages: Prime, Run, Observe, Validate, Evolve. It is deliberately a loop, because the right answer changes as your model and data change.

The framework's core principle is that examples are a cost you must justify, not a default you assume. Each stage exists to make sure you only pay for examples that measurably earn their tokens.

Stage 1: Prime — Specify the Task Without Examples

Priming means writing the sharpest possible zero-shot instruction. Name the output format, enumerate the categories or fields, and state how to handle edge cases explicitly.

Why Prime comes first

If the instruction can fully specify the task, you may need no examples at all — and a strong instruction transfers across models far better than a tuned example set. Priming also surfaces ambiguity early: if you cannot describe the task clearly in words, examples will only hide that gap, not close it.

The test for a good prime: hand the instruction to a competent human with no examples. If they produce the right output, your prime is strong.

Stage 2: Run — Baseline Zero-Shot on Real Data

Running means executing the primed instruction against a labeled eval set drawn from real inputs, including messy and ambiguous cases.

This stage produces your reference number: zero-shot accuracy per category, plus prompt token count and latency. Every later decision is measured against this baseline. Without it, you are guessing. The discipline here mirrors our best practices guide.

If the baseline meets your bar, you exit the framework here with the cheapest possible prompt. Most teams are surprised how often that happens on modern models.

Stage 3: Observe — Diagnose the Failures

If zero-shot fell short, Observe is where you diagnose why, input by input. The critical distinction:

  • Missing definition: the instruction was vague about a category or edge case. The fix is a better prime, not examples.
  • Missing demonstration: the task carries an implicit rule — a schema convention, brand voice, code style — that words struggle to convey. This is where examples genuinely help.

Conflating these is the most expensive mistake teams make, and our common mistakes guide shows how it inflates prompts with examples that paper over a fixable instruction.

Stage 4: Validate — Add Examples and Measure the Delta

Only failures diagnosed as "missing demonstration" justify moving here. Validate means adding examples deliberately and measuring whether they help.

  • Pull examples from real data, including hard cases.
  • Balance labels to avoid majority bias.
  • Start with two; add more only on measured accuracy gains.
  • Re-measure tokens, latency, and order-bias stability.

The output of Validate is a precise delta: this many examples buy this much accuracy at this much cost. If the delta does not justify the cost, you revert. Real-World Examples and Use Cases shows what good example sets look like across task types.

Stage 5: Evolve — Re-Run the Loop on Every Change

Evolve is the stage teams skip, and it is why prompts rot. Prompts that were correct a year ago are frequently over-engineered today.

What triggers an Evolve pass

  • A model upgrade — re-run Prime and Run; you can often delete examples.
  • An input-distribution shift — refresh the eval set and re-Observe.
  • Rising example token spend on a stable task — consider whether fine-tuning now amortizes better than prompting.

Evolve closes the loop. The framework is not a one-time decision; it is a maintenance cycle. For when Evolve points toward fine-tuning, see the trade-offs guide.

A Worked Pass Through PROVE

To make the stages concrete, walk through a single task: extracting structured shipping details from freeform customer emails into a fixed JSON schema.

Prime. Write an instruction naming every field — recipient, address, requested date, special instructions — and stating how to handle missing values (null) and multiple candidates (take the most recent). Hand it to a colleague; they extract correctly from a sample email. The prime is strong.

Run. Score it against 200 labeled real emails, broken down by field. Recipient and address hit high accuracy zero-shot. The "requested date" field lags — the model formats dates inconsistently and mishandles relative dates like "next Tuesday."

Observe. Diagnose: recipient and address are fine, so no examples needed there. The date failures are a missing demonstration problem — the convention for resolving relative dates is genuinely hard to state in words. This is a Validate candidate, not a Prime fix.

Validate. Add two examples showing relative-date resolution, including one ambiguous case. Re-score: date accuracy jumps, other fields unchanged, prompt grows by 180 tokens. The accuracy-per-token delta justifies the cost. Ship it.

Evolve. Three months later, a model upgrade lands. Re-run Prime and Run; the new model resolves relative dates zero-shot. Delete the examples, reclaim the tokens. The loop closes.

This single pass shows why each stage exists: Prime prevents wasted examples, Observe prevents the wrong fix, Validate measures the trade, and Evolve reclaims what newer models make free.

Common Ways PROVE Is Misapplied

The framework fails when teams shortcut stages. The most common failure is skipping Observe — treating any zero-shot shortfall as a signal to add examples, when half the time the real fix is a sharper prime. Adding examples to a missing-definition failure works locally but leaves the vague instruction in place to cause the next gap, exactly the pattern in our common mistakes guide.

The second failure is running Validate without a real eval set, so "the examples help" is a hand-wave instead of a measured delta. And the third is never scheduling Evolve, which lets prompts ossify across model generations. Each shortcut feels reasonable under deadline pressure and each one compounds into expensive, fragile prompts over time.

Applying PROVE in Practice

In practice the framework is fast. Prime and Run take an afternoon if you have an eval set. Observe is where the judgment lives. Validate is mechanical once you know which failures need demonstration. Evolve is a recurring calendar item, not a project. The payoff is that you never again add examples on reflex — every one in your prompt has a measured justification behind it.

Frequently Asked Questions

What problem does the PROVE framework actually solve?

It replaces gut-feel prompt decisions with an evidence-driven loop, so you only pay for examples that measurably help. It also forces the Prime-vs-Validate distinction that prevents teams from papering over vague instructions with examples.

Which stage do teams most often skip?

Evolve. They write a prompt once and never revisit it, so it goes stale across model upgrades. Making Evolve a recurring calendar item is the cheapest high-leverage habit in the whole framework.

How is Observe different from just looking at accuracy?

Observe diagnoses why each input failed, separating "missing definition" (fix the prime) from "missing demonstration" (add an example). Headline accuracy tells you something is wrong; Observe tells you which lever fixes it.

Can I use PROVE for reasoning tasks?

Yes. In Validate, your examples should demonstrate the reasoning process rather than answers, and in Run you should test a zero-shot "reason step by step" prime, which now closes much of the gap on capable models.

How long does one full PROVE cycle take?

With an existing eval set, Prime and Run take an afternoon, Observe and Validate a day or two depending on task complexity. Evolve passes are short re-runs triggered by model or data changes, not full projects.

Key Takeaways

  • PROVE — Prime, Run, Observe, Validate, Evolve — turns prompt decisions into an evidence loop.
  • Prime first: a strong instruction may make examples unnecessary and transfers across models.
  • Observe separates "missing definition" (fix the prime) from "missing demonstration" (add examples).
  • Validate measures the precise accuracy-versus-cost delta of each example.
  • Evolve is a recurring maintenance cycle, not a one-time decision.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification