PROVE: A Five-Stage Loop for the Examples Decision

Q: How is Observe different from just looking at accuracy?

Observe diagnoses why each input failed, separating "missing definition" (fix the prime) from "missing demonstration" (add an example). Headline accuracy tells you something is wrong; Observe tells you which lever fixes it.

Teams decide between zero-shot and few-shot prompting by instinct, and instinct defaults to adding examples. A framework replaces that reflex with a repeatable process where every step produces evidence. The one below — call it PROVE — has five stages: Prime, Run, Observe, Validate, Evolve. It is deliberately a loop, because the right answer changes as your model and data change.

The framework's core principle is that examples are a cost you must justify, not a default you assume. Each stage exists to make sure you only pay for examples that measurably earn their tokens.

Stage 1: Prime — Specify the Task Without Examples

Priming means writing the sharpest possible zero-shot instruction. Name the output format, enumerate the categories or fields, and state how to handle edge cases explicitly.

Why Prime comes first

If the instruction can fully specify the task, you may need no examples at all — and a strong instruction transfers across models far better than a tuned example set. Priming also surfaces ambiguity early: if you cannot describe the task clearly in words, examples will only hide that gap, not close it.

The test for a good prime: hand the instruction to a competent human with no examples. If they produce the right output, your prime is strong.

Stage 2: Run — Baseline Zero-Shot on Real Data

Running means executing the primed instruction against a labeled eval set drawn from real inputs, including messy and ambiguous cases.

This stage produces your reference number: zero-shot accuracy per category, plus prompt token count and latency. Every later decision is measured against this baseline. Without it, you are guessing. The discipline here mirrors our best practices guide.

If the baseline meets your bar, you exit the framework here with the cheapest possible prompt. Most teams are surprised how often that happens on modern models.

Stage 3: Observe — Diagnose the Failures

If zero-shot fell short, Observe is where you diagnose why, input by input. The critical distinction:

Missing definition: the instruction was vague about a category or edge case. The fix is a better prime, not examples.
Missing demonstration: the task carries an implicit rule — a schema convention, brand voice, code style — that words struggle to convey. This is where examples genuinely help.

Conflating these is the most expensive mistake teams make, and our common mistakes guide shows how it inflates prompts with examples that paper over a fixable instruction.

Stage 4: Validate — Add Examples and Measure the Delta

Only failures diagnosed as "missing demonstration" justify moving here. Validate means adding examples deliberately and measuring whether they help.

Pull examples from real data, including hard cases.
Balance labels to avoid majority bias.
Start with two; add more only on measured accuracy gains.
Re-measure tokens, latency, and order-bias stability.

The output of Validate is a precise delta: this many examples buy this much accuracy at this much cost. If the delta does not justify the cost, you revert. Real-World Examples and Use Cases shows what good example sets look like across task types.

Stage 5: Evolve — Re-Run the Loop on Every Change

Evolve is the stage teams skip, and it is why prompts rot. Prompts that were correct a year ago are frequently over-engineered today.

What triggers an Evolve pass

A model upgrade — re-run Prime and Run; you can often delete examples.
An input-distribution shift — refresh the eval set and re-Observe.
Rising example token spend on a stable task — consider whether fine-tuning now amortizes better than prompting.

Evolve closes the loop. The framework is not a one-time decision; it is a maintenance cycle. For when Evolve points toward fine-tuning, see the trade-offs guide.

A Worked Pass Through PROVE

To make the stages concrete, walk through a single task: extracting structured shipping details from freeform customer emails into a fixed JSON schema.

Prime. Write an instruction naming every field — recipient, address, requested date, special instructions — and stating how to handle missing values (null) and multiple candidates (take the most recent). Hand it to a colleague; they extract correctly from a sample email. The prime is strong.

Run. Score it against 200 labeled real emails, broken down by field. Recipient and address hit high accuracy zero-shot. The "requested date" field lags — the model formats dates inconsistently and mishandles relative dates like "next Tuesday."

Observe. Diagnose: recipient and address are fine, so no examples needed there. The date failures are a missing demonstration problem — the convention for resolving relative dates is genuinely hard to state in words. This is a Validate candidate, not a Prime fix.

Validate. Add two examples showing relative-date resolution, including one ambiguous case. Re-score: date accuracy jumps, other fields unchanged, prompt grows by 180 tokens. The accuracy-per-token delta justifies the cost. Ship it.

Evolve. Three months later, a model upgrade lands. Re-run Prime and Run; the new model resolves relative dates zero-shot. Delete the examples, reclaim the tokens. The loop closes.

This single pass shows why each stage exists: Prime prevents wasted examples, Observe prevents the wrong fix, Validate measures the trade, and Evolve reclaims what newer models make free.

Common Ways PROVE Is Misapplied

The framework fails when teams shortcut stages. The most common failure is skipping Observe — treating any zero-shot shortfall as a signal to add examples, when half the time the real fix is a sharper prime. Adding examples to a missing-definition failure works locally but leaves the vague instruction in place to cause the next gap, exactly the pattern in our common mistakes guide.

The second failure is running Validate without a real eval set, so "the examples help" is a hand-wave instead of a measured delta. And the third is never scheduling Evolve, which lets prompts ossify across model generations. Each shortcut feels reasonable under deadline pressure and each one compounds into expensive, fragile prompts over time.

Applying PROVE in Practice

In practice the framework is fast. Prime and Run take an afternoon if you have an eval set. Observe is where the judgment lives. Validate is mechanical once you know which failures need demonstration. Evolve is a recurring calendar item, not a project. The payoff is that you never again add examples on reflex — every one in your prompt has a measured justification behind it.

Frequently Asked Questions

What problem does the PROVE framework actually solve?

It replaces gut-feel prompt decisions with an evidence-driven loop, so you only pay for examples that measurably help. It also forces the Prime-vs-Validate distinction that prevents teams from papering over vague instructions with examples.

Which stage do teams most often skip?

Evolve. They write a prompt once and never revisit it, so it goes stale across model upgrades. Making Evolve a recurring calendar item is the cheapest high-leverage habit in the whole framework.

How is Observe different from just looking at accuracy?

Observe diagnoses why each input failed, separating "missing definition" (fix the prime) from "missing demonstration" (add an example). Headline accuracy tells you something is wrong; Observe tells you which lever fixes it.

Can I use PROVE for reasoning tasks?

Yes. In Validate, your examples should demonstrate the reasoning process rather than answers, and in Run you should test a zero-shot "reason step by step" prime, which now closes much of the gap on capable models.

How long does one full PROVE cycle take?

With an existing eval set, Prime and Run take an afternoon, Observe and Validate a day or two depending on task complexity. Evolve passes are short re-runs triggered by model or data changes, not full projects.

Key Takeaways

PROVE — Prime, Run, Observe, Validate, Evolve — turns prompt decisions into an evidence loop.
Prime first: a strong instruction may make examples unnecessary and transfers across models.
Observe separates "missing definition" (fix the prime) from "missing demonstration" (add examples).
Validate measures the precise accuracy-versus-cost delta of each example.
Evolve is a recurring maintenance cycle, not a one-time decision.

The framework's core principle is that examples are a cost you must justify, not a default you assume. Each stage exists to make sure you only pay for examples that measurably earn their tokens.

Stage 1: Prime — Specify the Task Without Examples

Priming means writing the sharpest possible zero-shot instruction. Name the output format, enumerate the categories or fields, and state how to handle edge cases explicitly.

Why Prime comes first

The test for a good prime: hand the instruction to a competent human with no examples. If they produce the right output, your prime is strong.

Stage 2: Run — Baseline Zero-Shot on Real Data

Running means executing the primed instruction against a labeled eval set drawn from real inputs, including messy and ambiguous cases.

If the baseline meets your bar, you exit the framework here with the cheapest possible prompt. Most teams are surprised how often that happens on modern models.

Stage 3: Observe — Diagnose the Failures

If zero-shot fell short, Observe is where you diagnose why, input by input. The critical distinction:

Missing definition: the instruction was vague about a category or edge case. The fix is a better prime, not examples.
Missing demonstration: the task carries an implicit rule — a schema convention, brand voice, code style — that words struggle to convey. This is where examples genuinely help.

Conflating these is the most expensive mistake teams make, and our common mistakes guide shows how it inflates prompts with examples that paper over a fixable instruction.

Stage 4: Validate — Add Examples and Measure the Delta

Only failures diagnosed as "missing demonstration" justify moving here. Validate means adding examples deliberately and measuring whether they help.

Pull examples from real data, including hard cases.
Balance labels to avoid majority bias.
Start with two; add more only on measured accuracy gains.
Re-measure tokens, latency, and order-bias stability.

Stage 5: Evolve — Re-Run the Loop on Every Change

Evolve is the stage teams skip, and it is why prompts rot. Prompts that were correct a year ago are frequently over-engineered today.

What triggers an Evolve pass

A model upgrade — re-run Prime and Run; you can often delete examples.
An input-distribution shift — refresh the eval set and re-Observe.
Rising example token spend on a stable task — consider whether fine-tuning now amortizes better than prompting.

Evolve closes the loop. The framework is not a one-time decision; it is a maintenance cycle. For when Evolve points toward fine-tuning, see the trade-offs guide.

A Worked Pass Through PROVE

To make the stages concrete, walk through a single task: extracting structured shipping details from freeform customer emails into a fixed JSON schema.

Evolve. Three months later, a model upgrade lands. Re-run Prime and Run; the new model resolves relative dates zero-shot. Delete the examples, reclaim the tokens. The loop closes.

This single pass shows why each stage exists: Prime prevents wasted examples, Observe prevents the wrong fix, Validate measures the trade, and Evolve reclaims what newer models make free.

Common Ways PROVE Is Misapplied

Applying PROVE in Practice

Frequently Asked Questions

What problem does the PROVE framework actually solve?

Which stage do teams most often skip?

Evolve. They write a prompt once and never revisit it, so it goes stale across model upgrades. Making Evolve a recurring calendar item is the cheapest high-leverage habit in the whole framework.

How is Observe different from just looking at accuracy?

Can I use PROVE for reasoning tasks?

How long does one full PROVE cycle take?

Key Takeaways

PROVE — Prime, Run, Observe, Validate, Evolve — turns prompt decisions into an evidence loop.
Prime first: a strong instruction may make examples unnecessary and transfers across models.
Observe separates "missing definition" (fix the prime) from "missing demonstration" (add examples).
Validate measures the precise accuracy-versus-cost delta of each example.
Evolve is a recurring maintenance cycle, not a one-time decision.

PROVE: A Five-Stage Loop for the Examples Decision

Stage 1: Prime — Specify the Task Without Examples

Why Prime comes first

Stage 2: Run — Baseline Zero-Shot on Real Data

Stage 3: Observe — Diagnose the Failures

Stage 4: Validate — Add Examples and Measure the Delta

Stage 5: Evolve — Re-Run the Loop on Every Change

What triggers an Evolve pass

A Worked Pass Through PROVE

Common Ways PROVE Is Misapplied

Applying PROVE in Practice

Frequently Asked Questions

What problem does the PROVE framework actually solve?

Which stage do teams most often skip?

How is Observe different from just looking at accuracy?

Can I use PROVE for reasoning tasks?

How long does one full PROVE cycle take?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

PROVE: A Five-Stage Loop for the Examples Decision

Stage 1: Prime — Specify the Task Without Examples

Why Prime comes first

Stage 2: Run — Baseline Zero-Shot on Real Data

Stage 3: Observe — Diagnose the Failures

Stage 4: Validate — Add Examples and Measure the Delta

Stage 5: Evolve — Re-Run the Loop on Every Change

What triggers an Evolve pass

A Worked Pass Through PROVE

Common Ways PROVE Is Misapplied

Applying PROVE in Practice

Frequently Asked Questions

What problem does the PROVE framework actually solve?

Which stage do teams most often skip?

How is Observe different from just looking at accuracy?

Can I use PROVE for reasoning tasks?

How long does one full PROVE cycle take?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?