The DRIVE Model for Deciding Whether a Prompt Is Ready

Ad hoc prompt evaluation works until it does not. The moment more than one person touches a prompt, or the same person evaluates many prompts, the lack of a shared structure produces inconsistent judgments and forgotten steps. A named framework solves this by giving everyone the same stages to move through, the same vocabulary, and a clear answer to the question of what comes next.

This guide introduces DRIVE: a five-stage model for evaluating prompt quality. The stages are Define, Represent, Instrument, Verify, and Elect. They run in order, each producing an output the next consumes, and together they take you from a fuzzy sense that a prompt might be good to a documented, defensible decision about whether it ships.

DRIVE is deliberately lightweight. It is not a heavyweight process to slow you down; it is a checklist of thinking that scales from a ten-minute evaluation of an internal prompt to a rigorous vetting of a customer-facing one. The stages stay the same; only the depth changes.

Stage 1: Define

Everything starts with a precise definition of what the prompt is supposed to do and what a good output looks like. This stage produces your success criteria.

Write criteria that are specific and testable — something a stranger could score without consulting you. Name the dimensions that matter for this task, whether accuracy, format adherence, tone, or safety, and identify the single most dangerous failure mode up front so it stays in view. Skipping this stage is the root cause of most worthless evaluations, because without a definition every output looks acceptable.

For why criteria-first discipline matters, see Evaluating Prompt Quality: Best Practices That Actually Work.

Stage 2: Represent

The second stage builds a test set that represents the real distribution the prompt will face. The output is a curated collection of inputs.

A good representation includes common cases, edge cases, and adversarial cases in roughly the proportions you expect in production. Sample from real data where possible. Crucially, split the set into a tuning portion you may iterate against and a held-out portion you reserve for final measurement. This stage is where you decide whether your evaluation will describe reality or a convenient fiction.

Stage 3: Instrument

With criteria and a test set in hand, you choose how each criterion will be scored. The output is a scoring plan.

Match each criterion to the cheapest valid method:

Programmatic checks for structured requirements like valid JSON or value ranges.
Reference comparison for criteria with known correct answers.
Rubric-based human or model judgment for subjective qualities like tone.

Instrumenting before you generate outputs keeps scoring consistent across dozens of cases and prevents you from inventing standards on the fly.

Stage 4: Verify

Now you run the prompt against the test set, score it, and probe its robustness. The output is evidence: a pass rate, a failure analysis, and a consistency measurement.

Run important inputs multiple times to capture variance, because a single pass hides intermittent failure. Group failures by cause so the pattern points at the fix, then iterate one targeted change at a time, re-running the full set after each to catch regressions. Verify is the stage where the prompt's true behavior finally becomes visible.

For the mechanics of this stage in detail, see A Step-by-Step Approach to Evaluating Prompt Quality.

Stage 5: Elect

The final stage turns evidence into a decision. The output is a documented choice: ship, keep iterating, or escalate.

Weigh quality against cost, latency, and failure rate, and compare the result against the quality floor you set during Define. Document the test set, the final pass rate, and the reasoning so the decision is reproducible. Electing is what separates an evaluation that informs action from one that merely produces a number.

When to Apply Each Stage

DRIVE scales by depth, not by skipping stages. For a low-stakes internal prompt, Define might be two sentences and Represent might be ten inputs. For a customer-facing prompt, Define becomes a detailed rubric, Represent grows to hundreds of sampled inputs, and Verify includes adversarial probing and variance analysis. The discipline is to always pass through all five stages, even quickly, because each one defends against a distinct way evaluations mislead you.

To see the framework's failure modes named directly, read 7 Common Mistakes with Evaluating Prompt Quality.

How the Stages Reinforce Each Other

DRIVE is more than five steps in a row; the stages depend on one another in ways that make the order matter. The criteria from Define determine what your Represent stage must cover, because a test set that does not stress your most important criterion is not representative no matter how many inputs it holds. The scoring plan from Instrument is only valid against the criteria Define produced. The evidence from Verify is only meaningful if Represent built an honest test set. And the decision in Elect is only defensible if every prior stage was sound.

This dependency is why a weak early stage poisons everything downstream. A vague Define produces criteria too fuzzy to instrument, which produces scores too soft to verify, which produces a decision built on sand. Conversely, investing in the first two stages pays off through the rest of the framework, which is why experienced practitioners spend a disproportionate share of their time there.

A Quick Self-Check Per Stage

Before moving from one stage to the next, ask a single question:

Define: Could a stranger score an output using only what I wrote?
Represent: Does this test set stress my most dangerous failure mode?
Instrument: Is each criterion scored by the cheapest method that actually measures it?
Verify: Have I run important inputs enough times to see their variance?
Elect: Can the next person reproduce my decision from what I documented?

If the answer is no, you are not ready to advance, and pushing forward only defers the problem to a stage where it is more expensive to fix.

Frequently Asked Questions

How is DRIVE different from just following a checklist?

A checklist verifies discrete items; DRIVE organizes the entire evaluation into ordered stages where each produces an input for the next. The framework gives you a shared mental model and vocabulary, which matters most when multiple people evaluate prompts or one person evaluates many. The two complement each other — use the framework to structure the work and a checklist to verify each stage.

Can I skip stages for a quick evaluation?

Skip depth, not stages. Each of the five stages defends against a specific failure: Define against vague criteria, Represent against unrepresentative tests, Instrument against inconsistent scoring, Verify against hidden variance, and Elect against undocumented decisions. For a quick pass you can do each in minutes, but omitting one reintroduces the exact problem it exists to prevent.

Where do most evaluations go wrong in the DRIVE stages?

Most failures trace to a weak Define or Represent stage. Vague success criteria make every score meaningless, and an unrepresentative test set produces confident numbers that do not hold in production. Getting those two stages right resolves the majority of evaluation problems, which is why the framework front-loads them.

Does DRIVE work for non-text tasks like image prompts?

The stages generalize because they describe a way of thinking, not a text-specific procedure. You still define success, represent the input distribution, instrument scoring, verify behavior across runs, and elect a decision. The scoring methods in the Instrument stage change for images, but the structure of moving from definition to documented decision stays the same.

Key Takeaways

DRIVE structures prompt evaluation into five ordered stages: Define, Represent, Instrument, Verify, Elect.
Define produces testable success criteria and names the most dangerous failure mode.
Represent builds a realistic test set split into tuning and held-out portions.
Instrument assigns the cheapest valid scoring method to each criterion before generating outputs.
Verify produces a pass rate, failure analysis, and variance measurement through iterative testing.
Elect weighs quality against cost and latency and documents a reproducible decision; scale depth, never skip stages.

Stage 1: Define

Everything starts with a precise definition of what the prompt is supposed to do and what a good output looks like. This stage produces your success criteria.

For why criteria-first discipline matters, see Evaluating Prompt Quality: Best Practices That Actually Work.

Stage 2: Represent

The second stage builds a test set that represents the real distribution the prompt will face. The output is a curated collection of inputs.

Stage 3: Instrument

With criteria and a test set in hand, you choose how each criterion will be scored. The output is a scoring plan.

Match each criterion to the cheapest valid method:

Programmatic checks for structured requirements like valid JSON or value ranges.
Reference comparison for criteria with known correct answers.
Rubric-based human or model judgment for subjective qualities like tone.

Instrumenting before you generate outputs keeps scoring consistent across dozens of cases and prevents you from inventing standards on the fly.

Stage 4: Verify

Now you run the prompt against the test set, score it, and probe its robustness. The output is evidence: a pass rate, a failure analysis, and a consistency measurement.

For the mechanics of this stage in detail, see A Step-by-Step Approach to Evaluating Prompt Quality.

Stage 5: Elect

The final stage turns evidence into a decision. The output is a documented choice: ship, keep iterating, or escalate.

When to Apply Each Stage

To see the framework's failure modes named directly, read 7 Common Mistakes with Evaluating Prompt Quality.

How the Stages Reinforce Each Other

A Quick Self-Check Per Stage

Before moving from one stage to the next, ask a single question:

Define: Could a stranger score an output using only what I wrote?
Represent: Does this test set stress my most dangerous failure mode?
Instrument: Is each criterion scored by the cheapest method that actually measures it?
Verify: Have I run important inputs enough times to see their variance?
Elect: Can the next person reproduce my decision from what I documented?

If the answer is no, you are not ready to advance, and pushing forward only defers the problem to a stage where it is more expensive to fix.

Frequently Asked Questions

How is DRIVE different from just following a checklist?

Can I skip stages for a quick evaluation?

Where do most evaluations go wrong in the DRIVE stages?

Does DRIVE work for non-text tasks like image prompts?

Key Takeaways

DRIVE structures prompt evaluation into five ordered stages: Define, Represent, Instrument, Verify, Elect.
Define produces testable success criteria and names the most dangerous failure mode.
Represent builds a realistic test set split into tuning and held-out portions.
Instrument assigns the cheapest valid scoring method to each criterion before generating outputs.
Verify produces a pass rate, failure analysis, and variance measurement through iterative testing.
Elect weighs quality against cost and latency and documents a reproducible decision; scale depth, never skip stages.

The DRIVE Model for Deciding Whether a Prompt Is Ready

Stage 1: Define

Stage 2: Represent

Stage 3: Instrument

Stage 4: Verify

Stage 5: Elect

When to Apply Each Stage

How the Stages Reinforce Each Other

A Quick Self-Check Per Stage

Frequently Asked Questions

How is DRIVE different from just following a checklist?

Can I skip stages for a quick evaluation?

Where do most evaluations go wrong in the DRIVE stages?

Does DRIVE work for non-text tasks like image prompts?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

The DRIVE Model for Deciding Whether a Prompt Is Ready

Stage 1: Define

Stage 2: Represent

Stage 3: Instrument

Stage 4: Verify

Stage 5: Elect

When to Apply Each Stage

How the Stages Reinforce Each Other

A Quick Self-Check Per Stage

Frequently Asked Questions

How is DRIVE different from just following a checklist?

Can I skip stages for a quick evaluation?

Where do most evaluations go wrong in the DRIVE stages?

Does DRIVE work for non-text tasks like image prompts?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?