Run a Real Model Evaluation in One Afternoon

Most advice about evaluating AI models stops at "build your own evaluation set" and leaves you staring at a blank document. This article does the opposite. It is a sequential process you can start this afternoon and finish before the day ends, producing a defensible decision about which model fits your task. No research infrastructure, no specialized tooling, just a method.

The premise is that public leaderboards get you a shortlist and nothing more. The actual decision comes from running candidate models against examples that look like your real work and scoring them against your real standard. That sounds laborious; in practice it is a focused half-day for a task that you might otherwise get wrong for months.

Follow the steps in order. Each one builds on the last, and skipping ahead is the surest way to produce a result you cannot trust.

Step 1: Write Down What "Good" Means

Before you touch a model, define success in one or two sentences. "Good" might mean factually correct, or correctly formatted, or in the right tone, or some weighted mix. If you cannot articulate it, you cannot score it, and you will end up choosing the model whose answers merely feel impressive.

Make it observable

Turn your definition into things you can check by looking. Instead of "high quality," write "no factual errors, under 150 words, ends with a clear next step." Observable criteria are what separate a real evaluation from a vibe check.

Step 2: Gather 30 to 50 Real Examples

Collect actual inputs from your workflow: real support tickets, real briefs, real documents, whatever your task involves. For each one, write or paste the output you would consider correct. This is your answer key.

Thirty examples is the floor for noticing patterns; fifty is comfortable. Fewer than twenty and a single lucky or unlucky case can swing your conclusion. Pull from a spread of difficulty, including the awkward edge cases that actually cause problems, not just the easy middle.

Include the hard ones on purpose

The easy examples will look fine on every model and tell you nothing. The hard ones, the ambiguous tickets and the documents with tricky formatting, are where models separate. Weight your set toward them.

Step 3: Pick Your Shortlist From the Leaderboards

Now consult public rankings, but only to choose two or three candidates. Look across several independent leaderboards and pick models that rank consistently well on tasks resembling yours. The detail of why consistency matters is covered in our definitive guide to leaderboards and evaluation; for now, just narrow to a manageable shortlist.

Resist the urge to test ten models. Three is plenty, and a tight shortlist keeps the scoring effort sane.

Step 4: Run Every Example Through Every Candidate

Feed each of your examples to each shortlisted model using the same prompt. Keep the prompt identical across models so you are comparing the models, not your prompt variations. Save every output in a table with columns for the input, each model's answer, and your reference answer.

Keep the conditions identical

Same prompt, same instructions, same settings. If you let the prompt drift between models, you are no longer running a fair comparison, and the whole exercise loses its meaning.

Step 5: Score the Outputs Against Your Criteria

Go row by row and score each model's answer against the observable criteria from Step 1. For objective tasks, a simple pass or fail works. For subjective ones, a one-to-five scale per criterion is enough. Have one person score everything in one sitting so the standard stays consistent.

If your task has clear right answers, you can speed this up by checking outputs automatically. If it is subjective, human reading is unavoidable and worth it. Our common mistakes article explains why outsourcing all your scoring to another AI model can quietly corrupt your results.

Step 6: Read the Failures, Not Just the Totals

Tally the scores, but do not stop there. The total tells you which model won; the failures tell you whether you can live with the loser's mistakes. A model that scores slightly lower but fails gracefully may beat a higher scorer that fails catastrophically.

Look for patterns in the misses

Group the failures. Does one model always stumble on a particular document type? Does another fabricate details under ambiguity? These patterns predict how the model will behave on the inputs you have not tested yet, which is most of them.

Step 7: Decide, Document, and Set a Recheck Date

Pick the model, write one paragraph explaining why, and note the date. The documentation matters because in three months a new model will tempt you, and you will want to know what bar it has to clear. Schedule a recheck for when a genuinely better model appears, not on a calendar. Our reusable framework turns this one-off process into something you can repeat in an hour each time.

Write down what almost won

Note the runner-up and why it lost, not just the winner. When a new model arrives, comparing it against the runner-up is often faster than re-running the whole set, because you already know where the second-place model fell short. That single line of context saves real time on every future recheck.

A Worked Micro-Example

To make the steps tangible, walk through a compressed version. Suppose your task is turning meeting notes into action-item lists. Your one-sentence definition of good: "every decision and owner captured, no invented items, formatted as a bullet list." Your observable criteria become completeness, no fabrication, and correct formatting.

You collect thirty real sets of meeting notes and write the ideal action-item list for each. You shortlist three models that rank well on summarization across two leaderboards. You run all thirty notes through each model with one frozen prompt, then score each output as pass or fail on the three criteria.

Interpreting the result

Model A captures every item but occasionally invents an action nobody agreed to. Model B never fabricates but sometimes misses an item. Model C matches your reference closely on both. Even before tallying, the fabrication failures in Model A should worry you more than Model B's omissions, because an invented commitment can cause real harm while a missed one is usually caught in review. Model C wins, and you can articulate exactly why. That articulation is the deliverable, far more than the raw score.

Common Snags and How to Get Past Them

A few predictable obstacles trip up first-time evaluators, and knowing them in advance keeps your afternoon on track.

Analysis paralysis on criteria. If you cannot settle on what "good" means, pick the two criteria that matter most and start. You can refine after the first scoring pass reveals what you missed.
Examples that are all easy. If every model passes everything, your set lacks hard cases. Deliberately hunt for the inputs that caused trouble in the past and add them.
Scoring fatigue. Fifty examples across three models is 150 outputs. Take a short break at the halfway point, but finish in one day so your standard does not drift overnight.

Frequently Asked Questions

How many examples are really necessary?

Aim for thirty to fifty. Below twenty, random luck on individual cases distorts the result, and you cannot distinguish a real difference from noise. Above fifty you hit diminishing returns for most tasks. If your task is high-stakes or highly varied, lean toward the upper end.

Can I use another AI model to score the outputs?

For objective tasks with clear answers, yes, and it saves time. For subjective tasks like tone or judgment, model-based scoring inherits the grader's biases and can mislead you. Use a human reviewer for anything where taste or nuance matters.

What if two models score almost the same?

When scores are close, decide on the failure modes and on practical factors like cost and speed. A model that fails safely, or costs half as much, can be the right pick even if it trails by a point or two on your scoreboard.

Do I need to redo this for every new model release?

No. Redo it only when a new model shows a meaningful jump on tasks like yours, or when your task itself changes. Re-evaluating on every release is wasteful churn. Document your current choice so you know what a challenger must beat.

How do I keep the comparison fair across models?

Use an identical prompt, identical instructions, and identical settings for every candidate, and score everything in a single sitting with one reviewer. Any drift in prompt or scoring standard turns a fair comparison into a misleading one.

Key Takeaways

Define "good" as observable criteria before you evaluate anything, or you will just rank vibes.
Gather thirty to fifty real examples with reference answers, weighted toward the hard cases.
Use leaderboards only to pick a shortlist of two or three candidates.
Run every example through every candidate with an identical prompt, then score against your criteria.
Read the failure patterns, decide, document the reasoning, and recheck only when a genuinely better model appears.

Follow the steps in order. Each one builds on the last, and skipping ahead is the surest way to produce a result you cannot trust.

Step 1: Write Down What "Good" Means

Make it observable

Step 2: Gather 30 to 50 Real Examples

Include the hard ones on purpose

Step 3: Pick Your Shortlist From the Leaderboards

Resist the urge to test ten models. Three is plenty, and a tight shortlist keeps the scoring effort sane.

Step 4: Run Every Example Through Every Candidate

Keep the conditions identical

Same prompt, same instructions, same settings. If you let the prompt drift between models, you are no longer running a fair comparison, and the whole exercise loses its meaning.

Step 5: Score the Outputs Against Your Criteria

Step 6: Read the Failures, Not Just the Totals

Look for patterns in the misses

Step 7: Decide, Document, and Set a Recheck Date

Write down what almost won

A Worked Micro-Example

Interpreting the result

Common Snags and How to Get Past Them

A few predictable obstacles trip up first-time evaluators, and knowing them in advance keeps your afternoon on track.

Analysis paralysis on criteria. If you cannot settle on what "good" means, pick the two criteria that matter most and start. You can refine after the first scoring pass reveals what you missed.
Examples that are all easy. If every model passes everything, your set lacks hard cases. Deliberately hunt for the inputs that caused trouble in the past and add them.
Scoring fatigue. Fifty examples across three models is 150 outputs. Take a short break at the halfway point, but finish in one day so your standard does not drift overnight.

Frequently Asked Questions

How many examples are really necessary?

Can I use another AI model to score the outputs?

What if two models score almost the same?

Do I need to redo this for every new model release?

How do I keep the comparison fair across models?

Key Takeaways

Define "good" as observable criteria before you evaluate anything, or you will just rank vibes.
Gather thirty to fifty real examples with reference answers, weighted toward the hard cases.
Use leaderboards only to pick a shortlist of two or three candidates.
Run every example through every candidate with an identical prompt, then score against your criteria.
Read the failure patterns, decide, document the reasoning, and recheck only when a genuinely better model appears.

Run a Real Model Evaluation in One Afternoon

Step 1: Write Down What "Good" Means

Make it observable

Step 2: Gather 30 to 50 Real Examples

Include the hard ones on purpose

Step 3: Pick Your Shortlist From the Leaderboards

Step 4: Run Every Example Through Every Candidate

Keep the conditions identical

Step 5: Score the Outputs Against Your Criteria

Step 6: Read the Failures, Not Just the Totals

Look for patterns in the misses

Step 7: Decide, Document, and Set a Recheck Date

Write down what almost won

A Worked Micro-Example

Interpreting the result

Common Snags and How to Get Past Them

Frequently Asked Questions

How many examples are really necessary?

Can I use another AI model to score the outputs?

What if two models score almost the same?

Do I need to redo this for every new model release?

How do I keep the comparison fair across models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Run a Real Model Evaluation in One Afternoon

Step 1: Write Down What "Good" Means

Make it observable

Step 2: Gather 30 to 50 Real Examples

Include the hard ones on purpose

Step 3: Pick Your Shortlist From the Leaderboards

Step 4: Run Every Example Through Every Candidate

Keep the conditions identical

Step 5: Score the Outputs Against Your Criteria

Step 6: Read the Failures, Not Just the Totals

Look for patterns in the misses

Step 7: Decide, Document, and Set a Recheck Date

Write down what almost won

A Worked Micro-Example

Interpreting the result

Common Snags and How to Get Past Them

Frequently Asked Questions

How many examples are really necessary?

Can I use another AI model to score the outputs?

What if two models score almost the same?

Do I need to redo this for every new model release?

How do I keep the comparison fair across models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?