Skip One Eval Step Under Pressure, Lose Months to It

Checklists exist because even people who know what to do skip steps under pressure. Model evaluation is full of steps that are easy to know and easy to skip, and skipping them is how teams end up with the wrong model for months. This is a working checklist for 2026, organized in the order you should actually do things, with a one-line justification for each item so you understand why it earns its place.

Treat it as a tool, not an essay. Copy it, keep it where you make model decisions, and run through it every time. The items are deliberately concrete, because a checklist that says "evaluate carefully" helps no one. Each box is something you can check off as genuinely done or not done.

The sequence matters. Define before you test, test before you decide, document before you move on. Working out of order is where most of the value leaks away.

Before You Look at Any Leaderboard

Get your own house in order first, so external rankings inform rather than drive you.

Write a one-sentence definition of "good" for this task. Without it you cannot score anything and will default to vibes.
List the observable criteria that definition implies. Things you can check by looking, like length, accuracy, and tone.
Catalog the failure modes that would actually hurt. Knowing what bad looks like is as important as knowing what good looks like.
Gather thirty to fifty real examples with reference answers. Below thirty, luck dominates; this set is your ground truth.

Our step-by-step guide expands each of these into a full procedure if you need detail.

When You Consult Leaderboards

Use rankings as a filter, with healthy skepticism baked in.

Check that the benchmark resembles your task. A coding leaderboard is irrelevant to copywriting; relevance comes before rank.
Cross-reference at least two independent leaderboards. Single sources carry blind spots you will inherit.
Note whether scoring is correctness or preference. Preference rewards verbosity; correctness rewards accuracy. Know which you need.
Watch for contamination risk. Fresh or rotating test sets resist memorization; static ones inflate scores.
Narrow to a shortlist of two or three candidates. More than three makes scoring unwieldy for little gain.

The reasoning behind the contamination and preference items is laid out in our common mistakes article.

During the Private Test

This is where the real decision gets made, so protect its fairness.

Freeze one identical prompt across all candidates. Otherwise you compare prompts, not models.
Run every example through every candidate. Partial coverage produces partial conclusions.
Score in a single sitting with one reviewer. Consistency of standard matters more than who does it.
Score failure modes explicitly, not just averages. A confident wrong answer can cost more than the average suggests.

Before You Commit

Convert the test into a durable decision.

Read the failure patterns, not just the totals. Patterns predict behavior on inputs you have not tested.
Weigh practical factors when scores are close. Cost, speed, and safe failure can break a near-tie.
Write one paragraph explaining the choice. Documentation gives future challengers a clear bar to beat.
Set a re-evaluation trigger, not a date. Re-test on a meaningful jump in relevant tasks, not on the calendar.

The discipline of triggers over schedules is covered in our best practices article.

Keeping the Checklist Alive

A checklist decays if you never update it. Add new failure modes to your evaluation set as you discover them in production, and revisit your definition of "good" when your task evolves. The reusable framework shows how to fold this checklist into a repeatable cycle so it improves rather than ossifies.

One last item

Re-run the full checklist whenever the stakes of a decision rise. A low-stakes internal tool can skip some items; a customer-facing one should hit every box. Calibrate the rigor to the consequences.

How to Use This Checklist in a Team

A checklist owned by no one gets followed by no one. Assign a single owner per evaluation who is accountable for running through every box, while keeping the checklist itself shared so anyone can pick it up. The owner role rotates fine; what cannot rotate within a single decision is the scoring standard, which one person should hold start to finish.

Make the boxes auditable

Each item should be answerable with a clear yes or no, not a vague "sort of." "Did we freeze one prompt across candidates?" is auditable. "Did we test carefully?" is not. When you adapt this checklist to your context, sharpen any item that has gone soft until it can be honestly checked or left blank. A box you cannot honestly check is a box doing real work, because it just told you where your process is thin.

A Lightweight Version for Quick Decisions

Not every decision deserves the full sequence. For a fast, low-stakes call, this five-item subset preserves most of the value:

Define good in one sentence. Never skip this, regardless of stakes.
Gather ten to fifteen real examples. Fewer than ideal, but far better than zero.
Shortlist two candidates from a leaderboard. Relevance-checked, even if only one ranking.
Run both through your examples with one frozen prompt. The minimum fair comparison.
Note the choice and a recheck trigger in one line. Cheap insurance for later.

This compressed version takes under an hour and still beats choosing on a headline. Graduate to the full checklist the moment the decision touches customers or money.

Why even the lightweight version works

The reason this short list still protects you is that it preserves the two irreplaceable moves: defining good and testing on real examples. Everything else in the full checklist sharpens those two moves, but without them no amount of additional rigor helps. A team that does only these five things, honestly, will out-decide a team that owns expensive tooling but skips the definition step. Rigor is valuable in proportion to how much of it points at your actual task, and these five items all do.

Pairing the Checklist With Your Calendar

A checklist governs a single decision; your calendar governs when decisions recur. Pair them by attaching the recheck trigger from the final item to whatever system you actually watch. If your team lives in a project tracker, file the trigger there. If you watch a model provider's release notes, note your current bar alongside them. The checklist only delivers its full value when its last item, the recheck trigger, lands somewhere you will genuinely see it later. An unwatched trigger is the same as no trigger, and the most common way a sound evaluation quietly goes stale.

Frequently Asked Questions

Can I skip items for a low-stakes decision?

Yes, deliberately. For an internal experiment, you might gather fewer examples and skip formal documentation. The checklist is a maximum for high-stakes choices; scale the rigor down consciously rather than by accident, and never skip defining "good."

Why is "write a one-sentence definition of good" the first item?

Because every later step depends on it. You cannot pick criteria, score outputs, or interpret results without knowing what success means. Skipping it guarantees you will end up ranking impressions instead of measuring fit.

How is this checklist specific to 2026?

The structure is timeless, but the 2026 emphasis on contamination risk and agentic, long-horizon benchmarks reflects where current rankings most often mislead. As models and benchmarks evolve, the items about freshness and task-matching only grow more important.

What if I do not have thirty real examples yet?

Start collecting them now and use what you have as a provisional set, clearly labeled as low-confidence. A decision on fifteen examples is shakier than one on forty, so treat it as a temporary call and firm it up as your set grows.

Should every team member run this checklist?

One owner per decision keeps the scoring standard consistent, but the checklist should be shared knowledge so anyone can pick it up. The goal is a common practice, with a single accountable reviewer for each specific evaluation.

Key Takeaways

Define "good" and gather thirty to fifty real examples before you look at any leaderboard.
Use leaderboards as a relevance-filtered, cross-referenced shortlist of two or three candidates.
Freeze one prompt, test every candidate, and score failure modes explicitly in a single sitting.
Read failure patterns, weigh practical factors on near-ties, and document the decision.
Set re-evaluation triggers, keep the checklist updated, and scale rigor to the stakes.

The sequence matters. Define before you test, test before you decide, document before you move on. Working out of order is where most of the value leaks away.

Before You Look at Any Leaderboard

Get your own house in order first, so external rankings inform rather than drive you.

Write a one-sentence definition of "good" for this task. Without it you cannot score anything and will default to vibes.
List the observable criteria that definition implies. Things you can check by looking, like length, accuracy, and tone.
Catalog the failure modes that would actually hurt. Knowing what bad looks like is as important as knowing what good looks like.
Gather thirty to fifty real examples with reference answers. Below thirty, luck dominates; this set is your ground truth.

Our step-by-step guide expands each of these into a full procedure if you need detail.

When You Consult Leaderboards

Use rankings as a filter, with healthy skepticism baked in.

Check that the benchmark resembles your task. A coding leaderboard is irrelevant to copywriting; relevance comes before rank.
Cross-reference at least two independent leaderboards. Single sources carry blind spots you will inherit.
Note whether scoring is correctness or preference. Preference rewards verbosity; correctness rewards accuracy. Know which you need.
Watch for contamination risk. Fresh or rotating test sets resist memorization; static ones inflate scores.
Narrow to a shortlist of two or three candidates. More than three makes scoring unwieldy for little gain.

The reasoning behind the contamination and preference items is laid out in our common mistakes article.

During the Private Test

This is where the real decision gets made, so protect its fairness.

Freeze one identical prompt across all candidates. Otherwise you compare prompts, not models.
Run every example through every candidate. Partial coverage produces partial conclusions.
Score in a single sitting with one reviewer. Consistency of standard matters more than who does it.
Score failure modes explicitly, not just averages. A confident wrong answer can cost more than the average suggests.

Before You Commit

Convert the test into a durable decision.

Read the failure patterns, not just the totals. Patterns predict behavior on inputs you have not tested.
Weigh practical factors when scores are close. Cost, speed, and safe failure can break a near-tie.
Write one paragraph explaining the choice. Documentation gives future challengers a clear bar to beat.
Set a re-evaluation trigger, not a date. Re-test on a meaningful jump in relevant tasks, not on the calendar.

The discipline of triggers over schedules is covered in our best practices article.

Keeping the Checklist Alive

One last item

Re-run the full checklist whenever the stakes of a decision rise. A low-stakes internal tool can skip some items; a customer-facing one should hit every box. Calibrate the rigor to the consequences.

How to Use This Checklist in a Team

Make the boxes auditable

A Lightweight Version for Quick Decisions

Not every decision deserves the full sequence. For a fast, low-stakes call, this five-item subset preserves most of the value:

Define good in one sentence. Never skip this, regardless of stakes.
Gather ten to fifteen real examples. Fewer than ideal, but far better than zero.
Shortlist two candidates from a leaderboard. Relevance-checked, even if only one ranking.
Run both through your examples with one frozen prompt. The minimum fair comparison.
Note the choice and a recheck trigger in one line. Cheap insurance for later.

This compressed version takes under an hour and still beats choosing on a headline. Graduate to the full checklist the moment the decision touches customers or money.

Why even the lightweight version works

Pairing the Checklist With Your Calendar

Frequently Asked Questions

Can I skip items for a low-stakes decision?

Why is "write a one-sentence definition of good" the first item?

How is this checklist specific to 2026?

What if I do not have thirty real examples yet?

Should every team member run this checklist?

Key Takeaways

Define "good" and gather thirty to fifty real examples before you look at any leaderboard.
Use leaderboards as a relevance-filtered, cross-referenced shortlist of two or three candidates.
Freeze one prompt, test every candidate, and score failure modes explicitly in a single sitting.
Read failure patterns, weigh practical factors on near-ties, and document the decision.
Set re-evaluation triggers, keep the checklist updated, and scale rigor to the stakes.

Skip One Eval Step Under Pressure, Lose Months to It

Before You Look at Any Leaderboard

When You Consult Leaderboards

During the Private Test

Before You Commit

Keeping the Checklist Alive

One last item

How to Use This Checklist in a Team

Make the boxes auditable

A Lightweight Version for Quick Decisions

Why even the lightweight version works

Pairing the Checklist With Your Calendar

Frequently Asked Questions

Can I skip items for a low-stakes decision?

Why is "write a one-sentence definition of good" the first item?

How is this checklist specific to 2026?

What if I do not have thirty real examples yet?

Should every team member run this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Skip One Eval Step Under Pressure, Lose Months to It

Before You Look at Any Leaderboard

When You Consult Leaderboards

During the Private Test

Before You Commit

Keeping the Checklist Alive

One last item

How to Use This Checklist in a Team

Make the boxes auditable

A Lightweight Version for Quick Decisions

Why even the lightweight version works

Pairing the Checklist With Your Calendar

Frequently Asked Questions

Can I skip items for a low-stakes decision?

Why is "write a one-sentence definition of good" the first item?

How is this checklist specific to 2026?

What if I do not have thirty real examples yet?

Should every team member run this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?