Benchmark Practices We'd Defend in an Argument

There's no shortage of benchmark advice that amounts to "use good data" and "be objective." It's true and useless. The practices that actually change outcomes are more specific, more opinionated, and occasionally inconvenient, because they ask you to do work the shortcut-takers skip.

What follows are the practices we'd defend in an argument, each paired with the reasoning that makes it worth the cost. Some contradict the conventional wisdom. Where they do, the reasoning is the point: adopt the practice because the logic holds for your situation, not because it's on a list.

These are ordered by leverage, the ones that most change your decision quality coming first. You don't need all of them on day one, but you should know why each one earns its place.

Build a Private Evaluation Before You Read Any Leaderboard

The single highest-leverage practice is to write down what you need from a model before you look at what's winning. Reverse that order and the leaderboard anchors your judgment.

Why order matters

If you read the leaderboard first, you'll unconsciously redefine your needs to match whatever's on top. Defining success first, in concrete terms, gives you a fixed target the benchmarks have to serve, rather than the other way around. It's the difference between buying a tool for the job and finding a job for the tool.

What it looks like in practice

Write a one-paragraph decision statement, list your hard constraints on cost and latency, and define three to five success criteria. Only then open the leaderboard, and use it to build a shortlist. A Step-by-Step Approach to AI Model Benchmarks lays out the full sequence.

Weight Execution-Scored Benchmarks Over Opinion-Scored Ones

Not all benchmarks are equally trustworthy. The ones where correctness is verified by running code or checking against a known answer are far harder to game than ones scored by judgment.

The reasoning

A coding benchmark that runs the model's output against unit tests has an objective pass-fail signal. A benchmark scored by a model or human judge introduces the judge's biases and inconsistencies. When two benchmarks disagree, give more weight to the one with the harder-to-fake scoring.

The trade-off

Execution-scored benchmarks cover fewer tasks because not everything can be auto-verified. So you can't rely on them alone for open-ended work like writing. Use them as your most trusted signal where they exist, and supplement with carefully-validated judged evaluations elsewhere.

Always Look at the Distribution, Never Just the Mean

A single average score throws away the information that often matters most: how the model fails.

Why the tail wins

For most production systems, the worst 5% of outputs determine user trust and risk, not the average. A model that's brilliant on average but occasionally produces something unsafe or badly wrong can be a worse choice than a steadier, slightly-lower-scoring alternative.

How to do it

After scoring, sort outputs from worst to best and read the bottom of the list. Segment scores by task type to find where the model breaks down. The mistake of optimizing for the average is common enough that we devoted a section to it in 7 Common Mistakes with AI Model Benchmarks.

Hold Test Conditions Identical, and Record Them

This sounds obvious and is violated constantly. The moment one model gets a better prompt or more attempts than another, the comparison is dead.

The discipline

Use the same prompt, temperature, attempt count, and tool access for every model in a comparison. Write these settings down with the results. If you tune the prompt to help one model, you must rerun all of them under the new setup. Half-updated comparisons are worse than no comparison because they look rigorous.

Why recording matters

Six weeks later, when someone questions the decision or a new model appears, you need to reproduce the exact conditions. Undocumented conditions make your evaluation unrepeatable, which means it can't be trusted or updated.

Re-Run on a Cadence, Treat Decisions as Perishable

A model choice is not a one-time event. Models get updated, sometimes silently, and a decision that was right last quarter can quietly go stale.

The practice

Keep your task set and rubric as a reusable asset. Rerun the evaluation when a model you use is updated, when a strong new model ships, or on a fixed cadence like quarterly. The first run is expensive; every rerun is cheap because the hard work is already built.

The payoff

This turns model selection from a gut-feel scramble at launch time into a standing capability. You'll catch regressions from version updates that pure leaderboard-watching would miss entirely. To formalize this, see A Framework for AI Model Benchmarks.

Validate Your Judge Before You Trust It at Scale

Model-as-judge scoring is how you scale open-ended evaluations, but an unvalidated judge can quietly bias every conclusion you draw.

The check

Before relying on a model to score outputs, have a human score a sample of 20 to 30 of the same outputs and measure agreement. If the judge disagrees with humans often or in a consistent direction, fix the rubric or the judge before scaling. A judge that's wrong in the same direction every time will give you confident, consistently wrong rankings.

Why directional bias is worse than noise

Random disagreement between judge and humans averages out across a large task set; a consistent lean does not. If your judge systematically favors longer answers, or penalizes a particular style, that bias survives every average you compute and quietly tilts the ranking toward whichever model happens to match the judge's preference. That's why you check the direction of disagreement, not just its size.

Prefer Reversible Decisions Early

A subtle practice that saves real pain: when you can, structure the rollout so the model choice is easy to undo. Ship behind a flag, route a fraction of traffic first, and keep the runner-up wired up. This lowers the stakes of every evaluation, because a wrong call becomes a quick switch rather than a re-platforming. The discipline of the private evaluation still matters, but reversibility turns it from a high-pressure one-shot into a fast feedback loop.

Frequently Asked Questions

Should I ever trust public benchmarks alone?

For low-stakes or exploratory decisions, public benchmarks alone can be enough to pick a reasonable default. For anything you'll deploy and depend on, no. Use public benchmarks to shortlist, then validate on your own tasks. The cost of a private evaluation is small next to the cost of a wrong production choice.

How is execution-scored different from judged scoring?

Execution-scored benchmarks verify correctness mechanically, by running code or matching a known answer, so they're hard to fake. Judged scoring uses a human or model to rate outputs, which adds the judge's biases. Both have a place, but execution-scored signals are more trustworthy where they're available.

Why validate a model judge if it's faster?

Speed is worthless if the judge is biased. An unvalidated judge can systematically over- or under-rate certain outputs, skewing your whole ranking in a way you won't notice. Validating against human scores on a sample is a small cost that protects every conclusion built on the judge.

How often is too often to re-run an evaluation?

Re-running daily is overkill for most teams and adds noise. Quarterly, plus event-triggered reruns when a model updates or a new contender appears, hits the right balance. The goal is to catch meaningful changes without drowning in run-to-run variance.

What if my team doesn't have time for a private evaluation?

Then scope it down rather than skipping it. Even 30 well-chosen tasks scored by hand will catch most of the gap between a leaderboard champion and the right model for you. A small private evaluation beats a large public one for your specific decision.

Key Takeaways

Define your needs and success criteria before reading any leaderboard, so the benchmark serves your decision instead of anchoring it.
Trust execution-scored benchmarks more than judged ones, and validate any model judge against human scores before scaling.
Read the worst outputs and segment by task type; the tail usually matters more than the mean.
Hold test conditions identical across models and record them so the evaluation is repeatable.
Keep your task set reusable and rerun on a cadence; model decisions are perishable.

These are ordered by leverage, the ones that most change your decision quality coming first. You don't need all of them on day one, but you should know why each one earns its place.

Build a Private Evaluation Before You Read Any Leaderboard

The single highest-leverage practice is to write down what you need from a model before you look at what's winning. Reverse that order and the leaderboard anchors your judgment.

Why order matters

What it looks like in practice

Weight Execution-Scored Benchmarks Over Opinion-Scored Ones

Not all benchmarks are equally trustworthy. The ones where correctness is verified by running code or checking against a known answer are far harder to game than ones scored by judgment.

The reasoning

The trade-off

Always Look at the Distribution, Never Just the Mean

A single average score throws away the information that often matters most: how the model fails.

Why the tail wins

How to do it

Hold Test Conditions Identical, and Record Them

This sounds obvious and is violated constantly. The moment one model gets a better prompt or more attempts than another, the comparison is dead.

The discipline

Why recording matters

Re-Run on a Cadence, Treat Decisions as Perishable

A model choice is not a one-time event. Models get updated, sometimes silently, and a decision that was right last quarter can quietly go stale.

The practice

The payoff

Validate Your Judge Before You Trust It at Scale

Model-as-judge scoring is how you scale open-ended evaluations, but an unvalidated judge can quietly bias every conclusion you draw.

The check

Why directional bias is worse than noise

Prefer Reversible Decisions Early

Frequently Asked Questions

Should I ever trust public benchmarks alone?

How is execution-scored different from judged scoring?

Why validate a model judge if it's faster?

How often is too often to re-run an evaluation?

What if my team doesn't have time for a private evaluation?

Key Takeaways

Define your needs and success criteria before reading any leaderboard, so the benchmark serves your decision instead of anchoring it.
Trust execution-scored benchmarks more than judged ones, and validate any model judge against human scores before scaling.
Read the worst outputs and segment by task type; the tail usually matters more than the mean.
Hold test conditions identical across models and record them so the evaluation is repeatable.
Keep your task set reusable and rerun on a cadence; model decisions are perishable.

Benchmark Practices We'd Defend in an Argument

Build a Private Evaluation Before You Read Any Leaderboard

Why order matters

What it looks like in practice

Weight Execution-Scored Benchmarks Over Opinion-Scored Ones

The reasoning

The trade-off

Always Look at the Distribution, Never Just the Mean

Why the tail wins

How to do it

Hold Test Conditions Identical, and Record Them

The discipline

Why recording matters

Re-Run on a Cadence, Treat Decisions as Perishable

The practice

The payoff

Validate Your Judge Before You Trust It at Scale

The check

Why directional bias is worse than noise

Prefer Reversible Decisions Early

Frequently Asked Questions

Should I ever trust public benchmarks alone?

How is execution-scored different from judged scoring?

Why validate a model judge if it's faster?

How often is too often to re-run an evaluation?

What if my team doesn't have time for a private evaluation?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Benchmark Practices We'd Defend in an Argument

Build a Private Evaluation Before You Read Any Leaderboard

Why order matters

What it looks like in practice

Weight Execution-Scored Benchmarks Over Opinion-Scored Ones

The reasoning

The trade-off

Always Look at the Distribution, Never Just the Mean

Why the tail wins

How to do it

Hold Test Conditions Identical, and Record Them

The discipline

Why recording matters

Re-Run on a Cadence, Treat Decisions as Perishable

The practice

The payoff

Validate Your Judge Before You Trust It at Scale

The check

Why directional bias is worse than noise

Prefer Reversible Decisions Early

Frequently Asked Questions

Should I ever trust public benchmarks alone?

How is execution-scored different from judged scoring?

Why validate a model judge if it's faster?

How often is too often to re-run an evaluation?

What if my team doesn't have time for a private evaluation?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?