Could Anyone on Your Team Reproduce Your Numbers?

The difference between a benchmark you ran once and a workflow you can hand off is whether anyone else on your team can reproduce your numbers without asking you a single question. Most benchmarking lives in someone's head and a folder of one-off scripts. That works until that person is on vacation and a new model ships.

This article is about turning model evaluation into a documented, repeatable process that survives staff changes and produces comparable results every time. The goal is boring on purpose. A good benchmarking workflow should feel like a checklist, not a research project, by the third time you run it.

Why repeatability is the whole point

A single benchmark run gives you a number. A repeatable workflow gives you a number you can compare to last month's number and trust the comparison. That trust is the entire value. Without it, you cannot tell whether a model got better, your test got easier, or you just changed a setting.

Repeatability rests on three things being identical across runs: the test set, the scoring method, and the model settings. Lock all three and your comparisons mean something. Let any one drift and your numbers become anecdotes.

Step one: assemble a frozen test set

Your test set is the foundation, and it must stop changing once you commit to it. Pull 30 to 100 real examples from your actual workload, with expected outputs or scoring rubrics attached. Cover the easy cases, the common cases, and crucially the hard 10 percent where models tend to fail.

What makes a good frozen set

Drawn from real inputs, not invented examples that flatter the model.
Versioned and stored, so you know exactly which set produced which numbers.
Stable, meaning you do not quietly add or remove examples between runs.

If you must change the set, bump its version and re-baseline every model on the new version. Never compare scores across set versions. The reasoning behind freezing is laid out further in A Step-by-Step Approach to AI Model Benchmarks.

Step two: define scoring before you look at outputs

Decide how you will score each example before you run anything. This sounds obvious and is constantly violated. When you score after seeing outputs, you unconsciously bend the rubric toward whichever model you already prefer.

Three scoring approaches

Exact or programmatic match for tasks with a single correct answer, like classification or extraction.
Rubric scoring for open-ended tasks, where a human or a judge model rates against fixed criteria.
Pairwise preference where a judge picks the better of two outputs without scoring each in isolation.

Write the rubric down. Anyone running the workflow should score the same output the same way you would. Ambiguous rubrics are the most common reason two people get different numbers from the same test.

Step three: pin your run configuration

Every variable you do not pin is a variable that will move and corrupt your comparison. Document the model version, temperature, system prompt, maximum tokens, and the number of examples shown to the model. Store this configuration alongside the results.

The silent killer here is model version drift. Providers update models behind stable names, so "the same model" three months apart may not be the same model at all. Record the exact version identifier every time, and when results shift unexpectedly, this is the first place to look. Many of these pitfalls are catalogued in 7 Common Mistakes with AI Model Benchmarks (and How to Avoid Them).

Step four: run multiple passes and record everything

A single pass is noisy because model outputs vary between identical requests. Run each example three to five times and record every result, not just the average. The spread tells you how reliable the model is, which is sometimes more important than the mean.

What to capture per run

Raw model output for every example and every pass.
The score assigned and who or what assigned it.
Latency and token count for cost and speed analysis.
The full configuration used.

Storing raw outputs is what makes the workflow auditable. When someone questions a number, you can show them the exact output that produced it instead of re-running and hoping for the same result.

Step five: produce a comparable summary

The output of the workflow is a short, standardized report. Same columns every time: model, version, quality score, score spread, median latency, cost per request, and the date. Consistency in the report format is what lets you stack runs side by side over months.

Keep the report blunt. One table and a two-sentence recommendation beat a long narrative nobody reads. The summary feeds directly into the decision plays described in The AI Model Benchmarks Playbook, where it becomes a go or no-go.

Making the workflow hand-off-able

A workflow that only you can run is not a workflow. Write a one-page runbook that lists the steps, points to the frozen test set, states the scoring rubric, and shows the configuration to pin. The test is simple: hand it to a teammate who has never run it and see if they reproduce your last numbers within a small margin.

If they cannot, the gap reveals what was living in your head instead of on the page. Patch the runbook and try again. After two or three handoffs the document gets genuinely tight, and benchmarking becomes a task anyone can pick up rather than a bottleneck attached to one person.

Frequently Asked Questions

How big should my frozen test set be?

Thirty to a hundred real examples works for most teams, with the upper end giving more statistical confidence. The composition matters more than the size. A set of 40 examples that covers your hard cases beats 200 easy ones. Make sure the difficult 10 percent of your workload is represented.

How many passes per example do I need?

Three to five passes per example lets you average out the randomness in model outputs and see the spread. A single pass gives a noisy number you should not trust for close decisions. The closer your candidate models are, the more passes you need to separate them confidently.

Should I use a model to score outputs?

Judge models work well for open-ended tasks at scale, but they have biases and need a clear rubric just like human scorers. For high-stakes decisions, validate the judge against human scores on a sample before trusting it broadly. For programmatic tasks with clear answers, skip judges and use exact matching.

What is model version drift and how do I handle it?

Version drift happens when a provider updates a model behind a stable name, changing behavior without notice. Handle it by recording the exact version identifier on every run and keeping a frozen baseline of past scores. When results shift with no change on your side, drift is the prime suspect.

How do I know my workflow is actually repeatable?

Hand the runbook to a teammate who has never run it and check whether they reproduce your last numbers within a small margin. If they cannot, something important is undocumented. Each handoff exposes hidden assumptions, and after a few iterations the runbook becomes genuinely portable.

Key Takeaways

A repeatable workflow produces numbers you can compare across months, not just a one-time result.
Freeze and version your test set so comparisons stay valid over time.
Define scoring rubrics before viewing outputs to avoid bending them toward a favored model.
Pin every configuration variable, especially model version, to catch silent drift.
Run multiple passes and store raw outputs so results are auditable, not just averaged.
Write a one-page runbook and prove it works by handing it to a teammate.

Why repeatability is the whole point

Step one: assemble a frozen test set

What makes a good frozen set

Drawn from real inputs, not invented examples that flatter the model.
Versioned and stored, so you know exactly which set produced which numbers.
Stable, meaning you do not quietly add or remove examples between runs.

Step two: define scoring before you look at outputs

Three scoring approaches

Exact or programmatic match for tasks with a single correct answer, like classification or extraction.
Rubric scoring for open-ended tasks, where a human or a judge model rates against fixed criteria.
Pairwise preference where a judge picks the better of two outputs without scoring each in isolation.

Step three: pin your run configuration

Step four: run multiple passes and record everything

What to capture per run

Raw model output for every example and every pass.
The score assigned and who or what assigned it.
Latency and token count for cost and speed analysis.
The full configuration used.

Storing raw outputs is what makes the workflow auditable. When someone questions a number, you can show them the exact output that produced it instead of re-running and hoping for the same result.

Step five: produce a comparable summary

Making the workflow hand-off-able

Frequently Asked Questions

How big should my frozen test set be?

How many passes per example do I need?

Should I use a model to score outputs?

What is model version drift and how do I handle it?

How do I know my workflow is actually repeatable?

Key Takeaways

A repeatable workflow produces numbers you can compare across months, not just a one-time result.
Freeze and version your test set so comparisons stay valid over time.
Define scoring rubrics before viewing outputs to avoid bending them toward a favored model.
Pin every configuration variable, especially model version, to catch silent drift.
Run multiple passes and store raw outputs so results are auditable, not just averaged.
Write a one-page runbook and prove it works by handing it to a teammate.

Could Anyone on Your Team Reproduce Your Numbers?

Why repeatability is the whole point

Step one: assemble a frozen test set

What makes a good frozen set

Step two: define scoring before you look at outputs

Three scoring approaches

Step three: pin your run configuration

Step four: run multiple passes and record everything

What to capture per run

Step five: produce a comparable summary

Making the workflow hand-off-able

Frequently Asked Questions

How big should my frozen test set be?

How many passes per example do I need?

Should I use a model to score outputs?

What is model version drift and how do I handle it?

How do I know my workflow is actually repeatable?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Could Anyone on Your Team Reproduce Your Numbers?

Why repeatability is the whole point

Step one: assemble a frozen test set

What makes a good frozen set

Step two: define scoring before you look at outputs

Three scoring approaches

Step three: pin your run configuration

Step four: run multiple passes and record everything

What to capture per run

Step five: produce a comparable summary

Making the workflow hand-off-able

Frequently Asked Questions

How big should my frozen test set be?

How many passes per example do I need?

Should I use a model to score outputs?

What is model version drift and how do I handle it?

How do I know my workflow is actually repeatable?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?