AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why repeatability is the whole pointStep one: assemble a frozen test setWhat makes a good frozen setStep two: define scoring before you look at outputsThree scoring approachesStep three: pin your run configurationStep four: run multiple passes and record everythingWhat to capture per runStep five: produce a comparable summaryMaking the workflow hand-off-ableFrequently Asked QuestionsHow big should my frozen test set be?How many passes per example do I need?Should I use a model to score outputs?What is model version drift and how do I handle it?How do I know my workflow is actually repeatable?Key Takeaways
Home/Blog/Could Anyone on Your Team Reproduce Your Numbers?
General

Could Anyone on Your Team Reproduce Your Numbers?

A

Agency Script Editorial

Editorial Team

·November 14, 2025·7 min read
AI model benchmarksAI model benchmarks workflowAI model benchmarks guideai fundamentals

The difference between a benchmark you ran once and a workflow you can hand off is whether anyone else on your team can reproduce your numbers without asking you a single question. Most benchmarking lives in someone's head and a folder of one-off scripts. That works until that person is on vacation and a new model ships.

This article is about turning model evaluation into a documented, repeatable process that survives staff changes and produces comparable results every time. The goal is boring on purpose. A good benchmarking workflow should feel like a checklist, not a research project, by the third time you run it.

Why repeatability is the whole point

A single benchmark run gives you a number. A repeatable workflow gives you a number you can compare to last month's number and trust the comparison. That trust is the entire value. Without it, you cannot tell whether a model got better, your test got easier, or you just changed a setting.

Repeatability rests on three things being identical across runs: the test set, the scoring method, and the model settings. Lock all three and your comparisons mean something. Let any one drift and your numbers become anecdotes.

Step one: assemble a frozen test set

Your test set is the foundation, and it must stop changing once you commit to it. Pull 30 to 100 real examples from your actual workload, with expected outputs or scoring rubrics attached. Cover the easy cases, the common cases, and crucially the hard 10 percent where models tend to fail.

What makes a good frozen set

  • Drawn from real inputs, not invented examples that flatter the model.
  • Versioned and stored, so you know exactly which set produced which numbers.
  • Stable, meaning you do not quietly add or remove examples between runs.

If you must change the set, bump its version and re-baseline every model on the new version. Never compare scores across set versions. The reasoning behind freezing is laid out further in A Step-by-Step Approach to AI Model Benchmarks.

Step two: define scoring before you look at outputs

Decide how you will score each example before you run anything. This sounds obvious and is constantly violated. When you score after seeing outputs, you unconsciously bend the rubric toward whichever model you already prefer.

Three scoring approaches

  • Exact or programmatic match for tasks with a single correct answer, like classification or extraction.
  • Rubric scoring for open-ended tasks, where a human or a judge model rates against fixed criteria.
  • Pairwise preference where a judge picks the better of two outputs without scoring each in isolation.

Write the rubric down. Anyone running the workflow should score the same output the same way you would. Ambiguous rubrics are the most common reason two people get different numbers from the same test.

Step three: pin your run configuration

Every variable you do not pin is a variable that will move and corrupt your comparison. Document the model version, temperature, system prompt, maximum tokens, and the number of examples shown to the model. Store this configuration alongside the results.

The silent killer here is model version drift. Providers update models behind stable names, so "the same model" three months apart may not be the same model at all. Record the exact version identifier every time, and when results shift unexpectedly, this is the first place to look. Many of these pitfalls are catalogued in 7 Common Mistakes with AI Model Benchmarks (and How to Avoid Them).

Step four: run multiple passes and record everything

A single pass is noisy because model outputs vary between identical requests. Run each example three to five times and record every result, not just the average. The spread tells you how reliable the model is, which is sometimes more important than the mean.

What to capture per run

  • Raw model output for every example and every pass.
  • The score assigned and who or what assigned it.
  • Latency and token count for cost and speed analysis.
  • The full configuration used.

Storing raw outputs is what makes the workflow auditable. When someone questions a number, you can show them the exact output that produced it instead of re-running and hoping for the same result.

Step five: produce a comparable summary

The output of the workflow is a short, standardized report. Same columns every time: model, version, quality score, score spread, median latency, cost per request, and the date. Consistency in the report format is what lets you stack runs side by side over months.

Keep the report blunt. One table and a two-sentence recommendation beat a long narrative nobody reads. The summary feeds directly into the decision plays described in The AI Model Benchmarks Playbook, where it becomes a go or no-go.

Making the workflow hand-off-able

A workflow that only you can run is not a workflow. Write a one-page runbook that lists the steps, points to the frozen test set, states the scoring rubric, and shows the configuration to pin. The test is simple: hand it to a teammate who has never run it and see if they reproduce your last numbers within a small margin.

If they cannot, the gap reveals what was living in your head instead of on the page. Patch the runbook and try again. After two or three handoffs the document gets genuinely tight, and benchmarking becomes a task anyone can pick up rather than a bottleneck attached to one person.

Frequently Asked Questions

How big should my frozen test set be?

Thirty to a hundred real examples works for most teams, with the upper end giving more statistical confidence. The composition matters more than the size. A set of 40 examples that covers your hard cases beats 200 easy ones. Make sure the difficult 10 percent of your workload is represented.

How many passes per example do I need?

Three to five passes per example lets you average out the randomness in model outputs and see the spread. A single pass gives a noisy number you should not trust for close decisions. The closer your candidate models are, the more passes you need to separate them confidently.

Should I use a model to score outputs?

Judge models work well for open-ended tasks at scale, but they have biases and need a clear rubric just like human scorers. For high-stakes decisions, validate the judge against human scores on a sample before trusting it broadly. For programmatic tasks with clear answers, skip judges and use exact matching.

What is model version drift and how do I handle it?

Version drift happens when a provider updates a model behind a stable name, changing behavior without notice. Handle it by recording the exact version identifier on every run and keeping a frozen baseline of past scores. When results shift with no change on your side, drift is the prime suspect.

How do I know my workflow is actually repeatable?

Hand the runbook to a teammate who has never run it and check whether they reproduce your last numbers within a small margin. If they cannot, something important is undocumented. Each handoff exposes hidden assumptions, and after a few iterations the runbook becomes genuinely portable.

Key Takeaways

  • A repeatable workflow produces numbers you can compare across months, not just a one-time result.
  • Freeze and version your test set so comparisons stay valid over time.
  • Define scoring rubrics before viewing outputs to avoid bending them toward a favored model.
  • Pin every configuration variable, especially model version, to catch silent drift.
  • Run multiple passes and store raw outputs so results are auditable, not just averaged.
  • Write a one-page runbook and prove it works by handing it to a teammate.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification