Stop Evaluating Models From Scratch Every Single Time

Most teams evaluate models the same way twice: badly, and from scratch. Each decision is a one-off scramble, the reasoning lives in someone's head, and when a new model ships nobody can reproduce how the last choice was made. The work doesn't compound, so the team never gets better at it.

A framework fixes that by turning evaluation into a repeatable structure with named stages, so each pass produces reusable assets and each decision stands on the last. This piece introduces a four-stage framework, FRAME, and explains what each stage produces, when it matters most, and how the stages connect. It's deliberately lightweight; the value is in the structure, not in ceremony.

Use it as a scaffold, not a straitjacket. For a quick exploratory choice you might run the stages in an afternoon; for a production standardization you might spend days on each. The stages stay the same; only the depth changes.

The FRAME Framework Overview

FRAME has four stages: Frame the decision, Reference public signals, Assemble a private evaluation, and Measure and maintain. Each stage hands a concrete artifact to the next.

Why a named structure helps

A named structure gives the team shared language and a reusable template. Instead of relitigating how to evaluate every time, you run the stages and inherit last quarter's task set, rubric, and documented reasoning. The first run is the most expensive; every subsequent run is cheaper because the artifacts persist.

When to apply the full framework

Apply all four stages for any model you'll deploy and depend on. For throwaway experiments, the first two stages may be enough. The deeper your commitment, the further into the framework you should go.

Stage 1: Frame the Decision

The first stage produces a decision statement, a constraint list, and success criteria. Nothing downstream is valid without it.

What this stage produces

A one-sentence statement of the actual decision, the hard constraints (cost, latency, region, compliance), and three to five concrete success criteria. These are written down before any benchmark is consulted, so the rest of the process serves the decision rather than being anchored by a leaderboard.

When it matters most

Framing matters most when the decision is high-stakes or long-lived, because a vague frame produces a vague answer. A standardization across a whole team lives or dies on how precisely you stated success here. Skipping this stage is the root cause of most evaluations that feel rigorous but pick the wrong model.

Stage 2: Reference Public Signals

The second stage uses public benchmarks to build a shortlist, producing a ranked set of candidates worth your testing time.

What this stage produces

A short list of two to four candidate models, chosen by matching benchmark categories to your task and filtering for disclosed methodology, independence, and meaningful score gaps. The output is explicitly a shortlist, not a winner, because public scores can't see your specific work.

When it matters most

This stage earns its keep when the field of candidate models is large. Public benchmarks are a free filter that saves you from privately testing every model on the market. The discipline is to use them only to narrow, never to decide. The details of reading scores well are in The Complete Guide to AI Model Benchmarks.

Stage 3: Assemble a Private Evaluation

The third stage is where the decision is actually made. It produces a scored comparison of the shortlist on your own tasks.

What this stage produces

A task set of 50 to 200 real examples, a rubric written before any output exists, and the shortlisted models' scored results under identical conditions. This is the artifact that, more than any public number, tells you which model fits your work.

When it matters most

Always, for anything you'll deploy. This stage is what defends against contamination, mismatched benchmarks, and the gap between test and reality, all at once. Teams that skip it because the leaderboard "already answered" are the ones surprised in production. The full procedure is in A Step-by-Step Approach to AI Model Benchmarks.

The sub-discipline of judging

If you score with a model judge to scale, validate it against human scores on a sample first. An unvalidated judge can bias the entire comparison in a consistent, invisible direction.

Stage 4: Measure and Maintain

The final stage produces the decision record and the re-run trigger that keeps the choice current over time.

What this stage produces

A documented decision with evidence and date, the trade-off analysis folding in cost and latency, and a maintained, reusable task set and rubric. Critically, it also produces a trigger: a defined event or cadence that prompts a re-run.

When it matters most

This stage matters most over the long run, where it's also the most neglected. Models update silently, and a decision made on old behavior can quietly underperform. The maintenance stage is what turns a one-time choice into a standing capability and catches regressions that pure leaderboard-watching never would.

Applying FRAME at Different Stakes

The framework flexes to the decision's importance, and knowing when to go shallow saves real time.

Exploratory or throwaway: Run Stage 1 lightly and Stage 2 to pick a reasonable default. Skip the private evaluation. The cost of being wrong is low.
Production but reversible: Add a scoped Stage 3 with 30 to 50 tasks. Enough to catch major mismatches without heavy investment.
Standardization or high-stakes: Run all four stages at full depth, with 100-plus tasks and a maintained re-run cadence. The cost of testing is trivial against the cost of a locked-in wrong choice.

To operationalize the framework as a working tool, pair it with The AI Model Benchmarks Checklist for 2026, and review 7 Common Mistakes with AI Model Benchmarks to see what each stage is designed to prevent.

Frequently Asked Questions

Why use a named framework instead of just good judgment?

Judgment doesn't compound or transfer. A named framework gives the team shared language and reusable artifacts, so each evaluation inherits the last one's task set, rubric, and reasoning. It also ensures the easy-to-skip stages, like framing and maintenance, actually happen under deadline pressure.

Can I skip the public-benchmark stage and go straight to private testing?

You can, but you'll waste time testing models that a quick public filter would have eliminated. The reference stage exists to narrow a large field cheaply. It's most valuable when many candidates exist and least necessary when you already have a shortlist.

How is this different from the step-by-step process?

The step-by-step process is the detailed procedure inside Stage 3. The framework is the larger structure that surrounds it, adding framing before and maintenance after. Think of the framework as the lifecycle and the step-by-step guide as the core of its most important stage.

What's the most neglected stage?

Stage 4, measure and maintain. Teams celebrate the decision and never revisit it, even as models update silently beneath them. The maintenance stage and its re-run trigger are what keep the choice from quietly going stale months later.

How deep should each stage go?

It scales with stakes. Exploratory choices run the first two stages lightly; high-stakes standardizations run all four at full depth. The stages stay constant; only the effort per stage changes with how costly a wrong decision would be.

Key Takeaways

FRAME has four stages: Frame, Reference, Assemble, Measure and maintain, each producing a reusable artifact.
Stage 1 frames the decision and success criteria before any benchmark anchors your judgment.
Stage 2 uses public benchmarks only to build a shortlist, never to decide.
Stage 3 is where a private evaluation on your own tasks actually picks the model.
Stage 4, the most neglected, documents the decision and triggers re-runs as models change, making evaluation a standing capability.

The FRAME Framework Overview

FRAME has four stages: Frame the decision, Reference public signals, Assemble a private evaluation, and Measure and maintain. Each stage hands a concrete artifact to the next.

Why a named structure helps

When to apply the full framework

Stage 1: Frame the Decision

The first stage produces a decision statement, a constraint list, and success criteria. Nothing downstream is valid without it.

What this stage produces

When it matters most

Stage 2: Reference Public Signals

The second stage uses public benchmarks to build a shortlist, producing a ranked set of candidates worth your testing time.

What this stage produces

When it matters most

Stage 3: Assemble a Private Evaluation

The third stage is where the decision is actually made. It produces a scored comparison of the shortlist on your own tasks.

What this stage produces

When it matters most

The sub-discipline of judging

If you score with a model judge to scale, validate it against human scores on a sample first. An unvalidated judge can bias the entire comparison in a consistent, invisible direction.

Stage 4: Measure and Maintain

The final stage produces the decision record and the re-run trigger that keeps the choice current over time.

What this stage produces

When it matters most

Applying FRAME at Different Stakes

The framework flexes to the decision's importance, and knowing when to go shallow saves real time.

Exploratory or throwaway: Run Stage 1 lightly and Stage 2 to pick a reasonable default. Skip the private evaluation. The cost of being wrong is low.
Production but reversible: Add a scoped Stage 3 with 30 to 50 tasks. Enough to catch major mismatches without heavy investment.
Standardization or high-stakes: Run all four stages at full depth, with 100-plus tasks and a maintained re-run cadence. The cost of testing is trivial against the cost of a locked-in wrong choice.

Frequently Asked Questions

Why use a named framework instead of just good judgment?

Can I skip the public-benchmark stage and go straight to private testing?

How is this different from the step-by-step process?

What's the most neglected stage?

How deep should each stage go?

Key Takeaways

FRAME has four stages: Frame, Reference, Assemble, Measure and maintain, each producing a reusable artifact.
Stage 1 frames the decision and success criteria before any benchmark anchors your judgment.
Stage 2 uses public benchmarks only to build a shortlist, never to decide.
Stage 3 is where a private evaluation on your own tasks actually picks the model.
Stage 4, the most neglected, documents the decision and triggers re-runs as models change, making evaluation a standing capability.

Stop Evaluating Models From Scratch Every Single Time

The FRAME Framework Overview

Why a named structure helps

When to apply the full framework

Stage 1: Frame the Decision

What this stage produces

When it matters most

Stage 2: Reference Public Signals

What this stage produces

When it matters most

Stage 3: Assemble a Private Evaluation

What this stage produces

When it matters most

The sub-discipline of judging

Stage 4: Measure and Maintain

What this stage produces

When it matters most

Applying FRAME at Different Stakes

Frequently Asked Questions

Why use a named framework instead of just good judgment?

Can I skip the public-benchmark stage and go straight to private testing?

How is this different from the step-by-step process?

What's the most neglected stage?

How deep should each stage go?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Stop Evaluating Models From Scratch Every Single Time

The FRAME Framework Overview

Why a named structure helps

When to apply the full framework

Stage 1: Frame the Decision

What this stage produces

When it matters most

Stage 2: Reference Public Signals

What this stage produces

When it matters most

Stage 3: Assemble a Private Evaluation

What this stage produces

When it matters most

The sub-discipline of judging

Stage 4: Measure and Maintain

What this stage produces

When it matters most

Applying FRAME at Different Stakes

Frequently Asked Questions

Why use a named framework instead of just good judgment?

Can I skip the public-benchmark stage and go straight to private testing?

How is this different from the step-by-step process?

What's the most neglected stage?

How deep should each stage go?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?