Five Evaluations In, Still Starting From Scratch

A one-off decision and a repeatable workflow are not the same thing, and the gap between them is where teams quietly lose months. The first time you evaluate open vs closed for a feature, it feels like deep architectural work. The fifth time, if you haven't written anything down, it's still deep architectural work, because every evaluation starts from scratch and lives in whoever happened to run it.

The fix is to make the evaluation itself a documented, repeatable process with defined inputs, stages, artifacts, and a hand-off point. When a new model launches or a new feature needs an AI backend, you run the workflow instead of re-litigating the philosophy. This article lays out that workflow stage by stage. For the underlying decision logic, the A Framework for Open vs Closed Source AI Models piece is the companion you'll lean on inside several of these stages.

Why a Workflow Beats a Decision

The argument for process is simple: consistency and hand-off. A workflow makes every evaluation produce the same artifacts in the same format, so two engineers reach comparable conclusions and a third can pick up the work without a meeting. It also forces you to separate the parts that change, the specific models and prices, from the parts that don't, the steps and the criteria.

The inputs every run needs

The workload definition: what task, what volume, what latency tolerance.
The constraint set: data residency, compliance, budget ceiling.
The current candidate models, open and closed, worth comparing this quarter.

Lock these three down before any evaluation starts. Most failed evaluations failed because someone skipped the constraint set and discovered a regulatory blocker after building.

Stage 1: Classify the Workload

Every run starts by sorting the workload into a category, because the category shortcuts most of the decision.

The categories

Sensitive-and-regulated: data can't leave your perimeter. This often forces open self-hosting regardless of cost.
High-volume-low-stakes: classification, extraction, routing. Strong open-model candidate.
Frontier-hard: complex reasoning, long context, agentic chains. Closed models usually still lead.
Exploratory: you don't know if the feature works yet. Always closed first for speed.

Most workloads fall cleanly into one bucket. The ones that straddle two are exactly where the rest of the workflow earns its keep, because you'll need real evidence rather than a category default.

Stage 2: Define the Eval Set

This is the stage teams skip and the one that determines whether the whole workflow is trustworthy. Before you compare any models, you build a fixed evaluation set from real inputs.

What goes in the eval set

50 to 200 real or realistic inputs for the workload, including edge cases.
A defined notion of an acceptable output: exact match, rubric score, or human judgment.
A way to run the set automatically against any model behind your abstraction layer.

The eval set is a reusable asset. Build it once per workload and you reuse it for every future model that comes along. This is what makes the workflow repeatable instead of a fresh research project each time. The A Step-by-Step Approach to Open vs Closed Source AI Models walks through assembling a first eval set in practice.

Stage 3: Run the Comparison

With a workload classified and an eval set built, the comparison is now mechanical, which is the whole point.

The comparison matrix

For each candidate model, record:

Quality score on the eval set.
Cost per task at projected volume.
Latency at the percentiles you care about, not just the average.
Operational burden: managed API versus GPUs you run.

Fill the same matrix for every run. Standardizing the columns is what lets you compare a decision made this quarter against one made last quarter without re-explaining anything.

Stage 4: Decide and Document

Now you apply the decision rule and, critically, write down why.

The decision artifact

Produce a short, standard document for every evaluation containing:

The chosen model and the runner-up.
The deciding factor: was it cost, quality, compliance, or operational load?
The conditions that would reverse the decision.

That last line is the secret weapon. "We chose closed; revisit if daily volume exceeds X or if open quality on this eval set reaches Y" turns a static decision into a tripwire. Future-you doesn't have to remember the reasoning; the artifact carries it. This discipline is what separates a workflow from a guess, and it's the antidote to several traps in 7 Common Mistakes with Open vs Closed Source AI Models.

Stage 5: Implement Behind the Abstraction

Whatever you chose, it goes in behind the same internal model interface every workload uses. No raw vendor SDK calls scattered through application code. This is non-negotiable, because it's what makes the next stage possible and keeps switching cheap.

Implementation checklist

All calls route through the abstraction layer.
The eval set runs in CI against the deployed model.
Cost and latency dashboards exist for the workload.

Stage 6: Schedule the Re-Run

The workflow isn't done when you ship; it loops. You schedule a re-evaluation trigger so the decision doesn't silently rot.

Triggers that fire a re-run

A major new model release in either camp.
A meaningful price change from a provider.
Crossing a volume threshold you wrote into the decision artifact.
A fixed cadence, quarterly is sane for most teams.

When a trigger fires, you don't start over. You re-run Stages 3 and 4 with the existing eval set and the standard matrix. A re-evaluation that used to take weeks now takes an afternoon. That speed is the entire return on building the workflow.

Making It Hand-Off-Able

The final test of a workflow is whether someone new can run it from the documents alone. Store the eval sets, the comparison matrices, and the decision artifacts in a shared, versioned location. Write a one-page runbook that points to each stage's template. When the engineer who built it leaves, the workflow stays, and that's the difference between a process and a person.

Frequently Asked Questions

How long does the first full run take?

The first run is slow, often a couple of weeks, mostly because you're building the eval set and the abstraction layer from nothing. Every subsequent run on the same workload is dramatically faster because those assets are reusable. The upfront cost is the investment that makes the workflow worth having.

What's the single most-skipped stage?

Defining the eval set. Teams jump straight to running models against vibes and anecdotes. Without a fixed eval set you can't compare models honestly, you can't re-run the decision later, and you can't hand it off. It's the load-bearing stage.

Can this workflow be partly automated?

Yes. Once the eval set and abstraction layer exist, running the comparison matrix against a new model can be largely scripted. The classification and the final decision still need human judgment, but the mechanical comparison should be push-button by your third or fourth run.

How detailed should the decision artifact be?

One page. The goal is to capture the chosen model, the deciding factor, and the conditions that would reverse it. Longer documents don't get read. The reversal conditions are the part you must not omit.

Where should all these artifacts live?

In a shared, versioned repository alongside your code or in a wiki linked from it, never in someone's local notes or a private doc. The whole value is hand-off and repeatability, which dies the moment the artifacts are personal rather than shared.

Key Takeaways

A repeatable workflow turns open vs closed from recurring research into an afternoon's re-run.
The six stages are: classify the workload, build an eval set, run the comparison, decide and document, implement behind an abstraction, and schedule the re-run.
The eval set and the abstraction layer are reusable assets; building them is the real investment.
Every decision produces a one-page artifact that names the deciding factor and the conditions that would reverse it.
Store all artifacts in a shared, versioned place so the workflow survives the person who built it.

Why a Workflow Beats a Decision

The inputs every run needs

The workload definition: what task, what volume, what latency tolerance.
The constraint set: data residency, compliance, budget ceiling.
The current candidate models, open and closed, worth comparing this quarter.

Lock these three down before any evaluation starts. Most failed evaluations failed because someone skipped the constraint set and discovered a regulatory blocker after building.

Stage 1: Classify the Workload

Every run starts by sorting the workload into a category, because the category shortcuts most of the decision.

The categories

Sensitive-and-regulated: data can't leave your perimeter. This often forces open self-hosting regardless of cost.
High-volume-low-stakes: classification, extraction, routing. Strong open-model candidate.
Frontier-hard: complex reasoning, long context, agentic chains. Closed models usually still lead.
Exploratory: you don't know if the feature works yet. Always closed first for speed.

Most workloads fall cleanly into one bucket. The ones that straddle two are exactly where the rest of the workflow earns its keep, because you'll need real evidence rather than a category default.

Stage 2: Define the Eval Set

This is the stage teams skip and the one that determines whether the whole workflow is trustworthy. Before you compare any models, you build a fixed evaluation set from real inputs.

What goes in the eval set

50 to 200 real or realistic inputs for the workload, including edge cases.
A defined notion of an acceptable output: exact match, rubric score, or human judgment.
A way to run the set automatically against any model behind your abstraction layer.

Stage 3: Run the Comparison

With a workload classified and an eval set built, the comparison is now mechanical, which is the whole point.

The comparison matrix

For each candidate model, record:

Quality score on the eval set.
Cost per task at projected volume.
Latency at the percentiles you care about, not just the average.
Operational burden: managed API versus GPUs you run.

Fill the same matrix for every run. Standardizing the columns is what lets you compare a decision made this quarter against one made last quarter without re-explaining anything.

Stage 4: Decide and Document

Now you apply the decision rule and, critically, write down why.

The decision artifact

Produce a short, standard document for every evaluation containing:

The chosen model and the runner-up.
The deciding factor: was it cost, quality, compliance, or operational load?
The conditions that would reverse the decision.

Stage 5: Implement Behind the Abstraction

Implementation checklist

All calls route through the abstraction layer.
The eval set runs in CI against the deployed model.
Cost and latency dashboards exist for the workload.

Stage 6: Schedule the Re-Run

The workflow isn't done when you ship; it loops. You schedule a re-evaluation trigger so the decision doesn't silently rot.

Triggers that fire a re-run

A major new model release in either camp.
A meaningful price change from a provider.
Crossing a volume threshold you wrote into the decision artifact.
A fixed cadence, quarterly is sane for most teams.

Making It Hand-Off-Able

Frequently Asked Questions

How long does the first full run take?

What's the single most-skipped stage?

Can this workflow be partly automated?

How detailed should the decision artifact be?

Where should all these artifacts live?

Key Takeaways

A repeatable workflow turns open vs closed from recurring research into an afternoon's re-run.
The six stages are: classify the workload, build an eval set, run the comparison, decide and document, implement behind an abstraction, and schedule the re-run.
The eval set and the abstraction layer are reusable assets; building them is the real investment.
Every decision produces a one-page artifact that names the deciding factor and the conditions that would reverse it.
Store all artifacts in a shared, versioned place so the workflow survives the person who built it.

Five Evaluations In, Still Starting From Scratch

Why a Workflow Beats a Decision

The inputs every run needs

Stage 1: Classify the Workload

The categories

Stage 2: Define the Eval Set

What goes in the eval set

Stage 3: Run the Comparison

The comparison matrix

Stage 4: Decide and Document

The decision artifact

Stage 5: Implement Behind the Abstraction

Implementation checklist

Stage 6: Schedule the Re-Run

Triggers that fire a re-run

Making It Hand-Off-Able

Frequently Asked Questions

How long does the first full run take?

What's the single most-skipped stage?

Can this workflow be partly automated?

How detailed should the decision artifact be?

Where should all these artifacts live?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Five Evaluations In, Still Starting From Scratch

Why a Workflow Beats a Decision

The inputs every run needs

Stage 1: Classify the Workload

The categories

Stage 2: Define the Eval Set

What goes in the eval set

Stage 3: Run the Comparison

The comparison matrix

Stage 4: Decide and Document

The decision artifact

Stage 5: Implement Behind the Abstraction

Implementation checklist

Stage 6: Schedule the Re-Run

Triggers that fire a re-run

Making It Hand-Off-Able

Frequently Asked Questions

How long does the first full run take?

What's the single most-skipped stage?

Can this workflow be partly automated?

How detailed should the decision artifact be?

Where should all these artifacts live?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?