A one-off decision and a repeatable workflow are not the same thing, and the gap between them is where teams quietly lose months. The first time you evaluate open vs closed for a feature, it feels like deep architectural work. The fifth time, if you haven't written anything down, it's still deep architectural work, because every evaluation starts from scratch and lives in whoever happened to run it.
The fix is to make the evaluation itself a documented, repeatable process with defined inputs, stages, artifacts, and a hand-off point. When a new model launches or a new feature needs an AI backend, you run the workflow instead of re-litigating the philosophy. This article lays out that workflow stage by stage. For the underlying decision logic, the A Framework for Open vs Closed Source AI Models piece is the companion you'll lean on inside several of these stages.
Why a Workflow Beats a Decision
The argument for process is simple: consistency and hand-off. A workflow makes every evaluation produce the same artifacts in the same format, so two engineers reach comparable conclusions and a third can pick up the work without a meeting. It also forces you to separate the parts that change, the specific models and prices, from the parts that don't, the steps and the criteria.
The inputs every run needs
- The workload definition: what task, what volume, what latency tolerance.
- The constraint set: data residency, compliance, budget ceiling.
- The current candidate models, open and closed, worth comparing this quarter.
Lock these three down before any evaluation starts. Most failed evaluations failed because someone skipped the constraint set and discovered a regulatory blocker after building.
Stage 1: Classify the Workload
Every run starts by sorting the workload into a category, because the category shortcuts most of the decision.
The categories
- Sensitive-and-regulated: data can't leave your perimeter. This often forces open self-hosting regardless of cost.
- High-volume-low-stakes: classification, extraction, routing. Strong open-model candidate.
- Frontier-hard: complex reasoning, long context, agentic chains. Closed models usually still lead.
- Exploratory: you don't know if the feature works yet. Always closed first for speed.
Most workloads fall cleanly into one bucket. The ones that straddle two are exactly where the rest of the workflow earns its keep, because you'll need real evidence rather than a category default.
Stage 2: Define the Eval Set
This is the stage teams skip and the one that determines whether the whole workflow is trustworthy. Before you compare any models, you build a fixed evaluation set from real inputs.
What goes in the eval set
- 50 to 200 real or realistic inputs for the workload, including edge cases.
- A defined notion of an acceptable output: exact match, rubric score, or human judgment.
- A way to run the set automatically against any model behind your abstraction layer.
The eval set is a reusable asset. Build it once per workload and you reuse it for every future model that comes along. This is what makes the workflow repeatable instead of a fresh research project each time. The A Step-by-Step Approach to Open vs Closed Source AI Models walks through assembling a first eval set in practice.
Stage 3: Run the Comparison
With a workload classified and an eval set built, the comparison is now mechanical, which is the whole point.
The comparison matrix
For each candidate model, record:
- Quality score on the eval set.
- Cost per task at projected volume.
- Latency at the percentiles you care about, not just the average.
- Operational burden: managed API versus GPUs you run.
Fill the same matrix for every run. Standardizing the columns is what lets you compare a decision made this quarter against one made last quarter without re-explaining anything.
Stage 4: Decide and Document
Now you apply the decision rule and, critically, write down why.
The decision artifact
Produce a short, standard document for every evaluation containing:
- The chosen model and the runner-up.
- The deciding factor: was it cost, quality, compliance, or operational load?
- The conditions that would reverse the decision.
That last line is the secret weapon. "We chose closed; revisit if daily volume exceeds X or if open quality on this eval set reaches Y" turns a static decision into a tripwire. Future-you doesn't have to remember the reasoning; the artifact carries it. This discipline is what separates a workflow from a guess, and it's the antidote to several traps in 7 Common Mistakes with Open vs Closed Source AI Models.
Stage 5: Implement Behind the Abstraction
Whatever you chose, it goes in behind the same internal model interface every workload uses. No raw vendor SDK calls scattered through application code. This is non-negotiable, because it's what makes the next stage possible and keeps switching cheap.
Implementation checklist
- All calls route through the abstraction layer.
- The eval set runs in CI against the deployed model.
- Cost and latency dashboards exist for the workload.
Stage 6: Schedule the Re-Run
The workflow isn't done when you ship; it loops. You schedule a re-evaluation trigger so the decision doesn't silently rot.
Triggers that fire a re-run
- A major new model release in either camp.
- A meaningful price change from a provider.
- Crossing a volume threshold you wrote into the decision artifact.
- A fixed cadence, quarterly is sane for most teams.
When a trigger fires, you don't start over. You re-run Stages 3 and 4 with the existing eval set and the standard matrix. A re-evaluation that used to take weeks now takes an afternoon. That speed is the entire return on building the workflow.
Making It Hand-Off-Able
The final test of a workflow is whether someone new can run it from the documents alone. Store the eval sets, the comparison matrices, and the decision artifacts in a shared, versioned location. Write a one-page runbook that points to each stage's template. When the engineer who built it leaves, the workflow stays, and that's the difference between a process and a person.
Frequently Asked Questions
How long does the first full run take?
The first run is slow, often a couple of weeks, mostly because you're building the eval set and the abstraction layer from nothing. Every subsequent run on the same workload is dramatically faster because those assets are reusable. The upfront cost is the investment that makes the workflow worth having.
What's the single most-skipped stage?
Defining the eval set. Teams jump straight to running models against vibes and anecdotes. Without a fixed eval set you can't compare models honestly, you can't re-run the decision later, and you can't hand it off. It's the load-bearing stage.
Can this workflow be partly automated?
Yes. Once the eval set and abstraction layer exist, running the comparison matrix against a new model can be largely scripted. The classification and the final decision still need human judgment, but the mechanical comparison should be push-button by your third or fourth run.
How detailed should the decision artifact be?
One page. The goal is to capture the chosen model, the deciding factor, and the conditions that would reverse it. Longer documents don't get read. The reversal conditions are the part you must not omit.
Where should all these artifacts live?
In a shared, versioned repository alongside your code or in a wiki linked from it, never in someone's local notes or a private doc. The whole value is hand-off and repeatability, which dies the moment the artifacts are personal rather than shared.
Key Takeaways
- A repeatable workflow turns open vs closed from recurring research into an afternoon's re-run.
- The six stages are: classify the workload, build an eval set, run the comparison, decide and document, implement behind an abstraction, and schedule the re-run.
- The eval set and the abstraction layer are reusable assets; building them is the real investment.
- Every decision produces a one-page artifact that names the deciding factor and the conditions that would reverse it.
- Store all artifacts in a shared, versioned place so the workflow survives the person who built it.