AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why a Workflow Beats a DecisionThe inputs every run needsStage 1: Classify the WorkloadThe categoriesStage 2: Define the Eval SetWhat goes in the eval setStage 3: Run the ComparisonThe comparison matrixStage 4: Decide and DocumentThe decision artifactStage 5: Implement Behind the AbstractionImplementation checklistStage 6: Schedule the Re-RunTriggers that fire a re-runMaking It Hand-Off-AbleFrequently Asked QuestionsHow long does the first full run take?What's the single most-skipped stage?Can this workflow be partly automated?How detailed should the decision artifact be?Where should all these artifacts live?Key Takeaways
Home/Blog/Five Evaluations In, Still Starting From Scratch
General

Five Evaluations In, Still Starting From Scratch

A

Agency Script Editorial

Editorial Team

·November 12, 2025·8 min read
open vs closed source AI modelsopen vs closed source AI models workflowopen vs closed source AI models guideai fundamentals

A one-off decision and a repeatable workflow are not the same thing, and the gap between them is where teams quietly lose months. The first time you evaluate open vs closed for a feature, it feels like deep architectural work. The fifth time, if you haven't written anything down, it's still deep architectural work, because every evaluation starts from scratch and lives in whoever happened to run it.

The fix is to make the evaluation itself a documented, repeatable process with defined inputs, stages, artifacts, and a hand-off point. When a new model launches or a new feature needs an AI backend, you run the workflow instead of re-litigating the philosophy. This article lays out that workflow stage by stage. For the underlying decision logic, the A Framework for Open vs Closed Source AI Models piece is the companion you'll lean on inside several of these stages.

Why a Workflow Beats a Decision

The argument for process is simple: consistency and hand-off. A workflow makes every evaluation produce the same artifacts in the same format, so two engineers reach comparable conclusions and a third can pick up the work without a meeting. It also forces you to separate the parts that change, the specific models and prices, from the parts that don't, the steps and the criteria.

The inputs every run needs

  • The workload definition: what task, what volume, what latency tolerance.
  • The constraint set: data residency, compliance, budget ceiling.
  • The current candidate models, open and closed, worth comparing this quarter.

Lock these three down before any evaluation starts. Most failed evaluations failed because someone skipped the constraint set and discovered a regulatory blocker after building.

Stage 1: Classify the Workload

Every run starts by sorting the workload into a category, because the category shortcuts most of the decision.

The categories

  • Sensitive-and-regulated: data can't leave your perimeter. This often forces open self-hosting regardless of cost.
  • High-volume-low-stakes: classification, extraction, routing. Strong open-model candidate.
  • Frontier-hard: complex reasoning, long context, agentic chains. Closed models usually still lead.
  • Exploratory: you don't know if the feature works yet. Always closed first for speed.

Most workloads fall cleanly into one bucket. The ones that straddle two are exactly where the rest of the workflow earns its keep, because you'll need real evidence rather than a category default.

Stage 2: Define the Eval Set

This is the stage teams skip and the one that determines whether the whole workflow is trustworthy. Before you compare any models, you build a fixed evaluation set from real inputs.

What goes in the eval set

  • 50 to 200 real or realistic inputs for the workload, including edge cases.
  • A defined notion of an acceptable output: exact match, rubric score, or human judgment.
  • A way to run the set automatically against any model behind your abstraction layer.

The eval set is a reusable asset. Build it once per workload and you reuse it for every future model that comes along. This is what makes the workflow repeatable instead of a fresh research project each time. The A Step-by-Step Approach to Open vs Closed Source AI Models walks through assembling a first eval set in practice.

Stage 3: Run the Comparison

With a workload classified and an eval set built, the comparison is now mechanical, which is the whole point.

The comparison matrix

For each candidate model, record:

  • Quality score on the eval set.
  • Cost per task at projected volume.
  • Latency at the percentiles you care about, not just the average.
  • Operational burden: managed API versus GPUs you run.

Fill the same matrix for every run. Standardizing the columns is what lets you compare a decision made this quarter against one made last quarter without re-explaining anything.

Stage 4: Decide and Document

Now you apply the decision rule and, critically, write down why.

The decision artifact

Produce a short, standard document for every evaluation containing:

  • The chosen model and the runner-up.
  • The deciding factor: was it cost, quality, compliance, or operational load?
  • The conditions that would reverse the decision.

That last line is the secret weapon. "We chose closed; revisit if daily volume exceeds X or if open quality on this eval set reaches Y" turns a static decision into a tripwire. Future-you doesn't have to remember the reasoning; the artifact carries it. This discipline is what separates a workflow from a guess, and it's the antidote to several traps in 7 Common Mistakes with Open vs Closed Source AI Models.

Stage 5: Implement Behind the Abstraction

Whatever you chose, it goes in behind the same internal model interface every workload uses. No raw vendor SDK calls scattered through application code. This is non-negotiable, because it's what makes the next stage possible and keeps switching cheap.

Implementation checklist

  • All calls route through the abstraction layer.
  • The eval set runs in CI against the deployed model.
  • Cost and latency dashboards exist for the workload.

Stage 6: Schedule the Re-Run

The workflow isn't done when you ship; it loops. You schedule a re-evaluation trigger so the decision doesn't silently rot.

Triggers that fire a re-run

  • A major new model release in either camp.
  • A meaningful price change from a provider.
  • Crossing a volume threshold you wrote into the decision artifact.
  • A fixed cadence, quarterly is sane for most teams.

When a trigger fires, you don't start over. You re-run Stages 3 and 4 with the existing eval set and the standard matrix. A re-evaluation that used to take weeks now takes an afternoon. That speed is the entire return on building the workflow.

Making It Hand-Off-Able

The final test of a workflow is whether someone new can run it from the documents alone. Store the eval sets, the comparison matrices, and the decision artifacts in a shared, versioned location. Write a one-page runbook that points to each stage's template. When the engineer who built it leaves, the workflow stays, and that's the difference between a process and a person.

Frequently Asked Questions

How long does the first full run take?

The first run is slow, often a couple of weeks, mostly because you're building the eval set and the abstraction layer from nothing. Every subsequent run on the same workload is dramatically faster because those assets are reusable. The upfront cost is the investment that makes the workflow worth having.

What's the single most-skipped stage?

Defining the eval set. Teams jump straight to running models against vibes and anecdotes. Without a fixed eval set you can't compare models honestly, you can't re-run the decision later, and you can't hand it off. It's the load-bearing stage.

Can this workflow be partly automated?

Yes. Once the eval set and abstraction layer exist, running the comparison matrix against a new model can be largely scripted. The classification and the final decision still need human judgment, but the mechanical comparison should be push-button by your third or fourth run.

How detailed should the decision artifact be?

One page. The goal is to capture the chosen model, the deciding factor, and the conditions that would reverse it. Longer documents don't get read. The reversal conditions are the part you must not omit.

Where should all these artifacts live?

In a shared, versioned repository alongside your code or in a wiki linked from it, never in someone's local notes or a private doc. The whole value is hand-off and repeatability, which dies the moment the artifacts are personal rather than shared.

Key Takeaways

  • A repeatable workflow turns open vs closed from recurring research into an afternoon's re-run.
  • The six stages are: classify the workload, build an eval set, run the comparison, decide and document, implement behind an abstraction, and schedule the re-run.
  • The eval set and the abstraction layer are reusable assets; building them is the real investment.
  • Every decision produces a one-page artifact that names the deciding factor and the conditions that would reverse it.
  • Store all artifacts in a shared, versioned place so the workflow survives the person who built it.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification