Most teams turn the open-versus-closed decision into a religious debate that burns weeks and resolves nothing. The fix is to treat it as a procedure, not an opinion. This article gives you an ordered process you can run today, from defining the workload to making a final call, with a clear output at each step.
Work through the steps in order. Do not skip ahead to model selection before you have characterized the workload, because the workload is what actually decides the answer. By the end you will have a written rationale you can defend to a skeptical stakeholder.
Step 1: Characterize the Workload First
Before you compare a single model, write down what you are actually building. The decision flips entirely based on these properties, so get them on paper.
Capture These Numbers
- Volume: Expected tokens or requests per day, and how spiky it is.
- Latency: Acceptable response time, including the worst case.
- Data sensitivity: Does data fall under HIPAA, GDPR residency, or contractual restrictions?
- Task difficulty: Is this frontier-level reasoning or routine summarization and extraction?
If you cannot fill these in yet, that is your real first task. Guessing here invalidates everything downstream.
Step 2: Set Hard Constraints
Some requirements are non-negotiable and instantly eliminate options. Identify them now so you do not waste time evaluating models that can never qualify.
The most common hard constraint is data residency. If a contract states that customer data must physically remain in your environment, a basic closed API is disqualified regardless of how good it is. Conversely, if you have no infrastructure team and a hard launch date next week, self-hosting an open model is disqualified. Write your hard constraints down and treat the survivors as your candidate pool.
Step 3: Estimate Cost Both Ways
Now model the economics for your specific volume from Step 1. Do this for two scenarios: closed API pricing, and self-hosted open-weight on rented GPUs.
What to Include
- Closed path: Per-token price times your projected monthly volume.
- Open path: GPU rental cost, plus a realistic estimate of engineering hours to deploy and maintain, plus observability tooling.
Do not stop at the GPU bill. The hidden cost of open self-hosting is senior engineering time. A cheap-looking GPU setup that needs two engineers babysitting it is not cheap. Our common mistakes guide explains why this estimate is where teams most often fool themselves.
Step 4: Build a Representative Evaluation Set
You cannot pick a model on vibes or public benchmarks. Assemble 30 to 100 real examples from your actual use case, each with a known good answer or a clear quality rubric. This eval set is the single most valuable artifact you will produce.
Public benchmarks tell you how a model does on someone else's test, not yours. A model that tops a leaderboard can still fail your specific extraction format or tone requirements. Your eval set catches that before it reaches users.
Step 5: Run a Bake-Off
Take your two or three surviving candidates and run them against your eval set under realistic conditions. Include at least one closed model and one open model so you have a true comparison.
Score on More Than Accuracy
- Quality: How often does the output meet your rubric?
- Latency: Measured at your expected concurrency, not in isolation.
- Cost per successful task: Not cost per token; cost per task that actually passes.
- Consistency: Does quality hold across edge cases, or only on easy examples?
Cost per successful task is the metric that exposes false economies. A cheaper model that fails twice as often is not cheaper.
Step 6: Pilot the Winner in Production Conditions
Do not roll out to everyone. Run the winning model on a slice of real traffic with monitoring in place. Watch for the failure modes that only appear at scale: latency spikes under load, quality drift on inputs your eval set missed, and operational pain like GPU availability for the open path.
This pilot is where the open path's true operational burden becomes visible. If your team is drowning in inference firefighting during the pilot, that is critical data, not a temporary nuisance.
What to Watch During the Pilot
- Latency under real concurrency, not the clean numbers from your isolated bake-off.
- Quality drift on inputs your eval set missed, which is how you discover the gaps in your test coverage.
- Operational load on your team, measured honestly in hours spent keeping the system healthy.
- Cost per successful task at real traffic, which sometimes differs from your estimate once retries and edge cases appear.
Run the pilot long enough to hit a realistic spread of inputs. A few hours of clean traffic tells you nothing; a week that includes your messy real-world distribution tells you everything.
Step 7: Decide, Document, and Revisit
Make the call and write a one-page rationale: the workload properties, the constraints, the cost estimates, the bake-off scores, and the pilot findings. This document protects the decision from being relitigated every time someone reads a new headline.
Finally, set a calendar reminder to revisit. Model capability and pricing move fast. A decision that was right six months ago may be wrong today. For a reusable structure to run this whole process repeatedly, see our framework article, and for the full landscape of trade-offs, the complete guide.
How Long This Process Takes
Teams often assume this looks like weeks of work, then stall before starting. In practice, the heavy lifting is concentrated in two steps and the rest is fast. Characterizing the workload (Step 1) and building the eval set (Step 4) take the most effort—usually a day or two combined—because they require gathering real data and real examples.
Once those exist, the constraint screen, cost modeling, and bake-off can each be done in a few hours. The pilot is calendar time rather than effort: you set it up once and let it run for a week. The whole process, from a cold start to a documented decision, is realistically a week of part-time work, and most of that is waiting on the pilot. The payoff is that you only build these artifacts once; every future model decision reuses the same eval set and abstraction, collapsing the work to an afternoon.
Frequently Asked Questions
Can I skip the bake-off and just trust benchmarks?
No. Benchmarks measure performance on generic tasks that rarely match yours. The bake-off against your own eval set is the step that prevents an expensive wrong choice, and it usually takes less than a day once your eval set exists.
How big should my evaluation set be?
For an initial decision, 30 to 100 representative examples is enough to surface meaningful differences. The examples matter more than the count; include your hard cases and edge cases, not just the easy middle of your distribution.
What if cost favors open but my team lacks infrastructure skills?
Then the honest cost of the open path includes hiring or training, which usually erases the apparent savings. Many teams in this position use managed open-model hosting as a middle ground, getting open-weight benefits without owning raw infrastructure.
How often should I revisit the decision?
Every three to six months, or whenever a major model release or pricing change lands. Re-running your existing eval set against new candidates is fast and keeps you from being locked into a stale choice.
Key Takeaways
- Characterize the workload before evaluating any model; volume, latency, data sensitivity, and difficulty drive the answer.
- Identify hard constraints early to eliminate disqualified options immediately.
- Estimate cost both ways and include engineering time, not just GPU or token bills.
- Decide with a bake-off against your own eval set, scored on cost per successful task, not benchmarks.
- Pilot in real conditions, document the rationale, and schedule a revisit as models and prices change.