A one-time evaluation is useful. A reusable framework is transformative, because models keep changing and the team with a repeatable process adapts in an hour while everyone else starts from scratch each time. This article introduces such a framework, which we call the FIT Loop, a named structure you can apply to any model decision and rerun whenever the landscape shifts.
FIT stands for Filter, Investigate, Test, and then loop back. It is deliberately simple, because a framework you cannot remember is a framework you will not use. The value is not novelty; it is that giving the stages names and an explicit order makes the practice repeatable and teachable across a team.
We will define each stage, explain when to apply it, and show how the loop closes so that re-evaluation becomes routine rather than heroic. By the end you should be able to run the FIT Loop from memory.
Stage F: Filter With Leaderboards
The first stage uses public rankings for the one thing they do well: reducing a large field to a workable shortlist. You are not deciding here; you are eliminating.
How to filter well
- Discard models whose strongest benchmarks do not resemble your task.
- Cross-reference two or three independent leaderboards and keep models that rank consistently.
- Reduce to two or three candidates, no more.
The Filter stage is where most teams stop, and stopping here is the central mistake our definitive guide warns against. Filtering is necessary but never sufficient.
Stage I: Investigate Your Own Requirements
Before testing, you turn inward. This stage is about knowing what you are actually selecting for, and it is the stage teams most often skip.
The investigation questions
- What does "good" mean for this task, in one sentence?
- Which observable criteria does that imply?
- What failure modes would genuinely hurt, and how badly?
The output of Investigate is a scoring rubric and an evaluation set of thirty to fifty real examples with reference answers. Without this stage, the Test stage has nothing meaningful to measure against. Our step-by-step guide details how to produce both artifacts.
Stage T: Test the Shortlist
Now you run each filtered candidate against your evaluation set, using an identical frozen prompt, and score every output against the rubric from the Investigate stage. One reviewer, one sitting, consistent standard.
Reading the results
Tally the scores, but weight your reading toward failure patterns. A model that wins on average but fails catastrophically on a key case may lose to a steadier runner-up. The examples article shows several cases where the average score and the right decision diverged.
The Test stage produces a decision and, crucially, a documented rationale: which model, why, and what bar a future challenger must clear.
Closing the Loop: When to Run FIT Again
The loop is what distinguishes FIT from a one-off process. You do not re-run it on a schedule; you re-run it on a trigger.
The triggers
- A new model shows a meaningful jump on benchmarks resembling your task.
- Your own requirements change, altering what "good" means.
- Production reveals a failure mode your evaluation set missed, which you then add.
When a trigger fires, the loop is cheap because the Investigate artifacts already exist. You refresh the shortlist in Filter, reuse or extend your evaluation set, and re-run Test in under an hour. This is the compounding return our best practices article describes: each pass through the loop makes the next one faster.
Why a Named Framework Beats Ad Hoc Evaluation
Naming the stages does real work. It gives a team shared language, so "we are in the Investigate stage" communicates instantly. It enforces order, preventing the classic jump from Filter straight to a decision. And it makes the practice teachable, so evaluation does not live in one person's head. An ad hoc process, by contrast, gets reinvented and degraded every time the original author is busy.
Adapting FIT to your stakes
For low-stakes decisions, run a lightweight FIT: a smaller evaluation set, lighter documentation. For high-stakes ones, run it fully. The loop scales with consequences, which is part of what makes it durable across very different decisions.
Walking Through One Pass of the Loop
Picture applying FIT to choosing a model for drafting product descriptions. In Filter, you check two leaderboards, discard models whose strengths are coding or math, and keep three that rank well on general writing. In Investigate, you define good as "accurate to the spec, on-brand in tone, under eighty words," build a rubric from those three criteria, and gather forty real product specs with ideal descriptions. In Test, you run all forty through the three candidates with one frozen prompt and score each output against the rubric in a single sitting.
The winner emerges with a documented rationale, and you record the runner-up and the bar to beat. Two months later a new model launches; a trigger fires. You spend twenty minutes refreshing the shortlist, reuse the same forty specs, and re-run Test. The loop closed, and the second pass cost a fraction of the first. That asymmetry, expensive first pass and cheap subsequent ones, is the entire economic argument for naming and reusing the process.
Where FIT Fits Among Other Methods
FIT is not in competition with checklists or step-by-step guides; it is the connective tissue that makes them recurring rather than one-off. A checklist tells you what to verify within a single evaluation. A step-by-step guide tells you how to execute one. FIT wraps both in a loop with a memory, so that the artifacts you build, the rubric and the evaluation set, persist and accelerate every future decision. Think of the checklist as the contents of one pass and FIT as the reason there is always a next pass ready to run cheaply.
The cultural payoff
Beyond efficiency, a named loop changes how a team talks. "Should we switch models?" becomes "has a trigger fired, and if so let us run FIT," which is a calmer, more answerable question. The framework turns an emotional debate into a procedural one, and procedural debates resolve faster and leave less residue.
Frequently Asked Questions
What makes FIT different from just following a checklist?
A checklist is a list of actions; FIT is a loop with named stages and an explicit re-entry condition. The loop structure is what turns evaluation from a one-time event into an ongoing practice that compounds, because the artifacts from one pass accelerate the next.
Which FIT stage do teams skip most often?
Investigate. Teams filter with leaderboards and jump straight to a decision, never defining what "good" means or building an evaluation set. Skipping Investigate leaves the Test stage with nothing trustworthy to measure against, which is the root of most bad model choices.
How long does a full FIT Loop take the first time?
The first pass, including building the evaluation set, typically takes a focused day. Subsequent passes take under an hour because the evaluation set and rubric already exist. The upfront investment is what makes future re-evaluations cheap.
Can FIT handle subjective tasks?
Yes. For subjective work, the Investigate stage defines criteria a human can judge, and the Test stage uses human scoring rather than automated checks. The framework is agnostic to whether scoring is objective or subjective; only the scoring method changes.
When should I extend my evaluation set?
Whenever production surfaces a failure mode your set did not cover. Adding that case closes the gap and strengthens every future loop. Over time your set becomes a precise map of what matters for your task, which is the framework's most valuable long-term output.
Key Takeaways
- The FIT Loop has three stages, Filter, Investigate, Test, plus a trigger-based return.
- Filter uses leaderboards only to shortlist; it is necessary but never the decision.
- Investigate defines "good" and builds the evaluation set, and it is the most-skipped stage.
- Test runs candidates against your set with a frozen prompt and produces a documented decision.
- Re-run FIT on triggers, not schedules; reusable artifacts make each loop faster than the last.