Stop Asking What Do We Do Here Every Time

A playbook is not a tutorial and not a list of best practices. It is an operating manual: a set of named plays, each with a trigger that tells you when to run it, an owner who is accountable, and a sequence that says what comes first. The value of a playbook is that it removes the recurring question "what do we do here?" Without one, every new foundation-model situation gets handled from scratch, which is slow, inconsistent, and dependent on whoever happens to be in the room.

This is that operating manual for foundation models. It assumes you already grasp the fundamentals — if not, The Complete Guide to Foundation Models is the place to start — and focuses on how to run an AI capability as a repeatable operation. Each play below is something I have watched separate organizations that compound their AI advantage from those that keep relearning the same lessons. Read it as a reference you return to, not a narrative you read once.

Play 1: Evaluate Before You Build

Trigger: Someone proposes a new foundation-model use case. Owner: The person who will maintain the solution, not the person who pitched it.

Before any building, run a fast feasibility evaluation. Take ten to twenty representative inputs, run them through a candidate model with a rough prompt, and judge the output honestly. This costs an afternoon and prevents the far more expensive outcome of building a system around a task the model cannot reliably do.

The decision this play produces is binary: the model handles this well enough to be worth engineering, or it does not. If the rough version is hopeless, no amount of polish will save it, and you have learned that cheaply. The patterns that make this evaluation reliable are in Foundation Models: Best Practices That Actually Work.

Play 2: Start With the Smallest Viable Approach

Trigger: A use case passed the feasibility evaluation. Owner: The implementer.

Resist the urge to build the sophisticated version first. The correct sequence climbs a ladder, stopping at the first rung that works:

Plain prompting — a well-structured prompt against a hosted model.
Few-shot prompting — add examples when plain prompting is inconsistent.
Retrieval — add external information when the model lacks the facts.
Fine-tuning — only when behavior is stable and the above have plateaued.

Most use cases stop at rung two or three. Teams that start at rung four waste weeks on machinery they did not need. The selection logic behind each rung is in A Framework for Foundation Models.

Play 3: Instrument Before You Scale

Trigger: A solution works in testing and is about to handle real volume. Owner: The implementer, with sign-off from whoever owns reliability.

Never scale an uninstrumented system. Before real traffic arrives, put three things in place:

A regression eval — a fixed test set you run on a schedule to catch quality drift when the model or your prompt changes.
Production monitoring — track output validity, latency, cost per call, and refusal rate, and alert on shifts.
A fallback path — define what happens when the model is slow, unavailable, or returns garbage.

This play is the one teams skip and regret. The risks it guards against — drift, leakage, vendor failure — are detailed in The Hidden Risks of Foundation Models (and How to Manage Them).

Trigger: Someone produces a prompt or workflow that works notably well. Owner: Whoever owns the team's shared assets.

Good prompts and workflows are organizational assets, and they evaporate if they live in one person's head. When something works, codify it: add it to a shared prompt library, document the use case and the trade-offs, and make it discoverable. This single discipline is the difference between an organization that compounds its AI capability and one that solves the same problem repeatedly. The enablement side of spreading these assets is in Rolling Out Foundation Models Across a Team.

Play 5: Route by Difficulty and Cost

Trigger: Volume or cost on a working system grows large enough to matter. Owner: The implementer.

Once a system handles real volume, stop sending every request to your largest model. Introduce routing:

Classify or score incoming requests by difficulty.
Send easy requests to a small, fast, cheap model.
Escalate only the hard or uncertain ones to the large model.

This play often cuts cost substantially while keeping quality steady, because most production traffic is easier than the worst case you designed for. It is a tuning play, run after the system works, not before.

Play 6: Review High-Stakes Output

Trigger: Any model output that reaches a customer or drives a consequential decision. Owner: A qualified human reviewer.

Define explicitly which outputs require human review and make that review substantive. The depth of review scales with stakes: a casual internal draft needs none, a customer-facing legal statement needs a careful read. The failure mode this play prevents is the silent shipping of a confident hallucination to someone who trusted it. Where the model is reliable enough to skip review is a judgment your team calibrates over time, covered in Foundation Models: Myths vs Reality.

Play 7: Reassess on a Cadence

Trigger: A fixed schedule — monthly or quarterly — regardless of whether anything seems wrong. Owner: Whoever owns the AI capability overall.

The landscape moves fast enough that decisions made six months ago may now be wrong. On a regular cadence, revisit: is there a better or cheaper model now, are the standards still right, what broke since last time, what new use cases emerged. This play keeps the capability current without requiring a crisis to force a review. It is the maintenance loop that keeps the other six plays honest.

How the Plays Sequence Together

The plays are not independent; they form a lifecycle. A use case enters through Play 1 (evaluate), gets built through Play 2 (smallest viable), is hardened through Play 3 (instrument), and once live, its lessons feed Play 4 (standardize). As it scales, Play 5 (route) optimizes cost and Play 6 (review) governs quality, while Play 7 (reassess) keeps the whole thing current. Assign each play a clear owner and the operation runs without depending on heroics. Skip the owners and you have a document nobody follows.

Frequently Asked Questions

What makes a playbook different from best practices?

A playbook adds triggers and owners. Best practices tell you what good looks like; a playbook tells you when to act and who is accountable. That structure is what turns knowledge into a repeatable operation rather than a document people nod at and ignore.

Who should own the overall playbook?

Someone accountable for the AI capability as a whole, with the authority to maintain standards and the shared prompt library. Individual plays have individual owners, but one person should own the lifecycle and the cadence reassessment.

Do small teams need a playbook this formal?

Scale it down, but keep the structure. Even a two-person team benefits from "evaluate before building," "start small," and "share what works." The triggers and owners can be lightweight; the discipline is what matters.

When do we actually fine-tune?

Only at the top rung of Play 2, after prompting and retrieval have plateaued and you have a stable behavior to teach at sufficient scale. Reaching for it earlier is the most common expensive mistake in the lifecycle.

How often should we run the reassessment play?

Monthly or quarterly depending on how central AI is to your work and how fast your relevant models are changing. The point is a fixed cadence so reassessment happens by default, not only when something breaks.

Key Takeaways

A playbook adds triggers and owners to best practices, turning knowledge into a repeatable operation.
Evaluate feasibility cheaply before building, and start at the smallest viable approach on the prompting-to-fine-tuning ladder.
Instrument with regression evals, monitoring, and fallbacks before scaling, never after.
Standardize and share what works so capability compounds instead of being relearned.
Route by difficulty to control cost, review high-stakes output, and reassess on a fixed cadence.

Play 1: Evaluate Before You Build

Trigger: Someone proposes a new foundation-model use case. Owner: The person who will maintain the solution, not the person who pitched it.

Play 2: Start With the Smallest Viable Approach

Trigger: A use case passed the feasibility evaluation. Owner: The implementer.

Resist the urge to build the sophisticated version first. The correct sequence climbs a ladder, stopping at the first rung that works:

Plain prompting — a well-structured prompt against a hosted model.
Few-shot prompting — add examples when plain prompting is inconsistent.
Retrieval — add external information when the model lacks the facts.
Fine-tuning — only when behavior is stable and the above have plateaued.

Most use cases stop at rung two or three. Teams that start at rung four waste weeks on machinery they did not need. The selection logic behind each rung is in A Framework for Foundation Models.

Play 3: Instrument Before You Scale

Trigger: A solution works in testing and is about to handle real volume. Owner: The implementer, with sign-off from whoever owns reliability.

Never scale an uninstrumented system. Before real traffic arrives, put three things in place:

A regression eval — a fixed test set you run on a schedule to catch quality drift when the model or your prompt changes.
Production monitoring — track output validity, latency, cost per call, and refusal rate, and alert on shifts.
A fallback path — define what happens when the model is slow, unavailable, or returns garbage.

This play is the one teams skip and regret. The risks it guards against — drift, leakage, vendor failure — are detailed in The Hidden Risks of Foundation Models (and How to Manage Them).

Trigger: Someone produces a prompt or workflow that works notably well. Owner: Whoever owns the team's shared assets.

Play 5: Route by Difficulty and Cost

Trigger: Volume or cost on a working system grows large enough to matter. Owner: The implementer.

Once a system handles real volume, stop sending every request to your largest model. Introduce routing:

Classify or score incoming requests by difficulty.
Send easy requests to a small, fast, cheap model.
Escalate only the hard or uncertain ones to the large model.

Play 6: Review High-Stakes Output

Trigger: Any model output that reaches a customer or drives a consequential decision. Owner: A qualified human reviewer.

Play 7: Reassess on a Cadence

Trigger: A fixed schedule — monthly or quarterly — regardless of whether anything seems wrong. Owner: Whoever owns the AI capability overall.

How the Plays Sequence Together

Frequently Asked Questions

What makes a playbook different from best practices?

Who should own the overall playbook?

Do small teams need a playbook this formal?

When do we actually fine-tune?

How often should we run the reassessment play?

Key Takeaways

A playbook adds triggers and owners to best practices, turning knowledge into a repeatable operation.
Evaluate feasibility cheaply before building, and start at the smallest viable approach on the prompting-to-fine-tuning ladder.
Instrument with regression evals, monitoring, and fallbacks before scaling, never after.
Standardize and share what works so capability compounds instead of being relearned.
Route by difficulty to control cost, review high-stakes output, and reassess on a fixed cadence.

Stop Asking What Do We Do Here Every Time

Play 1: Evaluate Before You Build

Play 2: Start With the Smallest Viable Approach

Play 3: Instrument Before You Scale

Play 4: Standardize and Share What Works

Play 5: Route by Difficulty and Cost

Play 6: Review High-Stakes Output

Play 7: Reassess on a Cadence

How the Plays Sequence Together

Frequently Asked Questions

What makes a playbook different from best practices?

Who should own the overall playbook?

Do small teams need a playbook this formal?

When do we actually fine-tune?

How often should we run the reassessment play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Stop Asking What Do We Do Here Every Time

Play 1: Evaluate Before You Build

Play 2: Start With the Smallest Viable Approach

Play 3: Instrument Before You Scale

Play 4: Standardize and Share What Works

Play 5: Route by Difficulty and Cost

Play 6: Review High-Stakes Output

Play 7: Reassess on a Cadence

How the Plays Sequence Together

Frequently Asked Questions

What makes a playbook different from best practices?

Who should own the overall playbook?

Do small teams need a playbook this formal?

When do we actually fine-tune?

How often should we run the reassessment play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?