Most advice about controlling token spend reads like a list of good ideas with no instructions for when to use them. That gap is why teams know they should cache, summarize, and route, yet keep paying for prompts nobody has touched in months. Knowing the levers is not the same as knowing which one to pull, who pulls it, and in what order.
A playbook fixes that. Instead of a pile of tactics, you get plays. Each play has a trigger that tells you when it applies, an owner who is accountable for running it, and a sequence that keeps you from making things worse. The point is to turn cost control from a heroic one-off project into something your team executes on cue.
What follows is an operating playbook you can adapt. Treat the triggers as defaults to tune, not commandments. The sequencing matters more than the exact thresholds, because pulling levers in the wrong order is how teams degrade quality while chasing savings.
Play One: Audit Before You Cut
You cannot optimize a system you have not measured. Every other play depends on this one running first.
Trigger
Run this the moment token cost becomes a topic, or on a fixed cadence such as monthly, whichever comes first.
The Sequence
- Instrument every call to log input tokens, output tokens, and the use case.
- Group spend by use case and sort by total cost, not request count.
- Identify the top two or three use cases that drive the majority of spend.
Owner
An engineer who can add logging and read the data. Optimization without this baseline is guessing, and guessing is how teams cut the wrong thing.
Play Two: Right-Size the Model
Paying premium rates for trivial work is the most common and most fixable form of waste.
Trigger
A high-volume use case runs on a top-tier model, or a single model handles tasks of wildly different difficulty.
The Sequence
- List the use cases hitting the expensive model.
- For each, ask whether a smaller model could meet the quality bar.
- Route by difficulty: cheap model for the easy majority, capable model for the hard minority.
Owner
The engineer who owns the use case, validated against a fixed evaluation set so the routing decision is evidence, not vibes.
Play Three: Shrink the Input
Input usually dominates spend, so this play has the largest blast radius.
Trigger
Input tokens consistently exceed output tokens, or the system prompt has grown past what anyone can justify line by line.
The Sequence
- Strip redundant and aspirational instructions from the system prompt.
- Remove few-shot examples and test whether quality holds without them.
- Cap retrieval to the smallest number of chunks that still answers the question.
Owner
A prompt engineer paired with whoever owns quality, because each cut must be tested against real inputs before it ships.
The mechanics of each cut are covered in the questions-and-answers companion if you need the underlying reasoning.
Play Four: Tame Conversation History
Chat-style products bleed tokens through history that grows without limit.
Trigger
Multi-turn conversations resend the full transcript on every request, or long sessions cost noticeably more than short ones.
The Sequence
- Set a maximum number of recent turns to keep verbatim.
- Replace older turns with a running summary refreshed periodically.
- Confirm the summary preserves the facts the model needs to stay coherent.
Owner
The application engineer, since this play changes how the conversation state is assembled, not just the prompt text.
Play Five: Exploit Caching and Batching
These are discounts you are leaving on the table if your request structure does not support them.
Caching
Move stable context to the front of the prompt so a cached prefix survives across requests. Reuse it for any workload that sends the same large block repeatedly, such as a fixed instruction set or a reference document.
Batching
For non-interactive jobs that tolerate latency, route them through a batch path to capture the discount. Overnight evaluations, bulk summarization, and backfills are ideal candidates.
Owner
The engineer who structures requests. Both plays require deliberate ordering and routing rather than a content change.
Play Six: Set Caps and Hold the Line
Optimization decays. Without guardrails, prompts drift back toward bloat as people add one more instruction at a time.
Trigger
Run this once you have realized your savings, then keep it running forever.
The Sequence
- Define a maximum context size and maximum output length per use case.
- Enforce the caps in code so violations fail loudly instead of silently inflating cost.
- Add cost per outcome to a dashboard people actually look at.
Owner
Whoever owns the budget. Guardrails are a governance job, not a one-time engineering task, and they are what make the savings durable.
For embedding these caps into day-to-day work, the repeatable workflow turns this final play into standing process. For where the discipline is heading, see the forward-looking view.
Play Seven: Trim the Output, Not Just the Input
Input gets most of the attention, but generous output formats quietly inflate cost on every single call.
Trigger
The model returns verbose answers, repeats the question back, or wraps every response in boilerplate that no consumer reads.
The Sequence
- Instruct the model to answer directly without restating the prompt.
- Set a maximum output length appropriate to the use case.
- For structured outputs, request the minimal schema rather than a chatty narrative.
Owner
The prompt engineer, validated against the evaluation set so brevity does not strip out detail the consumer actually needs. Output tokens often cost more per token than input, so even modest trims compound across high volume.
Play Eight: Route by Difficulty, Not by Default
Many systems send every request to whatever model the prototype happened to use. That default is rarely the right economic choice.
Trigger
A single model handles a mix of trivial and hard requests, or you suspect most traffic is easier than the model you are paying for.
The Sequence
- Classify incoming requests by difficulty, using a cheap heuristic or a small classifier.
- Send the easy majority to a smaller model and escalate only when needed.
- Measure escalation rate and quality so the routing stays honest.
Owner
The engineer who owns the request path. Routing is structural, so it belongs with whoever controls how requests are dispatched, not with whoever writes the prompt text.
Sequencing the Plays
Order matters. Audit first, because the rest depends on data. Right-size the model and shrink the input next, since they carry the most savings. Then tackle history, caching, and batching, which are structural and slower to change. Finish with caps, because they protect the gains. Running the plays out of order is how teams burn effort and still watch costs creep back.
Frequently Asked Questions
How often should I run the full playbook?
Run the audit play monthly or whenever cost becomes a topic. The cutting plays run as the audit surfaces opportunities. The caps play, once set, runs continuously through enforcement rather than as a recurring project.
Who should own token optimization overall?
A named person responsible for the budget, supported by the engineers who own each use case. Diffuse ownership is why optimization stalls; someone has to be accountable for the number.
What if cutting tokens degrades quality?
That is why every cutting play pairs with a fixed evaluation set. You test each change against real inputs before shipping. If quality drops, you keep the tokens. The plays are designed to find waste, not to cut blindly.
Can I skip the audit if I already know where the waste is?
You can skip it once, and you will usually be wrong about something. Even experienced teams misjudge the input-output split or miss a use case. The audit is cheap insurance against optimizing the wrong target.
Do these plays apply to agentic systems?
Yes, and they matter more there. Agentic loops multiply calls, so bloated prompts and uncapped history compound across every step. Caps and input-shrinking plays are especially valuable when one task triggers dozens of model calls.
Key Takeaways
- Turn tactics into plays: each needs a trigger, an owner, and a sequence.
- Audit first; every other play depends on real per-use-case data.
- Right-sizing the model and shrinking the input carry the largest savings.
- Tame conversation history with capped turns and running summaries.
- Caching and batching are structural discounts you must design for.
- Set and enforce caps last so optimization gains do not decay over time.