Plays That Pull Token Spend Down Without Breaking Output

Most advice about controlling token spend reads like a list of good ideas with no instructions for when to use them. That gap is why teams know they should cache, summarize, and route, yet keep paying for prompts nobody has touched in months. Knowing the levers is not the same as knowing which one to pull, who pulls it, and in what order.

A playbook fixes that. Instead of a pile of tactics, you get plays. Each play has a trigger that tells you when it applies, an owner who is accountable for running it, and a sequence that keeps you from making things worse. The point is to turn cost control from a heroic one-off project into something your team executes on cue.

What follows is an operating playbook you can adapt. Treat the triggers as defaults to tune, not commandments. The sequencing matters more than the exact thresholds, because pulling levers in the wrong order is how teams degrade quality while chasing savings.

Play One: Audit Before You Cut

You cannot optimize a system you have not measured. Every other play depends on this one running first.

Trigger

Run this the moment token cost becomes a topic, or on a fixed cadence such as monthly, whichever comes first.

The Sequence

Instrument every call to log input tokens, output tokens, and the use case.
Group spend by use case and sort by total cost, not request count.
Identify the top two or three use cases that drive the majority of spend.

Owner

An engineer who can add logging and read the data. Optimization without this baseline is guessing, and guessing is how teams cut the wrong thing.

Play Two: Right-Size the Model

Paying premium rates for trivial work is the most common and most fixable form of waste.

Trigger

A high-volume use case runs on a top-tier model, or a single model handles tasks of wildly different difficulty.

The Sequence

List the use cases hitting the expensive model.
For each, ask whether a smaller model could meet the quality bar.
Route by difficulty: cheap model for the easy majority, capable model for the hard minority.

Owner

The engineer who owns the use case, validated against a fixed evaluation set so the routing decision is evidence, not vibes.

Play Three: Shrink the Input

Input usually dominates spend, so this play has the largest blast radius.

Trigger

Input tokens consistently exceed output tokens, or the system prompt has grown past what anyone can justify line by line.

The Sequence

Strip redundant and aspirational instructions from the system prompt.
Remove few-shot examples and test whether quality holds without them.
Cap retrieval to the smallest number of chunks that still answers the question.

Owner

A prompt engineer paired with whoever owns quality, because each cut must be tested against real inputs before it ships.

The mechanics of each cut are covered in the questions-and-answers companion if you need the underlying reasoning.

Play Four: Tame Conversation History

Chat-style products bleed tokens through history that grows without limit.

Trigger

Multi-turn conversations resend the full transcript on every request, or long sessions cost noticeably more than short ones.

The Sequence

Set a maximum number of recent turns to keep verbatim.
Replace older turns with a running summary refreshed periodically.
Confirm the summary preserves the facts the model needs to stay coherent.

Owner

The application engineer, since this play changes how the conversation state is assembled, not just the prompt text.

Play Five: Exploit Caching and Batching

These are discounts you are leaving on the table if your request structure does not support them.

Caching

Move stable context to the front of the prompt so a cached prefix survives across requests. Reuse it for any workload that sends the same large block repeatedly, such as a fixed instruction set or a reference document.

Batching

For non-interactive jobs that tolerate latency, route them through a batch path to capture the discount. Overnight evaluations, bulk summarization, and backfills are ideal candidates.

Owner

The engineer who structures requests. Both plays require deliberate ordering and routing rather than a content change.

Play Six: Set Caps and Hold the Line

Optimization decays. Without guardrails, prompts drift back toward bloat as people add one more instruction at a time.

Trigger

Run this once you have realized your savings, then keep it running forever.

The Sequence

Define a maximum context size and maximum output length per use case.
Enforce the caps in code so violations fail loudly instead of silently inflating cost.
Add cost per outcome to a dashboard people actually look at.

Owner

Whoever owns the budget. Guardrails are a governance job, not a one-time engineering task, and they are what make the savings durable.

For embedding these caps into day-to-day work, the repeatable workflow turns this final play into standing process. For where the discipline is heading, see the forward-looking view.

Play Seven: Trim the Output, Not Just the Input

Input gets most of the attention, but generous output formats quietly inflate cost on every single call.

Trigger

The model returns verbose answers, repeats the question back, or wraps every response in boilerplate that no consumer reads.

The Sequence

Instruct the model to answer directly without restating the prompt.
Set a maximum output length appropriate to the use case.
For structured outputs, request the minimal schema rather than a chatty narrative.

Owner

The prompt engineer, validated against the evaluation set so brevity does not strip out detail the consumer actually needs. Output tokens often cost more per token than input, so even modest trims compound across high volume.

Play Eight: Route by Difficulty, Not by Default

Many systems send every request to whatever model the prototype happened to use. That default is rarely the right economic choice.

Trigger

A single model handles a mix of trivial and hard requests, or you suspect most traffic is easier than the model you are paying for.

The Sequence

Classify incoming requests by difficulty, using a cheap heuristic or a small classifier.
Send the easy majority to a smaller model and escalate only when needed.
Measure escalation rate and quality so the routing stays honest.

Owner

The engineer who owns the request path. Routing is structural, so it belongs with whoever controls how requests are dispatched, not with whoever writes the prompt text.

Sequencing the Plays

Order matters. Audit first, because the rest depends on data. Right-size the model and shrink the input next, since they carry the most savings. Then tackle history, caching, and batching, which are structural and slower to change. Finish with caps, because they protect the gains. Running the plays out of order is how teams burn effort and still watch costs creep back.

Frequently Asked Questions

How often should I run the full playbook?

Run the audit play monthly or whenever cost becomes a topic. The cutting plays run as the audit surfaces opportunities. The caps play, once set, runs continuously through enforcement rather than as a recurring project.

Who should own token optimization overall?

A named person responsible for the budget, supported by the engineers who own each use case. Diffuse ownership is why optimization stalls; someone has to be accountable for the number.

What if cutting tokens degrades quality?

That is why every cutting play pairs with a fixed evaluation set. You test each change against real inputs before shipping. If quality drops, you keep the tokens. The plays are designed to find waste, not to cut blindly.

Can I skip the audit if I already know where the waste is?

You can skip it once, and you will usually be wrong about something. Even experienced teams misjudge the input-output split or miss a use case. The audit is cheap insurance against optimizing the wrong target.

Do these plays apply to agentic systems?

Yes, and they matter more there. Agentic loops multiply calls, so bloated prompts and uncapped history compound across every step. Caps and input-shrinking plays are especially valuable when one task triggers dozens of model calls.

Key Takeaways

Turn tactics into plays: each needs a trigger, an owner, and a sequence.
Audit first; every other play depends on real per-use-case data.
Right-sizing the model and shrinking the input carry the largest savings.
Tame conversation history with capped turns and running summaries.
Caching and batching are structural discounts you must design for.
Set and enforce caps last so optimization gains do not decay over time.

Play One: Audit Before You Cut

You cannot optimize a system you have not measured. Every other play depends on this one running first.

Trigger

Run this the moment token cost becomes a topic, or on a fixed cadence such as monthly, whichever comes first.

The Sequence

Instrument every call to log input tokens, output tokens, and the use case.
Group spend by use case and sort by total cost, not request count.
Identify the top two or three use cases that drive the majority of spend.

Owner

An engineer who can add logging and read the data. Optimization without this baseline is guessing, and guessing is how teams cut the wrong thing.

Play Two: Right-Size the Model

Paying premium rates for trivial work is the most common and most fixable form of waste.

Trigger

A high-volume use case runs on a top-tier model, or a single model handles tasks of wildly different difficulty.

The Sequence

List the use cases hitting the expensive model.
For each, ask whether a smaller model could meet the quality bar.
Route by difficulty: cheap model for the easy majority, capable model for the hard minority.

Owner

The engineer who owns the use case, validated against a fixed evaluation set so the routing decision is evidence, not vibes.

Play Three: Shrink the Input

Input usually dominates spend, so this play has the largest blast radius.

Trigger

Input tokens consistently exceed output tokens, or the system prompt has grown past what anyone can justify line by line.

The Sequence

Strip redundant and aspirational instructions from the system prompt.
Remove few-shot examples and test whether quality holds without them.
Cap retrieval to the smallest number of chunks that still answers the question.

Owner

A prompt engineer paired with whoever owns quality, because each cut must be tested against real inputs before it ships.

The mechanics of each cut are covered in the questions-and-answers companion if you need the underlying reasoning.

Play Four: Tame Conversation History

Chat-style products bleed tokens through history that grows without limit.

Trigger

Multi-turn conversations resend the full transcript on every request, or long sessions cost noticeably more than short ones.

The Sequence

Set a maximum number of recent turns to keep verbatim.
Replace older turns with a running summary refreshed periodically.
Confirm the summary preserves the facts the model needs to stay coherent.

Owner

The application engineer, since this play changes how the conversation state is assembled, not just the prompt text.

Play Five: Exploit Caching and Batching

These are discounts you are leaving on the table if your request structure does not support them.

Caching

Batching

For non-interactive jobs that tolerate latency, route them through a batch path to capture the discount. Overnight evaluations, bulk summarization, and backfills are ideal candidates.

Owner

The engineer who structures requests. Both plays require deliberate ordering and routing rather than a content change.

Play Six: Set Caps and Hold the Line

Optimization decays. Without guardrails, prompts drift back toward bloat as people add one more instruction at a time.

Trigger

Run this once you have realized your savings, then keep it running forever.

The Sequence

Define a maximum context size and maximum output length per use case.
Enforce the caps in code so violations fail loudly instead of silently inflating cost.
Add cost per outcome to a dashboard people actually look at.

Owner

Whoever owns the budget. Guardrails are a governance job, not a one-time engineering task, and they are what make the savings durable.

For embedding these caps into day-to-day work, the repeatable workflow turns this final play into standing process. For where the discipline is heading, see the forward-looking view.

Play Seven: Trim the Output, Not Just the Input

Input gets most of the attention, but generous output formats quietly inflate cost on every single call.

Trigger

The model returns verbose answers, repeats the question back, or wraps every response in boilerplate that no consumer reads.

The Sequence

Instruct the model to answer directly without restating the prompt.
Set a maximum output length appropriate to the use case.
For structured outputs, request the minimal schema rather than a chatty narrative.

Owner

Play Eight: Route by Difficulty, Not by Default

Many systems send every request to whatever model the prototype happened to use. That default is rarely the right economic choice.

Trigger

A single model handles a mix of trivial and hard requests, or you suspect most traffic is easier than the model you are paying for.

The Sequence

Classify incoming requests by difficulty, using a cheap heuristic or a small classifier.
Send the easy majority to a smaller model and escalate only when needed.
Measure escalation rate and quality so the routing stays honest.

Owner

The engineer who owns the request path. Routing is structural, so it belongs with whoever controls how requests are dispatched, not with whoever writes the prompt text.

Sequencing the Plays

Frequently Asked Questions

How often should I run the full playbook?

Who should own token optimization overall?

A named person responsible for the budget, supported by the engineers who own each use case. Diffuse ownership is why optimization stalls; someone has to be accountable for the number.

What if cutting tokens degrades quality?

Can I skip the audit if I already know where the waste is?

Do these plays apply to agentic systems?

Key Takeaways

Turn tactics into plays: each needs a trigger, an owner, and a sequence.
Audit first; every other play depends on real per-use-case data.
Right-sizing the model and shrinking the input carry the largest savings.
Tame conversation history with capped turns and running summaries.
Caching and batching are structural discounts you must design for.
Set and enforce caps last so optimization gains do not decay over time.

Plays That Pull Token Spend Down Without Breaking Output

Play One: Audit Before You Cut

Trigger

The Sequence

Owner

Play Two: Right-Size the Model

Trigger

The Sequence

Owner

Play Three: Shrink the Input

Trigger

The Sequence

Owner

Play Four: Tame Conversation History

Trigger

The Sequence

Owner

Play Five: Exploit Caching and Batching

Caching

Batching

Owner

Play Six: Set Caps and Hold the Line

Trigger

The Sequence

Owner

Play Seven: Trim the Output, Not Just the Input

Trigger

The Sequence

Owner

Play Eight: Route by Difficulty, Not by Default

Trigger

The Sequence

Owner

Sequencing the Plays

Frequently Asked Questions

How often should I run the full playbook?

Who should own token optimization overall?

What if cutting tokens degrades quality?

Can I skip the audit if I already know where the waste is?

Do these plays apply to agentic systems?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Plays That Pull Token Spend Down Without Breaking Output

Play One: Audit Before You Cut

Trigger

The Sequence

Owner

Play Two: Right-Size the Model

Trigger

The Sequence

Owner

Play Three: Shrink the Input

Trigger

The Sequence

Owner

Play Four: Tame Conversation History

Trigger

The Sequence

Owner

Play Five: Exploit Caching and Batching

Caching

Batching

Owner

Play Six: Set Caps and Hold the Line

Trigger

The Sequence

Owner

Play Seven: Trim the Output, Not Just the Input

Trigger

The Sequence

Owner

Play Eight: Route by Difficulty, Not by Default

Trigger

The Sequence

Owner