Token Budgets in the Wild: Five Scenarios That Teach

Abstract advice about token budgets only goes so far. What makes the discipline click is seeing it applied to a specific feature, with specific components consuming specific shares of the window, and a specific decision that either worked or backfired. This article walks through five scenarios drawn from common LLM features. Each one shows where the tokens went, what choice was made, and why it succeeded or failed.

The scenarios are deliberately varied — a support chatbot, a document question answerer, a batch classifier, a code assistant, and a summarization pipeline — because the right budgeting decision depends heavily on the shape of the feature. A technique that saves a fortune in one context is irrelevant in another. Seeing the contrast is the point.

Read each scenario as a small case to reason about. The numbers are illustrative rather than measured from any single system, but the structure of each decision is exactly the kind you will face in your own work.

Scenario One: The Support Chatbot That Forgot

A customer support chatbot appended every conversation turn to its context and resent the whole history on each request.

What Went Wrong

Short test conversations worked fine. In production, support sessions ran twenty or thirty turns. By the end, each request carried the entire conversation, costs per turn climbed steadily, and long sessions eventually overflowed the context window and started dropping the earliest turns silently — including the customer's original problem.

The Fix and Why It Worked

The team kept the last four turns verbatim and replaced older turns with a running summary capturing the customer's issue, what had been tried, and any account details. History was capped at a fixed token budget. Costs per turn flattened, and the bot stopped forgetting the original problem because the summary preserved it deliberately. The mechanics mirror those in Cut Your Token Costs This Afternoon: An Ordered Routine.

Scenario Two: The Document Q&A That Sent Everything

A question-answering feature pulled the most relevant document and pasted its full text into the prompt.

What Went Wrong

Documents ran long, so most requests sent thousands of tokens of mostly irrelevant text. The cost was high, and answer quality was inconsistent because the model had to find the relevant passage inside a wall of unrelated content.

The Fix and Why It Worked

The team chunked documents, reranked chunks by relevance to the question, and included only the top three. Token usage dropped sharply, and answer quality improved because the model saw a focused context instead of a noisy one. This is the classic case where cutting tokens and improving quality point the same direction, a theme in Hard-Won Habits for Keeping Token Spend Under Control.

Scenario Three: The Classifier With a Novel-Length Prompt

A batch classification job categorized incoming tickets using a system prompt loaded with examples.

What Went Wrong

The system prompt contained two dozen few-shot examples, sent on every one of tens of thousands of daily classifications. Most of the examples were redundant, covering the same categories repeatedly, but they were paid for on every request.

The Fix and Why It Worked

The team trimmed the examples to a representative handful, one per category, and confirmed accuracy held on a validation set. Because the saving applied to every request at high volume, the trimmed prompt cut the job's cost substantially with no measurable accuracy loss. Volume turned a small per-request saving into a large one.

Scenario Four: The Code Assistant That Capped Too Hard

A coding assistant set an aggressive maximum output length to control costs.

What Went Wrong

The cap was tuned for short answers, but users often asked for full functions or multi-file changes. The model's responses got truncated mid-function, producing broken code that users had to repeatedly ask it to continue. Each continuation was another paid request, so the aggressive cap increased total cost while degrading the experience.

The Fix and Why It Worked

The team raised the output cap to fit typical code responses and added structure so the model signaled when more output was needed rather than being cut off blindly. Total cost fell because fewer continuation requests were needed, illustrating that the cheapest cap is not always the lowest one. The trade-off is examined further in Case Study: Token Budget Management and Optimization in Practice.

Scenario Five: The Summarizer That Ignored Output

A pipeline summarized long articles, and the team optimized hard on input while ignoring output.

What Went Wrong

They compressed and trimmed the input articles carefully, but left output length unbounded. Some summaries ran nearly as long as the source. Because output cost more per token, the unbounded summaries dominated the bill despite all the input work.

The Fix and Why It Worked

The team set a target summary length, capped output accordingly, and asked for a structured summary with a fixed number of bullet points. Costs dropped on the expensive side of the ledger, and the summaries became more useful for being concise. The lesson: optimize the half of the budget that actually dominates your cost.

What the Scenarios Have in Common

Five different features, five different fixes — but a few threads run through all of them, and those threads are the transferable part.

The Right Move Depends on the Feature

Notice that no single technique solved every case. Summarizing history saved the chatbot but was irrelevant to the stateless classifier. Capping output rescued the summarizer but hurt the code assistant when set too aggressively. The lesson is that token budgeting is not a checklist of universal cuts; it is a matter of finding which component dominates a particular feature and treating that one. The shape of the feature decides the fix.

Measurement Pointed to the Target Every Time

In each scenario, the winning move became obvious only once someone looked at where the tokens actually went. The chatbot team did not guess that history was the problem; they saw it grow turn by turn. The classifier team did not assume the examples were redundant; they counted them. Intuition would have sent several of these teams after the wrong component. The measurement-first stance is argued in full in Hard-Won Habits for Keeping Token Spend Under Control.

Cheaper Is Not Always Less

The code assistant is the cautionary one. The team cut the most obvious cost — output length — and ended up paying more, because truncated answers spawned paid continuations. The cheapest-looking limit was not the cheapest outcome. Whenever a cut forces the system to do extra work to compensate, count the total, not the per-request figure.

Frequently Asked Questions

Why did summarizing history fix the chatbot rather than just truncating?

Truncation drops the oldest turns blindly, which is exactly where the customer's original problem lived. Summarization preserves the important facts in fewer tokens, so the bot keeps the context it needs while staying within budget.

Did trimming retrieved documents hurt answer quality?

It improved quality. A focused context of the most relevant passages removes distracting noise that can pull the model off target, so the answers got both cheaper and better.

Why did an aggressive output cap increase cost?

Because truncated answers forced users to request continuations, and each continuation was another paid request. A cap tuned below typical answer length trades one large request for several, often costing more overall.

Why focus on output in the summarizer?

Output tokens usually cost more than input, so unbounded summaries dominated the bill even after careful input optimization. Capping output addressed the larger, pricier half of the budget.

Are these numbers from a real system?

The structure of each decision reflects common real patterns, but the specific figures are illustrative. The point is the reasoning behind each choice, which transfers to your own measured numbers.

Key Takeaways

Unbounded chat history inflates cost and overflows the window; summarize older turns and keep recent ones verbatim.
Sending whole documents wastes tokens and distracts the model; chunk, rerank, and include only top passages.
High-volume features turn small per-request savings, like trimming few-shot examples, into large total savings.
An output cap set below typical answer length can raise cost by forcing paid continuation requests.
Output is usually the pricier half of the budget, so optimizing input alone can miss the dominant cost.

Scenario One: The Support Chatbot That Forgot

A customer support chatbot appended every conversation turn to its context and resent the whole history on each request.

What Went Wrong

The Fix and Why It Worked

Scenario Two: The Document Q&A That Sent Everything

A question-answering feature pulled the most relevant document and pasted its full text into the prompt.

What Went Wrong

The Fix and Why It Worked

Scenario Three: The Classifier With a Novel-Length Prompt

A batch classification job categorized incoming tickets using a system prompt loaded with examples.

What Went Wrong

The Fix and Why It Worked

Scenario Four: The Code Assistant That Capped Too Hard

A coding assistant set an aggressive maximum output length to control costs.

What Went Wrong

The Fix and Why It Worked

Scenario Five: The Summarizer That Ignored Output

A pipeline summarized long articles, and the team optimized hard on input while ignoring output.

What Went Wrong

The Fix and Why It Worked

What the Scenarios Have in Common

Five different features, five different fixes — but a few threads run through all of them, and those threads are the transferable part.

The Right Move Depends on the Feature

Measurement Pointed to the Target Every Time

Cheaper Is Not Always Less

Frequently Asked Questions

Why did summarizing history fix the chatbot rather than just truncating?

Did trimming retrieved documents hurt answer quality?

It improved quality. A focused context of the most relevant passages removes distracting noise that can pull the model off target, so the answers got both cheaper and better.

Why did an aggressive output cap increase cost?

Why focus on output in the summarizer?

Output tokens usually cost more than input, so unbounded summaries dominated the bill even after careful input optimization. Capping output addressed the larger, pricier half of the budget.

Are these numbers from a real system?

The structure of each decision reflects common real patterns, but the specific figures are illustrative. The point is the reasoning behind each choice, which transfers to your own measured numbers.

Key Takeaways

Unbounded chat history inflates cost and overflows the window; summarize older turns and keep recent ones verbatim.
Sending whole documents wastes tokens and distracts the model; chunk, rerank, and include only top passages.
High-volume features turn small per-request savings, like trimming few-shot examples, into large total savings.
An output cap set below typical answer length can raise cost by forcing paid continuation requests.
Output is usually the pricier half of the budget, so optimizing input alone can miss the dominant cost.

Token Budgets in the Wild: Five Scenarios That Teach

Scenario One: The Support Chatbot That Forgot

What Went Wrong

The Fix and Why It Worked

Scenario Two: The Document Q&A That Sent Everything

What Went Wrong

The Fix and Why It Worked

Scenario Three: The Classifier With a Novel-Length Prompt

What Went Wrong

The Fix and Why It Worked

Scenario Four: The Code Assistant That Capped Too Hard

What Went Wrong

The Fix and Why It Worked

Scenario Five: The Summarizer That Ignored Output

What Went Wrong

The Fix and Why It Worked

What the Scenarios Have in Common

The Right Move Depends on the Feature

Measurement Pointed to the Target Every Time

Cheaper Is Not Always Less

Frequently Asked Questions

Why did summarizing history fix the chatbot rather than just truncating?

Did trimming retrieved documents hurt answer quality?

Why did an aggressive output cap increase cost?

Why focus on output in the summarizer?

Are these numbers from a real system?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Token Budgets in the Wild: Five Scenarios That Teach

Scenario One: The Support Chatbot That Forgot

What Went Wrong

The Fix and Why It Worked

Scenario Two: The Document Q&A That Sent Everything

What Went Wrong

The Fix and Why It Worked

Scenario Three: The Classifier With a Novel-Length Prompt

What Went Wrong

The Fix and Why It Worked

Scenario Four: The Code Assistant That Capped Too Hard

What Went Wrong

The Fix and Why It Worked

Scenario Five: The Summarizer That Ignored Output

What Went Wrong

The Fix and Why It Worked

What the Scenarios Have in Common

The Right Move Depends on the Feature

Measurement Pointed to the Target Every Time

Cheaper Is Not Always Less

Frequently Asked Questions

Why did summarizing history fix the chatbot rather than just truncating?

Did trimming retrieved documents hurt answer quality?

Why did an aggressive output cap increase cost?

Why focus on output in the summarizer?

Are these numbers from a real system?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?