AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Scenario One: The Support Chatbot That ForgotWhat Went WrongThe Fix and Why It WorkedScenario Two: The Document Q&A That Sent EverythingWhat Went WrongThe Fix and Why It WorkedScenario Three: The Classifier With a Novel-Length PromptWhat Went WrongThe Fix and Why It WorkedScenario Four: The Code Assistant That Capped Too HardWhat Went WrongThe Fix and Why It WorkedScenario Five: The Summarizer That Ignored OutputWhat Went WrongThe Fix and Why It WorkedWhat the Scenarios Have in CommonThe Right Move Depends on the FeatureMeasurement Pointed to the Target Every TimeCheaper Is Not Always LessFrequently Asked QuestionsWhy did summarizing history fix the chatbot rather than just truncating?Did trimming retrieved documents hurt answer quality?Why did an aggressive output cap increase cost?Why focus on output in the summarizer?Are these numbers from a real system?Key Takeaways
Home/Blog/Token Budgets in the Wild: Five Scenarios That Teach
General

Token Budgets in the Wild: Five Scenarios That Teach

A

Agency Script Editorial

Editorial Team

·August 27, 2022·8 min read
token budget management and optimizationtoken budget management and optimization examplestoken budget management and optimization guideprompt engineering

Abstract advice about token budgets only goes so far. What makes the discipline click is seeing it applied to a specific feature, with specific components consuming specific shares of the window, and a specific decision that either worked or backfired. This article walks through five scenarios drawn from common LLM features. Each one shows where the tokens went, what choice was made, and why it succeeded or failed.

The scenarios are deliberately varied — a support chatbot, a document question answerer, a batch classifier, a code assistant, and a summarization pipeline — because the right budgeting decision depends heavily on the shape of the feature. A technique that saves a fortune in one context is irrelevant in another. Seeing the contrast is the point.

Read each scenario as a small case to reason about. The numbers are illustrative rather than measured from any single system, but the structure of each decision is exactly the kind you will face in your own work.

Scenario One: The Support Chatbot That Forgot

A customer support chatbot appended every conversation turn to its context and resent the whole history on each request.

What Went Wrong

Short test conversations worked fine. In production, support sessions ran twenty or thirty turns. By the end, each request carried the entire conversation, costs per turn climbed steadily, and long sessions eventually overflowed the context window and started dropping the earliest turns silently — including the customer's original problem.

The Fix and Why It Worked

The team kept the last four turns verbatim and replaced older turns with a running summary capturing the customer's issue, what had been tried, and any account details. History was capped at a fixed token budget. Costs per turn flattened, and the bot stopped forgetting the original problem because the summary preserved it deliberately. The mechanics mirror those in Cut Your Token Costs This Afternoon: An Ordered Routine.

Scenario Two: The Document Q&A That Sent Everything

A question-answering feature pulled the most relevant document and pasted its full text into the prompt.

What Went Wrong

Documents ran long, so most requests sent thousands of tokens of mostly irrelevant text. The cost was high, and answer quality was inconsistent because the model had to find the relevant passage inside a wall of unrelated content.

The Fix and Why It Worked

The team chunked documents, reranked chunks by relevance to the question, and included only the top three. Token usage dropped sharply, and answer quality improved because the model saw a focused context instead of a noisy one. This is the classic case where cutting tokens and improving quality point the same direction, a theme in Hard-Won Habits for Keeping Token Spend Under Control.

Scenario Three: The Classifier With a Novel-Length Prompt

A batch classification job categorized incoming tickets using a system prompt loaded with examples.

What Went Wrong

The system prompt contained two dozen few-shot examples, sent on every one of tens of thousands of daily classifications. Most of the examples were redundant, covering the same categories repeatedly, but they were paid for on every request.

The Fix and Why It Worked

The team trimmed the examples to a representative handful, one per category, and confirmed accuracy held on a validation set. Because the saving applied to every request at high volume, the trimmed prompt cut the job's cost substantially with no measurable accuracy loss. Volume turned a small per-request saving into a large one.

Scenario Four: The Code Assistant That Capped Too Hard

A coding assistant set an aggressive maximum output length to control costs.

What Went Wrong

The cap was tuned for short answers, but users often asked for full functions or multi-file changes. The model's responses got truncated mid-function, producing broken code that users had to repeatedly ask it to continue. Each continuation was another paid request, so the aggressive cap increased total cost while degrading the experience.

The Fix and Why It Worked

The team raised the output cap to fit typical code responses and added structure so the model signaled when more output was needed rather than being cut off blindly. Total cost fell because fewer continuation requests were needed, illustrating that the cheapest cap is not always the lowest one. The trade-off is examined further in Case Study: Token Budget Management and Optimization in Practice.

Scenario Five: The Summarizer That Ignored Output

A pipeline summarized long articles, and the team optimized hard on input while ignoring output.

What Went Wrong

They compressed and trimmed the input articles carefully, but left output length unbounded. Some summaries ran nearly as long as the source. Because output cost more per token, the unbounded summaries dominated the bill despite all the input work.

The Fix and Why It Worked

The team set a target summary length, capped output accordingly, and asked for a structured summary with a fixed number of bullet points. Costs dropped on the expensive side of the ledger, and the summaries became more useful for being concise. The lesson: optimize the half of the budget that actually dominates your cost.

What the Scenarios Have in Common

Five different features, five different fixes — but a few threads run through all of them, and those threads are the transferable part.

The Right Move Depends on the Feature

Notice that no single technique solved every case. Summarizing history saved the chatbot but was irrelevant to the stateless classifier. Capping output rescued the summarizer but hurt the code assistant when set too aggressively. The lesson is that token budgeting is not a checklist of universal cuts; it is a matter of finding which component dominates a particular feature and treating that one. The shape of the feature decides the fix.

Measurement Pointed to the Target Every Time

In each scenario, the winning move became obvious only once someone looked at where the tokens actually went. The chatbot team did not guess that history was the problem; they saw it grow turn by turn. The classifier team did not assume the examples were redundant; they counted them. Intuition would have sent several of these teams after the wrong component. The measurement-first stance is argued in full in Hard-Won Habits for Keeping Token Spend Under Control.

Cheaper Is Not Always Less

The code assistant is the cautionary one. The team cut the most obvious cost — output length — and ended up paying more, because truncated answers spawned paid continuations. The cheapest-looking limit was not the cheapest outcome. Whenever a cut forces the system to do extra work to compensate, count the total, not the per-request figure.

Frequently Asked Questions

Why did summarizing history fix the chatbot rather than just truncating?

Truncation drops the oldest turns blindly, which is exactly where the customer's original problem lived. Summarization preserves the important facts in fewer tokens, so the bot keeps the context it needs while staying within budget.

Did trimming retrieved documents hurt answer quality?

It improved quality. A focused context of the most relevant passages removes distracting noise that can pull the model off target, so the answers got both cheaper and better.

Why did an aggressive output cap increase cost?

Because truncated answers forced users to request continuations, and each continuation was another paid request. A cap tuned below typical answer length trades one large request for several, often costing more overall.

Why focus on output in the summarizer?

Output tokens usually cost more than input, so unbounded summaries dominated the bill even after careful input optimization. Capping output addressed the larger, pricier half of the budget.

Are these numbers from a real system?

The structure of each decision reflects common real patterns, but the specific figures are illustrative. The point is the reasoning behind each choice, which transfers to your own measured numbers.

Key Takeaways

  • Unbounded chat history inflates cost and overflows the window; summarize older turns and keep recent ones verbatim.
  • Sending whole documents wastes tokens and distracts the model; chunk, rerank, and include only top passages.
  • High-volume features turn small per-request savings, like trimming few-shot examples, into large total savings.
  • An output cap set below typical answer length can raise cost by forcing paid continuation requests.
  • Output is usually the pricier half of the budget, so optimizing input alone can miss the dominant cost.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification