AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationSymptomsThe PressureThe DiagnosisInstrumenting the PromptWhat They FoundThe DecisionsRetrieval FirstHistory SecondOutput ThirdThe ExecutionStaged RolloutGuarding QualityThe OutcomeThe NumbersThe SurprisesWhat They KeptThe Lessons They Carried ForwardMeasurement Is Not OptionalCost and Quality Are Not Always OpposedOrder of Attack MattersGains Need GuardrailsFrequently Asked QuestionsWhy did the bill grow faster than traffic?Why start with retrieval instead of history?How did quality improve while cost fell?Was a staged rollout necessary?How did they prevent the savings from reversing?Key Takeaways
Home/Blog/How One Team Halved Its LLM Bill Without Losing Quality
General

How One Team Halved Its LLM Bill Without Losing Quality

A

Agency Script Editorial

Editorial Team

·August 23, 2022·8 min read
token budget management and optimizationtoken budget management and optimization case studytoken budget management and optimization guideprompt engineering

The fastest way to understand token budgeting is to follow a team through a real problem from start to finish. This is a composite account of a support automation team whose language model bill grew faster than their traffic for three straight months, until nobody could explain the gap between usage and cost. It follows them through diagnosis, decision, execution, and the measured outcome, and ends with the lessons they took away.

The situation is common enough to be instructive: a feature that worked well in testing, shipped without a deliberate token budget, and slowly turned into the largest line item on the engineering invoice. What makes the story useful is not that they fixed it — many teams do — but the specific decisions they made and the order in which they made them.

The names and exact figures are illustrative, but the shape of the problem and the moves that resolved it reflect patterns we see repeatedly. Read it as a worked example of the discipline applied under real pressure.

The Situation

The team ran a support assistant that answered customer questions using both conversation history and retrieved help-center articles.

Symptoms

Over a quarter, traffic grew about 30 percent while the model bill nearly doubled. Long support sessions occasionally produced incoherent answers, as if the assistant had forgotten earlier context. Nobody could point to a cause because nobody was measuring where tokens went.

The Pressure

Finance flagged the line item, and leadership asked for a plan to cut it by half without degrading the support experience. The team had two weeks. The full discipline they reached for is laid out in Spending Tokens Like Money: A Working Manual for LLM Budgets.

The Diagnosis

Before touching anything, the team instrumented the system to see where tokens actually went.

Instrumenting the Prompt

They logged token counts for each prompt component — system prompt, retrieved articles, conversation history, and user message — plus output tokens, across a sample of real sessions. The breakdown was revealing.

What They Found

Retrieved articles were sent as full documents and accounted for nearly half of input tokens. Conversation history was unbounded and grew without limit across long sessions, occasionally overflowing the window and silently dropping the customer's original question. Output was uncapped, and a tail of very long answers drove a disproportionate share of cost. The diagnostic approach matches Cut Your Token Costs This Afternoon: An Ordered Routine.

The Decisions

With the breakdown in hand, the targets were obvious, and the team prioritized by expected return.

Retrieval First

Because retrieved articles were the single largest consumer, they led with it. The decision was to chunk articles, rerank chunks against the question, and include only the top three rather than full documents.

History Second

The unbounded history both cost money and caused the incoherence, so it was the next target. The decision was to keep the last four turns verbatim, summarize older turns into a running record, and cap history at a fixed token budget.

Output Third

Finally, they decided to cap output length and ask for more concise, structured answers, since the long tail of verbose responses was hitting the pricier side of the ledger.

The Execution

The team rolled the changes out carefully rather than all at once, to isolate effects and catch regressions.

Staged Rollout

They shipped retrieval changes first to a fraction of traffic, compared token counts and answer quality against the baseline, and confirmed both improved before expanding. History and output changes followed the same pattern.

Guarding Quality

For each change they compared answers before and after on a fixed set of real questions. When the first history summary lost some account details, they adjusted what the summary preserved and re-verified. This is where the work became engineering rather than arithmetic, the same tension covered in Token Budget Management and Optimization: Real-World Examples and Use Cases.

The Outcome

The measured result met the target and produced a few unexpected benefits.

The Numbers

Input tokens per request fell by more than half, driven mostly by the retrieval change. Output cost dropped once length was capped. Overall cost per request came down enough to bring the bill back below its level from the start of the quarter, despite higher traffic.

The Surprises

Answer quality went up, not down. The focused retrieval context removed distracting noise, and the deliberate history summary stopped the assistant from forgetting the customer's original problem. Cutting cost and improving the experience turned out to point the same direction.

What They Kept

They moved every limit into central configuration and added a monthly review of token telemetry, so the gains would not quietly erode. That working tool resembles The Token Budget Management and Optimization Checklist for 2026.

The Lessons They Carried Forward

The team came out of the two weeks with more than a smaller bill. They came out with a set of habits they applied to every feature after it.

Measurement Is Not Optional

Their biggest regret was shipping the feature without any token instrumentation in the first place. For three months the cost grew and nobody could say why, because the data did not exist. After this, every new LLM feature shipped with per-component token logging from day one. The cost of instrumentation was trivial next to the cost of flying blind.

Cost and Quality Are Not Always Opposed

Going in, the team assumed cutting cost meant accepting worse answers, and they braced leadership for that trade. The opposite happened. Focused retrieval and a deliberate history summary improved the experience while reducing tokens. The lesson was not that this always happens, but that the trade-off should be measured rather than assumed. Sometimes the cheaper design is also the better one.

Order of Attack Matters

They saved the most by going after the largest consumer first. Had they started with the system prompt, which was already lean, they would have spent effort for little return. Prioritizing by measured size, rather than by what felt wasteful, directed their limited time to where it paid. This prioritization is the spine of Cut Your Token Costs This Afternoon: An Ordered Routine.

Gains Need Guardrails

The final habit was distrust of their own discipline. They knew that without enforcement, the components would creep back. Centralized limits and a monthly review were not bureaucracy; they were the only thing standing between the savings and their slow reversal.

Frequently Asked Questions

Why did the bill grow faster than traffic?

Because two components grew with use rather than staying fixed. Unbounded history got more expensive every turn, and full-document retrieval sent large amounts of text per request. Both compounded as sessions lengthened and traffic rose.

Why start with retrieval instead of history?

Retrieval was the single largest token consumer, so equal effort there returned the most. The team prioritized by measured size, and the breakdown put retrieved articles at the top.

How did quality improve while cost fell?

Focused retrieval removed distracting irrelevant text, helping the model find the right answer, and a deliberate history summary preserved the customer's original problem instead of dropping it. Both changes happened to improve answers while reducing tokens.

Was a staged rollout necessary?

It was prudent. Shipping changes to a fraction of traffic first let the team compare against the baseline and catch a history-summary regression before it reached everyone. Isolating each change made the effects legible.

How did they prevent the savings from reversing?

By centralizing every limit in configuration and instituting a monthly review of token telemetry. Enforcement and regular review keep components from creeping back to their old sizes.

Key Takeaways

  • A bill that grows faster than traffic usually points to components that scale with use, like unbounded history and full-document retrieval.
  • Instrument prompt assembly to see per-component token usage before deciding what to change.
  • Prioritize optimization by measured size, attacking the largest consumer first for the best return.
  • Roll out changes in stages and verify quality against a fixed set of real cases at each step.
  • Centralize limits and review telemetry regularly so the savings do not erode over time.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification