AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What a Token Actually IsTokens Versus WordsInput Tokens Versus Output TokensThe Context Window CeilingMeasuring Before You OptimizeCount Tokens at the SourceTrack Per-Request and Aggregate CostWatch the Output SideAllocating a Fixed BudgetReserve Space for the AnswerRank Components by ValueSet Caps, Not HopesCompression Techniques That Preserve MeaningSummarize Conversation HistoryTrim Retrieved ContextPrune the System PromptPrefer Structure Over ProseBuilding Budgets Into Your SystemCentralize the LimitsFail GracefullyReview on a ScheduleFrequently Asked QuestionsHow many tokens is a typical prompt?Should I always minimize tokens?Why is my bill dominated by output when my prompts are short?Does summarizing conversation history hurt quality?How is token budget different from rate limiting?Key Takeaways
Home/Blog/Spending Tokens Like Money: A Working Manual for LLM Budgets
General

Spending Tokens Like Money: A Working Manual for LLM Budgets

A

Agency Script Editorial

Editorial Team

·September 16, 2022·8 min read
token budget management and optimizationtoken budget management and optimization guidetoken budget management and optimization guideprompt engineering

Every call you make to a language model has a price, and that price is denominated in tokens. A token is roughly three-quarters of an English word, and both the text you send and the text the model returns count against your bill and against the fixed size of the model's context window. Treat tokens carelessly and two things happen at once: your invoice climbs faster than your usage, and your prompts start hitting the ceiling of what the model can hold in working memory at any one time.

Token budget management is the discipline of deciding, deliberately, how many tokens each part of your application is allowed to consume. It is part cost control and part engineering constraint. The cost angle is obvious — fewer tokens, smaller bill. The engineering angle is subtler and often more important: a context window is finite, so every token spent on boilerplate or stale conversation history is a token unavailable for the instructions and data that actually drive a good answer.

This manual treats the subject end to end. It explains where tokens come from, how to measure them before they surprise you, how to allocate a fixed budget across the moving parts of a prompt, and how to compress without losing meaning. The goal is not to make every prompt as short as possible. The goal is to spend deliberately, so that the tokens you do use buy the maximum amount of useful output.

What a Token Actually Is

Before you can budget tokens, you need an honest mental model of what one is.

Tokens Versus Words

Tokenizers split text into subword units. Common words map to a single token, but rare words, code, punctuation, and non-English text fragment into several. A useful rule of thumb for English prose is that 1,000 tokens is about 750 words, but you should never rely on the rule when precision matters — measure instead. Code and structured data routinely run hotter, because braces, indentation, and identifiers each consume tokens.

Input Tokens Versus Output Tokens

Your budget has two halves. Input tokens are everything you send: system prompt, instructions, retrieved documents, conversation history, and the user's current message. Output tokens are what the model generates. Most providers price these differently, with output usually costing more per token than input. That asymmetry matters: a verbose answer can dominate your cost even when your prompt is lean.

The Context Window Ceiling

The context window is the hard limit on input plus output combined. When a conversation grows past it, the oldest content must be dropped or summarized or the request fails outright. A token budget is therefore not just a money question — it is a capacity question, and the two interact constantly.

Measuring Before You Optimize

Optimization without measurement is guessing. Establish a baseline first.

Count Tokens at the Source

Every major provider ships a tokenizer you can call locally before sending a request. Counting tokens at the point where a prompt is assembled lets you log the size of each component — system prompt, retrieved context, history, user input — separately. Once you can see where tokens go, the expensive parts usually become obvious.

Track Per-Request and Aggregate Cost

Log token counts on every request along with the feature that triggered it. Aggregate by feature, by user, and by time window. A single chat turn that looks cheap can become your largest line item when it runs ten thousand times a day. The same instinct applies when you weigh model choice; the trade-offs in The Best Tools for Token Budget Management and Optimization start from exactly this kind of telemetry.

Watch the Output Side

Because output tokens often cost more, measure the length distribution of your responses, not just the average. A long tail of runaway generations can quietly double your bill. Setting a sensible maximum output length is one of the highest-leverage controls you have.

Allocating a Fixed Budget

Once you can measure, you can allocate. Treat the context window like a fixed container and decide in advance how much each component gets.

Reserve Space for the Answer

Start by carving out room for the output. If you need up to 800 tokens of answer, those tokens are not available for input. Reserve them first, then divide the remainder among input components.

Rank Components by Value

Not every input token earns its keep. The system prompt and current user message are usually non-negotiable. Retrieved context and conversation history are where most waste hides. Rank these by how much they improve answers, and give the budget to the high-value items first.

Set Caps, Not Hopes

A budget that exists only as an intention will be exceeded. Enforce caps in code: truncate retrieved documents to a token limit, cap history to the last N turns or a token ceiling, and refuse or summarize when input would exceed the reserve. This kind of structured discipline pairs well with the staged model in A Framework for Token Budget Management and Optimization.

Compression Techniques That Preserve Meaning

Cutting tokens is easy. Cutting tokens without cutting quality is the skill.

Summarize Conversation History

Long chat sessions do not need every prior turn verbatim. Periodically replace older turns with a compact summary that preserves decisions, facts, and open questions. This keeps the running context small while retaining what the model needs to stay coherent.

Trim Retrieved Context

When you pull documents from a retrieval system, you rarely need the full text. Rerank passages by relevance and include only the top few. Strip headers, navigation, and repeated boilerplate before the text reaches the prompt.

Prune the System Prompt

System prompts accumulate cruft. Instructions get added for one-off problems and never removed. Audit the system prompt regularly and delete anything that no longer changes behavior. A tight system prompt pays a dividend on every single request.

Prefer Structure Over Prose

Asking for a structured response — a short list, a small object, defined fields — often produces a more useful answer in fewer output tokens than asking for free-form prose. Constraint reduces both cost and the chance of rambling.

Building Budgets Into Your System

A budget that lives only in a developer's head erodes the moment that developer moves on.

Centralize the Limits

Keep token limits in configuration, not scattered across the codebase. One place to see and adjust the system prompt cap, history cap, retrieval cap, and output cap makes the whole budget legible and tunable.

Fail Gracefully

When a request would exceed the window, the system should degrade predictably — summarize history, drop the lowest-ranked context, or return a clear error — rather than crash or silently truncate something important. Concrete patterns for this appear in Token Budget Management and Optimization: Real-World Examples and Use Cases.

Review on a Schedule

Usage patterns drift. New features add new prompts. Revisit your token telemetry monthly, and treat any feature whose cost grew faster than its usage as a candidate for optimization.

Frequently Asked Questions

How many tokens is a typical prompt?

It varies enormously. A simple chat message might be 50 tokens, while a retrieval-augmented prompt with several documents can run 4,000 or more. The only reliable answer comes from measuring your own prompts with the provider's tokenizer rather than estimating.

Should I always minimize tokens?

No. The objective is deliberate spending, not minimal spending. Cutting a retrieved document that the model needed to answer correctly is a false economy. Spend tokens where they improve answers and trim where they do not.

Why is my bill dominated by output when my prompts are short?

Output tokens usually cost more per token than input, and unbounded generations can grow long. Check your response length distribution and set a maximum output length. That single control often produces the largest savings.

Does summarizing conversation history hurt quality?

Done carelessly, yes. Done well, summarization preserves the decisions, facts, and open threads the model needs while discarding redundant phrasing. The trick is summarizing the right things and keeping recent turns verbatim where detail matters.

How is token budget different from rate limiting?

Rate limiting controls how many requests you make over time. Token budgeting controls how large each request is and how the fixed context window is divided. They are complementary: one bounds frequency, the other bounds size.

Key Takeaways

  • A token is a subword unit; both input and output count against cost and the context window, with output usually pricier.
  • Measure token usage per component before optimizing, and log it alongside the feature that triggered each request.
  • Reserve output space first, then allocate the remaining window to input components ranked by the value they add.
  • Compress by summarizing history, reranking and trimming retrieved context, pruning the system prompt, and preferring structured responses.
  • Enforce caps in code and configuration, fail gracefully when limits are hit, and review token telemetry on a regular schedule.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification