AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step 1: Find the Model's Hard WindowStep 2: Measure the Fixed CostsSystem prompt and tool schemasReserve output spaceStep 3: Subtract a Safety MarginStep 4: Measure Your ContentStep 5: Choose a Strategy for Oversized ContentIf the overflow is conversation history, summarizeIf the overflow is a large static corpus, retrieveIf the overflow is borderline, trimStep 6: Build a Pre-Send GuardStep 7: Test at the EdgesStep 8: Monitor in ProductionFrequently Asked QuestionsHow do I count tokens accurately?How much output space should I reserve?What safety margin is reasonable?When should I summarize versus retrieve?Do I really need a pre-send token guard?Key Takeaways
Home/Blog/Do the Token Math Before the API Throws Errors
General

Do the Token Math Before the API Throws Errors

A

Agency Script Editorial

Editorial Team

·October 26, 2025·8 min read
ai model context length limitsai model context length limits how toai model context length limits guideai fundamentals

Most context-length problems come from skipping the math. People build a prompt, paste in some content, and find out it does not fit only when the API throws an error or the answer comes back truncated. The fix is not a bigger model. It is a repeatable process you run before shipping anything.

This article gives you that process as ordered steps. Do them in sequence, and you will know exactly how much room you have, whether your content fits, and which strategy to reach for when it does not. No theory beyond what each step requires. You should be able to apply this to a real prompt today.

We will use a running example throughout: a support assistant that answers questions using a product manual, on a model with a 32,000-token window. Plug in your own numbers as we go.

Step 1: Find the Model's Hard Window

Start by writing down the exact context window for the specific model and version you are using, in tokens. Do not approximate from memory and do not assume two models from the same vendor share a window. Check the current model documentation.

For our example, the window is 32,000 tokens. That number is your total budget for everything: instructions, history, documents, and the answer. Write it at the top of the page.

Step 2: Measure the Fixed Costs

Some parts of every request are constant. Measure them once, precisely.

System prompt and tool schemas

Run your system prompt through the actual tokenizer for your model, not a generic estimator. Tool and function definitions count too, and they are often larger than people expect because they are verbose JSON.

In our example, the system prompt is 800 tokens and there are no tools. Fixed cost so far: 800 tokens.

Reserve output space

Decide the maximum output you will ever ask for and reserve it. If your assistant should answer in up to 600 tokens, reserve 600. The model cannot borrow from input space to write a longer answer, so this reservation is non-negotiable.

Running total reserved: 800 + 600 = 1,400 tokens.

Step 3: Subtract a Safety Margin

Tokenizers and content vary, and you never want to run flush against the ceiling. Subtract a 10 to 15 percent margin of the full window.

For 32,000 tokens, a 12 percent margin is about 3,840 tokens. Reserve it.

Now compute your true working budget:

  • Window: 32,000
  • Fixed costs and output: 1,400
  • Safety margin: 3,840
  • Working budget for documents and history: 26,760 tokens

That number, not the headline 32,000, is what you actually have for content.

Step 4: Measure Your Content

Now measure the thing you want to send. Run your product manual, or whichever document you are using, through the tokenizer.

Suppose the full manual is 41,000 tokens. It does not fit your 26,760-token budget, and it is not close. This is the moment most projects discover the problem in production. You discovered it on paper, before writing integration code. Good.

If your content had fit with room to spare, you would be done: send it whole. Since it does not, proceed to the next step.

Step 5: Choose a Strategy for Oversized Content

When content exceeds the budget, pick one of three approaches based on the shape of the problem. The framework article explains the decision logic in depth, but here is the short version.

If the overflow is conversation history, summarize

For a long chat, compress older turns into a running summary and keep only recent turns verbatim. Trigger summarization when history crosses a threshold, say 60 percent of your working budget, so you never hit the wall mid-conversation.

If the overflow is a large static corpus, retrieve

For our 41,000-token manual against a 26,760-token budget, retrieval is the right call. Split the manual into chunks, index them, and at query time pull in only the handful of chunks relevant to the user's question. You might send 4,000 tokens of relevant passages instead of all 41,000.

If the overflow is borderline, trim

If you are only slightly over, sometimes removing boilerplate, redundant examples, or verbose formatting is enough to fit. Measure again after trimming to confirm.

Step 6: Build a Pre-Send Guard

Do not trust that your estimates hold at runtime. Add a check immediately before every API call that counts the assembled prompt's tokens and compares against the budget.

  1. Assemble the full prompt: system, history, retrieved content, everything.
  2. Count its tokens with the real tokenizer.
  3. If the count plus reserved output exceeds the window, shrink before sending.
  4. Shrink by dropping or summarizing the lowest-priority content, not by random truncation.

This guard is what separates a system that degrades gracefully from one that returns mysterious errors under load. The common mistakes guide lists the failure modes this single step prevents.

Step 7: Test at the Edges

Estimates from average content lie when content is unusual. Test with deliberately hard cases.

  • A document full of tables and code, which tokenizes inefficiently.
  • A conversation at maximum realistic length.
  • A query that retrieves the maximum number of chunks you allow.
  • Non-English input if your users might send it.

Confirm that each case stays under budget and produces a complete answer. If any case fails, tighten your margins or lower your retrieval limits. For inspiration on what these edge cases look like in practice, see the real-world examples.

Step 8: Monitor in Production

Once live, log the token count of every request. Watch for inputs creeping toward the ceiling and for any truncation events. A slow drift upward, as conversations get longer or documents grow, will eventually breach the limit if you are not watching. Treat a near-limit request the same way you treat a near-full disk: a warning to act on before it becomes an outage. The full checklist turns this monitoring into a standing routine.

Frequently Asked Questions

How do I count tokens accurately?

Use the tokenizer that matches your specific model rather than a generic word-to-token ratio. Most providers ship a tokenizer library or an endpoint that returns exact counts. Estimating from word count is fine for a rough sanity check but not for a production guard.

How much output space should I reserve?

Reserve the maximum length you ever expect the model to produce for that task, measured in tokens. If your answers should cap at 600 tokens, reserve 600. Under-reserving causes answers to cut off mid-sentence when the input is large.

What safety margin is reasonable?

Ten to fifteen percent of the full window is a sensible default. The margin absorbs variation in tokenization and small estimation errors so you never run flush against the ceiling, where even minor surprises cause hard failures.

When should I summarize versus retrieve?

Summarize when the overflow comes from growing conversation history you want to preserve the gist of. Retrieve when the overflow comes from a large, mostly static corpus where only a small slice is relevant per query. Many systems use both.

Do I really need a pre-send token guard?

Yes, if reliability matters. Estimates made at design time drift at runtime as inputs vary, and a guard is the only thing that catches an oversized prompt before the API rejects it or silently truncates it. It is a few lines of code that prevents an entire class of outages.

Key Takeaways

  • Compute your true working budget by subtracting fixed costs, reserved output, and a safety margin from the full window.
  • Measure both fixed costs and content with the real tokenizer, never with word-count estimates, for anything production-bound.
  • The headline window size is not your budget; the remainder after reservations is.
  • Choose summarize for growing history, retrieve for large static corpora, and trim for borderline overflow.
  • Add a pre-send guard that counts the assembled prompt and shrinks low-priority content before sending.
  • Test with token-heavy edge cases and monitor request sizes in production to catch drift before it breaches the limit.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification