Most context-length problems come from skipping the math. People build a prompt, paste in some content, and find out it does not fit only when the API throws an error or the answer comes back truncated. The fix is not a bigger model. It is a repeatable process you run before shipping anything.
This article gives you that process as ordered steps. Do them in sequence, and you will know exactly how much room you have, whether your content fits, and which strategy to reach for when it does not. No theory beyond what each step requires. You should be able to apply this to a real prompt today.
We will use a running example throughout: a support assistant that answers questions using a product manual, on a model with a 32,000-token window. Plug in your own numbers as we go.
Step 1: Find the Model's Hard Window
Start by writing down the exact context window for the specific model and version you are using, in tokens. Do not approximate from memory and do not assume two models from the same vendor share a window. Check the current model documentation.
For our example, the window is 32,000 tokens. That number is your total budget for everything: instructions, history, documents, and the answer. Write it at the top of the page.
Step 2: Measure the Fixed Costs
Some parts of every request are constant. Measure them once, precisely.
System prompt and tool schemas
Run your system prompt through the actual tokenizer for your model, not a generic estimator. Tool and function definitions count too, and they are often larger than people expect because they are verbose JSON.
In our example, the system prompt is 800 tokens and there are no tools. Fixed cost so far: 800 tokens.
Reserve output space
Decide the maximum output you will ever ask for and reserve it. If your assistant should answer in up to 600 tokens, reserve 600. The model cannot borrow from input space to write a longer answer, so this reservation is non-negotiable.
Running total reserved: 800 + 600 = 1,400 tokens.
Step 3: Subtract a Safety Margin
Tokenizers and content vary, and you never want to run flush against the ceiling. Subtract a 10 to 15 percent margin of the full window.
For 32,000 tokens, a 12 percent margin is about 3,840 tokens. Reserve it.
Now compute your true working budget:
- Window: 32,000
- Fixed costs and output: 1,400
- Safety margin: 3,840
- Working budget for documents and history: 26,760 tokens
That number, not the headline 32,000, is what you actually have for content.
Step 4: Measure Your Content
Now measure the thing you want to send. Run your product manual, or whichever document you are using, through the tokenizer.
Suppose the full manual is 41,000 tokens. It does not fit your 26,760-token budget, and it is not close. This is the moment most projects discover the problem in production. You discovered it on paper, before writing integration code. Good.
If your content had fit with room to spare, you would be done: send it whole. Since it does not, proceed to the next step.
Step 5: Choose a Strategy for Oversized Content
When content exceeds the budget, pick one of three approaches based on the shape of the problem. The framework article explains the decision logic in depth, but here is the short version.
If the overflow is conversation history, summarize
For a long chat, compress older turns into a running summary and keep only recent turns verbatim. Trigger summarization when history crosses a threshold, say 60 percent of your working budget, so you never hit the wall mid-conversation.
If the overflow is a large static corpus, retrieve
For our 41,000-token manual against a 26,760-token budget, retrieval is the right call. Split the manual into chunks, index them, and at query time pull in only the handful of chunks relevant to the user's question. You might send 4,000 tokens of relevant passages instead of all 41,000.
If the overflow is borderline, trim
If you are only slightly over, sometimes removing boilerplate, redundant examples, or verbose formatting is enough to fit. Measure again after trimming to confirm.
Step 6: Build a Pre-Send Guard
Do not trust that your estimates hold at runtime. Add a check immediately before every API call that counts the assembled prompt's tokens and compares against the budget.
- Assemble the full prompt: system, history, retrieved content, everything.
- Count its tokens with the real tokenizer.
- If the count plus reserved output exceeds the window, shrink before sending.
- Shrink by dropping or summarizing the lowest-priority content, not by random truncation.
This guard is what separates a system that degrades gracefully from one that returns mysterious errors under load. The common mistakes guide lists the failure modes this single step prevents.
Step 7: Test at the Edges
Estimates from average content lie when content is unusual. Test with deliberately hard cases.
- A document full of tables and code, which tokenizes inefficiently.
- A conversation at maximum realistic length.
- A query that retrieves the maximum number of chunks you allow.
- Non-English input if your users might send it.
Confirm that each case stays under budget and produces a complete answer. If any case fails, tighten your margins or lower your retrieval limits. For inspiration on what these edge cases look like in practice, see the real-world examples.
Step 8: Monitor in Production
Once live, log the token count of every request. Watch for inputs creeping toward the ceiling and for any truncation events. A slow drift upward, as conversations get longer or documents grow, will eventually breach the limit if you are not watching. Treat a near-limit request the same way you treat a near-full disk: a warning to act on before it becomes an outage. The full checklist turns this monitoring into a standing routine.
Frequently Asked Questions
How do I count tokens accurately?
Use the tokenizer that matches your specific model rather than a generic word-to-token ratio. Most providers ship a tokenizer library or an endpoint that returns exact counts. Estimating from word count is fine for a rough sanity check but not for a production guard.
How much output space should I reserve?
Reserve the maximum length you ever expect the model to produce for that task, measured in tokens. If your answers should cap at 600 tokens, reserve 600. Under-reserving causes answers to cut off mid-sentence when the input is large.
What safety margin is reasonable?
Ten to fifteen percent of the full window is a sensible default. The margin absorbs variation in tokenization and small estimation errors so you never run flush against the ceiling, where even minor surprises cause hard failures.
When should I summarize versus retrieve?
Summarize when the overflow comes from growing conversation history you want to preserve the gist of. Retrieve when the overflow comes from a large, mostly static corpus where only a small slice is relevant per query. Many systems use both.
Do I really need a pre-send token guard?
Yes, if reliability matters. Estimates made at design time drift at runtime as inputs vary, and a guard is the only thing that catches an oversized prompt before the API rejects it or silently truncates it. It is a few lines of code that prevents an entire class of outages.
Key Takeaways
- Compute your true working budget by subtracting fixed costs, reserved output, and a safety margin from the full window.
- Measure both fixed costs and content with the real tokenizer, never with word-count estimates, for anything production-bound.
- The headline window size is not your budget; the remainder after reservations is.
- Choose summarize for growing history, retrieve for large static corpora, and trim for borderline overflow.
- Add a pre-send guard that counts the assembled prompt and shrinks low-priority content before sending.
- Test with token-heavy edge cases and monitor request sizes in production to catch drift before it breaches the limit.