Token budgets are rarely wrecked by a single dramatic error. They erode through small habits repeated thousands of times a day. A few extra lines in a system prompt, an uncapped answer, a conversation that never forgets — each looks harmless in isolation and adds up to a bill nobody can explain. The damage is quiet, which is exactly what makes it dangerous.
This article names seven specific failure modes we see again and again. For each one, it explains why the mistake happens, what it actually costs you, and the corrective practice that fixes it. None of these are exotic. They are the ordinary ways that real systems leak tokens, and recognizing them in your own code is most of the battle.
Read this with your own prompts in mind. Odds are good that at least two of these are happening somewhere in your application right now, and the fixes are usually small.
Mistake One: No Output Length Cap
The most common and most expensive oversight is letting the model write as much as it wants.
Why It Happens
Output limits are optional in most APIs, and the default is often generous or unlimited. A developer building a feature focuses on getting a good answer, not on bounding a bad one, and the cap never gets set.
What It Costs
Output tokens usually cost more than input. A minority of requests that generate very long answers can dominate your bill while the average looks fine. You pay for verbosity you never wanted.
The Fix
Set a maximum output token count appropriate to each feature. If answers should be a paragraph, do not leave room for an essay. This single setting prevents the most common runaway-cost problem.
Mistake Two: The Bloated System Prompt
The system prompt is paid for on every request, which makes its waste relentless.
Why It Happens
System prompts grow by accretion. An instruction gets added to fix one edge case, then another, then another. Nobody removes the old ones because nobody is sure they are safe to remove.
What It Costs
Every unnecessary token in the system prompt is multiplied by your entire request volume. A few hundred wasted tokens become millions across a busy day, paid forever.
The Fix
Audit the system prompt regularly. Remove any instruction that no longer changes behavior, testing as you go. A lean system prompt pays a dividend on every call, as detailed in Spending Tokens Like Money: A Working Manual for LLM Budgets.
Mistake Three: Unbounded Conversation History
Chat features that remember everything eventually drown in their own history.
Why It Happens
The simplest chat implementation appends every turn to the context and sends the whole thing each time. It works perfectly in testing, where conversations are short, and fails in production, where they are not.
What It Costs
Each turn gets more expensive than the last because it carries all prior turns. Long sessions become disproportionately costly, and eventually they overflow the context window entirely.
The Fix
Keep recent turns verbatim, summarize older ones into a running summary, and enforce a hard token cap on history. This turns an unbounded cost into a predictable one. The ordered version of this fix appears in A Step-by-Step Approach to Token Budget Management and Optimization.
Mistake Four: Sending Whole Documents to Retrieval
Stuffing entire documents into the prompt wastes tokens and often hurts quality.
Why It Happens
It is easier to include a full document than to chunk, rank, and select passages. The full text is right there, so it gets sent whole.
What It Costs
You pay for large amounts of irrelevant text, and the irrelevant text distracts the model, sometimes producing worse answers than a tighter context would. You lose on both cost and quality.
The Fix
Chunk documents, rerank chunks by relevance to the query, and include only the top few. Strip boilerplate before the text reaches the prompt. Less context, better answers, lower cost.
Mistake Five: Ignoring the Output Side Entirely
Teams obsess over input tokens and forget the more expensive half.
Why It Happens
Input is visible — you write the prompt. Output feels like it belongs to the model. So optimization effort concentrates on shrinking prompts while output runs free.
What It Costs
Since output is usually pricier per token, neglecting it leaves the larger lever untouched. You can trim input heroically and still miss the bigger savings.
The Fix
Measure output length distributions, cap maximum output, and prefer structured responses that say what is needed concisely. A short list often beats a paragraph at lower cost.
Mistake Six: Optimizing Without Measuring
Cutting tokens by intuition leads to cutting the wrong things.
Why It Happens
Measurement requires instrumentation, and instrumentation feels like overhead. So developers guess at where tokens go and optimize based on the guess.
What It Costs
You spend effort shrinking components that were never the problem while the real consumer goes untouched. Worse, you sometimes cut context the model needed and degrade quality for no real saving.
The Fix
Instrument prompt assembly to count tokens per component on a representative sample before optimizing. Let the data choose your targets. The full diagnostic loop is described in Token Budget Management and Optimization: Real-World Examples and Use Cases.
Mistake Seven: Treating Limits as Suggestions
A budget that lives only as an intention will always be exceeded.
Why It Happens
It is easy to decide on a token limit and never enforce it in code. The limit exists in a design document or a developer's head, not in the request path.
What It Costs
Without enforcement, components creep back to their old sizes. The savings you worked for quietly evaporate over the following months.
The Fix
Put every cap in configuration and enforce it in code — truncate, summarize, or reject when a limit would be exceeded. A budget enforced is a budget kept.
How These Mistakes Compound
Each of the seven failure modes is manageable alone. The reason token bills spiral is that they rarely arrive alone — they reinforce one another.
Waste Multiplies Against Waste
A bloated system prompt is bad on its own. A bloated system prompt sent on every turn of an unbounded conversation is far worse, because the two grow against each other. Full-document retrieval combined with no output cap means you pay heavily on both halves of the budget at once. The mistakes are not independent; they stack, and a feature carrying three of them costs far more than the sum of three separate problems would suggest.
The Common Root Is Missing Measurement
Look closely and most of these failures trace to the same root: nobody was measuring. Without per-component token counts, a bloated system prompt is invisible, an unbounded history looks fine until it overflows, and an output tail goes unnoticed. Measurement is not the eighth mistake — it is the absence that lets the other seven hide. Fixing the measurement gap surfaces all of them at once, which is why the diagnostic-first approach in Token Budget Management and Optimization: Real-World Examples and Use Cases pays off so broadly.
Fixing One Often Reveals Another
Teams that cap output frequently discover their history was unbounded too, because once the obvious cost is gone the next-largest consumer becomes visible. Treat the first fix as a flashlight rather than a finish line. Each correction clears the noise and exposes whatever waste was hiding behind it.
Frequently Asked Questions
Which mistake costs the most?
Usually an uncapped output length, because output is the pricier side and a long tail of verbose answers can dominate the bill. Setting a maximum output length is often the single highest-return fix.
How do I find a bloated system prompt?
Read it line by line and ask of each instruction whether removing it changes behavior. Test the removals on real cases. Most long-lived system prompts contain instructions that no longer do anything.
Is summarizing history risky?
Only if done carelessly. Keep recent turns verbatim and summarize older ones, preserving decisions and facts. Verify against real conversations afterward to confirm the model stays coherent.
Why does sending full documents hurt quality?
Irrelevant text distracts the model and can pull its answer off target. A focused context of the most relevant passages usually produces better answers than a large, noisy one.
How do I keep limits from drifting back?
Enforce them in code and configuration rather than relying on discipline. When a component exceeds its cap, the system should truncate, summarize, or reject automatically.
Key Takeaways
- Always cap output length; unbounded generation is the most common and most expensive mistake.
- Audit the system prompt regularly since its waste is multiplied across every single request.
- Bound conversation history by summarizing older turns and enforcing a hard token cap.
- Chunk and rerank retrieved documents instead of sending whole files, which saves cost and improves quality.
- Measure before optimizing and enforce limits in code, or savings will quietly erode over time.