Blaming the Model When It's Really a Token Problem

Most professionals who hit a wall with AI outputs — answers that cut off mid-sentence, summaries that miss critical details, chatbots that seem to forget what they were just told — are actually running into a token and context window problem. They just don't know it yet. Understanding what's going wrong is the difference between blaming the model and fixing the workflow.

Tokens are the units AI models use to process text. Not words, not characters — roughly four characters per token on average, or about 75 words per 100 tokens. A context window is the total amount of text a model can hold in working memory at one time: your instructions, the conversation history, any documents you've pasted in, and the model's own response. Once you hit that ceiling, something gets dropped. The model can't go back and retrieve it. It simply doesn't exist in the model's processing space anymore.

The mistakes below aren't edge cases. They're the predictable failure modes that show up across agencies, consultancies, and professional workflows every week. Each one has a specific cause, a real cost, and a fix you can apply today.

Mistake 1: Treating the Context Window as Infinite

The most common error is also the most foundational. Professionals new to working with large language models assume that a 128,000-token context window — which sounds enormous — means they never have to think about space management. It doesn't.

Even a 128k window fills faster than most people expect. A typical PDF report might run 15,000–30,000 tokens. A long email thread can hit 5,000 tokens. Add a system prompt, a chain of back-and-forth conversation, and a request to cross-reference multiple documents, and you're burning through the window at a rate that surprises people every time.

Why this matters

When the window fills, one of two things happens: the model silently drops older content from its active context, or the API call fails with an error. In conversational interfaces, most models use a sliding window — meaning the beginning of your conversation, including critical instructions, simply disappears. Users experience this as the model "forgetting" earlier decisions or constraints.

The fix: Treat context as a budget, not a warehouse. Before any substantial workflow, estimate your token load. Tools like OpenAI's tokenizer, Anthropic's token counter, or third-party libraries like tiktoken let you measure before you send. If your content is large, design a chunking or summarization strategy before you hit the wall, not after.

Mistake 2: Burying the Most Important Instructions at the End

Instruction placement matters. The model processes the entire context, but research and consistent practitioner experience shows that information at the beginning and the end of a prompt tends to be weighted more heavily than content in the middle — a phenomenon sometimes called the "lost in the middle" problem.

If you're working with a long document or an extended system prompt, critical instructions that get sandwiched between large blocks of text frequently get underweighted in the model's output.

What this looks like in practice

You paste a 20-page policy document into the context, add your instructions at the bottom, and ask the model to flag compliance issues. The model produces a generic summary that misses the specific clause types you care about. The instructions were there — they were just buried after 15,000 tokens of dense regulatory language.

The fix: Lead with your most important instructions. State the task, the format, and the critical constraints in the first 200–400 tokens. Restate key constraints at the end as a closing checklist. For very long contexts, consider a "bookend" structure: task framing at the top, the source material in the middle, and a brief reminder of your requirements at the bottom.

Mistake 3: Copy-Pasting Raw Documents Without Preprocessing

When people first discover they can paste entire documents into a prompt, they do exactly that — including headers, footers, page numbers, navigation menus, boilerplate legal text, and repeated disclaimers. All of that burns tokens without adding signal.

A raw website paste can be 40–60% noise. A PDF converted to text drags in table artifacts, footnote numbering, and formatting characters. Every junk token is a token not spent on useful content or model reasoning.

The real cost

Noise in the context doesn't just waste space. It degrades output quality. The model allocates attention across all tokens, including irrelevant ones. More noise means less precise outputs — summaries that pick up boilerplate language, analyses that get distracted by repeated legalese, and extractions that confuse header text for content.

The fix: Strip documents before inserting them. Remove repeated headers, navigation, legal boilerplate, and formatting artifacts. Use Markdown formatting for structure rather than pasted table layouts. When working programmatically, pipe documents through a cleaning function before they ever reach the model. This is one of the highest-leverage, lowest-effort improvements you can make — more detail on this is covered in Tokens and Context Windows: Best Practices That Actually Work.

Mistake 4: Ignoring the System Prompt's Token Cost

System prompts are where professionals configure the model's persona, rules, constraints, and output format. They're also frequently the most bloated part of a context — because they were written once, never audited, and grow by accretion over time.

A system prompt that started at 300 tokens can balloon to 2,500+ tokens as teams add clarifications, edge case handling, example outputs, and formatting instructions. On every single API call, that overhead is paid. On a high-volume application handling thousands of requests per day, an oversized system prompt is both a quality problem and a direct cost driver.

The compounding effect

System prompt bloat has a compounding effect in multi-turn conversations. Every turn, the system prompt is re-included in the context. A 2,000-token system prompt in a 10-turn conversation consumes 20,000 tokens in system prompt overhead alone — before any user input or model output is counted.

The fix: Audit your system prompts quarterly. Every instruction should earn its place by reducing errors or improving output quality in a measurable way. Remove examples that can be handled by clear instruction instead. If you maintain multiple deployments, you're probably maintaining multiple system prompts — a governance habit is worth building. See The Tokens and Context Windows Checklist for 2026 for a structured audit template.

Mistake 5: Running Long Conversations Without Session Management

Chat applications feel conversational. That's by design. But every exchange adds tokens to the running history, and without deliberate session management, long conversations degrade predictably: responses get vague, the model starts contradicting earlier decisions, and eventually critical early instructions fall out of the window entirely.

This is one of the most common failure modes in client-facing deployments — a chatbot that works perfectly in a 5-turn demo and starts producing inconsistent results by turn 25.

What good session management looks like

There are two standard patterns: rolling summarization and hard resets. Rolling summarization compresses older conversation history into a condensed summary that gets prepended to new exchanges — preserving the essential context of what was decided without keeping every word. Hard resets wipe the history at natural breakpoints (end of topic, end of session) and reload just the system prompt and any persistent facts.

The fix: Never let conversation history grow unbounded. Decide upfront whether your use case needs continuity across long sessions (rolling summarization) or whether clean breaks are acceptable (hard resets). Build session management into the architecture from the start; retrofitting it is significantly harder. Real-world examples of both patterns are documented in Tokens and Context Windows: Real-World Examples and Use Cases.

Mistake 6: Confusing Token Limits with Quality Limits

Professionals who discover longer context windows sometimes assume bigger is automatically better. They design prompts that use the entire 128k or 200k window because they can — loading in every piece of potentially relevant information and expecting the model to handle the complexity.

In practice, signal-to-noise ratio matters more than raw context size. A focused 8,000-token prompt with well-curated content consistently outperforms a 90,000-token dump of loosely related material. Model attention is not uniformly distributed across 200,000 tokens; performance on retrieval and reasoning tasks tends to degrade with context bloat even when the technical limit isn't exceeded.

Retrieval as a better alternative

For workflows that genuinely require access to large knowledge bases, retrieval-augmented generation (RAG) is almost always the right architectural choice. Instead of loading everything into context, a retrieval layer surfaces only the 3–10 most relevant document chunks for each query. The context stays lean, the relevant signal stays high, and costs stay manageable.

The fix: Match context size to the actual information needs of the task. Before defaulting to "paste everything in," ask what the model genuinely needs to produce the output. For large document sets, build or use a retrieval layer. A full framework for making these architectural decisions is laid out in A Framework for Tokens and Context Windows.

Mistake 7: Not Accounting for Output Tokens in Budget Calculations

Input tokens — the prompt — get most of the attention. Output tokens — the model's response — are frequently ignored in cost and capacity planning. This leads to two distinct failure modes.

First, for API deployments charged per token, output tokens typically cost the same as or more than input tokens. A workflow that generates long-form documents at scale will have output token costs that dwarf input costs — and teams that didn't plan for this get surprised by invoices.

Second, max output length is a separate parameter from context window size. Setting max_tokens too low truncates responses mid-sentence. Not setting it at all on some APIs means the model defaults to a conservative output length that cuts off outputs before the task is complete.

Getting the budget right

A reasonable planning heuristic: for analytical and summarization tasks, output tokens are typically 10–25% of input tokens. For generative tasks (drafting, writing, synthesis), output can equal or exceed input. Build both into your per-task token budget.

The fix: Always specify max_tokens deliberately. For tasks requiring long outputs, calculate whether your prompt + anticipated output fits within the context window. For cost forecasting, model output token volume separately and assume it will be higher than you expect on the first pass. This discipline is part of a broader production readiness habit covered in the Case Study: Tokens and Context Windows in Practice.

Frequently Asked Questions

What actually happens when you exceed a context window?

It depends on the interface. In most API implementations, you receive an error and the call fails entirely. In conversational products, the model typically drops the oldest tokens from the context — meaning early instructions, conversation history, or document content silently disappear. The model has no awareness that content was removed; it processes whatever fits within the window as if it were the complete context.

Do all models handle long contexts equally well?

No. Token limits vary significantly by model — from around 4,000 tokens for older or smaller models to 200,000+ for models like Claude 3.5 or GPT-4o. But even within a single model, performance on tasks requiring attention to content near the middle of very long contexts tends to be weaker than tasks where relevant content sits near the beginning or end. Larger windows expand capacity; they don't guarantee uniform quality across all positions.

Is there a way to check how many tokens a prompt will use before sending it?

Yes. OpenAI publishes a tokenizer tool at platform.openai.com, and the Python library tiktoken lets you count tokens programmatically for OpenAI models. Anthropic's API includes a token counting endpoint. For general estimation, the rough rule of 100 tokens ≈ 75 words holds reasonably well for English prose, though code, structured data, and non-English languages tokenize differently.

How often should system prompts be audited?

At minimum, quarterly — more frequently if the application is in active development or user-facing. The trigger events for an audit should include: any time output quality degrades noticeably, any time a new model version is deployed (tokenization and behavior may differ), and any time the team adds new instructions without removing outdated ones. Treat the system prompt as living documentation, not a set-and-forget configuration.

Can you avoid context window problems by just using a model with a bigger window?

Partially. A larger context window removes certain hard limits and gives you more headroom, but it doesn't eliminate the need for thoughtful context management. Costs scale with tokens, quality degrades with noise, and architectural problems like unbounded conversation history or bloated system prompts remain problems regardless of window size. Bigger windows are useful; they're not a substitute for good design.

Key Takeaways

Context windows are budgets, not warehouses. Estimate token load before building workflows, not after hitting errors.
Instruction placement matters. Lead with your most critical requirements; don't bury them after large content blocks.
Raw document pasting is one of the most common sources of avoidable noise. Strip and clean before inserting.
System prompt bloat compounds on every call. Audit regularly and remove instructions that don't demonstrably improve output.
Long conversations need session management. Design for it from the start: rolling summarization or hard resets depending on the use case.
Bigger context windows improve capacity, not quality. Signal-to-noise ratio matters more than raw window size.
Output tokens have real cost and length constraints. Model them separately from input tokens in any production planning.

Mistake 1: Treating the Context Window as Infinite

Why this matters

Mistake 2: Burying the Most Important Instructions at the End

If you're working with a long document or an extended system prompt, critical instructions that get sandwiched between large blocks of text frequently get underweighted in the model's output.

What this looks like in practice

Mistake 3: Copy-Pasting Raw Documents Without Preprocessing

The real cost

Mistake 4: Ignoring the System Prompt's Token Cost

The compounding effect

Mistake 5: Running Long Conversations Without Session Management

This is one of the most common failure modes in client-facing deployments — a chatbot that works perfectly in a 5-turn demo and starts producing inconsistent results by turn 25.

What good session management looks like

Mistake 6: Confusing Token Limits with Quality Limits

Retrieval as a better alternative

Mistake 7: Not Accounting for Output Tokens in Budget Calculations

Input tokens — the prompt — get most of the attention. Output tokens — the model's response — are frequently ignored in cost and capacity planning. This leads to two distinct failure modes.

Getting the budget right

Frequently Asked Questions

What actually happens when you exceed a context window?

Do all models handle long contexts equally well?

Is there a way to check how many tokens a prompt will use before sending it?

How often should system prompts be audited?

Can you avoid context window problems by just using a model with a bigger window?

Key Takeaways

Context windows are budgets, not warehouses. Estimate token load before building workflows, not after hitting errors.
Instruction placement matters. Lead with your most critical requirements; don't bury them after large content blocks.
Raw document pasting is one of the most common sources of avoidable noise. Strip and clean before inserting.
System prompt bloat compounds on every call. Audit regularly and remove instructions that don't demonstrably improve output.
Long conversations need session management. Design for it from the start: rolling summarization or hard resets depending on the use case.
Bigger context windows improve capacity, not quality. Signal-to-noise ratio matters more than raw window size.
Output tokens have real cost and length constraints. Model them separately from input tokens in any production planning.

Blaming the Model When It's Really a Token Problem

Mistake 1: Treating the Context Window as Infinite

Why this matters

Mistake 2: Burying the Most Important Instructions at the End

What this looks like in practice

Mistake 3: Copy-Pasting Raw Documents Without Preprocessing

The real cost

Mistake 4: Ignoring the System Prompt's Token Cost

The compounding effect

Mistake 5: Running Long Conversations Without Session Management

What good session management looks like

Mistake 6: Confusing Token Limits with Quality Limits

Retrieval as a better alternative

Mistake 7: Not Accounting for Output Tokens in Budget Calculations

Getting the budget right

Frequently Asked Questions

What actually happens when you exceed a context window?

Do all models handle long contexts equally well?

Is there a way to check how many tokens a prompt will use before sending it?

How often should system prompts be audited?

Can you avoid context window problems by just using a model with a bigger window?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Blaming the Model When It's Really a Token Problem

Mistake 1: Treating the Context Window as Infinite

Why this matters

Mistake 2: Burying the Most Important Instructions at the End

What this looks like in practice

Mistake 3: Copy-Pasting Raw Documents Without Preprocessing

The real cost

Mistake 4: Ignoring the System Prompt's Token Cost

The compounding effect

Mistake 5: Running Long Conversations Without Session Management

What good session management looks like

Mistake 6: Confusing Token Limits with Quality Limits

Retrieval as a better alternative

Mistake 7: Not Accounting for Output Tokens in Budget Calculations

Getting the budget right

Frequently Asked Questions

What actually happens when you exceed a context window?

Do all models handle long contexts equally well?

Is there a way to check how many tokens a prompt will use before sending it?

How often should system prompts be audited?

Can you avoid context window problems by just using a model with a bigger window?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?