Which Variable Failed Your AI Task, Almost Always the Window

Q: Does a bigger context window always mean better performance?

No. Larger context windows change what's possible, not what's reliable. Models can underweight content placed far from the beginning or end of a long context, a pattern sometimes called "lost in the middle." A well-curated 10,000-token context often outperforms a carelessly assembled 80,000-token one. Use the largest window when you need it; use a focused context when you don't.

Tokens and context windows sound like infrastructure details — the kind of thing you learn once and forget. In practice, they're the reason a perfectly reasonable AI task succeeds or collapses, and most professionals don't realize which variable failed them. When a model truncates a long document, hallucinates details from earlier in a conversation, or returns weirdly shallow output on a complex brief, the context window is almost always part of the story.

This article walks through specific scenarios: what happened, why the token math mattered, and what a competent operator would do differently. The goal isn't to turn you into an ML engineer. It's to give you the mental models to diagnose problems and design AI tasks that actually work at the edges where things get hard.

A quick anchor before the examples: a token is roughly 0.75 words in English, meaning 1,000 words costs about 1,333 tokens. Context windows are measured in tokens — the combined total of everything the model holds in its working memory: the system prompt, conversation history, documents you've pasted in, and the output it's generating. When that window fills, something has to give, and it's rarely something you'd choose to cut.

The Document Summarization That Lost Its Ending

A marketing agency fed a 40-page white paper into an early GPT-4 build with an 8,192-token context window. The paper was about 12,000 words — roughly 16,000 tokens. The model received only the first portion and summarized what it saw confidently, with no indication that the rest of the document had been silently cut.

The summary missed the paper's entire conclusion and recommendations section, which happened to be where the client's key differentiators lived. The agency delivered the summary. The client noticed immediately.

What went wrong

The problem wasn't the model's intelligence — it was a capacity problem dressed up as a quality problem. The operator assumed the model would flag overflow. It didn't. Models don't typically say "I only read 60% of your document." They summarize what they received and treat it as the whole.

What works instead

Chunk the document into sections of 2,000–3,000 words each, summarize each chunk separately, then pass those summaries into a final synthesis prompt. This is called a map-reduce pattern and it scales to any document length.
Always verify token counts before you submit. Tools like the OpenAI Tokenizer or LangChain's token counter take seconds. See The Best Tools for Tokens and Context Windows for options worth bookmarking.
Check output against source. For any summarization of consequential material, confirm that the final section of the source document appears in the output.

The same failure mode appears constantly with legal contracts, research reports, and meeting transcripts. The fix is always the same: stop assuming the model read the whole thing.

The Long Customer Service Thread That Started Forgetting

A SaaS support team used an AI assistant for tier-1 support. On short threads, it performed well. After a customer conversation crossed roughly 6,000–7,000 tokens — about 35–40 back-and-forth exchanges — the assistant started contradicting earlier commitments. It offered a refund it had already declined, forgot account details the user had stated clearly, and re-asked for information it had already been given.

The mechanics of mid-conversation drift

Most production deployments use a sliding window or truncation strategy: when the context fills, older messages drop out. The model genuinely no longer has access to them. This isn't a memory lapse in any human sense — it's that the information no longer exists in the input. The model isn't confused; it's operating on an incomplete record.

What actually helps

Structured memory injection: At each turn, prepend a short system-level summary of confirmed facts — account ID, issue type, resolution offered so far — into the system prompt. This persists even when raw messages scroll out.
Hard conversation limits: Set a maximum turn count per session and prompt users to open a new thread if they exceed it. Brief the team on why.
Model selection by task depth: For support threads expected to run long, a 128K-context model changes the math entirely, though cost-per-token increases accordingly. Tokens and Context Windows: Trade-offs, Options, and How to Decide breaks down when the upgrade is worth it.

The Code Review That Worked — Because the Context Was Right-Sized

Not every tokens-and-context-windows example is a failure. A software consultancy used Claude to review pull requests on a mid-sized Python codebase. Their PRs typically ran 300–600 lines of changed code, which at a rough average comes to 3,000–5,000 tokens of code content. They padded their system prompt with the team's style guide and a set of error patterns to watch for — another ~800 tokens.

Total context load: comfortably under 10,000 tokens, well within almost any modern model's window. The output was specific, actionable, and consistently high quality across dozens of reviews.

Why this worked

The task was scoped to fit the window. They didn't try to feed the entire codebase.
The system prompt was precise, not padded. Instructions said exactly what patterns to catch, with examples. Vague instructions inflate token cost without improving output.
They ran the review on the diff only, not the full file. Cutting irrelevant context is just as important as counting total tokens.

This is the design principle professionals often overlook: context windows work best when you're intentional about what earns a place in them. Treating the context window like a dump zone — pasting everything in and hoping the model figures it out — degrades output quality even when you're nowhere near the limit.

The 100K-Context Experiment That Underperformed

When models with 100,000+ token windows became accessible, a content agency tried what seemed obvious: paste an entire competitor's website (crawled to markdown), a full brand guide, and a long client brief into a single prompt and ask for a strategy document.

Total input: ~85,000 tokens. The model processed it without truncation. The output was disappointing — generic recommendations that could have applied to any brand, surface-level competitive observations, and a structure that ignored most of what was in the brief.

The "lost in the middle" problem

Research into long-context model behavior consistently finds a pattern: models attend most reliably to material near the beginning and end of a long context. Content buried in the middle — say, the most important competitive insight at token 47,000 of 85,000 — is more likely to be underweighted or missed entirely. Longer isn't always better; it can dilute the signal.

What a better design looks like

Curate, don't dump. Pull the 10 most relevant competitor pages, not all 200. Distill the brand guide to its 3–5 core principles before it enters the prompt.
Position matters. Place the most critical instructions and context at the top of the prompt (or the bottom if that's where your model attends best — worth testing for your specific use case).
Break the task. A strategy document is several tasks: competitive landscape, brand positioning, channel prioritization, messaging. Run them sequentially with focused context for each.

For a deeper look at how to structure these decisions systematically, A Framework for Tokens and Context Windows offers a step-by-step method.

The RAG Pipeline That Got the Retrieval Wrong

A consulting firm built a retrieval-augmented generation (RAG) system over 5,000 internal documents. The system retrieved the top 5 chunks by vector similarity and injected them into the context before generating an answer.

On test questions, accuracy was strong. In production, it degraded noticeably on nuanced questions that required combining information across documents. The model gave confident answers that blended retrieved content incorrectly.

The chunk size problem

Their chunks were 500 tokens each, optimized for retrieval precision. That's a reasonable default, but it meant each chunk often lacked the surrounding context that would make its content interpretable. A sentence like "this exception applies only in cases described above" — retrieved without the "above" — gave the model incomplete information it then confidently completed from its parametric (pre-training) memory.

Calibrations that matter

Chunk with overlap. Using 100–150 token overlaps between chunks preserves sentence context across boundaries.
Retrieve more, then re-rank. Pull 10–15 chunks, re-rank by relevance, then pass the top 4–5 to the model. This improves quality without blowing the context budget.
Add metadata to chunks. Including document title, section header, and date in each chunk gives the model structural context to reason about what it's reading.
Test retrieval separately from generation. Most RAG quality failures are retrieval failures, not generation failures. Evaluate whether the right chunks are coming back before you blame the model's output.

See the Case Study: Tokens and Context Windows in Practice for a full walkthrough of a RAG build that went through exactly this diagnostic process.

The Cost Blowout on a Batch Summarization Job

An agency ran 1,200 article summarizations overnight. Their prompt template included a detailed system prompt (~900 tokens), each article averaged 2,200 words (~2,900 tokens), and they requested 300-word summaries (~400 tokens output). Per request: roughly 4,200 tokens in, 400 tokens out.

At $15 per million input tokens and $60 per million output tokens (approximate rates for a premium model at time of writing), each request cost about $0.087. Across 1,200 articles: ~$104. Against their client's budget, that was a 3× overage.

The token economy of prompt design

The system prompt was the culprit. It included elaborate stylistic guidance, several worked examples, and lengthy disclaimers — most of which was redundant with clearer, shorter instructions. Cutting it from 900 to 280 tokens reduced per-request cost by roughly 15%, which at scale was meaningful.

Audit your system prompts regularly. Every 100 tokens of system prompt costs you across every single request. The economics compound.
Match model to task. A smaller, cheaper model handles routine summarization competently. Reserve premium models for tasks that genuinely require their capability.
Use the [Tokens and Context Windows Checklist for 2026](/blog/tokens-and-context-windows-checklist) before any batch job. A 20-minute review before 1,200 requests is worth considerably more than the time spent explaining overages.

Frequently Asked Questions

How do I know if my content is exceeding the context window?

Most model APIs return an error when you exceed the context limit, but many interfaces silently truncate rather than error out. The safest practice is to count tokens before submitting using a tokenizer tool — the OpenAI Tokenizer is free and accurate for GPT models; Anthropic publishes token counting methods for Claude. Budget for your system prompt, input, and expected output simultaneously, not just the input alone.

Does a bigger context window always mean better performance?

No. Larger context windows change what's possible, not what's reliable. Models can underweight content placed far from the beginning or end of a long context, a pattern sometimes called "lost in the middle." A well-curated 10,000-token context often outperforms a carelessly assembled 80,000-token one. Use the largest window when you need it; use a focused context when you don't.

What's the practical difference between truncation and summarization when context fills up?

Truncation simply cuts content — usually the oldest messages in a conversation or the end of a document — with no indication to the user. Summarization (sometimes called compression) replaces older turns with a condensed summary, preserving key facts at lower token cost. Summarization preserves more information but can introduce distortions; it requires more engineering to implement correctly. For most production systems, structured memory injection (writing key facts into the system prompt) is more reliable than either approach alone.

Why does my model give worse output on longer prompts even when I'm under the context limit?

Several factors: instruction dilution (more context means the model has to weight competing signals), decreased attention to buried content, and increased opportunity for contradictory information in the input. A focused, well-organized prompt typically outperforms a long, loosely assembled one even when both fit comfortably in the window. Treat context as a budget, not a capacity ceiling.

How should I think about tokens when pricing a client project?

Estimate average token count per request (input + output), multiply by your expected request volume, and apply the model's per-token rate. Add a 20–30% buffer for prompt iterations and edge cases. For any job over a few thousand requests, model selection is a legitimate cost lever — a task that runs on a mid-tier model at one-fifth the price of a premium model, with comparable output quality, is the better business decision.

Key Takeaways

Token overflow fails silently. Models rarely flag that they've truncated your input — verify counts before you submit anything consequential.
Long context ≠ good context. Buried content is underweighted; curating what enters the window matters as much as the window's size.
Chunk and map-reduce for long documents. This pattern scales to any length and produces more reliable summaries than pasting everything in at once.
Persistent memory injection beats conversation history. For multi-turn applications, write key facts into the system prompt explicitly rather than relying on the model to hold them across a long thread.
System prompts are paid per request. Audit them for bloat before any batch operation; the cost compounds at scale.
RAG quality failures are usually retrieval failures. Evaluate chunk size, overlap, and re-ranking logic before blaming generation quality.
Match model size to task complexity. Cost optimization is a design decision, not an afterthought.

The Document Summarization That Lost Its Ending

What went wrong

What works instead

Chunk the document into sections of 2,000–3,000 words each, summarize each chunk separately, then pass those summaries into a final synthesis prompt. This is called a map-reduce pattern and it scales to any document length.
Always verify token counts before you submit. Tools like the OpenAI Tokenizer or LangChain's token counter take seconds. See The Best Tools for Tokens and Context Windows for options worth bookmarking.
Check output against source. For any summarization of consequential material, confirm that the final section of the source document appears in the output.

The same failure mode appears constantly with legal contracts, research reports, and meeting transcripts. The fix is always the same: stop assuming the model read the whole thing.

The Long Customer Service Thread That Started Forgetting

The mechanics of mid-conversation drift

What actually helps

Structured memory injection: At each turn, prepend a short system-level summary of confirmed facts — account ID, issue type, resolution offered so far — into the system prompt. This persists even when raw messages scroll out.
Hard conversation limits: Set a maximum turn count per session and prompt users to open a new thread if they exceed it. Brief the team on why.
Model selection by task depth: For support threads expected to run long, a 128K-context model changes the math entirely, though cost-per-token increases accordingly. Tokens and Context Windows: Trade-offs, Options, and How to Decide breaks down when the upgrade is worth it.

The Code Review That Worked — Because the Context Was Right-Sized

Total context load: comfortably under 10,000 tokens, well within almost any modern model's window. The output was specific, actionable, and consistently high quality across dozens of reviews.

Why this worked

The task was scoped to fit the window. They didn't try to feed the entire codebase.
The system prompt was precise, not padded. Instructions said exactly what patterns to catch, with examples. Vague instructions inflate token cost without improving output.
They ran the review on the diff only, not the full file. Cutting irrelevant context is just as important as counting total tokens.

The 100K-Context Experiment That Underperformed

The "lost in the middle" problem

What a better design looks like

Curate, don't dump. Pull the 10 most relevant competitor pages, not all 200. Distill the brand guide to its 3–5 core principles before it enters the prompt.
Position matters. Place the most critical instructions and context at the top of the prompt (or the bottom if that's where your model attends best — worth testing for your specific use case).
Break the task. A strategy document is several tasks: competitive landscape, brand positioning, channel prioritization, messaging. Run them sequentially with focused context for each.

For a deeper look at how to structure these decisions systematically, A Framework for Tokens and Context Windows offers a step-by-step method.

The RAG Pipeline That Got the Retrieval Wrong

The chunk size problem

Calibrations that matter

Chunk with overlap. Using 100–150 token overlaps between chunks preserves sentence context across boundaries.
Retrieve more, then re-rank. Pull 10–15 chunks, re-rank by relevance, then pass the top 4–5 to the model. This improves quality without blowing the context budget.
Add metadata to chunks. Including document title, section header, and date in each chunk gives the model structural context to reason about what it's reading.
Test retrieval separately from generation. Most RAG quality failures are retrieval failures, not generation failures. Evaluate whether the right chunks are coming back before you blame the model's output.

See the Case Study: Tokens and Context Windows in Practice for a full walkthrough of a RAG build that went through exactly this diagnostic process.

The Cost Blowout on a Batch Summarization Job

The token economy of prompt design

Audit your system prompts regularly. Every 100 tokens of system prompt costs you across every single request. The economics compound.
Match model to task. A smaller, cheaper model handles routine summarization competently. Reserve premium models for tasks that genuinely require their capability.
Use the [Tokens and Context Windows Checklist for 2026](/blog/tokens-and-context-windows-checklist) before any batch job. A 20-minute review before 1,200 requests is worth considerably more than the time spent explaining overages.

Frequently Asked Questions

How do I know if my content is exceeding the context window?

Does a bigger context window always mean better performance?

What's the practical difference between truncation and summarization when context fills up?

Why does my model give worse output on longer prompts even when I'm under the context limit?

How should I think about tokens when pricing a client project?

Key Takeaways

Token overflow fails silently. Models rarely flag that they've truncated your input — verify counts before you submit anything consequential.
Long context ≠ good context. Buried content is underweighted; curating what enters the window matters as much as the window's size.
Chunk and map-reduce for long documents. This pattern scales to any length and produces more reliable summaries than pasting everything in at once.
Persistent memory injection beats conversation history. For multi-turn applications, write key facts into the system prompt explicitly rather than relying on the model to hold them across a long thread.
System prompts are paid per request. Audit them for bloat before any batch operation; the cost compounds at scale.
RAG quality failures are usually retrieval failures. Evaluate chunk size, overlap, and re-ranking logic before blaming generation quality.
Match model size to task complexity. Cost optimization is a design decision, not an afterthought.

Which Variable Failed Your AI Task, Almost Always the Window

The Document Summarization That Lost Its Ending

What went wrong

What works instead

The Long Customer Service Thread That Started Forgetting

The mechanics of mid-conversation drift

What actually helps

The Code Review That Worked — Because the Context Was Right-Sized

Why this worked

The 100K-Context Experiment That Underperformed

The "lost in the middle" problem

What a better design looks like

The RAG Pipeline That Got the Retrieval Wrong

The chunk size problem

Calibrations that matter

The Cost Blowout on a Batch Summarization Job

The token economy of prompt design

Frequently Asked Questions

How do I know if my content is exceeding the context window?

Does a bigger context window always mean better performance?

What's the practical difference between truncation and summarization when context fills up?

Why does my model give worse output on longer prompts even when I'm under the context limit?

How should I think about tokens when pricing a client project?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Which Variable Failed Your AI Task, Almost Always the Window

The Document Summarization That Lost Its Ending

What went wrong

What works instead

The Long Customer Service Thread That Started Forgetting

The mechanics of mid-conversation drift

What actually helps

The Code Review That Worked — Because the Context Was Right-Sized

Why this worked

The 100K-Context Experiment That Underperformed

The "lost in the middle" problem

What a better design looks like

The RAG Pipeline That Got the Retrieval Wrong

The chunk size problem

Calibrations that matter

The Cost Blowout on a Batch Summarization Job

The token economy of prompt design

Frequently Asked Questions

How do I know if my content is exceeding the context window?

Does a bigger context window always mean better performance?

What's the practical difference between truncation and summarization when context fills up?

Why does my model give worse output on longer prompts even when I'm under the context limit?

How should I think about tokens when pricing a client project?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?