A mid-sized content agency discovered its AI-assisted editorial workflow was producing inconsistent output — sometimes sharp, sometimes weirdly truncated or repetitive — and couldn't figure out why. The quality varied by project, by writer, by day. After three weeks of troubleshooting prompts, swapping models, and adjusting temperature settings, the real culprit emerged: nobody on the team understood how tokens and context windows actually worked in practice. They had been treating the AI like a human editor with unlimited memory. It isn't, and it doesn't have any.
This case study walks through what that agency learned — not as abstract theory, but as a real operational sequence with real trade-offs and measurable results. The situation, the decisions made, the execution, and the outcomes are representative of a pattern we see across agencies and professional teams adopting AI. If you've ever had an AI seem to "forget" instructions halfway through a long document, produce a great first section and a hollow last one, or inexplicably ignore context you're sure you provided, this article is for you.
Understanding tokens and context windows isn't optional background knowledge for AI-powered teams. It's core infrastructure thinking — the difference between a workflow that scales and one that quietly degrades.
The Situation: When Output Quality Becomes Unpredictable
The agency — a ten-person editorial shop producing long-form content for B2B SaaS clients — had integrated an LLM into their production workflow about six months before the problems became obvious. Early wins came fast. Drafting time dropped. Outlines got sharper. Research summaries were useful.
Then, around the time average article length crept past 3,000 words and briefs started carrying more client history, quality variance spiked. Editors were spending more time correcting AI output than they had in month one.
What the Symptoms Actually Indicated
The team diagnosed the problem as a prompt quality issue. They weren't wrong, exactly — but they were diagnosing downstream effects rather than root causes.
The real issues were:
- Context window overflow: Long briefs, conversation histories, and pasted reference documents were pushing total token counts past the effective context limit of the model they were using.
- Attention dilution: Even within a nominally large context window, models weight earlier and later tokens more heavily than content buried in the middle. Instructions placed in the center of a 12,000-token prompt were being functionally ignored.
- No token accounting: Nobody had ever measured how many tokens a typical brief consumed. They were operating blind.
The agency had no framework for thinking about this. They needed one fast.
Defining the Problem Operationally
Before fixing anything, the team spent two days measuring. This is unglamorous work, but it's where real understanding begins.
They pulled ten recent projects and, using a tokenizer tool, calculated the token count of every element they were passing to the model: the system prompt, the editorial brief, the reference documents, the conversation history, and any draft content included for revision.
The results surprised them. A typical project bundle was running between 8,000 and 14,000 tokens before a single word of new output was generated. The model they were using had a 16,000-token context window. On several projects, they had been operating with fewer than 2,000 tokens of "headroom" for the actual output — and on two projects, they had almost certainly been truncating their own inputs without realizing it.
This kind of audit is the first step in any serious tokens and context windows work. How to Measure Tokens and Context Windows: Metrics That Matter covers the instrumentation in detail, but the minimum viable version is simply: count what you're putting in, and compare it to what the model can hold.
The Decision: Model Selection and Prompt Architecture
Armed with actual numbers, the agency faced a genuine decision with real trade-offs. They had three options.
Option A: Move to a Larger Context Window Model
Models with 100,000+ token context windows were available. Switching would eliminate the overflow problem immediately. The cost: inference cost per token was meaningfully higher, and latency on very long contexts was slower in ways that would affect the team's turnaround expectations.
Option B: Redesign Prompt Architecture to Fit Existing Constraints
Rather than expand the container, shrink what goes into it. This meant compressing briefs, summarizing reference material, and being ruthless about what actually needed to be in the context at any given moment.
Option C: Hybrid — Model Upgrade Plus Structural Redesign
Upgrade the model for complex, high-value projects while redesigning prompt architecture for routine work. More operationally complex, but it optimized cost and quality simultaneously.
They chose Option C. The reasoning was sound: paying for a 100K context window to process a routine 800-word blog post brief is waste. Trying to compress a nuanced 40-page brand guide into 2,000 tokens is destruction of signal. The right tool depends on the task.
This is the core insight of Tokens and Context Windows: Trade-offs, Options, and How to Decide — there is no universally correct configuration, only configurations that match specific workload profiles.
The Execution: Five Concrete Changes
The team implemented changes over three weeks, rolling out one change at a time to isolate effects.
Change 1: Token Budgeting Per Project Type
They created three project tiers based on complexity and brief length. Each tier had a defined token budget broken into components: system prompt (fixed, optimized once), brief (variable, capped by tier), reference material (summarized to a hard limit), conversation history (pruned after three exchanges unless flagged as critical).
Tier 1 (routine content): 6,000-token total input budget, standard model Tier 2 (complex/strategic): 20,000-token budget, upgraded model Tier 3 (deep research/long-form): 80,000+ token budget, highest-capability model
Change 2: System Prompt Compression
Their original system prompt was 1,100 tokens. After editing for redundancy and converting verbose instructions into structured rules, it came down to 340 tokens without losing any functional guidance. This freed up meaningful space across every single project.
Change 3: Reference Document Summarization
Instead of pasting client brand guides and research documents directly, a junior editor ran a preprocessing step: use the AI itself to generate a structured 500-token summary of each document, flagging the five to seven most operationally relevant facts. This summary replaced the raw document in the main prompt.
Change 4: Instruction Placement
The team moved all critical instructions to the top of the prompt (before any content) and added a brief restatement at the end. This addressed the attention dilution problem — leveraging the primacy and recency effects that most transformer-based models exhibit. Instructions buried in the middle of long prompts became a firing offense, organizationally speaking.
Change 5: Conversation History Pruning
For multi-turn workflows, they implemented a manual pruning rule: after three exchanges, an editor reviews the conversation history and collapses it into a summary of decisions made. The summary replaces the raw exchange history. This kept context windows clean across longer projects without losing continuity.
The Outcome: Measurable Improvements Over 60 Days
Sixty days after full implementation, the agency measured across four dimensions.
Output consistency: Editor revision time dropped by roughly 35–40%. The "hollow last section" problem essentially disappeared — a direct result of no longer overflowing context windows.
Cost efficiency: Routing routine projects to the standard model instead of defaulting to the premium model reduced inference costs by approximately 45% month-over-month, while quality on those projects held steady or improved due to better prompt architecture.
Throughput: Faster, more predictable output meant the editorial team could move from roughly 22 completed pieces per month to 29 with the same headcount.
Team confidence: This one is harder to quantify but real. Writers and editors stopped experiencing the AI as unpredictable. When you understand why something works, you can make it work reliably. Competence replaced learned helplessness.
The underlying lesson is structural: the agency didn't get better results by finding a smarter prompt or a better model. They got better results by building a disciplined system around how they used what they already had. Referencing A Framework for Tokens and Context Windows during their redesign gave them a principled starting point rather than ad hoc fixes.
What Most Teams Get Wrong
The agency's experience is not unusual. Here are the failure modes that appear most consistently across similar case studies.
Treating context windows as binary. Teams think in terms of "fits" or "doesn't fit." In reality, quality degrades before you hit the hard limit. Effective context — the range where the model reliably attends to all your content — is typically 60–80% of the nominal maximum, not 100%.
Optimizing the wrong layer. Prompt engineering gets all the attention. Token budgeting gets almost none. Both matter. Fixing the wrong layer first wastes time.
No measurement before intervention. Most teams diagnose by feel and fix by experimentation. Running a simple token count audit before touching anything else would save weeks in a typical case.
Static configurations for dynamic workloads. A brief that works fine at 2,000 tokens will break your workflow when the client sends a 40-page strategy document. Build tiering in from the start.
If you're ready to operationalize what you've read here, the The Tokens and Context Windows Checklist for 2026 translates these principles into a step-by-step audit you can run on your own workflow this week. For teams evaluating specific software to support this work, The Best Tools for Tokens and Context Windows covers the current landscape.
Frequently Asked Questions
What is a context window, and why does it limit output quality?
A context window is the total amount of text — measured in tokens — that a model can "see" at once during a single inference call. When your inputs consume most of that window, the model has limited space to generate output and may truncate or compress its response. More critically, models attend unevenly to content across long contexts, so instructions and information placed in the middle of a packed prompt may receive less weight than the same content placed at the start or end.
How many tokens does a typical business document consume?
Rough rule of thumb: one token equals approximately three to four characters, or about 0.75 words in English. A 1,000-word document runs roughly 1,300–1,400 tokens. A 10-page brand guide might run 6,000–9,000 tokens depending on formatting. The practical implication is that pasting long documents directly into prompts consumes context budget fast — summarization or chunking is usually necessary.
Should I always choose the model with the largest context window?
No. Larger context windows generally come with higher per-token cost and, in some cases, higher latency. For routine tasks with short inputs, a well-configured smaller-context model will often outperform a large-context model used carelessly. Match model capability to task complexity, and design your prompt architecture to fit within the window you're paying for.
What's the fastest way to diagnose a context window problem in an existing workflow?
Run a token count on your ten most recent prompts using any free tokenizer (most model providers offer one). Compare total input tokens to the model's stated context limit. If you're consistently using more than 70% of the available window before output generation, you have a structural problem worth addressing before changing anything else.
Does splitting a task across multiple API calls solve context window limitations?
Partially. Breaking a long document into chunks and processing each separately avoids overflow but introduces a coherence challenge: the model has no memory of previous chunks unless you explicitly pass summaries forward. Multi-call workflows require deliberate context handoff design — summarizing decisions or key information from each call and prepending it to the next.
How often should a team revisit their token budgeting configuration?
Any time the nature of the work changes significantly: new client types, new document formats, a change in model, or a notable shift in average brief length. A lightweight monthly review — comparing actual token usage to budgets — is sufficient for most teams. Major model updates (new releases, context window expansions) warrant a full reassessment.
Key Takeaways
- Context window overflow is a structural problem, not a prompt problem. Diagnose before you optimize.
- Effective context is roughly 60–80% of nominal maximum — quality degrades before the hard limit.
- Token budgeting by project tier is the most practical way to match cost, model capability, and task complexity.
- System prompt compression, reference document summarization, and instruction placement are the three highest-leverage changes most teams can make immediately.
- Instruction placement matters: lead with critical guidance, restate at the end, and never bury key directives in the middle of a long prompt.
- Measurement comes first. A simple token count audit on recent prompts takes under an hour and changes what you see.
- Teams that build systems around token management consistently outperform teams that rely on model capability alone.