Why Models Forget Instructions Three Paragraphs Back

Tokens and context windows are the infrastructure layer most teams skip. They read tutorials on prompting, experiment with ChatGPT, and then hit a wall — outputs degrade halfway through a long document, costs spike without explanation, or the model seems to "forget" instructions given three paragraphs earlier. The root cause is almost always the same: nobody explained how the model actually reads and remembers.

This playbook closes that gap. You'll get a clear mental model of what tokens and context windows are, specific plays for managing them in real workflows, the triggers that tell you when to act, who should own each decision, and how to sequence all of it. Whether you're deploying AI across a client services team or building repeatable internal workflows, this is the operational layer that makes everything else work.

Think of tokens as the unit of measurement for everything that passes through a language model — input, output, memory, cost. Think of the context window as the desk the model works on: big enough to hold a lot, but finite, and everything that falls off the edge is gone. Mismanage those two constraints and no amount of prompt craft will save you.

What Tokens Actually Are

A token is not a word and it's not a character. It's a chunk of text that the model's tokenizer has learned to treat as a single unit. In English, one token is roughly three to four characters, which means 100 words is approximately 75 tokens. Short common words — "the," "is," "and" — often map one-to-one with tokens. Long technical terms, proper nouns, and non-English text tend to split into multiple tokens.

Why does this matter operationally? Because pricing, rate limits, and context capacity are all measured in tokens, not words. A 2,000-word brief is roughly 1,500 tokens. A dense legal contract at 10,000 words is closer to 7,500 tokens. Underestimate this conversion and you'll exceed context limits mid-task or blow past your monthly token budget without understanding why.

The Tokenizer Gap

Different model families use different tokenizers. GPT-4 uses a tokenizer called cl100k_base. Claude uses its own. Llama models use yet another. This means the same input produces different token counts across models, which affects both cost comparison and context planning. When you're evaluating models, count tokens for your actual use cases — don't extrapolate from marketing examples.

Use OpenAI's Tokenizer tool or the tiktoken library to count tokens before you commit to a prompt architecture. For teams using Claude, Anthropic's API returns token counts in the response metadata. Make counting a standard step in prompt design, not an afterthought.

What Context Windows Are and Why They Have Edges

The context window is the total number of tokens a model can hold in active memory at once — input plus output combined. A model with a 128,000-token context window can "see" roughly 100,000 words of text at a time. That sounds like a lot until you feed it a 200-page report, a full codebase, or a long conversation thread with system prompt overhead included.

When content exceeds the window, one of three things happens depending on the implementation: the oldest content gets truncated silently, the API returns an error, or the application summarizes and re-injects. None of these are invisible — all three have quality and cost consequences.

The Recency Bias Problem

Even when content fits inside the context window, position matters. Research on attention patterns in transformer models consistently shows that models weight recent tokens more heavily than tokens from the middle of a long context. This is sometimes called "lost in the middle" — information buried in the center of a long prompt is less reliably retrieved than information at the beginning or end. If you bury your key instructions in paragraph eight of a twenty-paragraph prompt, expect degraded compliance.

This isn't a bug you can file a ticket for. It's a structural property of the architecture. The operating response is to put critical instructions at the top and the bottom, and to test retrieval at different positions before deploying any long-context workflow.

The Six Core Plays

This is the operational heart of the playbook. Each play has a trigger (when to run it), an owner (who decides), and a sequencing note (where it fits relative to other plays).

Play 1: Token Budget Setting

Trigger: Before any prompt is deployed to production or shared with a team. Owner: The prompt designer or AI lead.

Set explicit maximum token budgets for input and output separately. For most business tasks — summarization, drafting, classification — outputs rarely need to exceed 1,000 tokens. Reserving too much headroom for output wastes context space that could carry richer input. A useful starting default: allocate 70% of the context window to input, 20% to output, and hold 10% in reserve for system prompt overhead and conversation history.

Play 2: Chunking for Long Documents

Trigger: Source material exceeds 20% of the model's context window. Owner: Whoever owns document workflows — often an operations lead or a technically literate account manager.

Break long documents into semantically coherent chunks rather than character-count chunks. A 50-page report should be chunked by section, not by arbitrary 2,000-token slices. Run each chunk through the model with a consistent task prompt, then aggregate and synthesize results in a second pass. This two-stage architecture is more reliable and auditable than stuffing everything into one prompt.

Play 3: Context Pruning

Trigger: Any conversational AI tool that accumulates history (chatbots, AI assistants, long ideation sessions). Owner: The tool administrator or the individual operator depending on the deployment type.

Conversation history is a context sink. Every prior message re-enters the context window on each turn, which means a long conversation degrades output quality and inflates cost nonlinearly. Prune by summarizing earlier turns into a compact state block — a structured paragraph capturing key decisions, constraints, and context — and replacing the raw history with that summary. Many enterprise chat tools support this natively. If yours doesn't, build a manual pruning step into your team's AI usage protocol.

Play 4: System Prompt Hygiene

Trigger: Any time a system prompt exceeds 500 tokens or hasn't been audited in 30 days. Owner: The AI lead or whoever owns prompt governance — a natural fit for the roles described in The Large Language Models Playbook.

System prompts compete for context space with the actual task content. Audit them the way you audit code: remove redundancy, eliminate examples that aren't earning their keep, and version-control every change. A bloated system prompt can silently consume 15–25% of your usable context window before the user has typed a single word.

Play 5: Model-Context Matching

Trigger: When selecting a model for a new use case, or when an existing workflow is underperforming. Owner: The AI lead in consultation with the budget owner.

Not every task needs a 128K-token context window. Short classification tasks, tone checks, and single-paragraph rewrites run fine on models with 8K–16K windows — and typically cost 10x to 30x less per token than frontier models with maximum context. Match window size to task requirements, not to what sounds most impressive. Rolling Out Large Language Models Across a Team covers the model selection framework in more depth if you're doing this at scale.

Play 6: Output Compression

Trigger: When output quality feels high but verbosity is inflating token costs or downstream processing time. Owner: Prompt designer.

Instruct the model explicitly on output format and length. "Respond in no more than 150 words using bullet points" is not just a stylistic preference — it reduces output tokens, which lowers cost and keeps the response tight enough for downstream use without post-processing. Models will fill available space unless you constrain them. Constrain them.

Failure Modes to Anticipate

Every playbook needs an honest failure section. Here are the four most common ways teams mismanage tokens and context windows, drawn from recurring patterns in AI deployments.

Silent truncation. The model processes what it can and doesn't tell you what it missed. You get a confident-sounding answer that's missing a third of the source material. Build verification steps into any workflow where completeness matters.

Context poisoning. Irrelevant content in a large context window can confuse the model's attention even when it doesn't exceed the limit. A 10,000-token context full of loosely related documents will produce worse output than a 3,000-token context containing only the relevant passage. Relevance beats volume.

Cost drift. Token costs compound invisibly in team deployments. A workflow that costs $0.03 per run seems trivial until it's running 5,000 times a month across a team of twenty. Instrument your deployments to track token usage per workflow, not just total monthly spend. The Hidden Risks of Large Language Models (and How to Manage Them) covers cost governance in the broader risk management frame.

False confidence in large windows. The availability of 128K or 200K context windows tempts teams to stop designing prompts carefully and just throw everything in. Large windows are a capability, not a strategy. The recency bias problem doesn't disappear at scale — it gets harder to diagnose.

Sequencing the Plays for a New Deployment

If you're standing up a new AI workflow from scratch, run the plays in this order:

Model-context matching — choose the right model before writing a single prompt.
Token budget setting — establish input/output allocations.
System prompt hygiene — draft a lean system prompt and measure its token footprint.
Chunking design — if source material is long, architect the chunking strategy before testing.
Output compression — specify format and length constraints in the prompt.
Context pruning protocol — if the workflow is conversational, define when and how history gets summarized.

This sequence prevents the most expensive mistakes: over-engineering the prompt before the model is chosen, or discovering the context architecture is broken after a workflow is already in production. Teams working through broader LLM adoption questions will find the Large Language Models: The Questions Everyone Asks, Answered article a useful companion here.

Ownership and Governance

Context window management is not a one-time configuration. Token costs shift as models update their pricing. Context windows expand (and occasionally contract) as providers release new versions. Your prompt library grows, system prompts drift, and the team's usage patterns evolve.

Assign a single owner — usually an AI lead or operations lead — to audit token efficiency quarterly. That audit has three components: review system prompt sizes against baseline, sample 20–30 recent outputs to check for truncation or context poisoning signs, and compare token spend per workflow against the previous quarter. This takes two to four hours per quarter and prevents the compounding drift that makes AI deployments expensive and unreliable over time.

For teams that are earlier in their AI literacy journey, this governance work connects directly to the skill-building covered in Large Language Models: Myths vs Reality, which addresses the misconceptions that lead to poor context decisions in the first place.

Frequently Asked Questions

What's the difference between a token limit and a context window?

They're related but not identical. The context window is the total capacity — the amount of text the model can process at once. Token limits are constraints applied at different points: the maximum input tokens, the maximum output tokens, and sometimes a rate limit on tokens per minute. All of these are measured in the same unit, but they operate at different levels of the system.

Does a larger context window mean better output quality?

Not automatically. A larger window gives the model more information to work with, but it also introduces the "lost in the middle" problem where content positioned in the center of a long context is retrieved less reliably. Larger windows are useful when you genuinely need them; using them indiscriminately often produces worse results than a carefully scoped smaller context.

How do I know if my workflow is hitting context limits silently?

Look for output that seems incomplete, stops mid-sentence, or answers a question about a part of your document without acknowledging other parts. You can also check API response metadata — most providers return the input and output token counts per call. Compare those counts against your expected input size. A significant discrepancy suggests truncation.

Is it worth using retrieval-augmented generation (RAG) instead of large context windows?

For many business use cases, yes. RAG retrieves only the relevant passages from a large document corpus and injects them into the prompt, keeping context lean and targeted. It requires more infrastructure to set up than a direct large-context approach, but it scales better and often produces higher-quality outputs because the context is more relevant. The right choice depends on how often the source material changes and how much retrieval precision your use case requires.

How should teams track token costs without becoming obsessed with micro-optimization?

Set cost-per-workflow benchmarks, not just total monthly spend. If a summarization workflow costs $0.05 per run at launch and creeps to $0.12 six months later without a corresponding quality improvement, that's a signal to audit. Below that threshold, focus on output quality rather than token counting. The goal is cost awareness, not cost anxiety.

Key Takeaways

Tokens are the unit everything is measured in — input, output, cost, and memory. One token is roughly 3–4 characters in English.
The context window is finite and shared between input and output. Position within the window affects how reliably content is used.
Run the six plays in sequence for new deployments: model matching, budget setting, system prompt hygiene, chunking design, output compression, context pruning.
System prompts, conversation history, and irrelevant content all consume context space silently. Audit regularly.
Large context windows are a capability, not a strategy. Relevance beats volume; targeted context beats maximum context.
Assign a single owner to quarterly token efficiency audits. Two to four hours per quarter prevents significant cost and quality drift.
The most dangerous failure mode is silent truncation — build verification into any workflow where completeness matters.

What Tokens Actually Are

The Tokenizer Gap

What Context Windows Are and Why They Have Edges

The Recency Bias Problem

The Six Core Plays

This is the operational heart of the playbook. Each play has a trigger (when to run it), an owner (who decides), and a sequencing note (where it fits relative to other plays).

Play 1: Token Budget Setting

Trigger: Before any prompt is deployed to production or shared with a team. Owner: The prompt designer or AI lead.

Play 2: Chunking for Long Documents

Trigger: Source material exceeds 20% of the model's context window. Owner: Whoever owns document workflows — often an operations lead or a technically literate account manager.

Play 3: Context Pruning

Play 4: System Prompt Hygiene

Play 5: Model-Context Matching

Trigger: When selecting a model for a new use case, or when an existing workflow is underperforming. Owner: The AI lead in consultation with the budget owner.

Play 6: Output Compression

Trigger: When output quality feels high but verbosity is inflating token costs or downstream processing time. Owner: Prompt designer.

Failure Modes to Anticipate

Every playbook needs an honest failure section. Here are the four most common ways teams mismanage tokens and context windows, drawn from recurring patterns in AI deployments.

Sequencing the Plays for a New Deployment

If you're standing up a new AI workflow from scratch, run the plays in this order:

Model-context matching — choose the right model before writing a single prompt.
Token budget setting — establish input/output allocations.
System prompt hygiene — draft a lean system prompt and measure its token footprint.
Chunking design — if source material is long, architect the chunking strategy before testing.
Output compression — specify format and length constraints in the prompt.
Context pruning protocol — if the workflow is conversational, define when and how history gets summarized.

Ownership and Governance

Frequently Asked Questions

What's the difference between a token limit and a context window?

Does a larger context window mean better output quality?

How do I know if my workflow is hitting context limits silently?

Is it worth using retrieval-augmented generation (RAG) instead of large context windows?

How should teams track token costs without becoming obsessed with micro-optimization?

Key Takeaways

Tokens are the unit everything is measured in — input, output, cost, and memory. One token is roughly 3–4 characters in English.
The context window is finite and shared between input and output. Position within the window affects how reliably content is used.
Run the six plays in sequence for new deployments: model matching, budget setting, system prompt hygiene, chunking design, output compression, context pruning.
System prompts, conversation history, and irrelevant content all consume context space silently. Audit regularly.
Large context windows are a capability, not a strategy. Relevance beats volume; targeted context beats maximum context.
Assign a single owner to quarterly token efficiency audits. Two to four hours per quarter prevents significant cost and quality drift.
The most dangerous failure mode is silent truncation — build verification into any workflow where completeness matters.

Why Models Forget Instructions Three Paragraphs Back

What Tokens Actually Are

The Tokenizer Gap

What Context Windows Are and Why They Have Edges

The Recency Bias Problem

The Six Core Plays

Play 1: Token Budget Setting

Play 2: Chunking for Long Documents

Play 3: Context Pruning

Play 4: System Prompt Hygiene

Play 5: Model-Context Matching

Play 6: Output Compression

Failure Modes to Anticipate

Sequencing the Plays for a New Deployment

Ownership and Governance

Frequently Asked Questions

What's the difference between a token limit and a context window?

Does a larger context window mean better output quality?

How do I know if my workflow is hitting context limits silently?

Is it worth using retrieval-augmented generation (RAG) instead of large context windows?

How should teams track token costs without becoming obsessed with micro-optimization?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Why Models Forget Instructions Three Paragraphs Back

What Tokens Actually Are

The Tokenizer Gap

What Context Windows Are and Why They Have Edges

The Recency Bias Problem

The Six Core Plays

Play 1: Token Budget Setting

Play 2: Chunking for Long Documents

Play 3: Context Pruning

Play 4: System Prompt Hygiene

Play 5: Model-Context Matching

Play 6: Output Compression

Failure Modes to Anticipate

Sequencing the Plays for a New Deployment

Ownership and Governance

Frequently Asked Questions

What's the difference between a token limit and a context window?

Does a larger context window mean better output quality?

How do I know if my workflow is hitting context limits silently?

Is it worth using retrieval-augmented generation (RAG) instead of large context windows?

How should teams track token costs without becoming obsessed with micro-optimization?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?