Tokens and context windows are the two mechanical facts that explain more about how large language models behave than almost anything else. Understanding them isn't optional for anyone who uses AI seriously — it's the difference between building reliable workflows and being constantly surprised by model behavior you can't diagnose.
The core problem most practitioners hit is this: they treat an AI model like a search engine or a human colleague, neither of which has the constraints that LLMs operate under. A search engine has no memory between queries. A human colleague carries months of context in their head. A language model sits in a precise middle ground — it has a defined window of memory that resets, processes information in a specific unit called a token, and charges you (in time, money, or quality) based on both. Knowing that changes how you design prompts, structure documents, and build agent pipelines.
This guide covers the full picture: what tokens actually are, how context windows work mechanically, what happens at the boundaries, how to think about cost and performance, and how to make deliberate decisions when working within these constraints. If you're new to the topic, you may want to start with the Tokens and Context Windows: A Beginner's Guide before diving in here. If you've already got the basics, read on.
What a Token Actually Is
A token is the fundamental unit a language model reads and writes. It is not a word, a character, or a syllable — though it loosely correlates with all three.
Modern models use a technique called byte-pair encoding (BPE) or similar subword tokenization schemes. The tokenizer splits input text into the most statistically efficient chunks based on how often those character sequences appear in training data. In practical terms:
- Common English words are usually a single token: "the," "house," "running"
- Less common or longer words often split: "unbelievable" might become ["un", "believ", "able"] — three tokens
- Numbers tokenize inconsistently: "2024" might be one token, "2025" might be two
- Spaces and punctuation count: a space before a word is often part of the token, not separate
- Code tokenizes differently than prose — some languages are more token-efficient than others
A useful rule of thumb for English text: roughly 750 words ≈ 1,000 tokens, or about 4 characters per token on average. This is an approximation, not a law. Technical content, non-English text, and heavy formatting can push the ratio significantly.
Why Tokenization Is Model-Specific
Each model family has its own tokenizer. GPT-4 and GPT-4o use the cl100k_base tokenizer. Claude uses Anthropic's proprietary tokenizer. Gemini uses a different one still. A prompt that is 800 tokens on one model may be 900 on another. This matters when you're estimating costs, comparing model capacities, or building tools that need precise token budgets. OpenAI's public Tokenizer tool and the tiktoken Python library let you count tokens before sending them. Anthropic exposes token counts in its API response metadata.
What a Context Window Is
The context window is the maximum number of tokens a model can process in a single inference call — the sum of your input (system prompt + conversation history + documents) and the model's output.
Think of it as working memory. Everything the model "knows" during a single call must fit inside that window. When you start a new conversation, the window resets. There is no passive background memory unless you build one explicitly.
Context Window Sizes Across Major Models
Window sizes have grown dramatically. As of mid-2025, typical ranges look like this:
- Smaller/faster models (GPT-4o mini, Claude Haiku): 128K tokens
- Mid-tier models (GPT-4o, Claude Sonnet): 128K–200K tokens
- Long-context frontier models (Claude Opus, Gemini 1.5 Pro): 200K–1M+ tokens
- Specialized research models: experimental contexts exceeding 2M tokens exist but are not yet production standard
Larger is not automatically better. Long-context models are generally slower, more expensive per token, and can exhibit "lost in the middle" behavior — where information placed in the center of a very long prompt is retrieved less reliably than information at the start or end.
Input Tokens, Output Tokens, and Why the Distinction Matters
Most LLM APIs price input and output tokens separately, with output tokens typically costing 3–5x more per token than input. This is because generating tokens is computationally more expensive than reading them.
The practical implication: a model that reads a 50,000-token document and produces a 500-word summary is actually quite cheap. A model asked to write a 5,000-word detailed report from a short prompt is significantly more expensive — and slower — because output token generation dominates both cost and latency.
Max output tokens is a separate parameter from context window size. A model with a 200K-token context window might cap output at 4K or 16K tokens depending on configuration. If your use case requires long outputs — full reports, detailed code modules, multi-chapter drafts — verify the output limit separately from the context limit.
The Context Window in Practice: What Fills It
Understanding what consumes tokens in a real API call helps you budget deliberately.
The Anatomy of a Prompt
In a typical API call, the context window fills from several sources:
- System prompt: instructions, persona, rules, format guidance. Can range from 100 tokens to 5,000+ for complex agent setups
- Conversation history: every prior message in the thread. In long conversations, this grows fast
- Retrieved documents or context: RAG results, pasted documents, tool outputs
- The current user message: often the smallest part
- Model output (reserved): the model needs room to generate its response
A common failure mode: you build a chat interface, the conversation runs for 30 turns, and by turn 25 the context is full. The model either truncates older messages (losing important context), starts making errors, or throws a hard API error. Planning for conversation length is not optional — it's architecture.
Context Window Limits and Quality Degradation
Hitting the hard limit causes a crash. Approaching it causes something subtler: quality degradation.
Research and practitioner experience consistently show that:
- Recency bias is real: models weight information closer to the end of context more heavily
- Lost-in-the-middle effect: in very long documents, middle sections are retrieved less reliably
- Instruction drift: if your system prompt is 4,000 tokens and your conversation history is 90,000 tokens, the model may not adhere to the system prompt as reliably
This is why 7 Common Mistakes with Tokens and Context Windows (and How to Avoid Them) consistently surfaces "assuming large context windows = reliable full-document comprehension" as one of the most costly errors practitioners make.
The practical rule: don't fill the context window just because you can. Use the minimum context required for the task. More tokens mean more noise, more cost, and in some configurations, lower accuracy.
Strategies for Working Within Context Constraints
Constraints are architectural inputs, not obstacles. Here's how skilled practitioners handle them.
Prompt Compression
Reduce the token footprint of your inputs without reducing the information density:
- Summarize conversation history periodically rather than passing all raw messages
- Strip formatting from retrieved documents — markdown headers, whitespace, and decorative punctuation all cost tokens
- Use structured shorthand in system prompts where possible: "Reply only with JSON" is cheaper than a paragraph explaining it
Retrieval-Augmented Generation (RAG)
Instead of stuffing entire documents into context, retrieve only the relevant chunks. A 500K-word knowledge base shouldn't live in the context window. A well-built retrieval system surfaces the 2,000–5,000 tokens actually relevant to the question. This is the right architecture for document-heavy workflows. For a detailed walkthrough, see A Step-by-Step Approach to Tokens and Context Windows.
Chunking and Windowing
For tasks that require processing long documents end-to-end — legal review, financial analysis, research synthesis — break the document into chunks that fit comfortably (not maximally) in context. Process each chunk, extract structured outputs, then aggregate. This is more reliable than one massive call.
Conversation Management
Build explicit context management into multi-turn applications:
- Track token counts per message using library tools
- Implement rolling summaries: when conversation history exceeds a threshold, summarize the oldest N messages into a compact summary and replace the originals
- Preserve high-value context (key decisions, user preferences, established facts) explicitly rather than relying on raw history
Cost and Performance Trade-offs
Token volume is your primary cost driver in most LLM-heavy workflows, and understanding the math lets you make real decisions.
At typical 2025 pricing for mid-tier models, processing 1 million input tokens costs roughly $2–$15 depending on the model. At the premium end (frontier models with large context), it can reach $30–$60 per million input tokens. For high-volume production workloads processing millions of calls per month, token efficiency directly determines whether a workflow is profitable or not.
Beyond cost, tokens affect latency. Time-to-first-token and total generation time both scale with context length. For real-time user-facing applications, this matters. For async batch processing, it may not.
See Tokens and Context Windows: Best Practices That Actually Work for specific optimization patterns tied to workflow types.
Applying This to Real Workflows
The theory lands differently depending on what you're building. Tokens and Context Windows: Real-World Examples and Use Cases covers these in depth, but here's the high-level map:
- Chatbots and assistants: conversation history management is the dominant concern. Budget tokens per turn and build rolling summaries.
- Document processing: chunking strategy and output format efficiency matter most. Avoid asking for verbose outputs when structured data is sufficient.
- Agent pipelines: tool call results and multi-step reasoning can eat context fast. Each tool invocation may add hundreds of tokens of scaffolding. Monitor cumulative context across steps.
- Code generation: code is relatively token-dense. Complex codebases passed as context exhaust windows quickly. Use file-level chunking and targeted retrieval.
Frequently Asked Questions
What's the difference between context window and memory?
Context window refers to the tokens a model can process in a single inference call — it resets when the call ends. "Memory" is a product-level or application-level feature that persists information across calls, usually by storing and retrieving it externally. Claude's memory feature and ChatGPT's memory are built on top of the context window, not inside it.
Do I always need a bigger context window?
Not necessarily. Larger context windows are slower, more expensive, and can produce less focused outputs for simple tasks. Use the smallest context that comfortably fits your task. A 4K-token prompt with a 128K context model works just as well as filling the window — often better.
How do I count tokens before sending a request?
OpenAI provides the tiktoken library for Python, and their Tokenizer tool at platform.openai.com is useful for spot-checking. Anthropic returns token counts in API response metadata. Most SDKs expose token counts in usage fields. For production systems, build token estimation into your pipeline before sending calls, not after.
Why does the same prompt cost different amounts on different models?
Token pricing varies by model, and different tokenizers produce different token counts for the same text. A prompt processed by GPT-4o and Claude Sonnet may have different token counts due to different underlying tokenizers, and the per-token price differs between providers. Always verify token counts and pricing for the specific model in your workflow.
What happens when you exceed the context window?
Hard limit breaches return an API error and the call fails. More commonly, client libraries or hosted products silently truncate the input — usually dropping the oldest messages first. The model never sees the dropped content, which can cause factual errors, broken references, and instructions being ignored. Build explicit length management into anything longer than a single-turn call.
Key Takeaways
- A token is a subword unit, roughly 4 characters or ¾ of a word in English; tokenization is model-specific and affects both cost and capacity
- The context window is working memory for a single inference call — it resets after each call and holds all input plus generated output
- Input and output tokens are priced separately; output tokens typically cost significantly more per unit
- Larger context windows don't guarantee better results — lost-in-the-middle degradation and instruction drift are real failure modes at high fill rates
- Token efficiency is an engineering discipline: prompt compression, RAG, chunking, and conversation management are the core tools
- For multi-turn applications, proactive context management (rolling summaries, token tracking) is architectural, not optional
- Cost scales directly with token volume; understanding the math is required for building economically viable AI workflows