Stop Treating Language Models Like a Search Engine

Q: How do I count tokens before sending a request?

OpenAI provides the tiktoken library for Python, and their Tokenizer tool at platform.openai.com is useful for spot-checking. Anthropic returns token counts in API response metadata. Most SDKs expose token counts in usage fields. For production systems, build token estimation into your pipeline before sending calls, not after.

Tokens and context windows are the two mechanical facts that explain more about how large language models behave than almost anything else. Understanding them isn't optional for anyone who uses AI seriously — it's the difference between building reliable workflows and being constantly surprised by model behavior you can't diagnose.

The core problem most practitioners hit is this: they treat an AI model like a search engine or a human colleague, neither of which has the constraints that LLMs operate under. A search engine has no memory between queries. A human colleague carries months of context in their head. A language model sits in a precise middle ground — it has a defined window of memory that resets, processes information in a specific unit called a token, and charges you (in time, money, or quality) based on both. Knowing that changes how you design prompts, structure documents, and build agent pipelines.

This guide covers the full picture: what tokens actually are, how context windows work mechanically, what happens at the boundaries, how to think about cost and performance, and how to make deliberate decisions when working within these constraints. If you're new to the topic, you may want to start with the Tokens and Context Windows: A Beginner's Guide before diving in here. If you've already got the basics, read on.

What a Token Actually Is

A token is the fundamental unit a language model reads and writes. It is not a word, a character, or a syllable — though it loosely correlates with all three.

Modern models use a technique called byte-pair encoding (BPE) or similar subword tokenization schemes. The tokenizer splits input text into the most statistically efficient chunks based on how often those character sequences appear in training data. In practical terms:

Common English words are usually a single token: "the," "house," "running"
Less common or longer words often split: "unbelievable" might become ["un", "believ", "able"] — three tokens
Numbers tokenize inconsistently: "2024" might be one token, "2025" might be two
Spaces and punctuation count: a space before a word is often part of the token, not separate
Code tokenizes differently than prose — some languages are more token-efficient than others

A useful rule of thumb for English text: roughly 750 words ≈ 1,000 tokens, or about 4 characters per token on average. This is an approximation, not a law. Technical content, non-English text, and heavy formatting can push the ratio significantly.

Why Tokenization Is Model-Specific

Each model family has its own tokenizer. GPT-4 and GPT-4o use the cl100k_base tokenizer. Claude uses Anthropic's proprietary tokenizer. Gemini uses a different one still. A prompt that is 800 tokens on one model may be 900 on another. This matters when you're estimating costs, comparing model capacities, or building tools that need precise token budgets. OpenAI's public Tokenizer tool and the tiktoken Python library let you count tokens before sending them. Anthropic exposes token counts in its API response metadata.

What a Context Window Is

The context window is the maximum number of tokens a model can process in a single inference call — the sum of your input (system prompt + conversation history + documents) and the model's output.

Think of it as working memory. Everything the model "knows" during a single call must fit inside that window. When you start a new conversation, the window resets. There is no passive background memory unless you build one explicitly.

Context Window Sizes Across Major Models

Window sizes have grown dramatically. As of mid-2025, typical ranges look like this:

Smaller/faster models (GPT-4o mini, Claude Haiku): 128K tokens
Mid-tier models (GPT-4o, Claude Sonnet): 128K–200K tokens
Long-context frontier models (Claude Opus, Gemini 1.5 Pro): 200K–1M+ tokens
Specialized research models: experimental contexts exceeding 2M tokens exist but are not yet production standard

Larger is not automatically better. Long-context models are generally slower, more expensive per token, and can exhibit "lost in the middle" behavior — where information placed in the center of a very long prompt is retrieved less reliably than information at the start or end.

Input Tokens, Output Tokens, and Why the Distinction Matters

Most LLM APIs price input and output tokens separately, with output tokens typically costing 3–5x more per token than input. This is because generating tokens is computationally more expensive than reading them.

The practical implication: a model that reads a 50,000-token document and produces a 500-word summary is actually quite cheap. A model asked to write a 5,000-word detailed report from a short prompt is significantly more expensive — and slower — because output token generation dominates both cost and latency.

Max output tokens is a separate parameter from context window size. A model with a 200K-token context window might cap output at 4K or 16K tokens depending on configuration. If your use case requires long outputs — full reports, detailed code modules, multi-chapter drafts — verify the output limit separately from the context limit.

The Context Window in Practice: What Fills It

Understanding what consumes tokens in a real API call helps you budget deliberately.

The Anatomy of a Prompt

In a typical API call, the context window fills from several sources:

System prompt: instructions, persona, rules, format guidance. Can range from 100 tokens to 5,000+ for complex agent setups
Conversation history: every prior message in the thread. In long conversations, this grows fast
Retrieved documents or context: RAG results, pasted documents, tool outputs
The current user message: often the smallest part
Model output (reserved): the model needs room to generate its response

A common failure mode: you build a chat interface, the conversation runs for 30 turns, and by turn 25 the context is full. The model either truncates older messages (losing important context), starts making errors, or throws a hard API error. Planning for conversation length is not optional — it's architecture.

Context Window Limits and Quality Degradation

Hitting the hard limit causes a crash. Approaching it causes something subtler: quality degradation.

Research and practitioner experience consistently show that:

Recency bias is real: models weight information closer to the end of context more heavily
Lost-in-the-middle effect: in very long documents, middle sections are retrieved less reliably
Instruction drift: if your system prompt is 4,000 tokens and your conversation history is 90,000 tokens, the model may not adhere to the system prompt as reliably

This is why 7 Common Mistakes with Tokens and Context Windows (and How to Avoid Them) consistently surfaces "assuming large context windows = reliable full-document comprehension" as one of the most costly errors practitioners make.

The practical rule: don't fill the context window just because you can. Use the minimum context required for the task. More tokens mean more noise, more cost, and in some configurations, lower accuracy.

Strategies for Working Within Context Constraints

Constraints are architectural inputs, not obstacles. Here's how skilled practitioners handle them.

Prompt Compression

Reduce the token footprint of your inputs without reducing the information density:

Summarize conversation history periodically rather than passing all raw messages
Strip formatting from retrieved documents — markdown headers, whitespace, and decorative punctuation all cost tokens
Use structured shorthand in system prompts where possible: "Reply only with JSON" is cheaper than a paragraph explaining it

Retrieval-Augmented Generation (RAG)

Instead of stuffing entire documents into context, retrieve only the relevant chunks. A 500K-word knowledge base shouldn't live in the context window. A well-built retrieval system surfaces the 2,000–5,000 tokens actually relevant to the question. This is the right architecture for document-heavy workflows. For a detailed walkthrough, see A Step-by-Step Approach to Tokens and Context Windows.

Chunking and Windowing

For tasks that require processing long documents end-to-end — legal review, financial analysis, research synthesis — break the document into chunks that fit comfortably (not maximally) in context. Process each chunk, extract structured outputs, then aggregate. This is more reliable than one massive call.

Conversation Management

Build explicit context management into multi-turn applications:

Track token counts per message using library tools
Implement rolling summaries: when conversation history exceeds a threshold, summarize the oldest N messages into a compact summary and replace the originals
Preserve high-value context (key decisions, user preferences, established facts) explicitly rather than relying on raw history

Cost and Performance Trade-offs

Token volume is your primary cost driver in most LLM-heavy workflows, and understanding the math lets you make real decisions.

At typical 2025 pricing for mid-tier models, processing 1 million input tokens costs roughly $2–$15 depending on the model. At the premium end (frontier models with large context), it can reach $30–$60 per million input tokens. For high-volume production workloads processing millions of calls per month, token efficiency directly determines whether a workflow is profitable or not.

Beyond cost, tokens affect latency. Time-to-first-token and total generation time both scale with context length. For real-time user-facing applications, this matters. For async batch processing, it may not.

See Tokens and Context Windows: Best Practices That Actually Work for specific optimization patterns tied to workflow types.

Applying This to Real Workflows

The theory lands differently depending on what you're building. Tokens and Context Windows: Real-World Examples and Use Cases covers these in depth, but here's the high-level map:

Chatbots and assistants: conversation history management is the dominant concern. Budget tokens per turn and build rolling summaries.
Document processing: chunking strategy and output format efficiency matter most. Avoid asking for verbose outputs when structured data is sufficient.
Agent pipelines: tool call results and multi-step reasoning can eat context fast. Each tool invocation may add hundreds of tokens of scaffolding. Monitor cumulative context across steps.
Code generation: code is relatively token-dense. Complex codebases passed as context exhaust windows quickly. Use file-level chunking and targeted retrieval.

Frequently Asked Questions

What's the difference between context window and memory?

Context window refers to the tokens a model can process in a single inference call — it resets when the call ends. "Memory" is a product-level or application-level feature that persists information across calls, usually by storing and retrieving it externally. Claude's memory feature and ChatGPT's memory are built on top of the context window, not inside it.

Do I always need a bigger context window?

Not necessarily. Larger context windows are slower, more expensive, and can produce less focused outputs for simple tasks. Use the smallest context that comfortably fits your task. A 4K-token prompt with a 128K context model works just as well as filling the window — often better.

How do I count tokens before sending a request?

OpenAI provides the tiktoken library for Python, and their Tokenizer tool at platform.openai.com is useful for spot-checking. Anthropic returns token counts in API response metadata. Most SDKs expose token counts in usage fields. For production systems, build token estimation into your pipeline before sending calls, not after.

Why does the same prompt cost different amounts on different models?

Token pricing varies by model, and different tokenizers produce different token counts for the same text. A prompt processed by GPT-4o and Claude Sonnet may have different token counts due to different underlying tokenizers, and the per-token price differs between providers. Always verify token counts and pricing for the specific model in your workflow.

What happens when you exceed the context window?

Hard limit breaches return an API error and the call fails. More commonly, client libraries or hosted products silently truncate the input — usually dropping the oldest messages first. The model never sees the dropped content, which can cause factual errors, broken references, and instructions being ignored. Build explicit length management into anything longer than a single-turn call.

Key Takeaways

A token is a subword unit, roughly 4 characters or ¾ of a word in English; tokenization is model-specific and affects both cost and capacity
The context window is working memory for a single inference call — it resets after each call and holds all input plus generated output
Input and output tokens are priced separately; output tokens typically cost significantly more per unit
Larger context windows don't guarantee better results — lost-in-the-middle degradation and instruction drift are real failure modes at high fill rates
Token efficiency is an engineering discipline: prompt compression, RAG, chunking, and conversation management are the core tools
For multi-turn applications, proactive context management (rolling summaries, token tracking) is architectural, not optional
Cost scales directly with token volume; understanding the math is required for building economically viable AI workflows

What a Token Actually Is

A token is the fundamental unit a language model reads and writes. It is not a word, a character, or a syllable — though it loosely correlates with all three.

Common English words are usually a single token: "the," "house," "running"
Less common or longer words often split: "unbelievable" might become ["un", "believ", "able"] — three tokens
Numbers tokenize inconsistently: "2024" might be one token, "2025" might be two
Spaces and punctuation count: a space before a word is often part of the token, not separate
Code tokenizes differently than prose — some languages are more token-efficient than others

Why Tokenization Is Model-Specific

What a Context Window Is

The context window is the maximum number of tokens a model can process in a single inference call — the sum of your input (system prompt + conversation history + documents) and the model's output.

Context Window Sizes Across Major Models

Window sizes have grown dramatically. As of mid-2025, typical ranges look like this:

Smaller/faster models (GPT-4o mini, Claude Haiku): 128K tokens
Mid-tier models (GPT-4o, Claude Sonnet): 128K–200K tokens
Long-context frontier models (Claude Opus, Gemini 1.5 Pro): 200K–1M+ tokens
Specialized research models: experimental contexts exceeding 2M tokens exist but are not yet production standard

Input Tokens, Output Tokens, and Why the Distinction Matters

The Context Window in Practice: What Fills It

Understanding what consumes tokens in a real API call helps you budget deliberately.

The Anatomy of a Prompt

In a typical API call, the context window fills from several sources:

System prompt: instructions, persona, rules, format guidance. Can range from 100 tokens to 5,000+ for complex agent setups
Conversation history: every prior message in the thread. In long conversations, this grows fast
Retrieved documents or context: RAG results, pasted documents, tool outputs
The current user message: often the smallest part
Model output (reserved): the model needs room to generate its response

Context Window Limits and Quality Degradation

Hitting the hard limit causes a crash. Approaching it causes something subtler: quality degradation.

Research and practitioner experience consistently show that:

Recency bias is real: models weight information closer to the end of context more heavily
Lost-in-the-middle effect: in very long documents, middle sections are retrieved less reliably
Instruction drift: if your system prompt is 4,000 tokens and your conversation history is 90,000 tokens, the model may not adhere to the system prompt as reliably

Strategies for Working Within Context Constraints

Constraints are architectural inputs, not obstacles. Here's how skilled practitioners handle them.

Prompt Compression

Reduce the token footprint of your inputs without reducing the information density:

Summarize conversation history periodically rather than passing all raw messages
Strip formatting from retrieved documents — markdown headers, whitespace, and decorative punctuation all cost tokens
Use structured shorthand in system prompts where possible: "Reply only with JSON" is cheaper than a paragraph explaining it

Retrieval-Augmented Generation (RAG)

Chunking and Windowing

Conversation Management

Build explicit context management into multi-turn applications:

Track token counts per message using library tools
Implement rolling summaries: when conversation history exceeds a threshold, summarize the oldest N messages into a compact summary and replace the originals
Preserve high-value context (key decisions, user preferences, established facts) explicitly rather than relying on raw history

Cost and Performance Trade-offs

Token volume is your primary cost driver in most LLM-heavy workflows, and understanding the math lets you make real decisions.

See Tokens and Context Windows: Best Practices That Actually Work for specific optimization patterns tied to workflow types.

Applying This to Real Workflows

The theory lands differently depending on what you're building. Tokens and Context Windows: Real-World Examples and Use Cases covers these in depth, but here's the high-level map:

Chatbots and assistants: conversation history management is the dominant concern. Budget tokens per turn and build rolling summaries.
Document processing: chunking strategy and output format efficiency matter most. Avoid asking for verbose outputs when structured data is sufficient.
Agent pipelines: tool call results and multi-step reasoning can eat context fast. Each tool invocation may add hundreds of tokens of scaffolding. Monitor cumulative context across steps.
Code generation: code is relatively token-dense. Complex codebases passed as context exhaust windows quickly. Use file-level chunking and targeted retrieval.

Frequently Asked Questions

What's the difference between context window and memory?

Do I always need a bigger context window?

How do I count tokens before sending a request?

Why does the same prompt cost different amounts on different models?

What happens when you exceed the context window?

Key Takeaways

A token is a subword unit, roughly 4 characters or ¾ of a word in English; tokenization is model-specific and affects both cost and capacity
The context window is working memory for a single inference call — it resets after each call and holds all input plus generated output
Input and output tokens are priced separately; output tokens typically cost significantly more per unit
Larger context windows don't guarantee better results — lost-in-the-middle degradation and instruction drift are real failure modes at high fill rates
Token efficiency is an engineering discipline: prompt compression, RAG, chunking, and conversation management are the core tools
For multi-turn applications, proactive context management (rolling summaries, token tracking) is architectural, not optional
Cost scales directly with token volume; understanding the math is required for building economically viable AI workflows

Stop Treating Language Models Like a Search Engine

What a Token Actually Is

Why Tokenization Is Model-Specific

What a Context Window Is

Context Window Sizes Across Major Models

Input Tokens, Output Tokens, and Why the Distinction Matters

The Context Window in Practice: What Fills It

The Anatomy of a Prompt

Context Window Limits and Quality Degradation

Strategies for Working Within Context Constraints

Prompt Compression

Retrieval-Augmented Generation (RAG)

Chunking and Windowing

Conversation Management

Cost and Performance Trade-offs

Applying This to Real Workflows

Frequently Asked Questions

What's the difference between context window and memory?

Do I always need a bigger context window?

How do I count tokens before sending a request?

Why does the same prompt cost different amounts on different models?

What happens when you exceed the context window?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Stop Treating Language Models Like a Search Engine

What a Token Actually Is

Why Tokenization Is Model-Specific

What a Context Window Is

Context Window Sizes Across Major Models

Input Tokens, Output Tokens, and Why the Distinction Matters

The Context Window in Practice: What Fills It

The Anatomy of a Prompt

Context Window Limits and Quality Degradation

Strategies for Working Within Context Constraints

Prompt Compression

Retrieval-Augmented Generation (RAG)

Chunking and Windowing

Conversation Management

Cost and Performance Trade-offs

Applying This to Real Workflows

Frequently Asked Questions

What's the difference between context window and memory?

Do I always need a bigger context window?

How do I count tokens before sending a request?

Why does the same prompt cost different amounts on different models?

What happens when you exceed the context window?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?