AI model pricing looks simple on a vendor's homepage and then surprises you on the first real invoice. The headline numbers — a few dollars per million tokens — feel trivial until you multiply them by a production workload that runs thousands of times a day, each call dragging along a growing context window. The gap between "this is basically free" and "why did we spend $40,000 last month" is almost always a failure to understand the underlying pricing structure, not the unit price itself.
This guide is the definitive reference for how AI model costs actually work. We cover the token-based billing model that nearly every provider uses, the difference between input and output pricing, how context windows and caching change the math, and the structural choices — pay-as-you-go, committed throughput, batch discounts, self-hosting — that determine whether your AI spend scales gracefully or balloons.
By the end you should be able to read any model's pricing page, estimate the cost of a workload before you build it, and recognize which lever to pull when the bill comes in higher than expected.
How Token-Based Pricing Works
Almost all hosted language models bill by the token. A token is a chunk of text — roughly four characters or three-quarters of a word in English. The sentence you are reading is about 15 tokens. Providers count tokens because models process text in these units, so it is the most direct measure of computational work.
The critical detail that trips up nearly everyone: input and output tokens are priced separately, and output is usually three to five times more expensive. Input is the prompt you send — instructions, context, retrieved documents, conversation history. Output is what the model generates. A model might charge $3 per million input tokens and $15 per million output tokens. If your application sends huge prompts but gets short answers, your bill skews toward input. If it generates long essays from short prompts, output dominates.
The two numbers that matter
- Cost per million input tokens — multiply by your average prompt size.
- Cost per million output tokens — multiply by your average response length.
Estimate a single call, then multiply by call volume. That arithmetic is the entire game. If you want the gentle, first-principles version of this math, start with our Beginner's Guide.
Pricing Tiers and Model Families
Every major provider ships a family of models at different price-performance points. There is usually a flagship model (most capable, most expensive), a mid-tier workhorse, and a small fast model that costs a fraction of the flagship. The price spread within a single family is often 10x to 30x.
This tiering is your single biggest cost lever. Most teams reflexively reach for the flagship model for everything, then discover that 70 percent of their calls — classification, extraction, simple routing — work just as well on the small model at a tenth of the price. The discipline of matching task difficulty to model tier is where serious savings live.
A practical tiering heuristic
- Small/fast models: classification, tagging, routing, short extractions, high-volume simple tasks.
- Mid-tier models: most production reasoning, summarization, and customer-facing generation.
- Flagship models: genuinely hard reasoning, long-horizon agents, anything where a wrong answer is expensive.
Context Windows, Caching, and the Hidden Multipliers
The context window is the maximum amount of text a model can consider at once. Larger windows let you stuff in more documents and history, but every token you put in the window is a token you pay for on every single call. A retrieval-augmented chatbot that prepends 8,000 tokens of context to every message pays for those 8,000 tokens thousands of times a day, even though the user only typed a sentence.
Prompt caching is the most underused cost control in the entire space. When a large portion of your prompt is identical across calls — a system prompt, a knowledge base, a long set of instructions — providers let you cache it. Cached input tokens are billed at a steep discount, often 75 to 90 percent off. For agents and chatbots with stable system prompts, caching alone can cut total spend by half. The mistakes that come from ignoring it are covered in our Common Mistakes guide.
Structural Pricing Models Beyond Pay-As-You-Go
Token rates are the default, but they are not the only structure available.
- Pay-as-you-go: you pay per token with no commitment. Best for variable or early-stage workloads.
- Batch processing: submit jobs that can tolerate a delay (minutes to a day) and pay roughly half the standard rate. Ideal for nightly enrichment, bulk classification, and offline pipelines.
- Provisioned/committed throughput: reserve dedicated capacity for a fixed monthly fee. Predictable cost, guaranteed latency, but you pay whether or not you use it.
- Self-hosting open models: run an open-weight model on your own GPUs. No per-token fee, but you carry infrastructure, scaling, and reliability costs.
Choosing among these is a workload-shape decision, not a price-list decision. We walk through the trade-offs in our Framework article.
Estimating Cost Before You Build
Never start a build without a back-of-envelope cost model. The process is mechanical:
- Estimate average input tokens per call (prompt + context + history).
- Estimate average output tokens per call.
- Multiply each by the per-token rate for your chosen model.
- Multiply the per-call cost by expected daily volume, then by 30.
- Apply realistic caching and batch discounts.
If that number is uncomfortable, you change the inputs — smaller model, trimmed context, caching, batching — before you write a line of code. For a sequential walkthrough of this estimation, see our Step-by-Step Approach.
Monitoring and Optimizing in Production
Estimation gets you to launch; monitoring keeps you solvent. Instrument every call with token counts, model used, and a feature tag so you can attribute spend. Set per-feature budgets and alerts. The teams that control costs treat their AI spend like any other observable metric, not like a surprise that arrives once a month.
The highest-leverage optimizations, in rough order of payoff: enable prompt caching, downgrade tasks to smaller models, trim context aggressively, move tolerant work to batch, and cap output length. Our Best Practices article ranks these by real-world impact.
Frequently Asked Questions
Why is output more expensive than input?
Generating tokens is computationally heavier than reading them — the model runs a full forward pass for each output token, sequentially. Input can be processed in parallel. The price difference, typically three to five times, reflects that asymmetry, which is why capping response length is one of the easiest savings available.
How much does prompt caching actually save?
For workloads with a large, stable prefix — long system prompts, knowledge bases, instruction sets — caching commonly reduces total spend by 40 to 70 percent. The savings come from billing the repeated portion at a discount of 75 to 90 percent. Workloads with no repeated content see little benefit.
Should I just self-host an open model to avoid token fees?
Only if your volume is high and steady. Self-hosting trades per-token fees for fixed GPU and operational costs, which only pay off above a meaningful utilization threshold. Below that, hosted pay-as-you-go is almost always cheaper and far less work.
What's the single biggest mistake teams make on cost?
Using a flagship model for every task. Most production workloads are a mix of hard and easy calls, and the easy majority runs fine on a model that costs a tenth as much. Routing by task difficulty is the highest-leverage change most teams can make.
How do I forecast costs for a workload I haven't built yet?
Estimate average input and output tokens per call, multiply by the per-token rates, then by your expected call volume. Apply realistic caching and batch discounts. The number is rough but reliably tells you whether your design is affordable before you commit to it.
Key Takeaways
- AI models bill by the token, with output priced three to five times higher than input.
- Matching task difficulty to model tier is the single largest cost lever, often saving 10x.
- Context window size multiplies your per-call cost across every request — trim it.
- Prompt caching can cut total spend by half for workloads with stable prefixes.
- Batch, committed throughput, and self-hosting are structural choices driven by workload shape.
- Always build a back-of-envelope cost model before you build, and instrument spend in production.