Every Inference Choice Buys Speed and Costs Something Else

Q: Is self-hosting always cheaper than an API at scale?

No. Self-hosting wins on per-token cost only at high, steady volume where your GPUs stay busy. Bursty or low traffic means you pay for idle capacity, and once you add the engineering and on-call cost, an API is often cheaper in total. Run the math on your actual traffic shape, not peak.

There is no universally correct way to run AI inference. Every choice you make—which model, where it runs, how you batch requests, whether you stream tokens—buys you something and costs you something else. The teams that ship reliable AI features are not the ones who found the fastest setup. They are the ones who understood what they were trading away and chose deliberately.

This article lays out the competing approaches to managing inference latency, the axes that actually move the needle, and a decision rule you can use to pick without endless benchmarking. The goal is not to crown a winner. It is to help you reason about the specific trade-off in front of you.

If you are new to the underlying concepts, start with AI Inference and Latency: A Beginner's Guide and come back. The rest of this assumes you know what a token is and why time-to-first-token differs from total generation time.

The Four Axes That Govern Every Decision

Before comparing approaches, fix the variables. Almost every inference trade-off reduces to four axes, and you can only optimize three at once.

Latency

Split this into two numbers, because they fail differently. Time-to-first-token (TTFT) is how long until the user sees anything. Inter-token latency, or throughput, is how fast the rest streams. A chatbot lives or dies on TTFT. A batch summarization job does not care about TTFT at all and only cares about total tokens per second.

Cost

Measured per million tokens or per request. Cost scales with model size, with whether you run it yourself, and with how much idle capacity you pay for. Self-hosting a small model can be cheaper than an API at high volume and far more expensive at low volume.

Quality

The hardest to measure and the easiest to sacrifice by accident. A smaller, faster model that gets 8% more answers wrong is not a win if those wrong answers reach customers.

Operational burden

Who maintains the GPUs, patches the serving stack, handles a traffic spike at 2 a.m. An API hides this cost. Self-hosting makes it yours.

Hold these four in mind. Every option below is just a different bet across them.

The Competing Approaches

Hosted API, largest model

Call a frontier model through a provider's API. You get top-tier quality and zero operational burden. You pay the most per token and you accept whatever latency the provider's queue gives you, which can spike unpredictably under load.

Choose this when quality is non-negotiable and volume is low to moderate—legal drafting, complex reasoning, anything customer-facing where a wrong answer is expensive.

Hosted API, smaller or distilled model

Same convenience, but you pick a lighter model from the same provider. TTFT drops, cost drops sharply, quality drops by an amount you must measure. Distilled and "mini" models have closed much of the quality gap for routine tasks like classification, extraction, and short rewrites.

This is the default starting point for most production features. Start small, upgrade only where evals show you need to.

Self-hosted open-weight model

Run an open-weight model on your own GPUs or a serving platform. You control latency, you can optimize the stack, and at high steady volume the per-token cost can undercut any API. The price is real operational burden: capacity planning, autoscaling, kernel-level tuning, and on-call.

Choose this when volume is high and predictable, when data residency forbids sending tokens to a third party, or when you need latency guarantees an API cannot promise.

Edge or on-device inference

Run a small model on the user's device or a nearby edge node. TTFT approaches zero because there is no network round trip, and it works offline. The trade-off is a hard ceiling on model size and therefore quality.

Reserve this for narrow, latency-critical tasks: autocomplete, on-device transcription, simple intent detection.

For a fuller catalog of what runs where, The Best Tools for AI Inference and Latency breaks down serving frameworks and managed platforms by use case.

Techniques That Cut Across Every Approach

Independent of where the model runs, a handful of techniques shift the trade-off curve. They are not approaches in themselves; they are levers you pull on top of one.

Streaming. Stream tokens as they generate. This does not reduce total latency, but it slashes perceived latency by giving the user something to read immediately. Almost always worth it for interactive features.
Prompt caching. Reuse the computed state of a long, stable prefix—a system prompt, a document, a few-shot block. On repeated calls this can cut TTFT and cost dramatically.
Batching. Group concurrent requests so the GPU processes them together. Raises throughput and lowers cost per token, but adds queueing latency to individual requests. Great for self-hosted backends serving many users, bad for a single low-traffic endpoint.
Speculative decoding. A small draft model proposes tokens that the large model verifies in parallel. Speeds up generation with no quality loss, at the cost of extra complexity and memory.
Quantization. Run the model at lower numerical precision. Cuts memory and speeds inference, with a usually-small quality hit you must verify on your own data.

The order to reach for these: streaming first because it is nearly free, then caching, then batching, then the heavier techniques. Pulling every lever at once makes regressions impossible to diagnose. For the disciplined sequence, see A Step-by-Step Approach to AI Inference and Latency.

A Decision Rule You Can Actually Use

Skip the matrix paralysis. Answer these in order and stop at the first one that fits.

Is the task latency-critical and narrow? (autocomplete, simple intent) → Edge or on-device, smallest viable model.
Is quality non-negotiable and volume low-to-moderate? → Hosted API, largest model. Optimize perceived latency with streaming, not by shrinking the model.
Is volume high, steady, and predictable, or is data residency a hard requirement? → Self-hosted open-weight model with batching.
Everything else (the common case) → Hosted API, smallest model that passes your evals. Add streaming and prompt caching. Upgrade the model only where evals prove you must.

The rule encodes one opinion: default to the cheap, simple, hosted option and earn your way to complexity. Most teams over-engineer inference before they have traffic to justify it.

The Failure Modes to Watch

Each approach fails in a characteristic way. Knowing the failure mode is half of avoiding it.

Largest-model API: latency spikes under provider load you cannot control, and cost balloons silently as usage grows.
Smallest-model API: quality regressions that no one catches because there are no evals. The model gets quietly worse on edge cases.
Self-hosting: under-provisioned GPUs cause request queueing that looks like model slowness; over-provisioned GPUs burn money on idle capacity.
Edge/on-device: the model is too small for a task that grew in scope, and quality quietly craters.

Two of these four—missing evals and confusing queue latency with model latency—are the most common production incidents we see. 7 Common Mistakes with AI Inference and Latency covers the full list and the fixes.

Frequently Asked Questions

Is self-hosting always cheaper than an API at scale?

No. Self-hosting wins on per-token cost only at high, steady volume where your GPUs stay busy. Bursty or low traffic means you pay for idle capacity, and once you add the engineering and on-call cost, an API is often cheaper in total. Run the math on your actual traffic shape, not peak.

What is the single highest-leverage change for perceived latency?

Streaming. It does not reduce total generation time at all, but it lets the user start reading within the first few hundred milliseconds instead of staring at a spinner. For any interactive feature, enable streaming before you touch the model or the infrastructure.

How do I choose between a smaller model and a faster setup?

Measure quality first with a real eval set, then optimize speed. A smaller model that fails your evals is not faster—it is just wrong sooner. Lock in the smallest model that passes, then apply caching and batching to make that model fast.

Does quantization hurt quality?

Usually only a little, but "usually" is not "always" and the loss is task-dependent. Quantization can degrade performance on numerical reasoning or long-context tasks more than on classification. Always validate a quantized model on your own data before shipping it.

When should I use batching?

Use batching when a single backend serves many concurrent requests and you care about cost and throughput more than individual request latency—self-hosted serving, background jobs, bulk processing. Avoid it for a low-traffic interactive endpoint, where the added queueing delay hurts more than the throughput gain helps.

Key Takeaways

Every inference decision trades among four axes: latency (TTFT and throughput), cost, quality, and operational burden. You can optimize three at a time, not all four.
There is no best approach. Match the approach to the task: edge for narrow latency-critical work, large API for high-stakes low-volume, self-hosting for high steady volume or data residency, small API for everything else.
Streaming, prompt caching, batching, speculative decoding, and quantization are cross-cutting levers. Reach for them in roughly that order—cheapest and safest first.
Use the decision rule: default to the cheapest hosted option, gate every upgrade behind evals, and earn your way to complexity rather than starting there.
Know your approach's failure mode. Missing evals and mistaking queue latency for model latency cause most real-world inference incidents.

The Four Axes That Govern Every Decision

Before comparing approaches, fix the variables. Almost every inference trade-off reduces to four axes, and you can only optimize three at once.

Latency

Cost

Quality

The hardest to measure and the easiest to sacrifice by accident. A smaller, faster model that gets 8% more answers wrong is not a win if those wrong answers reach customers.

Operational burden

Who maintains the GPUs, patches the serving stack, handles a traffic spike at 2 a.m. An API hides this cost. Self-hosting makes it yours.

Hold these four in mind. Every option below is just a different bet across them.

The Competing Approaches

Hosted API, largest model

Choose this when quality is non-negotiable and volume is low to moderate—legal drafting, complex reasoning, anything customer-facing where a wrong answer is expensive.

Hosted API, smaller or distilled model

This is the default starting point for most production features. Start small, upgrade only where evals show you need to.

Self-hosted open-weight model

Choose this when volume is high and predictable, when data residency forbids sending tokens to a third party, or when you need latency guarantees an API cannot promise.

Edge or on-device inference

Reserve this for narrow, latency-critical tasks: autocomplete, on-device transcription, simple intent detection.

For a fuller catalog of what runs where, The Best Tools for AI Inference and Latency breaks down serving frameworks and managed platforms by use case.

Techniques That Cut Across Every Approach

Independent of where the model runs, a handful of techniques shift the trade-off curve. They are not approaches in themselves; they are levers you pull on top of one.

Streaming. Stream tokens as they generate. This does not reduce total latency, but it slashes perceived latency by giving the user something to read immediately. Almost always worth it for interactive features.
Prompt caching. Reuse the computed state of a long, stable prefix—a system prompt, a document, a few-shot block. On repeated calls this can cut TTFT and cost dramatically.
Batching. Group concurrent requests so the GPU processes them together. Raises throughput and lowers cost per token, but adds queueing latency to individual requests. Great for self-hosted backends serving many users, bad for a single low-traffic endpoint.
Speculative decoding. A small draft model proposes tokens that the large model verifies in parallel. Speeds up generation with no quality loss, at the cost of extra complexity and memory.
Quantization. Run the model at lower numerical precision. Cuts memory and speeds inference, with a usually-small quality hit you must verify on your own data.

A Decision Rule You Can Actually Use

Skip the matrix paralysis. Answer these in order and stop at the first one that fits.

Is the task latency-critical and narrow? (autocomplete, simple intent) → Edge or on-device, smallest viable model.
Is quality non-negotiable and volume low-to-moderate? → Hosted API, largest model. Optimize perceived latency with streaming, not by shrinking the model.
Is volume high, steady, and predictable, or is data residency a hard requirement? → Self-hosted open-weight model with batching.
Everything else (the common case) → Hosted API, smallest model that passes your evals. Add streaming and prompt caching. Upgrade the model only where evals prove you must.

The rule encodes one opinion: default to the cheap, simple, hosted option and earn your way to complexity. Most teams over-engineer inference before they have traffic to justify it.

The Failure Modes to Watch

Each approach fails in a characteristic way. Knowing the failure mode is half of avoiding it.

Largest-model API: latency spikes under provider load you cannot control, and cost balloons silently as usage grows.
Smallest-model API: quality regressions that no one catches because there are no evals. The model gets quietly worse on edge cases.
Self-hosting: under-provisioned GPUs cause request queueing that looks like model slowness; over-provisioned GPUs burn money on idle capacity.
Edge/on-device: the model is too small for a task that grew in scope, and quality quietly craters.

Frequently Asked Questions

Is self-hosting always cheaper than an API at scale?

What is the single highest-leverage change for perceived latency?

How do I choose between a smaller model and a faster setup?

Does quantization hurt quality?

When should I use batching?

Key Takeaways

Every inference decision trades among four axes: latency (TTFT and throughput), cost, quality, and operational burden. You can optimize three at a time, not all four.
There is no best approach. Match the approach to the task: edge for narrow latency-critical work, large API for high-stakes low-volume, self-hosting for high steady volume or data residency, small API for everything else.
Streaming, prompt caching, batching, speculative decoding, and quantization are cross-cutting levers. Reach for them in roughly that order—cheapest and safest first.
Use the decision rule: default to the cheapest hosted option, gate every upgrade behind evals, and earn your way to complexity rather than starting there.
Know your approach's failure mode. Missing evals and mistaking queue latency for model latency cause most real-world inference incidents.

Every Inference Choice Buys Speed and Costs Something Else

The Four Axes That Govern Every Decision

Latency

Cost

Quality

Operational burden

The Competing Approaches

Hosted API, largest model

Hosted API, smaller or distilled model

Self-hosted open-weight model

Edge or on-device inference

Techniques That Cut Across Every Approach

A Decision Rule You Can Actually Use

The Failure Modes to Watch

Frequently Asked Questions

Is self-hosting always cheaper than an API at scale?

What is the single highest-leverage change for perceived latency?

How do I choose between a smaller model and a faster setup?

Does quantization hurt quality?

When should I use batching?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Every Inference Choice Buys Speed and Costs Something Else

The Four Axes That Govern Every Decision

Latency

Cost

Quality

Operational burden

The Competing Approaches

Hosted API, largest model

Hosted API, smaller or distilled model

Self-hosted open-weight model

Edge or on-device inference

Techniques That Cut Across Every Approach

A Decision Rule You Can Actually Use

The Failure Modes to Watch

Frequently Asked Questions

Is self-hosting always cheaper than an API at scale?

What is the single highest-leverage change for perceived latency?

How do I choose between a smaller model and a faster setup?

Does quantization hurt quality?

When should I use batching?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?