One Number Decides If Your Model Feels Magical

Inference is the moment a trained model produces an answer. Training gets the headlines and the GPU budgets, but inference is where almost all of the real-world cost, risk, and user experience lives. Every chatbot reply, every autocomplete suggestion, every fraud score is an inference call. And the single number that determines whether those calls feel magical or maddening is latency.

This guide treats inference and latency as one system, because they are. You cannot reason about one without the other. A model that produces a perfect answer in nine seconds is, for most interactive products, a failure. A model that responds in 200 milliseconds but with a worse answer might win the market. The art is knowing which trade-offs to make, and where.

We will move from the mechanics of what happens during an inference request, through the specific places latency hides, to the levers you actually control. By the end you should be able to look at a slow AI feature and form a real hypothesis about why it is slow — rather than guessing.

What Inference Actually Is

Inference is forward propagation through a trained network: inputs go in, the model computes, outputs come out. No weights change. For a large language model, the unit of work is the token, and inference happens in two distinct phases that behave very differently.

Prefill versus decode

The prefill phase processes your entire prompt at once. It is compute-bound and highly parallel — the GPU chews through all input tokens in roughly one big batch. The decode phase generates the output one token at a time, each new token depending on the last. Decode is memory-bandwidth-bound and inherently sequential.

This split matters enormously. A long prompt with a short answer is dominated by prefill. A short prompt with a long answer is dominated by decode. They respond to completely different optimizations, which is why a single "make it faster" instinct so often fails.

The Anatomy of a Latency Number

When someone says a request "took two seconds," they are collapsing several independent delays into one figure. To improve latency you have to separate them.

Network round trip — the time to reach the inference server and return.
Queue time — how long the request waits before a worker picks it up.
Time to first token (TTFT) — prefill plus scheduling, the delay before anything appears.
Inter-token latency — the gap between each streamed token during decode.
Total generation time — TTFT plus inter-token latency times the number of output tokens.

For streaming interfaces, TTFT is the number users feel as "responsiveness," while inter-token latency determines whether the text crawls or flows. Optimizing the wrong one wastes effort. If you have never instrumented these separately, that is the first thing to fix — a theme we expand on in A Step-by-Step Approach to AI Inference and Latency.

Where Latency Comes From

Model size and architecture

Bigger models are slower, but not linearly. Memory bandwidth, not raw FLOPs, often dominates decode. A 70-billion-parameter model can be more than ten times slower per token than a 7-billion one, and that gap widens under load.

Batching and concurrency

Inference servers batch concurrent requests to keep the GPU busy. Batching raises throughput but can raise individual latency if a request waits to join a batch. Continuous batching — where new requests slot into an in-flight batch — is the modern answer and dramatically improves both metrics simultaneously.

Context length

Longer prompts mean more prefill compute and a larger key-value cache to hold in memory during decode. Latency grows with context length, and beyond a point the KV cache competes for the same memory the model weights need.

The Levers You Control

Some latency is physics. Most of it is choices. The high-leverage moves, roughly in order of impact:

Pick a smaller model when quality allows. The cheapest optimization is needing less compute.
Quantize weights to 8-bit or 4-bit to cut memory traffic, usually with minor quality loss.
Stream tokens so perceived latency tracks TTFT, not total time.
Cache aggressively — both prompt prefixes and full responses for repeated queries.
Shorten outputs with tighter prompts and lower max-token limits.

For the deeper reasoning behind each of these, see AI Inference and Latency: Best Practices That Actually Work.

Measuring It Honestly

Averages lie. A 300-millisecond mean TTFT can hide a tail where one in twenty users waits four seconds. Always report percentiles — p50, p95, p99 — because the tail is what generates complaints and churn.

Measure under realistic load, not on an idle server. Latency that looks great with one user often collapses at fifty concurrent requests because queueing and batching dynamics change entirely. Synthetic single-request benchmarks are nearly useless for capacity planning.

The Latency-Throughput-Cost Triangle

You cannot reason about inference by optimizing one number in isolation, because latency, throughput, and cost are bound together. Push one and the others move.

How the three interact

Larger batches raise throughput and lower cost per request, but can raise individual latency as requests wait to assemble. A bigger, more capable model improves quality but worsens both latency and cost. Quantization improves latency and cost but may cost a little quality. Every meaningful decision is a move within this triangle, not a free win.

The practical implication is that "make it faster" is an incomplete instruction. Faster at what throughput? At what cost? At what quality bar? A team that answers those questions makes deliberate trades; a team that does not just shuffles the problem around. The right framing is: meet the latency target at the lowest cost without dropping below the quality bar.

Interactive Versus Batch Workloads

The single most useful distinction in inference is whether a human is waiting. It changes which metric matters and inverts your entire optimization strategy.

Interactive (chat, autocomplete, voice): latency dominates. Time to first token and streaming smoothness decide the experience. Small batches, tight windows, aggressive caching.
Batch (overnight classification, bulk embedding): throughput dominates. Latency per item is irrelevant. Large batches, no streaming, maximize items per dollar.

Applying interactive tuning to a batch job wastes money; applying batch tuning to an interactive feature makes it feel broken. Sorting your workload into one bucket or the other is the first decision, and it shapes everything downstream. The contrast plays out vividly across AI Inference and Latency: Real-World Examples and Use Cases.

Frequently Asked Questions

What is the difference between inference and training latency?

Training latency concerns how long it takes to update model weights over many examples, measured in hours or days. Inference latency is the time to produce a single prediction from an already-trained model, measured in milliseconds to seconds. They optimize for different things, and a fast-training setup says nothing about fast inference.

Is throughput or latency more important?

It depends on the workload. Interactive products like chat live and die by latency, especially time to first token. Batch jobs like nightly document classification care only about throughput — total items processed per dollar. Most teams need to optimize both, but you must decide which one wins when they conflict.

Why does my model get slower as more users arrive?

Concurrent requests compete for GPU memory and compute, and they queue. Without continuous batching, requests wait for the current batch to finish. The KV cache for many simultaneous long contexts can also exhaust memory, forcing evictions or rejections that spike tail latency.

Can I reduce latency without changing the model?

Often, yes. Streaming responses, caching repeated prompts, trimming output length, co-locating your app and inference servers, and tuning batch settings can all cut perceived or actual latency with the same model. These are usually the first moves before reaching for a smaller or quantized model.

What is a good latency target?

For conversational interfaces, aim for time to first token under 500 milliseconds and a steady stream after. For autocomplete or search, you want sub-100-millisecond responses. Background and batch tasks can tolerate seconds. The right target is set by the user's expectation, not by what your infrastructure happens to deliver.

Key Takeaways

Inference is where models spend most of their lifetime cost and shape user experience.
LLM inference has two phases — prefill (compute-bound) and decode (bandwidth-bound) — that need different optimizations.
A latency number is several delays combined; separate network, queue, TTFT, and inter-token latency before improving anything.
Model size, batching strategy, and context length are the dominant sources of latency.
Streaming, caching, quantization, and smaller models are your highest-leverage levers.
Always measure percentiles under realistic concurrent load — averages and single-request tests will mislead you.

What Inference Actually Is

Prefill versus decode

The Anatomy of a Latency Number

When someone says a request "took two seconds," they are collapsing several independent delays into one figure. To improve latency you have to separate them.

Network round trip — the time to reach the inference server and return.
Queue time — how long the request waits before a worker picks it up.
Time to first token (TTFT) — prefill plus scheduling, the delay before anything appears.
Inter-token latency — the gap between each streamed token during decode.
Total generation time — TTFT plus inter-token latency times the number of output tokens.

Where Latency Comes From

Model size and architecture

Batching and concurrency

Context length

The Levers You Control

Some latency is physics. Most of it is choices. The high-leverage moves, roughly in order of impact:

Pick a smaller model when quality allows. The cheapest optimization is needing less compute.
Quantize weights to 8-bit or 4-bit to cut memory traffic, usually with minor quality loss.
Stream tokens so perceived latency tracks TTFT, not total time.
Cache aggressively — both prompt prefixes and full responses for repeated queries.
Shorten outputs with tighter prompts and lower max-token limits.

For the deeper reasoning behind each of these, see AI Inference and Latency: Best Practices That Actually Work.

Measuring It Honestly

The Latency-Throughput-Cost Triangle

You cannot reason about inference by optimizing one number in isolation, because latency, throughput, and cost are bound together. Push one and the others move.

How the three interact

Interactive Versus Batch Workloads

The single most useful distinction in inference is whether a human is waiting. It changes which metric matters and inverts your entire optimization strategy.

Interactive (chat, autocomplete, voice): latency dominates. Time to first token and streaming smoothness decide the experience. Small batches, tight windows, aggressive caching.
Batch (overnight classification, bulk embedding): throughput dominates. Latency per item is irrelevant. Large batches, no streaming, maximize items per dollar.

Frequently Asked Questions

What is the difference between inference and training latency?

Is throughput or latency more important?

Why does my model get slower as more users arrive?

Can I reduce latency without changing the model?

What is a good latency target?

Key Takeaways

Inference is where models spend most of their lifetime cost and shape user experience.
LLM inference has two phases — prefill (compute-bound) and decode (bandwidth-bound) — that need different optimizations.
A latency number is several delays combined; separate network, queue, TTFT, and inter-token latency before improving anything.
Model size, batching strategy, and context length are the dominant sources of latency.
Streaming, caching, quantization, and smaller models are your highest-leverage levers.
Always measure percentiles under realistic concurrent load — averages and single-request tests will mislead you.

One Number Decides If Your Model Feels Magical

What Inference Actually Is

Prefill versus decode

The Anatomy of a Latency Number

Where Latency Comes From

Model size and architecture

Batching and concurrency

Context length

The Levers You Control

Measuring It Honestly

The Latency-Throughput-Cost Triangle

How the three interact

Interactive Versus Batch Workloads

Frequently Asked Questions

What is the difference between inference and training latency?

Is throughput or latency more important?

Why does my model get slower as more users arrive?

Can I reduce latency without changing the model?

What is a good latency target?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

One Number Decides If Your Model Feels Magical

What Inference Actually Is

Prefill versus decode

The Anatomy of a Latency Number

Where Latency Comes From

Model size and architecture

Batching and concurrency

Context length

The Levers You Control

Measuring It Honestly

The Latency-Throughput-Cost Triangle

How the three interact

Interactive Versus Batch Workloads

Frequently Asked Questions

What is the difference between inference and training latency?

Is throughput or latency more important?

Why does my model get slower as more users arrive?

Can I reduce latency without changing the model?

What is a good latency target?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?