Opinionated Inference Habits Worth the Extra Work

Most "best practices" lists for inference latency are wallpaper — true but useless. "Use efficient models." "Cache when possible." Thanks. This article is the opposite: a set of opinionated practices we actually stand behind, each with the reasoning that makes it more than a slogan. Some of these will feel like extra work. They earn their keep.

The through-line is a bias toward measurement and toward doing less. Most latency wins come from not computing things you do not need, and from knowing precisely where time goes before you spend any of yours fixing it. Speed is mostly subtraction.

If you adopt only three of these, adopt the ones about percentiles, caching, and right-sizing the model. They cover the majority of real-world latency problems.

Right-Size the Model Before Anything Else

The fastest inference is the smallest model that still meets your quality bar. Teams reflexively reach for the most capable model, then fight its latency forever.

Why this comes first

Model size drives both prefill and decode cost. A model that is twice as fast to begin with makes every downstream optimization easier. Run your real prompts against a few model sizes, score quality honestly, and pick the smallest that passes. Capability you do not use is latency you pay for.

Manage Latency by Percentile

Never let an average represent your latency. Set service targets against p95 or p99, because the tail is what users feel and what generates churn.

The reasoning is statistical: response times are right-skewed, so the mean sits well below the experience of your slowest users. A system can have a beautiful average and a brutal p99. Track both; promise on the tail.

Cache at Every Layer You Can

Caching is the highest-leverage latency practice that most teams underuse. There are three layers worth caching:

Full responses for repeated or near-identical queries.
Prompt prefixes so a fixed system prompt is not reprocessed each call.
Retrieved context so you do not re-fetch and re-embed stable documents.

Even a modest hit rate pays off because a cache hit costs near-zero latency versus a full inference. Measure your hit rate; if it is low, your cache key is probably too strict.

Stream Everything Interactive

For any user-facing feature, stream tokens. This is non-negotiable. Streaming decouples perceived latency from total generation time — the user starts reading at time-to-first-token, not at completion.

The corollary: optimize TTFT obsessively for interactive features, and worry less about total time. A 300 ms TTFT with steady streaming beats a 1.5-second complete response nearly every time. This pairs with the diagnosis steps in A Step-by-Step Approach to AI Inference and Latency.

Control Context Length Deliberately

Long context is a silent latency tax. Every extra input token adds prefill compute and enlarges the KV cache during decode.

Practices that work

Cap conversation history and summarize older turns.
Retrieve fewer, higher-relevance documents instead of dumping everything.
Keep system prompts tight and cache them as a prefix.

The discipline is treating context as a budget, not a dumping ground. Most prompts can lose a third of their tokens with no quality loss — and the common mistakes guide covers how this bloat sneaks in.

Tune Batching for Your Workload

Use continuous batching so new requests join in-flight batches rather than waiting. This improves throughput and latency at the same time, which is rare.

The trade-off to understand: aggressive batching maximizes GPU utilization but can add a few milliseconds of wait as requests assemble. For latency-critical paths, tune batch windows tighter; for throughput-critical batch jobs, let them grow. Match the configuration to whether the workload is interactive or background.

Quantize When Decode Is the Bottleneck

If your diagnosis shows slow per-token streaming, quantization to 8-bit (and often 4-bit) cuts memory bandwidth demand — the actual constraint in decode — usually with minor quality loss.

The reason this works is that decode is memory-bound, not compute-bound. Smaller weights move faster through memory, so each token comes quicker. Test the quality impact on your real tasks before shipping; the loss is usually small but task-dependent.

Set a Latency Budget and Defend It

The practices above are tactics. The practice that makes them stick is treating latency as a budget set during design, not a number you discover after launch. A budget is a single sentence: "time to first token must stay under 500 ms at p95 for the chat path." Everything else flows from it.

Why a budget beats good intentions

Without a written budget, latency erodes one reasonable decision at a time. Someone adds a few retrieved documents "to improve quality." Someone enriches the system prompt. Each change is defensible in isolation, and collectively they push TTFT past the point users tolerate. A budget gives you a line to defend: if a change pushes you over it, the change has to pay for itself or get cut.

Tie the budget to the user, not the infrastructure. The right number is the one set by what the person on the other end expects, which differs sharply by use case — a theme explored across AI Inference and Latency: Real-World Examples and Use Cases. Autocomplete and chat live in different worlds, and a single shared budget for both is a budget that fits neither.

Make Latency a Team Habit, Not a Heroic Project

The teams that stay fast are not the ones that run a one-time optimization sprint. They are the ones that built latency into their reflexes. A few habits do most of the work:

Show percentiles on the main dashboard so the tail is always visible, not something you go hunting for during an incident.
Re-test under load on every traffic milestone, because the configuration that was fine at one scale rarely survives the next.
Review token counts in code review, catching context bloat before it ships rather than after it slows everything down.

The deeper point is cultural. When latency is a number the whole team watches, regressions get caught early and cheaply. When it is something one engineer checks during a crisis, you are always fixing it the expensive way, after users have already felt it.

Frequently Asked Questions

What is the single highest-leverage practice?

Right-sizing the model, because it multiplies every other optimization. A smaller starting model makes caching, batching, and streaming all easier and cheaper. The catch is doing the quality comparison honestly rather than assuming the biggest model is required.

Why is caching emphasized so heavily?

Because a cache hit costs almost nothing compared to a full inference, and real traffic is far more repetitive than teams assume. Prompt-prefix caching alone can slash TTFT on every request that shares a system prompt. Underused caching is the most common missed win.

Should I always quantize?

No. Quantize when decode speed is the proven bottleneck and after you have verified quality holds on your tasks. Quantizing a model whose latency problem is actually queueing or oversized context wastes effort and may degrade output for no speed gain.

How tight should my batch window be?

For interactive features, keep it small so requests do not wait long to assemble — a few milliseconds at most. For background batch jobs, let batches grow large to maximize throughput. The right answer depends entirely on whether a human is waiting.

Do these practices apply to hosted APIs?

Many do. Even without controlling the model or server, you control context length, output length, caching, streaming, and region. Those alone cover most of the practices here. Right-sizing becomes choosing the right tier of the provider's model lineup.

Key Takeaways

Pick the smallest model that meets your quality bar before optimizing anything else.
Manage and promise latency by percentile (p95/p99), never the average.
Cache full responses, prompt prefixes, and retrieved context — the most underused win.
Stream all interactive responses and obsess over time to first token.
Treat context length as a budget; trim and summarize aggressively.
Use continuous batching tuned to the workload, and quantize only when decode is the proven bottleneck.

If you adopt only three of these, adopt the ones about percentiles, caching, and right-sizing the model. They cover the majority of real-world latency problems.

Right-Size the Model Before Anything Else

The fastest inference is the smallest model that still meets your quality bar. Teams reflexively reach for the most capable model, then fight its latency forever.

Why this comes first

Manage Latency by Percentile

Never let an average represent your latency. Set service targets against p95 or p99, because the tail is what users feel and what generates churn.

Cache at Every Layer You Can

Caching is the highest-leverage latency practice that most teams underuse. There are three layers worth caching:

Full responses for repeated or near-identical queries.
Prompt prefixes so a fixed system prompt is not reprocessed each call.
Retrieved context so you do not re-fetch and re-embed stable documents.

Even a modest hit rate pays off because a cache hit costs near-zero latency versus a full inference. Measure your hit rate; if it is low, your cache key is probably too strict.

Stream Everything Interactive

Control Context Length Deliberately

Long context is a silent latency tax. Every extra input token adds prefill compute and enlarges the KV cache during decode.

Practices that work

Cap conversation history and summarize older turns.
Retrieve fewer, higher-relevance documents instead of dumping everything.
Keep system prompts tight and cache them as a prefix.

Tune Batching for Your Workload

Use continuous batching so new requests join in-flight batches rather than waiting. This improves throughput and latency at the same time, which is rare.

Quantize When Decode Is the Bottleneck

If your diagnosis shows slow per-token streaming, quantization to 8-bit (and often 4-bit) cuts memory bandwidth demand — the actual constraint in decode — usually with minor quality loss.

Set a Latency Budget and Defend It

Why a budget beats good intentions

Make Latency a Team Habit, Not a Heroic Project

The teams that stay fast are not the ones that run a one-time optimization sprint. They are the ones that built latency into their reflexes. A few habits do most of the work:

Show percentiles on the main dashboard so the tail is always visible, not something you go hunting for during an incident.
Re-test under load on every traffic milestone, because the configuration that was fine at one scale rarely survives the next.
Review token counts in code review, catching context bloat before it ships rather than after it slows everything down.

Frequently Asked Questions

What is the single highest-leverage practice?

Why is caching emphasized so heavily?

Should I always quantize?

How tight should my batch window be?

Do these practices apply to hosted APIs?

Key Takeaways

Pick the smallest model that meets your quality bar before optimizing anything else.
Manage and promise latency by percentile (p95/p99), never the average.
Cache full responses, prompt prefixes, and retrieved context — the most underused win.
Stream all interactive responses and obsess over time to first token.
Treat context length as a budget; trim and summarize aggressively.
Use continuous batching tuned to the workload, and quantize only when decode is the proven bottleneck.

Opinionated Inference Habits Worth the Extra Work

Right-Size the Model Before Anything Else

Why this comes first

Manage Latency by Percentile

Cache at Every Layer You Can

Stream Everything Interactive

Control Context Length Deliberately

Practices that work

Tune Batching for Your Workload

Quantize When Decode Is the Bottleneck

Set a Latency Budget and Defend It

Why a budget beats good intentions

Make Latency a Team Habit, Not a Heroic Project

Frequently Asked Questions

What is the single highest-leverage practice?

Why is caching emphasized so heavily?

Should I always quantize?

How tight should my batch window be?

Do these practices apply to hosted APIs?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Opinionated Inference Habits Worth the Extra Work

Right-Size the Model Before Anything Else

Why this comes first

Manage Latency by Percentile

Cache at Every Layer You Can

Stream Everything Interactive

Control Context Length Deliberately

Practices that work

Tune Batching for Your Workload

Quantize When Decode Is the Bottleneck

Set a Latency Budget and Defend It

Why a budget beats good intentions

Make Latency a Team Habit, Not a Heroic Project

Frequently Asked Questions

What is the single highest-leverage practice?

Why is caching emphasized so heavily?

Should I always quantize?

How tight should my batch window be?

Do these practices apply to hosted APIs?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?