Seven Latency Decisions That Feel Reasonable and Cost You

Latency problems are rarely exotic. The same handful of mistakes appears in team after team, and they are expensive precisely because they look reasonable from the inside. Each one feels like a sensible decision until you trace a slow request and discover the real culprit.

This article names seven of the most common mistakes, explains why each happens, what it costs, and the corrective practice. None of them require deep expertise to avoid — they require knowing the trap exists. Most of these we have watched smart engineers walk into more than once.

Read these as a diagnostic checklist. If you recognize three or more in your own setup, you have found your weekend project.

Mistake 1: Optimizing the Average

Teams report a "200 ms average response time" and call it fast. Then support tickets pile up. The reason is almost always the tail: while the median user gets 200 ms, the p99 user waits four seconds.

Why it happens: averages are the default metric in most dashboards and they feel intuitive.

The cost: the slowest one percent of requests generate a disproportionate share of complaints and churn, and you never see them.

The fix: report p50, p95, and p99 everywhere. Set targets against p95 or p99, not the mean.

Mistake 2: Measuring on an Idle Server

A benchmark with one request looks blazing fast. In production with fifty concurrent users it falls apart. Single-request testing hides queueing and batching dynamics entirely.

Why it happens: single-request tests are easy and the numbers look flattering.

The cost: capacity surprises in production, usually during your busiest hour.

The fix: always load test at realistic concurrency. The step-by-step process in A Step-by-Step Approach to AI Inference and Latency builds this in from the start.

Mistake 3: Swapping Models Blindly

The feature is slow, so someone switches to a smaller model. Sometimes it helps. Often it does nothing, because the bottleneck was queueing or context length, not the model itself.

Why it happens: model choice is the most visible knob, so it gets turned first.

The cost: wasted effort, plus a possible quality regression that hurts users without fixing speed.

The fix: diagnose the dominant cost before changing the model. Only swap when decode speed is genuinely the bottleneck.

Mistake 4: Ignoring Time to First Token

A team optimizes total generation time but leaves a long blank pause before the first token. Users perceive that pause as the system being broken, even when total time is fine.

Why it happens: total time is easier to think about than the streaming experience.

The fix

Measure TTFT separately and stream tokens as they generate. A response that starts in 300 ms and streams steadily feels faster than one that arrives complete in two seconds. Perceived speed is governed by TTFT, not totals.

Mistake 5: Sending Bloated Context

Every request stuffs the full conversation history, a giant system prompt, and ten retrieved documents into the model. Prefill cost balloons and TTFT climbs with every extra token.

Why it happens: "more context is safer" feels true, so context only ever grows.

The cost: higher latency and higher token bills, often for context the model does not even use.

The fix: trim ruthlessly. Cache static prompt prefixes, summarize old history, and retrieve fewer, more relevant documents. This is a recurring theme in AI Inference and Latency: Best Practices That Actually Work.

Mistake 6: No Caching

Identical or near-identical requests hit the model fresh every time. Common questions, repeated system prompts, and stable document sets all get recomputed needlessly.

Why it happens: caching feels like extra infrastructure and gets deferred.

The cost: you pay full inference latency and cost for work you already did.

The fix: cache full responses for repeated queries and use prompt-prefix caching so the static portion of prompts is not reprocessed. Caching is often the single biggest latency win available.

Mistake 7: Treating Latency as a Last-Minute Concern

Latency gets attention only when users complain, after the architecture is locked in. By then the cheap fixes are gone and the expensive ones remain.

Why it happens: correctness ships first; speed is assumed to be tunable later.

The cost: retrofitting streaming, caching, and batching into a system not designed for them is far harder than building them in.

The fix: set a latency target during design, not after launch. Treat it as a requirement alongside correctness, as outlined in The Complete Guide to AI Inference and Latency.

The Pattern Behind All Seven

Step back and these mistakes share a single root: acting before measuring. Optimizing averages, testing on idle servers, swapping models blindly — each is a decision made without the data that would have pointed somewhere else. The trap is that every one of them feels productive. You are doing something, the dashboard looks plausible, and the real bottleneck sits untouched in the dark.

How to break the pattern

The corrective discipline is the same across all seven: instrument first, then act. Specifically:

Split latency into segments and report percentiles before you change anything.
Reproduce the problem under realistic load, so the number you optimize is the number users feel.
Name the single dominant cost, then fix that one thing and re-measure.

This is not glamorous, and it is slower in the first hour. It is dramatically faster over the first week, because you stop fixing the wrong problem. Teams that internalize this stop guessing and start converging. The diagnostic loop in A Framework for AI Inference and Latency formalizes exactly this sequence into repeatable stages.

A Quick Self-Audit

Run through these questions against your own system. An honest "no" to any of them points at one of the seven mistakes hiding in your stack:

Can you state your p99 TTFT under peak load right now, from a dashboard, without running a one-off test?
When you last made a feature faster, did you re-measure to confirm the dominant cost actually shrank?
Do you know your cache hit rate, or are you assuming caching helps without checking?
Is your latency target written down, tied to a percentile and a use case?

Most teams answer "no" to at least two. That is not a failure; it is a map of where the cheap wins are. The mistakes in this list are common precisely because they are easy to make and easy to miss — and just as easy to fix once you know they are there.

Frequently Asked Questions

Which mistake is the most common?

Optimizing the average and measuring on an idle server tie for first. Both produce flattering numbers that collapse in production. They are seductive because the dashboards look great right up until the support tickets arrive.

Is swapping to a smaller model ever the right move?

Yes — when you have confirmed that decode speed is the actual bottleneck and the smaller model holds acceptable quality. The mistake is doing it as a first reflex without diagnosis. Done deliberately, model right-sizing is one of the strongest levers you have.

How do I know if my context is bloated?

Log input token counts per request and look at the distribution. If many requests carry thousands of tokens the model barely uses, you have bloat. Try trimming and watch whether TTFT drops with no quality loss; usually it does.

Why is caching skipped so often?

It feels like extra moving parts, and teams underestimate how repetitive their traffic is. In practice a large share of requests are near-duplicates or share a fixed prompt prefix. Once you measure the hit rate, caching almost always justifies itself quickly.

Can perceived speed really substitute for real speed?

Up to a point. Streaming and instant typing indicators make a system feel responsive even when total time is unchanged. But they do not help batch jobs or fix a genuinely overloaded server. Use them alongside real optimizations, not instead of them.

Key Takeaways

Report and target percentiles (p95, p99), never the average.
Load test at realistic concurrency; idle-server benchmarks lie.
Diagnose the dominant cost before swapping models.
Measure and optimize time to first token, and stream tokens to win perceived speed.
Trim bloated context and cache aggressively — often the biggest wins.
Make latency a design requirement, not a post-launch scramble.

Read these as a diagnostic checklist. If you recognize three or more in your own setup, you have found your weekend project.

Mistake 1: Optimizing the Average

Teams report a "200 ms average response time" and call it fast. Then support tickets pile up. The reason is almost always the tail: while the median user gets 200 ms, the p99 user waits four seconds.

Why it happens: averages are the default metric in most dashboards and they feel intuitive.

The cost: the slowest one percent of requests generate a disproportionate share of complaints and churn, and you never see them.

The fix: report p50, p95, and p99 everywhere. Set targets against p95 or p99, not the mean.

Mistake 2: Measuring on an Idle Server

A benchmark with one request looks blazing fast. In production with fifty concurrent users it falls apart. Single-request testing hides queueing and batching dynamics entirely.

Why it happens: single-request tests are easy and the numbers look flattering.

The cost: capacity surprises in production, usually during your busiest hour.

The fix: always load test at realistic concurrency. The step-by-step process in A Step-by-Step Approach to AI Inference and Latency builds this in from the start.

Mistake 3: Swapping Models Blindly

The feature is slow, so someone switches to a smaller model. Sometimes it helps. Often it does nothing, because the bottleneck was queueing or context length, not the model itself.

Why it happens: model choice is the most visible knob, so it gets turned first.

The cost: wasted effort, plus a possible quality regression that hurts users without fixing speed.

The fix: diagnose the dominant cost before changing the model. Only swap when decode speed is genuinely the bottleneck.

Mistake 4: Ignoring Time to First Token

A team optimizes total generation time but leaves a long blank pause before the first token. Users perceive that pause as the system being broken, even when total time is fine.

Why it happens: total time is easier to think about than the streaming experience.

The fix

Mistake 5: Sending Bloated Context

Every request stuffs the full conversation history, a giant system prompt, and ten retrieved documents into the model. Prefill cost balloons and TTFT climbs with every extra token.

Why it happens: "more context is safer" feels true, so context only ever grows.

The cost: higher latency and higher token bills, often for context the model does not even use.

Mistake 6: No Caching

Identical or near-identical requests hit the model fresh every time. Common questions, repeated system prompts, and stable document sets all get recomputed needlessly.

Why it happens: caching feels like extra infrastructure and gets deferred.

The cost: you pay full inference latency and cost for work you already did.

The fix: cache full responses for repeated queries and use prompt-prefix caching so the static portion of prompts is not reprocessed. Caching is often the single biggest latency win available.

Mistake 7: Treating Latency as a Last-Minute Concern

Latency gets attention only when users complain, after the architecture is locked in. By then the cheap fixes are gone and the expensive ones remain.

Why it happens: correctness ships first; speed is assumed to be tunable later.

The cost: retrofitting streaming, caching, and batching into a system not designed for them is far harder than building them in.

The fix: set a latency target during design, not after launch. Treat it as a requirement alongside correctness, as outlined in The Complete Guide to AI Inference and Latency.

The Pattern Behind All Seven

How to break the pattern

The corrective discipline is the same across all seven: instrument first, then act. Specifically:

Split latency into segments and report percentiles before you change anything.
Reproduce the problem under realistic load, so the number you optimize is the number users feel.
Name the single dominant cost, then fix that one thing and re-measure.

A Quick Self-Audit

Run through these questions against your own system. An honest "no" to any of them points at one of the seven mistakes hiding in your stack:

Can you state your p99 TTFT under peak load right now, from a dashboard, without running a one-off test?
When you last made a feature faster, did you re-measure to confirm the dominant cost actually shrank?
Do you know your cache hit rate, or are you assuming caching helps without checking?
Is your latency target written down, tied to a percentile and a use case?

Frequently Asked Questions

Which mistake is the most common?

Is swapping to a smaller model ever the right move?

How do I know if my context is bloated?

Why is caching skipped so often?

Can perceived speed really substitute for real speed?

Key Takeaways

Report and target percentiles (p95, p99), never the average.
Load test at realistic concurrency; idle-server benchmarks lie.
Diagnose the dominant cost before swapping models.
Measure and optimize time to first token, and stream tokens to win perceived speed.
Trim bloated context and cache aggressively — often the biggest wins.
Make latency a design requirement, not a post-launch scramble.

Seven Latency Decisions That Feel Reasonable and Cost You

Mistake 1: Optimizing the Average

Mistake 2: Measuring on an Idle Server

Mistake 3: Swapping Models Blindly

Mistake 4: Ignoring Time to First Token

The fix

Mistake 5: Sending Bloated Context

Mistake 6: No Caching

Mistake 7: Treating Latency as a Last-Minute Concern

The Pattern Behind All Seven

How to break the pattern

A Quick Self-Audit

Frequently Asked Questions

Which mistake is the most common?

Is swapping to a smaller model ever the right move?

How do I know if my context is bloated?

Why is caching skipped so often?

Can perceived speed really substitute for real speed?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Seven Latency Decisions That Feel Reasonable and Cost You

Mistake 1: Optimizing the Average

Mistake 2: Measuring on an Idle Server

Mistake 3: Swapping Models Blindly

Mistake 4: Ignoring Time to First Token

The fix

Mistake 5: Sending Bloated Context

Mistake 6: No Caching

Mistake 7: Treating Latency as a Last-Minute Concern

The Pattern Behind All Seven

How to break the pattern

A Quick Self-Audit

Frequently Asked Questions

Which mistake is the most common?

Is swapping to a smaller model ever the right move?

How do I know if my context is bloated?

Why is caching skipped so often?

Can perceived speed really substitute for real speed?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?