Most teams attack inference latency by guessing. They swap models, fiddle with settings, and hope. This guide replaces guessing with a process: a sequential set of steps that takes you from "the feature feels slow" to "I know exactly which delay to fix and I have verified the fix worked."
You can run this whole sequence in a day or two. It does not require rewriting your stack. It requires discipline about measuring before changing and changing one thing at a time. Skip the discipline and you will end up exactly where you started, just more tired.
Work through the steps in order. Do not jump to step five because it sounds fun. The early steps tell you whether the later ones are even worth doing.
Step 1: Define the Target
Before measuring anything, decide what "fast enough" means for this specific feature. A target is a number tied to a percentile and a use case.
- Conversational chat: time to first token under 500 ms at p95.
- Inline autocomplete: full response under 150 ms at p95.
- Background summarization: total time under 10 seconds at p99.
Without a target you will optimize forever and never know when to stop. Write the number down before you touch anything.
Step 2: Instrument the Pipeline
You cannot fix what you cannot see. Add timing around each segment of a request and log them separately:
- Time the request leaves your app and arrives at the inference server.
- Queue wait before processing begins.
- Time to first token.
- Inter-token latency during decode.
- Total end-to-end time.
Log the token counts too — input tokens and output tokens — because they explain most variation. If you only have one timer wrapping the whole thing, you are blind. This separation is the same anatomy described in The Complete Guide to AI Inference and Latency.
Step 3: Collect Data Under Realistic Load
Run the feature at the concurrency you actually expect, not with a single test request. Latency behaves completely differently at fifty simultaneous users than at one.
Capture percentiles, not averages
Record p50, p95, and p99 for each segment. The average will hide the tail, and the tail is what your users complain about. A feature with a great mean and a terrible p99 is a feature with a queue problem you have not found yet.
Step 4: Find the Dominant Cost
Now read the data and find which segment dominates. There is almost always one clear culprit. Common patterns:
- High TTFT, short prompt — likely queueing or cold-start, not the model.
- High TTFT, long prompt — prefill cost; your context is too large.
- Slow per-token streaming — decode is bandwidth-bound; the model is too big or unquantized.
- Spiky tail only — concurrency and batching configuration.
Do not move on until you can name the single biggest contributor. If two segments are close, attack the larger one first.
Step 5: Apply One Targeted Fix
Match the fix to the cause you identified. Resist the urge to do five things at once.
- For long-prompt prefill: trim context, cache the static prompt prefix, or use prompt compression.
- For slow decode: quantize the model to 8-bit or try a smaller model.
- For queueing: enable continuous batching or add capacity.
- For network: co-locate the app and inference server in the same region.
Each of these targets a specific failure mode. The mistakes catalog in 7 Common Mistakes with AI Inference and Latency shows what happens when teams apply the wrong fix to the wrong cause.
Step 6: Verify Against the Target
Re-run the exact same load test from step three and compare percentiles. Did the dominant segment shrink? Did the target get met? Did anything else regress — for example, did a smaller model hurt answer quality?
Verification is non-negotiable. Plenty of "optimizations" make one number better and another worse. If you did not re-measure, you do not actually know what you changed.
Step 7: Add Perceived-Speed Wins
Once actual latency is acceptable, improve how fast it feels:
- Stream tokens so users see output immediately.
- Show a typing indicator within 100 ms so nothing feels frozen.
- Pre-warm connections to avoid cold-start spikes on the first request.
These do not lower the real numbers, but they lower the felt delay, which is what retains users. Perceived speed is half the battle and often the cheaper half.
A Worked Example of the Process
To make the steps concrete, walk through how they play out on a typical slow chatbot. The feature feels sluggish; users are dropping off. Here is the sequence in action.
- Target (step 1): TTFT under 500 ms at p95 for the chat path.
- Instrument (step 2): add timers and discover you have been logging only total time.
- Load test (step 3): at fifty concurrent users, p50 TTFT is 300 ms but p95 is 3.5 seconds.
- Isolate (step 4): the prompt is short, so the long tail is not prefill — it is queueing, plus a 1,500-token system prompt reprocessed every call.
- Fix (step 5): cache the system prompt as a prefix and enable continuous batching.
- Verify (step 6): re-run the same load test; p95 TTFT drops to 550 ms.
- Perceived speed (step 7): add streaming, and the experience now feels instant even though some tail remains.
Notice you never swapped the model. The bottleneck was queueing and a repeated prompt, not decode speed. That is the entire reason step four exists — to stop you from optimizing the wrong thing.
What to Do When the First Fix Is Not Enough
Sometimes step six shows the dominant cost shrank but you still missed the target. That is normal. Fixing the biggest bottleneck reveals the next one, which was hidden behind it.
Loop, do not pile on
Return to step four with your fresh data and find the new dominant cost. Resist the urge to apply three more fixes at once; you will lose the ability to tell which one helped. The discipline that makes the first pass work makes the second pass work too. Most features need one or two iterations of this loop to land their target, and the framework version of this process is built around exactly that loop.
Frequently Asked Questions
How long does this whole process take?
Instrumenting and collecting data is usually a day. Diagnosing and applying one fix is another day, plus verification. The discipline of doing it in order saves far more time than it costs, because you stop chasing the wrong problem.
What if I do not control the model or server?
If you use a hosted API, you still control prompt length, output length, caching, streaming, and region selection. Instrument what you can see — at minimum TTFT and total time — and optimize the levers available to you. You have more control than you think.
Should I optimize before I have a target?
No. Without a target you cannot tell success from motion. A clear number — tied to a percentile and a use case — is what turns optimization from an endless activity into a finite task with a finish line.
Why measure under load instead of a single request?
Because real systems serve many requests at once, and queueing plus batching dynamics only appear under concurrency. A single-request test can look perfect while the system falls apart at peak. Load testing reveals the tail latency that drives real complaints.
What is the most common dominant cost?
In our experience it is either oversized context inflating prefill, or queueing under load because batching is not tuned. Both are fixable without changing the model, which is why steps two through four matter so much before you reach for a different model.
Key Takeaways
- Set a concrete latency target tied to a percentile and use case before doing anything.
- Instrument each request segment separately — network, queue, TTFT, decode, total.
- Always collect data under realistic concurrent load and report percentiles.
- Identify the single dominant cost before applying any fix.
- Apply one targeted fix at a time and verify against your target by re-running the same test.
- Finish with perceived-speed wins like streaming and instant typing indicators.