Five Workloads, Five Very Different Latency Budgets

Latency advice stays abstract until you tie it to a real use case. A 500-millisecond target is excellent for a chatbot and catastrophic for code autocomplete. The right way to build intuition is to walk through concrete scenarios and see what made each one succeed or fall on its face.

This article runs through five distinct use cases, each with a different latency profile and a different dominant bottleneck. The point is not to memorize numbers but to internalize how the same underlying mechanics — prefill, decode, batching, caching — express themselves differently depending on what the product is trying to do.

For each, we name the constraint, the failure mode, and the fix that worked.

Use Case 1: Conversational Chatbot

A customer-support chat assistant. The user types a question and expects a human-paced reply.

What matters here

Time to first token dominates the experience. Users tolerate a steadily streaming answer that takes a few seconds total, but they will not tolerate a blank four-second pause before anything appears.

Failure mode: a team optimized total generation time, leaving TTFT at 1.8 seconds. Users perceived the bot as frozen and abandoned conversations.

What worked: streaming tokens, caching the large static system prompt as a prefix, and trimming conversation history to the last few turns. TTFT dropped under 400 ms and abandonment fell.

Use Case 2: Code Autocomplete

Inline suggestions in an editor that appear as the developer types.

This is the most latency-brutal use case in this article. Suggestions must arrive in roughly 100 milliseconds or they are useless — the developer has already typed past them.

A large model was too slow per token; the team moved to a small, specialized completion model.
Aggressive caching of common prefixes handled a surprising share of requests instantly.
Outputs were capped short, since suggestions are a few tokens, not paragraphs.

The lesson: when the latency budget is tiny, model size and output length are the only knobs that matter. There is no room for a slow first token.

Use Case 3: Real-Time Fraud Scoring

A model scores transactions as they happen, inside a payment flow with a hard time budget.

Here the constraint is a strict deadline, not user perception. The score must return within, say, 50 milliseconds or the transaction proceeds without it. Tail latency is the enemy because a slow p99 means missed fraud or blocked legitimate payments.

What worked: a small, quantized model with no streaming (the output is a single score), tight batching windows, and co-locating the model with the transaction service to kill network latency. The team monitored p99 obsessively, since the average was never the risk.

Use Case 4: Voice Assistant

Speech in, speech out, where conversational rhythm makes latency painfully obvious.

Voice is unforgiving because humans expect sub-second turn-taking. The pipeline stacks delays: speech-to-text, then inference, then text-to-speech. Each adds latency, and they compound.

What worked

Streaming partial transcripts into the model before the user finishes speaking.
Streaming the model's tokens into the speech synthesizer so audio begins before generation completes.
Right-sizing the language model so per-token decode stayed fast.

Overlapping the stages — rather than running them strictly in sequence — was the breakthrough. The diagnosis discipline in A Step-by-Step Approach to AI Inference and Latency is what surfaced where the compounding delay actually lived.

Use Case 5: Overnight Batch Classification

Classifying millions of documents in a nightly job. No human is waiting.

This is the inverse of every prior case. Latency per item is irrelevant; throughput per dollar is everything. Optimizing for low single-request latency here would actively waste money.

What worked: large batch windows to saturate the GPU, no streaming, and a model sized for accuracy rather than speed. The team measured items per hour and cost per million, not milliseconds. Trying to apply chatbot-style latency tuning would have lowered throughput and raised cost.

Use Case 6: Retrieval-Augmented Question Answering

An internal knowledge assistant that retrieves company documents and answers questions grounded in them. This case is interesting because the latency comes from a pipeline, not a single model call.

Where the time actually goes

The naive assumption is that the language model is the slow part. Instrumentation often says otherwise. The pipeline is: embed the query, search the vector store, rank results, assemble context, then run inference. Each stage adds latency, and the retrieval stages can rival or exceed the model.

One team found their vector search was returning twenty documents, inflating prefill enough to dominate TTFT.
Cutting retrieval to the five most relevant documents dropped context size and TTFT together, with no measurable quality loss.
Caching embeddings for repeated queries removed an entire stage for common questions.

The lesson generalizes: in any multi-stage AI feature, instrument every stage, because the bottleneck is frequently not the model. This is exactly the kind of misdiagnosis the common mistakes guide warns against — blaming the model when the real cost lives upstream.

What These Cases Have in Common

Six very different products, and a few patterns repeat in every one:

The right metric is set by the user, not the infrastructure. Each case has a different target because each user has a different tolerance.
The bottleneck is often not where you first look. Retrieval, queueing, and context size masquerade as "the model is slow" again and again.
Streaming helps where a human watches and is irrelevant where one does not. Voice and chat lean on it; fraud scoring and batch jobs ignore it.

The mechanics — prefill, decode, batching, caching — never change. What changes is which of them dominates, and that is entirely a function of what the product is for.

Frequently Asked Questions

Why do these use cases need such different targets?

Because the human (or system) expectation differs. A developer waiting on autocomplete has a 100 ms tolerance; an overnight batch job has none at all. The mechanics are the same, but which metric matters — TTFT, p99, or throughput — changes completely with the use case.

Which use case is hardest to optimize?

Code autocomplete and voice are the toughest. Autocomplete has the smallest latency budget, leaving almost no room. Voice compounds delays across multiple stages, so even modest per-stage latency adds up to a sluggish feel unless you overlap the stages.

When should I not stream?

When the output is a single value, like a fraud score or a classification label, there is nothing to stream — you need the whole answer. Streaming also adds no value to batch jobs where no human watches the output appear.

How does caching help across these cases?

It helps most where requests repeat or share structure — chatbots with a fixed system prompt, autocomplete with common code prefixes. It helps least in fraud scoring, where each transaction is unique, though even there the model weights and warm connections act as a kind of cache.

Can one model serve multiple use cases?

Sometimes, but the latency profiles often conflict. A model tuned for batch throughput will disappoint in interactive chat, and vice versa. It is usually cleaner to right-size a model per use case than to force one model to do everything, as argued in AI Inference and Latency: Best Practices That Actually Work.

Key Takeaways

Chatbots live or die by time to first token; stream and cache the system prompt.
Code autocomplete has a ~100 ms budget — small models and short outputs are the only options.
Real-time fraud scoring is about p99 and a hard deadline, not averages.
Voice assistants compound delays across stages; overlap the stages to win.
Batch classification ignores per-item latency entirely and optimizes throughput per dollar.
The same mechanics express differently per use case — match the metric to the product.

For each, we name the constraint, the failure mode, and the fix that worked.

Use Case 1: Conversational Chatbot

A customer-support chat assistant. The user types a question and expects a human-paced reply.

What matters here

Time to first token dominates the experience. Users tolerate a steadily streaming answer that takes a few seconds total, but they will not tolerate a blank four-second pause before anything appears.

Failure mode: a team optimized total generation time, leaving TTFT at 1.8 seconds. Users perceived the bot as frozen and abandoned conversations.

What worked: streaming tokens, caching the large static system prompt as a prefix, and trimming conversation history to the last few turns. TTFT dropped under 400 ms and abandonment fell.

Use Case 2: Code Autocomplete

Inline suggestions in an editor that appear as the developer types.

This is the most latency-brutal use case in this article. Suggestions must arrive in roughly 100 milliseconds or they are useless — the developer has already typed past them.

A large model was too slow per token; the team moved to a small, specialized completion model.
Aggressive caching of common prefixes handled a surprising share of requests instantly.
Outputs were capped short, since suggestions are a few tokens, not paragraphs.

The lesson: when the latency budget is tiny, model size and output length are the only knobs that matter. There is no room for a slow first token.

Use Case 3: Real-Time Fraud Scoring

A model scores transactions as they happen, inside a payment flow with a hard time budget.

Use Case 4: Voice Assistant

Speech in, speech out, where conversational rhythm makes latency painfully obvious.

Voice is unforgiving because humans expect sub-second turn-taking. The pipeline stacks delays: speech-to-text, then inference, then text-to-speech. Each adds latency, and they compound.

What worked

Streaming partial transcripts into the model before the user finishes speaking.
Streaming the model's tokens into the speech synthesizer so audio begins before generation completes.
Right-sizing the language model so per-token decode stayed fast.

Use Case 5: Overnight Batch Classification

Classifying millions of documents in a nightly job. No human is waiting.

This is the inverse of every prior case. Latency per item is irrelevant; throughput per dollar is everything. Optimizing for low single-request latency here would actively waste money.

Use Case 6: Retrieval-Augmented Question Answering

An internal knowledge assistant that retrieves company documents and answers questions grounded in them. This case is interesting because the latency comes from a pipeline, not a single model call.

Where the time actually goes

One team found their vector search was returning twenty documents, inflating prefill enough to dominate TTFT.
Cutting retrieval to the five most relevant documents dropped context size and TTFT together, with no measurable quality loss.
Caching embeddings for repeated queries removed an entire stage for common questions.

What These Cases Have in Common

Six very different products, and a few patterns repeat in every one:

The right metric is set by the user, not the infrastructure. Each case has a different target because each user has a different tolerance.
The bottleneck is often not where you first look. Retrieval, queueing, and context size masquerade as "the model is slow" again and again.
Streaming helps where a human watches and is irrelevant where one does not. Voice and chat lean on it; fraud scoring and batch jobs ignore it.

The mechanics — prefill, decode, batching, caching — never change. What changes is which of them dominates, and that is entirely a function of what the product is for.

Frequently Asked Questions

Why do these use cases need such different targets?

Which use case is hardest to optimize?

When should I not stream?

How does caching help across these cases?

Can one model serve multiple use cases?

Key Takeaways

Chatbots live or die by time to first token; stream and cache the system prompt.
Code autocomplete has a ~100 ms budget — small models and short outputs are the only options.
Real-time fraud scoring is about p99 and a hard deadline, not averages.
Voice assistants compound delays across stages; overlap the stages to win.
Batch classification ignores per-item latency entirely and optimizes throughput per dollar.
The same mechanics express differently per use case — match the metric to the product.

Five Workloads, Five Very Different Latency Budgets

Use Case 1: Conversational Chatbot

What matters here

Use Case 2: Code Autocomplete

Use Case 3: Real-Time Fraud Scoring

Use Case 4: Voice Assistant

What worked

Use Case 5: Overnight Batch Classification

Use Case 6: Retrieval-Augmented Question Answering

Where the time actually goes

What These Cases Have in Common

Frequently Asked Questions

Why do these use cases need such different targets?

Which use case is hardest to optimize?

When should I not stream?

How does caching help across these cases?

Can one model serve multiple use cases?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Five Workloads, Five Very Different Latency Budgets

Use Case 1: Conversational Chatbot

What matters here

Use Case 2: Code Autocomplete

Use Case 3: Real-Time Fraud Scoring

Use Case 4: Voice Assistant

What worked

Use Case 5: Overnight Batch Classification

Use Case 6: Retrieval-Augmented Question Answering

Where the time actually goes

What These Cases Have in Common

Frequently Asked Questions

Why do these use cases need such different targets?

Which use case is hardest to optimize?

When should I not stream?

How does caching help across these cases?

Can one model serve multiple use cases?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?