Plain Answers to TTFT, Bills, and Stubborn GPUs

If you work near AI products, the same questions about inference and latency come up again and again — in standups, in budget reviews, in incident channels. What does TTFT actually mean? Why is our bill so high? Why did the bigger GPU not help? This article answers those high-frequency questions directly, in plain language, with enough specificity to act on.

It is organized as a structured Q&A grouped by theme: the basics, performance, cost, and scaling. Treat it as a reference you can jump into. For systematic depth on any one thread, the linked guides go further, starting with The Complete Guide to AI Inference and Latency.

The Basics

What is AI inference, exactly?

Inference is the act of running a trained model to produce an output — generating a response, a classification, an embedding. Training builds the model once; inference uses it on every request, forever. For most teams, inference is the recurring cost and the latency users actually feel.

What is latency in the context of AI?

Latency is the time between sending a request and getting a usable result. For generative models it is not one number but several: time to first token, the speed at which subsequent tokens arrive, and total end-to-end time. Each has a different cause and a different fix.

What does "time to first token" mean and why does everyone mention it?

Time to first token (TTFT) is how long you wait before the model produces anything at all. It is the number users perceive as responsiveness — a fast TTFT feels snappy even if the full answer takes a while. It is dominated by prompt length and how busy the system is. The full metric set is in How to Measure AI Inference and Latency.

Performance Questions

Why is my inference slow even though my model is small?

Small models can still be slow for reasons that have nothing to do with the model: a bloated system prompt inflating prefill, an uncapped output letting the model ramble, or queueing under load. Diagnose which component is slow before assuming the model is the problem. Often the fix is trimming the prompt or capping output.

Why didn't a faster GPU fix my latency?

Because token generation is limited by memory bandwidth, not raw compute, and because your real bottleneck may be the prompt, the output length, or queueing — none of which a faster GPU addresses. This is one of the most common and expensive misconceptions, covered in AI Inference and Latency: Myths vs Reality.

How do I make responses feel faster without making them actually faster?

Turn on streaming so tokens appear as they generate instead of all at once at the end. Total time is unchanged, but perceived latency drops sharply, which is what users judge. For interactive features this is often the single highest-impact change you can make.

What is a good latency target?

It depends on the use case. Interactive chat wants TTFT under about a second at p95 and a smooth token rate. Inline autocomplete needs a much faster first token but fewer tokens. Background batch jobs do not care about TTFT at all and should optimize throughput and cost. Set a budget per use case rather than chasing one universal number.

Cost Questions

Why is my inference bill so high?

Inference cost scales with token volume — input plus output tokens times your cost per token times your request count. The usual culprits are a long system prompt charged on every request, uncapped or verbose output, and defaulting to an oversized model. Each is a direct, controllable lever. The cost model is laid out in The ROI of AI Inference and Latency.

What is the cheapest way to cut inference cost?

Three quality-neutral moves: trim the system prompt, cap and tighten the output, and right-size the model to the task. If you self-host, raising GPU utilization through batching cuts cost per request further with no quality trade-off. These are the foundation of Getting Started with AI Inference and Latency.

Should I self-host or use an API?

APIs are simpler and cheaper at low to moderate volume because you pay only per request with no idle cost. Self-hosting wins at high, steady volume where you can keep GPUs well-utilized, and when data must stay inside your boundary. The crossover depends on your volume and utilization, not on a fixed rule.

Scaling Questions

Why does my system slow down under load when it was fast in testing?

Concurrency fills the KV cache — the memory holding attention state for in-flight requests — which forces queueing or eviction and spikes tail latency. Gentle, uniform test traffic never reproduces this. Load-test with realistic concurrency and long prompts, and watch p99, not just the median.

How do I serve more requests on the same hardware?

Continuous batching keeps the GPU saturated by admitting new requests as others finish, and paged attention lets you fit more concurrent requests in memory. Together they often double throughput without new hardware. These internals are covered in Advanced AI Inference and Latency.

How do I roll this out across a whole team, not just one service?

Make the efficient path the default through a shared client library and serving layer that bake in streaming, caching, output caps, and instrumentation. Set a published latency budget per use case and govern it continuously. The playbook is in Rolling Out AI Inference and Latency Across a Team.

Quality and Trade-Off Questions

Will optimizing for speed make my answers worse?

It can, if you are careless. Aggressive quantization, an over-eager smaller model, or a too-tight output cap can degrade quality — usually unevenly, hurting hard inputs while easy ones still pass. The protection is to run a hard-case evaluation set on every optimization and watch production quality signals like retries and thumbs-down, never accepting a speed win without confirming quality held.

How do I decide between a faster model and a more accurate one?

By the use case and a real evaluation. For autocomplete and high-volume routine tasks, speed usually wins because users value immediacy and the task is easy. For high-stakes reasoning, accuracy wins. The most robust answer is a cascade: a fast model handles the easy majority and escalates the hard minority to the accurate model, capturing both.

Is there a downside to caching responses?

The main risk is staleness — serving an outdated answer when the underlying data has changed. This is manageable with sensible cache invalidation tied to data freshness. There is also a privacy consideration, since caching can inadvertently store sensitive content, so be deliberate about what you cache. Handled carefully, caching is one of the safest, highest-leverage optimizations available.

What is the one habit that prevents most latency mistakes?

Measuring before and after every change. The recurring root cause of wasted effort is optimizing without a baseline — buying hardware that does not help, or shipping a change that quietly hurt quality. A disciplined measure-change-measure loop catches both, and it is the cheapest insurance in the entire practice.

Frequently Asked Questions

What is the difference between inference and training?

Training builds the model once, at high upfront cost. Inference runs the finished model on every request, indefinitely. For teams that consume rather than build models, inference is the cost that matters and the latency users experience.

Which latency metric should I watch first?

Time to first token at the p95 percentile. It is what users feel as responsiveness, and the percentile captures the common unlucky experience that an average hides. If you track one number, track that one.

Is it cheaper to use a smaller model?

Almost always, and usually faster too, because cost and latency both scale with model size and token volume. For most tasks a right-sized smaller model passes your real quality bar, with the large model reserved as an escalation path for hard cases.

Why does streaming help if it does not reduce total time?

Because users judge perceived latency, not total latency. Streaming shows tokens as they generate, making the response feel immediate even though the full answer arrives at the same moment. It is highly effective for interactive chat and ineffective for long-reasoning models.

When should I move from an API to self-hosting?

When your volume is high and steady enough to keep GPUs well-utilized, or when data residency requires it. At low or bursty volume, APIs are cheaper because you avoid paying for idle capacity. The crossover is a function of your utilization, not a universal threshold.

Key Takeaways

Inference is the recurring cost and felt latency; training is a one-time upfront cost.
Time to first token at p95 is the metric to watch first.
Slow inference is often the prompt, output length, or queueing — not the model or GPU.
Cost scales with token volume; trimming prompts, capping output, and right-sizing the model are the cheap wins.
Systems slow under load because the KV cache fills; load-test realistically and watch p99.
Scale across a team by making the efficient path the default and governing a per-use-case budget.

The Basics

What is AI inference, exactly?

What is latency in the context of AI?

What does "time to first token" mean and why does everyone mention it?

Performance Questions

Why is my inference slow even though my model is small?

Why didn't a faster GPU fix my latency?

How do I make responses feel faster without making them actually faster?

What is a good latency target?

Cost Questions

Why is my inference bill so high?

What is the cheapest way to cut inference cost?

Should I self-host or use an API?

Scaling Questions

Why does my system slow down under load when it was fast in testing?

How do I serve more requests on the same hardware?

How do I roll this out across a whole team, not just one service?

Quality and Trade-Off Questions

Will optimizing for speed make my answers worse?

How do I decide between a faster model and a more accurate one?

Is there a downside to caching responses?

What is the one habit that prevents most latency mistakes?

Frequently Asked Questions

What is the difference between inference and training?

Which latency metric should I watch first?

Is it cheaper to use a smaller model?

Why does streaming help if it does not reduce total time?

When should I move from an API to self-hosting?

Key Takeaways

Inference is the recurring cost and felt latency; training is a one-time upfront cost.
Time to first token at p95 is the metric to watch first.
Slow inference is often the prompt, output length, or queueing — not the model or GPU.
Cost scales with token volume; trimming prompts, capping output, and right-sizing the model are the cheap wins.
Systems slow under load because the KV cache fills; load-test realistically and watch p99.
Scale across a team by making the efficient path the default and governing a per-use-case budget.

Plain Answers to TTFT, Bills, and Stubborn GPUs

The Basics

What is AI inference, exactly?

What is latency in the context of AI?

What does "time to first token" mean and why does everyone mention it?

Performance Questions

Why is my inference slow even though my model is small?

Why didn't a faster GPU fix my latency?

How do I make responses feel faster without making them actually faster?

What is a good latency target?

Cost Questions

Why is my inference bill so high?

What is the cheapest way to cut inference cost?

Should I self-host or use an API?

Scaling Questions

Why does my system slow down under load when it was fast in testing?

How do I serve more requests on the same hardware?

How do I roll this out across a whole team, not just one service?

Quality and Trade-Off Questions

Will optimizing for speed make my answers worse?

How do I decide between a faster model and a more accurate one?

Is there a downside to caching responses?

What is the one habit that prevents most latency mistakes?

Frequently Asked Questions

What is the difference between inference and training?

Which latency metric should I watch first?

Is it cheaper to use a smaller model?

Why does streaming help if it does not reduce total time?

When should I move from an API to self-hosting?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Plain Answers to TTFT, Bills, and Stubborn GPUs

The Basics

What is AI inference, exactly?

What is latency in the context of AI?

What does "time to first token" mean and why does everyone mention it?

Performance Questions

Why is my inference slow even though my model is small?

Why didn't a faster GPU fix my latency?

How do I make responses feel faster without making them actually faster?

What is a good latency target?

Cost Questions

Why is my inference bill so high?

What is the cheapest way to cut inference cost?

Should I self-host or use an API?

Scaling Questions

Why does my system slow down under load when it was fast in testing?

How do I serve more requests on the same hardware?

How do I roll this out across a whole team, not just one service?

Quality and Trade-Off Questions

Will optimizing for speed make my answers worse?

How do I decide between a faster model and a more accurate one?

Is there a downside to caching responses?

What is the one habit that prevents most latency mistakes?

Frequently Asked Questions

What is the difference between inference and training?

Which latency metric should I watch first?

Is it cheaper to use a smaller model?

Why does streaming help if it does not reduce total time?

When should I move from an API to self-hosting?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?