You have already done the obvious work. You cache, you stream, you cap output, you run a right-sized model. Your p95 is decent and your bills are under control. The further gains now live in the internals of how transformers actually serve tokens — the KV cache, the batching scheduler, the decode loop. This is where a well-tuned stack pulls away from a merely competent one, often doubling throughput on the same hardware.
This article assumes you know the fundamentals and want the depth: how the KV cache governs your memory ceiling, why continuous batching beats static batching, how speculative decoding cheats the sequential nature of generation, and where each technique breaks. These are the levers behind the trends in AI Inference and Latency: Trends and What to Expect in 2026, pulled apart so you can apply them deliberately.
The KV Cache Is Your Real Constraint
Transformer decode is memory-bound, not compute-bound. The reason is the key-value cache: to generate each new token the model reuses the attention state of every prior token, stored in GPU memory. That cache grows with sequence length and with the number of concurrent requests.
Why It Dominates Everything
The KV cache, not the model weights, usually determines how many requests you can serve concurrently. When it fills, you must evict or queue, which is where tail latency spikes come from. Long contexts are expensive precisely because they balloon the KV cache. Understanding this reframes a lot of optimization: shortening context is not just about prefill speed, it is about fitting more concurrent users in memory.
Paged Attention
Naive KV cache allocation reserves contiguous memory for the maximum possible sequence length, wasting most of it. Paged attention allocates the cache in small fixed blocks, like virtual memory paging, so memory is used only as sequences actually grow. The result is dramatically higher concurrency on the same GPU. If your serving framework supports it, enabling it is one of the largest throughput wins available.
Continuous Batching Beats Static Batching
Static batching groups a fixed set of requests, runs them together, and waits for the slowest to finish before starting the next batch. The problem is obvious: a batch is held hostage by its longest generation while short requests sit finished and idle.
Continuous batching (also called in-flight batching) solves this. As soon as any request in the batch completes, its slot is freed and a waiting request takes its place mid-flight. The GPU stays saturated, throughput rises sharply, and short requests are not penalized for sharing a batch with long ones. For mixed traffic — which is most real traffic — this is the difference between good and wasteful utilization, and it directly improves the cost model in The ROI of AI Inference and Latency.
Speculative Decoding: Cheating Sequential Generation
Generation is inherently sequential — each token depends on the last — which caps how fast a single request can go. Speculative decoding breaks this limit.
A small, fast "draft" model proposes several tokens ahead. The large target model then verifies them all in a single forward pass. When the draft guesses correctly, you get multiple tokens for the cost of one verification step; when it guesses wrong, you fall back with no loss of correctness. The output is mathematically identical to the target model alone, but faster.
The trade-offs:
- It helps most on predictable, lower-entropy text and helps least on highly creative or unpredictable output.
- The draft model consumes memory and must be fast enough that its overhead does not eat the gains.
- Acceptance rate is the metric to watch; below a threshold, the technique stops paying.
Quantization and the Quality Frontier
Quantization stores weights and sometimes activations at lower precision, shrinking memory footprint and speeding memory-bound decode. The advanced practitioner's job is finding the precision floor for their specific task.
- Moderate quantization is usually near-lossless and a clear win for latency and memory.
- Aggressive quantization can degrade quality unevenly — fine on easy inputs, noticeably worse on hard ones.
- The only reliable evaluation is your own task-specific test set, because benchmark scores hide task-level regressions.
Treat quantization as a quality experiment, not a free lunch. Measure on the tasks that matter, the way AI Inference and Latency: Best Practices That Actually Work recommends for any model change.
Routing, Cascades, and Adaptive Compute
The highest-leverage advanced pattern is not making one model faster — it is not running the big model when you do not need to.
Confidence-Based Cascades
A fast small model answers first. If its confidence is high, you ship that answer. If low, you escalate to the large model. Most traffic never touches the expensive model, so blended latency and cost drop while worst-case quality stays intact.
Difficulty-Aware Reasoning Budgets
For reasoning models, allocate thinking tokens by query difficulty: near zero for trivial queries, generous for hard ones. A lightweight classifier in front of the model decides. This prevents the model from spending seconds reasoning about a question it could answer instantly.
Where These Techniques Break
Advanced optimization has sharp edges. Continuous batching can starve long requests under heavy load without fair scheduling. Speculative decoding hurts when acceptance rates are low. Aggressive quantization fails silently on hard inputs. Cascades add a routing layer that itself can become a latency and failure point. Every technique here trades simplicity for speed, so instrument each one and keep the ability to turn it off. The failures these create are exactly the kind catalogued in The Hidden Risks of AI Inference and Latency.
Sequencing Advanced Optimizations
Knowing the techniques is half the job; knowing the order to apply them is the other half. Stacking everything at once produces a system you cannot debug, because when latency moves you cannot attribute the change. Apply advanced techniques one at a time, measuring after each.
A sensible order for a memory-bound serving setup: first enable paged attention and continuous batching, since these raise concurrency and throughput with the least quality risk. Next, add prefix caching for shared prompts. Only then consider quantization, treating it as a quality experiment on your own test set. Speculative decoding comes after that, gated on a healthy draft acceptance rate. Routing and cascades come last, because they add the most architectural surface area and the most new failure points.
The reason for this order is risk-adjusted return. Batching and paging give large, quality-safe gains and should be exhausted first. Each subsequent technique buys less, costs more complexity, and demands more careful evaluation. Most teams discover they reach their latency budget after the first two or three steps and never need the exotic tail of the list — which is exactly the right outcome. The art of advanced optimization is knowing when to stop, not how many techniques you can stack.
Frequently Asked Questions
Why is the KV cache more important than model size for concurrency?
Decode is memory-bound, and the key-value cache grows with sequence length and concurrent requests, consuming GPU memory that the weights do not. When the cache fills you must queue or evict, which causes tail-latency spikes. Managing it — via paged attention and shorter contexts — usually governs how many users you can serve at once.
What is the difference between static and continuous batching?
Static batching waits for the slowest request in a fixed batch before starting the next, wasting the GPU on finished short requests. Continuous batching frees each slot the moment its request completes and admits a waiting one mid-flight, keeping the GPU saturated and sharply raising throughput on mixed traffic.
Does speculative decoding change the model's output?
No. The large target model verifies every drafted token, so the final output is mathematically identical to running the target model alone. You only gain speed when the draft guesses correctly; wrong guesses fall back with no loss of correctness.
How far can I quantize before quality suffers?
Moderate quantization is usually near-lossless, while aggressive quantization degrades unevenly — fine on easy inputs, worse on hard ones. The only reliable test is your own task-specific evaluation set, because public benchmark scores hide task-level regressions.
When should I use a cascade instead of a single model?
When most of your traffic is easy and only a minority is hard. A fast small model handles the bulk and escalates low-confidence cases to a large model, dropping blended latency and cost while preserving worst-case quality. The cost is an extra routing layer to instrument and maintain.
Key Takeaways
- The KV cache, not model weights, usually sets your concurrency ceiling; paged attention raises it.
- Continuous batching keeps the GPU saturated and beats static batching on mixed traffic.
- Speculative decoding accelerates generation with identical output, but depends on draft acceptance rate.
- Quantization is a quality experiment; find the precision floor on your own test set.
- Cascades and adaptive reasoning budgets avoid running the big model when you do not need it.
- Every advanced technique trades simplicity for speed — instrument it and keep an off switch.