A latency win that lives in one engineer's head is a liability. You've seen it: someone tunes the inference stack, the dashboards look great, and then they go on vacation and the p99 quietly climbs because nobody else knows which lever moved it. Performance work that isn't written down isn't really done.
This article is about the unglamorous part: turning AI inference and latency from a hero effort into a repeatable, hand-off-able workflow. Not a one-time optimization, but a documented loop with inputs, steps, checkpoints, and a clear definition of done. The goal is that a new hire can run it in week two and get the same result the senior engineer would.
A workflow is different from a playbook. The AI Inference and Latency Playbook tells you which move to make when a trigger fires. The workflow is the standing process that runs whether or not anything is on fire, so the system stays fast by default instead of being rescued.
Start by Defining the Unit of Work
Every repeatable process needs a clear unit. For latency, the unit is a single inference path: one route, one model, one prompt template, under one traffic profile. Don't try to optimize "the app." Optimize the checkout-assistant path, or the document-summary path, one at a time.
What goes into the workflow's intake
For each path you put through the workflow, capture a small intake record before touching anything:
- The latency target (TTFT and p95 end-to-end) and where it came from.
- The current baseline measured under realistic load.
- The prompt template and its stable-versus-variable parts.
- The model and provider, plus any routing rules already in place.
This intake is what makes the work hand-off-able. Without a written baseline, the next person can't tell whether they improved anything or just got lucky with quiet traffic.
The Five-Stage Loop
The workflow itself is a loop you run per path. Keep the stages fixed so the process is teachable.
Stage 1: Measure
Capture the baseline under load that looks like production, including long prompts and concurrent requests. The most common mistake is measuring with a single warm request and declaring a number. Measure p50, p95, and p99, and record the test conditions so the result is reproducible. If you can't reproduce the baseline, you don't have one.
Stage 2: Diagnose
Decompose the number. Is the time in the queue, in TTFT, or in generation? A long TTFT points at prompt size, cold starts, or routing. Long generation points at output length. High tail latency with a fine median points at capacity or a bad route. This diagnosis step is where junior engineers usually skip ahead to fixes, which is why it has to be an explicit stage.
Stage 3: Change one lever
Apply exactly one change: cache the prefix, cap output tokens, switch routes, adjust concurrency. One lever per pass is non-negotiable, because two simultaneous changes make the result uninterpretable. The discipline here is the same one that separates real engineering from guessing.
Stage 4: Verify
Re-run the Stage 1 measurement under the same conditions and compare. Confirm the latency improved and, critically, that quality did not regress. A faster wrong answer is a failure. Pair every latency check with a quality sample, drawing from the same evaluation set each time.
Stage 5: Document and ship
Write down what changed, the before and after numbers, and the trade-off accepted. Ship behind a flag if the change is risky. This record is the deliverable that makes the next pass faster, and it feeds directly into A Step-by-Step Approach to AI Inference and Latency for anyone learning the moves.
The documentation also closes the loop on accountability. A change with a named owner, a recorded baseline, and a verified result can be trusted by the rest of the team without a meeting. A change with none of those gets re-litigated every time someone new looks at the dashboard, which is how teams burn weeks re-deriving decisions they already made.
Make It Hand-Off-Able
A workflow that only the author can run is just a long memory. Three things make it transferable.
Write the runbook, not the result
The deliverable isn't "we got TTFT to 600ms." It's the runbook that someone else can follow to get there again. Include the commands, the dashboard links, the test harness, and the decision rules. If a step says "tune the batch size," it must also say how to know which value is right.
Standardize the measurement harness
If everyone measures differently, nobody can compare results. Pick one load-testing tool and one fixed prompt set per path, and require that all before/after numbers come from it. This single decision removes most of the arguments about whether a change helped.
Define done explicitly
A path exits the workflow when it hits its target, the change is documented, and quality is verified. Not when the engineer is tired of it. A written definition of done is what keeps the loop from becoming an endless tinkering session, a failure mode covered in 7 Common Mistakes with AI Inference and Latency (and How to Avoid Them).
Wire It Into the Calendar
A workflow that isn't scheduled won't run. Attach it to events you already have:
- On every new inference path — run the full five-stage loop before launch.
- On prompt template changes — re-run Stage 1 and Stage 4, because prompt edits change latency more than people expect.
- Monthly — sweep your top three paths by traffic and confirm no drift.
- Before any traffic-doubling launch — run the loop under projected load, not current load.
The monthly sweep is the cheapest insurance you'll buy. Latency drifts as prompts grow, traffic shifts, and providers change behavior. Catching a 200ms regression in a scheduled review beats discovering it in a customer escalation.
One caution on the launch trigger: run the loop under projected load, not today's. A path that's fast at a thousand requests an hour can fall off a cliff at ten thousand because queueing is nonlinear. The whole reason to wire the workflow into launches is to surface that cliff before customers do, not after. If you can't generate projected load in a test, model it conservatively and assume the tail will be worse than your point estimate.
Frequently Asked Questions
How is a workflow different from just optimizing latency once?
Optimization is a single event; a workflow is a repeatable loop with intake, stages, and a definition of done. The workflow assumes latency will drift and builds in scheduled re-runs, so you maintain performance instead of rescuing it. It also makes the work transferable, which one-off optimization never is.
Who should own the workflow?
One engineer owns the process and the runbook, but anyone trained on it can execute a pass. The owner's real job is keeping the measurement harness and runbook current, not personally running every loop. That separation is what lets the team scale beyond the original expert.
What's the smallest version I can start with?
Pick your single highest-traffic inference path, write down its baseline, and run the five stages once with full documentation. That one documented loop is more valuable than ten undocumented optimizations, because it becomes the template for everything else.
How do I keep quality from regressing as I chase speed?
Make Stage 4 verification mandatory and pair every latency measurement with a quality sample from a fixed evaluation set. If quality drops, the change fails regardless of the speed gain. This is the guardrail that keeps the workflow honest.
Does this work for managed APIs or only self-hosted?
It works for both. The levers in Stage 3 differ — managed APIs lean on routing, caching, and output shaping while self-hosting adds batching and concurrency — but the five-stage loop and the documentation discipline are identical.
Key Takeaways
- A latency win that isn't documented isn't done; build a workflow, not a hero effort.
- Define the unit of work as a single inference path and capture a written baseline first.
- Run the fixed five-stage loop: measure, diagnose, change one lever, verify, document.
- Make it hand-off-able with a runbook, a standardized harness, and an explicit definition of done.
- Schedule the loop against launches and a monthly sweep so latency never silently drifts.