Most teams size their AI compute by feel. They pick a GPU that sounds powerful, hope it works, and adjust in a panic when it does not. This guide replaces guesswork with a sequence. Follow the steps in order and you will arrive at a defensible hardware decision instead of a hopeful one.
The process works for any workload — a chatbot, a batch summarization job, a fine-tuning run. What changes between them are the numbers you plug in, not the method. Each step builds on the previous one, so resist the urge to skip ahead to "which GPU should I buy." That question is the last step, not the first.
Have a notepad ready. You will be writing down a handful of numbers as you go.
Step 1: Define the Workload Precisely
Before any hardware talk, write down exactly what you are doing in one sentence.
- Are you training, fine-tuning, or running inference? These have radically different needs.
- What model size will you use, in billions of parameters?
- Is the work interactive (a user waiting for a reply) or batch (jobs that can run overnight)?
Interactive work cares about latency; batch work cares about throughput and cost. Naming this now prevents you from optimizing the wrong thing later. If any of these terms are unfamiliar, pause and read the beginner's guide first.
Step 2: Calculate Memory Requirements
This is the gate. If the model does not fit in VRAM, nothing else matters.
- Start with parameter count in billions.
- For inference, multiply by 2 for FP16 or 0.5 for 4-bit quantization to get base GB.
- For full training, multiply by 16 to 20 instead — gradients and optimizer states are heavy.
- Add 25 percent overhead for the KV cache and framework.
Example: a 13B model for inference at FP16 is 13 × 2 = 26 GB, plus 25 percent = roughly 33 GB. That rules out a 24 GB card and points to a 48 GB one. Write your number down.
Step 3: Estimate Throughput Needs
Now decide how fast the work must happen.
For interactive workloads
Target tokens per second per user and the number of concurrent users. A chatbot serving 50 simultaneous users needs far more aggregate throughput than a demo serving one.
For batch workloads
Compute total tokens to process and your deadline. If you must summarize a million documents by morning, work backward from that to required throughput. The complete guide explains the throughput math in more depth.
Step 4: Choose Precision and Optimization First
Counterintuitively, optimize before you size hardware, because optimization changes the answer.
- Apply quantization — 8-bit is nearly free quality-wise and halves memory; 4-bit quarters it.
- Use batching for throughput workloads to keep the GPU busy.
- Consider a smaller model if quality holds; it is the biggest cost lever of all.
Run these decisions through Step 2 again. Often a workload that looked like it needed a datacenter GPU now fits on a consumer card.
Step 5: Map Requirements to a GPU Tier
With memory and throughput numbers in hand, pick a tier.
- Under 24 GB and modest throughput: consumer card.
- 24–48 GB: workstation or prosumer card.
- Above 48 GB or multi-GPU training: datacenter accelerator.
Match the tier to your worst-case requirement, not your average. Our tools roundup names specific options per tier.
Step 6: Decide Buy, Rent, or API
Now the financial step.
- Estimate sustained utilization — what fraction of the day the GPU will actually work.
- Below ~50 percent, rent cloud GPUs or use a managed API.
- Above ~50 percent sustained, owning may pay off; run the crossover math.
Be honest about utilization. Most teams overestimate it badly, then pay for idle hardware.
Step 7: Validate With a Small Run
Never commit at full scale untested.
- Run a small version of the job and measure actual VRAM and tokens per second.
- Compare to your estimates and adjust.
- Only then provision the full environment.
A thirty-minute test run routinely catches a sizing error that would have cost far more to discover in production. This habit is reinforced in our best practices guide.
Step 8: Set Up Cost Controls Before Going Live
Sizing correctly is necessary but not sufficient. Without controls, even a well-sized deployment leaks money.
- Attach an auto-shutdown timer to every rented instance so nothing runs idle.
- Use spot or preemptible capacity for any work that can be interrupted and resumed, such as batch jobs and training.
- Set a budget alert that notifies you before spend crosses a threshold, not after.
These take minutes to configure and prevent the single most common source of overspend: capacity that keeps billing after the useful work has stopped. Treat them as part of provisioning, not an afterthought.
Step 9: Instrument and Review
The final step turns a one-time decision into a durable one.
- Add a dashboard showing real utilization and throughput for the workload.
- Review it on a regular cadence — weekly is reasonable for a live service.
- Re-run this whole process whenever the model, traffic, or budget changes materially.
Compute requirements are not static. A model swap, a traffic surge, or a longer context window can all invalidate yesterday's sizing. Instrumentation is what tells you when to revisit, so you are never surprised by a bill again. The same discipline underpins the repeatable model in our framework guide.
A Worked Example, Start to Finish
To make the sequence concrete, here is the whole process applied to one workload.
Suppose you are building an internal assistant on a 13B model for 25 concurrent users. In Step 1 you classify it as interactive inference. In Step 2, 13B at FP16 is 26 GB plus overhead, around 33 GB — too big for a 24 GB card. In Step 4 you quantize to 4-bit, dropping it to roughly 9 GB. Re-running Step 2 confirms it now fits a 24 GB card with headroom. Step 3 says batching across 25 users keeps one GPU busy. Step 6, with sporadic daytime use, points to renting rather than owning. Step 7's test run confirms real VRAM and throughput. Steps 8 and 9 add a shutdown timer and a dashboard. One careful pass, one cheap GPU, no surprises.
Frequently Asked Questions
What if my memory estimate is borderline?
Round up and add a buffer, or quantize one step further to create headroom. A model that barely fits will fail the moment context length grows, so leave margin rather than running at the edge.
Should I optimize before or after choosing hardware?
Before. Quantization and model choice can change which GPU tier you need entirely. Sizing hardware first and optimizing later means you buy capacity you do not use.
How do I estimate utilization if I have no usage data yet?
Make a conservative guess, choose a rentable option, and measure real utilization for a week or two. Switch to owned hardware only after the data justifies it. Renting first de-risks the decision.
Do I really need a test run?
Yes. Estimates are directional, not exact. A short validation run on real data catches surprises in memory use and speed before they become expensive production incidents.
Can I follow these steps for fine-tuning too?
Yes, with one change: use the training memory multiplier (16–20× parameters) in Step 2 instead of the inference one. Fine-tuning is closer to training than to inference in its hardware demands.
Key Takeaways
- Define the workload type and model size before discussing any hardware.
- Calculate VRAM first — it is the gate that decides whether anything else matters.
- Optimize with quantization and model choice before sizing, since it changes the answer.
- Match GPU tier to your worst-case need, not your average.
- Choose buy-versus-rent based on honest sustained-utilization estimates, not optimism.
- Always validate with a small test run before provisioning at full scale.