When a 70B Model Won't Fit: Plays for Real Weight Trouble

A playbook is not a tutorial. A tutorial teaches you what weights are; a playbook tells you what to do when a 70 billion parameter model won't fit on your hardware, when a fine-tune degrades production quality, or when a vendor swaps the model behind your API without telling you. This document is organized around situations you will actually hit, each with a clear trigger, a recommended play, and the person who should own the call.

The premise is that parameters and weights are operational concerns, not just academic ones. The model you choose, how you store its weights, how you quantize them, and how you version them all have direct consequences on cost, latency, and quality. Teams that manage these deliberately ship faster and break less. Teams that treat them as someone else's problem discover the costs at the worst possible time—usually during an incident.

Use this as a reference. You do not need to run every play. You need to know which play applies when the trigger fires, and who pulls the trigger.

Play 1: Right-sizing the model to the task

Trigger: You are starting a new feature or evaluating a vendor and someone proposes the biggest model available "to be safe."

The default instinct toward maximum size is almost always wrong. Larger models cost more per token, add latency, and frequently outperform smaller ones by margins your users will never notice. The play is to start small and scale up only on evidence.

Owner: The engineer building the feature, with budget sign-off from the team lead.
Sequence: Define the task, pick the smallest credible model, benchmark on your real data, and only move up a size class if the smaller one fails a measurable threshold.
Failure mode to avoid: Choosing by parameter count on a spec sheet instead of by performance on your workload.

A disciplined right-sizing process is the single highest-leverage habit here. The Ai Model Parameters and Weights: Best Practices That Actually Work goes deeper on how to structure that benchmark.

Play 2: Fitting weights onto your hardware

Trigger: A model you want to self-host does not fit in available GPU memory.

You have a precise problem with a small set of known solutions. Run the memory math first—parameters times bytes per parameter—then choose a lever.

The levers, in order of preference

Quantize. Drop from FP16 to INT8 or INT4. Halves or quarters the weight memory. Test for accuracy loss on your task.
Use a smaller model. If quantization isn't enough, step down a size class.
Shard across GPUs. Split the weights across multiple devices. More cost, more complexity, but keeps the large model.
Offload to CPU or disk. Last resort. Works, but latency suffers badly.

Owner: The infrastructure or ML engineer. The team lead owns the cost trade-off if sharding adds GPUs.

Play 3: Fine-tuning without breaking the base

Trigger: A prompt-only approach isn't getting you the quality or consistency you need on a specialized task.

Fine-tuning adjusts weights toward your data, but full fine-tuning is expensive and risks catastrophic forgetting, where the model loses general ability while gaining your narrow task. The play is to reach for parameter-efficient methods first.

Start with LoRA or a similar adapter. Freeze the base weights, train a small set of new ones. Cheap, fast, reversible.
Keep a held-out evaluation set that tests general capability, not just your task, so you catch regressions.
Version every adapter alongside the base model version it was trained against.

Owner: The ML engineer running the fine-tune, with a reviewer signing off on the evaluation results before anything ships.

Play 4: Locking down weight provenance and versioning

Trigger: You are putting a model into production and need to know exactly which weights are running.

This is the play teams skip and regret. Weights are artifacts; treat them like any other production dependency.

Pin exact versions. Record the model checkpoint, quantization scheme, and adapter hashes.
Store weights in a registry, not on someone's laptop. Use checksums to verify integrity.
Watch for silent vendor swaps. Closed API providers update models behind stable names. Build evaluation canaries that alert you when behavior shifts.

Owner: The platform or MLOps engineer. This is foundational reliability work, not optional polish.

Play 5: Responding to a quality regression

Trigger: Output quality drops in production and you suspect the model.

Resist the urge to immediately swap models. Diagnose first, because the cause is often not where you think.

Diagnostic sequence

Confirm the weights didn't change. Check your version pins and any vendor changelog.
Check quantization. If you recently quantized, that is your prime suspect—roll back to higher precision and compare.
Check the data path. Most "model" regressions are actually changes in prompts, retrieval, or input formatting.
Only then consider the weights themselves.

Owner: Whoever is on call, escalating to the ML engineer if the issue isolates to the model. Working through this kind of diagnosis methodically is exactly what A Step-by-Step Approach to Ai Model Parameters and Weights is built to support.

Play 6: Deciding open weights versus closed API

Trigger: A new project forces the build-versus-buy question for the model layer.

This is a strategic call with long consequences, so make it explicitly rather than by default.

Choose open weights when you need portability, on-premise deployment, data residency control, or custom fine-tuning at depth.
Choose a closed API when you want the strongest raw capability, zero infrastructure burden, and you can tolerate vendor dependence.
Hybrid is valid. Many teams use closed APIs for hard tasks and self-hosted open weights for high-volume, simpler ones.

Owner: The engineering lead or architect, with input from security and finance. The The Best Tools for Ai Model Parameters and Weights overview can help map the options on each side.

Sequencing the plays

These plays are not a checklist to run top to bottom; they are a library you draw from. But there is a natural order for a new project: right-size first (Play 1), fit to hardware if self-hosting (Play 2), establish provenance before launch (Play 4), and keep the fine-tuning and regression plays (Plays 3 and 5) ready for when they trigger. The open-versus-closed decision (Play 6) usually comes before all of them, since it shapes which plays even apply.

Frequently Asked Questions

Who should own model and weight decisions on a team?

It is shared but specific. Engineers own task-level model selection and fitting weights to hardware. Platform or MLOps owns versioning and provenance. The engineering lead owns strategic calls like open versus closed and any decision that materially affects budget. Naming the owner per play prevents the common failure where everyone assumes someone else is watching.

When should I quantize versus just use a smaller model?

Quantize first if you want to keep a specific model's behavior but need it to fit or run faster, since quantization usually preserves most quality. Switch to a smaller model when even aggressive quantization isn't enough, or when the smaller model already passes your benchmark. Test both options on your real task rather than deciding in the abstract.

How do I avoid being surprised by a vendor changing their model?

Build evaluation canaries—small, fixed test sets you run on a schedule against the API. When scores move, you get an early warning even if the vendor publishes nothing. Combine this with version pinning where the provider supports it. Treating a closed model as a moving dependency rather than a fixed one is the core discipline.

Is full fine-tuning ever worth it over LoRA?

Occasionally, when you have a large, high-quality dataset and the task differs substantially from the base model's training. Even then, start with LoRA to validate the approach cheaply before committing the compute. Full fine-tuning also demands more rigorous regression testing because of the catastrophic forgetting risk.

What is the first thing to check during a quality regression?

Confirm nothing changed in the weights or quantization, then check the data path—prompts, retrieval, and input formatting. The majority of regressions blamed on the model turn out to be upstream changes. Ruling those out first saves you from an expensive and unnecessary model swap.

Key Takeaways

Treat parameters and weights as operational concerns with named owners, not as a black box.
Right-size to the task first; start small and scale up only on measured evidence.
Quantization is your primary lever for fitting weights to hardware—test it on your real workload.
Prefer adapter-based fine-tuning like LoRA, and always keep a general-capability evaluation set.
Pin weight versions and build evaluation canaries so vendor swaps and regressions don't surprise you.
Diagnose quality regressions in order: weights, then quantization, then the data path.

Use this as a reference. You do not need to run every play. You need to know which play applies when the trigger fires, and who pulls the trigger.

Play 1: Right-sizing the model to the task

Trigger: You are starting a new feature or evaluating a vendor and someone proposes the biggest model available "to be safe."

Owner: The engineer building the feature, with budget sign-off from the team lead.
Sequence: Define the task, pick the smallest credible model, benchmark on your real data, and only move up a size class if the smaller one fails a measurable threshold.
Failure mode to avoid: Choosing by parameter count on a spec sheet instead of by performance on your workload.

A disciplined right-sizing process is the single highest-leverage habit here. The Ai Model Parameters and Weights: Best Practices That Actually Work goes deeper on how to structure that benchmark.

Play 2: Fitting weights onto your hardware

Trigger: A model you want to self-host does not fit in available GPU memory.

You have a precise problem with a small set of known solutions. Run the memory math first—parameters times bytes per parameter—then choose a lever.

The levers, in order of preference

Quantize. Drop from FP16 to INT8 or INT4. Halves or quarters the weight memory. Test for accuracy loss on your task.
Use a smaller model. If quantization isn't enough, step down a size class.
Shard across GPUs. Split the weights across multiple devices. More cost, more complexity, but keeps the large model.
Offload to CPU or disk. Last resort. Works, but latency suffers badly.

Owner: The infrastructure or ML engineer. The team lead owns the cost trade-off if sharding adds GPUs.

Play 3: Fine-tuning without breaking the base

Trigger: A prompt-only approach isn't getting you the quality or consistency you need on a specialized task.

Start with LoRA or a similar adapter. Freeze the base weights, train a small set of new ones. Cheap, fast, reversible.
Keep a held-out evaluation set that tests general capability, not just your task, so you catch regressions.
Version every adapter alongside the base model version it was trained against.

Owner: The ML engineer running the fine-tune, with a reviewer signing off on the evaluation results before anything ships.

Play 4: Locking down weight provenance and versioning

Trigger: You are putting a model into production and need to know exactly which weights are running.

This is the play teams skip and regret. Weights are artifacts; treat them like any other production dependency.

Pin exact versions. Record the model checkpoint, quantization scheme, and adapter hashes.
Store weights in a registry, not on someone's laptop. Use checksums to verify integrity.
Watch for silent vendor swaps. Closed API providers update models behind stable names. Build evaluation canaries that alert you when behavior shifts.

Owner: The platform or MLOps engineer. This is foundational reliability work, not optional polish.

Play 5: Responding to a quality regression

Trigger: Output quality drops in production and you suspect the model.

Resist the urge to immediately swap models. Diagnose first, because the cause is often not where you think.

Diagnostic sequence

Confirm the weights didn't change. Check your version pins and any vendor changelog.
Check quantization. If you recently quantized, that is your prime suspect—roll back to higher precision and compare.
Check the data path. Most "model" regressions are actually changes in prompts, retrieval, or input formatting.
Only then consider the weights themselves.

Play 6: Deciding open weights versus closed API

Trigger: A new project forces the build-versus-buy question for the model layer.

This is a strategic call with long consequences, so make it explicitly rather than by default.

Choose open weights when you need portability, on-premise deployment, data residency control, or custom fine-tuning at depth.
Choose a closed API when you want the strongest raw capability, zero infrastructure burden, and you can tolerate vendor dependence.
Hybrid is valid. Many teams use closed APIs for hard tasks and self-hosted open weights for high-volume, simpler ones.

Owner: The engineering lead or architect, with input from security and finance. The The Best Tools for Ai Model Parameters and Weights overview can help map the options on each side.

Sequencing the plays

Frequently Asked Questions

Who should own model and weight decisions on a team?

When should I quantize versus just use a smaller model?

How do I avoid being surprised by a vendor changing their model?

Is full fine-tuning ever worth it over LoRA?

What is the first thing to check during a quality regression?

Key Takeaways

Treat parameters and weights as operational concerns with named owners, not as a black box.
Right-size to the task first; start small and scale up only on measured evidence.
Quantization is your primary lever for fitting weights to hardware—test it on your real workload.
Prefer adapter-based fine-tuning like LoRA, and always keep a general-capability evaluation set.
Pin weight versions and build evaluation canaries so vendor swaps and regressions don't surprise you.
Diagnose quality regressions in order: weights, then quantization, then the data path.

When a 70B Model Won't Fit: Plays for Real Weight Trouble

Play 1: Right-sizing the model to the task

Play 2: Fitting weights onto your hardware

The levers, in order of preference

Play 3: Fine-tuning without breaking the base

Play 4: Locking down weight provenance and versioning

Play 5: Responding to a quality regression

Diagnostic sequence

Play 6: Deciding open weights versus closed API

Sequencing the plays

Frequently Asked Questions

Who should own model and weight decisions on a team?

When should I quantize versus just use a smaller model?

How do I avoid being surprised by a vendor changing their model?

Is full fine-tuning ever worth it over LoRA?

What is the first thing to check during a quality regression?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

When a 70B Model Won't Fit: Plays for Real Weight Trouble

Play 1: Right-sizing the model to the task

Play 2: Fitting weights onto your hardware

The levers, in order of preference

Play 3: Fine-tuning without breaking the base

Play 4: Locking down weight provenance and versioning

Play 5: Responding to a quality regression

Diagnostic sequence

Play 6: Deciding open weights versus closed API

Sequencing the plays

Frequently Asked Questions

Who should own model and weight decisions on a team?

When should I quantize versus just use a smaller model?

How do I avoid being surprised by a vendor changing their model?

Is full fine-tuning ever worth it over LoRA?

What is the first thing to check during a quality regression?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?