When You Have to Justify the Compute Bill

The distinction between training a model from scratch and fine-tuning an existing one sounds academic until you're the person who has to justify the compute bill or explain why the chatbot still doesn't understand your industry's terminology. Both approaches produce capable models. They differ dramatically in cost, time, data requirements, and the kinds of problems they actually solve.

Most organizations default to fine-tuning because that's the option that gets talked about in vendor documentation. A smaller group—usually enterprises with genuinely unique data or competitive moats to protect—pursue full training. Both groups make expensive mistakes by treating their chosen approach as the obvious one rather than the right one. This article walks through specific scenarios, what made them work, and where things went wrong, so you can map your situation to the correct choice before committing resources.

Understanding the mechanics matters here, but only as a foundation for judgment. If you want a broader grounding in the underlying concepts, Machine Learning Basics: A Beginner's Guide is a useful place to start before reading further.

What Each Approach Actually Means

Training from scratch means initializing a model with random weights and running it through massive datasets until it learns language patterns, reasoning structures, or domain knowledge from the ground up. This is how GPT-4, Claude, Llama, and similar foundation models came to exist. The compute requirements are enormous—think thousands of GPUs running for weeks to months, with costs ranging from hundreds of thousands to tens of millions of dollars depending on model size.

Fine-tuning starts with a pre-trained model's existing weights and adjusts them using a much smaller, task-specific dataset. The model already knows how language works, what logic looks like, and how to follow instructions. Fine-tuning teaches it to apply those capabilities differently—more formally, more narrowly, in a specific persona, or with particular knowledge baked in.

The Spectrum Between Them

These are not binary options. Between full training and standard fine-tuning, you have:

Continued pre-training: taking a foundation model and running it through additional unsupervised data (legal corpora, medical literature, financial filings) before any task-specific tuning
Instruction fine-tuning: training on input-output pairs that teach a model to follow specific formats or instructions
RLHF and preference tuning: using human feedback to shape tone, safety, and response quality
LoRA and adapter methods: fine-tuning only a small fraction of model parameters to reduce cost while preserving most of the base model's behavior

Knowing this spectrum matters because many failures come from choosing the wrong point on it—using full fine-tuning when LoRA would have worked, or doing instruction tuning when what was actually needed was continued pre-training on domain text.

Example 1: The Legal Tech Company That Fine-Tuned When It Needed to Train

A mid-size legal technology firm wanted a model that could draft contract clauses with the precision of a specialized attorney. They fine-tuned GPT-3.5-class models on roughly 5,000 labeled examples of contract language. The model produced cleaner output than the base model and used correct terminology more often.

The problem surfaced in production: the model occasionally confabulated legal standards—citing plausible-sounding but nonexistent case law concepts. The fine-tuning had taught it the style and vocabulary of legal drafting without giving it a deep understanding of legal reasoning. The base model's general patterns were still driving the underlying logic.

What would have worked better: continued pre-training on a large corpus of actual contracts, statutes, and legal commentary, followed by fine-tuning on labeled examples. The continued pre-training would have shifted the model's world model toward legal reasoning, not just legal language. This is a common mistake described in detail in 7 Common Mistakes with Machine Learning Basics (and How to Avoid Them)—optimizing surface performance without testing for deeper failure modes.

Example 2: The Healthcare Startup That Got Fine-Tuning Right

A clinical documentation startup built a fine-tuned model to generate structured SOAP notes (Subjective, Objective, Assessment, Plan) from physician dictations. They had roughly 12,000 examples of raw dictation paired with clinician-reviewed SOAP notes. Their process:

Started with a strong base model (a Llama-class open-source model)
Applied continued pre-training on publicly available clinical text (de-identified case reports, medical textbooks)
Fine-tuned on their proprietary dictation-to-SOAP dataset
Evaluated on a held-out test set reviewed by two practicing physicians

The result outperformed the base model significantly on clinical accuracy metrics and reduced the time physicians spent correcting documentation by 40–60% in pilot testing.

What Made It Work

Data quality over quantity: every training example had been reviewed by a licensed clinician, not just labeled by contractors unfamiliar with clinical norms
The right base model: they chose an open-source model they could deploy on-premise, which was necessary for HIPAA compliance
Staged approach: continued pre-training before task-specific fine-tuning meant the model's knowledge substrate matched the domain before it was shaped by labeled examples
Conservative evaluation: they tested for failure modes, not just average performance

Example 3: An Agency That Used Fine-Tuning for Brand Voice—and Scaled It

A marketing agency managing content for 30+ clients wanted a way to produce first drafts that matched each client's voice without heavy human editing. They fine-tuned a model on 200–400 examples per client: existing approved content tagged with brand guidelines.

The economics worked because they ran fine-tuning on smaller, cheaper models and used the fine-tuned versions for first-draft generation only—human editors still reviewed everything. They weren't trying to replace judgment; they were reducing the mechanical labor of starting from zero.

Realistic numbers for this kind of project: fine-tuning a 7–13 billion parameter open-source model on a few hundred examples costs between $20 and $200 depending on infrastructure choices. Running inference is cheap. The ROI math closes quickly when editors are spending 30% less time on revisions.

What Almost Derailed It

The early runs used unfiltered client content—including drafts that had been rejected internally. Including bad examples taught the model some of the bad patterns. The fix was curatorial: only approved, published content made it into training sets. This connects directly to a principle covered in Machine Learning Basics: Best Practices That Actually Work—garbage in, garbage out applies with amplified consequences in fine-tuning because there's no large dataset to dilute noise.

Example 4: Training From Scratch for a Narrow, High-Stakes Domain

A government contractor built a model for analyzing signals intelligence reports. The data was classified, highly specialized, and had no overlap with anything in standard pre-training corpora. Fine-tuning on a commercial foundation model was not an option—the data could not leave a classified environment, and commercial model weights often carry usage restrictions that prohibited this application.

They trained a smaller model (sub-7B parameters) from scratch on a curated corpus assembled over several years. It was expensive, took significant infrastructure investment, and required a team with genuine ML engineering depth. But the alternative—trying to fine-tune a foundation model they couldn't legally use in that context—wasn't actually available.

This is the clearest legitimate case for training from scratch: data that is genuinely unavailable to foundation model trainers, legal constraints that prevent using commercial weights, or a competitive moat that depends on keeping the model's knowledge proprietary.

Example 5: Fine-Tuning That Failed Because the Problem Needed Prompt Engineering

A SaaS company tried to fine-tune a model to produce consistent JSON output for a structured data extraction task. They spent several weeks building a training set and running fine-tuning jobs. Performance improved modestly.

A contractor they brought in for a second opinion spent two hours iterating on system prompts with schema definitions and few-shot examples. That approach matched the fine-tuned model's performance and, in some cases, exceeded it.

Fine-tuning is not always the right tool for consistency or format adherence. Modern foundation models respond well to explicit instructions, structured prompts, and output parsers. Before committing to a fine-tuning project, run a thorough prompting evaluation. The A Step-by-Step Approach to Machine Learning Basics framework applies here: define the task precisely, establish a baseline, then determine whether the gap requires tuning or just better prompting.

Deciding Which Approach Fits Your Situation

The decision tree is less complicated than vendors make it seem:

Choose fine-tuning when:

A capable foundation model already understands your domain generally
You have 500–50,000 high-quality labeled examples
You need consistent format, tone, or task specialization
Budget is constrained and time to deployment matters

Consider continued pre-training + fine-tuning when:

Your domain has specialized terminology and reasoning patterns underrepresented in standard training data
Surface-level style adaptation isn't enough—you need the model to reason differently
You have large amounts of unlabeled domain text available

Train from scratch only when:

Your data cannot be shared with external model providers
You need full ownership of weights for legal or competitive reasons
You're building a genuinely novel capability with no foundation model analog
You have the infrastructure and ML engineering depth to execute it

Cost is not the only variable. Training from scratch without the team to maintain and iterate on the resulting model creates a different kind of debt—a capable model that nobody inside the organization fully understands is fragile in ways that surface at the worst times. See Machine Learning Basics: Real-World Examples and Use Cases for how similar trade-offs play out across other ML contexts.

Frequently Asked Questions

How much data do you actually need for fine-tuning?

There is no universal answer, but useful results often appear with 200–500 high-quality examples for format and style tasks, and 5,000–50,000 for tasks requiring deeper knowledge or reasoning shifts. Quality consistently matters more than volume—100 rigorously reviewed examples typically outperform 1,000 hastily labeled ones. Start small, evaluate honestly, and scale the dataset only where you can demonstrate clear gaps.

Can fine-tuning make a model forget what it already knows?

Yes—this is called catastrophic forgetting, and it's a real risk with aggressive fine-tuning on small datasets. The model can over-index on training examples and lose general capability. Techniques like LoRA (Low-Rank Adaptation) reduce this risk by updating only a small fraction of parameters, preserving most of the base model's behavior while teaching new patterns.

Is fine-tuning worth it if I can just use a good system prompt?

Often, no. Many format, tone, and task-consistency problems are solvable with well-structured prompts and few-shot examples at a fraction of the cost and complexity. Fine-tuning earns its place when you need consistent behavior at scale across thousands of inferences, when prompts would need to be prohibitively long to carry all the necessary context, or when latency and cost favor smaller, specialized models over large general ones.

What's the biggest hidden cost in fine-tuning projects?

Data preparation. The compute for a fine-tuning run is relatively cheap. The labor to collect, clean, label, and review training examples is where projects actually stall or overrun. Budget 60–70% of project effort toward data work if you want a realistic picture of what a fine-tuning project actually costs.

When does training from scratch make economic sense for a company that isn't a major lab?

Rarely, and only under specific conditions: proprietary data that can't touch external infrastructure, regulated environments with strict model governance requirements, or applications so specialized that no foundation model provides a useful starting point. Even then, continued pre-training on a permissively licensed open-source foundation model is often a better path than true training from scratch.

Key Takeaways

Fine-tuning adjusts an existing model's behavior; training from scratch builds a model's knowledge from the ground up—the choice depends on data constraints, legal context, and what the base model already knows
Most organizations benefit more from fine-tuning (or even prompt engineering) than from training from scratch; the exceptions are real but narrow
Data quality is the single highest-leverage variable in any fine-tuning project—bad examples teach bad patterns with no large dataset to dilute them
Catastrophic forgetting is a genuine risk; parameter-efficient methods like LoRA reduce it while keeping fine-tuning costs manageable
Prompt engineering should be evaluated thoroughly before committing to a fine-tuning project—many consistency and format problems are solvable without model modification
Continued pre-training followed by fine-tuning is the underused middle path that solves domain-reasoning problems that style-level fine-tuning cannot

What Each Approach Actually Means

The Spectrum Between Them

These are not binary options. Between full training and standard fine-tuning, you have:

Continued pre-training: taking a foundation model and running it through additional unsupervised data (legal corpora, medical literature, financial filings) before any task-specific tuning
Instruction fine-tuning: training on input-output pairs that teach a model to follow specific formats or instructions
RLHF and preference tuning: using human feedback to shape tone, safety, and response quality
LoRA and adapter methods: fine-tuning only a small fraction of model parameters to reduce cost while preserving most of the base model's behavior

Example 1: The Legal Tech Company That Fine-Tuned When It Needed to Train

Example 2: The Healthcare Startup That Got Fine-Tuning Right

Started with a strong base model (a Llama-class open-source model)
Applied continued pre-training on publicly available clinical text (de-identified case reports, medical textbooks)
Fine-tuned on their proprietary dictation-to-SOAP dataset
Evaluated on a held-out test set reviewed by two practicing physicians

The result outperformed the base model significantly on clinical accuracy metrics and reduced the time physicians spent correcting documentation by 40–60% in pilot testing.

What Made It Work

Data quality over quantity: every training example had been reviewed by a licensed clinician, not just labeled by contractors unfamiliar with clinical norms
The right base model: they chose an open-source model they could deploy on-premise, which was necessary for HIPAA compliance
Staged approach: continued pre-training before task-specific fine-tuning meant the model's knowledge substrate matched the domain before it was shaped by labeled examples
Conservative evaluation: they tested for failure modes, not just average performance

Example 3: An Agency That Used Fine-Tuning for Brand Voice—and Scaled It

What Almost Derailed It

Example 4: Training From Scratch for a Narrow, High-Stakes Domain

Example 5: Fine-Tuning That Failed Because the Problem Needed Prompt Engineering

Deciding Which Approach Fits Your Situation

The decision tree is less complicated than vendors make it seem:

Choose fine-tuning when:

A capable foundation model already understands your domain generally
You have 500–50,000 high-quality labeled examples
You need consistent format, tone, or task specialization
Budget is constrained and time to deployment matters

Consider continued pre-training + fine-tuning when:

Your domain has specialized terminology and reasoning patterns underrepresented in standard training data
Surface-level style adaptation isn't enough—you need the model to reason differently
You have large amounts of unlabeled domain text available

Train from scratch only when:

Your data cannot be shared with external model providers
You need full ownership of weights for legal or competitive reasons
You're building a genuinely novel capability with no foundation model analog
You have the infrastructure and ML engineering depth to execute it

Frequently Asked Questions

How much data do you actually need for fine-tuning?

Can fine-tuning make a model forget what it already knows?

Is fine-tuning worth it if I can just use a good system prompt?

What's the biggest hidden cost in fine-tuning projects?

When does training from scratch make economic sense for a company that isn't a major lab?

Key Takeaways

Fine-tuning adjusts an existing model's behavior; training from scratch builds a model's knowledge from the ground up—the choice depends on data constraints, legal context, and what the base model already knows
Most organizations benefit more from fine-tuning (or even prompt engineering) than from training from scratch; the exceptions are real but narrow
Data quality is the single highest-leverage variable in any fine-tuning project—bad examples teach bad patterns with no large dataset to dilute them
Catastrophic forgetting is a genuine risk; parameter-efficient methods like LoRA reduce it while keeping fine-tuning costs manageable
Prompt engineering should be evaluated thoroughly before committing to a fine-tuning project—many consistency and format problems are solvable without model modification
Continued pre-training followed by fine-tuning is the underused middle path that solves domain-reasoning problems that style-level fine-tuning cannot

When You Have to Justify the Compute Bill

What Each Approach Actually Means

The Spectrum Between Them

Example 1: The Legal Tech Company That Fine-Tuned When It Needed to Train

Example 2: The Healthcare Startup That Got Fine-Tuning Right

What Made It Work

Example 3: An Agency That Used Fine-Tuning for Brand Voice—and Scaled It

What Almost Derailed It

Example 4: Training From Scratch for a Narrow, High-Stakes Domain

Example 5: Fine-Tuning That Failed Because the Problem Needed Prompt Engineering

Deciding Which Approach Fits Your Situation

Frequently Asked Questions

How much data do you actually need for fine-tuning?

Can fine-tuning make a model forget what it already knows?

Is fine-tuning worth it if I can just use a good system prompt?

What's the biggest hidden cost in fine-tuning projects?

When does training from scratch make economic sense for a company that isn't a major lab?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

When You Have to Justify the Compute Bill

What Each Approach Actually Means

The Spectrum Between Them

Example 1: The Legal Tech Company That Fine-Tuned When It Needed to Train

Example 2: The Healthcare Startup That Got Fine-Tuning Right

What Made It Work

Example 3: An Agency That Used Fine-Tuning for Brand Voice—and Scaled It

What Almost Derailed It

Example 4: Training From Scratch for a Narrow, High-Stakes Domain

Example 5: Fine-Tuning That Failed Because the Problem Needed Prompt Engineering

Deciding Which Approach Fits Your Situation

Frequently Asked Questions

How much data do you actually need for fine-tuning?

Can fine-tuning make a model forget what it already knows?

Is fine-tuning worth it if I can just use a good system prompt?

What's the biggest hidden cost in fine-tuning projects?

When does training from scratch make economic sense for a company that isn't a major lab?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?