AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Mistake 1: Treating Fine-Tuning as a Knowledge Injection ToolWhy it happensThe realityMistake 2: Fine-Tuning When Prompt Engineering Would Have WorkedWhy it happensThe costThe corrective practiceMistake 3: Training from Scratch When Fine-Tuning Was the Right CallWhy it happensThe costThe corrective practiceMistake 4: Using Low-Quality or Misaligned Training DataWhy it happensSpecific failure modesThe corrective practiceMistake 5: Catastrophic Forgetting — Overwriting What the Base Model KnewWhy it happensWhat it looks like in practiceThe corrective practiceMistake 6: No Evaluation Framework Before or After TrainingWhy it happensThe costThe corrective practiceMistake 7: Ignoring Inference Cost and Latency in Model SelectionWhy it happensThe costThe corrective practiceFrequently Asked QuestionsWhat is the core difference between training and fine-tuning?How much data do I need to fine-tune a model effectively?Can fine-tuning make a model more accurate about specific facts?What is LoRA and when should I use it?How do I know if fine-tuning actually improved my model?When does training from scratch actually make sense?Key Takeaways
Home/Blog/Conflating Two Things That Cost Months of GPU Time
General

Conflating Two Things That Cost Months of GPU Time

A

Agency Script Editorial

Editorial Team

·March 31, 2026·11 min read

Most teams who are new to working with large language models conflate two fundamentally different things: training a model from scratch and fine-tuning a pre-trained one. The distinction matters enormously — not just as a technical detail, but as a budget decision, a risk decision, and a quality decision. When you get it wrong, you either waste months of GPU time building something a fine-tuning run could have delivered, or you waste weeks fine-tuning a model that was never going to give you what you needed.

The failure modes are predictable, and they repeat across agencies, startups, and enterprise teams alike. They fall into a recognizable pattern: someone misreads what a problem actually requires, applies the wrong method, and either burns resources or ships something broken. This article names all seven of those failure modes precisely, explains why each one happens, what it costs, and what the corrective practice looks like. If you're deciding right now whether to train or fine-tune — or you're debugging why your current approach isn't working — this is the article to read before making your next move.

Understanding these distinctions also builds on the foundations covered in The Complete Guide to Machine Learning Basics, which is worth reading alongside this piece if you want the broader conceptual scaffolding.


Mistake 1: Treating Fine-Tuning as a Knowledge Injection Tool

Why it happens

The most common mistake is assuming that fine-tuning "teaches" a model new facts — that you can take a base model, fine-tune it on your company's documentation, and it will now "know" your product deeply. This logic is intuitive but wrong.

The reality

Fine-tuning adjusts the model's behavior and style — how it responds, how it formats outputs, what tone it adopts. It does not reliably inject factual knowledge the way training on a large corpus does. If you fine-tune on a dataset of 500 product FAQs, the model won't reliably recall specific facts from that dataset; it will learn to sound like someone who knows your product. Factual retrieval under novel phrasing will often fail.

The corrective practice: use retrieval-augmented generation (RAG) for knowledge injection, and fine-tuning for style and behavior. If your goal is that the model knows specific facts, attach a retrieval system. If your goal is that the model responds in a specific format or follows a specific interaction pattern, then fine-tuning is the right lever.


Mistake 2: Fine-Tuning When Prompt Engineering Would Have Worked

Why it happens

Fine-tuning feels more rigorous. It sounds like a real technical solution. Teams that have budget and access to fine-tuning infrastructure often reach for it immediately, skipping the simpler step of testing whether a well-constructed system prompt already solves the problem.

The cost

A fine-tuning run on a mid-size model typically takes anywhere from several hours to several days of compute time, plus the labor to curate, clean, and format training data. If a 400-token system prompt with a few examples achieves 85–90% of the target behavior, you've spent significant resources on marginal gains — and made your system harder to iterate on, since changing fine-tuned behavior requires a new run.

The corrective practice

Always run a structured prompt engineering phase before committing to fine-tuning. Test at minimum: a zero-shot prompt, a few-shot prompt with 3–5 examples, and a chain-of-thought variant. If none of these approaches gets you within striking distance of the target behavior, fine-tuning is justified. If one does, ship it and optimize later.


Mistake 3: Training from Scratch When Fine-Tuning Was the Right Call

Why it happens

This mistake is less common but far more expensive when it occurs. It tends to happen in organizations with ML researchers on staff who are accustomed to training pipelines, or in contexts where leadership wants "our own model" for IP or compliance reasons without understanding the actual cost curve.

The cost

Training a useful general-purpose language model from scratch requires compute budgets that start in the tens of thousands of dollars and scale rapidly into the millions, depending on parameter count and dataset size. Beyond compute, you need data curation infrastructure, evaluation frameworks, and iteration cycles that take months, not weeks. For the vast majority of business use cases — even sophisticated ones — a fine-tuned version of an existing open-weights model will outperform a from-scratch model trained on a fraction of the budget.

The gap in capability between a fine-tuned 7B–70B parameter open-weights model and a from-scratch model you trained with a small team and limited data is typically enormous. This is one of the central insights in understanding The Future of Neural Networks: the economics of AI have shifted so that differentiation comes from adaptation, not origination.

The corrective practice

Default to fine-tuning open-weights models (Mistral, LLaMA variants, Qwen, Phi, and others) before considering pre-training. The only legitimate cases for training from scratch are: you need a domain so specialized that no existing model's pre-training corpus covers it meaningfully (rare), you have proprietary data at a scale that would genuinely shift model capability (extremely rare), or you have legal constraints that prevent using any third-party model weights.


Mistake 4: Using Low-Quality or Misaligned Training Data

Why it happens

Teams underestimate the data problem. They assume that if they have enough examples, quality will average out. It won't. Fine-tuning is highly sensitive to data quality because you're applying strong gradient updates to a relatively small set of examples. One bad signal propagates more than you'd expect.

Specific failure modes

  • Inconsistent labels: If your dataset has three different "correct" ways to respond to the same input type, the model learns to be inconsistent too.
  • Wrong distribution: If you fine-tune on idealized, polished examples but the model will encounter messy real-world inputs, you'll see performance cliffs.
  • Contaminated negatives: If your dataset includes low-quality outputs without explicit negative labeling, the model may learn those patterns as acceptable.

The corrective practice

Invest more time in data curation than you think you need. For most fine-tuning runs, 500–2,000 high-quality, consistently formatted examples outperform 10,000 noisy ones. Run a human review pass on a random sample of at least 10% of your dataset before training. Define a rubric for what "correct" looks like before you start collecting examples, not after.


Mistake 5: Catastrophic Forgetting — Overwriting What the Base Model Knew

Why it happens

Catastrophic forgetting is a well-documented failure mode in neural network training: when you train on a new task, the model degrades on tasks it previously handled well. In fine-tuning, this happens when the fine-tuning dataset is too narrow, too large relative to the base training data, or when learning rates are set too aggressively.

What it looks like in practice

You fine-tune a model on customer service dialogues. It becomes excellent at customer service phrasing, but its general reasoning ability degrades. It starts making errors on logical inferences that the base model handled correctly. Or it loses calibrated uncertainty — it starts confidently stating things that are wrong.

The corrective practice

Use conservative learning rates during fine-tuning (typically 1e-5 to 5e-5 for most transformer models). Mix a small percentage of general-purpose examples into your fine-tuning dataset — roughly 5–10% of total volume — to preserve general capability. Run your evaluation suite on both target tasks and general capability benchmarks after every training run. Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation) are specifically designed to reduce catastrophic forgetting by only updating a small subset of parameters. For most agency and professional use cases, LoRA or QLoRA should be your default approach.

This connects to broader patterns described in A Step-by-Step Approach to Machine Learning Basics, where evaluation and iteration loops are emphasized as the discipline that separates working systems from perpetually broken ones.


Mistake 6: No Evaluation Framework Before or After Training

Why it happens

Evaluation is treated as a final step, if it's done at all. Teams run a fine-tuning job, manually check a handful of outputs, decide it "looks good," and ship. This is how regressions go undetected and how you discover months later that your model performs well on the examples you checked and poorly on everything else.

The cost

Without a systematic evaluation framework, you cannot know whether a model change improved performance or degraded it. You cannot compare runs. You cannot detect regressions. You are flying blind, and the only feedback signal is production failures.

The corrective practice

Build your evaluation set before you build your training set, and treat it as a fixed artifact. Your eval set should include:

  • A representative sample of target-task inputs
  • Edge cases and adversarial inputs
  • Inputs that test general capability preservation
  • Clear, rubric-based scoring criteria

Run automated scoring where possible (for tasks with objective answers), and human scoring on a consistent sample for subjective outputs. Compare every new training run against a baseline. A 3–5% improvement on eval metrics that aligns with your rubric is meaningful. A model that "looks better" to one reviewer is not a measurement.

The habit of rigorous evaluation is one of the most common gaps covered in 7 Common Mistakes with Machine Learning Basics (and How to Avoid Them), and it applies with equal force here.


Mistake 7: Ignoring Inference Cost and Latency in Model Selection

Why it happens

Teams optimize for benchmark performance during training and fine-tuning, then deploy and discover that the model is too slow or too expensive to run at their required query volume. This happens because the person selecting and training the model is often not the person responsible for infrastructure costs.

The cost

A 70B parameter model may outperform a 7B model on your eval set by 8–12%, but it may cost 6–10x more per inference token and add 300–800ms of latency. For a customer-facing product with a 1–2 second response expectation, that latency alone can be a product-killer. For an internal tool running 50,000 queries per day, the cost difference is material on an annual basis.

The corrective practice

Set latency and cost constraints before you select a base model. Define: maximum acceptable latency at P95, maximum cost per 1,000 queries, and required throughput. Then select the smallest model that can meet your performance targets within those constraints. In many cases, a well-fine-tuned 7B model with strong data will outperform a poorly-prompted 70B model on your specific task — and at a fraction of the cost. Size is not a proxy for quality on specialized tasks.


Frequently Asked Questions

What is the core difference between training and fine-tuning?

Training from scratch means building a model's weights from random initialization using a large dataset — it determines everything the model knows and how it reasons. Fine-tuning starts from a pre-trained model's existing weights and updates them (fully or partially) on a smaller, task-specific dataset to adjust behavior. For almost all practical business applications, fine-tuning is the appropriate starting point, not training from scratch.

How much data do I need to fine-tune a model effectively?

There is no universal answer, but most practitioners see useful behavior changes with as few as 200–500 high-quality examples for style and format tasks. For more complex behavioral changes, 1,000–5,000 curated examples is a reasonable working range. Beyond 10,000 examples, quality and consistency of the data matter far more than volume.

Can fine-tuning make a model more accurate about specific facts?

Not reliably. Fine-tuning is better at shaping how a model responds than at reliably encoding specific facts it will accurately recall. For factual accuracy on specific knowledge, retrieval-augmented generation (RAG) — where the model queries a document store at inference time — is significantly more reliable than fine-tuning alone.

What is LoRA and when should I use it?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small number of additional parameters while keeping base model weights frozen. It reduces compute requirements, reduces catastrophic forgetting, and makes it easier to swap fine-tuned adapters in and out. For most professional and agency use cases, LoRA or its quantized variant QLoRA should be the default approach over full fine-tuning.

How do I know if fine-tuning actually improved my model?

You need a fixed evaluation set with clear scoring criteria established before training begins. Run the base model and the fine-tuned model against the same eval set and compare scores. "It looks better" is not a measurement. Track both target-task performance and general capability preservation across every training run.

When does training from scratch actually make sense?

Training from scratch makes sense when you have a genuinely novel domain not covered by existing models' pre-training data, proprietary data at a scale that would shift model capability, or hard legal constraints on using third-party weights. These conditions apply to a small minority of organizations. Most teams that think they need to train from scratch actually need better fine-tuning practices.


Key Takeaways

  • Fine-tuning shapes behavior and style; it does not reliably inject factual knowledge — use RAG for that.
  • Always test prompt engineering thoroughly before committing to a fine-tuning run; it's faster and often sufficient.
  • Training from scratch is rarely justified for business applications; fine-tuning open-weights models is almost always the right starting point.
  • Data quality outweighs data volume in fine-tuning; 500 clean examples typically beat 5,000 noisy ones.
  • Catastrophic forgetting is real — use conservative learning rates, mixed training data, and parameter-efficient methods like LoRA.
  • Build your evaluation framework before you build your training set; without it, you cannot measure progress or detect regressions.
  • Set latency and cost constraints before selecting a base model; smaller, well-fine-tuned models frequently beat larger, poorly-prompted ones on specific tasks.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification