AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Understand What You're Actually Choosing BetweenPre-training: Building Knowledge From ScratchFine-tuning: Specializing an Existing ModelPrompting and RAG: Often UnderestimatedThe Decision Framework That Actually WorksData Quality Is the Only Moat That MattersWhat "Quality" Actually MeansThe Deduplication ProblemChoosing the Right Fine-tuning MethodLoRA and QLoRA: Your Practical DefaultWhen Full Fine-tuning Earns Its CostInstruction Tuning vs. Task-Specific TuningEvaluation: The Practice Most Teams SkipBuild Your Eval Before You TrainTrack Generations, Not Just MetricsDeployment and Iteration PracticesVersion EverythingShadow Deployment Before Full RolloutPlan for Re-fine-tuningWhen to Train From Scratch (And Why It's Almost Never You)Frequently Asked QuestionsHow much data do I actually need to fine-tune a model?Will fine-tuning make my model hallucinate less?Can I fine-tune a model and then use RAG on top of it?How do I know if fine-tuning is actually better than a good system prompt?What's the biggest mistake teams make when fine-tuning?How long does a typical fine-tuning run take?Key Takeaways
Home/Blog/Choosing the Right Relationship With Your AI Model
General

Choosing the Right Relationship With Your AI Model

A

Agency Script Editorial

Editorial Team

·March 30, 2026·11 min read

Most AI projects fail not because the underlying model is wrong, but because the team chose the wrong relationship with that model. They either throw expensive compute at a problem that a few dozen examples could solve, or they expect a general-purpose model to perform specialized work it was never equipped to handle. Getting the training-versus-fine-tuning decision right is one of the highest-leverage choices you'll make in any AI deployment.

This article is not a glossary. If you need a primer on the underlying mechanics, The Complete Guide to Machine Learning Basics covers that ground well. What follows is a set of opinionated, reasoned practices drawn from what actually works — and what quietly destroys projects — when teams decide how to adapt large models for real work.

The stakes are meaningful. Full pre-training a large language model from scratch costs hundreds of thousands to millions of dollars in compute and months of engineering time. Fine-tuning the same family of models costs a fraction of that. Prompting costs almost nothing. Picking the right tier isn't just a technical question; it's a business judgment with a wide financial spread.

Understand What You're Actually Choosing Between

Before you can apply best practices, you need a clear mental model of the spectrum. "Training" and "fine-tuning" are often used interchangeably in casual conversation, and that imprecision causes real damage.

Pre-training: Building Knowledge From Scratch

Pre-training is the process by which a model learns general representations from a massive corpus — billions to trillions of tokens. The model learns syntax, world knowledge, reasoning patterns, and latent structure. This is what happens when organizations like OpenAI, Google, or Meta train foundation models. Almost no agency or enterprise team should be doing this. The compute cost alone (typically $1M–$100M+ depending on model scale) disqualifies it for 99% of use cases.

Fine-tuning: Specializing an Existing Model

Fine-tuning starts from a pre-trained checkpoint and continues training on a smaller, domain-specific dataset — typically thousands to hundreds of thousands of examples. The goal is to shift the model's behavior, tone, output format, or domain knowledge in a targeted direction. You preserve the vast general capability while steering it. This is the correct choice for a specific class of problems, which we'll define below.

Prompting and RAG: Often Underestimated

Retrieval-augmented generation (RAG) and sophisticated prompting aren't lesser options — they're frequently the correct ones. If your problem is about giving the model access to current, proprietary, or voluminous information, RAG beats fine-tuning every time. If your problem is about consistency of format or tone, a well-crafted system prompt often gets you 80% of the way there at near-zero cost. A common mistake — covered in depth in 7 Common Mistakes with Machine Learning Basics (and How to Avoid Them) — is skipping these cheaper options and going straight to fine-tuning because it feels more "serious."

The Decision Framework That Actually Works

Stop asking "should I fine-tune?" and start asking a sequence of sharper questions.

Question 1: Is this a knowledge problem or a behavior problem?

If your model doesn't know something (recent events, your proprietary documents, your client's product catalog), that's a knowledge problem. Fine-tuning won't reliably fix it — you'd have to constantly re-fine-tune as facts change, and models can still hallucinate even on material they've been trained on. Use RAG.

If your model knows the relevant domain but behaves wrong — it uses the wrong format, the wrong voice, the wrong level of technical depth, or it's inconsistent across outputs — that's a behavior problem. Fine-tuning addresses behavior problems well.

Question 2: Do you have the data to do it properly?

Fine-tuning on garbage data produces a confidently garbage model. You need clean, diverse, representative examples. As a practical floor: 500–1,000 high-quality input/output pairs for simple behavior shaping; 10,000+ for substantive capability development. If you can't assemble that, fix your data problem before touching training.

Question 3: Can you measure success?

If you can't define a clear eval — a test set with ground truth that you can score programmatically or with human review — you cannot safely fine-tune. You'll have no way to know whether the model improved or regressed, and regressions in fine-tuning are common and subtle.

Data Quality Is the Only Moat That Matters

Practitioners routinely underestimate how much the quality of training data determines the ceiling of the fine-tuned model. Quantity helps, but quality determines the shape of what you get.

What "Quality" Actually Means

  • Accuracy: Every example should represent the behavior you want. A single style of error repeated across 30% of your examples will be learned as correct.
  • Diversity: If all your examples are from the same document type, the model won't generalize. Sample across topics, lengths, edge cases, and difficulty levels.
  • Label consistency: If humans are labeling outputs, inter-annotator agreement should be measured and enforced. Inconsistency is noise; noise degrades models.
  • Negative coverage: Include examples of things the model should decline or handle carefully. Models trained only on "do this" without "don't do that" examples develop blind spots.

The Deduplication Problem

Near-duplicate examples don't add information — they add bias. If 40% of your dataset is paraphrases of the same five scenarios, the model will overfit to those scenarios and underperform on anything that looks different. Run deduplication before training, not after you notice the model behaving strangely.

Choosing the Right Fine-tuning Method

Full fine-tuning (updating all model weights) is rarely the right choice anymore. Parameter-efficient methods have matured significantly and should be your default starting point.

LoRA and QLoRA: Your Practical Default

Low-Rank Adaptation (LoRA) freezes most model weights and trains a small set of adapter matrices. This reduces memory requirements dramatically — often by 60–80% — while capturing most of the behavioral benefit of full fine-tuning. QLoRA combines LoRA with quantization, allowing fine-tuning of large models on a single consumer GPU. For most agency-level fine-tuning tasks, start here.

When Full Fine-tuning Earns Its Cost

Full fine-tuning makes sense when you need deep behavioral change across the model's entire capability surface — not just output formatting, but underlying reasoning style or domain-specific generation patterns. It also makes sense when you have the infrastructure and the eval suite to detect regressions safely. Without both, full fine-tuning is just expensive risk.

Instruction Tuning vs. Task-Specific Tuning

Instruction tuning teaches a model to follow a range of instructions in a chat-style interface. Task-specific tuning optimizes a model for a single task type (e.g., structured data extraction, code completion in a specific style). Task-specific models can dramatically outperform general instruction-tuned models on the narrow task — but they're brittle outside it. Know which you need before you start.

Evaluation: The Practice Most Teams Skip

Skipping structured evaluation is the single most common way fine-tuning projects quietly go wrong. You push a new model version, it feels better in the demo, and three weeks later a user finds a regression that's been silently failing.

Build Your Eval Before You Train

Your evaluation suite should exist before you collect a single training example. This forces you to define success concretely. A good eval for a fine-tuned model typically includes:

  • A held-out test set (never touched during training) of at least 200 examples
  • Automated metrics where possible (exact match, ROUGE, BLEU for text tasks; accuracy for classification)
  • A human eval protocol with defined rubrics for subjective qualities like tone and coherence
  • Regression checks on the base model's strengths — you want to know if fine-tuning broke something it previously did well

Track Generations, Not Just Metrics

Aggregate metrics can hide important failure modes. Always spot-check raw generations from your test set. A model can achieve high average scores while producing catastrophically wrong outputs on a subset of inputs. Manual review of 50–100 generations per evaluation run catches what metrics miss.

Deployment and Iteration Practices

A fine-tuned model isn't a finished artifact. It's a starting point for an iteration cycle.

Version Everything

Treat fine-tuned models like software. Every training run should be tagged with: the dataset version used, the base model checkpoint, hyperparameters, eval results, and the date. Without this, you can't reproduce results or safely roll back.

Shadow Deployment Before Full Rollout

Before replacing your current model in production, run the fine-tuned version in shadow mode — it processes real inputs and logs outputs but doesn't serve users. Compare logged outputs against the production model using your eval rubric. Shadow deployment surfaces real-world distribution shift that your test set doesn't capture.

Plan for Re-fine-tuning

Models decay. The world changes, your product changes, user behavior changes. Budget for quarterly or semi-annual re-fine-tuning cycles as part of your AI operating model. Teams that treat fine-tuning as a one-time event end up with increasingly misaligned models and no clear moment when they noticed the drift. The Step-by-Step Approach to Machine Learning Basics offers a useful framework for building this kind of ongoing discipline into your workflow.

When to Train From Scratch (And Why It's Almost Never You)

There is one legitimate case for training a model from scratch: when the domain is so specialized, the data so proprietary, and the general-purpose model so inadequate that no amount of fine-tuning can close the gap. This describes a small fraction of problems — rare medical imaging tasks, highly specialized scientific language, scenarios where data cannot leave a secure environment and no suitable open-weight model exists.

For everyone else: use a foundation model and adapt it. The Machine Learning Basics: Best Practices That Actually Work article makes this point in a broader context — building on existing foundations is almost always faster, cheaper, and safer than starting over.

Frequently Asked Questions

How much data do I actually need to fine-tune a model?

The honest answer is: it depends on how much behavioral change you need and how good your base model already is. As a practical range, 500–1,000 high-quality examples can meaningfully shift tone and output format; 5,000–50,000 are appropriate for deeper capability work. Quality matters more than raw count — 300 excellent examples outperform 3,000 noisy ones in most cases.

Will fine-tuning make my model hallucinate less?

Not reliably, and it can make things worse if done carelessly. Fine-tuning teaches behavior, not factual accuracy. If you fine-tune on data that contains errors, the model will learn to reproduce those errors confidently. For factual accuracy, RAG is a more reliable lever than fine-tuning.

Can I fine-tune a model and then use RAG on top of it?

Yes, and this is often the right architecture. Fine-tune to establish behavioral patterns, voice, and output format; layer RAG on top to provide current, accurate, retrievable knowledge. The two approaches are complementary, not competing.

How do I know if fine-tuning is actually better than a good system prompt?

Run both against the same eval set and compare. This is not a philosophical question — it's an empirical one. Many teams are surprised to find that a carefully engineered system prompt with a few-shot examples achieves 85–90% of the performance of a fine-tuned model at a fraction of the cost. Always establish this baseline before investing in fine-tuning.

What's the biggest mistake teams make when fine-tuning?

Training on data they haven't manually reviewed. Automated pipelines that scrape, filter, and package training data without human spot-checking routinely inject noise, bias, and errors that shape the model in unintended ways. Review at least a random sample of your training data before every run. This applies to beginners and experienced teams equally — as noted in Machine Learning Basics: A Beginner's Guide, foundational discipline matters at every level.

How long does a typical fine-tuning run take?

For LoRA-style fine-tuning on a model in the 7B–13B parameter range with a dataset of a few thousand examples, expect 1–4 hours on a single A100 GPU. Full fine-tuning of the same model takes longer and requires more memory — often multiple GPUs and 8–24 hours. Larger models scale accordingly. Cloud providers like AWS, GCP, and Azure have made this compute accessible; budget $50–$500 for a typical experimental run depending on scale.

Key Takeaways

  • The training vs. fine-tuning decision is a business decision first. Cost, data availability, and measurable outcomes should drive it, not technical enthusiasm.
  • Exhaust prompting and RAG before fine-tuning. These cheaper options solve more problems than most practitioners realize.
  • Knowledge problems need RAG; behavior problems need fine-tuning. Conflating the two wastes resources and produces weak results.
  • Data quality is the ceiling. No technique compensates for noisy, inconsistent, or unreviewed training data.
  • Start with LoRA or QLoRA, not full fine-tuning. Parameter-efficient methods deliver most of the gain at a fraction of the cost.
  • Build your eval suite before your dataset. If you can't measure success before you start, you can't know if you've achieved it after.
  • Version, shadow-deploy, and plan for re-fine-tuning. A fine-tuned model is a living artifact, not a finished product.
  • Pre-training from scratch is almost never the right call. If you're considering it, make sure you've genuinely exhausted all the alternatives.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification