AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step 1: Define the Decision GateThe Four Decision VariablesStep 2: Run the Baseline-First RuleWhy This MattersStep 3: Prepare the Dataset as a First-Class ArtifactFor Fine-TuningFor Training from ScratchStep 4: Configure and Run the Training ProcessFine-Tuning Configuration ChecklistTraining from Scratch — Additional ConsiderationsStep 5: Evaluate Against the Baseline, Not Just Absolute MetricsBuild an Evaluation Suite, Not a Single NumberStep 6: Package for HandoffThe Handoff PackageStep 7: Build the Iteration LoopIteration TriggersFrequently Asked QuestionsWhat's the most common mistake teams make when choosing between training and fine-tuning?How much data do you actually need to fine-tune a large language model?Is training from scratch ever the right call for an agency or mid-market team?How do you version models in a repeatable workflow?When should fine-tuning be replaced by better prompting or retrieval-augmented generation?How do you know when a fine-tuned model needs to be retrained?Key Takeaways
Home/Blog/Stop Treating Model Projects as One-Off Experiments
General

Stop Treating Model Projects as One-Off Experiments

A

Agency Script Editorial

Editorial Team

·March 12, 2026·11 min read

Most teams treating AI adoption as a series of one-off experiments never build durable capability. They fine-tune a model for one client, train something from scratch for another, and document neither. Six months later, nobody can reproduce the results, the person who ran the process has moved on, and the next project starts from zero. The fix isn't more experimentation — it's a documented, repeatable workflow that makes the right decision (train vs. fine-tune) predictable and the execution hand-off-able.

The training vs. fine-tuning distinction matters more than most teams realize. Training a model from scratch means building all its weights from random initialization using a large, task-representative dataset. Fine-tuning means taking a pre-trained model that already understands language, images, or code and adjusting its weights — or a subset of them — using a smaller, task-specific dataset. These are not interchangeable options. Each belongs to a specific class of problem, budget, and risk profile. Choosing the wrong path wastes months of compute and produces models that underperform approaches costing a fraction of the effort.

What follows is a practical, documented process for making that decision correctly, running whichever path you choose, and handing the work off to someone else without losing context. It works whether you're an internal AI team at a mid-sized company or an agency delivering AI products to clients. If you want grounding on how models learn before diving into the workflow, Machine Learning Basics: The Questions Everyone Asks, Answered covers the fundamentals without assuming a PhD.

Step 1: Define the Decision Gate

Before touching a dataset or a model card, you need a documented decision gate — a structured checkpoint where you commit to training vs. fine-tuning (or neither) based on explicit criteria. This gate is the most important part of the workflow because it prevents the most expensive mistake: defaulting to "let's fine-tune" because it sounds more accessible, or "let's train from scratch" because it sounds more rigorous.

The Four Decision Variables

Document your answer to each before proceeding:

  • Task novelty. Does the task require understanding that no existing pre-trained model has been exposed to? Truly novel domains — proprietary scientific notation, highly specialized industrial sensor data, undocumented languages — may require training from scratch. Most enterprise and agency tasks do not clear this bar.
  • Data volume. Fine-tuning typically requires hundreds to tens of thousands of labeled examples for good results. Training from scratch requires millions to hundreds of millions. If your client has 800 labeled support tickets, fine-tuning is the ceiling.
  • Compute budget. Training a large model from scratch can cost tens of thousands of dollars in cloud compute. Fine-tuning a 7B-parameter model on a single A100 for a few hours might cost under $50. Budget constraints are a legitimate part of the decision, not a footnote.
  • Latency and control requirements. If you need full weight ownership for compliance, on-premise deployment, or proprietary IP reasons, training from scratch may be forced on you regardless of the other variables. Document this explicitly.

Record each answer in a shared decision log — a simple table in Notion, Confluence, or even a spreadsheet works. The goal is auditability. When a client asks why you chose fine-tuning six months later, you show them the log, not a vague memory.

Step 2: Run the Baseline-First Rule

Whichever path you choose, never start with it. Start with a baseline. This is the single rule most teams skip, and it causes the most wasted effort.

A baseline is the simplest possible version of your solution: a pre-trained model used zero-shot or few-shot, with good prompting, no weight modification. Run it against your evaluation set before you train or fine-tune anything.

Why This Matters

  • If the baseline already solves 80% of the problem, you may not need to fine-tune at all. Inference costs are almost always lower than training amortized over time.
  • The baseline gives you a performance floor. Fine-tuning should beat it. If it doesn't, your dataset has a problem, not your model choice.
  • It gives stakeholders a concrete before/after comparison, which is more persuasive than abstract claims about model improvement.

Document the baseline results — accuracy, F1, latency, cost per inference, whatever metrics matter for your task — in the same shared log as your decision gate. The baseline row in that log becomes the benchmark every subsequent run is measured against.

Step 3: Prepare the Dataset as a First-Class Artifact

Data preparation is where most fine-tuning projects fail, and where most training-from-scratch projects are won or lost. The dataset is not a preprocessing step — it is a deliverable, and it should be treated with the same documentation discipline as code.

For Fine-Tuning

  • Format consistency. Your examples need to follow the exact prompt-response format your model expects. Inconsistent formatting degrades performance faster than small dataset size.
  • Quality over quantity. 500 clean, diverse, correctly labeled examples routinely outperform 5,000 noisy ones. Build a labeling rubric before your labelers touch the data.
  • Split discipline. Use a train/validation/test split. Never evaluate on training data. Keep the test set frozen — don't iterate against it, or you'll overfit your decisions to it.
  • Version the dataset. Every iteration of the dataset gets a version number, a change log entry, and a storage location. This is non-negotiable for reproducibility. Tools like DVC or even timestamped S3 folders work; what doesn't work is "the latest CSV on someone's laptop."

For Training from Scratch

Everything above applies, at 10–100× the scale. Add data provenance documentation (where did each corpus come from, what are the licensing terms) and deduplication checks. Training a large model on duplicate data is a known path to memorization and poor generalization — a risk catalogued in detail in The Hidden Risks of Machine Learning Basics (and How to Manage Them).

Step 4: Configure and Run the Training Process

This is the step most tutorials make the whole job. It's important, but in a repeatable workflow, configuration is nearly as important as execution.

Fine-Tuning Configuration Checklist

Document each of these in your run config file (not in your head):

  • Base model and version (e.g., mistral-7b-instruct-v0.2, not just "Mistral")
  • Fine-tuning method: full fine-tuning, LoRA, QLoRA, prefix tuning. Each trades off compute cost against parameter coverage.
  • Learning rate and scheduler: fine-tuning typically uses learning rates in the 1e-5 to 5e-4 range, lower than training from scratch
  • Batch size and gradient accumulation steps
  • Number of epochs or steps, and early stopping criteria
  • Evaluation frequency: how often you run the validation set during training

Use a framework like Hugging Face Transformers + PEFT, Axolotl, or LlamaFactory, and commit your config file to version control alongside your training script. If someone else needs to reproduce your run, they should be able to do it from the repo alone.

Training from Scratch — Additional Considerations

Add architecture decisions (number of layers, attention heads, context window), tokenizer design or selection, and distributed training configuration. These choices have downstream consequences that are difficult to reverse, which is why The Machine Learning Basics Playbook treats architecture selection as a governance decision, not just a technical one.

Step 5: Evaluate Against the Baseline, Not Just Absolute Metrics

Evaluation is a step most teams rush because the numbers feel good in isolation. A model with 91% accuracy sounds strong until you remember your baseline was 89%.

Build an Evaluation Suite, Not a Single Number

  • Task-specific metrics first: ROUGE for summarization, exact match or F1 for QA, precision/recall for classification
  • Behavioral tests: does the model handle edge cases your baseline failed on? Does it refuse appropriately? Does it hallucinate less?
  • Regression tests: does fine-tuning on your task degrade performance on general capabilities you still need?
  • Cost and latency: a fine-tuned smaller model that matches a larger baseline on quality often wins on cost per inference, which matters at scale

Document every evaluation run with the model version, dataset version, and hardware used. Evaluation results without this metadata are nearly useless for reproducibility.

Step 6: Package for Handoff

The workflow isn't complete until someone who wasn't in the room can pick it up. This is where most AI work fails operationally — knowledge lives in Slack threads and individual memory, not documentation.

The Handoff Package

A complete handoff for any training or fine-tuning project includes:

  • Decision log: the four decision variables, the baseline results, and why training vs. fine-tuning was chosen
  • Dataset documentation: version, source, labeling rubric, split ratios, known limitations
  • Run configuration: committed to version control, not in a Google Doc
  • Evaluation report: metrics, behavioral test results, comparison to baseline
  • Deployment notes: where the model is hosted, how it's versioned, what monitoring is in place
  • Known failure modes: what inputs break the model, what edge cases weren't covered in training data

This package is what distinguishes an agency or internal team that builds durable AI capability from one that rebuilds from scratch on every project. Building a Repeatable Workflow for Machine Learning Basics covers the broader documentation discipline this approach fits into.

Step 7: Build the Iteration Loop

A single training or fine-tuning run is rarely the end state. Production data drifts, task requirements change, and the model you deployed in Q1 may underperform by Q3. The workflow needs an iteration protocol.

Iteration Triggers

Define these in advance, not reactively:

  • Performance drops below a defined threshold on a monitored metric
  • New labeled data exceeds a defined volume (e.g., 500 new examples accumulated)
  • Task definition changes materially
  • A new base model releases that outperforms the one you fine-tuned on your eval set

When a trigger fires, you don't restart the workflow — you return to Step 3 with the accumulated data, re-run from there, and document the delta in the decision log. The decision to re-train vs. re-fine-tune vs. switch base models is made at Step 1 again, with updated inputs. This turns the workflow into a loop, not a one-time event.

For teams new to distinguishing which problems genuinely benefit from model adaptation versus better prompting, Machine Learning Basics: Myths vs Reality is worth a read before defining your iteration triggers — it challenges several assumptions that lead teams to fine-tune unnecessarily.

Frequently Asked Questions

What's the most common mistake teams make when choosing between training and fine-tuning?

Skipping the baseline. Teams assume fine-tuning is necessary when a well-prompted general model would already meet the performance bar. Running a zero-shot or few-shot baseline first takes a few hours and can save weeks of fine-tuning work. Always know what you're improving on before you start improving it.

How much data do you actually need to fine-tune a large language model?

It depends on task complexity and base model quality, but useful fine-tuning has been demonstrated with as few as 100–500 high-quality examples for narrow, well-defined tasks. For broader capability improvements, 5,000–50,000 examples is a more typical range. Data quality — consistent formatting, accurate labels, diverse coverage — matters more than raw volume past a certain threshold.

Is training from scratch ever the right call for an agency or mid-market team?

Rarely, and less often than people expect. The compute cost, data volume requirements, and specialized expertise needed put it out of reach for most teams on most projects. The realistic cases are highly proprietary domains with no useful pre-trained base, or regulatory environments that prohibit using any third-party model weights as a starting point. For almost everything else, fine-tuning a strong open-source base model is the better path.

How do you version models in a repeatable workflow?

Treat model weights like software releases: semantic versioning (v1.0, v1.1, v2.0), stored in a consistent location (model registry, S3, Hugging Face Hub private repo), with a changelog entry for each version documenting what changed, why, and what the evaluation delta was. Never overwrite a production model in place — always deploy to a new version slot and maintain rollback capability.

When should fine-tuning be replaced by better prompting or retrieval-augmented generation?

When the task is primarily about accessing specific knowledge rather than learning a behavioral pattern or style. If the problem is "the model doesn't know our product documentation," RAG is almost always the right answer. If the problem is "the model writes in the wrong tone, follows the wrong format, or fails on a narrow task class reliably," fine-tuning addresses something RAG cannot. These approaches also combine well.

How do you know when a fine-tuned model needs to be retrained?

Monitor task-specific metrics in production — not just uptime. When production performance diverges from your held-out test set results by more than a defined threshold, or when a significant volume of new labeled data has accumulated, it's time to iterate. Set the trigger criteria before deployment; don't wait until a client complaint surfaces the problem.

Key Takeaways

  • The training vs. fine-tuning decision belongs in a documented gate, not an informal conversation. Four variables drive it: task novelty, data volume, compute budget, and control requirements.
  • Always run a baseline before fine-tuning or training. If a zero-shot model meets the bar, you're done.
  • Treat your dataset as a versioned deliverable with documentation, not a preprocessing step.
  • Commit run configurations to version control. Reproducibility is a minimum standard, not a bonus.
  • Evaluate against the baseline, not just absolute numbers. Include behavioral and regression tests alongside task metrics.
  • A complete handoff package — decision log, dataset docs, run config, evaluation report, failure modes — is what separates durable AI capability from one-off experiments.
  • Build iteration triggers before deployment so the loop is defined, not reactive.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification