A guide tells you what is true. A playbook tells you what to do next, who owns it, and what triggers the move. This is the operating version of how AI training data gets collected, written for the person who actually has to run the pipeline rather than admire it from a distance.
Each play below has a trigger that fires it, an owner who runs it, and a clear handoff to the next play. Run them in sequence the first time, then let the triggers drive the loop after that. If you want the conceptual scaffolding behind these moves, read A Framework for How Ai Training Data Is Collected alongside this.
Play 1: Define the data contract
Trigger: A new model or fine-tune is approved. Owner: ML lead plus a product stakeholder.
Before anyone touches a crawler, write down exactly what the dataset must contain and what it must exclude. A data contract is a one-page document that specifies the task, the input and output format, the coverage you need across categories, the minimum label quality, and the prohibited content. Without this, every later play drifts.
What the contract must pin down
- The exact task the model performs and the examples that represent it.
- Required coverage, for example a minimum count per language, topic, or difficulty band.
- Hard exclusions: personal data, licensed content you lack rights to, known benchmark sets.
- The acceptance bar for labels, expressed as inter-annotator agreement or expert sign-off.
Handoff: The contract becomes the spec the sourcing play is measured against.
Play 2: Source against the contract
Trigger: A signed data contract exists. Owner: Data engineering.
Now you gather. Pull from the cheapest acceptable source first and escalate only when coverage gaps remain. Start with curated public corpora and data you already own. Move to licensed datasets for gaps that public data cannot fill. Use web crawling last, because it carries the most cleaning and legal overhead.
For every source, record provenance at the moment of collection: where it came from, the license or rights basis, the date, and the collecting tool. This metadata is non-negotiable. You cannot reconstruct it later, and you will need it the first time someone asks where a specific example came from. The tools roundup covers what to use for crawling, storage, and provenance tracking.
Handoff: A raw dataset with full provenance metadata moves to cleaning.
Play 3: Clean and deduplicate
Trigger: A raw dataset lands in staging. Owner: Data engineering.
Raw data is never training-ready. Run the standard pipeline in order, because order matters.
The cleaning sequence
- Deduplicate. Remove exact and near-duplicate records first so you are not cleaning the same junk twice.
- Strip boilerplate. Remove navigation, ads, and templated noise from crawled pages.
- Filter quality. Drop spam, gibberish, and machine-generated noise using heuristics or a quality classifier.
- Redact PII. Scrub names, emails, and identifiers that the contract prohibits.
- Decontaminate. Remove any overlap with benchmark test sets so evaluation stays honest.
Log how many records each step removed. A sudden spike or drop is your earliest signal that something upstream broke.
Handoff: A clean, decontaminated corpus moves to labeling.
Play 4: Label and annotate
Trigger: A clean corpus needs human judgment added. Owner: Annotation lead plus vendor.
This is where quality is won or lost. Write annotation guidelines with worked examples for the easy cases and, more importantly, the edge cases. Run a calibration round on a small batch, measure agreement between annotators, and fix the guidelines before scaling. Only when agreement clears the contract's bar do you open the floodgates.
Build in a review layer. A second annotator or an expert spot-checks a sample of every batch, and disagreements feed back into the guidelines. Skipping this is the most common way teams ship a dataset that looks finished but quietly teaches the model the wrong thing. The examples guide shows what good and bad labeled batches look like in practice.
Handoff: A labeled dataset with measured agreement scores moves to validation.
Play 5: Validate before training
Trigger: Labeling is complete. Owner: ML lead.
Do not train yet. First confirm the dataset matches the contract. Check coverage against the required counts, sample records by hand and read them, and run automated checks for label distribution and leakage. Hold out a clean evaluation slice that never enters training.
This play exists because training is expensive and slow to debug. Catching a coverage gap or a leaked benchmark here costs an afternoon. Catching it after a training run costs a week and your credibility.
Handoff: A validated dataset and a held-out eval set go to training.
Play 6: Train, evaluate, and feed back
Trigger: A validated dataset is ready. Owner: ML lead.
Train, then evaluate on the held-out slice and on real user-style inputs. Read the failures. Most failures trace back to a data problem: missing coverage, mislabeled examples, or a filtering choice that erased something important.
The feedback loop
- Cluster the model's errors by type.
- Trace each cluster back to a data cause using your provenance metadata.
- Write the fix as an amendment to the data contract.
- Re-trigger Play 2 for just the affected slice.
This is what makes the playbook a loop rather than a one-shot. Each cycle the dataset gets sharper because failures point straight at their data cause.
Sequencing and ownership at a glance
Run plays 1 through 6 in order for the first build. After that, the feedback loop in Play 6 re-triggers earlier plays surgically rather than rebuilding everything. Keep ownership tight: one ML lead accountable for the contract and validation, one data engineering owner for sourcing and cleaning, one annotation lead for labels. When ownership blurs, provenance gaps and quality drift creep in immediately.
Frequently Asked Questions
How long does a full pass through this playbook take?
For a focused fine-tune, a small team can run plays 1 through 6 in two to four weeks, with labeling the longest stage. Large foundation datasets take months. The biggest time sink is almost always the labeling calibration in Play 4, so budget for it generously rather than discovering it under deadline.
Can I skip the data contract for a quick experiment?
You can, but you will pay for it. Even a half-page contract prevents the most common waste: collecting the wrong data and discovering it after labeling. For a true throwaway prototype, write three bullet points instead of a full contract. For anything you might ship, do not skip it.
Who should own provenance tracking?
Whoever does the sourcing in Play 2 owns it, because provenance must be captured at collection time. Trying to bolt it on later never works, since the original context is gone. Make it a required field in your ingestion tooling so it cannot be skipped.
What is the most commonly skipped play?
Validation in Play 5. Teams are eager to train and treat a labeled dataset as finished. Skipping validation is how leaked benchmarks and coverage gaps survive into the model, where they are far more expensive to find. Treat Play 5 as mandatory.
How does this playbook handle synthetic data?
Synthetic data slots into Play 2 as another source, but it needs extra scrutiny in Plays 3 and 5. Watch for repetition, narrow patterns, and inherited model errors. Keep a human review sample larger than you would for human-written data, because synthetic data fails quietly.
Key Takeaways
- Run six plays in sequence: contract, source, clean, label, validate, then train and feed back.
- A one-page data contract written first prevents the most expensive failures downstream.
- Capture provenance at collection time; it cannot be reconstructed later.
- Labeling quality is won in calibration, not in volume; measure agreement before scaling.
- Validation before training is the most-skipped and most-valuable play.
- The feedback loop in Play 6 re-triggers earlier plays surgically, turning a one-shot build into a repeatable loop. Pair this with the step-by-step how-to for execution detail.