Generic advice tells you to use clean, diverse, high-quality data. That is true and almost useless, because it does not tell you what to actually do on a Tuesday afternoon when you are staring at a pile of raw text. This article is the opposite: specific, opinionated practices with the reasoning behind each, drawn from how strong data pipelines actually behave.
These are not laws. They are defaults that hold up under pressure. Where a practice has a trade-off, we name it, because a best practice you cannot reason about is just a slogan.
Treat Data Work as the Real Work
The biggest mindset shift is to stop treating data collection as setup for the "real" modeling work. For applied projects, the data is the work. The model is a commodity you borrow.
In practice this means budgeting most of your time for collection, cleaning, and labeling, and accepting that the unglamorous parts deserve your best people. Teams that respect data work outperform teams that rush it to get to training. If you want the end-to-end sequence, the step-by-step guide lays it out.
Curate Aggressively, Collect Conservatively
The reflex to gather everything is wrong for most projects. A smaller, sharply curated dataset usually beats a sprawling one.
Why Curation Wins
- Every low-quality example you remove raises the average signal.
- Smaller datasets are cheaper to clean, label, and audit thoroughly.
- You can actually understand what is in a curated dataset, which makes debugging possible.
The trade-off: aggressive curation risks dropping rare-but-important cases. Counter it by auditing for coverage gaps and collecting specifically to fill them, rather than collecting broadly and hoping.
Make Provenance Non-Negotiable
Record source, date, and usage rights for every batch at the moment you collect it. This sounds bureaucratic until the day someone asks whether you have the right to use a dataset, or you need to remove one source cleanly.
The reasoning is risk asymmetry. Capturing provenance costs minutes. Reconstructing it later costs days and is often impossible. The expected value strongly favors doing it upfront. Missing provenance is one of the most common failures in our common mistakes roundup.
Decontaminate Before You Trust Any Number
Always remove training examples that overlap with your evaluation set. Without this, a model can score brilliantly by memorizing the answers and then fail on genuinely new inputs.
Keep one held-out test set that never touches training, ever. Treat it as the only honest signal you have. Every metric downstream of a contaminated test set is fiction, and fiction makes you ship the wrong thing.
Invest Disproportionately in Labeling Quality
When your task needs labels, the labeling scheme caps your model's ceiling. Sloppy labels cannot be fixed by more data.
Practices That Hold Up
- Write instructions with concrete examples of edge cases, not just the easy middle.
- Label a sample yourself before handing the task off, so you know where the ambiguity lives.
- Have multiple people label a shared subset and measure agreement; low agreement means your instructions, not your annotators, are the problem.
- Route disagreements back into sharper guidelines instead of silently picking a winner.
The trade-off is speed. High-quality labeling is slower per example. It is worth it, because the alternative is a model that learned contradictions.
Audit Composition on Purpose
Do not assume your dataset is balanced. Break it down by the categories that matter for your task and look for thin or missing groups. Skewed data produces models that excel on common cases and fail on the rest, often invisibly.
The corrective is targeted collection: identify the gap, then collect specifically for it. Padding the dataset with more of what is already abundant makes the imbalance worse, not better.
Use Synthetic Data as a Scalpel, Not a Firehose
Synthetic data, generated by an existing model, is genuinely useful for filling specific gaps and balancing rare cases. It becomes dangerous when it dominates the dataset, because the model learns the generating model's quirks rather than the real task.
The rule of thumb: ground the dataset in real data, then use synthetic data surgically where real data is scarce. Always measure whether the synthetic additions actually improve evaluation, and cut them if they do not. The complete guide covers where synthetic data fits in the broader pipeline.
Iterate on Data, Not the Model
When a model underperforms, the reflexive move is to reach for a bigger or different model. For applied projects, that reflex is usually wrong. The highest-leverage fix is almost always better data.
The reasoning is leverage. The model architecture is a commodity you borrowed and that thousands of people have already optimized. Your dataset is the part unique to your problem, and it is where the easy wins live. A model that fails on a category of inputs is telling you that category is missing or mislabeled in your data, not that the architecture is wrong.
In practice, build the habit of responding to weak evaluation with a data question: is coverage complete, are labels consistent, is anything contaminated? The step-by-step guide turns that habit into a diagnostic you can run.
Keep the Feedback Loop Short
The best data practitioners run the collect-clean-evaluate loop many times, fast, rather than trying to get the dataset perfect in one pass.
Why Short Loops Win
- You discover what the model actually struggles with instead of guessing.
- You collect targeted data for real gaps rather than padding broadly.
- You catch labeling and contamination problems early, when they are cheap to fix.
The trade-off is that short loops require discipline to evaluate honestly each time. But a team that goes around the loop five times with a sealed test set will out-build a team that spends the same effort assembling one enormous dataset they never validate. Speed of iteration, grounded in honest measurement, is itself a best practice.
Frequently Asked Questions
What is the single highest-leverage practice here?
Decontaminating against a held-out test set, because it determines whether any of your other measurements are trustworthy. If your evaluation is contaminated, every decision you make downstream is based on fiction. Honest measurement is the foundation everything else rests on.
How do I balance curation against missing rare cases?
Curate aggressively for quality, but pair it with a deliberate coverage audit. Identify the rare cases that matter, then collect specifically for them. The goal is a small dataset that is still complete, not a small dataset that is merely convenient.
Is synthetic data ever a bad idea?
It is a bad idea when it dominates your dataset or when you do not verify that it improves results. Used surgically to fill real gaps, it is valuable. Used as a cheap substitute for real data at scale, it teaches the model the generator's flaws instead of the real task.
How much time should data work take versus modeling?
For applied projects, expect the majority of your time on collection, cleaning, and labeling. The model is usually borrowed and tuned. Teams consistently underestimate data work and overestimate modeling, which is why so many projects stall on weak data.
What does good labeling agreement look like?
There is no universal number, but you want annotators to agree on the clear cases consistently and to disagree only on genuine edge cases. Persistent disagreement on ordinary examples signals that your instructions are ambiguous and need sharpening before you scale up.
Key Takeaways
- For applied projects, data work is the real work; budget your best people and most of your time for it.
- Curate aggressively for quality but audit coverage so you do not drop rare, important cases.
- Make provenance logging non-negotiable and capture it at collection time.
- Keep one untouched test set and decontaminate training data against it before trusting any metric.
- Invest in labeling quality and use synthetic data surgically, never as a bulk substitute for real data.