Opinionated Defaults for the Tuesday You Face Raw Data

Generic advice tells you to use clean, diverse, high-quality data. That is true and almost useless, because it does not tell you what to actually do on a Tuesday afternoon when you are staring at a pile of raw text. This article is the opposite: specific, opinionated practices with the reasoning behind each, drawn from how strong data pipelines actually behave.

These are not laws. They are defaults that hold up under pressure. Where a practice has a trade-off, we name it, because a best practice you cannot reason about is just a slogan.

Treat Data Work as the Real Work

The biggest mindset shift is to stop treating data collection as setup for the "real" modeling work. For applied projects, the data is the work. The model is a commodity you borrow.

In practice this means budgeting most of your time for collection, cleaning, and labeling, and accepting that the unglamorous parts deserve your best people. Teams that respect data work outperform teams that rush it to get to training. If you want the end-to-end sequence, the step-by-step guide lays it out.

Curate Aggressively, Collect Conservatively

The reflex to gather everything is wrong for most projects. A smaller, sharply curated dataset usually beats a sprawling one.

Why Curation Wins

Every low-quality example you remove raises the average signal.
Smaller datasets are cheaper to clean, label, and audit thoroughly.
You can actually understand what is in a curated dataset, which makes debugging possible.

The trade-off: aggressive curation risks dropping rare-but-important cases. Counter it by auditing for coverage gaps and collecting specifically to fill them, rather than collecting broadly and hoping.

Make Provenance Non-Negotiable

Record source, date, and usage rights for every batch at the moment you collect it. This sounds bureaucratic until the day someone asks whether you have the right to use a dataset, or you need to remove one source cleanly.

The reasoning is risk asymmetry. Capturing provenance costs minutes. Reconstructing it later costs days and is often impossible. The expected value strongly favors doing it upfront. Missing provenance is one of the most common failures in our common mistakes roundup.

Decontaminate Before You Trust Any Number

Always remove training examples that overlap with your evaluation set. Without this, a model can score brilliantly by memorizing the answers and then fail on genuinely new inputs.

Keep one held-out test set that never touches training, ever. Treat it as the only honest signal you have. Every metric downstream of a contaminated test set is fiction, and fiction makes you ship the wrong thing.

Invest Disproportionately in Labeling Quality

When your task needs labels, the labeling scheme caps your model's ceiling. Sloppy labels cannot be fixed by more data.

Practices That Hold Up

Write instructions with concrete examples of edge cases, not just the easy middle.
Label a sample yourself before handing the task off, so you know where the ambiguity lives.
Have multiple people label a shared subset and measure agreement; low agreement means your instructions, not your annotators, are the problem.
Route disagreements back into sharper guidelines instead of silently picking a winner.

The trade-off is speed. High-quality labeling is slower per example. It is worth it, because the alternative is a model that learned contradictions.

Audit Composition on Purpose

Do not assume your dataset is balanced. Break it down by the categories that matter for your task and look for thin or missing groups. Skewed data produces models that excel on common cases and fail on the rest, often invisibly.

The corrective is targeted collection: identify the gap, then collect specifically for it. Padding the dataset with more of what is already abundant makes the imbalance worse, not better.

Use Synthetic Data as a Scalpel, Not a Firehose

Synthetic data, generated by an existing model, is genuinely useful for filling specific gaps and balancing rare cases. It becomes dangerous when it dominates the dataset, because the model learns the generating model's quirks rather than the real task.

The rule of thumb: ground the dataset in real data, then use synthetic data surgically where real data is scarce. Always measure whether the synthetic additions actually improve evaluation, and cut them if they do not. The complete guide covers where synthetic data fits in the broader pipeline.

Iterate on Data, Not the Model

When a model underperforms, the reflexive move is to reach for a bigger or different model. For applied projects, that reflex is usually wrong. The highest-leverage fix is almost always better data.

The reasoning is leverage. The model architecture is a commodity you borrowed and that thousands of people have already optimized. Your dataset is the part unique to your problem, and it is where the easy wins live. A model that fails on a category of inputs is telling you that category is missing or mislabeled in your data, not that the architecture is wrong.

In practice, build the habit of responding to weak evaluation with a data question: is coverage complete, are labels consistent, is anything contaminated? The step-by-step guide turns that habit into a diagnostic you can run.

Keep the Feedback Loop Short

The best data practitioners run the collect-clean-evaluate loop many times, fast, rather than trying to get the dataset perfect in one pass.

Why Short Loops Win

You discover what the model actually struggles with instead of guessing.
You collect targeted data for real gaps rather than padding broadly.
You catch labeling and contamination problems early, when they are cheap to fix.

The trade-off is that short loops require discipline to evaluate honestly each time. But a team that goes around the loop five times with a sealed test set will out-build a team that spends the same effort assembling one enormous dataset they never validate. Speed of iteration, grounded in honest measurement, is itself a best practice.

Frequently Asked Questions

What is the single highest-leverage practice here?

Decontaminating against a held-out test set, because it determines whether any of your other measurements are trustworthy. If your evaluation is contaminated, every decision you make downstream is based on fiction. Honest measurement is the foundation everything else rests on.

How do I balance curation against missing rare cases?

Curate aggressively for quality, but pair it with a deliberate coverage audit. Identify the rare cases that matter, then collect specifically for them. The goal is a small dataset that is still complete, not a small dataset that is merely convenient.

Is synthetic data ever a bad idea?

It is a bad idea when it dominates your dataset or when you do not verify that it improves results. Used surgically to fill real gaps, it is valuable. Used as a cheap substitute for real data at scale, it teaches the model the generator's flaws instead of the real task.

How much time should data work take versus modeling?

For applied projects, expect the majority of your time on collection, cleaning, and labeling. The model is usually borrowed and tuned. Teams consistently underestimate data work and overestimate modeling, which is why so many projects stall on weak data.

What does good labeling agreement look like?

There is no universal number, but you want annotators to agree on the clear cases consistently and to disagree only on genuine edge cases. Persistent disagreement on ordinary examples signals that your instructions are ambiguous and need sharpening before you scale up.

Key Takeaways

For applied projects, data work is the real work; budget your best people and most of your time for it.
Curate aggressively for quality but audit coverage so you do not drop rare, important cases.
Make provenance logging non-negotiable and capture it at collection time.
Keep one untouched test set and decontaminate training data against it before trusting any metric.
Invest in labeling quality and use synthetic data surgically, never as a bulk substitute for real data.

These are not laws. They are defaults that hold up under pressure. Where a practice has a trade-off, we name it, because a best practice you cannot reason about is just a slogan.

Treat Data Work as the Real Work

The biggest mindset shift is to stop treating data collection as setup for the "real" modeling work. For applied projects, the data is the work. The model is a commodity you borrow.

Curate Aggressively, Collect Conservatively

The reflex to gather everything is wrong for most projects. A smaller, sharply curated dataset usually beats a sprawling one.

Why Curation Wins

Every low-quality example you remove raises the average signal.
Smaller datasets are cheaper to clean, label, and audit thoroughly.
You can actually understand what is in a curated dataset, which makes debugging possible.

Make Provenance Non-Negotiable

Decontaminate Before You Trust Any Number

Always remove training examples that overlap with your evaluation set. Without this, a model can score brilliantly by memorizing the answers and then fail on genuinely new inputs.

Invest Disproportionately in Labeling Quality

When your task needs labels, the labeling scheme caps your model's ceiling. Sloppy labels cannot be fixed by more data.

Practices That Hold Up

Write instructions with concrete examples of edge cases, not just the easy middle.
Label a sample yourself before handing the task off, so you know where the ambiguity lives.
Have multiple people label a shared subset and measure agreement; low agreement means your instructions, not your annotators, are the problem.
Route disagreements back into sharper guidelines instead of silently picking a winner.

The trade-off is speed. High-quality labeling is slower per example. It is worth it, because the alternative is a model that learned contradictions.

Audit Composition on Purpose

The corrective is targeted collection: identify the gap, then collect specifically for it. Padding the dataset with more of what is already abundant makes the imbalance worse, not better.

Use Synthetic Data as a Scalpel, Not a Firehose

Iterate on Data, Not the Model

When a model underperforms, the reflexive move is to reach for a bigger or different model. For applied projects, that reflex is usually wrong. The highest-leverage fix is almost always better data.

Keep the Feedback Loop Short

The best data practitioners run the collect-clean-evaluate loop many times, fast, rather than trying to get the dataset perfect in one pass.

Why Short Loops Win

You discover what the model actually struggles with instead of guessing.
You collect targeted data for real gaps rather than padding broadly.
You catch labeling and contamination problems early, when they are cheap to fix.

Frequently Asked Questions

What is the single highest-leverage practice here?

How do I balance curation against missing rare cases?

Is synthetic data ever a bad idea?

How much time should data work take versus modeling?

What does good labeling agreement look like?

Key Takeaways

For applied projects, data work is the real work; budget your best people and most of your time for it.
Curate aggressively for quality but audit coverage so you do not drop rare, important cases.
Make provenance logging non-negotiable and capture it at collection time.
Keep one untouched test set and decontaminate training data against it before trusting any metric.
Invest in labeling quality and use synthetic data surgically, never as a bulk substitute for real data.

Opinionated Defaults for the Tuesday You Face Raw Data

Treat Data Work as the Real Work

Curate Aggressively, Collect Conservatively

Why Curation Wins

Make Provenance Non-Negotiable

Decontaminate Before You Trust Any Number

Invest Disproportionately in Labeling Quality

Practices That Hold Up

Audit Composition on Purpose

Use Synthetic Data as a Scalpel, Not a Firehose

Iterate on Data, Not the Model

Keep the Feedback Loop Short

Why Short Loops Win

Frequently Asked Questions

What is the single highest-leverage practice here?

How do I balance curation against missing rare cases?

Is synthetic data ever a bad idea?

How much time should data work take versus modeling?

What does good labeling agreement look like?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Opinionated Defaults for the Tuesday You Face Raw Data

Treat Data Work as the Real Work

Curate Aggressively, Collect Conservatively

Why Curation Wins

Make Provenance Non-Negotiable

Decontaminate Before You Trust Any Number

Invest Disproportionately in Labeling Quality

Practices That Hold Up

Audit Composition on Purpose

Use Synthetic Data as a Scalpel, Not a Firehose

Iterate on Data, Not the Model

Keep the Feedback Loop Short

Why Short Loops Win

Frequently Asked Questions

What is the single highest-leverage practice here?

How do I balance curation against missing rare cases?

Is synthetic data ever a bad idea?

How much time should data work take versus modeling?

What does good labeling agreement look like?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?