A checklist is only useful if you can actually run it. This one is built to be worked through in order, from before you generate a single record to long after your model ships. Each item carries a one-line justification so you know why it earns its place, not just that someone said to do it.
Copy it. Adapt it. Run it against your next project. For the reasoning behind these items in depth, pair it with the best practices guide and the step-by-step workflow.
Before You Generate Anything
The most consequential work happens before generation. Skip this section and the rest cannot save you.
- [ ] Write a one-sentence gap statement. Name exactly what synthetic data will fix and how much you need. If you cannot write it, you are not ready.
- [ ] Lock a real holdout set. Carve out representative real data and freeze it before generation touches anything. It is your only ungameable test.
- [ ] Confirm the holdout is representative. Check that it spans the classes, segments, and tails you care about. A biased referee is worse than none.
- [ ] Decide your primary success metric. Recall, precision, downstream task accuracy, pick one before you start so you cannot rationalize later.
- [ ] Identify your privacy stakes. If real records are sensitive, privacy testing becomes mandatory, not optional.
Choosing a Generation Method
Match the method to the data and the risk. Simpler is better until it demonstrably is not.
- [ ] Pick the simplest viable generator. Do not reach for a complex generative model when rules or statistical sampling solve the problem. Complexity is a cost.
- [ ] Match the method to the data type. Tabular, text, image, and sensor data each have generators suited to them; mismatches break correlations.
- [ ] Plan for reproducibility. Fix seeds, version source data, and store parameters so you can regenerate on demand.
After the First Small Batch
Generate a few hundred records before generating millions. This stage is cheap and catches expensive errors.
- [ ] Read the sample by hand. View images, scan tables, read text. Humans catch gross failures that metrics miss in seconds.
- [ ] Check for impossible values. Look for combinations that violate real-world constraints; generators routinely produce them.
- [ ] Look for leaked real records. If synthetic samples look identical to real ones, you have a memorization and privacy problem.
Validating Fidelity
Prove the synthetic distribution matches the real one before scaling. Marginals are not enough.
- [ ] Compare marginal distributions. Each column or feature should match the real data's distribution.
- [ ] Compare pairwise correlations. This is where weak generators fail silently while passing marginal checks.
- [ ] Check coverage of the tails. Confirm the synthetic data reaches the rare values the real data has, not just the common middle.
The Decisive Utility Check
Fidelity is necessary but not sufficient. This is the gate that matters.
- [ ] Run train-on-synthetic, test-on-real. Train a model on synthetic data alone, evaluate on the real holdout. This is the verdict.
- [ ] Compare against a real-data baseline. Know how far synthetic-trained performance sits from a model trained on available real data.
- [ ] Do not proceed on fidelity alone. If utility is poor, return to the generator regardless of how good the fidelity scores look.
Blending and Tuning
Pure synthetic is rarely optimal. Blend, then search for the right ratio.
- [ ] Blend synthetic with real data. Real data anchors the model; synthetic fills the named gap.
- [ ] Sweep the synthetic-to-real ratio. Treat it as a hyperparameter and measure utility at each setting on the real holdout.
- [ ] Stop at the empirical optimum. It is usually well below maximum synthetic; more synthetic past the optimum reliably hurts.
Privacy Verification
If privacy motivated the project, prove it rather than assume it.
- [ ] Run membership inference tests. Confirm an attacker cannot tell which real records were in the training set.
- [ ] Measure nearest-neighbor distances. Synthetic samples sitting too close to real records indicate memorization and leakage.
- [ ] Apply differential privacy if stakes are high. Accept the fidelity trade-off when the privacy stakes justify it.
After You Ship
Synthetic data is perishable. Plan for its decay before it bites you.
- [ ] Monitor the real data distribution. Watch for drift away from what your synthetic data modeled.
- [ ] Set a regeneration trigger. Regenerate on drift, not on the calendar; document the threshold.
- [ ] Document the whole pipeline. Method, parameters, ratio, and utility numbers so regeneration is a button, not an excavation.
Adapting the Checklist to Your Context
This checklist is deliberately complete, which means parts of it will not apply to every project. Knowing what to drop is as important as knowing what to run.
- Low-stakes augmentation of public data. Relax the privacy section. If the source data carries no personal information, membership inference and distance checks are optional. Keep everything else.
- Simulation-heavy domains. The fidelity section shifts toward closing the sim-to-real gap rather than matching tabular correlations. The utility check on real data matters even more, because rendering artifacts are the dominant risk.
- One-off research experiments. You can run a lighter version, but never drop the locked holdout or the utility check. Those two items are non-negotiable regardless of stakes, because without them you cannot honestly know whether the synthetic data helped.
The rule for adaptation: you may scale down the verification appropriate to your stakes, but the pre-generation gate and the utility check survive every context. They are the load-bearing items.
How to Use This Checklist
Run the "before you generate" section as a gate; do not pass it until every box is checked. Treat the utility check as a hard stop. Everything after blending is about keeping the model healthy over time. The teams that fail almost always failed an unchecked box from the first or fourth section, the gaps the common mistakes article catalogs.
Treat the checklist as a living document. After each project, note which item caught a real problem and which felt like ceremony. Over a few projects you will learn which checks earn their keep in your domain and which you can streamline. The structure stays; the emphasis adapts to where your failures actually occur.
Frequently Asked Questions
Which section is the most important?
"Before you generate anything," especially locking a real holdout. Every later check depends on having ground truth to measure against. Skip it and the rest of the checklist measures nothing.
Can I skip the manual inspection step at scale?
No. You inspect a small sample, not the full dataset, and it costs minutes. It catches gross failures, like impossible values and leaked records, that automated metrics routinely pass.
Do I always need the privacy section?
Only if your real data is sensitive. If you are augmenting non-sensitive public data, you can relax privacy testing. When real records carry personal information, it is mandatory.
How do I know where to stop tuning the ratio?
Sweep ratios, measure utility on the real holdout, and stop where the metric plateaus or starts degrading. The optimum is empirical and dataset-specific, never a fixed number.
When should I regenerate synthetic data?
When you detect drift in the real data distribution, not on a fixed schedule. Drift-based regeneration keeps the synthetic data aligned with current reality.
Key Takeaways
- The pre-generation gate, especially a locked real holdout, is the most important section to clear.
- Inspect a small batch by hand before scaling; it is cheap and catches expensive errors.
- Validate correlations, not just marginals, and treat the utility check as a hard stop.
- Blend and tune the ratio empirically; stop at the optimum, which is below maximum synthetic.
- Treat synthetic data as perishable: monitor drift, set regeneration triggers, and document everything.