Case Study: Synthetic Data in Ai Training in Practice

This is an illustrative case study, a composite built from patterns common to real synthetic data projects, written as a single narrative so you can see how the decisions connect. The numbers are representative of typical outcomes, not measurements from one specific deployment. The point is the arc: how a stalled project found its way to a shipped model, and where it nearly went wrong.

Follow it end to end. Then read the principles it illustrates in The Complete Guide and the workflow it followed in the step-by-step approach.

The Situation

A mid-sized lender built a model to flag fraudulent loan applications. It had two years of application data: roughly 400,000 applications, of which about 1,200 were confirmed fraud. That is 0.3 percent positive class.

The first model, trained on the raw data, achieved high overall accuracy and was useless. It caught almost no fraud because predicting "legitimate" every time scored 99.7 percent accuracy. Recall on the fraud class sat near 20 percent. The fraud team was drowning in cases the model missed and ignoring the model entirely.

Collecting more real fraud was not an option. Fraud accumulates at its own pace, and the team needed a working model in weeks, not years.

The Decision

The team considered three paths: aggressive class weighting, undersampling the majority class, and synthetic generation of the minority class. Class weighting alone had not moved recall enough. Undersampling threw away most of the legitimate data and made the model jittery.

They chose synthetic generation, but with a constraint they wrote down explicitly: synthetic data would be used only to balance the training set, never to evaluate, and the blend ratio would be tuned, not assumed. That single discipline, set before any code, shaped everything that followed.

The Execution

Locking the holdout

Before generating anything, they carved out a real evaluation set: 80,000 applications including roughly 240 real fraud cases, untouched by generation. This holdout became the referee for every later decision.

Choosing the generator

The data was tabular with strong correlations between fields like income, requested amount, and employment history. A naive sampler broke those correlations immediately. They moved to a conditional tabular generative model that learned the joint distribution of the fraud class.

Inspecting before scaling

They generated 500 synthetic fraud records and read them by hand. The first batch had impossible combinations: applicants with zero income requesting large amounts approved at high rates. The generator had not learned a hard constraint. They added the constraint and regenerated. This manual inspection, covered in the best practices guide, caught a flaw that automated fidelity metrics had scored as acceptable.

Validating fidelity

The corrected generator's synthetic fraud matched the real fraud on marginal distributions and, critically, on the pairwise correlations that the first attempt had broken. Only then did they scale up.

Tuning the ratio

They swept the synthetic-to-real ratio for the fraud class: 10 percent, 25 percent, 40 percent, 50 percent, and 70 percent synthetic. At each setting they measured recall and precision on the real holdout. Recall climbed steadily up to 40 percent synthetic, then plateaued, while precision began to erode past 50 percent. They settled at 40 percent synthetic for the fraud class.

The Outcome

On the real holdout, fraud recall rose from roughly 20 percent to about 68 percent, while precision stayed high enough that the fraud team's caseload remained manageable. Overall accuracy barely changed, which was expected and irrelevant; the goal was catching fraud, not inflating an accuracy number that the imbalance had already maxed out.

The model shipped. The fraud team began trusting and acting on its flags. The project that had stalled for a quarter moved to production in about five weeks once synthetic generation was chosen.

What They Would Do Differently

Two lessons surfaced in the retrospective.

Test privacy earlier. They ran membership inference checks only near the end and got lucky; the generator had not memorized records. Had it leaked, they would have rebuilt late. Privacy testing belongs in the first generation cycle, not the last.
Plan for drift sooner. Fraud patterns evolve as fraudsters adapt. Six months later, recall had drifted down because the synthetic data described an older fraud distribution. They had no regeneration schedule and scrambled to rebuild. Treating synthetic data as perishable from the start would have caught this.

These are the exact failure modes catalogued in 7 Common Mistakes, seen here in a project that mostly avoided them.

The Alternatives They Did Not Take

It is worth naming the paths the team rejected, because the road not taken clarifies why this one worked. They could have bought an end-to-end synthetic data platform and trusted its built-in validation. They chose open-source generation with their own validation instead, reasoning that the validation was too important to outsource to a vendor's dashboard they could not fully inspect. That choice is examined in the tools roundup.

They also considered generating a much larger synthetic fraud set, on the theory that more minority-class data would only help. The ratio sweep disproved that theory directly: past 40 percent synthetic, precision fell and the model began over-flagging legitimate applications. Had they skipped the sweep and assumed more was better, they would have shipped a model that buried the fraud team in false positives, trading one failure mode for another. The empirical sweep, not intuition, kept them at the right point.

Why This Worked

Strip the story to its spine and the success came from three decisions made before generation: a locked real holdout, a tuned ratio rather than an assumed one, and manual inspection that caught a broken constraint. The generative method mattered less than the discipline around it. Swap the conditional tabular model for a different generator and, with the same discipline, the outcome would likely have held.

The broader lesson is that synthetic data did not solve this problem on its own. Process did. The generator manufactured fraud examples, but it was the holdout, the inspection, and the ratio sweep that turned those examples into a model the fraud team trusted. A team that ran the same generator without those guardrails would have produced plausible-looking data and an unreliable model, and likely blamed the technique rather than the missing discipline.

Frequently Asked Questions

Why not just use class weighting instead of synthetic data?

The team tried it. Class weighting alone did not raise recall enough because the model still saw too few distinct fraud patterns. Synthetic generation gave the model varied minority-class examples to learn from, which weighting cannot manufacture.

How did they know 40 percent was the right ratio?

They swept multiple ratios and measured recall and precision on the real holdout at each. Recall plateaued and precision started eroding past 40 percent, so that was the empirical sweet spot, not a guess.

Was the accuracy improvement the success metric?

No. Overall accuracy was already 99.7 percent from the imbalance and barely moved. The real metric was fraud recall, which rose from about 20 percent to about 68 percent.

What caused the later performance drift?

Fraud patterns evolved while the synthetic data stayed frozen at an older distribution. Without a regeneration schedule, the model slowly decayed. The fix was treating synthetic data as perishable and regenerating on drift.

What was the single most important decision?

Locking a real holdout before generating anything. It made every subsequent ratio and fidelity decision verifiable against ground truth rather than against synthetic data the team had shaped.

Key Takeaways

A 0.3 percent fraud class produced a useless high-accuracy model; the goal was recall, not accuracy.
Synthetic minority-class generation, with a tuned ratio, lifted fraud recall from about 20 to about 68 percent.
Manual inspection of a small batch caught a broken constraint that fidelity metrics had passed.
A locked real holdout made every later decision verifiable.
Skipping early privacy testing and drift planning were the retrospective's main regrets.

Follow it end to end. Then read the principles it illustrates in The Complete Guide and the workflow it followed in the step-by-step approach.

The Situation

Collecting more real fraud was not an option. Fraud accumulates at its own pace, and the team needed a working model in weeks, not years.

The Decision

The Execution

Locking the holdout

Choosing the generator

Inspecting before scaling

Validating fidelity

Tuning the ratio

The Outcome

The model shipped. The fraud team began trusting and acting on its flags. The project that had stalled for a quarter moved to production in about five weeks once synthetic generation was chosen.

What They Would Do Differently

Two lessons surfaced in the retrospective.

Test privacy earlier. They ran membership inference checks only near the end and got lucky; the generator had not memorized records. Had it leaked, they would have rebuilt late. Privacy testing belongs in the first generation cycle, not the last.
Plan for drift sooner. Fraud patterns evolve as fraudsters adapt. Six months later, recall had drifted down because the synthetic data described an older fraud distribution. They had no regeneration schedule and scrambled to rebuild. Treating synthetic data as perishable from the start would have caught this.

These are the exact failure modes catalogued in 7 Common Mistakes, seen here in a project that mostly avoided them.

The Alternatives They Did Not Take

Why This Worked

Frequently Asked Questions

Why not just use class weighting instead of synthetic data?

How did they know 40 percent was the right ratio?

Was the accuracy improvement the success metric?

No. Overall accuracy was already 99.7 percent from the imbalance and barely moved. The real metric was fraud recall, which rose from about 20 percent to about 68 percent.

What caused the later performance drift?

What was the single most important decision?

Locking a real holdout before generating anything. It made every subsequent ratio and fidelity decision verifiable against ground truth rather than against synthetic data the team had shaped.

Key Takeaways

A 0.3 percent fraud class produced a useless high-accuracy model; the goal was recall, not accuracy.
Synthetic minority-class generation, with a tuned ratio, lifted fraud recall from about 20 to about 68 percent.
Manual inspection of a small batch caught a broken constraint that fidelity metrics had passed.
A locked real holdout made every later decision verifiable.
Skipping early privacy testing and drift planning were the retrospective's main regrets.

Case Study: Synthetic Data in Ai Training in Practice

The Situation

The Decision

The Execution

Locking the holdout

Choosing the generator

Inspecting before scaling

Validating fidelity

Tuning the ratio

The Outcome

What They Would Do Differently

The Alternatives They Did Not Take

Why This Worked

Frequently Asked Questions

Why not just use class weighting instead of synthetic data?

How did they know 40 percent was the right ratio?

Was the accuracy improvement the success metric?

What caused the later performance drift?

What was the single most important decision?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Case Study: Synthetic Data in Ai Training in Practice

The Situation

The Decision

The Execution

Locking the holdout

Choosing the generator

Inspecting before scaling

Validating fidelity

Tuning the ratio

The Outcome

What They Would Do Differently

The Alternatives They Did Not Take

Why This Worked

Frequently Asked Questions

Why not just use class weighting instead of synthetic data?

How did they know 40 percent was the right ratio?

Was the accuracy improvement the success metric?

What caused the later performance drift?

What was the single most important decision?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?