Case Study: Supervised vs Unsupervised Learning in Practice

Two agencies. Same general goal: use machine learning to grow revenue. Same six-month window. Radically different approaches — and outcomes that reveal something most introductory ML content glosses over entirely.

The first agency, a mid-size e-commerce consultancy, wanted to predict which customers would churn within 90 days. They had labeled historical data: thousands of accounts with known outcomes. The second, a content marketing firm, wanted to understand why their newsletter audience had stopped engaging — but they had no clear definition of "disengaged" yet, no tidy labels, just raw behavioral data sitting in their ESP.

Both chose machine learning. Both succeeded, eventually. But the path that led each agency there — the decision about which type of machine learning to use — determined how long it took, how much it cost, and how useful the output actually was. That decision was supervised vs. unsupervised learning, and the stakes were higher than either team initially understood.

This article walks through both case studies in full: situation, decision rationale, execution, measurable outcomes, and the lessons that transfer to your own projects.

The Core Distinction (and Why It Actually Matters)

Most explanations of supervised vs. unsupervised learning stop at the textbook definition. Supervised learning trains on labeled data — input/output pairs — so the model learns to predict a known target. Unsupervised learning finds structure in unlabeled data, grouping or compressing it without a predefined answer key.

That distinction sounds clean in theory. In practice, the choice forces you to answer a harder upstream question: Do you already know what you're looking for?

If you have a defined outcome — churn, conversion, fraud, diagnosis — and historical examples of that outcome, supervised learning is usually the right tool. If you're still in discovery mode, trying to find patterns you haven't named yet, unsupervised learning is often more appropriate. Getting this wrong doesn't just slow you down; it can produce confidently wrong results that are harder to catch than an obvious failure.

Case Study 1: Predicting Churn with Supervised Learning

The Situation

The e-commerce consultancy managed retention programs for roughly 40 clients. Each client's customer base ranged from 5,000 to 200,000 accounts. The consultancy suspected they were losing clients partly because they couldn't quantify churn risk proactively — they were reacting to cancellations instead of preventing them.

Their dataset: 18 months of transaction history, login frequency, support ticket volume, and subscription tier changes for one anchor client with 85,000 accounts. Crucially, they had a binary label for each account: churned or retained at the 90-day mark.

The Decision

They chose a gradient-boosted classifier (XGBoost, specifically) — a supervised approach. The reasoning was straightforward: the target variable existed, the historical labels were reliable, and the business outcome they cared about was specific and binary. There was no discovery phase needed. They knew what churn was.

This is where many teams waste weeks: debating model architecture before confirming data quality. Their first two weeks were spent entirely on label validation — auditing whether the "churned" flags in the database accurately reflected actual cancellations, not just paused accounts or payment failures. Roughly 8% of their initial labels were wrong. Fixing that before training saved them from building a model on a corrupted foundation.

Execution

Feature engineering took longer than model training. The team derived 34 features from raw transaction logs:

Days since last purchase
Purchase frequency trend over 30/60/90-day rolling windows
Average order value trajectory
Support ticket sentiment (extracted with a lightweight NLP classifier)
Ratio of browsed SKUs to purchased SKUs

They split data 70/15/15 (train/validation/test), trained the classifier, and evaluated using AUC-ROC rather than raw accuracy — because with a churned rate of roughly 12%, accuracy alone would be misleading. A model that predicted "retained" for everyone would be 88% accurate and completely useless.

Final AUC-ROC on the holdout set: 0.83. That's a solid result for a first production model in this domain, though not exceptional. Precision at a 0.6 probability threshold was 71%, meaning 71% of the accounts the model flagged as high-risk actually churned.

Measurable Outcome

Over the following quarter, the consultancy ran targeted retention campaigns only on model-flagged accounts (top 15% risk score). Compared to the prior quarter's blanket campaigns:

Intervention cost dropped by roughly 40% (fewer accounts receiving expensive outreach)
Churn rate among flagged accounts decreased by approximately 22 percentage points versus the control group
The anchor client renewed its contract and expanded scope

The model became a productized offering. Within six months, the consultancy had deployed a version of it for four additional clients, with client-specific retraining on each dataset.

Case Study 2: Discovering Audience Segments with Unsupervised Learning

The Situation

The content marketing firm had a newsletter list of 110,000 subscribers, accumulated over four years. Open rates had slid from 28% to 19% over 18 months. The instinct was to "re-engage" the list — but with what message, to whom?

Here's the problem: "disengaged" meant different things. Some subscribers had never opened anything. Others had been highly active and then gone cold. Others opened occasionally but never clicked. Treating all of them the same way with a single re-engagement campaign was already failing.

They had no labels. No one had sat down and hand-coded 110,000 subscribers as "segment A" or "segment B." That's precisely when unsupervised learning earns its place.

The Decision

They chose k-means clustering — an unsupervised algorithm that partitions data into k groups by minimizing intra-cluster variance. The appeal: interpretable outputs, fast iteration, and clusters that could be profiled and named by the marketing team without needing a data scientist to decode them.

The harder choice was what to cluster on. They built a behavioral feature matrix for each subscriber:

Total opens, total clicks, click-to-open rate
Recency of last open (days)
Topic category affinity (derived from which content pillars generated clicks)
Enrollment channel (organic search, referral, paid, event)

One important architectural note: they didn't include subscriber demographics like job title or company size in the initial clustering. Demographics can dominate the distance calculations and mask behavioral patterns — which was what they actually needed to understand.

Execution

They ran the elbow method to choose k, testing k=2 through k=9. The elbow appeared at k=5, suggesting five meaningfully distinct clusters. They also validated with silhouette scores; the average score at k=5 was 0.41, which is moderate but acceptable for behavioral marketing data.

The five clusters, after the marketing team reviewed centroid profiles:

Loyalists — High open rate, high CTR, recent activity. ~11% of the list.
Skimmers — High open rate, very low CTR. Open but rarely act. ~23%.
Topic specialists — Low overall open rate but high engagement on specific content pillars. ~19%.
Early dropoffs — Opened 1–3 times at enrollment, nothing since. ~28%.
Ghosts — No meaningful engagement in 12+ months. ~19%.

This taxonomy didn't exist before the analysis. It was found, not defined.

Measurable Outcome

Each cluster received a different re-engagement treatment:

Loyalists: upsell sequences and referral programs
Skimmers: format experiments (shorter emails, stronger subject lines)
Topic specialists: personalized content tracks
Early dropoffs: a single high-value lead magnet with a 30-day send window
Ghosts: suppressed from active sends, reducing deliverability risk

After one quarter:

Overall open rate climbed from 19% to 24%
Unsubscribe rate dropped by roughly 30% (less irrelevant mail)
List size shrank by 14% as Ghosts were suppressed — which the team initially resisted but then recognized as a win: smaller, cleaner, more engaged list

Revenue attributed to newsletter-driven conversions increased by approximately 35% despite the smaller list.

Where the Two Approaches Intersect

These case studies didn't stay cleanly separated. After the content marketing firm completed their clustering, they faced a new question: could they predict which new subscribers would become Topic Specialists vs. Early Dropoffs, and intervene earlier?

That's a supervised learning problem — and it required the unsupervised output as its foundation. The cluster labels from k-means became the training labels for a classifier. This sequence — unsupervised discovery, then supervised prediction — is a legitimate and underused workflow. Building a repeatable workflow for how generative AI works requires the same kind of sequenced thinking: establish what you're working with before you try to optimize it.

Similarly, the churn prediction model eventually hit a ceiling. At AUC-ROC 0.83, marginal improvements came slowly. The consultancy started exploring whether hidden sub-segments of churners behaved differently — and used clustering on the false negatives (churners the model missed) to find a behavioral pattern they hadn't encoded in their features. Unsupervised learning surfaced the blind spot that supervised learning couldn't self-diagnose.

Understanding how these models make decisions at the layer level — which features they weight, how representations form — is covered in depth in The Complete Guide to Neural Networks. For readers newer to how models learn from data at all, Neural Networks: A Beginner's Guide builds the conceptual scaffolding first.

The Four Decision Factors That Actually Drive the Choice

When you're standing at the fork between supervised and unsupervised, four factors determine which road to take:

1. Label availability. If you have reliable labels, supervised learning almost always outperforms unsupervised on a defined prediction task. If you don't — or if the act of creating labels would take months and introduce bias — unsupervised is your starting point.

2. Problem definition clarity. "Predict churn" is defined. "Understand our audience better" is not. Vague problems fed into supervised models produce precise-sounding nonsense.

3. The cost of a wrong prediction. Supervised models make specific, auditable predictions. Unsupervised outputs are interpretations. If someone's livelihood or safety depends on the answer, the auditability of supervised models is a significant advantage.

4. Your data volume and quality. Supervised learning needs enough labeled examples of each class (as a rough guide: hundreds at minimum, ideally thousands per class for tabular data). Unsupervised methods can surface patterns with smaller datasets but are more sensitive to noisy features.

Common Failure Modes

Both approaches have predictable failure modes worth naming before you're inside one:

Supervised learning failures:

Training on leaky features (data that wouldn't exist at prediction time)
Imbalanced classes evaluated by raw accuracy
Label drift — the world changes but the model doesn't

Unsupervised learning failures:

Choosing k by intuition rather than validation metrics
Including irrelevant features that swamp meaningful signal
Treating cluster labels as ground truth rather than as hypotheses requiring validation

The pattern underlying most of these failures is the same: moving from data to model without adequate investment in understanding the data itself. A step-by-step approach to neural networks makes the same point about architecture decisions — process discipline before technical sophistication.

Frequently Asked Questions

Can you use supervised and unsupervised learning together in one project?

Yes, and this combination is often more powerful than either alone. A common workflow is to use unsupervised clustering to discover natural groupings, then use those cluster assignments as labels to train a supervised classifier for future predictions. The churn case study above eventually moved in this direction when analyzing false negatives.

How much data do you need for each approach?

For supervised learning on tabular classification tasks, rough practical minimums are a few hundred labeled examples per class, with performance improving significantly through the thousands. Unsupervised methods are less data-hungry but more sensitive to feature quality — garbage features produce meaningless clusters regardless of sample size.

What's the difference between a supervised vs unsupervised learning case study and just A/B testing?

A/B testing measures the effect of a specific intervention with a predefined hypothesis. Machine learning case studies use algorithms to discover patterns (unsupervised) or predict outcomes (supervised) from historical data. They're complementary: ML can tell you who to target; A/B testing tells you whether your intervention actually worked.

Is unsupervised learning harder to implement than supervised?

Not necessarily harder technically, but harder to evaluate. Supervised models have clear metrics — AUC, precision, recall. Unsupervised outputs require human interpretation to validate, which introduces subjectivity. The difficulty is less in running the algorithm and more in determining whether the output is meaningful.

How do these approaches relate to generative AI?

Generative AI models are trained using supervised signals (predicting the next token is a form of supervised learning at scale) and sometimes unsupervised pretraining on raw text. The future of how generative AI works explores how this training paradigm is evolving. Understanding the supervised/unsupervised distinction gives you a more grounded mental model of what large language models actually are.

Key Takeaways

Supervised learning requires labeled data and a defined target variable. If you don't have both, you're not ready for it.
Unsupervised learning is a discovery tool, best used when you need to find structure before you can define a prediction problem.
The two approaches are often sequential, not alternatives — unsupervised first, then supervised on the discovered categories.
Label quality matters more than algorithm sophistication. Both case studies spent more time on data validation than on model selection.
Evaluation metrics must match the problem. AUC-ROC for imbalanced classification; silhouette scores or elbow analysis for clustering.
Shrinking a list, removing noisy features, or suppressing bad data is a win, not a concession — both case studies benefited from narrowing scope.
Start with the business question, not the algorithm. The question determines the approach; the approach determines the data requirements; the data requirements reveal the real project scope.

This article walks through both case studies in full: situation, decision rationale, execution, measurable outcomes, and the lessons that transfer to your own projects.

The Core Distinction (and Why It Actually Matters)

That distinction sounds clean in theory. In practice, the choice forces you to answer a harder upstream question: Do you already know what you're looking for?

Case Study 1: Predicting Churn with Supervised Learning

The Situation

The Decision

Execution

Feature engineering took longer than model training. The team derived 34 features from raw transaction logs:

Days since last purchase
Purchase frequency trend over 30/60/90-day rolling windows
Average order value trajectory
Support ticket sentiment (extracted with a lightweight NLP classifier)
Ratio of browsed SKUs to purchased SKUs

Measurable Outcome

Over the following quarter, the consultancy ran targeted retention campaigns only on model-flagged accounts (top 15% risk score). Compared to the prior quarter's blanket campaigns:

Intervention cost dropped by roughly 40% (fewer accounts receiving expensive outreach)
Churn rate among flagged accounts decreased by approximately 22 percentage points versus the control group
The anchor client renewed its contract and expanded scope

The model became a productized offering. Within six months, the consultancy had deployed a version of it for four additional clients, with client-specific retraining on each dataset.

Case Study 2: Discovering Audience Segments with Unsupervised Learning

The Situation

They had no labels. No one had sat down and hand-coded 110,000 subscribers as "segment A" or "segment B." That's precisely when unsupervised learning earns its place.

The Decision

The harder choice was what to cluster on. They built a behavioral feature matrix for each subscriber:

Total opens, total clicks, click-to-open rate
Recency of last open (days)
Topic category affinity (derived from which content pillars generated clicks)
Enrollment channel (organic search, referral, paid, event)

Execution

The five clusters, after the marketing team reviewed centroid profiles:

Loyalists — High open rate, high CTR, recent activity. ~11% of the list.
Skimmers — High open rate, very low CTR. Open but rarely act. ~23%.
Topic specialists — Low overall open rate but high engagement on specific content pillars. ~19%.
Early dropoffs — Opened 1–3 times at enrollment, nothing since. ~28%.
Ghosts — No meaningful engagement in 12+ months. ~19%.

This taxonomy didn't exist before the analysis. It was found, not defined.

Measurable Outcome

Each cluster received a different re-engagement treatment:

Loyalists: upsell sequences and referral programs
Skimmers: format experiments (shorter emails, stronger subject lines)
Topic specialists: personalized content tracks
Early dropoffs: a single high-value lead magnet with a 30-day send window
Ghosts: suppressed from active sends, reducing deliverability risk

After one quarter:

Overall open rate climbed from 19% to 24%
Unsubscribe rate dropped by roughly 30% (less irrelevant mail)
List size shrank by 14% as Ghosts were suppressed — which the team initially resisted but then recognized as a win: smaller, cleaner, more engaged list

Revenue attributed to newsletter-driven conversions increased by approximately 35% despite the smaller list.

Where the Two Approaches Intersect

The Four Decision Factors That Actually Drive the Choice

When you're standing at the fork between supervised and unsupervised, four factors determine which road to take:

2. Problem definition clarity. "Predict churn" is defined. "Understand our audience better" is not. Vague problems fed into supervised models produce precise-sounding nonsense.

Common Failure Modes

Both approaches have predictable failure modes worth naming before you're inside one:

Supervised learning failures:

Training on leaky features (data that wouldn't exist at prediction time)
Imbalanced classes evaluated by raw accuracy
Label drift — the world changes but the model doesn't

Unsupervised learning failures:

Choosing k by intuition rather than validation metrics
Including irrelevant features that swamp meaningful signal
Treating cluster labels as ground truth rather than as hypotheses requiring validation

Frequently Asked Questions

Can you use supervised and unsupervised learning together in one project?

How much data do you need for each approach?

What's the difference between a supervised vs unsupervised learning case study and just A/B testing?

Is unsupervised learning harder to implement than supervised?

How do these approaches relate to generative AI?

Key Takeaways

Supervised learning requires labeled data and a defined target variable. If you don't have both, you're not ready for it.
Unsupervised learning is a discovery tool, best used when you need to find structure before you can define a prediction problem.
The two approaches are often sequential, not alternatives — unsupervised first, then supervised on the discovered categories.
Label quality matters more than algorithm sophistication. Both case studies spent more time on data validation than on model selection.
Evaluation metrics must match the problem. AUC-ROC for imbalanced classification; silhouette scores or elbow analysis for clustering.
Shrinking a list, removing noisy features, or suppressing bad data is a win, not a concession — both case studies benefited from narrowing scope.
Start with the business question, not the algorithm. The question determines the approach; the approach determines the data requirements; the data requirements reveal the real project scope.

Case Study: Supervised vs Unsupervised Learning in Practice

The Core Distinction (and Why It Actually Matters)

Case Study 1: Predicting Churn with Supervised Learning

The Situation

The Decision

Execution

Measurable Outcome

Case Study 2: Discovering Audience Segments with Unsupervised Learning

The Situation

The Decision

Execution

Measurable Outcome

Where the Two Approaches Intersect

The Four Decision Factors That Actually Drive the Choice

Common Failure Modes

Frequently Asked Questions

Can you use supervised and unsupervised learning together in one project?

How much data do you need for each approach?

What's the difference between a supervised vs unsupervised learning case study and just A/B testing?

Is unsupervised learning harder to implement than supervised?

How do these approaches relate to generative AI?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Case Study: Supervised vs Unsupervised Learning in Practice

The Core Distinction (and Why It Actually Matters)

Case Study 1: Predicting Churn with Supervised Learning

The Situation

The Decision

Execution

Measurable Outcome

Case Study 2: Discovering Audience Segments with Unsupervised Learning

The Situation

The Decision

Execution

Measurable Outcome

Where the Two Approaches Intersect

The Four Decision Factors That Actually Drive the Choice

Common Failure Modes

Frequently Asked Questions

Can you use supervised and unsupervised learning together in one project?

How much data do you need for each approach?

What's the difference between a supervised vs unsupervised learning case study and just A/B testing?

Is unsupervised learning harder to implement than supervised?

How do these approaches relate to generative AI?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?