Case Study: Machine Learning Basics in Practice

A case study is the fastest shortcut to genuine understanding. Reading about gradient descent or cross-validation in the abstract is one thing; watching a real team make decisions, absorb surprises, and measure what actually changed is another. This article follows a mid-sized marketing agency — 22 people, a mix of account managers, strategists, and creatives — through their first deliberate machine learning project. The goal was practical and scoped: use ML to reduce the time spent manually scoring inbound leads before routing them to sales.

The narrative is real in structure even if some details are composited for clarity. The decisions, failure modes, and outcomes represent what happens when intelligent non-specialists approach ML with seriousness and reasonable resources. If you're a professional or agency operator trying to understand where machine learning actually fits into your work, this is the case study worth reading.

The Situation: A Real Problem Worth Solving

The agency received between 80 and 140 inbound leads per month through a mix of contact forms, event signups, and referral introductions. Each lead was manually reviewed by a senior account manager, who would spend 15–25 minutes researching the company, estimating fit, and assigning a priority tier. The process worked, but it was expensive in attention and inconsistent in output — two different people scoring the same lead would often reach different conclusions.

The team had historical data: three years of CRM entries, including the original lead attributes, the assigned tier, and the eventual outcome (won, lost, stalled). That data covered roughly 2,800 records. It wasn't a massive dataset, but it was labeled, relatively clean, and directly connected to a decision the business made every week.

Why This Problem Was ML-Ready

Not every business problem benefits from machine learning. This one passed the basic tests:

Labeled historical outcomes existed. The CRM recorded what happened to each lead — a prerequisite for supervised learning.
The input features were structured. Company size, industry, source channel, geographic region, deal size estimate — all fields with consistent values.
The cost of errors was asymmetric but tolerable. A false positive (routing a weak lead as high priority) wasted time. A false negative (missing a good lead) cost revenue. Neither was catastrophic enough to require near-perfect precision before deployment.
Human effort was the bottleneck. The process cost roughly 4–6 hours of senior time per week, week after week.

Before doing anything technical, the team walked through A Framework for Machine Learning Basics to confirm they were solving the right problem in the right way. That discipline matters. ML applied to the wrong problem is expensive in both time and credibility.

The Decision: Choosing the Right Approach

The team considered three paths:

Buy a scoring tool baked into their CRM (HubSpot's native lead scoring)
Use a no-code ML platform like Obviously AI or Akkio
Build a lightweight custom model using Python and scikit-learn, supported by a part-time data analyst contractor

They rejected option one quickly. The native tool applied generic scoring logic and couldn't be trained on their specific historical data without an expensive upgrade tier. Option three felt premature — they didn't have in-house ML expertise, and the contractor cost would take months to justify.

They chose option two: a no-code platform that allowed CSV import, automatic feature selection, model training, and output as a scored probability. The full comparison of tools for this kind of work helped frame the trade-offs. No-code platforms sacrifice some precision and interpretability for speed and accessibility — an acceptable trade for a first deployment.

What They Decided to Predict

The precise framing of the prediction target matters enormously. The team initially said they wanted to predict "lead quality." That's not a ML target — it's a concept. They had to translate it into something measurable in the data.

After reviewing the CRM, they settled on a binary classification task: predict whether a lead would move to proposal stage within 90 days of first contact. Proposal stage was a concrete CRM milestone, consistently logged, and a reliable proxy for genuine opportunity. This reframing is one of the most important moves in any ML project, and it happens before any model training.

The Execution: Building and Testing the Model

The data analyst contractor spent three days on data preparation. This is typical — data preparation consistently takes 50–70% of the total project time in applied ML work. The tasks:

Removing duplicates and null-heavy rows. About 180 records were dropped, leaving 2,620.
Encoding categorical variables. Industry and source channel were converted to numeric representations.
Splitting the dataset. 80% for training, 20% held out for testing. The holdout set was kept untouched until evaluation.
Checking class imbalance. Only about 28% of leads had reached proposal stage — a meaningful imbalance. They applied a mild oversampling technique (SMOTE) to avoid the model learning to predict "no" for everything.

The no-code platform trained several model types automatically — logistic regression, a decision tree, and a gradient boosting variant. It selected the gradient boosting model based on AUC score on the validation split.

Interpreting the First Results

The first model run produced an AUC of 0.74. For context: 0.5 is random chance, 0.7–0.8 is generally considered functional for business applications, and 0.9+ is either excellent or suspicious. The team resisted the temptation to declare success and instead examined the confusion matrix.

At their chosen threshold (0.55 probability = "high priority"), the model was:

Correctly flagging about 68% of leads that would eventually reach proposal stage
Generating false positives on roughly 22% of low-quality leads

The account manager who had been doing manual scoring estimated her own consistency was around 70–75% for top-tier leads, based on outcome review. The model was operating in that range, without the 15-minute manual effort per lead.

They also reviewed feature importance. The top predictors, in order, were:

Source channel (referrals converted at 3× the rate of inbound form fills)
Company revenue band
Industry vertical
Time between form fill and first response (faster responses predicted higher close rates)
Geographic region

None of these were surprising, which was actually a good sign. When a model's top features are completely counterintuitive, something is usually wrong — either with the data or the framing.

The Outcome: What Actually Changed

They deployed the model in a lightweight way: the contractor built a simple spreadsheet-based scoring tool that the operations coordinator ran each Monday morning, producing a priority score for every new lead from the previous week. It wasn't automated via API — it was a deliberate first step that kept humans in the loop.

Over the following four months, they tracked the outcomes. Compared to the four-month baseline before deployment:

Senior account manager time on lead review dropped by roughly 60%. From ~5 hours/week to ~2 hours/week.
Lead-to-proposal conversion rate increased from 24% to 31%. Higher-priority leads received faster follow-up, which mattered.
One significant lead was nearly missed — flagged low by the model, rescued by a junior account manager who noticed the company in the news. This became a standing policy: model scores inform but don't override human review.

The dollar value wasn't dramatic in absolute terms, but reclaiming ~12 hours of senior time per month, compounded over a year, is meaningful for a 22-person firm. More importantly, the team gained something harder to quantify: a shared understanding of what their data actually contained and what their best leads looked like.

Where It Broke Down: The Honest Failures

No ML deployment is clean. These were the real failure modes:

Data quality degraded over time. By month three, the operations team had gotten inconsistent about logging source channel — one of the top predictors. The model's accuracy drifted noticeably, and they had to retrain on fresher data. This is a perpetual challenge: measuring model performance over time isn't optional, it's the job.

The threshold choice needed adjustment. Their initial 0.55 cutoff was arbitrary. After two months, they reviewed which threshold produced the best actual business outcome (not just AUC) and moved it to 0.62, reducing false positives at a modest cost to recall.

Stakeholder skepticism created friction. Two senior account managers were resistant to using the scores, viewing them as a threat to their judgment. This was never fully resolved through data — it was resolved through transparency: showing them the feature importance list and framing the tool as "this is what you already know, made faster."

Understanding these failure modes in advance is part of what the machine learning basics trade-offs discussion is designed to surface. Every deployment has versions of these problems. The question is whether you're prepared to navigate them.

What This Case Study Teaches About ML Basics

A few principles this project illustrates:

Problem framing precedes everything. "Lead quality" became "proposal stage within 90 days" — that translation was more important than any algorithm choice.
Data preparation is the real work. Three days of cleaning, encoding, and balancing for a 2,600-row dataset. Plan for this.
Start with interpretable outputs. A probability score that a human reviews is safer and more instructive than a fully automated routing system.
Model accuracy and business accuracy are not the same thing. An AUC of 0.74 delivered real value. Chasing 0.90 would have cost more than it returned.
Retraining is not a failure. It's the normal maintenance cycle of a model that operates in a changing environment.

Before beginning a project like this, teams benefit from working through the machine learning basics checklist as a pre-flight — it surfaces data readiness issues and scoping problems before they become expensive.

Frequently Asked Questions

Do you need a data scientist to run a project like this?

Not necessarily for a first deployment of this scope. A part-time data analyst or a technically capable operations person, combined with a no-code ML platform, can get a functional model running. Where a data scientist becomes important is in model interpretability, handling complex feature engineering, and managing larger or messier datasets.

How much historical data do you need to train a useful model?

For a structured, binary classification problem like lead scoring, 1,500–3,000 labeled records is generally sufficient to produce a model worth testing. Below 1,000, you're likely to see high variance and unreliable predictions. Above 10,000 clean records, you can start exploring more complex architectures with confidence.

How do you know when a model has drifted and needs retraining?

Monitor the business metric the model was built to improve — in this case, lead-to-proposal conversion rates — on a monthly basis. If the rate drops and you haven't changed anything else, drift is a likely suspect. You can also track prediction distribution: if the model starts scoring most leads near the middle of the probability range, its discriminative power has probably degraded.

What's the difference between a no-code ML platform and a CRM's built-in scoring?

Built-in CRM scoring tools usually apply generic rules or are trained on population-level data across all their customers, not your specific historical outcomes. No-code platforms let you train on your own labeled data, which produces a model tuned to your actual conversion patterns. The trade-off is setup time and the need for clean historical data.

Is this kind of project worth it for small teams?

It depends on whether the bottleneck is truly recurring human effort. If a task consumes 3–5 hours of senior time per week and the inputs are consistent and logged, the payoff on even a modest ML deployment is usually positive within three to six months. If the task is occasional or the data is sparse, the overhead is harder to justify.

Key Takeaways

The most important ML decision is precise problem framing — translating a vague goal into a specific, measurable prediction target.
Data preparation takes longer than model training. Budget accordingly and don't skip it.
Start with humans in the loop. A scored output that informs judgment is safer and more credible than full automation on a first deployment.
Track business metrics, not just model metrics. AUC and accuracy are proxies — what matters is the outcome you set out to change.
Model drift is inevitable. Build monitoring and periodic retraining into the plan from the start, not as an afterthought.
Stakeholder buy-in requires transparency, not just results. Show people what the model learned and why it makes sense.

The Situation: A Real Problem Worth Solving

Why This Problem Was ML-Ready

Not every business problem benefits from machine learning. This one passed the basic tests:

Labeled historical outcomes existed. The CRM recorded what happened to each lead — a prerequisite for supervised learning.
The input features were structured. Company size, industry, source channel, geographic region, deal size estimate — all fields with consistent values.
The cost of errors was asymmetric but tolerable. A false positive (routing a weak lead as high priority) wasted time. A false negative (missing a good lead) cost revenue. Neither was catastrophic enough to require near-perfect precision before deployment.
Human effort was the bottleneck. The process cost roughly 4–6 hours of senior time per week, week after week.

The Decision: Choosing the Right Approach

The team considered three paths:

Buy a scoring tool baked into their CRM (HubSpot's native lead scoring)
Use a no-code ML platform like Obviously AI or Akkio
Build a lightweight custom model using Python and scikit-learn, supported by a part-time data analyst contractor

What They Decided to Predict

The Execution: Building and Testing the Model

The data analyst contractor spent three days on data preparation. This is typical — data preparation consistently takes 50–70% of the total project time in applied ML work. The tasks:

Removing duplicates and null-heavy rows. About 180 records were dropped, leaving 2,620.
Encoding categorical variables. Industry and source channel were converted to numeric representations.
Splitting the dataset. 80% for training, 20% held out for testing. The holdout set was kept untouched until evaluation.
Checking class imbalance. Only about 28% of leads had reached proposal stage — a meaningful imbalance. They applied a mild oversampling technique (SMOTE) to avoid the model learning to predict "no" for everything.

Interpreting the First Results

At their chosen threshold (0.55 probability = "high priority"), the model was:

Correctly flagging about 68% of leads that would eventually reach proposal stage
Generating false positives on roughly 22% of low-quality leads

They also reviewed feature importance. The top predictors, in order, were:

Source channel (referrals converted at 3× the rate of inbound form fills)
Company revenue band
Industry vertical
Time between form fill and first response (faster responses predicted higher close rates)
Geographic region

None of these were surprising, which was actually a good sign. When a model's top features are completely counterintuitive, something is usually wrong — either with the data or the framing.

The Outcome: What Actually Changed

Over the following four months, they tracked the outcomes. Compared to the four-month baseline before deployment:

Senior account manager time on lead review dropped by roughly 60%. From ~5 hours/week to ~2 hours/week.
Lead-to-proposal conversion rate increased from 24% to 31%. Higher-priority leads received faster follow-up, which mattered.
One significant lead was nearly missed — flagged low by the model, rescued by a junior account manager who noticed the company in the news. This became a standing policy: model scores inform but don't override human review.

Where It Broke Down: The Honest Failures

No ML deployment is clean. These were the real failure modes:

What This Case Study Teaches About ML Basics

A few principles this project illustrates:

Problem framing precedes everything. "Lead quality" became "proposal stage within 90 days" — that translation was more important than any algorithm choice.
Data preparation is the real work. Three days of cleaning, encoding, and balancing for a 2,600-row dataset. Plan for this.
Start with interpretable outputs. A probability score that a human reviews is safer and more instructive than a fully automated routing system.
Model accuracy and business accuracy are not the same thing. An AUC of 0.74 delivered real value. Chasing 0.90 would have cost more than it returned.
Retraining is not a failure. It's the normal maintenance cycle of a model that operates in a changing environment.

Frequently Asked Questions

Do you need a data scientist to run a project like this?

How much historical data do you need to train a useful model?

How do you know when a model has drifted and needs retraining?

What's the difference between a no-code ML platform and a CRM's built-in scoring?

Is this kind of project worth it for small teams?

Key Takeaways

The most important ML decision is precise problem framing — translating a vague goal into a specific, measurable prediction target.
Data preparation takes longer than model training. Budget accordingly and don't skip it.
Start with humans in the loop. A scored output that informs judgment is safer and more credible than full automation on a first deployment.
Track business metrics, not just model metrics. AUC and accuracy are proxies — what matters is the outcome you set out to change.
Model drift is inevitable. Build monitoring and periodic retraining into the plan from the start, not as an afterthought.
Stakeholder buy-in requires transparency, not just results. Show people what the model learned and why it makes sense.

Case Study: Machine Learning Basics in Practice

The Situation: A Real Problem Worth Solving

Why This Problem Was ML-Ready

The Decision: Choosing the Right Approach

What They Decided to Predict

The Execution: Building and Testing the Model

Interpreting the First Results

The Outcome: What Actually Changed

Where It Broke Down: The Honest Failures

What This Case Study Teaches About ML Basics

Frequently Asked Questions

Do you need a data scientist to run a project like this?

How much historical data do you need to train a useful model?

How do you know when a model has drifted and needs retraining?

What's the difference between a no-code ML platform and a CRM's built-in scoring?

Is this kind of project worth it for small teams?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Case Study: Machine Learning Basics in Practice

The Situation: A Real Problem Worth Solving

Why This Problem Was ML-Ready

The Decision: Choosing the Right Approach

What They Decided to Predict

The Execution: Building and Testing the Model

Interpreting the First Results

The Outcome: What Actually Changed

Where It Broke Down: The Honest Failures

What This Case Study Teaches About ML Basics

Frequently Asked Questions

Do you need a data scientist to run a project like this?

How much historical data do you need to train a useful model?

How do you know when a model has drifted and needs retraining?

What's the difference between a no-code ML platform and a CRM's built-in scoring?

Is this kind of project worth it for small teams?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?