AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Phase 1: Problem Framingâ–¡ Translate the business problem into an ML task typeâ–¡ Write a one-sentence success criterionâ–¡ Confirm that ML is the right toolâ–¡ Map the decision that the output will drivePhase 2: Data Readinessâ–¡ Audit data availability against your task definitionâ–¡ Check class balance (for classification tasks)â–¡ Profile missing data by feature, not just overallâ–¡ Detect and document distribution shiftsâ–¡ Establish a data versioning practicePhase 3: Feature Engineering and Preprocessingâ–¡ Encode categoricals explicitly and document the schemaâ–¡ Scale numerical features when your model requires itâ–¡ Build and freeze your preprocessing pipeline before training beginsPhase 4: Model Selection and Trainingâ–¡ Establish a baseline firstâ–¡ Choose a model family appropriate to your data size and latency constraintsâ–¡ Use cross-validation, not a single train/test splitâ–¡ Track experiments with versioned logsPhase 5: Evaluationâ–¡ Choose your primary metric before training beginsâ–¡ Evaluate on a held-out test set that never touched model selectionâ–¡ Slice your evaluation by subgroupsâ–¡ Stress-test with adversarial and out-of-distribution examplesPhase 6: Deployment and Monitoringâ–¡ Define your serving architecture before you start buildingâ–¡ Build monitoring for data drift and prediction driftâ–¡ Establish a retraining cadence and trigger conditionsâ–¡ Document the model for future maintainersFrequently Asked QuestionsWhat makes a machine learning basics checklist useful versus a generic tutorial?How long does it take to complete this checklist on a real project?Can this checklist be used to audit ML work delivered by a vendor or contractor?Do all checklist items apply to projects using pre-trained models or APIs?What's the most commonly skipped item on this checklist in practice?Key Takeaways
Home/Blog/ML Projects Fail at a Predictable Set of Skipped Steps
General

ML Projects Fail at a Predictable Set of Skipped Steps

A

Agency Script Editorial

Editorial Team

·March 23, 2026·10 min read
machine learning basicsmachine learning basics checklistmachine learning basics guideai fundamentals

Machine learning projects fail at a surprisingly predictable set of failure points. Not because the algorithms are wrong, but because practitioners skip foundational steps they assume are obvious or already handled. This checklist exists to close that gap — whether you're evaluating an ML initiative at your agency, building your first pipeline, or auditing work someone else delivered.

Each item below pairs a concrete action with a brief justification. Work through it linearly the first time; use it as a spot-check on subsequent projects. The checklist covers the full arc: problem framing, data, modeling, evaluation, deployment, and ongoing governance. Miss items in the early phases and you pay for it in every phase that follows.

This is not a theoretical overview. If you want conceptual scaffolding before diving in, start with A Framework for Machine Learning Basics and return here once you're ready to act. The goal of this checklist is to hand you a working instrument, not a reading list.


Phase 1: Problem Framing

Before any data is touched, the problem must be defined with enough precision that success is falsifiable.

â–¡ Translate the business problem into an ML task type

Classify it: supervised classification, supervised regression, clustering, ranking, recommendation, anomaly detection, or generative. Each has different data requirements, evaluation strategies, and deployment constraints. Misidentifying the task type early is one of the most expensive mistakes you can make — it leads to collecting the wrong labels and training the wrong model architecture.

â–¡ Write a one-sentence success criterion

Example: "The model correctly identifies at-risk customers at least 80% of the time with a false positive rate below 15%." Vague success criteria ("the model performs well") guarantee scope creep and post-launch disputes. Numeric thresholds force honest conversation before work begins.

â–¡ Confirm that ML is the right tool

Ask: Can a lookup table, a rule set, or a simple statistical threshold solve this? If yes, use that instead. ML adds latency, cost, maintenance overhead, and explainability debt. Reserve it for problems where patterns are too complex or numerous to encode manually.

â–¡ Map the decision that the output will drive

Who acts on the model's output? What action do they take? What's the cost of a false positive versus a false negative? A fraud detection model deployed in a high-stakes payment flow has an asymmetric cost structure that a content recommendation model does not. This asymmetry should shape every subsequent decision.


Phase 2: Data Readiness

Data quality is the single largest determinant of model quality. Most teams underinvest here.

â–¡ Audit data availability against your task definition

List every feature you hypothesize is predictive. For each, confirm: Is it available at prediction time? Is historical data sufficient to train on (typically 1,000–100,000+ labeled examples depending on task complexity)? A feature that exists in your data warehouse but isn't available at inference time creates data leakage — one of the most common ways models appear to work in development and fail in production.

â–¡ Check class balance (for classification tasks)

If 95% of your examples belong to one class, a model that always predicts the majority class achieves 95% accuracy while being useless. Note the imbalance ratio and decide in advance whether you'll use oversampling, undersampling, synthetic data generation (SMOTE), or class-weighted loss functions.

â–¡ Profile missing data by feature, not just overall

A dataset that is "3% missing" overall might have one critical feature that is 40% null. Run column-level missingness analysis. Decide on imputation strategy (mean/median/mode, model-based, or drop) and document it — the same strategy must be applied identically at inference time.

â–¡ Detect and document distribution shifts

Compare your training data's time range to the deployment context. If you're training on 2021–2023 data and deploying in 2026, behavioral shifts, market changes, or policy changes may have invalidated your features. This is especially acute in Machine Learning Basics: Trends and What to Expect in 2026, where model environments are evolving faster than many teams' retraining cadences.

â–¡ Establish a data versioning practice

Use a tool or convention that ties each model version to a specific snapshot of training data. Without this, you cannot reproduce results, investigate failures, or compare experiments reliably. Even a naming convention with timestamps is better than nothing; dedicated tooling (DVC, Delta Lake, feature stores) is better still.


Phase 3: Feature Engineering and Preprocessing

Raw data almost never goes directly into a model. This phase is where most domain expertise gets operationalized.

â–¡ Encode categoricals explicitly and document the schema

One-hot encoding, ordinal encoding, target encoding, and embedding all make different assumptions. Choose based on cardinality and model type. High-cardinality categoricals (ZIP codes, product IDs) handled with one-hot encoding will explode your feature space; target encoding or learned embeddings are usually better. Document the encoding so your inference pipeline is identical to your training pipeline.

â–¡ Scale numerical features when your model requires it

Tree-based models (random forest, gradient boosting) are scale-invariant. Linear models, SVMs, and neural networks are not. Applying standard scaling or min-max normalization inconsistently — or forgetting to apply training-set statistics to the test set — introduces subtle bugs that are hard to diagnose.

â–¡ Build and freeze your preprocessing pipeline before training begins

The preprocessing steps applied to training data must be reproducible for every batch of inference data. Use a pipeline object (scikit-learn's Pipeline, Spark ML pipelines, or equivalent) that encapsulates fit-and-transform logic. If you transform training and inference data with separate ad-hoc scripts, you will eventually introduce a mismatch.


Phase 4: Model Selection and Training

This phase is where practitioners most often over-invest relative to the value it returns.

â–¡ Establish a baseline first

Before training any ML model, compute the simplest possible baseline: majority class prediction, mean value prediction, or a simple rule. Your ML model must beat this baseline meaningfully, or you have not yet proven the ML approach is warranted. Baselines also calibrate your expectations for what "good" looks like. See Machine Learning Basics: Trade-offs, Options, and How to Decide for guidance on model selection logic.

â–¡ Choose a model family appropriate to your data size and latency constraints

Gradient boosting (XGBoost, LightGBM, CatBoost) typically outperforms on tabular data with moderate dataset sizes. Neural networks win on unstructured data (images, text, audio) when data volume is high. Logistic regression remains competitive for linear problems and is far easier to deploy and explain. Match the tool to the context rather than defaulting to whatever is most familiar.

â–¡ Use cross-validation, not a single train/test split

A single split introduces variance based on which samples land where. K-fold cross-validation (typically k=5 or k=10) gives you a more reliable estimate of generalization performance. For time-series data, use time-based splits that respect chronological order — shuffling time-series data leaks future information into training.

â–¡ Track experiments with versioned logs

Log hyperparameters, training data versions, preprocessing steps, and evaluation metrics for every run. Without this, you cannot determine which decisions improved performance or reproduce your best result. MLflow, Weights & Biases, and Neptune are purpose-built for this. A spreadsheet works if you maintain discipline.


Phase 5: Evaluation

Evaluation deserves its own phase because evaluation errors are how bad models reach production. For a deeper treatment of metrics, see How to Measure Machine Learning Basics: Metrics That Matter.

â–¡ Choose your primary metric before training begins

Accuracy, AUC-ROC, F1, precision, recall, RMSE, MAE, NDCG — each emphasizes different things. Define which metric is primary (optimized for) and which are guardrails (must not drop below threshold). Defining this after you see results creates confirmation bias.

â–¡ Evaluate on a held-out test set that never touched model selection

If you used a validation set to tune hyperparameters, your validation performance is optimistically biased. Reserve a final test set that is used exactly once, after all decisions are frozen, to get an unbiased generalization estimate.

â–¡ Slice your evaluation by subgroups

Aggregate metrics can mask poor performance on important subgroups (demographic segments, low-frequency categories, edge cases). A model with 90% accuracy overall might perform at 60% for the subgroup that matters most to your client. Subgroup analysis is both an ethical requirement and a practical quality control step.

â–¡ Stress-test with adversarial and out-of-distribution examples

Feed the model data it wasn't designed for and observe how it fails. Does it fail silently with high-confidence wrong predictions, or does it surface uncertainty? High-confidence failures are the most dangerous in production.


Phase 6: Deployment and Monitoring

A model that performs well in development and fails silently in production is worse than no model.

â–¡ Define your serving architecture before you start building

Batch prediction (run nightly, store results) is simpler to operate and appropriate for many use cases. Real-time inference adds latency constraints and infrastructure complexity. Choose the simpler option unless real-time is a genuine requirement. The right tools depend heavily on this choice — consult The Best Tools for Machine Learning Basics for environment-specific recommendations.

â–¡ Build monitoring for data drift and prediction drift

Data drift: the distribution of input features shifts over time. Prediction drift: the distribution of model outputs shifts. Both are early warning signs that the model is encountering a world it wasn't trained on. Set alert thresholds on key feature statistics and output distributions. Review them monthly at minimum.

â–¡ Establish a retraining cadence and trigger conditions

Retraining on a fixed schedule (monthly, quarterly) is a floor, not a ceiling. Define trigger conditions — prediction drift exceeds X%, business metric drops below Y%, a known external event invalidates training data — that prompt unscheduled retraining. Document who owns this process.

â–¡ Document the model for future maintainers

At minimum: training data source and date range, features used, preprocessing steps, model type, evaluation results, known limitations, and the business decision it supports. A model without documentation is a liability that compounds with time.


Frequently Asked Questions

What makes a machine learning basics checklist useful versus a generic tutorial?

A checklist enforces sequential decision-making and surfaces the specific failure points that tutorials often gloss over. Its value is operational: you use it on live projects to catch omissions before they become expensive. A tutorial explains; a checklist audits.

How long does it take to complete this checklist on a real project?

The checklist items themselves take minutes to review, but the work they point to varies enormously. Data readiness alone can take days to weeks on messy enterprise datasets. Treat the checklist as a scoping tool: early completion of Phase 1 and Phase 2 items will tell you most of what you need to know about project feasibility and timeline.

Can this checklist be used to audit ML work delivered by a vendor or contractor?

Yes, and that's one of its most practical uses. Work through Phases 1 and 5 first — problem framing and evaluation. If a vendor cannot produce a clearly defined success criterion, a reproducible preprocessing pipeline, and a held-out test set evaluation, the work has not met a professional standard regardless of what the demo shows.

Do all checklist items apply to projects using pre-trained models or APIs?

Not all, but most. If you're fine-tuning a foundation model or calling a third-party ML API, you can skip some training-specific items. You still need rigorous problem framing, evaluation on your specific data distribution, subgroup analysis, deployment monitoring, and documentation. The evaluation and monitoring phases are non-negotiable regardless of whether you trained the model yourself.

What's the most commonly skipped item on this checklist in practice?

Subgroup evaluation. Teams get excited about strong aggregate metrics and ship. The most consequential failures — both in terms of business impact and potential harm — typically show up in subgroup performance gaps that aggregate numbers completely hide.


Key Takeaways

  • Frame the problem as a specific ML task type and write a numeric success criterion before touching data.
  • Data quality and leakage prevention are higher-leverage than model selection; invest accordingly.
  • Build preprocessing as a frozen, reproducible pipeline — not a collection of ad-hoc scripts.
  • Always establish a baseline before claiming an ML model adds value.
  • Choose your primary evaluation metric before training, and evaluate on a true held-out test set.
  • Slice evaluation by subgroups; aggregate metrics routinely hide the failures that matter most.
  • Deployment requires ongoing monitoring of data drift and prediction drift, not just a launch-day check.
  • A model without documentation, retraining triggers, and a defined owner is a liability in progress.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification