Most people approach overfitting and underfitting as a set of disconnected tips: add dropout, get more data, try regularization. That scattershot habit is why models stay broken. The fixes are not interchangeable. Each one solves exactly one problem and worsens the other. What you need is a sequence that diagnoses before it treats.
This guide is that sequence. Follow it in order, on your own model, today. Each step has a clear input, a clear output, and a decision that routes you to the next step. Do not skip the diagnosis steps to get to the fixes faster. Skipping diagnosis is how people spend a week adding regularization to a model that was underfitting the whole time.
By the end you will have a model whose generalization you understand and a clear record of which knob did what. That record is worth as much as the model.
Step 1: Build an Honest Data Split
Before any modeling, partition your data into three parts: training, validation, and a final test set you will not touch until the very end.
- Training set: the model learns from this.
- Validation set: you tune against this.
- Test set: you look at this exactly once, at the end.
Avoid Leakage Now
Fit every preprocessing transform, scaling, encoding, imputation, on the training set only, then apply it to the others. If you compute a mean across the whole dataset before splitting, you have leaked information and every downstream number is optimistic. This single discipline prevents the most common silent failure.
Step 2: Train a Deliberately Simple Baseline
Start with the simplest reasonable model: a linear or logistic regression, or a shallow tree. Record training error and validation error.
This baseline is not your final model. It is a reference point. Every later change is judged against it. If a complex model does not beat your simple baseline on the validation set, the complexity is buying you nothing.
Step 3: Read the Learning Curve to Diagnose
Now plot training and validation error against training set size, or at minimum compare the two final numbers. The pattern routes everything that follows.
- Both errors high, gap small: you are underfitting. Go to Step 4.
- Training error low, validation error high, gap large: you are overfitting. Go to Step 5.
- Both errors low, gap small: you are in good shape. Go to Step 6.
This is the fork in the road. The whole reason for Steps 1 and 2 was to make this diagnosis trustworthy. For the deeper theory of why this gap maps to bias and variance, see The Complete Guide to Ai Model Overfitting and Underfitting.
Step 4: The Underfitting Fix Sequence
If you are underfitting, apply these in order, re-checking the learning curve after each change. Stop as soon as the validation error reaches an acceptable level.
- Add model capacity. Move from linear to a tree ensemble, or add layers and units to a network.
- Engineer features. Add interaction terms, polynomial features, or domain signals the model cannot derive on its own.
- Reduce regularization. If you set a penalty earlier, lower it.
- Train longer. Increase epochs or iterations if the loss was still falling.
Change one thing, re-measure, then decide whether to continue. Changing several at once makes it impossible to know what helped.
Step 5: The Overfitting Fix Sequence
If you are overfitting, work this list in order. Each step trades a little bias for less variance.
- Get more training data. The most durable fix. Variance shrinks as examples grow.
- Add regularization. L2 for linear models, dropout for networks, max-depth and min-samples limits for trees.
- Use early stopping. Halt training when validation error stops improving.
- Reduce capacity. Fewer parameters, shallower model, lower polynomial degree.
- Augment data. For images or audio, transformations multiply your effective sample size.
Re-check after each. When the gap between training and validation error closes to an acceptable level, stop. The common errors people make in this sequence are catalogued in 7 Common Mistakes with Ai Model Overfitting and Underfitting.
Step 6: Validate With Cross-Validation
A single validation split can be lucky or unlucky. Replace it with k-fold cross-validation, typically five or ten folds, to get a stable estimate and see how much your error varies across folds.
Read the Variance Across Folds
If error swings widely from fold to fold, your model is sensitive to the specific training data, a sign of residual overfitting. Tight, consistent fold scores indicate a model that generalizes. For time-series data, replace random folds with forward-chaining splits so you never train on the future.
Step 6b: Pressure-Test Across Segments
Before you trust an aggregate cross-validation score, slice it. Group your validation predictions by meaningful segments, customer tier, geography, time period, device type, whatever matters for your problem, and compute error within each group.
Why Aggregates Lie
A model can post a respectable overall score while quietly failing an important slice. Suppose it underfits new customers but overfits long-tenured ones; the average looks fine and hides both problems. Per-segment evaluation exposes this unevenness, which an aggregate number cannot.
When you find a struggling segment, decide whether it deserves a targeted fix, more data for that slice, a segment-specific feature, or whether the aggregate model is acceptable for your use case. Either way, you made the call with eyes open rather than discovering the weakness in production.
Step 7: Run the Final Test Once and Stop
Take the configuration that won on cross-validation and evaluate it on the test set you have not touched. This number is your honest estimate of production performance.
Critically, do not now go back and tune to improve this number. The moment you optimize against the test set, it stops being a test set and your estimate becomes fiction. If the test number disappoints, the correct response is to collect more data or rethink the problem, not to keep poking the holdout. The best practices behind this discipline are expanded in Ai Model Overfitting and Underfitting: Best Practices That Actually Work.
Frequently Asked Questions
How many fixes should I apply before re-checking?
Exactly one. The entire value of this sequence comes from isolating cause and effect. If you apply three changes and the model improves, you have learned nothing about which change mattered, and you may be carrying a harmful change masked by two helpful ones.
What if my diagnosis is ambiguous, with a moderate gap?
A moderate gap with moderate error often means you have headroom in both directions. Try adding a little capacity first; if the gap widens sharply, you have hit the overfitting regime and should back off and regularize instead. Treat the ambiguous zone as a place to probe carefully, one step at a time.
Can I skip the simple baseline to save time?
No. Without the baseline you have no reference for whether complexity helps, and you are far more likely to ship an overcomplicated model that overfits. The baseline takes minutes and saves hours. It is the cheapest insurance in the workflow.
Why fit preprocessing only on the training set?
Because fitting on the full dataset lets information from validation and test data influence your transforms, which leaks the answer and inflates your scores. The model appears to generalize better than it will in production, and you discover the gap only after deployment. Fit on training folds, apply to the rest.
When do I stop iterating?
Stop when validation error reaches a level acceptable for your use case and additional changes yield diminishing returns, or when you have exhausted the relevant fix sequence. Chasing marginal gains past the point of diminishing returns often introduces fragility. Ship the simplest model that meets the bar.
Key Takeaways
- Split into training, validation, and an untouched test set before doing anything else.
- Train a simple baseline as your reference point.
- Diagnose with the training-versus-validation gap before choosing any fix.
- Underfitting and overfitting have opposite fix sequences; apply only the one your diagnosis points to.
- Change one thing at a time and re-measure.
- Run the final test set once and resist the urge to tune against it.