Reading about recommendation systems is one thing. Building one that actually returns sensible suggestions is another. The gap between the two is filled with small, unglamorous decisions about data, evaluation, and serving that tutorials tend to skip. This article does not skip them.
What follows is a sequential build process you could start today. Each step produces something concrete that the next step depends on, so you can stop at any point and still have a working artifact. We will not assume a specific framework, because the steps matter more than the library. The point is to understand how recommendation systems work by assembling one yourself.
We will build toward a modest but real goal: given a user, return a ranked list of items they are likely to engage with, served fast enough to use in an app. Let us begin where every recommender begins, with the data.
Step 1: Assemble and Clean Your Interaction Data
Before any modeling, you need a clean record of who interacted with what.
Build the interaction table
Create a table with three essential columns: user identifier, item identifier, and an interaction value. The value might be an explicit rating, or an implicit signal like a click, watch time, or purchase converted into a number. Strip out bots, test accounts, and duplicate events first, because garbage interactions poison everything downstream.
Decide what counts as a positive
You must decide what a "positive" interaction means. A purchase is obvious. A two-second video view is not. Set explicit thresholds, document them, and be consistent. This single decision shapes every recommendation your system will ever make.
Handle the implicit-feedback trap
If you are using implicit signals, remember that you only ever observe positives. A user who never clicked an item might dislike it, or might simply never have seen it. Treating every non-interaction as a negative will teach your model that unseen items are bad, which is wrong. The standard fix is to treat non-interactions as weak, uncertain negatives rather than hard ones, often by sampling a subset of them per training round. Getting this right early saves you from a model that systematically buries anything it has not already shown, the seed of a damaging feedback loop down the line.
Step 2: Split Data Honestly
If you evaluate on the same data you trained on, your numbers will lie. Split your interactions before you model.
For recommendations, a time-based split is usually wiser than a random one. Train on older interactions and test on newer ones, because that mirrors how the system will actually be used: predicting the future from the past. A random split lets the model peek at information it would never have in production, inflating your results. The common mistakes article covers this leakage trap in more detail.
Step 3: Start With a Baseline You Can Beat
Resist the urge to begin with a neural network. Start simple.
The popularity baseline
Recommend the most popular items to everyone. It is trivial, it requires no personalization, and it is shockingly hard to beat for new users. If your fancy model cannot outperform "show everyone the top sellers," something is wrong, and you want to know that early.
A simple collaborative baseline
Next, build item-based collaborative filtering: for each item, precompute which other items are frequently engaged with by the same users, then recommend neighbors of a user's recent items. This captures most of the value with very little complexity, and it gives you a real personalization baseline to measure against.
Step 4: Train a Matrix Factorization Model
Once your baselines work, reach for matrix factorization, the workhorse of personalized recommendation.
The idea is to learn a short vector for every user and every item, such that their dot product predicts the interaction value. Libraries handle the optimization; your job is to choose the number of factors, set regularization to prevent overfitting, and tune for implicit versus explicit feedback. Start with a modest number of factors, around thirty to sixty, and increase only if validation metrics improve. The definitive guide explains the intuition behind these latent factors if you want the conceptual grounding.
Step 5: Evaluate Like You Mean It
Now measure whether your model beats the baselines, using the held-out future data from Step 2.
- Precision at K: of the top K items you recommended, how many did the user actually engage with?
- Recall at K: of all items the user engaged with, how many appeared in your top K?
- NDCG: rewards putting the right items higher in the list, not just including them.
Compare every model against the popularity baseline on the same metric. If matrix factorization does not clearly win, debug your data split before blaming the model. A short, working recommendation checklist can help you verify you have not skipped a step here.
Step 6: Build the Serving Funnel
A model is not a product. To serve recommendations at speed, split the work into stages.
Candidate generation
Precompute item embeddings and load them into a fast similarity index. At request time, take the user's recent items, fetch their nearest neighbors, and assemble a few hundred candidates. This step must be fast and favor recall over precision.
Ranking and re-ranking
Score those candidates with your trained model, then apply business rules: remove already-purchased items, enforce some diversity, and boost fresh content. The output is the final ordered list you show. The framework for recommendation design maps these stages onto a reusable mental model.
Keep training and serving consistent
A subtle bug will haunt you if you are not careful: a feature computed one way during training and a slightly different way during serving. Even a small mismatch, a different default value, a different rounding rule, can quietly degrade live performance while your offline metrics stay pristine. Compute features through shared code wherever possible, and log the exact features used to serve a recommendation so you can replay them later. This training-serving skew is one of the most common reasons a model that looked great in development underperforms in production, and it is almost invisible unless you go looking for it.
Step 7: Ship, Measure, Iterate
Deploy behind an A/B test, never as a silent full rollout. Send a fraction of traffic to the new recommender and compare engagement and retention against the existing experience. Only promote it if the live numbers, not the offline ones, improve. Then return to Step 1 with fresh interaction data and repeat. Recommendation systems are never finished; they are tended.
Frequently Asked Questions
Do I really need a popularity baseline if I plan to use a fancier model?
Yes. The baseline is your reality check. If a sophisticated model cannot beat simply recommending popular items, you have a data or evaluation bug. The baseline costs almost nothing to build and saves you from shipping a model that adds no value.
Why use a time-based split instead of a random one?
Because production recommenders always predict the future from the past. A random split lets the model train on interactions that happen after the ones it is tested on, which leaks future information and inflates your metrics. A time-based split mirrors real conditions.
How many latent factors should I use in matrix factorization?
Start small, around thirty to sixty, and increase only if validation metrics clearly improve. More factors can capture finer patterns but also overfit and slow training. Tune it like any hyperparameter, watching held-out performance rather than guessing.
Can I skip the serving funnel and just score every item?
For a tiny catalog, yes. For anything large, no. Scoring every item per request is too slow, which is why candidate generation narrows millions of items to a few hundred before the heavier ranking model runs. The funnel is what makes recommendations fast.
When is my recommender "done"?
Never, in the sense that it benefits from ongoing iteration. Tastes shift, catalogs change, and new interaction data arrives constantly. The realistic goal is a system you can retrain and re-evaluate on a regular cadence, improving it over time.
Key Takeaways
- Start by building and cleaning an honest interaction table, deciding explicitly what counts as a positive.
- Use a time-based split so your evaluation reflects predicting the future from the past.
- Always build a popularity baseline and a simple collaborative baseline before reaching for complex models.
- Matrix factorization is the practical workhorse; tune factors and regularization against held-out metrics.
- Serve through a candidate-generation, ranking, and re-ranking funnel, and validate every change with a live A/B test.