Federated learning is a way to train a machine learning model across many separate data sources without ever collecting that data in one place. Instead of shipping everyone's data to a central server, you ship the model to the data, train locally, and send back only the resulting parameter updates. The central server averages those updates into a new global model and repeats the cycle. The raw data never leaves the device or the organization that owns it.
That single architectural inversion changes a lot. It lets you build models on data that is too private, too regulated, or too large to centralize. It is the reason your phone's keyboard can improve its next-word predictions without uploading everything you type, and the reason hospitals can collaborate on a diagnostic model without sharing patient records across institutional walls.
This guide walks through the full picture: the mechanics, the variants, the privacy story, the hard parts, and how to decide whether federated learning is the right tool for a given problem. If you are brand new to the topic, start with our What Is Federated Learning: A Beginner's Guide and come back here once the vocabulary feels comfortable.
The Core Mechanic: Move the Model, Not the Data
Every federated learning system runs a loop with the same four steps:
- Distribute. The central server sends the current global model to a selected set of clients (phones, hospitals, banks, sensors).
- Train locally. Each client trains the model on its own local data for a few steps, producing an updated set of weights.
- Aggregate. Clients send only their weight updates back to the server. The server combines them, usually by a weighted average called Federated Averaging (FedAvg), where each client's contribution is weighted by how much data it has.
- Repeat. The new global model goes back out, and the loop continues for many rounds until the model converges.
The key property is that step 2 happens on hardware the server does not control, and step 3 transmits gradients or weights rather than examples. A medical image, a typed message, or a transaction record stays put.
Why averaging works at all
It is not obvious that averaging models trained on different data should produce a good combined model. It works because the local updates all point, roughly, in the direction of lower loss on the shared objective. Averaging them cancels out the data-specific noise and reinforces the common signal. It works best when client data is reasonably similar; it struggles when data distributions differ wildly between clients, a problem we return to below.
The Two Main Flavors
Federated learning splits into two settings that look similar but have very different engineering realities.
Cross-device federated learning
Here the clients are millions of unreliable edge devices: phones, watches, browsers. Any single device is slow, frequently offline, and may drop out mid-round. No device holds much data. The server samples a few thousand available devices each round and tolerates massive churn. This is Google's keyboard, Apple's on-device personalization, and similar consumer-scale systems.
Cross-silo federated learning
Here the clients are a handful of organizations: hospitals, banks, manufacturers. Each silo is a reliable, always-on data center holding a large, valuable dataset. There might be five to fifty participants, not millions. The hard problems shift from device flakiness to governance, trust, and incentive alignment between competing institutions. Most enterprise federated learning is cross-silo.
Privacy Is a Feature, Not a Guarantee
The most common misconception is that federated learning is automatically private because the data never moves. That is half true. Raw data staying local is a real benefit, but the weight updates themselves can leak information. A gradient computed on a single example can, with effort, be partially inverted to reconstruct that example.
Serious deployments layer additional protections on top of the basic architecture:
- Secure aggregation uses cryptography so the server only ever sees the sum of client updates, never any individual contribution. No single update is readable.
- Differential privacy adds calibrated noise to updates so that no single record measurably changes the final model, giving a mathematical bound on what can be inferred.
- Client-side clipping limits how much any one client can influence the model, which both helps privacy and blunts malicious participants.
If you skip these and assume the architecture alone protects you, you have built something less private than you think. We cover this trap in depth in 7 Common Mistakes with What Is Federated Learning.
The Hard Parts You Will Actually Hit
Federated learning is not free. The genuine challenges are concrete and predictable.
Non-IID data
In a normal training set, you assume examples are independent and identically distributed. In federated learning they are not: each client's data reflects that client. One hospital sees more of one disease; one user types in a different language. This skew slows convergence and can make the global model worse for everyone. Techniques like FedProx and adaptive optimizers exist specifically to handle it.
Communication cost
Sending model weights over consumer networks, round after round, is expensive. A large model times thousands of rounds times millions of devices is a real bandwidth bill. Practical systems compress updates, train more locally per round to reduce round count, and select clients carefully.
Systems heterogeneity
Clients differ in compute, battery, and connectivity. A round can only move as fast as its stragglers, so robust scheduling and dropout tolerance are mandatory, especially cross-device.
When to Reach for It and When Not To
Federated learning earns its complexity only under specific conditions. Use it when data genuinely cannot be centralized (regulation, contracts, or physics), when the data lives in many places, and when a model trained on the union of that data would be meaningfully better than one trained on any single silo.
Do not use it when you could simply centralize the data with consent, when one party already holds enough data, or when the coordination overhead exceeds the privacy benefit. For a structured way to make this call, see A Framework for What Is Federated Learning. When you are ready to build, A Step-by-Step Approach to What Is Federated Learning lays out the sequence.
Frequently Asked Questions
Is federated learning the same as distributed training?
No. Distributed training splits one centralized dataset across many machines you control to train faster. Federated learning trains across data you do not control and cannot move, with privacy and governance as first-class constraints. The goals are different even though both spread computation across machines.
Does the data really never leave the device?
The raw training data does not. What leaves are model weight updates. Those updates can leak information without extra safeguards, which is why secure aggregation and differential privacy are standard in serious systems rather than optional extras.
How many clients do you need?
It depends on the setting. Cross-device systems sample thousands of devices per round out of millions. Cross-silo systems can work with as few as two to a few dozen organizations. More clients generally helps, but data diversity matters more than raw count.
Is it slower than normal training?
Usually yes, in wall-clock terms, because of communication rounds and stragglers. The trade-off is access to data you otherwise could not use at all. You accept slower training in exchange for a better or even feasible model.
What tools should I start with?
Open-source frameworks like Flower, TensorFlow Federated, and NVIDIA FLARE cover most needs. See The Best Tools for What Is Federated Learning for selection criteria and trade-offs.
Key Takeaways
- Federated learning trains a shared model by moving the model to the data and aggregating updates, never centralizing raw data.
- The loop is distribute, train locally, aggregate (FedAvg), repeat.
- Cross-device (millions of flaky phones) and cross-silo (a few reliable organizations) are very different engineering problems.
- The architecture alone is not private; add secure aggregation and differential privacy.
- Non-IID data, communication cost, and device heterogeneity are the real challenges.
- Use it only when data cannot be centralized and a combined model is genuinely better.