A checklist is only useful if every item earns its place and you understand why it is there. A list of vague reminders ("consider fairness") is decoration. This one is built to be a working tool: each item is something you can verify as done or not done, and each carries a short justification so you know what it is protecting against.
Work through it in order before any model that affects people goes to production, and revisit the monitoring section on a schedule afterward. If an item does not apply to your context, mark it explicitly rather than skipping it silently. The sections map to the lifecycle the framework article lays out in full.
One note on how to use it well. A checklist worked in a rush, ticked from memory without actually verifying each item, is worse than no checklist, because it manufactures a paper trail of diligence that did not happen. Treat each box as a claim you could defend to an auditor. If you cannot point to the evidence, the box is not checked, no matter how confident you feel.
Before You Build: Framing Checks
These catch the bias that enters before a single row of data is processed.
The items
- Named the protected groups for this domain. You cannot measure fairness for groups you never defined.
- Confirmed the target variable measures what you actually care about. A proxy target, like spending standing in for need, bakes in bias no model can remove.
- Identified who is accountable for fairness decisions. "The model decided" is not a defensible answer.
- Wrote the fairness definition into the spec. Choosing after results invites cherry-picking, as the common mistakes article details.
- Confirmed the decision the model drives is one fairness can meaningfully apply to. Some uses are high-stakes enough that the right answer is not to automate them at all.
Data Checks
Most bias enters here, so these items carry the most weight.
The items
- Checked representation of each group in the training data. Underrepresented groups get unreliable predictions by default.
- Reviewed how labels were generated and by whom. Human labelers encode their own assumptions.
- Tested for proxy features that reconstruct a protected attribute. Removing the attribute does not remove its proxies.
- Retained the protected attribute for auditing, separate from model inputs. You cannot measure a gap for a variable you deleted.
- Verified the evaluation set is balanced across groups, not sampled from the same skew. A biased test set hides the bias it shares.
- Confirmed each group is large enough to measure reliably. A gap on forty examples may be sampling noise, not unfairness.
- Checked at least a few intersectional subgroups, not just single attributes. A model fair on each axis alone can fail at the overlap.
Measurement Checks
This is where a hidden gap becomes a visible number.
The items
- Computed the key metric per group, not just in aggregate. Aggregates are dominated by the majority.
- Built a per-group confusion matrix. Identical accuracy can hide opposite error patterns.
- Checked calibration per group. A score of 0.7 should mean the same real-world rate for everyone.
- Reported the gap between best and worst group. That gap, not the average, is the headline.
- Confirmed results against the fairness definition chosen up front. Measuring against a definition picked afterward proves nothing.
Mitigation Checks
Only reach these after measurement gives you a real gap to close.
The items
- Traced the gap to its cause before applying any fix. Treating a symptom without the cause usually fails.
- Tried the cheapest effective intervention first. Data fixes are more transparent than output hacks.
- Re-measured after every change. Closing one gap can open another.
- Documented the trade-off accepted. Every mitigation costs something; pretending otherwise fails audits.
Deployment and Monitoring Checks
Fairness is not permanent, so these keep it alive.
The items
- Added per-group metrics to production monitoring. Aggregate dashboards hide drift in a single group.
- Set a threshold that triggers re-auditing on drift. Populations shift and models age.
- Defined a retraining trigger and owner. A monitor nobody answers is theater.
- Recorded the full audit in a one-page summary. An audit that lives only in someone's head is gone next quarter.
- Established a channel for affected users to report unfair outcomes. Monitoring catches drift in aggregate; individuals catch failures the metrics miss.
- Scheduled the next full re-audit, not just automated monitoring. Drift alerts catch degradation; a periodic human review catches assumptions that quietly went stale.
How to Adapt the Checklist to Your Stakes
This list is a maximum, not a uniform minimum. Applying every item with equal intensity to a playlist recommender and a loan model wastes effort on one and under-serves the other. The right move is to scale rigor to consequence.
Tiering the work
- High-stakes decisions (lending, hiring, healthcare, anything affecting access to opportunity or care): run every item in full, add legal review, and treat the monitoring section as non-negotiable.
- Medium-stakes decisions (content ranking, pricing, prioritization): run framing, data, and measurement checks fully; apply mitigation and monitoring proportionally.
- Low-stakes decisions (cosmetic recommendations, internal tooling): two items remain mandatory at any level, defining the fairness goal and measuring per group, because they are nearly free and catch the most.
The discipline is to make the tier an explicit decision rather than a default. Writing down "this is a medium-stakes system, so we are running these items and consciously deferring those" is itself a fairness decision you can defend. The danger is not skipping items; it is skipping them silently, so that nobody can tell later whether an omission was reasoned or careless.
Frequently Asked Questions
How often should I re-run the monitoring section?
It depends on how fast your population and inputs change, but a quarterly cadence is a reasonable default for most systems, with automated per-group monitoring catching anything between reviews. High-stakes or fast-drifting domains warrant more frequent checks. The trigger threshold is your safety net for changes that arrive faster than the schedule.
Can I skip the data checks if I trust my data source?
No, because trusted sources are exactly where unexamined bias hides. "We have always used this data" is how historical patterns get treated as ground truth. The data checks take little time relative to their value, and they catch the failures that aggregate testing never will. Trust is not a substitute for measurement.
What if I cannot complete the measurement checks because I lack the protected attribute?
Then mark that explicitly as a known gap and treat your fairness claims as unverified. Options are to collect the attribute for a representative evaluation sample or to use a validated proxy strictly for auditing. What you should not do is quietly skip the checks and imply the model passed them.
Is this checklist enough on its own?
It is a strong floor, not a ceiling. It enforces the practices that prevent the most common failures, but it does not replace domain expertise about the specific population affected or legal review for regulated decisions. Use it as the baseline every project clears, then add domain-specific checks on top.
How do I get a team to actually use a checklist like this?
Embed it in the workflow rather than leaving it as an optional document people are supposed to remember. Make the framing and data checks part of the project kickoff template, make per-group measurement a required field in the evaluation report, and make the monitoring items part of the deployment runbook. A checklist that lives inside the steps people already take gets used; one that lives in a separate file gets forgotten. Accountability matters too: assign a named owner who signs off that the relevant items were genuinely completed.
Key Takeaways
- Every checklist item is verifiable and carries the specific failure it prevents.
- Framing checks catch bias before data, including a wrong target variable and an undefined fairness goal.
- Data checks carry the most weight because most bias enters during collection and labeling.
- Measurement checks turn hidden gaps into reported numbers via per-group metrics and calibration.
- Deployment checks keep fairness alive with per-group monitoring, drift thresholds, and a named owner.