Most fairness efforts die as a one-time audit. Someone runs the numbers before launch, ships a slide deck, and the model drifts unwatched for the next eighteen months. A playbook is the antidote: a set of named plays, each with a trigger that fires it, an owner who runs it, and a defined place in the sequence. The point is to make fairness an operating routine, not a heroic event.
This is written for the person who has to make it happen—usually an ops lead or a delivery manager, not a research scientist. Every play below assumes limited time and a need to defend decisions to a client. We're optimizing for "good enough to stand behind," not academic completeness.
The operating principle: plays, triggers, owners
A play is a small, repeatable procedure. A trigger is the event that should make it run. An owner is the single person accountable for it happening. If any play lacks a trigger, it never runs; if it lacks an owner, it runs inconsistently. Write all three down before you write any code.
The sequencing matters because the plays build on each other. You cannot pick a fairness metric (Play 3) until you've defined the decision and affected groups (Play 1). Skipping ahead is the most common way programs produce numbers nobody can interpret.
One more design rule before the plays: every play should produce a small artifact, not just an outcome. A play that "happens" but leaves no record can't be audited, handed off, or trusted six months later. Treat the artifact—a paragraph, a table, a signed decision—as the real output of each play, with the underlying work as the means. This is what separates a program that can prove fairness from one that merely claims it.
Play 1: Scope the decision
Trigger: any project that uses AI to influence a decision about a person. Owner: project lead.
Before data, write one paragraph: what decision does this model influence, who is affected, and what's the worst-case harm? This screens out low-stakes uses (no human impact, skip the heavy machinery) and flags high-stakes ones (hiring, credit, eligibility) that need the full sequence. This single step prevents the two opposite failures: over-engineering a copywriting helper and under-engineering a screening tool.
Play 2: Map the affected groups and collect attributes
Trigger: Play 1 marks the use as person-affecting. Owner: data lead.
List the groups whose treatment you'd have to defend—legally protected classes plus context-specific ones. Then arrange to collect those attributes for auditing only, with access controls. As covered in the Beginner's Guide, you can't measure disparity you refuse to record. The failure mode here is "fairness through unawareness"—deleting the attribute and calling it solved.
Play 3: Choose and lock a fairness metric
Trigger: groups and attributes are defined. Owner: project lead with data lead.
Pick the metric whose errors you most need to equalize—equalized odds, demographic parity, or calibration—and write down why. The Framework walks through this choice in depth. Lock it before you see results, so you're not metric-shopping for the one that makes your model look best.
Set the disparity threshold up front
Decide the gap you'd defend publicly—say, no group's false-negative rate exceeds the best group's by more than a fixed margin. Pre-committing removes the temptation to rationalize whatever number you get.
Play 4: Run the pre-launch audit
Trigger: a model candidate is ready to evaluate. Owner: data lead.
Report your chosen metrics disaggregated by every group from Play 2. Include sample sizes—a "fair" result on 11 examples is noise. If a group is too small to evaluate, that itself is a finding: you lack the data to make claims about them. Document results in a short, dated artifact, not a chat message.
Play 5: Mitigate, then re-audit
Trigger: the audit shows a disparity beyond threshold. Owner: data lead.
Work through mitigations in order of cost and reversibility:
- Re-sample or augment underrepresented data—addresses root cause but slow.
- Adjust group-aware thresholds—fast and transparent, but politically sensitive.
- Reweight or constrain training—powerful but harder to explain.
Re-run Play 4 after any change. The Common Mistakes guide details how teams quietly trade a fixed disparity for a new one they didn't measure.
Play 6: Gate the launch
Trigger: re-audit complete. Owner: accountable executive (not the builder).
A human who didn't build the model decides go/no-go against the pre-set threshold. Separating builder from gatekeeper is the structural safeguard that survives turnover and deadline pressure. Record the decision and its rationale.
Play 7: Monitor in production
Trigger: model is live; fires on a calendar and on every data or model update. Owner: ops lead.
Re-run the disaggregated audit on a fixed cadence—quarterly for higher-stakes uses—plus a drift check on input distributions. The Best Practices guide covers lightweight monitoring that doesn't require a dedicated team. A model fair at launch degrades silently as the world shifts.
Sequencing the whole program
Run Plays 1–6 once per model, in order, before launch. Play 7 runs forever. The cardinal rule: never let a play run without its owner and trigger documented. A program where "someone should check fairness" is the instruction has no plays at all—it has hopes.
| Play | Trigger | Owner | | --- | --- | --- | | 1 Scope | Person-affecting AI project | Project lead | | 2 Map groups | Use is person-affecting | Data lead | | 3 Pick metric | Groups defined | Project + data lead | | 4 Pre-launch audit | Candidate ready | Data lead | | 5 Mitigate | Disparity over threshold | Data lead | | 6 Gate | Re-audit done | Executive | | 7 Monitor | Calendar + updates | Ops lead |
Frequently Asked Questions
How is a playbook different from a checklist?
A checklist tells you what to verify; a playbook tells you what to do, when it's triggered, and who owns it. The checklist is an artifact a play produces. You need both, but the playbook is what makes the work actually happen on schedule.
Who should own the program overall?
Operations, not data science. The hard part is consistency over time—running Play 7 every quarter, enforcing the launch gate under deadline pressure. That's an operational discipline, with data science as a specialist input rather than the owner.
What if we can't collect sensitive attributes legally?
You may be able to use validated proxies or aggregate-level analysis, but you must document the limitation. Inability to measure is a known gap to disclose, not permission to claim fairness you can't verify.
Can a small team run all seven plays?
Yes. The plays scale down—for a low-stakes use, Play 1 may end the sequence in a paragraph. The discipline is matching effort to stakes, which the playbook makes explicit rather than leaving to instinct.
How do we keep the launch gate from being rubber-stamped?
Give the gatekeeper a pre-set threshold and require a written rationale for any override. A gate with no objective criterion and no paper trail is theater; the threshold and the record are what give it teeth.
Key Takeaways
- Every play needs a trigger and a named owner, or it won't run consistently.
- Sequence matters: scope the decision and map groups before choosing a metric or auditing.
- Lock your fairness metric and disparity threshold before seeing results to avoid metric-shopping.
- Separate the person who builds the model from the person who gates its launch.
- Production monitoring is the play most teams skip and the one that catches silent drift.
For deeper builds on individual plays, see the Framework and the Step-by-Step How-To.