The difference between a team that does fairness work once and a team that does it reliably is documentation. Not the dense, governance-committee kind—the practical kind that lets a new hire run the same audit you ran, get a comparable result, and know what to do with it. A workflow is fairness made portable: any competent person can pick it up, follow the steps, and produce a defensible output.
This guide is about the mechanics of repeatability. We'll cover the stages a fairness review passes through, the artifacts each stage produces, the hand-off points where work changes hands, and the failure modes that quietly break repeatability even when the steps look followed. If you've read the playbook's "what to do," this is the "how to make it stick."
Why repeatability is the whole game
A brilliant one-time audit that lives in someone's head is worthless the moment that person leaves. Bias creeps back in not through malice but through inconsistency—a slightly different metric here, a forgotten subgroup there, an un-rerun audit after a model update. Repeatability is what converts a fairness result into a fairness guarantee you can renew on schedule.
Repeatability also makes the work auditable. When a client asks "how do you know this is fair," a documented workflow with dated artifacts is a credible answer. "We checked once, I think it was fine" is not.
There's a second, quieter benefit. A repeatable workflow forces you to make your judgment calls explicit. The first time you run it, you have to decide which metric, which groups, and which threshold. Those decisions get written down. From then on, anyone reviewing your work is reviewing a stated position, not reverse-engineering an intuition. That alone resolves most of the arguments that stall fairness programs, because the disagreement moves from "is this fair" to "is this the right threshold," which is a question a team can actually settle.
Stage 1: Intake and triage
Every model entering production passes through a single intake. The intake form captures: the decision the model influences, the population affected, the stakes, and the relevant groups. This produces the scope artifact—a one-page record that determines whether the model gets the full workflow or a lightweight pass.
The failure mode here is letting models skip intake because they're "just a small tool." Make intake a hard gate: no model ships without a scope artifact, even if the artifact concludes "low stakes, no further review."
Stage 2: Data and target documentation
Before any metric, document two things: where the training data came from and how it was sampled, and exactly how the prediction target was defined. The Examples and Use Cases guide shows how target definition silently encodes value judgments. This stage produces the data sheet—a record a successor can read to understand what the model actually learned.
The hand-off point
This is where the work usually passes from the project lead to whoever runs the technical audit. The data sheet is the hand-off contract. If the auditor has to interview people to understand the data, the document failed and the workflow isn't yet repeatable.
Stage 3: Standardized audit
The audit must be runnable the same way every time, which means a fixed metric set chosen during intake, a fixed list of groups, and a fixed reporting format. Output:
- Disaggregated performance by group, with sample sizes.
- The chosen fairness metric measured against its pre-set threshold.
- A plain-language verdict: pass, fail, or insufficient data.
Standardizing the format is what makes results comparable across models and across time. The Step-by-Step How-To details the actual computation; the workflow's contribution is forcing it into the same template every run.
Stage 4: Remediation loop
When the audit fails, remediation follows a documented decision tree rather than improvisation. Cheapest reversible fix first (threshold adjustment), then data work, then training-time constraints—re-running the standardized audit after each. The loop's exit condition is explicit: pass the threshold, or escalate the trade-off to a human decision-maker. Document which mitigation was applied and why, so the next person doesn't undo it by accident.
Stage 5: Sign-off and archival
A defined approver signs off against the threshold, and every artifact—scope, data sheet, audit, remediation log—gets archived together with a date and the model version. This archive is the spine of repeatability: the next review starts by reading the last one. Skipping archival is the most common reason a "repeatable" workflow turns out to be repeated guesswork.
Stage 6: Scheduled re-run
The workflow loops on a trigger: a calendar cadence plus every model or data update. Re-running pulls the archived artifacts, re-executes the standardized audit, and compares against the baseline. The Best Practices guide covers keeping this lightweight. The goal is that the fifth re-run looks identical in procedure to the first—only the numbers change.
Wiring the workflow into how you already work
A workflow that lives in a separate ceremony nobody attends will be skipped under deadline pressure. The durable move is to attach each stage to a moment that already exists in your delivery process. Intake hooks onto project kickoff, where scope is already being discussed. The standardized audit hooks onto the pre-launch review you already run for everything else. The scheduled re-run hooks onto whatever quarterly business review or maintenance window you already hold with the client.
When the fairness step is a line item inside a meeting that happens anyway, it stops being optional. When it's a standalone task on someone's list, it becomes the first thing dropped when the week gets tight. The artifacts make this possible: because each stage produces a small, defined document, the fairness step inside an existing meeting is just "attach the data sheet" rather than "go do an open-ended analysis." Reducing each stage to attaching a known artifact is what lets the workflow ride along on rituals you already keep.
Failure modes that break repeatability
Even disciplined teams lose repeatability in predictable ways:
- Metric drift: someone "improves" the metric mid-stream, so new results aren't comparable to old. Freeze the metric per model.
- Subgroup amnesia: a group audited last time gets dropped this time. The archived group list prevents it.
- Tribal knowledge: a step works only because one person remembers a caveat. If it's not in the artifact, it's not in the workflow.
- Silent skips: an update ships without re-running the audit. Tie the trigger to your deployment process so it can't be forgotten.
Frequently Asked Questions
How much documentation is too much?
The test is the hand-off: a competent colleague should be able to run the workflow from the documents alone, without interviewing you. If they can, you have enough; if they can't, you have too little—regardless of page count. Aim for the minimum that passes that test.
Does this need special tooling?
No. A shared folder with templated documents and a calendar reminder runs the entire workflow. Tooling helps at scale, but the bottleneck is almost always discipline and standardization, not software. Start with documents.
How do we handle models we inherited without documentation?
Treat them as new intakes: run them through Stage 1 and 2 retroactively, reconstructing the data sheet as best you can. Note the reconstruction's gaps honestly. An imperfect retroactive record beats no record.
Who keeps the workflow itself up to date?
Assign one owner for the workflow templates, separate from the people running individual audits. Regulations and best practices evolve; without an owner, the templates calcify and the workflow slowly stops matching reality.
What if two reviewers get different results?
That's a signal the workflow isn't yet standardized—usually a metric, group list, or data slice that wasn't fully specified. Treat divergence as a bug in the workflow, fix the ambiguity in the template, and the divergence disappears.
Key Takeaways
- Repeatability, not brilliance, is what makes fairness durable—document so anyone can re-run your audit.
- Each stage produces a specific artifact; the artifact is the hand-off contract between people.
- Freeze the metric and group list per model so results stay comparable across time.
- Archive every review together with the model version; the next review starts from the last.
- Watch for metric drift, subgroup amnesia, and silent skips—the quiet ways repeatability dies.
To see this workflow applied end to end, read the Case Study and pair it with the Playbook.