A Chicago AI agency delivered a tenant screening model to a property management company. The agency tested the model for bias using overall accuracy across racial groups and found no significant disparity. The model was deployed. Six months later, a fair housing organization filed a complaint alleging that the model disproportionately rejected applicants from predominantly Black zip codes. The agency's bias testing had looked at race as a direct feature, but the model was not using race directly. It was using zip code, credit score, and employment history, all of which correlated strongly with race due to historical redlining and systemic inequality. The agency's bias audit had checked the obvious places and missed the actual discrimination. The resulting settlement cost the property management company $320,000, and the agency lost the client and two other clients in the real estate vertical who learned about the incident.
Bias audits that check the obvious places and declare the model fair are worse than no audit at all. They create a false sense of security that enables exactly the kind of harm they were supposed to prevent. A real bias audit framework digs deeper than surface-level metrics. It examines proxy discrimination, intersectional effects, feedback loops, and the full impact of the model on the people it affects.
Why Most Bias Audits Fail
Before building a better framework, understand why the typical approach falls short.
Testing only protected attributes. Most agencies test whether the model's accuracy differs across racial or gender groups. But models that do not use protected attributes can still discriminate through proxy variables. Zip code, school attended, name, and even shopping behavior can serve as proxies for race, ethnicity, or socioeconomic status.
Using the wrong fairness metrics. There are dozens of fairness metrics, and they often conflict with each other. A model can satisfy demographic parity while violating equalized odds. An agency that picks one metric and declares victory may be ignoring a fairness violation that a different metric would catch.
Testing only aggregate groups. Testing whether the model is fair to women as a whole may miss that it discriminates against Black women specifically. Intersectional analysis, examining combinations of protected attributes, catches patterns that single-attribute analysis misses.
Ignoring the decision threshold. A model's bias profile changes depending on where you set the decision threshold. A model that is fair at a 50% threshold may be unfair at the 70% threshold the client actually uses. Bias audits must evaluate bias at the deployed threshold.
Treating the audit as a one-time event. Bias can emerge or shift over time as the data distribution changes, as the population served evolves, or as feedback loops amplify initial disparities. A single pre-deployment audit is necessary but not sufficient.
Not involving affected communities. Technical bias audits measure statistical disparities. They do not capture the lived experience of people affected by the model. Without input from affected communities, auditors may miss forms of harm that statistics alone do not reveal.
The Bias Audit Framework
Your framework should cover seven phases: scoping, data analysis, model analysis, threshold analysis, intersectional analysis, impact assessment, and ongoing monitoring.
Phase 1: Scoping
Before running any bias tests, define what you are testing, why, and what constitutes an acceptable result.
Protected attributes identification. Identify every protected attribute relevant to the model's use case and jurisdiction.
- Federal protections: race, color, religion, sex, national origin, age, disability, genetic information
- State and local protections: may include additional categories like sexual orientation, gender identity, marital status, source of income, and veteran status
- Domain-specific protections: fair housing adds familial status, fair lending adds receipt of public assistance
- For each protected attribute, identify proxy variables in your data that correlate with it
Fairness metric selection. Select the fairness metrics appropriate for your use case. Different contexts demand different metrics.
- Demographic parity. The model's positive outcome rate should be similar across groups. Appropriate when equal access is the goal regardless of base rates.
- Equalized odds. The model's true positive rate and false positive rate should be similar across groups. Appropriate when you want the model to be equally accurate for all groups.
- Predictive parity. The model's precision should be similar across groups. Appropriate when you want positive predictions to be equally reliable for all groups.
- Calibration. A predicted probability of X% should mean X% across all groups. Appropriate for scoring models where the probability itself is used for decisions.
- Individual fairness. Similar individuals should receive similar predictions regardless of group membership. Appropriate when individual treatment consistency matters.
Threshold definition. Define what level of disparity constitutes a bias violation before you run the audit. Setting thresholds after seeing results is a form of bias itself.
- The four-fifths rule from employment law states that a selection rate for any group less than 80% of the rate for the most favored group constitutes adverse impact. This is a common starting point.
- Statistical significance thresholds for group differences
- Domain-specific thresholds from applicable regulations
- Document your thresholds and the rationale for choosing them
Phase 2: Data Analysis
Bias often originates in the data before the model ever sees it. Audit the data first.
Representation analysis. Measure the representation of each protected group in your training data.
- Calculate the proportion of each group in the training data
- Compare against the proportion in the target population the model will serve
- Identify underrepresented groups that may receive less accurate predictions
- Document representation gaps and their potential impact on model fairness
Label bias analysis. If your training data includes human-assigned labels, test those labels for bias.
- Compare labeling rates across groups for the same underlying conditions
- Test for inter-annotator agreement disparities across groups
- Check whether labeling criteria were applied consistently across populations
- If labels are historical outcomes like loan defaults or job performance, assess whether those outcomes were influenced by historical discrimination
Feature bias analysis. Examine individual features for discriminatory patterns.
- Calculate the correlation between each feature and each protected attribute
- Identify features that are strong proxies for protected attributes
- Assess whether proxy features are justified by business necessity
- Document the proxy analysis and decisions about feature inclusion or exclusion
Historical bias assessment. Determine whether the training data reflects historical patterns of discrimination.
- Identify data fields that were influenced by historical discrimination, such as credit scores, property values, arrest records, and educational attainment
- Assess whether training on this data would perpetuate historical discrimination
- Consider debiasing techniques or alternative data sources
- Document the historical bias assessment and any mitigation applied
Phase 3: Model Analysis
Audit the model's behavior across protected groups.
Group performance comparison. Calculate your selected fairness metrics across all protected groups.
- Calculate each metric for each group and compute the disparity ratios
- Compare disparities against your predefined thresholds
- Investigate any metric where disparity exceeds the threshold
- Document all results including passing and failing metrics
Proxy discrimination testing. Test whether the model discriminates through proxy variables even when protected attributes are not used as features.
- Remove protected attributes from the model if they are included and retrain. Compare predictions to the original model. If predictions are similar, the protected attributes were not the primary driver.
- Add protected attributes to the model and compare fairness metrics. If including protected attributes improves fairness, the model may be using harmful proxies.
- Conduct counterfactual analysis: change a data point's protected attribute value while keeping everything else the same and observe how the prediction changes. Large changes indicate proxy discrimination.
Feature importance by group. Analyze which features drive predictions for each group.
- Calculate feature importance using SHAP values or similar methods for each group separately
- Compare the most important features across groups
- Investigate features that are disproportionately important for specific groups, which could indicate proxy discrimination
- Document feature importance analysis results
Error pattern analysis. Analyze the model's errors to understand who bears the cost of model mistakes.
- Calculate false positive and false negative rates for each group
- Identify whether specific types of errors are concentrated in specific groups
- Assess the real-world impact of these errors, for example, false negatives in disease screening disproportionately affecting a specific group means that group receives less healthcare benefit from the model
- Document error patterns and their potential impact
Phase 4: Threshold Analysis
The decision threshold determines who gets a positive or negative outcome. Bias at the threshold level is where abstract model behavior becomes concrete real-world impact.
Threshold sensitivity analysis. Evaluate fairness metrics across a range of decision thresholds.
- Calculate fairness metrics at multiple threshold levels
- Identify whether there are thresholds where the model is fair and thresholds where it is not
- If the deployed threshold creates unfair outcomes, assess whether an alternative threshold would be fairer without unacceptable performance trade-offs
- Document the threshold analysis and the rationale for the chosen threshold
Multi-threshold governance. Consider using different thresholds for different groups if it would improve fairness.
- Evaluate whether group-specific thresholds reduce disparate impact
- Assess the legal and ethical implications of group-specific thresholds, which vary by jurisdiction and domain
- If group-specific thresholds are used, document the justification and ensure compliance with applicable law
Phase 5: Intersectional Analysis
Testing bias across individual protected attributes misses discrimination at the intersection of multiple attributes.
Intersection identification. Define the intersections to test.
- Generate combinations of protected attributes that are relevant to your use case
- Prioritize intersections where historical discrimination has been documented, such as the intersection of race and gender
- Identify intersections where sample sizes are sufficient for meaningful analysis
- Document which intersections were tested and which were not testable due to sample size
Intersectional metric calculation. Calculate fairness metrics for each intersection.
- Apply the same fairness metrics used in Phase 3 to intersectional groups
- Compare disparities against the same thresholds
- Pay particular attention to intersections where disparity is larger than for either individual attribute, which indicates compounded discrimination
- Document intersectional analysis results
Small sample handling. Some intersectional groups will have small sample sizes. Handle these carefully.
- Acknowledge the limitation of small sample analysis in your audit documentation
- Use statistical methods appropriate for small samples such as exact tests or bootstrapping
- Do not dismiss potential bias simply because the sample is small
- Consider targeted data collection to improve representation of underrepresented intersections
Phase 6: Impact Assessment
Translate statistical findings into real-world impact.
Affected population estimation. Estimate how many individuals are affected by identified biases.
- Calculate the number of individuals in each affected group who would receive different outcomes due to the bias
- Extrapolate to the full population the model will serve over its expected lifetime
- Express the impact in concrete terms such as number of individuals denied a loan, rejected for housing, or not selected for an interview
Harm characterization. Characterize the nature and severity of the harm from identified biases.
- Is the harm financial, such as denied loans or higher prices?
- Is the harm related to opportunities, such as denied jobs or housing?
- Is the harm related to dignity, such as stereotyping or surveillance?
- Is the harm reversible or irreversible?
- Are affected individuals aware of and able to contest the biased treatment?
Cumulative impact assessment. Consider the cumulative effect of bias across multiple systems.
- If this model is one of several AI systems that affect the same population, consider the cumulative bias across all systems
- A model with modest individual bias can contribute to significant cumulative harm when combined with other biased systems
- Document the cumulative impact consideration
Remediation recommendation. Based on the impact assessment, recommend specific remediation actions.
- Rank recommendations by impact and feasibility
- Include both immediate mitigations such as threshold adjustments and human review and longer-term improvements such as data collection and model redesign
- Estimate the cost of each remediation action
- Define success criteria for each action
Phase 7: Ongoing Monitoring
Bias audit is not a one-time event. Implement continuous monitoring.
Production bias monitoring. Track fairness metrics in production continuously.
- Calculate fairness metrics on rolling windows such as daily, weekly, and monthly
- Compare production metrics against the baselines established during the audit
- Alert when metrics exceed thresholds
- Investigate and address drift in fairness metrics
Feedback loop monitoring. Monitor for self-reinforcing bias loops.
- Track whether the model's decisions are changing the composition of future training data
- Monitor for patterns where biased decisions create data that reinforces the bias
- Implement periodic recalibration to break feedback loops
- Document feedback loop monitoring results
Re-audit cadence. Conduct full bias re-audits on a defined schedule.
- Annual re-audits at minimum for all models
- Quarterly re-audits for high-risk models
- Triggered re-audits when significant model changes, data changes, or fairness metric shifts occur
- Triggered re-audits when applicable regulations change
Audit Documentation and Reporting
Internal audit report. A comprehensive document covering all phases of the audit.
- Methodology documentation including metrics selected, thresholds defined, and analysis performed
- Data analysis results including representation, label bias, and feature bias findings
- Model analysis results including group performance, proxy discrimination, and error pattern findings
- Threshold and intersectional analysis results
- Impact assessment results and remediation recommendations
- Reviewer qualifications and independence declaration
Client-facing audit summary. A concise document for clients.
- Key findings and their business implications
- Remediation recommendations with estimated costs and timelines
- Ongoing monitoring plan
- Compliance implications and regulatory alignment
Regulatory audit documentation. Documentation that meets regulatory requirements.
- For NYC Local Law 144: summary of results by race, ethnicity, and sex as required
- For EU AI Act: documentation per high-risk AI system requirements
- For sector-specific regulators: documentation aligned with their examination procedures
- Maintain audit documentation for the period required by applicable regulations
Your Next Step
Pull up your most recent AI model delivery that affects individuals, whether it is a scoring model, a classification model, a recommendation system, or any system that produces different outcomes for different people. Run it through Phase 2 and Phase 3 of this framework: data analysis and model analysis. Check for proxy discrimination specifically, not just direct protected attribute performance.
If you do not have the protected attribute data needed for bias testing, that is itself a governance gap. You cannot audit for bias if you cannot identify which individuals belong to which groups. Work with your client to establish a bias testing data strategy that allows meaningful audit while respecting privacy. The agencies that can demonstrate rigorous, defensible bias auditing will win the contracts that matter. The ones that treat bias auditing as a checkbox will eventually pay the price, and the price is always higher than the audit would have been.