Bias in AI systems manifests in multiple dimensions — representation, performance parity, impact disparities. Measurement is necessary but not sufficient. Fairness metrics often conflict; context determines which matters. This post is the dimensions of AI bias, techniques to measure each, and the honest reality that measurement alone doesn't fix bias.
Representation bias
Data coverage by group. Is training data representative of user population? Underrepresented groups likely have poor performance.
Output diversity. Does output reflect diversity of world? Image generators asked for 'doctor' producing only men is a representation bias.
Who's visible, who's absent. Which perspectives, languages, cultures are present in training data? Which are missing?
Measurement. Compare data distribution to target population. Simple but often revealing.
Performance parity
Accuracy by subgroup. Model performs better on some groups than others? Known issue in face recognition historically.
False positive / false negative rates. Not just accuracy overall; disaggregated error types matter.
Calibration. Confidence scores should mean same thing across groups. Often don't.
Measurement. Compute metrics separately for each group. Gaps are the signal.
Frameworks. Fairlearn, AI Fairness 360 provide standard metrics.
Impact disparities
Downstream outcomes. Same accuracy can produce different outcomes if decisions affect groups differently.
Allocation effects. Who gets approved vs denied, by group? Real-world consequences.
Feedback loops. Biased decisions shape future training data; bias compounds.
Measurement. Track outcomes disaggregated by group. Legal frameworks (EEOC 4/5 rule) provide reference points.
Metrics conflict
Demographic parity. Equal approval rates across groups.
Equal opportunity. Equal true positive rates across groups.
Equalized odds. Equal true positive AND false positive rates across groups.
Calibration. Confidence scores equal across groups.
Cannot satisfy all simultaneously (except in trivial cases). Impossibility results known (Kleinberg, Chouldechova).
Tradeoffs required. Which metric matters in your context?
Context determines which metric matters
Criminal justice risk assessment. Equalized odds often argued appropriate — no group should have higher false positive rate.
Medical diagnosis. Equal opportunity typically — all groups should have equal true positive rates for diseases.
Hiring. Demographic parity sometimes (quotas, diversity goals); equal opportunity other times (meritocracy argument).
Credit scoring. Calibration important — same score should mean same risk across groups.
No universal answer. Context-specific judgments required.
Measurement alone doesn't fix bias
Measurement reveals gaps. Doesn't close them.
Mitigation techniques. Reweighting, fairness-aware training, post-processing adjustment. Each has tradeoffs.
Upstream fixes. Better data, more diverse teams, different problem formulations often matter more than algorithmic fixes.
Organizational. Governance, accountability, remediation processes. Technical measures alone insufficient.
Regulatory context
EEOC (US employment). Four-fifths rule for disparate impact.
Credit (US). Equal Credit Opportunity Act; adverse action notices.
GDPR (EU). Right to explanation; automated decision-making restrictions.
EU AI Act. High-risk AI systems require bias assessment, human oversight.
State laws. New York City AEDT law requires bias audits. Others emerging.
LLM-specific bias
Language coverage. English dominant; other languages systematically worse quality.
Cultural perspectives. Western, English-speaking, online perspectives over-represented.
Occupational associations. 'Nurse' vs 'doctor' gender associations; hiring implications.
Safety tuning. Refusal patterns sometimes correlate with identity categories problematically.
Tools
Fairlearn (Microsoft). Metrics, mitigation, visualization.
AI Fairness 360 (IBM). Open-source toolkit.
Aequitas. Bias audit for criminal justice and similar.
Commercial. Credo AI, Fiddler, Arthur for enterprise bias management.