Bias in AI systems manifests in multiple dimensions — representation, performance parity, impact disparities. Measurement is necessary but not sufficient. Fairness metrics often conflict; context determines which matters. This post is the dimensions of AI bias, techniques to measure each, and the honest reality that measurement alone doesn't fix bias.

Three dimensions

Representation: data coverage, output diversity. Performance parity: accuracy by subgroup, false positives/negatives. Impact disparities: downstream outcomes, allocation effects.

Representation bias

Data coverage by group. Is training data representative of user population? Underrepresented groups likely have poor performance.

Output diversity. Does output reflect diversity of world? Image generators asked for 'doctor' producing only men is a representation bias.

Who's visible, who's absent. Which perspectives, languages, cultures are present in training data? Which are missing?

Measurement. Compare data distribution to target population. Simple but often revealing.

Performance parity

Accuracy by subgroup. Model performs better on some groups than others? Known issue in face recognition historically.

False positive / false negative rates. Not just accuracy overall; disaggregated error types matter.

Calibration. Confidence scores should mean same thing across groups. Often don't.

Measurement. Compute metrics separately for each group. Gaps are the signal.

Frameworks. Fairlearn, AI Fairness 360 provide standard metrics.

Impact disparities

Downstream outcomes. Same accuracy can produce different outcomes if decisions affect groups differently.

Allocation effects. Who gets approved vs denied, by group? Real-world consequences.

Feedback loops. Biased decisions shape future training data; bias compounds.

Measurement. Track outcomes disaggregated by group. Legal frameworks (EEOC 4/5 rule) provide reference points.

Metrics conflict

Demographic parity. Equal approval rates across groups.

Equal opportunity. Equal true positive rates across groups.

Equalized odds. Equal true positive AND false positive rates across groups.

Calibration. Confidence scores equal across groups.

Cannot satisfy all simultaneously (except in trivial cases). Impossibility results known (Kleinberg, Chouldechova).

Tradeoffs required. Which metric matters in your context?

Context determines which metric matters

Criminal justice risk assessment. Equalized odds often argued appropriate — no group should have higher false positive rate.

Medical diagnosis. Equal opportunity typically — all groups should have equal true positive rates for diseases.

Hiring. Demographic parity sometimes (quotas, diversity goals); equal opportunity other times (meritocracy argument).

Credit scoring. Calibration important — same score should mean same risk across groups.

No universal answer. Context-specific judgments required.

Measurement alone doesn't fix bias

Measurement reveals gaps. Doesn't close them.

Mitigation techniques. Reweighting, fairness-aware training, post-processing adjustment. Each has tradeoffs.

Upstream fixes. Better data, more diverse teams, different problem formulations often matter more than algorithmic fixes.

Organizational. Governance, accountability, remediation processes. Technical measures alone insufficient.

Regulatory context

EEOC (US employment). Four-fifths rule for disparate impact.

Credit (US). Equal Credit Opportunity Act; adverse action notices.

GDPR (EU). Right to explanation; automated decision-making restrictions.

EU AI Act. High-risk AI systems require bias assessment, human oversight.

State laws. New York City AEDT law requires bias audits. Others emerging.

LLM-specific bias

Language coverage. English dominant; other languages systematically worse quality.

Cultural perspectives. Western, English-speaking, online perspectives over-represented.

Occupational associations. 'Nurse' vs 'doctor' gender associations; hiring implications.

Safety tuning. Refusal patterns sometimes correlate with identity categories problematically.

Tools

Fairlearn (Microsoft). Metrics, mitigation, visualization.

AI Fairness 360 (IBM). Open-source toolkit.

Aequitas. Bias audit for criminal justice and similar.

Commercial. Credo AI, Fiddler, Arthur for enterprise bias management.

Measuring AI bias: frameworks, metrics, practical steps

Representation bias

Performance parity

Impact disparities

Metrics conflict

Context determines which metric matters

Measurement alone doesn't fix bias

Regulatory context

LLM-specific bias

Tools

Continue the thread.

AI explainability techniques in 2026

AI safety research today: what's happening, what matters

AI alignment in practice: from research to production

Want to talk about this?