eazyware
Playbook·April 24, 2023·11 min read

Measuring AI bias: frameworks, metrics, practical steps

Disparate impact, demographic parity, equalized odds. The specific metrics and processes for measuring bias in AI systems.

KR
Kushal R.
Engineering lead

Bias in AI systems manifests in multiple dimensions — representation, performance parity, impact disparities. Measurement is necessary but not sufficient. Fairness metrics often conflict; context determines which matters. This post is the dimensions of AI bias, techniques to measure each, and the honest reality that measurement alone doesn't fix bias.

Three dimensions
AI bias measurement — dimensions Representation Data coverage by group Output diversity Who's visible, who's absent Performance parity Accuracy by subgroup False positive/negative Gaps as signal Impact disparities Downstream outcomes Allocation effects Real-world consequences Measurement reality Fairness metrics conflict — equalizing one often worsens another Context matters — same metric appropriate in one setting, wrong in another Measurement alone doesn't fix bias — decisions about tradeoffs do
Representation: data coverage, output diversity. Performance parity: accuracy by subgroup, false positives/negatives. Impact disparities: downstream outcomes, allocation effects.

Representation bias

Data coverage by group. Is training data representative of user population? Underrepresented groups likely have poor performance.

Output diversity. Does output reflect diversity of world? Image generators asked for 'doctor' producing only men is a representation bias.

Who's visible, who's absent. Which perspectives, languages, cultures are present in training data? Which are missing?

Measurement. Compare data distribution to target population. Simple but often revealing.

Performance parity

Accuracy by subgroup. Model performs better on some groups than others? Known issue in face recognition historically.

False positive / false negative rates. Not just accuracy overall; disaggregated error types matter.

Calibration. Confidence scores should mean same thing across groups. Often don't.

Measurement. Compute metrics separately for each group. Gaps are the signal.

Frameworks. Fairlearn, AI Fairness 360 provide standard metrics.

Impact disparities

Downstream outcomes. Same accuracy can produce different outcomes if decisions affect groups differently.

Allocation effects. Who gets approved vs denied, by group? Real-world consequences.

Feedback loops. Biased decisions shape future training data; bias compounds.

Measurement. Track outcomes disaggregated by group. Legal frameworks (EEOC 4/5 rule) provide reference points.

Metrics conflict

Demographic parity. Equal approval rates across groups.

Equal opportunity. Equal true positive rates across groups.

Equalized odds. Equal true positive AND false positive rates across groups.

Calibration. Confidence scores equal across groups.

Cannot satisfy all simultaneously (except in trivial cases). Impossibility results known (Kleinberg, Chouldechova).

Tradeoffs required. Which metric matters in your context?

Context determines which metric matters

Criminal justice risk assessment. Equalized odds often argued appropriate — no group should have higher false positive rate.

Medical diagnosis. Equal opportunity typically — all groups should have equal true positive rates for diseases.

Hiring. Demographic parity sometimes (quotas, diversity goals); equal opportunity other times (meritocracy argument).

Credit scoring. Calibration important — same score should mean same risk across groups.

No universal answer. Context-specific judgments required.

Measurement alone doesn't fix bias

Measurement reveals gaps. Doesn't close them.

Mitigation techniques. Reweighting, fairness-aware training, post-processing adjustment. Each has tradeoffs.

Upstream fixes. Better data, more diverse teams, different problem formulations often matter more than algorithmic fixes.

Organizational. Governance, accountability, remediation processes. Technical measures alone insufficient.

Regulatory context

EEOC (US employment). Four-fifths rule for disparate impact.

Credit (US). Equal Credit Opportunity Act; adverse action notices.

GDPR (EU). Right to explanation; automated decision-making restrictions.

EU AI Act. High-risk AI systems require bias assessment, human oversight.

State laws. New York City AEDT law requires bias audits. Others emerging.

LLM-specific bias

Language coverage. English dominant; other languages systematically worse quality.

Cultural perspectives. Western, English-speaking, online perspectives over-represented.

Occupational associations. 'Nurse' vs 'doctor' gender associations; hiring implications.

Safety tuning. Refusal patterns sometimes correlate with identity categories problematically.

Tools

Fairlearn (Microsoft). Metrics, mitigation, visualization.

AI Fairness 360 (IBM). Open-source toolkit.

Aequitas. Bias audit for criminal justice and similar.

Commercial. Credo AI, Fiddler, Arthur for enterprise bias management.

Read next
AI explainability techniques in 2026
Read next
AI safety research today: what's happening, what matters
Read next
AI alignment in practice: from research to production
Tags
biasfairnessethics
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request