Module 7 of 24 · Foundations

Responsible AI basics

40 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

This is the seventh of 8 Foundations modules. You can now build models (Module 4), evaluate them (Module 5), and choose architectures (Module 6). This module asks a different kind of question: should you deploy this model? And if so, under what conditions? Responsible AI is not an add-on. It is a prerequisite for any system that affects real people.

By the end of this module you will be able to:

Define and distinguish demographic parity and equalized odds as fairness criteria
Explain how LIME and SHAP provide post-hoc explanations for model predictions
Describe the purpose and structure of a model card and explain why accountability requires documentation

Scales of justice representing algorithmic fairness in criminal justice

Investigation · May 2016

An algorithm was twice as likely to falsely flag Black defendants.

In May 2016, ProPublica published an investigation into COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a risk assessment tool used by US courts to predict whether a defendant would reoffend. The tool scored defendants on a scale of 1 to 10. Judges used these scores to inform bail, sentencing, and parole decisions.

ProPublica's analysis of over 7,000 defendants in Broward County, Florida found that the algorithm was approximately twice as likely to incorrectly label Black defendants as high-risk (false positive) compared to white defendants. Conversely, it was approximately twice as likely to incorrectly label white defendants as low-risk (false negative).

Northpointe (the tool's developer, now Equivant) countered that the tool achieved equal predictive accuracy across racial groups: among those scored as high-risk, the reoffending rate was similar regardless of race. Both claims were statistically true. The debate revealed that different mathematical definitions of fairness are mutually incompatible when base rates differ between groups.

If two people commit the same offence, should an algorithm be allowed to predict different reoffending risks based on factors correlated with race?

The COMPAS controversy is not a story about a bad algorithm. It is a story about the impossibility of satisfying all fairness criteria simultaneously when the underlying populations have different characteristics. This mathematical reality, not a technical bug, is what makes responsible AI genuinely hard. This module gives you the vocabulary and frameworks to navigate it.

If fairness metrics, LIME/SHAP, and model cards are already familiar, test yourself with the knowledge checks and proceed to Module 8: Foundations capstone.

With the learning outcomes established, this module begins by examining fairness: when definitions conflict in depth.

7.1 Fairness: when definitions conflict

Fairness in machine learning is not a single concept. It is a family of mathematical criteria, several of which are provably incompatible when applied simultaneously. Two of the most widely used are:

7.1.1 Demographic parity

A model satisfies demographic parity (also called statistical parity) if the proportion of positive predictions is the same across all demographic groups. If 30% of applicants from Group A are approved for a loan, then 30% of applicants from Group B should also be approved, regardless of any other factors.

The appeal is intuitive: equal treatment at the output level. The limitation is that it ignores base rates. If Group A genuinely has a higher default rate due to historical economic disadvantage, demographic parity may require approving applicants who are likely to default, which harms both the lender and the borrower.

7.1.2 Equalized odds

A model satisfies equalized odds if the true positive rate and false positive rate are equal across groups. This means the model is equally accurate (and equally wrong) for everyone. If it catches 80% of actual reoffenders in Group A, it should catch 80% in Group B. If it falsely flags 10% of non-reoffenders in Group A, it should falsely flag 10% in Group B.

COMPAS approximately satisfied equal predictive accuracy (a related criterion) but violated equalized odds: Black defendants had a higher false positive rate. The mathematical impossibility result, proved by Chouldechova (2017) and by Kleinberg, Mullainathan, and Raghavan (2016), shows that when base rates differ between groups, you cannot simultaneously achieve both calibration (equal predictive accuracy) and equalized odds.

“Any test that satisfies predictive parity cannot also satisfy equal false positive and false negative rates across groups when the base rates differ.”
Chouldechova, A., 'Fair prediction with disparate impact: A study of bias in recidivism prediction instruments' (2017) - Theorem 1
This impossibility result is fundamental. It means fairness is not a technical problem to be solved but a value judgment about which trade-offs are acceptable. Different stakeholders (defendants, judges, communities) may reasonably prioritise different fairness criteria.

Common misconception

“A fair algorithm treats everyone the same.”

Equal treatment and equal outcomes are different things, and achieving one often prevents the other. When underlying populations have different characteristics (due to historical discrimination, socioeconomic factors, or genuine differences), treating everyone identically reproduces existing inequalities. Fairness requires choosing which type of equality matters most in a given context, and that is a moral and political decision, not a technical one.

With an understanding of fairness: when definitions conflict in place, the discussion can now turn to explainability: opening the black box, which builds directly on these foundations.

Global network connections representing the scale at which AI decisions affect populations — AI systems make decisions at a scale no human institution could match. A single model can affect millions of loan applications, hiring decisions, or criminal justice outcomes simultaneously, which is why getting fairness right matters.

7.2 Explainability: opening the black box

A model that cannot explain its decisions cannot be trusted, audited, or challenged. Explainability methods provide post-hoc interpretations of why a model made a specific prediction. Two dominant approaches are LIME and SHAP.

7.2.1 LIME (Local Interpretable Model-Agnostic Explanations)

LIME explains individual predictions by perturbing the input and observing how the output changes. For a loan denial, LIME might create hundreds of slightly modified versions of the application (changing income, age, employment status one at a time) and fit a simple, interpretable model (like a linear regression) to the local region around the original prediction. The coefficients of that local model tell you which features drove this specific decision.

LIME is model-agnostic: it works on any model because it only needs input-output pairs, not access to model internals. The trade-off is that explanations are local (they explain one prediction, not the model as a whole) and can be unstable (different perturbation samples may produce different explanations).

7.2.2 SHAP (SHapley Additive exPlanations)

SHAP assigns each feature a contribution value based on Shapley values from cooperative game theory. The idea: what is each feature's marginal contribution to the prediction when considering all possible feature combinations? Unlike LIME, SHAP values have a solid theoretical foundation with provable properties (local accuracy, missingness, consistency). They provide both local explanations (why this prediction) and global explanations (which features matter most across all predictions).

SHAP is more computationally expensive than LIME, especially for large models. For tree-based models, TreeSHAP provides exact Shapley values in polynomial time. For deep neural networks, approximations are necessary.

With an understanding of explainability: opening the black box in place, the discussion can now turn to accountability: model cards and documentation, which builds directly on these foundations.

7.3 Accountability: model cards and documentation

Accountability requires knowing who built the model, what data it was trained on, how it was evaluated, what its limitations are, and for whom it is intended. Without documentation, there is no accountability.

A model card, proposed by Mitchell et al. (2019) at Google, is a standardised document accompanying a trained model. It includes:

Model details: architecture, training procedure, version, and developer information.
Intended use: primary use cases and known out-of-scope uses. A sentiment analysis model trained on product reviews should not be used to assess suicide risk.
Metrics: performance on relevant evaluation metrics, broken down by demographic subgroup. Aggregate F1 hides disparities that disaggregated metrics reveal.
Training data: description of the dataset, including known biases, collection methodology, and preprocessing steps.
Ethical considerations: foreseeable risks, mitigation strategies, and groups that may be adversely affected.

Model cards are not a bureaucratic exercise. They are the minimum viable documentation for responsible deployment. If you cannot fill out a model card, you do not understand your model well enough to deploy it.

“Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups.”
Mitchell, M. et al., 'Model Cards for Model Reporting', FAT* Conference (2019) - Section 1, Introduction
The model card framework was developed at Google as a response to the lack of standardised documentation for deployed ML models. It draws on precedents from other industries: pharmaceutical package inserts, nutritional labels, and electronics datasheets.

With an understanding of accountability: model cards and documentation in place, the discussion can now turn to detecting bias before deployment, which builds directly on these foundations.

7.4 Detecting bias before deployment

Bias detection is not a one-time check. It requires systematic evaluation across every stage of the ML pipeline: data collection, feature engineering, model training, and post-deployment monitoring. Key practices include:

Disaggregated evaluation: never report only aggregate metrics. Break performance down by demographic group, geographic region, and any other dimension relevant to fairness. A model with 90% accuracy overall might have 95% accuracy for one group and 70% for another.
Slice analysis: examine model performance on intersectional subgroups (e.g. Black women over 50), not just single demographic dimensions. Intersectional disparities are often invisible in single-axis analysis.
Counterfactual testing: change a protected attribute (race, gender) in the input while holding everything else constant. If the prediction changes significantly, the model may be relying on proxy features.
Ongoing monitoring: bias can emerge or worsen over time as the population or data distribution shifts. A model fair at launch may become unfair six months later.

Loading interactive component...

7.5 Check your understanding

A hiring model approves 40% of male applicants and 40% of female applicants. It satisfies demographic parity. However, among qualified candidates, it approves 80% of males but only 60% of females. Which fairness criterion is violated?

A loan approval model denies an application. LIME generates an explanation showing that 'zip code' was the most influential feature. Why might this be concerning from a fairness perspective?

A team deploys a credit scoring model with a model card showing 0.82 F1 score overall. Six months later, they discover it performs at 0.56 F1 for applicants over 65. What went wrong?

Loading interactive component...

Key takeaways

Fairness in ML is a family of mathematically incompatible criteria. Demographic parity (equal positive rates) and equalized odds (equal error rates) cannot both be satisfied when base rates differ between groups. This is a proven impossibility result, not a fixable bug.
LIME explains individual predictions by fitting an interpretable model to local perturbations. SHAP provides theoretically grounded feature attributions based on Shapley values. Both are post-hoc and model-agnostic; both have limitations (instability for LIME, computational cost for SHAP).
Model cards are standardised documents that accompany deployed models. They include intended use, disaggregated metrics, training data description, and ethical considerations. If you cannot fill out a model card, you do not understand your model well enough to deploy it.
Bias enters the ML pipeline at every stage: data collection, feature engineering, model training, and post-deployment drift. Detection requires disaggregated evaluation, slice analysis, counterfactual testing, and continuous monitoring.
Responsible AI is not an add-on or a compliance checkbox. It requires principled choices about which trade-offs are acceptable, informed by the specific context, affected populations, and consequences of errors. The COMPAS controversy demonstrates that even well-intentioned systems can cause harm when these choices are not made explicitly.

Standards and sources cited in this module

Angwin, J. et al., 'Machine Bias', ProPublica (May 2016)
Full investigation
The investigation that brought algorithmic fairness into public discourse. Demonstrated racial disparities in COMPAS false positive rates. Used as the opening case study.
Chouldechova, A., 'Fair prediction with disparate impact: A study of bias in recidivism prediction instruments', Big Data (2017)
Theorem 1
Proves the impossibility of simultaneously satisfying calibration and equalized odds when base rates differ. This result is foundational to understanding why algorithmic fairness is a value judgment, not a technical fix.
Mitchell, M. et al., 'Model Cards for Model Reporting', FAT* Conference (2019)
Sections 1-4
Introduced the model card framework for standardised ML model documentation. Establishes the minimum documentation standard for responsible deployment, including disaggregated metrics and intended use constraints.
Ribeiro, M.T., Singh, S. & Guestrin, C., 'Why Should I Trust You? Explaining the Predictions of Any Classifier', KDD (2016)
Sections 3-5
Introduced LIME (Local Interpretable Model-Agnostic Explanations). Demonstrated that model-agnostic local explanations can reveal unexpected model behaviour and proxy variable reliance.
Lundberg, S.M. & Lee, S.I., 'A Unified Approach to Interpreting Model Predictions', NeurIPS (2017)
Full paper
Introduced SHAP values for ML explanation. Unified multiple existing explanation methods under the Shapley value framework, providing theoretical guarantees (local accuracy, missingness, consistency) that ad-hoc methods lack.

You now have the vocabulary for responsible AI: fairness criteria, explainability methods, model cards, and bias detection. The Foundations stage concludes with a capstone that integrates everything from Modules 1-7: a hospital wants to deploy AI for patient triage. You will evaluate the system across all the dimensions you have learned: data quality, model evaluation, architecture choice, fairness, and accountability.

Previous: Deep learning architectures Next: Foundations capstone

Module 7 of 24 · AI Foundations

Loading lesson...

An algorithm was twice as likely to falsely flag Black defendants.

If two people commit the same offence, should an algorithm be allowed to predict different reoffending risks based on factors correlated with race?

7.1 Fairness: when definitions conflict

Fairness in machine learning is not a single concept. It is a family of mathematical criteria, several of which are provably incompatible when applied simultaneously. Two of the most widely used are:

7.1.1 Demographic parity

7.1.2 Equalized odds

“Any test that satisfies predictive parity cannot also satisfy equal false positive and false negative rates across groups when the base rates differ.”

Chouldechova, A., 'Fair prediction with disparate impact: A study of bias in recidivism prediction instruments' (2017) - Theorem 1

This impossibility result is fundamental. It means fairness is not a technical problem to be solved but a value judgment about which trade-offs are acceptable. Different stakeholders (defendants, judges, communities) may reasonably prioritise different fairness criteria.

Common misconception

“A fair algorithm treats everyone the same.”

7.2 Explainability: opening the black box

7.2.1 LIME (Local Interpretable Model-Agnostic Explanations)

7.2.2 SHAP (SHapley Additive exPlanations)

7.3 Accountability: model cards and documentation

A model card, proposed by Mitchell et al. (2019) at Google, is a standardised document accompanying a trained model. It includes:

Model details: architecture, training procedure, version, and developer information.
Intended use: primary use cases and known out-of-scope uses. A sentiment analysis model trained on product reviews should not be used to assess suicide risk.
Metrics: performance on relevant evaluation metrics, broken down by demographic subgroup. Aggregate F1 hides disparities that disaggregated metrics reveal.
Training data: description of the dataset, including known biases, collection methodology, and preprocessing steps.
Ethical considerations: foreseeable risks, mitigation strategies, and groups that may be adversely affected.

“Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups.”

Mitchell, M. et al., 'Model Cards for Model Reporting', FAT* Conference (2019) - Section 1, Introduction

The model card framework was developed at Google as a response to the lack of standardised documentation for deployed ML models. It draws on precedents from other industries: pharmaceutical package inserts, nutritional labels, and electronics datasheets.

7.4 Detecting bias before deployment

Disaggregated evaluation: never report only aggregate metrics. Break performance down by demographic group, geographic region, and any other dimension relevant to fairness. A model with 90% accuracy overall might have 95% accuracy for one group and 70% for another.
Slice analysis: examine model performance on intersectional subgroups (e.g. Black women over 50), not just single demographic dimensions. Intersectional disparities are often invisible in single-axis analysis.
Counterfactual testing: change a protected attribute (race, gender) in the input while holding everything else constant. If the prediction changes significantly, the model may be relying on proxy features.
Ongoing monitoring: bias can emerge or worsen over time as the population or data distribution shifts. A model fair at launch may become unfair six months later.

Key takeaways

Fairness in ML is a family of mathematically incompatible criteria. Demographic parity (equal positive rates) and equalized odds (equal error rates) cannot both be satisfied when base rates differ between groups. This is a proven impossibility result, not a fixable bug.

LIME explains individual predictions by fitting an interpretable model to local perturbations. SHAP provides theoretically grounded feature attributions based on Shapley values. Both are post-hoc and model-agnostic; both have limitations (instability for LIME, computational cost for SHAP).

Model cards are standardised documents that accompany deployed models. They include intended use, disaggregated metrics, training data description, and ethical considerations. If you cannot fill out a model card, you do not understand your model well enough to deploy it.

Bias enters the ML pipeline at every stage: data collection, feature engineering, model training, and post-deployment drift. Detection requires disaggregated evaluation, slice analysis, counterfactual testing, and continuous monitoring.

Responsible AI is not an add-on or a compliance checkbox. It requires principled choices about which trade-offs are acceptable, informed by the specific context, affected populations, and consequences of errors. The COMPAS controversy demonstrates that even well-intentioned systems can cause harm when these choices are not made explicitly.

Standards and sources cited in this module

Angwin, J. et al., 'Machine Bias', ProPublica (May 2016)

Full investigation

The investigation that brought algorithmic fairness into public discourse. Demonstrated racial disparities in COMPAS false positive rates. Used as the opening case study.

Chouldechova, A., 'Fair prediction with disparate impact: A study of bias in recidivism prediction instruments', Big Data (2017)

Theorem 1

Proves the impossibility of simultaneously satisfying calibration and equalized odds when base rates differ. This result is foundational to understanding why algorithmic fairness is a value judgment, not a technical fix.

Mitchell, M. et al., 'Model Cards for Model Reporting', FAT* Conference (2019)

Sections 1-4

Introduced the model card framework for standardised ML model documentation. Establishes the minimum documentation standard for responsible deployment, including disaggregated metrics and intended use constraints.

Ribeiro, M.T., Singh, S. & Guestrin, C., 'Why Should I Trust You? Explaining the Predictions of Any Classifier', KDD (2016)

Sections 3-5

Introduced LIME (Local Interpretable Model-Agnostic Explanations). Demonstrated that model-agnostic local explanations can reveal unexpected model behaviour and proxy variable reliance.

Lundberg, S.M. & Lee, S.I., 'A Unified Approach to Interpreting Model Predictions', NeurIPS (2017)

Full paper

Introduced SHAP values for ML explanation. Unified multiple existing explanation methods under the Shapley value framework, providing theoretical guarantees (local accuracy, missingness, consistency) that ad-hoc methods lack.