Case study on evaluating statistical tools that predict recidivism.
This case study is based on two articles that were published in 2016:
“Machine Bias”, by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, and published by ProPublica.
A response by Sam Corbett-Davies, Emma Pierson, Avi Feller and Sharad Goel: “A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear.”, published in the Washington Post.
Both articles are about COMPAS, a statistical tool used in the justice system to assign defendants a “risk score” that is intended to reflect the risk that they will commit another crime if released.
The ProPublica article evaluates COMPAS as a binary classifier and compares its error rates for black and white defendants. It concludes that COMPAS is unfair to black defendants because they are more likely to be misclassified as high risk.
In response, the Washington Post article shows that COMPAS has the same predictive value for black and white defendants. And they explain that the test cannot have the same predictive value and the same error rates at the same time.
The purpose of this case study is to understand these conflicting claims, to learn about classification algorithms and the metrics we use to evaluate them, and to think about fairness and the ethics of data science.
In the first notebook I replicate the analysis from the ProPublica article and define the basic metrics we use to evaluate classification algorithms, including error rates and predictive values.
In the second notebook I replicate the analysis from the WaPo article and define the calibration curve, the ROC curve, and a related metric, concordance.
In the third notebook I use the same methods to evaluate the performance of COMPAS for male and female defendants, and lay out the fundamental conflict between two definitions of fairness.
These three notebooks are intended to support a module in a data science class that engages students in the context and ethical challenges of machine learning.
I used these notebooks for a module of my Data Science class at Olin College.
Over the course of three class sessions, I presented these slides and led a discussion with students. This happened in Spring 2020 when classes were run remotely, so the discussions were not as effective as they could have been. For next time I hope to develop a richer set of discussion questions.
This repository contains three additional notebooks with additional explorations that you might be interested in. They are not essential to understand the issues, and they are less complete than the first three notebook.
The fourth notebook proves what I asserted in the second notebook: if you are given prevalence and error rates, you can compute predictive values; and if you are given prevalence and predictive values, you can compute error rates.
The fifth notebook demonstrates that the challenge of defining fairness between groups gets harder as we consider more groups, and identifies the groups with the highest and lowest errors and predictive values.
The sixth notebook explores what I call “the other calibration curve”, the probability of being classified high risk as a function of the probability of recidivism.
I include these notebook in part to resist the temptation to hide my development process. I worked on this case study on and off over several years. I explored a lot of things and took a lot of wrong turns. It took me a long time to find the story, get it organized, and strike a balance between two conflicting goals: maintaining the scientific detachment that lets us tackle difficult topics while keeping sight of the context, the people, and the human consequences.
I hope these materials will be engaging and informative for readers, and useful for teaching and learning the ethical practice of data science.