Fairness and Fallacy

Fairness and Fallacy#

Click here to run this notebook on Colab.

In the criminal justice system, we use algorithms to guide decisions about who should be released on bail or kept in jail, and who should be kept in prison or released on parole.

Of course, we want those algorithms to be fair. For example, suppose we use an algorithm to predict whether a candidate for parole is likely to commit another crime, if released. If the algorithm is fair:

Its predictions should mean the same thing for different groups of people. For example, if the algorithm predicts that a group of women and a group of men are equally likely to reoffend, we expect the numbers of women and men who actually reoffend to be the same, on average.
Also, since the algorithm will make mistakes – some people who reoffend will be assigned low probabilities, and some people who do not reoffend will be assigned high probabilities – it should make these mistakes at the same rate in different groups of people.

It is hard to argue with either of these requirements. Suppose the algorithm assigns a probability of 30% to 100 women and 100 men; if 20 of the women commit another crime, and 40 of the men do, it seems like the algorithm is unfair.

Or suppose black candidates are more likely to be wrongly assigned a high probability, and white candidates are more likely to be wrongly assigned a low probability. That seems unfair, too.

But here’s the problem: it is not possible for an algorithm – or a human – to satisfy both requirements. Unless two groups commit crimes at precisely the same rate, any classification that is equally predictive for both groups will necessarily make different kinds of errors between the groups. And if we calibrate it to make the same kind of errors, the meaning of the predictions will be different. To understand why, we have to understand the base rate fallacy.

In this chapter, I’ll demonstrate the base rate fallacy using three examples:

Suppose you take a medical test and the result is positive, which indicates that you have a particular disease. If the test is 99% accurate, you might think there is a 99% chance that you have the disease. But the actual probability could be much less.
Suppose a driver is arrested because a testing device found that their blood alcohol was above the legal limit. If the device is 99% accurate, you might think there’s a 99% chance they are guilty. In fact, the probability depends strongly on the reason the driver was stopped.
Suppose you hear that 70% of people who die from a disease had been vaccinated against it. You might think that the vaccine was not effective. In fact, such a vaccine might prevent 80% of deaths or more, and save a large number of lives.

These examples might be surprising if you have not seen them before, but once you understand what’s going on, I think they will make sense. At the end of the chapter, we’ll return to the problem of algorithms and criminal justice.

Medical Tests#

During the COVID-19 pandemic, we all got a crash course in the science and statistics of infectious disease. One of the things we learned about is the accuracy of medical testing and possibility of errors, both false positives and false negatives.

As an example, suppose a friend tells you they have tested positive for COVID, and they want to know whether they are really infected, or whether the result might be a false positive. What information would you need to answer their question?

One thing you clearly need to know is the accuracy of the test, but that turns out to be a little tricky. In the context of medical testing, we have to consider two kinds of accuracy: sensitivity and specificity.

Let’s consider a case where the base rate is relatively low, like 1%. And let’s imagine a group of 1000 people who all take the test. In a group this size, we expect 10 people to be infected, because 10 out of 1000 is 1%.

Of the 10 who are actually infected, we expect 9 to get a positive test result, because the sensitivity of the test is 87%.

Of the other 990, we expect 970 to get a negative test result, because the specificity is 98%. But that means we expect 20 people to get a false positive.

Before we go on, let’s put the numbers we have so far in a table.

table = pd.DataFrame(index=["Infected", "Not infected"])
table["# of people"] = 10, 990
table["Prob positive"] = 0.87, 0.02
table["# positive"] = (
    (table["# of people"] * table["Prob positive"]).round().astype(int)
)
table

	# of people	Prob positive	# positive
Infected	10	0.87	9
Not infected	990	0.02	20

Now we are ready to answer your friend’s question: Given a positive test result, what is the probability that they are actually infected? In this example, the answer is 9 out of 29, or 31%.

Here’s the table again, with a fourth column showing the probability of actual infection and the complementary probability that the test result is a false positive.

total = table["# positive"].sum()
table["% of positive tests"] = (table["# positive"] / total).round(3) * 100
table

	# of people	Prob positive	# positive	% of positive tests
Infected	10	0.87	9	31.0
Not infected	990	0.02	20	69.0

Although the sensitivity and specificity of the test are high, after a positive result, the probability that your friend is infected is only 31%. The reason it’s so low is that the base rate in this example is only 1%.

More prevalence#

To see why it matters, let’s change the scenario. Suppose your friend has mild flu-like symptoms; in that case it seems more likely that they are infected, compared to someone with no symptoms. Let’s say it is ten times more likely, so the probability that your friend is infected is 10% before we get the test results. In that case, out of 1000 people with the same symptoms, we would expect 100 to be infected. If we modify the first column of the table accordingly, here are the results.

def make_table(prior, likelihood, as_int=True):
    """ """
    table = pd.DataFrame(index=["Infected", "Not infected"])
    table["# of people"] = prior
    table["Prob positive"] = likelihood
    table["# positive"] = table["# of people"] * table["Prob positive"]
    if as_int:
        table["# positive"] = table["# positive"].astype(int)
    total = table["# positive"].sum()
    table["% of positive tests"] = (table["# positive"] / total).round(3) * 100
    return table

sens = 0.87
spec = 0.98
prior = 100, 900
likelihood = sens, 1 - spec
make_table(prior, likelihood)

	# of people	Prob positive	# positive	% of positive tests
Infected	100	0.87	87	82.9
Not infected	900	0.02	18	17.1

Now the probability is about 83% that your friend is actually infected, and about 17% that the result is a false positive. This example demonstrates two things:

The base rate makes a big difference, and
Even with an accurate test and a 10% base rate, the probability of a false positive is still surprisingly high.

If the test is more sensitive, that helps, but maybe not as much as you expect. For example, another brand of rapid antigen tests claims 95% sensitivity, substantially better than the first brand, which was 87%. With this test, assuming the same specificity, 98%, and the same base rate, 10%, here’s what we get.

sens = 0.95
spec = 0.98
prior = 100, 900
likelihood = sens, 1 - spec
make_table(prior, likelihood)

	# of people	Prob positive	# positive	% of positive tests
Infected	100	0.95	95	84.1
Not infected	900	0.02	18	15.9

Increasing the sensitivity from 87% to 95% has only a small effect: the probability that the test result is a false positive goes from 17% to 16%.

More specificity#

Increasing specificity has a bigger effect. For example, lab tests that use PCR (polymerase chain reaction) are highly specific, about as close to 100% as can be. However, in practice it is always possible that a specimen is contaminated, a device malfunctions, or a result is reported incorrectly.

For example, in a retirement community near my house in Massachusetts, 18 employees and one resident tested positive for COVID in August 2020. But all 19 turned out to be false positives, produced by a lab in Boston that was suspended by the Department of Public Health after they reported at least 383 false positive results.

It’s hard to say how often something like that goes wrong, but if it happens one time in 1000, the specificity of the test would be 99.9%. Let’s see what effect that has on the results.

sens = 0.95
spec = 0.999
prior = 100, 900
likelihood = sens, 1 - spec
make_table(prior, likelihood, as_int=False)

	# of people	Prob positive	# positive	% of positive tests
Infected	100	0.950	95.0	99.1
Not infected	900	0.001	0.9	0.9

With 95% sensitivity, 99.9% specificity, and 10% base rate, the probability is about 99% that your friend is actually infected, given a positive PCR test result.

However, the base rate still matters. Suppose you tell me that your friend lives in New Zealand where (at least at the time I am writing) the rate of COVID infection is very low. In that case the base rate for someone with mild flu-like symptoms might be 1 in 1000.

Here’s the table with 95% sensitivity, 99.9% specificity, and base rate 1 in 1000.

sens = 0.95
spec = 0.999
prior = 1, 999
likelihood = sens, 1 - spec
make_table(prior, likelihood, as_int=False)

	# of people	Prob positive	# positive	% of positive tests
Infected	1	0.950	0.950	48.7
Not infected	999	0.001	0.999	51.3

In this example, the numbers in the third column aren’t integers, but that’s okay. The calculation works the same way. Out of 1000 tests, we expect 0.95 true positives, on average, and 0.999 false positives. So the probability is about 49% that a positive test is correct. That’s lower than most people think, including most doctors.

Bad medicine#

A 2014 paper in The Journal of the American Medical Association reports the result of a sneaky experiment. The researchers asked a “convenience sample” of doctors (probably their friends and colleagues) the following question:

“If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs?”

What they call “prevalence” is what I’ve been calling “base rate”. And what they call the “false positive rate” is the complement of specificity, so a false positive rate of 5% corresponds to a sensitivity of 95%.

Before I tell you the results of the experiment, let’s work out the answer to the question. We are not given the sensitivity of the test, so I’ll make the optimistic assumption that it is 99%. The following table shows the results.

sens = 0.99
spec = 0.95
prior = 1, 999
likelihood = sens, 1 - spec
make_table(prior, likelihood, as_int=False)

	# of people	Prob positive	# positive	% of positive tests
Infected	1	0.99	0.99	1.9
Not infected	999	0.05	49.95	98.1

The correct answer is about 2%.

Now, here are the results of the experiment:

“Approximately three-quarters of respondents answered the question incorrectly. In our study, 14 of 61 respondents (23%) gave a correct response. […] the most common answer was 95%, given by 27 of 61 respondents.”

If the correct answer is 2% and the most common response is 95%, that is an alarming level of misunderstanding.

Driving Under the Influence#

The challenges of the base rate fallacy have become more salient as some states have cracked down on “drugged driving”.

In September 2017 the American Civil Liberties Union (ACLU) filed suit against Cobb County, Georgia on behalf of four drivers who were arrested for driving under the influence of cannabis. All four were evaluated by Officer Tracy Carroll, who had been trained as a “Drug Recognition Expert” (DRE) as part of a program developed by the Los Angeles Police Department in the 1970s.

At the time of their arrest, all four insisted that they had not smoked or ingested any cannabis products, and when their blood was tested, all four results were negative; that is, the blood tests found no evidence of recent cannabis use.

In each case, prosecutors dismissed the charges related to impaired driving. Nevertheless, the arrests were disruptive and costly, and the plaintiffs were left with a permanent and public arrest record.

At issue in the case is the assertion by the ACLU that, “Much of the DRE protocol has never been rigorously and independently validated.”

So I investigated that claim. What I found was a collection of studies that are, across the board, deeply flawed. Every one of them features at least one methodological error so blatant it would be embarrassing at a middle school science fair.

As an example, the lab study most often cited to show that the DRE protocol is valid was conducted at Johns Hopkins University School of Medicine in 1985. It concludes, “Overall, in 98.7% of instances of judged intoxication the subject had received some active drug”. In other words, in the cases where one of the Drug Recognition Experts believed that a subject was under the influence, they were right 98.7% of the time.

That sounds impressive, but there are several problems with this study. The biggest is that the subjects were all “normal, healthy” male volunteers between 18 and 35 years old, who were screened and “trained on the psychomotor tasks and subjective effect questionnaires used in the study”.

By design, the study excluded women, anyone older than 35, and anyone in poor health. Then the screening excluded anyone who had any difficulty passing a sobriety test while they were sober – for example, anyone with shaky hands, poor coordination, or poor balance.

But those are exactly the people most likely to be falsely accused. How can you estimate the number of false positives if you exclude from the study everyone likely to yield a false positive? You can’t.

Another frequently-cited study reports that “When DREs claimed drugs other than alcohol were present, they [the drugs] were almost always detected in the blood (94% of the time)”. Again, that sounds impressive until you look at the methodology.

Subjects in this study had already been arrested because they were suspected of driving while impaired, most often because they had failed a field sobriety test.

Then, while they were in custody, they were evaluated by a DRE, that is, a different officer trained in the drug evaluation procedure. If the DRE thought that the suspect was under the influence of a drug, the suspect was asked to consent to a blood test; otherwise they were released.

Of 219 suspects, 18 were released after a DRE performed a “cursory examination” and concluded that there was no evidence of drug impairment.

The remaining 201 suspects were asked for a blood sample. Of those, 22 refused and 6 provided a urine sample only.

Of the 173 blood samples, 162 were found to contain a drug other than alcohol. That’s about 94%, which is the statistic they reported.

But the base rate in this study is extraordinarily high, because it includes only cases that were suspected by the arresting officer and then confirmed by the DRE. With a few generous assumptions, I estimate that the base rate is 86%; in reality, it was probably higher.

To estimate the base rate, let’s assume:

All 18 of the suspects who were released were, in fact, not under the influence of a drug, and
The 28 suspects who refused a blood test were impaired at the same rate as the 173 who agreed, 94%.

Both of these assumptions are generous; that is, they probably overestimate the accuracy of the DREs. Even so, they imply that 188 out of 219 blood tests would have been positive, if they had been tested. That’s a base rate of 86%.

def percent(y, n):
    return 100 * y / (y + n)

no_drug = 11
drug = 173 - 11
drug

rate = percent(drug, no_drug)
rate

93.64161849710983

refused = 28
total_pos = rate / 100 * refused + drug
total_pos

188.21965317919074

total_pos / 219

0.8594504711378573

Because the suspects who were released were not tested, there is no way to estimate the sensitivity of the test, but let’s assume it’s 99%, so if a suspect is under the influence of a drug, there is a 99% chance a DRE would detect it. In reality, it is probably lower.

With these generous assumptions, we can use the following table to estimate the sensitivity of the DRE protocol.

col1 = "Suspects"
col2 = "Prob Positive"
col3 = "Cases"
col4 = "Percent"
table = pd.DataFrame(
    index=["Impaired", "Not impaired"], columns=[col1, col2, col3, col4]
)

table[col1] = 86, 14
table[col2] = 0.99, 0.4
table[col3] = table[col1] * table[col2]
total = table[col3].sum()
table[col4] = (table[col3] / total).round(3) * 100

table

	Suspects	Prob Positive	Cases	Percent
Impaired	86	0.99	85.14	93.8
Not impaired	14	0.40	5.60	6.2

With 86% base rate, we expect 86 impaired suspects out of 100, and 14 unimpaired. With 99% sensitivity, we expect the DRE to detect about 85 true positives. And with 60% specificity, we expect the DRE to wrongly accuse 5.6 suspects. Out of 91 positive tests, 85 would be correct; that’s about 94%, as reported in the study.

But this accuracy is only possible because the base rate in the study is so high. Remember that most of the subjects had been arrested because they had failed a field sobriety test. Then they were tested by a DRE, who was effectively offering a second opinion.

But that’s not what happened when Officer Tracy Carroll arrested Katelyn Ebner, Princess Mbamara, Ayokunle Oriyomi, and Brittany Penwell. In each of those cases, the driver was stopped for driving erratically, which is evidence of possible impairment. But when Officer Carroll began his evaluation, that was the only evidence of impairment.

So the relevant base rate is not 86%, as in the study; it is the fraction of erratic drivers who are under the influence of drugs. And there are many other reasons for erratic driving, including distraction, sleepiness, and the influence of alcohol. It’s hard to say which explanation is most common. I’m sure it depends on time and location. But as an example, let’s suppose it is 50%; the following table shows the results with this base rate.

table = pd.DataFrame(
    index=["Impaired", "Not impaired"], columns=[col1, col2, col3, col4]
)

table[col1] = 50, 50
table[col2] = 0.99, 0.4
table[col3] = table[col1] * table[col2]
total = table[col3].sum()
table[col4] = (table[col3] / total).round(3) * 100

table

	Suspects	Prob Positive	Cases	Percent
Impaired	50	0.99	49.5	71.2
Not impaired	50	0.40	20.0	28.8

With 50% base rate, 99% sensitivity, and 60% specificity, the predictive value of the test is only 71%; under these assumptions, almost 30% of the accused would be innocent. In fact, the base rate, sensitivity, and specificity are probably lower, which means that the value of the test is even worse.

The suit filed by the ACLU was not successful. The court decided that the arrests were valid because the results of the field sobriety tests constituted “probable cause” for an arrest. As a result, the court did not consider the evidence for, or against, the validity of the DRE protocol. The ACLU has appealed the decision.

Vaccine Effectiveness#

Now that we understand the base rate fallacy, we’re ready to untangle a particularly confusing example of COVID disinformation. In October 2021, a journalist appeared on a well-known podcast with a surprising claim. He said, “In the [United Kingdom] 70-plus percent of the people who die now from COVID are fully vaccinated.”

The incredulous host asked, “Seventy percent?” and the journalist repeated, “Seven in ten of the people [who died] – I want to keep saying because nobody believes it but the numbers are there in the government documents – the vast majority of people in Britain who died in September were fully vaccinated.”

Then, to his credit, he showed a table from a report published by Public Health England in September 2021. From the table, he read off the number of deaths in each age group: “1270 out of 1500 in the over 80 category […] 607 of 800 of the 70 year-olds […] They were almost all fully vaccinated. Most people who die of this now are fully vaccinated in the UK. Those are the numbers.”

It’s true; those are the numbers. But the implication that the vaccine is useless, or actually harmful, is wrong. In fact, we can use these numbers, along with additional information from the same table, to compute the effectiveness of the vaccine and estimate the number of lives it saved.

Berenson: The vaccine still appears to have some protective effect … that would still imply that the vaccine was doing some good.

Rogan: So when you say that most of the people who are dying are vaccinated is that because the [] rate of vaccination is very high?”

Berenson “Yes, but… there’s another complexity here – and this is the part that the vaccine [advocates?] never admit – when you get to a place like Britain or Israel where almost everybody in that age range is vaccinated, who’s not being vaccinated? Do you think there’s a lot of people in the old age home who are saying, ‘You know what, I’m insisting on my personal rights; you can’t vaccinate me.’ Some 88 year old, no. The only people who aren’t being vaccinated in that age group? Are probably too sick or too close to the end of their lives …”

Rogan: Isn’t that speculative, though?

Berenson: You caught me, because you’re right, it is speculative. That is my speculation that there is this difference in these two groups.

death_vax = 1272
death_unvax = 1521 - death_vax
death_vax + death_unvax

death_vax / (death_vax + death_unvax)

0.8362919132149902

rate_vax = 495
rate_unvax = 1560

effectiveness = 1 - (rate_vax / rate_unvax)
effectiveness

0.6826923076923077

Let’s start with the oldest age group, people who were 80 or more years old. In this group, there were 1521 deaths attributed to COVID during the four week period from August 23 to September 19, 2021. Of the people who died, 1272 had been fully vaccinated. The others were either unvaccinated or partially vaccinated; for simplicity I’ll consider them all not fully vaccinated. So, in this age group, 84% of the people who died had been fully vaccinated. On the face of it, that sounds like the vaccine was not effective.

However, the same table also reports death rates among the vaccinated and unvaccinated, that is, the number of deaths as a fraction of the population in each age group. During the same four week period, the death rates due to COVID were 1,560 per million people among the unvaccinated and 495 per million among the vaccinated. So, the death rate was substantially lower among the vaccinated.

The following table shows these death rates in the second column, and the number of deaths in the third column. Given these numbers, we can work forward to compute the fourth column, which shows again that 84% of the people who died had been vaccinated.

We can also work backward to compute the first column, which shows that there were about 2.57 million people in this age group who had been vaccinated, and only 0.16 million who had not. So, more than 94% of this age group had been vaccinated.

col1 = "Population"
col2 = "Death rate"
col3 = "Deaths"
col4 = "Percent"
table = pd.DataFrame(
    index=["Vaccinated", "Not vaccinated"], columns=[col1, col2, col3, col4]
)

table[col3] = death_vax, death_unvax
table[col2] = rate_vax, rate_unvax
table[col1] = table[col3] / table[col2]
total = table[col3].sum()
table[col4] = (table[col3] / total * 100).round(1)

table.round(2)

	Population	Death rate	Deaths	Percent
Vaccinated	2.57	495	1272	83.6
Not vaccinated	0.16	1560	249	16.4

percent_vaccinated = table[col1] / table[col1].sum()
percent_vaccinated

Vaccinated        0.941518
Not vaccinated    0.058482
Name: Population, dtype: float64

From this table, we can also compute the effectiveness of the vaccine, which is the fraction of deaths the vaccine prevented. The difference in the death rate from 1560 per million to 495 is a decrease of 68%. By definition, this decrease is the “effectiveness” of the vaccine in this age group.

Finally, we can estimate the number of lives saved by answering a counterfactual question: if the death rate among the vaccinated had been the same as the death rate among the unvaccinated, how many deaths would there have been? The answer is that there would have been 4,009 deaths. In reality, there were 1,272, so we can estimate that the vaccine saved about 2,737 lives in this age group, in just four weeks.

In the United Kingdom right now, there are a lot of people visiting parents and grandparents at their homes, rather than a cemetery, because of the COVID vaccine.

counterfact = (
    table.loc["Vaccinated", "Population"] * table.loc["Not vaccinated", "Death rate"]
)
counterfact

4008.727272727273

# actual number, not rate, in one month
lives_saved = counterfact - table.loc["Vaccinated", "Deaths"]
death_vax, lives_saved

(1272, 2736.727272727273)

Of course, this analysis is based on some assumptions, most notably that the vaccinated and unvaccinated were similar except for their vaccination status. That might not be true: people with high risk or poor general heath might have been more likely to seek out the vaccine. If so, our estimate would be too low, and the vaccine might have saved more lives. If not, and people in poor health were less likely to be vaccinated, our estimate would be too high. I’ll leave it to you to judge which is more likely.

We can repeat this analysis with the other age groups. The following table shows the number of deaths in each age group and the percentage of the people who died who were vaccinated.

I’ve omitted the “Under 18” age group because there were only 6 deaths in this group, 4 among among the unvaccinated and 2 with unknown status. With such small numbers, we can’t make useful estimates for the death rate or effectiveness of the vaccine.

ages = [
    #    "Under 18",
    "18 to 29",
    "30 to 39",
    "40 to 49",
    "50 to 59",
    "60 to 69",
    "70 to 79",
    "80+",
]

# cases per 28 days, August 23 to September 19

deaths_vax = np.array([5, 10, 30, 102, 258, 607, 1272])
deaths_total = np.array([17, 48, 104, 250, 411, 801, 1521])
deaths_unvax = deaths_total - deaths_vax

deaths_vax.sum(), deaths_total.sum(), deaths_vax.sum() / deaths_total.sum()

(2284, 3152, 0.7246192893401016)

table2 = pd.DataFrame(index=ages, dtype=float)
table2["Deaths vax"] = deaths_vax
table2["Deaths total"] = deaths_total
percent = (deaths_vax / deaths_total) * 100
table2["Percent"] = percent.round(0).astype(int)
table2

	Deaths vax	Deaths total	Percent
18 to 29	5	17	29
30 to 39	10	48	21
40 to 49	30	104	29
50 to 59	102	250	41
60 to 69	258	411	63
70 to 79	607	801	76
80+	1272	1521	84

Adding up the columns, there were a total of 2,284 deaths, 3,152 of them among the vaccinated. So 72% of the people who died had been vaccinated, as the unnamed journalist reported. Among people over 80, it was even higher, as we’ve already seen.

However, in the younger age groups, the percentage of deaths among the vaccinated is substantially lower, which is a hint that this number might reflect something about the groups, not about the vaccine.

To compute something about the vaccine, we can use death rates rather than number of deaths. The following table shows death rates per million people, reported by Public Health England for each age group, and the implied effectiveness of the vaccine, which is the percent reduction in death rate.

rates_vax = [1, 2, 5, 14, 45, 131, 495]
rates_unvax = [3, 12, 38, 124, 231, 664, 1560]

table3 = pd.DataFrame(index=ages, dtype=float)
table3["Death rate vax"] = rates_vax
table3["Death rate unvax"] = rates_unvax
effectiveness = 100 * (1 - np.array(rates_vax) / rates_unvax)
table3["Effectiveness"] = effectiveness.round(0).astype(int)
table3

	Death rate vax	Death rate unvax	Effectiveness
18 to 29	1	3	67
30 to 39	2	12	83
40 to 49	5	38	87
50 to 59	14	124	89
60 to 69	45	231	81
70 to 79	131	664	80
80+	495	1560	68

The effectiveness of the vaccine is more than 80% in most age groups. In the youngest group it is 67%, but that might be inaccurate because the number of deaths is low and the estimated death rates are not precise. In the oldest group it is 68%, which suggests that the vaccine is less effective for older people, possibly because their immune systems are weaker. However, a treatment that reduces the probability of dying by 68% is still very good.

Effectiveness is nearly the same most age groups because it reflects primarily something about the vaccines and only secondarily something about the groups.

Now, given the number of deaths and death rates, we can infer the number of people in each age group who were vaccinated or not, and the percentage who had been vaccinated.

table4 = pd.DataFrame(index=ages, dtype=float)
table4["# vax (millions)"] = deaths_vax / rates_vax
table4["# unvax (millions)"] = deaths_unvax / rates_unvax
pop_total = table4.sum().sum()
percent = 100 * table4["# vax (millions)"] / table4.sum(axis=1)
table4["Percent vax"] = percent.round(0).astype(int)
table4.round(1)

	# vax (millions)	# unvax (millions)	Percent vax
18 to 29	5.0	4.0	56
30 to 39	5.0	3.2	61
40 to 49	6.0	1.9	75
50 to 59	7.3	1.2	86
60 to 69	5.7	0.7	90
70 to 79	4.6	0.3	94
80+	2.6	0.2	94

By August 2021, nearly everyone in the England over 60 years old had been vaccinated. In the younger groups, the percentages were lower, but even in the youngest group it was more than half.

With this, it becomes clear why most deaths were among the vaccinated:

Most deaths were in the oldest age groups, and
In those age groups, almost everyone was vaccinated.

Taking this logic to the extreme, if everyone is vaccinated, we expect all deaths to be among the vaccinated.

In the vocabulary of this chapter, the percentage of deaths among the vaccinated depends on the effectiveness of the vaccine and the base rate of vaccination in the population. If the base rate is low, as in the younger groups, the percentage of deaths among the vaccinated is low. If the base rate is high, as in the older groups, the percentage of deaths is high. Because this percentage depends so strongly on the properties of the group, it doesn’t tell us much about the properties of the vaccine.

Finally, we can estimate the number of lives saved in each age group. First we compute the hypothetical number of the vaccinated who would have died if their death rate had been the same as among the unvaccinated, then we subtract off the actual number of deaths.

table5 = pd.DataFrame(index=ages, dtype=float)
table5["Hypothetical deaths"] = table4["# vax (millions)"] * table3["Death rate unvax"]
table5["Actual deaths"] = deaths_vax
percent = 100 * table4["# vax (millions)"] / table4.sum(axis=1)
table5["Lives saved"] = table5["Hypothetical deaths"] - table5["Actual deaths"]
table5.round(0).astype(int)

	Hypothetical deaths	Actual deaths	Lives saved
18 to 29	15	5	10
30 to 39	60	10	50
40 to 49	228	30	198
50 to 59	903	102	801
60 to 69	1324	258	1066
70 to 79	3077	607	2470
80+	4009	1272	2737

lives_saved = table5["Lives saved"].sum()
lives_saved

7332.25813423218

pop_total

47.64403757147204

pop_total / lives_saved * 1e6

6497.866918928548

In total, the COVID vaccine saved more than 7000 lives in a four-week period, in a relevant population of about 48 million.

If you created a vaccine that saved 7000 lives in less than a month, in just one country, you would feel pretty good about yourself. And if you used misleading statistics to persuade a large, international audience that they should not get that vaccine, you should feel very bad.

Predicting Crime#

If we understand the base rate fallacy, we can correctly interpret medical and impaired driving tests, and we can avoid being misled by headlines about COVID vaccines. We can also shed light on an ongoing debate about the use of data and algorithms in the criminal justice system.

In 2016 a team of journalists at ProPublica published a now-famous article about COMPAS, which is a statistical tool used in some states to inform decisions about which defendants should be released on bail before trial, how long convicted defendants should be imprisoned, and whether prisoners should be released on probation.

COMPAS uses information about defendants to generate a “risk score” which is supposed to quantify the probability that the defendant will commit another crime if released.

The authors of the ProPublica article used public data to assess the accuracy of COMPAS risk scores. They explain:

We obtained the risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014 and checked to see how many were charged with new crimes over the next two years, the same benchmark used by the creators of the algorithm.

They published the data they obtained, so we can use it to replicate their analysis and do our own.

NOTE: The statistics reported in the rest of this chapter are from the Recidivism Case Study, to be published as part of Elements of Data Science.

If we think of COMPAS as a diagnostic test, a high risk score is like a positive test result and a low risk score is like a negative result. Under those definitions, we can use the data to compute the sensitivity and specificity of the test. As it turns out, they are not very good:

Sensitivity: Of the people who were charged with another crime during the period of observation, only 63% were given high risk scores.
Specificity: Of the people who were not charged with another crime, only 68% were given low risk scores.

Now suppose you are a judge considering a bail request from a defendant who has been assigned a high risk score. Among other things, you would like to know the probability that they will commit a crime if released. Let’s see if we can figure that out.

As you might guess by now, we need another piece of information: the base rate. In the sample from Broward County, it is 45%; that is, 45% of the defendants released from jail were charged with a crime within two years.

The following table shows the results with this base rate, sensitivity, and specificity.

def make_table(prior, likelihood, as_int=True):
    table = pd.DataFrame(index=["Charged again", "Not charged"])
    table["# of people"] = prior.astype(int)
    table["P(high risk)"] = likelihood
    table["# high risk"] = table["# of people"] * table["P(high risk)"]
    if as_int:
        table["# high risk"] = table["# high risk"].astype(int)
    total = table["# high risk"].sum()
    table["Percent"] = (table["# high risk"] / total).round(3) * 100
    return table

sens = 0.63
spec = 0.68
prev = 0.45
prior = np.array([prev, 1 - prev]) * 1000
likelihood = sens, 1 - spec
make_table(prior, likelihood)

	# of people	P(high risk)	# high risk	Percent
Charged again	450	0.63	283	61.8
Not charged	550	0.32	175	38.2

Out of 1000 people in this dataset, 450 will be charged with a crime, on average; the other 550 will not.

Based on the sensitivity and specificity of the test, we expect 283 of the offenders to be assigned a high risk score, along with 175 of the non-offenders. So, of all people with high risk scores, about 62% will be charged with another crime.

This result is called the “positive predictive value”, or PPV, because it quantifies the accuracy of a positive test result. In this case, 62% of the positive tests turn out to be correct.

We can do the same analysis with low risk scores.

sens = 1 - 0.63
spec = 1 - 0.68
prev = 0.45
prior = np.array([prev, 1 - prev]) * 1000
likelihood = sens, 1 - spec
table = make_table(prior, likelihood)
table.columns = ['# of people', 'P(low risk)', '# low risk', 'Percent']
table

	# of people	P(low risk)	# low risk	Percent
Charged again	450	0.37	166	30.7
Not charged	550	0.68	374	69.3

Out of 450 offenders, we expect 166 to get an incorrect low score. Out of 550 non-offenders, we expect 374 to get a correct low score. So, of all people with low risk scores, 69% were not charged with another crime.

This result is called the “negative predictive value” of the test, or NPV, because it indicates what fraction of negative tests are correct.

On one hand, these results show that risk scores provide useful information. If someone gets a high risk score, the probability is 62% that they will be charged with a crime. If they get a low risk score, it is only 31%. So, people with high risk scores are about twice as likely to re-offend.

On the other hand, these results are not as accurate as we would like when we make decisions that affect people’s lives so seriously. And they might not be fair.

Comparing Groups#

The authors of the ProPublica article considered whether COMPAS has the same accuracy for different groups. With respect to racial groups, they find:

… In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.

The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.

White defendants were mislabeled as low risk more often than black defendants.

This discrepancy suggests that the use of COMPAS in the criminal justice system is racially biased.

I will use the data they obtained to replicate their analysis, and we will see that their numbers are correct. But interpreting these results turns out to be complicated; I think it will be clearer if we start by considering sex, and then race.

In the data from Broward County, 81% of defendants are male and 19% are female. The sensitivity and specificity of the risk scores is almost the same in both groups:

Sensitivity is 63% for male defendants and 61% for female defendants.
Specificity is close to 68% for both groups.

But the base rate is different: about 47% of male defendants were charged with another crime, compared to 36% of female defendants.

In a group of 1000 male defendants, the following table shows the number we expect to get a high risk score and the fraction of them that will re-offend.

sens = 0.63
spec = 0.68
prev = 0.47
prior = np.array([prev, 1 - prev]) * 1000
likelihood = sens, 1 - spec
make_table(prior, likelihood)

	# of people	P(high risk)	# high risk	Percent
Charged again	470	0.63	296	63.7
Not charged	530	0.32	169	36.3

Of the high risk male defendants, about 64% were charged with another crime.

Here is the corresponding table for 1000 female defendants.

sens = 0.61
spec = 0.68
prev = 0.36
prior = np.array([prev, 1 - prev]) * 1000
likelihood = sens, 1 - spec
make_table(prior, likelihood)

	# of people	P(high risk)	# high risk	Percent
Charged again	360	0.61	219	51.8
Not charged	640	0.32	204	48.2

Of the high risk female defendants, only 52% were charged with another crime.

And that’s what we should expect: if the test has the same sensitivity and specificity, but the groups have different base rates, the test will have different predictive values in the two groups.

Now let’s consider racial groups. As the ProPublica article reports, the sensitivity and specificity of COMPAS are substantially different for white and black defendants:

Sensitivity for white defendants is 52%; for black defendants it is 72%.
Specificity for white defendants is 77%; for black defendants it is 55%.

The complement of sensitivity is the “false negative rate”, or FNR, which in this context is the fraction of offenders who were wrongly classified as low risk. The false negative rate for white defendants is 48% (the complement of 52%); for black defendants it is 28%

And the complement of specificity is the “false positive rate”, or FPR, which is the fraction of non-offenders who were wrongly classified as high risk. The false positive rate for white defendants is 23% (the complement of 77%); for black defendants it is 45%.

In other words, black non-offenders were almost twice as likely to bear the cost of an incorrect high score. And black offenders were substantially less likely to get the benefit of an incorrect low score.

That seems patently unfair. As U.S. Attorney General Eric Holder wrote in 2014 (as quoted in the ProPublica article), “Although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice [and] they may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.”

But that’s not the end of the story.

Fairness is Hard to Define#

A few months after the ProPublica article, the Washington Post published a response with the expositive title: “A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear.”

It acknowledges that the results of the ProPublica article are correct: the false positive rate for black defendants is higher and the false negative rate is lower. But it points out that the PPV and NPV are the nearly the same for both groups:

Positive predictive value: Of people with high risk scores, 59% of white defendants and 63% of black defendants were charged with another crime.
Negative predictive value: Of people with low risk scores, 71% of white defendants and 65% of black defendants were not charged again.

So in this sense the test is fair: a high risk score in either group means the same thing; that is, it corresponds to roughly the same probability of recidivism. And a low risk score corresponds to roughly the same probability of non-recidivism.

Strangely, COMPAS achieves one kind of fairness based on sex, and another kind of fairness based on race.

For male and female defendants, the error rates (false positive and false negative) are roughly the same, but the predictive values are different.
For black and white defendants, the error rates are substantially different, but the predictive values (PPV and NPV) are about the same.

The COMPAS algorithm is a trade secret, so there is no way to know why it is designed this way, or even whether the discrepancy is deliberate. But the discrepancy is not inevitable. COMPAS could be calibrated to have equal error rates in all four groups, or equal predictive values.

However, it cannot have the same error rates and the same predictive values. We have already seen why: if the error rates are the same and the base rates are different, we get different predictive values. And, going the other way, if the predictive values are the same and the base rates are different, we get different error rates.

Probably Overthinking It

The code in this notebook and utils.py is under the MIT license.