The Laws of Probability¶
import pandas as pd
import numpy as np
from utils import values
Introduction¶
This notebook takes a computational approach to understanding probability. We’ll use data from the General Social Survey to compute the probability of propositions like:
If I choose a random survey respondent, what is the probability they are female?
If I choose a random survey respondent, what is the probability they work in banking?
From there, we will explore two related concepts:
Conjunction, which is the probability that two propositions are both true; for example, what is the probability of choosing a female banker?
Conditional probability, which is the probability that one proposition is true, given that another is true; for example, given than a respondent is female, what is the probability that she is a banker?
I chose these examples because they are related to a famous experiment by Tversky and Kahneman, who posed the following question:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
Linda is a bank teller.
Linda is a bank teller and is active in the feminist movement.
Many people choose the second answer, presumably because it seems more consistent with the description. It seems unlikely that Linda would be just a bank teller; if she is a bank teller, it seems likely that she would also be a feminist.
But the second answer cannot be “more probable”, as the question asks. Suppose we find 1000 people who fit Linda’s description and 10 of them work as bank tellers. How many of them are also feminists? At most, all 10 of them are; in that case, the two options are equally likely. More likely, only some of them are; in that case the second option is less likely. But there can’t be more than 10 out of 10, so the second option cannot be more likely.
The error people make if they choose the second option is called the conjunction fallacy. It’s called a fallacy because it’s a logical error and “conjunction” because “bank teller AND feminist” is a logical conjunction.
If this example makes you uncomfortable, you are in good company. The biologist Stephen J. Gould wrote :
I am particularly fond of this example because I know that the [second] statement is least probable, yet a little homunculus in my head continues to jump up and down, shouting at me, “but she can’t just be a bank teller; read the description.”
If the little person in your head is still unhappy, maybe this notebook will help.
Probability¶
At this point I should define probability, but that turns out to be surprisingly difficult. To avoid getting bogged down before we get started, I’ll start with a simple definition: a probability is a fraction of a dataset.
For example, if we survey 1000 people, and 20 of them are bank tellers, the fraction that work as bank tellers is 0.02 or 2%. If we choose a person from this population at random, the probability that they are a bank teller is 2%.
(By “at random” I mean that every person in the dataset has the same chance of being chosen, and by “they” I mean the singular, gender-neutral pronoun, which is a correct and useful feature of English.)
With this definition and an appropriate dataset, we can compute probabilities by counting.
To demonstrate, I’ll use a data set from the General Social Survey or GSS. The following cell reads the data.
gss = pd.read_csv('gss_bayes.csv', index_col=0)
The results is a Pandas DataFrame with one row for each person surveyed and one column for each variable I selected.
Here are the number of rows and columns:
gss.shape
(49290, 6)
And here are the first few rows:
gss.head()
year | age | sex | polviews | partyid | indus10 | |
---|---|---|---|---|---|---|
caseid | ||||||
1 | 1974 | 21.0 | 1 | 4.0 | 2.0 | 4970.0 |
2 | 1974 | 41.0 | 1 | 5.0 | 0.0 | 9160.0 |
5 | 1974 | 58.0 | 2 | 6.0 | 1.0 | 2670.0 |
6 | 1974 | 30.0 | 1 | 5.0 | 4.0 | 6870.0 |
7 | 1974 | 48.0 | 1 | 5.0 | 4.0 | 7860.0 |
The columns are
caseid
: Respondent id (which is the index of the table).year
: Year when the respondent was surveyed.age
: Respondent’s age when surveyed.sex
: Male or female.polviews
: Political views on a range from liberal to conservative.partyid
: Political party affiliation, Democrat, Independent, or Republican.indus10
: Code for the industry the respondent works in.
Let’s look at these variables in more detail, starting with indus10
.
Banking¶
The code for “Banking and related activities” is 6870, so we can select bankers like this:
banker = (gss['indus10'] == 6870)
The result is a Boolean series, which is a Pandas Series that contains the values True
and False
. Here are the first few entries:
banker.head()
caseid
1 False
2 False
5 False
6 True
7 False
Name: indus10, dtype: bool
We can use values
to see how many times each value appears.
values(banker)
counts | |
---|---|
values | |
False | 48562 |
True | 728 |
In this dataset, there are 728 bankers.
If we use the sum
function on this Series, it treats True
as 1 and False
as 0, so the total is the number of bankers.
banker.sum()
728
To compute the fraction of bankers, we can divide by the number of people in the dataset:
banker.sum() / banker.size
0.014769730168391155
But we can also use the mean
function, which computes the fraction of True
values in the Series:
banker.mean()
0.014769730168391155
About 1.5% of the respondents work in banking.
That means if we choose a random person from the dataset, the probability they are a banker is about 1.5%.
Exercise: The values of the column sex
are encoded like this:
1 Male
2 Female
The following cell creates a Boolean series that is True
for female respondents and False
otherwise.
female = (gss['sex'] == 2)
Use
values
to display the number ofTrue
andFalse
values infemale
.Use
sum
to count the number of female respondents.Use
mean
to compute the fraction of female respondents.
The fraction of women in this dataset is higher than in the adult U.S. population because the GSS does not include people living in institutions, including prisons and military housing, and those populations are more likely to be male.
Exercise: The designers of the General Social Survey chose to represent sex as a binary variable. What alternatives might they have considered? What are the advantages and disadvantages of their choice?
For more on this topic, you might be interested in this article: Westbrook and Saperstein, New categories are not enough: rethinking the measurement of sex and gender in social surveys
Political views¶
The values of polviews
are on a seven-point scale:
1 Extremely liberal
2 Liberal
3 Slightly liberal
4 Moderate
5 Slightly conservative
6 Conservative
7 Extremely conservative
Here are the number of people who gave each response:
values(gss['polviews'])
counts | |
---|---|
values | |
1.0 | 1442 |
2.0 | 5808 |
3.0 | 6243 |
4.0 | 18943 |
5.0 | 7940 |
6.0 | 7319 |
7.0 | 1595 |
I’ll define liberal
to be True
for anyone whose response is “Extremely liberal”, “Liberal”, or “Slightly liberal”.
liberal = (gss['polviews'] < 4)
Here are the number of True
and False
values:
values(liberal)
counts | |
---|---|
values | |
False | 35797 |
True | 13493 |
And the fraction of respondents who are “liberal”.
liberal.mean()
0.27374721038750255
If we choose a random person in this dataset, the probability they are liberal is about 27%.
The probability function¶
To summarize what we have done so far:
To represent a logical proposition like “this respondent is liberal”, we are using a Boolean series, which contains the values
True
andFalse
.To compute the probability that a proposition is true, we are using the
mean
function, which computes the fraction ofTrue
values in a series.
To make this computation more explicit, I’ll define a function that takes a Boolean series and returns a probability:
def prob(A):
"""Computes the probability of a proposition, A.
A: Boolean series
returns: probability
"""
assert isinstance(A, pd.Series)
assert A.dtype == 'bool'
return A.mean()
The assert
statements check whether A
is a Boolean series. If not, they display an error message.
Using this function to compute probabilities makes the code more readable. Here are the probabilities for the propositions we have computed so far.
prob(banker)
0.014769730168391155
prob(female)
0.5378575776019476
prob(liberal)
0.27374721038750255
Exercise: The values of partyid
are encoded like this:
0 Strong democrat
1 Not str democrat
2 Ind,near dem
3 Independent
4 Ind,near rep
5 Not str republican
6 Strong republican
7 Other party
I’ll define democrat
to include respondents who chose “Strong democrat” or “Not str democrat”:
democrat = (gss['partyid'] <= 1)
Use
mean
to compute the fraction of Democrats in this dataset.Use
prob
to compute the same fraction, which we will think of as a probability.
Conjunction¶
Now that we have a definition of probability and a function that computes it, let’s move on to conjunction.
“Conjunction” is another name for the logical and
operation. If you have two propositions, A
and B
, the conjunction A and B
is True
if both A
and B
are True
, and False
otherwise.
I’ll demonstrate using two Boolean series constructed to enumerate every combination of True
and False
:
A = pd.Series((True, True, False, False))
A
0 True
1 True
2 False
3 False
dtype: bool
B = pd.Series((True, False, True, False))
B
0 True
1 False
2 True
3 False
dtype: bool
To compute the conjunction of A
and B
, we can use the &
operator, like this:
A & B
0 True
1 False
2 False
3 False
dtype: bool
The result is True
only when A
and B
are True
.
To show this operation more clearly, I’ll put the operands and the result in a DataFrame:
table = pd.DataFrame()
table['A'] = A
table['B'] = B
table['A & B'] = A & B
table
A | B | A & B | |
---|---|---|---|
0 | True | True | True |
1 | True | False | False |
2 | False | True | False |
3 | False | False | False |
This way of representing a logical operation is called a truth table.
In a previous section, we computed the probability that a random respondent is a banker:
prob(banker)
0.014769730168391155
And the probability that a respondent is a Democrat:
prob(democrat)
0.3662609048488537
Now we can compute the probability that a random respondent is a banker and a Democrat:
prob(banker & democrat)
0.004686548995739501
As we should expect, prob(banker & democrat)
is less than prob(banker)
, because not all bankers are Democrats.
Exercise: Use prob
and the &
operator to compute the following probabilities.
What is the probability that a random respondent is a banker and liberal?
What is the probability that a random respondent is female, a banker, and liberal?
What is the probability that a random respondent is female, a banker, and a liberal Democrat?
Notice that as we add more conjunctions, the probabilities get smaller.
Exercise: We expect conjunction to be commutative; that is, A & B
should be the same as B & A
.
To check, compute these two probabilies:
What is the probability that a random respondent is a banker and liberal?
What is the probability that a random respondent is liberal and a banker?
prob(banker & liberal)
0.003306958815175492
prob(liberal & banker)
0.003306958815175492
If they are not the same, something has gone very wrong!
Conditional probability¶
Conditional probability is a probability that depends on a condition, but that might not be the most helpful definition. Here are some examples:
What is the probability that a respondent is a Democrat, given that they are liberal?
What is the probability that a respondent is female, given that they are a banker?
What is the probability that a respondent is liberal, given that they are female?
Let’s start with the first one, which we can interpret like this: “Of all the respondents who are liberal, what fraction are Democrats?”
We can compute this probability in two steps:
Select all respondents who are liberal.
Compute the fraction of the selected respondents who are Democrats.
To select liberal respondents, we can use the bracket operator, []
, like this:
selected = democrat[liberal]
The result is a Boolean series that contains a subset of the values in democrat
. Specifically, it contains only the values where liberal
is True
.
To confirm that, let’s check the length of the result:
len(selected)
13493
If things have gone according to plan, that should be the same as the number of True
values in liberal
:
liberal.sum()
13493
Good.
selected
contains the value of democrat
for liberal respondents, so the mean of selected
is the fraction of liberals who are Democrats:
selected.mean()
0.5206403320240125
A little more than half of liberals are Democrats. If the result is lower than you expected, keep in mind:
We used a somewhat strict definition of “Democrat”, excluding Independents who “lean” democratic.
The dataset includes respondents as far back as 1974; in the early part of this interval, there was less alignment between political views and party affiliation, compared to the present.
Let’s try the second example, “What is the probability that a respondent is female, given that they are a banker?”
We can interpret that to mean, “Of all respondents who are bankers, what fraction are female?”
Again, we’ll use the bracket operator to select only the bankers:
selected = female[banker]
len(selected)
728
As we’ve seen, there are 728 bankers in the dataset.
Now we can use mean
to compute the conditional probability that a respondent is female, given that they are a banker:
selected.mean()
0.7706043956043956
About 77% of the bankers in this dataset are female.
We can get the same result using prob
:
prob(selected)
0.7706043956043956
Remember that we defined prob
to make the code easier to read. We can do the same thing with conditional probability.
I’ll define conditional
to take two Boolean series, A
and B
, and compute the conditional probability of A
given B
:
def conditional(A, B):
"""Conditional probability of A given B.
A: Boolean series
B: Boolean series
returns: probability
"""
return prob(A[B])
Now we can use it to compute the probability that a liberal is a Democrat:
conditional(democrat, liberal)
0.5206403320240125
And the probability that a banker is female:
conditional(female, banker)
0.7706043956043956
The results are the same as what we computed above.
Exercise: Use conditional
to compute the probability that a respondent is liberal given that they are female.
Hint: The answer should be less than 30%. If your answer is about 54%, you have made a mistake (see the next exercise).
Exercise: In a previous exercise, we saw that conjunction is commutative; that is, prob(A & B)
is always equal to prob(B & A)
.
But conditional probability is NOT commutative; that is, conditional(A, B)
is not the same as conditional(B, A)
.
That should be clear if we look at an example. Previously, we computed the probability a respondent is female, given that they are banker.
conditional(female, banker)
0.7706043956043956
The result shows that the majority of bankers are female. That is not the same as the probability that a respondent is a banker, given that they are female:
conditional(banker, female)
0.02116102749801969
Only about 2% of female respondents are bankers.
Exercise: Use conditional
to compute the following probabilities:
What is the probability that a respondent is liberal, given that they are a Democrat?
What is the probability that a respondent is a Democrat, given that they are liberal?
Think carefully about the order of the series you pass to conditional
.
conditional(liberal, democrat)
0.3891320002215698
conditional(democrat, liberal)
0.5206403320240125
Conditions and conjunctions¶
We can combine conditional probability and conjunction. For example, here’s the probability a respondent is female, given that they are a liberal Democrat.
conditional(female, liberal & democrat)
0.576085409252669
Almost 57% of liberal Democrats are female.
And here’s the probability they are a liberal female, given that they are a banker:
conditional(liberal & female, banker)
0.17307692307692307
About 17% of bankers are liberal women.
Exercise: What fraction of female bankers are liberal Democrats?
Hint: If your answer is less than 1%, you have it backwards. Remember that conditional probability is not commutative.
Summary¶
At this point, you should understand the definition of probability, at least in the simple case where we have a finite dataset. Later we will consider cases where the definition of probability is more controversial.
And you should understand conjunction and conditional probability. In the next notebook, we will explore the relationship between conjunction and conditional probability, and use it to derive Bayes’s Theorem, which is the foundation of Bayesian statistics.