# Distributions

## Contents

# Distributions#

Click here to run this notebook on Colab or click here to download it.

In this chapter we’ll see three ways to describe a set of values:

A probability mass function (PMF), which represents a set of values and the number of times each one appears in a dataset.

A cumulative distribution function (CDF), which contains the same information as a PMF in a form that makes it easier to visualize, make comparisons, and perform some computations.

A kernel density estimate (KDE), which is like a smooth, continuous version of a histogram.

For examples, we’ll use data from the General Social Survey (GSS) to look at distributions of age and income, and to explore the relationship between income and education.

But we’ll start with one of the most important ideas in statistics, the distribution.

## Distributions#

A distribution is a set of values and their corresponding probabilities. For example, if you roll a six-sided die, there are six possible outcomes, the numbers `1`

through `6`

, and they all have the same probability, `1/6`

.

We can represent this distribution of outcomes with a table, like this:

Value |
Probability |
---|---|

1 |
1/6 |

2 |
1/6 |

3 |
1/6 |

4 |
1/6 |

5 |
1/6 |

6 |
1/6 |

More generally, a distributions can have any number of values, the values can be any type, and the probabilities do not have to be equal.

To represent distributions in Python, we will use a library called `empiricaldist`

, for “empirical distribution”, where “empirical” means it is based on data rather than a mathematical formula.

`empiricaldist`

provides an object called `Pmf`

, which stands for “probability mass function”.
A `Pmf`

object contains a set of possible outcomes and their probabilities.

For example, here’s a `Pmf`

that represents the outcome of rolling a six-sided die:

```
from empiricaldist import Pmf
outcomes = [1,2,3,4,5,6]
die = Pmf(1/6, outcomes)
```

The first argument is the probability of each outcome; the second argument is the list of outcomes. We can display the result like this.

```
die
```

probs | |
---|---|

1 | 0.166667 |

2 | 0.166667 |

3 | 0.166667 |

4 | 0.166667 |

5 | 0.166667 |

6 | 0.166667 |

A `Pmf`

object is a specialized version of a Pandas `Series`

, so it provides all of the attributes and methods of a `Series`

, plus some additional methods we’ll see soon.

## Distribution of Education#

To get started with this dataset, let’s look at the distribution of `EDUC`

, which records the number of years of education for each respondent. First we’ll select a column from the `DataFrame`

and use `value_counts`

to see what values are in it.

```
gss['EDUC'].value_counts().sort_index()
```

```
0 165
1 47
2 152
3 257
4 319
5 402
6 828
7 879
8 2724
9 2083
10 2880
11 3743
12 19663
13 5360
14 7160
15 2910
16 8355
17 1967
18 2384
19 920
20 1439
98 73
99 104
Name: EDUC, dtype: int64
```

The result from `value_counts`

is a set of possible values and the number of times each one appears, so it is a kind of distribution.

The values `98`

and `99`

are special codes for “Don’t know” and “No answer”.
We’ll use `replace`

to replace these codes with `NaN`

.

```
import numpy as np
educ = gss['EDUC'].replace([98, 99], np.nan)
```

We’ve already seen one way to visualize a distribution, a histogram. Here’s the histogram of education level.

```
import matplotlib.pyplot as plt
educ.hist(grid=False)
plt.xlabel('Years of education')
plt.ylabel('Number of respondents')
plt.title('Histogram of education level');
```

Based on the histogram, we can see the general shape of the distribution and the central tendency – it looks like the peak is near 12 years of education. But a histogram is not the best way to visualize this distribution because it obscures some important details.

An alternative is to use a `Pmf`

.
The function `Pmf.from_seq`

takes any kind of sequence – like a list, tuple, or Pandas `Series`

– and computes the distribution of the values in the sequence.

```
pmf_educ = Pmf.from_seq(educ, normalize=False)
type(pmf_educ)
```

```
empiricaldist.empiricaldist.Pmf
```

The keyword argument `normalize=False`

indicates that we don’t want to normalize this PMF. I’ll explain what that means soon.

Here’s what the first few rows look like.

```
pmf_educ.head()
```

probs | |
---|---|

0.0 | 165 |

1.0 | 47 |

2.0 | 152 |

In this dataset, there are `165`

respondents who report that they have had no formal education, and `47`

who have only one year.
Here the last few rows.

```
pmf_educ.tail()
```

probs | |
---|---|

18.0 | 2384 |

19.0 | 920 |

20.0 | 1439 |

There are `1439`

respondents who report that they have 20 or more years of formal education, which probably means they attended college and graduate school.

You can use the bracket operator to look up a value in a `Pmf`

and get the corresponding count:

```
pmf_educ[20]
```

```
1439
```

Usually when we make a PMF, we want to know the *fraction* of respondents with each value, rather than the counts. We can do that by setting `normalize=True`

; then we get a **normalized** PMF, that is, a PMF where the values in the second column add up to 1.

```
pmf_educ_norm = Pmf.from_seq(educ, normalize=True)
pmf_educ_norm.head()
```

probs | |
---|---|

0.0 | 0.002553 |

1.0 | 0.000727 |

2.0 | 0.002352 |

Now if we use the bracket operator, the result is a fraction. For example, the fraction of people with 12 years of education is about 30%:

```
pmf_educ_norm[12]
```

```
0.30420656899299164
```

`Pmf`

provides a `bar`

method that plots the values and their probabilities as a bar chart.

```
pmf_educ_norm.bar(label='EDUC')
plt.xlabel('Years of education')
plt.xticks(range(0, 21, 4))
plt.ylabel('PMF')
plt.title('Distribution of years of education')
plt.legend();
```

In this figure, we can see that the most common value is 12 years, but there are also peaks at 14 and 16, which correspond to two and four years of college.

For this data, the PMF is probably a better choice than the histogram. The PMF shows all unique values, so we can see where the peaks are. Because the histogram puts values into bins, it obscures these details. With this dataset, and the default number of bins, we couldn’t see the peaks at 14 and 16 years. But PMFs have limitations, too, as we’ll see.

First, here’s an exercise where you can practice with PMFs.

**Exercise:** Let’s look at the `YEAR`

column in the `DataFrame`

, which represents the year each respondent was interviewed.

Make an unnormalized `Pmf`

for `YEAR`

and display the result.
How many respondents were interviewed in 2018?

## Cumulative Distribution Functions#

Now we’ll see another way to represent a distribution, the cumulative distribution function (CDF).
`empiricaldist`

provides a `Cdf`

object that represents a CDF.
We can import it like this:

```
from empiricaldist import Cdf
```

As an example, suppose we have a sequence of five values:

```
values = 1, 2, 2, 3, 5
```

Here’s the `Pmf`

of these values.

```
Pmf.from_seq(values)
```

probs | |
---|---|

1 | 0.2 |

2 | 0.4 |

3 | 0.2 |

5 | 0.2 |

If you draw a random value from `values`

, the `Pmf`

tells you the chance of getting `x`

, for any value of `x`

.
So the probability of the value `1`

is `1/5`

; the probability of the value `2`

is `2/5`

; and the probabilities for `3`

and `5`

are `1/5`

each.

A CDF is similar to a PMF in the sense that it contains values and their probabilities; the difference is that the probabilities in the CDF are the cumulative sum of the probabilities in the PMF.

Here’s a `Cdf`

object for the same five values.

```
Cdf.from_seq(values)
```

probs | |
---|---|

1 | 0.2 |

2 | 0.6 |

3 | 0.8 |

5 | 1.0 |

If you draw a random value from `values`

, `Cdf`

tells you the chance of getting a value *less than or equal to* `x`

, for any given `x`

.

So the `Cdf`

of `1`

is `1/5`

because one of the five values in the sequence is less than or equal to 1.

The `Cdf`

of 2 is `3/5`

because three of the five values are less than or equal to 2.

And the `Cdf`

of 5 is `5/5`

because all of the values are less than or equal to 5.

## CDF of Age#

Now let’s look at a more substantial `Cdf`

, the distribution of ages for respondents in the General Social Survey.

The variable we’ll use is `'AGE'`

.
According to the codebook, the range of the values is from `18`

to `89`

, where `89`

means “89 or older”.
The special codes `98`

and `99`

mean “Don’t know” and “Didn’t answer”.
See https://gssdataexplorer.norc.org/variables/53/vshow.

We can use `replace`

to replace the special codes with `NaN`

.

```
age = gss['AGE'].replace([98, 99], np.nan)
```

We can compute the `Cdf`

of these values like this:

```
cdf_age = Cdf.from_seq(age)
```

`Cdf`

provides a method called `plot`

that plots the CDF as a line. Here’s what it looks like.

```
cdf_age.plot()
plt.xlabel('Age (years)')
plt.ylabel('CDF')
plt.title('Distribution of age');
```

The \(x\)-axis is the ages, from 18 to 89. The \(y\)-axis is the cumulative probabilities, from 0 to 1.

`cdf_age`

can be used as a function, so if you give it an age, it returns the corresponding probability (in a NumPy array).

```
q = 51
p = cdf_age(q)
p
```

```
array(0.63318676)
```

`q`

stands for “quantity”, which is what we are looking up. `p`

stands for probability, which is the result.
In this example, the quantity is age 51, and the corresponding probability is about `0.63`

.
That means that about 63% of the respondents are 51 years old or younger.

The arrow in the following figure shows how you could read this value from the CDF, at least approximately.

```
cdf_age.plot()
x = 17
draw_line(p, q, x)
draw_arrow_left(p, q, x)
plt.xlabel('Age (years)')
plt.xlim(x-1, 91)
plt.ylabel('CDF')
plt.title('Distribution of age');
```

The CDF is an invertible function, which means that if you have a probability, `p`

, you can look up the corresponding quantity, `q`

.
`Cdf`

provides a method called `inverse`

that computes the inverse of the cumulative distribution function.

```
p1 = 0.25
q1 = cdf_age.inverse(p1)
q1
```

```
array(31.)
```

In this example, we look up the probability `0.25`

and the result is `31`

.

That means that 25% of the respondents are age 31 or less. Another way to say the same thing is “age 31 is the 25th percentile of this distribution”.

If we look up probability `0.75`

, it returns `59`

, so 75% of the respondents are 59 or younger.

```
p2 = 0.75
q2 = cdf_age.inverse(p2)
q2
```

```
array(59.)
```

In the following figure, the arrows show how you could read these values from the CDF.

```
cdf_age.plot()
x = 17
draw_line(p1, q1, x)
draw_arrow_down(p1, q1, 0)
draw_line(p2, q2, x)
draw_arrow_down(p2, q2, 0)
plt.xlabel('Age (years)')
plt.xlim(x-1, 91)
plt.ylabel('CDF')
plt.title('Distribution of age');
```

The distance from the 25th to the 75th percentile is called the **interquartile range**, or IQR. It measures the spread of the distribution, so it is similar to standard deviation or variance.

Because it is based on percentiles, it doesn’t get thrown off by extreme values or outliers, the way standard deviation does. So IQR is more **robust** than variance, which means it works well even if there are errors in the data or extreme values.

**Exercise:** Using `cdf_age`

, compute the fraction of the respondents in the GSS dataset that are *older* than 65.

**Exercise:** The distribution of income in almost every country is long-tailed, which means there are a small number of people with very high incomes.
In the GSS dataset, the column `REALINC`

represents total household income, converted to 1986 dollars. We can get a sense of the shape of this distribution by plotting the CDF.

Select `REALINC`

from the `gss`

dataset, make a `Cdf`

called `cdf_income`

, and plot it. Remember to label the axes!

## Comparing Distributions#

So far we’ve seen two ways to represent distributions, PMFs and CDFs. Now we’ll use PMFs and CDFs to compare distributions, and we’ll see the pros and cons of each.

One way to compare distributions is to plot multiple PMFs on the same axes. For example, suppose we want to compare the distribution of age for male and female respondents.

First we’ll create a Boolean Series that’s true for male respondents.

```
male = (gss['SEX'] == 1)
```

And another that’s true for female respondents.

```
female = (gss['SEX'] == 2)
```

Now we can select ages for the male and female respondents.

```
male_age = age[male]
female_age = age[female]
```

And plot a PMF for each.

```
pmf_male_age = Pmf.from_seq(male_age)
pmf_male_age.plot(label='Male')
pmf_female_age = Pmf.from_seq(female_age)
pmf_female_age.plot(label='Female')
plt.xlabel('Age (years)')
plt.ylabel('PMF')
plt.title('Distribution of age by sex')
plt.legend();
```

The plot is pretty noisy. In the range from 40 to 50, it looks like the PMF is higher for men. And from 70 to 80, it is higher for women. But both of those differences might be due to random variation.

Now let’s do the same thing with CDFs; everything is the same except we replace `Pmf`

with `Cdf`

.

```
cdf_male_age = Cdf.from_seq(male_age)
cdf_male_age.plot(label='Male')
cdf_female_age = Cdf.from_seq(female_age)
cdf_female_age.plot(label='Female')
plt.xlabel('Age (years)')
plt.ylabel('CDF')
plt.title('Distribution of age by sex')
plt.legend();
```

In general, CDFs are smoother than PMFs. Because they smooth out randomness, we can often get a better view of real differences between distributions. In this case, the lines are close together until age 40; after that, the CDF is higher for men than women. So what does that mean?

One way to interpret the difference is that the fraction of men below a given age is generally more than the fraction of women below the same age. For example, about 79% of men are 60 or less, compared to 76% of women.

```
cdf_male_age(60), cdf_female_age(60)
```

```
(array(0.78599958), array(0.75529908))
```

Going the other way, we could also compare percentiles. For example, the median age woman is older than the median age man, by about one year.

```
cdf_male_age.inverse(0.5), cdf_female_age.inverse(0.5)
```

```
(array(43.), array(44.))
```

**Exercise:** What fraction of men are over 80? What fraction of women?

## Comparing Incomes#

As another example, let’s look at household income and compare the distribution before and after 1995 (I chose 1995 because it’s roughly the midpoint of the survey).
The variable `REALINC`

represents household income in 1986 dollars.

We’ll make a Boolean `Series`

to select respondents interviewed before and after 1995.

```
pre95 = (gss['YEAR'] < 1995)
post95 = (gss['YEAR'] >= 1995)
```

Now we can plot the PMFs.

```
income = gss['REALINC'].replace(0, np.nan)
Pmf.from_seq(income[pre95]).plot(label='Before 1995')
Pmf.from_seq(income[post95]).plot(label='After 1995')
plt.xlabel('Income (1986 USD)')
plt.ylabel('PMF')
plt.title('Distribution of income')
plt.legend();
```

There are a lot of unique values in this distribution, and none of them appear very often. As a result, the PMF is so noisy and we can’t really see the shape of the distribution.

It’s also hard to compare the distributions. It looks like there are more people with high incomes after 1995, but it’s hard to tell. We can get a clearer picture with a CDF.

```
Cdf.from_seq(income[pre95]).plot(label='Before 1995')
Cdf.from_seq(income[post95]).plot(label='After 1995')
plt.xlabel('Income (1986 USD)')
plt.ylabel('CDF')
plt.title('Distribution of income')
plt.legend();
```

Below $30,000 the CDFs are almost identical; above that, we can see that the post-1995 distribution is shifted to the right. In other words, the fraction of people with high incomes is about the same, but the income of high earners has increased.

In general, I recommend CDFs for exploratory analysis. They give you a clear view of the distribution, without too much noise, and they are good for comparing distributions, especially if you have more than two.

**Exercise:** In the previous figure, the dollar amounts are big enough that the labels on the `x`

axis are crowded. Improve the figure by expressing income in 1000s of dollars (and update the `x`

label accordingly).

**Exercise:** Let’s compare incomes for different levels of education in the GSS dataset

To do that we’ll create Boolean `Series`

to identify respondents with different levels of education.

In the U.S, 12 years of education usually means the respondent has completed high school (secondary education).

A respondent with 14 years of education has probably completed an associate degree (two years of college)

Someone with 16 years has probably completed a bachelor’s degree (four years of college or university).

Define Boolean `Series`

named `high`

, `assc`

, and `bach`

that are true for respondents with

12 or fewer years of education,

13, 14, or 15 years, and

16 or more.

Compute and plot the distribution of income for each group. Remember to label the CDFs, display a legend, and label the axes. Write a few sentences that describe and interpret the results.

## Modeling Distributions#

Some distributions have names. For example, you might be familiar with the normal distribution, also called the Gaussian distribution or the bell curve. And you might have heard of others like the exponential distribution, binomial distribution, or maybe Poisson distribution.

These “distributions with names” are called **analytic** because they are described by analytic mathematical functions, as contrasted with empirical distributions, which are based on data.

It turns out that many things we measure in the world have distributions that are well approximated by analytic distributions, so these distributions are sometimes good models for the real world.

In this context, what I mean by a “model” is a simplified description of the world that is accurate enough for its intended purpose.

In this section, we’ll compute the CDF of a normal distribution and compare it to an empirical distribution of data. But before we get to real data, we’ll start with fake data.

The following statement uses NumPy’s `random`

library to generate 1000 values from a normal distribution with mean `0`

and standard deviation `1`

.

```
np.random.seed(17)
```

```
sample = np.random.normal(size=1000)
```

Here’s what the empirical distribution of the sample looks like.

```
cdf_sample = Cdf.from_seq(sample)
cdf_sample.plot(label='Random sample')
plt.xlabel('x')
plt.ylabel('CDF')
plt.legend();
```

If we did not know that this sample was drawn from a normal distribution, and we wanted to check, we could compare the CDF of the data to the CDF of an ideal normal distribution, which we can use the SciPy library to compute.

```
from scipy.stats import norm
xs = np.linspace(-3, 3)
ys = norm(0, 1).cdf(xs)
```

First we import `norm`

from `scipy.stats`

, which is a collection of functions related to statistics.

Then we use `linspace()`

to create an array of equally-spaced points from -3 to 3; those are the `x`

values where we will evaluate the normal CDF.

Next, `norm(0, 1)`

creates an object that represents a normal distribution with mean `0`

and standard deviation `1`

.

Finally, `cdf`

computes the CDF of the normal distribution, evaluated at each of the `xs`

.

I’ll plot the normal CDF with a gray line and then plot the CDF of the data again.

```
plt.plot(xs, ys, color='gray', label='Normal CDF')
cdf_sample.plot(label='Random sample')
plt.xlabel('x')
plt.ylabel('CDF')
plt.legend();
```

The CDF of the random sample agrees with the normal model. And that’s not surprising because the data were actually sampled from a normal distribution. When we collect data in the real world, we do not expect it to fit a normal distribution as well as this. In the next exercise, we’ll try it and see.

**Exercise:** Is the normal distribution a good model for the distribution of ages in the U.S. population?

To answer this question:

Compute the mean and standard deviation of ages in the GSS dataset.

Use

`linspace`

to create an array of equally spaced values between 18 and 89.Use

`norm`

to create a normal distribution with the same mean and standard deviation as the data, then use it to compute the normal CDF for each value in the array.Plot the normal CDF with a gray line.

Plot the CDF of the ages in the GSS.

How well do the plotted CDFs agree?

**Exercise:** In many datasets, the distribution of income is approximately **lognormal**, which means that the logarithms of the incomes fit a normal distribution. Let’s see whether that’s true for the GSS data.

Extract

`REALINC`

from`gss`

and compute its logarithm using`np.log10()`

. Hint: Replace the value`0`

with`NaN`

before computing logarithms.Compute the mean and standard deviation of the log-transformed incomes.

Use

`norm`

to make a normal distribution with the same mean and standard deviation as the log-transformed incomes.Plot the CDF of the normal distribution.

Compute and plot the CDF of the log-transformed incomes.

How similar are the CDFs of the log-transformed incomes and the normal distribution?

## Kernel Density Estimation#

We have seen two ways to represent distributions, PMFs and CDFs. Now we’ll learn another way: a probability density function, or PDF.
The `norm`

function, which we used to compute the normal CDF, can also compute the normal PDF:

```
xs = np.linspace(-3, 3)
ys = norm(0,1).pdf(xs)
plt.plot(xs, ys, color='gray', label='Normal PDF')
plt.xlabel('x')
plt.ylabel('PDF')
plt.title('Normal density function')
plt.legend();
```

The normal PDF is the classic “bell curve”.

It is tempting to compare the PMF of the data to the PDF of the normal distribution, but that doesn’t work. Let’s see what happens if we try:

```
plt.plot(xs, ys, color='gray', label='Normal PDF')
pmf_sample = Pmf.from_seq(sample)
pmf_sample.plot(label='Random sample')
plt.xlabel('x')
plt.ylabel('PDF')
plt.title('Normal density function')
plt.legend();
```

The PMF of the sample is a flat line across the bottom. In the random sample, every value is unique, so they all have the same probability, one in 1000.

However, we can use the points in the sample to estimate the PDF of the distribution they came from.
This process is called **kernel density estimation**, or KDE. It’s a way of getting from a PMF, a probability mass function, to a PDF, a probability density function.

To generate a KDE plot, we’ll use the Seaborn library, imported as `sns`

.
Seaborn provides `kdeplot`

, which takes the sample, estimates the PDF, and plots it.

```
import seaborn as sns
sns.kdeplot(sample, label='Estimated sample PDF')
plt.xlabel('x')
plt.ylabel('PDF')
plt.title('Normal density function')
plt.legend();
```

Now we can compare the KDE plot and the normal PDF.

```
plt.plot(xs, ys, color='gray', label='Normal PDF')
sns.kdeplot(sample, label='Estimated sample PDF')
plt.xlabel('x')
plt.ylabel('PDF')
plt.title('Normal density function')
plt.legend();
```

The KDE plot matches the normal PDF pretty well, although the differences look bigger when we compare PDFs than they did with the CDFs.
That means that the PDF is a more sensitive way to look for differences, but often it is too sensitive.

It’s hard to tell whether apparent differences mean anything, or if they are just random, as in this case.

**Exercise:** In a previous exercise, we asked “Is the normal distribution a good model for the distribution of ages in the U.S. population?” To answer this question, we plotted the CDF of the data and compared it to the CDF of a normal distribution with the same mean and standard deviation.

Now we’ll compare the estimated density of the data with the normal PDF.

Again, compute the mean and standard deviation of ages in the GSS dataset.

Use

`linspace`

to create an array of values between 18 and 89.Use

`norm`

to create a normal distribution with the same mean and standard deviation as the data, then use it to compute the normal PDF for each value in the array.Plot the normal PDF with a gray line.

Use

`sns.kdeplot`

to estimate and plot the density of the ages in the GSS.

Note: Seaborn can’t handle NaNs, so use `dropna`

to drop them before calling `kdeplot`

.

How well do the PDF and KDE plots agree?

**Exercise:** In a previous exercise, we used CDFs to see if the distribution of income fits a lognormal distribution. We can make the same comparison using a PDF and KDE.

Again, extract

`REALINC`

from`gss`

and compute its logarithm using`np.log10()`

.Compute the mean and standard deviation of the log-transformed incomes.

Use

`norm`

to make a normal distribution with the same mean and standard deviation as the log-transformed incomes.Plot the PDF of the normal distribution.

Use

`sns.kdeplot()`

to estimate and plot the density of the log-transformed incomes.

## Summary#

In this chapter, we’ve seen four ways to visualize distributions, PMFs, CDFs, and KDE plots.

In general, I use CDFs when I am exploring data. That way, I get the best view of what’s going on without getting distracted by noise.

Then, if I am presenting results to an audience unfamiliar with CDFs, I might use a PMF if the dataset contains a small number of unique values, or KDE if there are many unique values.