4. Cumulative Distribution Functions#

Click here to run this notebook on Colab.

Hide code cell content
%load_ext nb_black
%load_ext autoreload
%autoreload 2
Hide code cell content
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Hide code cell content
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from thinkstats import decorate

4.1. Percentiles and Percentile Ranks#

If you have taken a standardized test, you probably got your results in the form of a raw score and a percentile rank. In this context, the percentile rank is the percentage of people who got the same score as you or lower. So if you are “in the 90th percentile,” you did as well as or better than 90% of the people who took the exam.

To understand percentiles and percentile ranks, let’s consider an example based on running speeds. Some years ago I ran the James Joyce Ramble, which is a 10 kilometer road race in Massachusetts. After the race, I downloaded the results so to see how my time compared to other runners. The relay.py module provides a function that reads the results and returns a Pandas DataFrame.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/relay.py")
download(
    "https://github.com/AllenDowney/ThinkStats/raw/v3/data/Apr25_27thAn_set1.shtml"
)
from relay import read_results

results = read_results()
results.head()
Place Div/Tot Division Guntime Nettime Min/Mile MPH
0 1 1/362 M2039 30:43 30:42 4:57 12.121212
1 2 2/362 M2039 31:36 31:36 5:06 11.764706
2 3 3/362 M2039 31:42 31:42 5:07 11.726384
3 4 4/362 M2039 32:28 32:27 5:14 11.464968
4 5 5/362 M2039 32:52 32:52 5:18 11.320755

results contains one row for each of 1633 runners who finished the race. The column we’ll use to quantify performance is MPH, which contains each runner’s average speed in miles per hour. We’ll select this column and use values to extract the speeds as a NumPy array.

speeds = results["MPH"].values

I finished in 42:44, so we can find my row like this.

my_result = results.query("Nettime == '42:44'")
my_result
Place Div/Tot Division Guntime Nettime Min/Mile MPH
96 97 26/256 M4049 42:48 42:44 6:53 8.716707

The index of my row is 96, so we can extract my speed like this.

my_speed = speeds[96]

We can use sum to count the number of runners at my speed or slower.

(speeds <= my_speed).sum()
1537

And we can use mean to compute the percentage of runners at my speed or slower.

(speeds <= my_speed).mean() * 100
94.12124923453766

The result is my percentile rank in the field, which was about 94%.

More generally, the following functions computes the percentile rank of a particular value in a sequence of values.

def percentile_rank(x, seq):
    """Percentile rank of x.

    x: value
    seq: sequence of values

    returns: percentile rank 0-100
    """
    return (seq <= x).mean() * 100

In results, the Division column indicates the division each runner was in, identified by gender and age range – for example, I was in the M4040 division, which includes male runners aged 40 to 49. We can use the query method to select the rows for people in my division and extract their speeds.

my_division = results.query("Division == 'M4049'")
my_division_speeds = my_division["MPH"].values

Now we can use percentile_rank to compute my percentile rank in my division.

percentile_rank(my_speed, my_division_speeds)
90.234375

Going in the other direction, if we are given a percentile rank, the following function finds the corresponding value in a sequence.

def percentile(p, seq):
    n = len(seq)
    i = (1 - p / 100) * (n + 1)
    return seq[round(i)]

n is the number of elements in the sequence; i is the index of the element with the given percentile rank.

When we look up a percentile rank, the corresponding value is called a percentile.

percentile(90, my_division_speeds)
8.591885441527447

In my division, the 90th percentile was about 8.5 mph.

If I am still running in 10 years (and I hope I am), I will be in the M5059 division. [NOTE FOR THE THIRD EDITION: I am.] Assuming that my percentile rank in my division is the same, how much slower should I expect to be? We can answer that question by converting my percentile rank in the M4049 division, which is about 90.2%, to a speed in the M5059 division.

next_division = results.query("Division == 'M5059'")
next_division_speeds = next_division["MPH"].values

percentile(90.2, next_division_speeds)
8.017817371937639

The person in the M5059 division with the same percentile rank as me ran just over 8 mph. We can use query to find him.

next_division.query("MPH > 8.01").tail(1)
Place Div/Tot Division Guntime Nettime Min/Mile MPH
222 223 18/171 M5059 46:30 46:25 7:29 8.017817

He finished in 46:25 and came in 18th out of 171 people in his division.

The goal of these examples is to introduce percentile ranks and percentiles, which are the key to understanding cumulative distribution functions.

4.2. CDFs#

A cumulative distribution function, or CDF, is a way to describe the distribution of a set of values. Given a value x, the CDF computes the fraction of values less than or equal to x. As an example, we’ll start with a short sequence.

t = [1, 2, 2, 3, 5]

One way to compute a CDF is to start with a PMF. Here is a Pmf object that represents the distribution of values in t.

from empiricaldist import Pmf

pmf = Pmf.from_seq(t)
pmf
probs
1 0.2
2 0.4
3 0.2
5 0.2

We can use the bracket operator to look up a value in a Pmf.

pmf[2]
0.4

The result is the proportion of values in the sequence equal to the given value. In this example, two out of five values are equal to 2, so the result is 0.4, which is two fifths. We can also think of this proportion as the probability that a randomly chosen value from the sequence equals 2.

Pmf has a make_cdf method that computes the cumulative sum of the probabilities in the Pmf.

cdf = pmf.make_cdf()
cdf
probs
1 0.2
2 0.6
3 0.8
5 1.0

The result is a Cdf object, which is a kind of Pandas Series. We can use the bracket operator to look up a value.

cdf[2]
0.6000000000000001

The result is the proportion of values in the sequence less than or equal to the given value. In this example, three out of five of the values in the sequence are less than or equal to 2, so the result is 0.6, which is three fifths. We can also think of this proportion as the probability that a randomly chosen value from the sequence is less than or equal to 2.

We can also use parentheses to call the Cdf object like a function.

cdf(3)
array(0.8)

The cumulative distribution function is defined for all numbers, not just the ones that appear in the sequence.

cdf(4)
array(0.8)

To visualize the Cdf, we can use the step method, which plots the Cdf as a step function.

cdf.step()
decorate(xlabel="x", ylabel="CDF")
_images/2f75bef4871371fcb094948c32f78a9ad731e76fda6176834b37ee44da4c2820.png

As a second example, let’s make a Cdf that represents the distribution of running speeds from the previous section. The Cdf class that provides a from_seq function we can use to create a Cdf object from a sequence, without making a Pmf first.

from empiricaldist import Cdf

cdf_speeds = Cdf.from_seq(speeds)

And here’s what it looks like.

cdf_speeds.step()
decorate(xlabel="Speed (mph)", ylabel="CDF")
_images/f86a759f053dc4366f295ee665e4461f8a6d982c6d4d6f3ac0a93f2bf258bc7e.png

If we look up my speed, the result is the fraction of runners at my speed or slower. If we multiply by 100, we get my percentile rank.

cdf_speeds(my_speed) * 100
94.12124923453766

So that’s one way to think about the Cdf – given a value, it computes something like a percentile rank, except that it’s a proportion between 0 and 1 rather than a percentage between 0 and 100.

Cdf provides an inverse method that computes the inverse of the cumulative distribution function – given a proportion between 0 and 1, it finds the corresponding value.

For example, if someone says they ran as fast or faster than 50% of the field, we can find their speed like this.

cdf_speeds.inverse(0.5)
array(6.70391061)

If you have a proportion and you use the inverse CDF to find the corresponding value, the result is called a quantile – so the inverse CDF is sometimes called the quantile function.

If have have a quantile and you use the CDF to find the corresponding proportion, the result doesn’t really have a name, strangely. To be consistent with percentile and percentile rank”, it could be called a “quantile rank”, but as far as I can tell, no one calls it that. Most often, it is called a cumulative probability.

4.3. Comparing CDFs#

CDFs are especially useful for comparing distributions. As an example, let’s compare the distribution of birth weights for first babies and others. We’ll load the NSFG dataset again, and divide it into three DataFrames: all live births, first babies, and others. Instructions for downloading the data are in the notebook for this chapter.

The following cells download the data files and install statadict, which we need to read the data.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")
try:
    import statadict
except ImportError:
    !pip install statadict
from nsfg import get_nsfg_groups

live, firsts, others = get_nsfg_groups()

From firsts and others we’ll select total birth weights in pounds, using dropna to remove values that are nan.

first_weights = firsts["totalwgt_lb"].dropna()
first_weights.mean()
7.201094430437772
other_weights = others["totalwgt_lb"].dropna()
other_weights.mean()
7.325855614973262

It looks like first babies are a little lighter on average. But there are several ways a difference like that could happen – for example, there might be a small number of first babies who are especially light, or a small number of other babies who are especially heavy. In those cases, the distributions would have different shapes. As another possibility, the distributions might have the same shape, but different locations.

To compare the distributions, we can try plotting the PMFs.

from empiricaldist import Pmf

first_pmf = Pmf.from_seq(first_weights, name="first")
other_pmf = Pmf.from_seq(other_weights, name="other")

But it doesn’t work very well.

from thinkstats import two_bar_plots

two_bar_plots(first_pmf, other_pmf, width=0.06, alpha=0.5)
decorate(xlabel="Weight (pounds)", ylabel="PMF")
_images/4201348653dc588345d4ad5bda0bc719704714bff53622e4a200c9b5a901d3c0.png

I adjusted the width and transparency of the bars to show the distributions as clearly as possible, but it is hard to compare them. There are many peaks and valleys, and some apparent differences, but it is hard to tell which of these features are meaningful. Also, it is hard to see overall patterns – for example, it is not visually apparent which distribution has the higher mean.

These problems can be mitigated by binning the data – that is, dividing the range of quantities into non-overlapping intervals and counting the number of quantities in each bin. Binning can be useful, but it is tricky to get the size of the bins right. If they are big enough to smooth out noise, they might also smooth out useful information.

A good alternative is to plot the CDFs.

first_cdf = first_pmf.make_cdf()
other_cdf = other_pmf.make_cdf()

Here’s what they look like.

first_cdf.plot()
other_cdf.plot()
decorate(xlabel="Weight (pounds)", ylabel="CDF")
_images/3632fd9317c2ef4ff9caee61b563eadba244b4c5f579aa9983828c1f230fd071.png

This figure makes the shape of the distributions, and the differences between them, much clearer. We can see that first babies are slightly lighter throughout the distribution, with a larger discrepancy above the midpoint.

4.4. Percentile-Based Statistics#

In Section xxx we computed the arithmetic mean, which identifies a central point in a distribution, and the standard deviations, which quantifies how spread out the distribution is. And in Section xxx we computed Pearson’s skewness, which indicates whether a distribution is skewed left or right. One drawback of all of these statistics is that they are sensitive to outliers. A single extreme value in a dataset can have a large effect on mean, standard deviation, and skewness.

An alternative is to use statistics that are based on percentiles of the distribution, which tend to be more robust, which means that they are less sensitive to outliers.

To demonstrate, let’s load the NSFG data again without doing any data cleaning.

from nsfg import read_stata

dct_file = "2002FemPreg.dct"
dat_file = "2002FemPreg.dat.gz"

preg = read_stata(dct_file, dat_file)

Recall that birth weight is recorded in two columns, one for the pounds and one for the ounces.

birthwgt_lb = preg["birthwgt_lb"]
birthwgt_oz = preg["birthwgt_oz"]

If we make a Hist object with the values from birthwgt_oz, we can see that they include the special values 97, 98, and 99, which indicate missing data.

from empiricaldist import Hist

Hist.from_seq(birthwgt_oz).tail(5)
freqs
birthwgt_oz
14.0 475
15.0 378
97.0 1
98.0 1
99.0 46

The birthwgt_lb column includes the same special values; it also includes the value 51, which has to be a mistake.

Hist.from_seq(birthwgt_lb).tail(5)
freqs
birthwgt_lb
15.0 1
51.0 1
97.0 1
98.0 1
99.0 57

Now let’s imagine two scenarios. In one scenario, we clean these variables by replacing missing and invalid values with nan, and then compute total weight in pounds.

birthwgt_lb_clean = birthwgt_lb.replace([51, 97, 98, 99], np.nan)
birthwgt_oz_clean = birthwgt_oz.replace([97, 98, 99], np.nan)

total_weight_clean = birthwgt_lb_clean + birthwgt_oz_clean / 16

In the other scenario, we neglect to clean the data and accidentally compute the total weight with these bogus values.

total_weight_bogus = birthwgt_lb + birthwgt_oz / 16

I’ll call the first result total_weight_clean and the second total_weight_bogus, so we remember which is which. The bogus dataset contains only 49 bogus values, which is about 0.5% of the data.

count1, count2 = total_weight_bogus.count(), total_weight_clean.count()
diff = count1 - count2

diff, diff / count2 * 100
(49, 0.5421553441026776)

Now let’s compute the mean of the data in both scenarios.

mean1, mean2 = total_weight_bogus.mean(), total_weight_clean.mean()
mean1, mean2
(7.319680587652691, 7.265628457623368)

The bogus values have a moderate effect on the mean. If we take the mean of the cleaned data to be correct, the mean of the bogus data is off by less than 1%, so an error like that might go undetected.

(mean1 - mean2) / mean2 * 100
0.74394294099376

Now let’s see what happens to the standard deviations.

std1, std2 = total_weight_bogus.std(), total_weight_clean.std()
std1, std2
(2.096001779161835, 1.4082934455690173)
(std1 - std2) / std2 * 100
48.83274403900607

The standard deviation of the bogus data is off by almost 50%, so that’s pretty bad.

Finally, here’s the skewness of the two datasets.

def skewness(seq):
    """Compute the skewness of a sequence

    seq: sequence of numbers

    returns: float skewness
    """
    deviations = seq - seq.mean()
    return np.mean(deviations**3) / seq.std(ddof=0) ** 3
skew1, skew2 = skewness(total_weight_bogus), skewness(total_weight_clean)
skew1, skew2
(22.251846195422484, -0.5895062687577697)
# how much is skew1 off by?
(skew1 - skew2) / skew2
-38.74658112171128

The skewness of the bogus dataset is off by a factor of almost 40, and it has the wrong sign! With the outliers added to the data, the distribution is strongly skewed to the right, as indicated by large positive skewness. But the distribution of the valid data is slightly skewed to the left, as indicated by small negative skewness.

These results show that a small number of outliers have a moderate effect on the mean, a strong effect on the standard deviation, and a disastrous effect on skewness.

An alternative is to use statistics based on percentiles. Specifically:

  • The median, which is the 50th percentile, identifies a central point in a distribution, like the mean.

  • The interquartile range, which is the difference between the 25th and 75th percentiles, quantifies the spread of the distribution, like the standard deviation.

  • The quartile skewness uses the quartiles of the distribution (25th, 50th, and 75th percentiles) to quantify the skewness.

The Cdf object provides an efficient way to compute these percentile-based statistics. To demonstrate, let’s make a Cdf object from the bogus and clean datasets.

cdf_total_weight_bogus = Cdf.from_seq(total_weight_bogus)
cdf_total_weight_clean = Cdf.from_seq(total_weight_clean)

The following function takes a Cdf and uses its inverse method to compute the 50th percentile, which is the median.

def median(cdf):
    m = cdf.inverse(0.5)
    return m

Now we can compute the median of both datasets.

median(cdf_total_weight_bogus), median(cdf_total_weight_clean)
(array(7.375), array(7.375))

The results are identical, so in this case, the outliers have no effect on the median at all. In general, outliers have a smaller effect on the median than on the mean.

The interquartile range (IQR) is the difference between the 75th and 25th percentiles. The following function takes a Cdf and returns the IQR.

def iqr(cdf):
    low, high = cdf.inverse([0.25, 0.75])
    return high - low

And here are the interquartile ranges of the two datasets.

iqr(cdf_total_weight_bogus), iqr(cdf_total_weight_clean)
(1.625, 1.625)

In general, outliers have less effect on the IQR than on the standard deviation – in this case they have no effect at all.

Finally, here’s a function that computes quartile skewness, which is depends on three statistics:

  • The median,

  • The midpoint of 25th and 75th percentiles, and

  • The semi-IQR, which is half of the IQR.

def quartile_skewness(cdf):
    low, median, high = cdf.inverse([0.25, 0.5, 0.75])
    midpoint = (high + low) / 2
    semi_iqr = (high - low) / 2
    return (midpoint - median) / semi_iqr
qskew1 = quartile_skewness(cdf_total_weight_bogus)
qskew2 = quartile_skewness(cdf_total_weight_clean)
qskew1, qskew2
(-0.07692307692307693, -0.07692307692307693)

These examples show that percentile-based statistics are less sensitive to outliers and errors in the data.

4.5. Random Numbers#

Cdf objects provide an efficient way to generate random numbers from a distribution. First we generate random numbers from a uniform distribution between 0 and 1. Then we evaluate the inverse CDF at those points. The following function implements this algorithm.

def sample_from_cdf(cdf, n):
    ps = np.random.random(size=n)
    return cdf.inverse(ps)
sample = sample_from_cdf(cdf_speeds, 1633)
cdf_sample = Cdf.from_seq(sample)
cdf_speeds.plot(label="original")
cdf_sample.plot(label="sample")

decorate(xlabel="Speed (mph)", ylabel="CDF")
_images/3aced69428481a52bc5d9aa2089cd019e18aff18b41e28d4002eb900cb86b6ce.png

To understand how this algorithm works, consider this question:

Suppose we choose a random sample from the population of running speed and look up the percentile ranks of the speeds in the sample. Now suppose we compute the CDF of the percentile ranks. What do you think it will look like?

Let’s find out. Here are the percentile ranks for the sample we generated.

percentile_ranks = cdf_speeds(sample) * 100

And here is the CDF of the percentile ranks.

cdf_percentile_rank = Cdf.from_seq(percentile_ranks)
cdf_percentile_rank.plot()

decorate(xlabel="Percentile rank", ylabel="CDF")
_images/638e8f37f57c8b94e8b2a0af03ce61da74ccfa69711960a5761082407f784fc2.png

The CDF of the percentile ranks is close to a straight line, which suggests that it is a uniform distribution between 0 and 1. And that makes sense, because in any distribution, the proportion with percentile rank less than 50% is 0.5; the proportion with percentile rank less than 90% is 0.9, and so on.

So, if we have a random sample from a distribution and we compute their percentile ranks, we get a uniform distribution.

Cdf provides a sample method that uses this algorithm, so we could also generate a sample like this.

sample = cdf_speeds.sample(1633)

4.6. Glossary#

  • percentile rank: The percentage of quantities in a distribution that are less than or equal to a given quantity.

  • percentile: The quantity associated with a given percentile rank.

  • cumulative distribution function (CDF): A function that maps from quantities to their cumulative probabilities. \(CDF(x)\) is the fraction of the sample less than or equal to \(x\).

  • inverse CDF: A function that maps from a cumulative probability, \(p\), to the corresponding quantity.

  • median: The 50th percentile, often used as a measure of central tendency.

  • interquartile range: The difference between the 75th and 25th percentiles, used as a measure of spread.

  • quantile: A sequence of quantities that correspond to equally spaced percentile ranks; for example, the quartiles of a distribution are the 25th, 50th and 75th percentiles.

  • replacement: A property of a sampling process. “With replacement” means that the same quantity can be chosen more than once; “without replacement” means that once a quantity is chosen, it is removed from the population.

4.7. Exercises#

4.7.1. Exercise#

How much did you weigh at birth? If you don’t know, call your mother or someone else who knows. Using the NSFG data (all live births), compute the distribution of birth weights and use it to find your percentile rank. If you were a first baby, find your percentile rank in the distribution for first babies. Otherwise use the distribution for others. If you are in the 90th percentile or higher, call your mother back and apologize.

from nsfg import get_nsfg_groups

live, firsts, others = get_nsfg_groups()

4.7.2. Exercise#

For live births in the NSFG dataset, the column babysex indicates whether the baby was male or female. We can use query to select the rows for male and female babies.

male = live.query("babysex == 1")
female = live.query("babysex == 2")
len(male), len(female)
(4641, 4500)

Make Cdf objects that represent the distribution of birth weights for male and female babies. Plot the two CDFs. What are the differences in the shape and location of the distributions?

If a male baby weighs 8.5 pounds, what is his percentile rank? What is the weight of a female baby with the same percentile rank?

4.7.3. Exercise#

From the NSFG dataset pregnancy data, select the agepreg column, divide by 100 to convert to and make a Cdf to represent the distribution of age at conception for each pregnancy.

Use the CDF to compute the percentage of ages less than or equal to 20, and the percentage less than or equal to 30.

Use those results to compute the percentage between 20 and 30.

from nsfg import read_fem_preg

preg = read_fem_preg()

4.7.4. Exercise#

Here are the running speeds of the people who finished the James Joyce Ramble, described in Section xxx.

speeds = results["MPH"].values

Make a Cdf that represents the distribution of these speeds, and use it to compute the median, IQR, and quartile skewness. Does the distribution skew to the left or right?

4.7.5. Exercise#

The numbers generated by np.random.random are supposed to be uniform between 0 and 1, which means that the CDF of a sample should be a straight line. Let’s see if that’s true.

Here’s a sample of 1000 numbers. Plot the CDF of this sample. Does it look like a straight line?

t = np.random.random(1000)

Think Stats: Exploratory Data Analysis in Python, 3rd Edition

Copyright 2024 Allen B. Downey

Code license: MIT License

Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International