6. Probability Density Functions#

In the previous chapter, we modeled empirical distributions with theoretical distributions including the binomial, Poisson, exponential, and normal distributions.

The binomial and Poisson distributions are discrete, which means that the outcomes have to be distinct or separate elements, like an integer number of hits and misses, or goals scored. In a discrete distribution, each outcome is associated with a probability mass.

The exponential and normal distribution are continuous, which means the outcomes can be locations on a spectrum – like a wavelength – or ratios – like height and weight, which are ratios relative to their units of measure. In a discrete distribution, each outcome is associated with a probability density. To understand probability density, let’s start by comparing distributions.

Click here to run this notebook on Colab.

Hide code cell content
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Hide code cell content
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from thinkstats import decorate

6.1. Comparing Distributions#

In the previous chapter, when we compared discrete distributions, we used a bar plot to show their probability mass functions (PMFs). When we compared continuous distributions, we used a line plot to show their cumulative distribution functions (CDFs).

For the discrete distributions, we could also have used CDFs. For example, here’s the PMF of a Poisson distribution with parameter lam=2.2, which is a good model for the distribution of household size in the NSFG. Instructions for downloading the data are in the notebook for this chapter.

The following cells download the data files and install statadict, which we need to read the data.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemResp.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemResp.dat.gz")
try:
    import statadict
except ImportError:
    !pip install statadict
from nsfg import read_fem_resp

resp = read_fem_resp()
older = resp.query("age >= 25")
num_family = older["numfmhh"]
from empiricaldist import Pmf

pmf_family = Pmf.from_seq(num_family, name="data")
from thinkstats import poisson_pmf

lam = 2.2
ks = np.arange(11)
ps = poisson_pmf(ks, lam)
pmf_poisson = Pmf(ps, ks, name="Poisson model")
from thinkstats import two_bar_plots

two_bar_plots(pmf_family, pmf_poisson)
decorate(xlabel="Number of family members")
_images/01f8dd62e1081b034a15bdf410a7b8a1766636c9a87c21e31d5194cabb07dc7c.png

Comparing the PMFs, we can see where the data deviates from the model. But PMFs tend to emphasize small differences that might be due to random variation. Sometimes we can see the big picture more clearly by comparing CDFs.

For example, here’s the CDF of family size compared to the CDF of the Poisson model.

cdf_family = pmf_family.make_cdf()
cdf_poisson = pmf_poisson.make_cdf()
from thinkstats import two_cdf_plots

two_cdf_plots(cdf_poisson, cdf_family)
decorate(xlabel="Number of family members")
_images/b0fd6c4e887a05ee50a09ba17d6550f6d849ee31d2de5492cd89229aca160fe5.png

In my opinion, CDFs are usually the best way to compare data to a model, or compare data from different groups, etc.

Also, CDFs work well with continuous data.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")
from nsfg import read_fem_preg

preg = read_fem_preg()
birth_weights = preg["totalwgt_lb"].dropna()
from scipy.stats import trimboth

trimmed = trimboth(birth_weights, 0.01)
from thinkstats import make_normal_model

cdf_model = make_normal_model(trimmed)
from empiricaldist import Cdf

cdf_birth_weight = Cdf.from_seq(birth_weights, name="sample")
two_cdf_plots(cdf_model, cdf_birth_weight, xlabel="Birth weight (pounds)")
_images/6c2fc1d648cc49d93887a25edcbe4b9981b614b52f1c54937132cd9ba392b13c.png

6.2. Probability Density#

The familiar bell curve is actually the probability density function (PDF) of the normal distribution, which we can compute like this.

def normal_pdf(xs, mu, sigma):
    """Evaluates the normal probability density function.

    xs: float or sequence of floats
    mu: mean of the distribution
    sigma: standard deviation of the distribution

    returns: float or NumPy array of probability density
    """
    z = (xs - mu) / sigma
    return np.exp(-(z**2) / 2) / sigma / np.sqrt(2 * np.pi)

thinkstats provides a NormalPdf object that represents the PDF of a normal distribution. We can use it to create a NormalPdf with the same mean and standard deviation as the birth weights in the NSFG dataset.

from thinkstats import NormalPdf

m, s = np.mean(trimmed), np.std(trimmed)
pdf_model = NormalPdf(m, s, name="normal model")
pdf_model
NormalPdf(7.280883100022579, 1.2430657948614345, name='normal model')

If we call the NormalPdf object as a function, it evaluates the normal PDF.

low = m - 4 * s
high = m + 4 * s
qs = np.linspace(low, high, 201)
ps = pdf_model(qs)

The result looks like a bell curve.

plt.plot(qs, ps, label="normal model")
decorate(xlabel="Birth weight (pounds)", ylabel="Density")
_images/2771a8ea2d474d2cbeec027676f9adec4c2a3889402856146d4e19866d96de49.png

The PdfNormal object provides a plot function that does the same thing.

The peak of the distribution is at m. If we evaluate the PDF there, the result is a probability density.

pdf_model(m)
0.32093416297880123

By itself, a probability density doesn’t mean much – most importantly, it is not a probability. It would be incorrect to say that the probability is 32% that a randomly-chosen birth weight equals m. In fact, the probability that a birth weight is truly, exactly, and precisely equal to m – or any other specific value – is effectively zero.

However, we can use these probability densities to compute the probability that a birth weight falls in an interval between two values, by computing the area under the curve. The following function takes a NormalPdf object and the bounds of an interval, low and high. It evaluates the normal PDF at equally-spaced quantities between low and high, and uses Simpson’s method to estimate the area under the curve.

from scipy.integrate import simpson


def area_under(pdf, low, high):
    """Find the area under a PDF.

    pdf: Pdf object
    low: low end of the interval
    high: high end of the interval
    """
    qs = np.linspace(low, high, 501)
    ps = pdf(qs)
    return simpson(y=ps, x=qs)

If we compute the area under the curve from the lowest to the highest point in the graph, the result is close to 1.

area_under(pdf_model, 2, 12)
0.9999158086616793

If we extend the interval from negative infinity to positive infinity, the total area is exactly 1.

If we start from 0 – or any value far below the mean – we can compute the fraction of birth weights less than or equal to 8.5 pounds.

area_under(pdf_model, 0, 8.5)
0.8366380335513807

You might recall that the “fraction less than or equal to a given value” is the definition of the CDF. So we could compute the same result using the CDF of the normal distribution.

from scipy.stats import norm

norm.cdf(8.5, m, s)
0.8366380358092718

Similarly, we can use the area under the density curve to compute the fraction of birth weights between 6 and 8 pounds.

area_under(pdf_model, 6, 8)
0.5671317752927691

Or we can get the same result using the CDF to compute the fraction less than 8 and then subtracting off the fraction less than 6.

norm.cdf(8, m, s) - norm.cdf(6, m, s)
0.5671317752921801

So the CDF is the area under the curve of the PDF. If you know calculus, another way to say the same thing is that the CDF is the integral of the PDF. And conversely, the PDF is the derivative of the CDF.

6.3. The Exponential PDF#

To get your head around probability density, it might help to look at another example. In the previous chapter, we used an exponential distribution to model the time until the first goal in a hockey game. We used the following function to compute the exponential CDF.

def exponential_cdf(x, lam):
    """Compute the exponential CDF.

    x: float or sequence of floats
    lam: rate parameter

    returns: float or NumPy array of cumulative probability
    """
    return 1 - np.exp(-lam * x)

Where lam is the rate parameter in goals per unit of time. We can compute the exponential PDF like this.

def exponential_pdf(x, lam):
    """Evaluates the exponential PDF.

    x: float or sequence of floats
    lam: rate parameter

    returns: float or NumPy array of probability density
    """
    return lam * np.exp(-lam * x)

thinkstats provides an ExponentialPdf object that uses this function to compute the exponential PDF. We can use one to represent an exponential distribution with rate parameter 6 goals per game.

from thinkstats import ExponentialPdf

lam = 6
pdf_expo = ExponentialPdf(lam, name="model")
pdf_expo
ExponentialPdf(6, name='model')

ExponentialPdf provides a plot function we can use to plot the PDF – notice that the unit of time is games here, rather than seconds as in the previous chapter.

qs = np.linspace(0, 1.5, 201)
pdf_expo.plot(qs)
decorate(xlabel="Time (games)", ylabel="Density")
_images/c1019bd2307578031c7488df0c47aee05469c42136a15b1598108276da0e42cf.png

Looking at the y-axis, you might notice that some of these densities are greater than 1, which is a reminder that a probability density is not a probability. But the area under a density curve is a probability, so it should never be greater than 1.

If we compute the area under this curve from 0 to 1.5 games, we can confirm that the result is close to 1.

area_under(pdf_expo, 0, 1.5)
0.999876590779019

If we extend the interval much farther, the result is slightly great than 1, but that’s because we’re approximating the area numerically. Mathematically, it is exactly 1, as we can confirm using the exponential CDF.

from thinkstats import exponential_cdf

exponential_cdf(7, lam)
1.0

We can use the area under the density curve to compute the probability of a goal during any interval. For example, here is the probability of a goal during the first minute of a 60-minute game.

area_under(pdf_expo, 0, 1 / 60)
0.09516258196404043

We can compute the same result using the exponential CDF.

exponential_cdf(1 / 60, lam)
0.09516258196404048

6.4. Comparing PMFs and PDFs#

It is a common error to compare the PMF of a sample with the PDF of a theoretical model. For example, suppose we want to compare the distribution of birth weights to a normal model. Here’s a Pmf that represents the distribution of the data.

pmf_birth_weight = Pmf.from_seq(birth_weights, name="data")

And we already have pdf_model, which represents the PDF of the normal distribution with the same mean and standard deviation. Here’s what happens if we plot them on the same axis.

pmf_birth_weight.plot()
pdf_model.plot(color="gray")

decorate(xlabel="Birth weight (pounds)", ylabel="PMF? Density?")
_images/c4dde542eea6a859a814c043c03edba4cf84511ab06d4a12ce9b1f6e6dc61063.png

It doesn’t work very well. One reason is that they are not in the same units. A Pmf contains probability masses and a NormalPdf contains probability densities, so we can’t compare them, and we shouldn’t plot them on the same axes.

As a first attempt to solve the problem, we can make a Pmf that approximates the normal distribution by evaluating the PDF at a discrete set of points. NormalPdf provides a make_pmf method that does that.

pmf_model = pdf_model.make_pmf()

The result contain probability masses, so we can at least plot it on the same axes as the PMF of the data.

pmf_model.plot(color="gray")
pmf_birth_weight.plot()

decorate(xlabel="Birth weight (pounds)", ylabel="PMF")
_images/9ddd9103b985364092f059ef8dfa13d589d6157987c5cad6995a74714d7c4d46.png

But this is still not a good way to compare distributions. One problem is that the two Pmf objects contains different numbers of quantities, and the quantities in pmf_birth_weight are not equally spaced, so the probability masses are not really comparable.

len(pmf_model), len(pmf_birth_weight)
(201, 184)

The other problem is that the Pmf of the data is noisy.

6.5. Kernel Density Estimation#

So let’s try something else – instead of converting the model to a PMF, we can convert the distribution of the data to a PDF. To show how that works, I’ll start with a small sample of the data.

np.random.seed(3)

n = 10
sample = birth_weights.sample(n)

The Pmf of this sample looks like this.

for weight in sample:
    pmf = Pmf.from_seq([weight]) / n
    pmf.bar(width=0.08, alpha=0.5)

xlim = [1.5, 12.5]
decorate(xlabel="Birth weight (pounds)", ylabel="PMF", xlim=xlim)
_images/d4da519be64ffcd67f999025412db85097a2d3fafa39805eeb33098a95524193.png

This way of representing the distribution treats the data as if it is discrete, so each probability mass is stacked up on a single point. But birth weight is actually a continuous quantity, so the quantities between the measurements are also possible. We can represent that possibility by replacing each discrete probability mass with a continuous probability density, like this.

qs = np.linspace(2, 12, 201)

for weight in sample:
    ps = NormalPdf(weight, 0.75)(qs) / n
    plt.plot(qs, ps, alpha=0.5)

decorate(xlabel="Birth weight (pounds)", ylabel="PDF", xlim=xlim)
_images/bd2daec0182480aac5b135f27d619430160506baef0c579b4ad2356fca08184f.png

For each weight in the sample, we create a NormalPdf with the observed weight as the mean – now let’s add them up.

low_ps = np.zeros_like(qs)

for weight in sample:
    ps = NormalPdf(weight, 0.75)(qs) / n
    high_ps = low_ps + ps
    plt.fill_between(qs, low_ps, high_ps, alpha=0.5, lw=0)
    low_ps = high_ps

decorate(xlabel="Birth weight (pounds)", ylabel="PDF", xlim=xlim)
_images/a2f9cff8fe849b6b617f3e1b1ced495ad53f84e761484501098ec985cfd4a535.png

When we add up the probability densities for each data point, the result is an estimate of the probability density for the whole sample. This process is called kernel density estimation or KDE. In this context, a “kernel” is one of the small density functions we added up. Because the kernels we used are normal distributions – also known as Gaussians – we could say more specifically that we computed a Gaussian KDE.

SciPy provides a function called gaussian_kde that implements this algorithm. Here’s how we can use it to estimate the distribution of birth weights.

from scipy.stats import gaussian_kde

kde = gaussian_kde(birth_weights)

The result is an object that represents the estimated PDF, which we can evaluate by calling it like a function.

ps = kde(qs)

Here’s what the result looks like.

plt.plot(qs, ps)

decorate(xlabel="Birth weight (pounds)", ylabel="Density")
_images/e439502495e67aa8555711c4477a4d2040303f8d93148134e876977273e24aa4.png

thinkstats provides a Pdf object that takes the result from gaussian_kde, and a domain that indicates where the density should be evaluated. Here’s how we make one.

from thinkstats import Pdf

domain = np.min(birth_weights), np.max(birth_weights)
kde_birth_weights = Pdf(kde, domain, name="data")

Pdf provides a plot method we can use to compare the estimated PDF of the sample to the PDF of a normal distribution.

kde_birth_weights.plot()
pdf_model.plot(color="gray")

decorate(xlabel="Birth weight (pounds)", ylabel="Density")
_images/ed907e1c413ddc96de21643297f1556345fb5db8e3d1e4306af692ff638ef529.png

KDE makes it possible to compare the distribution of a dataset to a theoretical model, and for some audiences, this is a good way to visualize the comparison. But for continuous quantities, I think it’s usually better to compare CDFs.

6.6. The distribution framework#

At this point we have a complete set of ways to represent distributions: PMFs, CDFs, and PDFs The following figure shows these representations and the transitions from one to another. For example, if we have a Pmf, we can use the cumsum function to compute the cumulative sum of the probabilities and get a Cdf that represents the same distribution.

https://github.com/AllenDowney/ThinkStats/raw/v3/figs/distribution_framework.png

To demonstrate these transitions, we’ll use a new dataset that “contains the time of birth, sex, and birth weight for each of 44 babies born in one 24-hour period at a Brisbane, Australia, hospital.”

thinkstats provides a function that reads the data and returns a DataFrame.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/babyboom.dat")
from thinkstats import read_baby_boom

boom = read_baby_boom()
boom.head()
time sex weight_g minutes
0 5 1 3837 5
1 104 1 3334 64
2 118 2 3554 78
3 155 2 3838 115
4 257 2 3625 177

The minutes records “the number of minutes since midnight for each birth”. So can use the diff method to compute the interval between each successive birth.

diffs = boom["minutes"].diff().dropna()

If births happen with equal probability during any minute of the day, we expect these intervals to follow an exponential distribution. In reality, that assumption is not precisely true, but the exponential distribution might still be a good model for the data.

To find out, we’ll start by making a Pmf that represents the distribution of intervals.

pmf_diffs = Pmf.from_seq(diffs, name="data")
pmf_diffs.bar(width=1)

decorate(xlabel="Interval (minutes)", ylabel="PMF")
_images/adf6f9b1013cba6c0970c5827a1cceb1b17a009fc23fb7f5022dc171bcc6cee9.png

The make_cdf method computes the cumulative sum of the probabilities in the Pmf and returns a Cdf object.

cdf_diffs = pmf_diffs.make_cdf()
cdf_diffs.step(color="C1")

decorate(xlabel="Interval (minutes)", ylabel="CDF")
_images/8657bceb6ebce11db212f09fd7f2e39dc45cb24dff5547947954319326df52bb.png

The Pmf and Cdf are equivalent in the sense that if we are given either one, we can compute the other. To demonstrate, we’ll use the make_pmf method, which computes the differences between successive probabilities in a Cdf and returns a Pmf.

pmf_diffs2 = cdf_diffs.make_pmf()

The result should be identical to the original Pmf, but there might be small floating-point errors. We can use allclose to check that the result is close to the original Pmf.

np.allclose(pmf_diffs, pmf_diffs2)
True

And it is.

From a Pmf, we can estimate a density function by calling gaussian_kde with the probabilities from the Pmf as weights.

kde = gaussian_kde(pmf_diffs.qs, weights=pmf_diffs.ps)
domain = np.min(pmf_diffs.qs), np.max(pmf_diffs.qs)
kde_diffs = Pdf(kde, domain=domain, name="estimated density")

kde_diffs.plot()
decorate(xlabel="Interval (minutes)", ylabel="Density")
_images/9953dd90e1fb8c63bbe1d3b5f7d38cf323ab485c36c0b00929b130f0e4f3b880.png

To see whether the estimated density follows an exponential model, we can make an ExponentialPdf with the same mean as the data.

m = diffs.mean()
lam = 1 / m
pdf_model = ExponentialPdf(lam, name="exponential PDF")

Here’s what it looks like compared to the estimated density.

pdf_model.plot(color="gray")
kde_diffs.plot()
decorate(xlabel="Interval (minutes)", ylabel="Density")
_images/29ead44fb9e7b12a323e2e6482169d179dfb21fc1e4b79bf741adbd32d44d853.png

Comparing the estimated density to an exponential PDF, it doesn’t seem like the data fit the model well. But that might be misleading – again, I think comparing CDFs provides the clearer picture.

Here’s the CDF of an exponential distribution compared to the CDF of the data.

from thinkstats import ExponentialCdf

cdf_model = ExponentialCdf(lam, name="exponential CDF")
cdf_model.plot(color="gray")
cdf_diffs.step(color="C1")

decorate(xlabel="Interval (minutes)", ylabel="CDF")
_images/a004a57306625469fd4d16222fc5a4f3ceb6b7056c6402f1a7cb47578dbf9566.png

The exponential CDF fits the CDF of the data well.

Given an ExponentialPdf, we can use make_pmf to discretize the PDF – that is, to make a discrete approximation by evaluating the PDF at a sequence of equally spaced quantities.

qs = np.linspace(0, 175, 31)
pmf_model = pdf_model.make_pmf(qs)
pmf_model.bar(width=2)

decorate(xlabel="Interval (minutes)", ylabel="PMF")
_images/8976ca235be9c877832e64a5928aaf67c79efea11c2354f8cc2b0538e546eb5a.png

And given an ExponentialCdf, we can use make_cdf to make a discrete approximation of the exponential CDF.

discrete_cdf_model = cdf_model.make_cdf(qs)
discrete_cdf_model.step()

decorate(xlabel="Interval (minutes)", ylabel="CDF")
_images/ba59d84cc620082e6b950e2617ff401e70d9629e1ac4f8dab55264f62a46a629.png

Finally, to get from a discrete CDF to a continuous CDF, we can interpolate between the steps, which is what we see if we use the plot method instead of the step method.

discrete_cdf_model.plot()

decorate(xlabel="Interval (minutes)", ylabel="CDF")
_images/dbffae86a7b9164cbd653ea44682b4fb268f9b6865076653ba1a48e114d4ebb0.png

Finally, a PDF is the derivative of a continuous CDF, and a CDF is the integral of a PDF.

To demonstrate, we can use SymPy to define the CDF of an exponential distribution and compute its derivative.

import sympy as sp

x = sp.Symbol("x", real=True, positive=True)
λ = sp.Symbol("λ", real=True, positive=True)

cdf = 1 - sp.exp(-λ * x)
cdf
\[\displaystyle 1 - e^{- x λ}\]
pdf = sp.diff(cdf, x)
pdf
\[\displaystyle λ e^{- x λ}\]

And if we integrate the result, we get the CDF back – although we lose the constant of integration in the process.

sp.integrate(pdf, x)
\[\displaystyle - e^{- x λ}\]

6.7. Glossary#

  • Probability density function (PDF): The derivative of a continuous CDF, a function that maps a value to its probability density.

  • Probability density: A quantity that can be integrated over a range of values to yield a probability. If the values are in units of cm, for example, probability density is in units of probability per cm.

  • Kernel density estimation (KDE): An algorithm that estimates a PDF based on a sample.

  • discretize: To approximate a continuous function or distribution with a discrete function. The opposite of smoothing.

6.8. Exercises#

6.8.1. Exercise#

In World Cup soccer (football), suppose the time until the first goal is well modeled by an exponential distribution with rate parameter lam=2.5 goals per game. Make an ExponentialPdf to represent this distribution and use area_under to compute the probability a goal is scored in the first half of a game. Then use an ExponentialCdf to compute the same probability and check that the results are consistent.

Use ExponentialPdf to compute the probability the first goal is scored in the second half of the game. Then use an ExponentialCdf to compute the same probability and check that the results are consistent.

6.8.2. Exercise#

In order to join Blue Man Group, you have to be male between 5’10” and 6’1”, which is roughly 178 to 185 centimeters. Let’s see what fraction of the male adult population in the United States meets this requirement.

The heights of male participants in the BRFSS are well modeled by a normal distribution with mean 178 cm and standard deviation 7 cm.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/CDBRFS08.ASC.gz")
from thinkstats import read_brfss

brfss = read_brfss()
male = brfss.query("sex == 1")
heights = male["htm3"].dropna()
from scipy.stats import trimboth

trimmed = trimboth(heights, 0.01)
m, s = np.mean(trimmed), np.std(trimmed)
m, s
(178.10278947124948, 7.017054887136004)

Here’s a NormalCdf object that represents a normal distribution with the same mean and standard deviation as the trimmed data.

from thinkstats import NormalCdf

cdf_normal_model = NormalCdf(m, s)

And here’s how it compares to the CDF of the data.

cdf_height = Cdf.from_seq(heights, name="data")
cdf_normal_model.plot(color="gray")
cdf_height.step()

xlim = [140, 210]
decorate(xlabel="Height (cm)", ylabel="CDF", xlim=xlim)
_images/4c89a230d5f676ee91b20ef3738e4a64770128ce76a668a5da4c666fd7418949.png

Use gaussian_kde to make an Pdf to represent the PDF of male height. Hint: Investigate the bw_method argument, which can be used to control the smoothness of the estimated density.

Plot the estimated density and compare it to a NormalPdf with mean m and standard deviation s.

Use a NormalPdf and area_under to compute the fraction of people in the normal model that are between 178 and 185 centimeters. Use a NormalCdf to compute the same fraction, and check that the results are consistent. Finally, use the empirical Cdf of the data to see what fraction of people in the dataset are in the same range.

Think Stats: Exploratory Data Analysis in Python, 3rd Edition

Copyright 2024 Allen B. Downey

Code license: MIT License

Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International