10. Least Squares#

Click here to run this notebook on Colab.

Hide code cell content
%load_ext nb_black
%load_ext autoreload
%autoreload 2
Hide code cell content
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Hide code cell content
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from thinkstats import decorate

10.1. Least Squares Fit#

As an example, let’s return to the scenario from Chapter xxx. Suppose you are a researcher in Antarctica, studying local populations of penguins. Are part of your data collection, you capture a sample of penguins, measure and weigh them – and then release them unharmed. But it is notoriously difficult to get penguins to stay on the scale long enough to get an accurate measurement. So let’s suppose that for some penguins we have measurements like flipper length, but no weights. Maybe we can use the other measurements to fill in the missing data – this process is called imputation.

We’ll start by exploring the relationship between the weights and measurements, using data collected between 2007 and 2010 by researchers at Palmer Station in Antarctica. The data they collected is freely available – instructions for downloading it are in the notebook for this chapter.

The following cell downloads the data from a repository created by Allison Horst.

download(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/c19a904462482430170bfe2c718775ddb7dbb885/inst/extdata/penguins_raw.csv"
)
penguins = pd.read_csv("penguins_raw.csv").dropna(subset=["Body Mass (g)"])
penguins.shape
(342, 17)

The dataset includes measurements of 151 Adélie penguins. We can use query to select the rows that contain this data.

adelie = penguins.query('Species.str.startswith("Adelie")')
len(adelie)
151

Now suppose we know the flipper length of an Adélie penguin – let’s see how well we can predict its weight. First we’ll select these columns from the DataFrame.

xvar = "Flipper Length (mm)"
yvar = "Body Mass (g)"

flipper_length = adelie[xvar]
body_mass = adelie[yvar]

Here’s a scatter plot showing the relationship between these quantities.

plt.scatter(flipper_length, body_mass, marker=".", alpha=0.5)
decorate(xlabel=xvar, ylabel=yvar)
_images/786517ba2d35124a5823366ab9612bdc792be47587b05582bb784beb6b018717.png

It looks like they are related – we can quantify the strength of the relationship by computing the coefficient of correlation.

np.corrcoef(flipper_length, body_mass)[0, 1]
0.4682016942179394

The correlation is about 0.47, so penguins with longer flippers tend to be heavier. That’s useful because it means we can guess a penguin’s weight more accurately if we know its flipper length, but correlation alone doesn’t tell us how to make those guesses. For that, we need to choose a line of best fit.

There are many ways to define the “best” line, but for data like this a common choice is a linear least squares fit, which is the straight line that minimizes the mean squared error (MSE).

SciPy provides a function called linregress that computes a least squares fit. The name is short for linear regression, which is another term for a least squares fit. Here’s how we use it.

from scipy.stats import linregress

result = linregress(flipper_length, body_mass)
result
LinregressResult(slope=32.83168975115009, intercept=-2535.8368022002514, rvalue=0.46820169421793933, pvalue=1.3432645947790051e-09, stderr=5.076138407990821, intercept_stderr=964.7984274994059)

The result is a LinregressResult object that contains the slope and intercept of the fitted line, along with other information we’ll unpack soon. The slope is about 32.8, which means that each additional millimeter of flipper length is associated with an addition 32.8 grams of body weight.

The intercept is -2535 grams, which might seem nonsensical, since a measured weight can’t be negative. It might make more sense if we use the slope and intercept to evaluate the fitted line at the average flipper length.

x = flipper_length.mean()
y = result.intercept + result.slope * x
x, y
(189.95364238410596, 3700.662251655629)

For a penguin with flipper length 190 mm, the expected body weight is about 3700 grams.

The following function takes the result object and a sequence of xs and finds the point on the fitted line for each value of x.

def predict(result, xs):
    ys = result.intercept + result.slope * xs
    return ys

The name predict might seem odd here – in natural language, a prediction usually pertains to something happening in the future, but in the context of regression, the points on the fitted line are called predictions.

We can use predict to compute the points on the line for a range of flipper sizes.

fit_xs = np.linspace(np.min(flipper_length), np.max(flipper_length))
fit_ys = predict(result, fit_xs)

Here’s the fitted line alone with the scatter plot of the data.

plt.scatter(flipper_length, body_mass, marker=".", alpha=0.5)
plt.plot(fit_xs, fit_ys, color="C1")
decorate(xlabel=xvar, ylabel=yvar)
_images/eecafcbf4653be5906fa06ffde40147beb4a83ef3ac918d7d165e2bc1e1429e8.png

As expected, the fitted line goes through the center of the data and follows the trend. And some of the predictions are accurate – but many of the data points are far from the line. To get a sense of how good (or bad) the predictions are, we can compute the prediction error, which is the vertical distance of each point from the line. The following function computes these errors, which are also called residuals.

def compute_residuals(result, xs, ys):
    fit_ys = predict(result, xs)
    return ys - fit_ys

Here are the residuals for body mass as a function of flipper length.

residuals = compute_residuals(result, flipper_length, body_mass)

As an example, we can look at the results for the first penguin in the dataset.

x = flipper_length[0]
y = predict(result, x)
x, y
(181.0, 3406.699042757914)

The flipper length of the selected penguin is 181 mm and the predicted body mass is 3407 grams.

body_mass[0], residuals[0]
(3750.0, 343.30095724208604)

The actual mass of this penguin is 3750 grams, so the residual – after subtracting away the prediction – is 343 grams.

The average of the squared residuals is the mean squared error (MSE) of the predictions.

mse = np.mean(residuals**2)
mse
163098.85902884745

By itself, this number doesn’t mean very much. We can make more sense of it by computing the coefficient of determination.

10.2. Coefficient of Determination#

Suppose you want to guess the weight of a penguin. If you know it’s flipper length, you can use the least squares fit to inform your guess, and the MSE quantifies the accuracy of your guesses, on average.

But what if you don’t know the flipper length – what would you guess? It turns out that the best strategy is to guess the mean. In that case, the prediction errors are the deviations from the mean.

deviations = body_mass - np.mean(body_mass)

And the MSE is the mean squared deviation.

np.mean(deviations**2)
208890.28989956583

And you might remember that the mean squared deviation is the variance.

np.var(body_mass)
208890.28989956583

So we can think of the variance of the masses as the MSE if we always guess the mean, and the variance of the residuals as the MSE if we use the regression line. If we compute the ratio of these variances and subtract it from 1, the result indicates how much the MSE is reduce if we use flipper lengths to inform our guesses.

The following function computes this value, which is technically called the coefficient of determination, but because it is denoted \(R^2\), most people call it “R squared”.

def coefficient_of_determination(ys, residuals):
    return 1 - np.var(residuals) / np.var(ys)

In the example, \(R^2\) is about 0.22, which means that the fitted line reduces MSE by 22%.

R2 = coefficient_of_determination(body_mass, residuals)
R2
0.21921282646854912

It turns out that there’s a relationship between the coefficient of determination, \(R^2\), and the coefficient of correlation, \(r\). As you might guess, just based on the notation, \(r^2 = R^2\).

We can show that’s true by computing the square root of \(R^2\).

r = np.sqrt(R2)
r
0.4682016942179397

And comparing it to the correlation we computed earlier.

corr = np.corrcoef(flipper_length, body_mass)[0, 1]
corr
0.4682016942179394

The linregress function also computes this value and returns it as an attribute in the RegressionResult object.

result.rvalue
0.46820169421793933

The coefficients of determination and correlation convey mostly the same information, but they are interpreted differently:

  • Correlation quantifies the strength of the relationship on a scale from -1 to 1.

  • \(R^2\) quantifies the ability of the fitted line to reduce MSE.

Also, \(R^2\) is always positive, so it doesn’t indicate whether the correlation is positive or negative.

10.3. Minimizing MSE#

Earlier I said that the least squares fit is the straight line that minimizes the mean squared error (MSE). We won’t prove that, but we can test it by adding small random values to the intercept and slope, and checking whether the MSE gets worse.

The types module provides a SimpleNamespace object we can use to make a an object with intercept and slope attributes.

from types import SimpleNamespace

intercept = result.intercept + np.random.normal(0, 1)
slope = result.slope + np.random.normal(0, 1)
fake_result = SimpleNamespace(intercept=intercept, slope=slope)
fake_result
namespace(intercept=-2536.6741320174224, slope=33.61896590421531)

We can pass this object to compute_residuals and use the residuals to compute the MSE.

fake_residuals = compute_residuals(fake_result, flipper_length, body_mass)
fake_mse = np.mean(fake_residuals**2)

If we compare the result to the MSE of the least squares line, it is always worse.

mse, fake_mse, fake_mse > mse
(163098.85902884745, 185239.44960298543, True)

Minimizing MSE is nice, but it’s not the only definition of “best”. One alternative is to minimize the absolute values of the errors (MAE). Another is to minimize the shortest distance from each point to the fitted line, which is called the “total error”. In some contexts, guessing too high might be better (or worse) than guessing too low. In that case you might want to compute a cost function for each residual, and minimize total cost.

But the least squares fit is much more widely used than the alternatives, primarily because it is efficient to compute. The following function shows how.

def least_squares(xs, ys):
    xbar = np.mean(xs)
    ybar = np.mean(ys)

    xdev = xs - xbar
    ydev = ys - ybar

    slope = np.sum(xdev * ydev) / np.sum(xdev**2)
    intercept = ybar - slope * xbar

    return intercept, slope

To test this function, we’ll use flipper length and body mass again.

intercept, slope = least_squares(flipper_length, body_mass)
intercept, slope
(-2535.8368022002524, 32.831689751150094)

And we can confirm that we get the same results we got from linregress.

np.allclose([intercept, slope], [result.intercept, result.slope])
True

Minimizing MSE made sense when computational efficiency was more important than choosing the method most appropriate to the problem at hand. But that’s no longer the case, so it is worth considering whether squared residuals are the right thing to minimize.

10.4. Estimation#

The parameters slope and intercept are estimates based on a sample; like other estimates, they are vulnerable to non-representative sampling, measurement error, and variability due to random sampling. As usual, it’s hard to quantify the effect of non-representative sampling and measurement error. It’s easier to quantify the effect of random sampling – by running simulated experiments and computing sampling distributions of the estimates.

We can simulate the experiments by resampling the data; that is, by treating the sample as if it were the whole population and drawing new samples, with replacement, from the observed data. The following function takes a DataFrame and uses the sample method to resample the rows and return a new DataFrame.

def resample(df):
    n = len(df)
    return df.sample(n, replace=True)

And the following function takes a DataFrame, finds the least squares fit, and returns the slope of the fitted line.

def estimate_slope(df):
    xs, ys = df["Flipper Length (mm)"], df["Body Mass (g)"]
    result = linregress(xs, ys)
    return result.slope

We can use these functions to generate many simulated datasets and compute the slope for each one.

resampled_slopes = [estimate_slope(resample(adelie)) for i in range(1001)]

The result is a sample from the sampling distribution of the slope. Here’s what it looks like.

from thinkstats import plot_kde

plot_kde(resampled_slopes)

decorate(xlabel="Slope of the fitted line (g / mm)", ylabel="Density")
_images/d514be828cad6172e417e444b853ded3b47e1c890c096b34a2575eea2790ec04.png

We can use percentile to compute a 90% confidence interval.

ci90 = np.percentile(resampled_slopes, [5, 95])
print(result.slope, ci90)
32.83168975115009 [25.70959875 40.96580407]

So we could report that the estimated slope is 33 grams / mm with a 90% CI [26, 41] grams / mm.

stderr = np.std(resampled_slopes)
stderr
4.663133263255994

The RegressionResult object we got from linregress provides an approximation of the standard error, based on some assumptions about the shape of the distribution.

result.stderr
5.076138407990821

The standard error we computed by resampling is a little smaller, but the difference probably doesn’t matter in practice.

Finally, to compute a p-value, we can approximate the sampling distribution with a normal distribution that has the same mean and standard deviation.

from scipy.stats import norm

m, s = np.mean(resampled_slopes), np.std(resampled_slopes)
sampling_dist = norm(m, s)

Since the estimated slope is positive, the p-value is the probability that the resampled slope is negative, which we can compute with the CDF of the normal distribution.

p_value = sampling_dist.cdf(0)
p_value
6.564638066114596e-13

The p-value is very small, so we can conclude that the observed slope would have been very unlikely if there were actually no relationship between flipper length and body weight.

10.5. Visualizing Uncertainty#

Each time we resample the dataset, we get a different fitted line. To see how much variation there is in the lines, one option is to loop through them and plot them all. The following function takes a resampled DataFrame, computes a least squares fit, and generates predicted values for a sequence of xs.

def fit_line(df, fit_xs):
    xs, ys = df["Flipper Length (mm)"], df["Body Mass (g)"]
    result = linregress(xs, ys)
    fit_ys = predict(result, fit_xs)
    return fit_ys

Here’s the sequence of xs we’ll use.

xs = adelie["Flipper Length (mm)"]
fit_xs = np.linspace(np.min(xs), np.max(xs))

And here’s what the fitted lines look like, along with a scatter plot of the data.

plt.scatter(flipper_length, body_mass, marker=".", alpha=0.5)

for i in range(101):
    fit_ys = fit_line(resample(adelie), fit_xs)
    plt.plot(fit_xs, fit_ys, color="C1", alpha=0.05)

decorate(xlabel=xvar, ylabel=yvar)
_images/108ffa84f1994e5e7ab8f4c07d3f2e3e39e9f6dd1bd61da5ff0a581d31342fda.png

Near the middle, the fitted lines are close together – at the extremes, they are farther apart.

Another way to represent the variability of the fitted lines is to plot a 90% confidence interval for each predicted value. We can do that by collecting the fitted lines as a list of arrays.

fitted_ys = [fit_line(resample(adelie), fit_xs) for i in range(1001)]

We can think of this list of arrays as a two-dimensional array with one row for each fitted line and one column corresponding to each of the xs.

Now we can use percentile with the axis=0 argument to find the 5th, 50th, and 95th percentiles of the ys corresponding to each of the xs.

low, median, high = np.percentile(fitted_ys, [5, 50, 95], axis=0)

Now we’ll use fill_between to plot a region between the 5th and 95 percentiles, which represents the 90% CI, along with the median value in each column and a scatter plot of the data.

plt.scatter(flipper_length, body_mass, marker=".", alpha=0.5)

plt.fill_between(fit_xs, low, high, color="C1", lw=0, alpha=0.2)
plt.plot(fit_xs, median, color="C1")

decorate(xlabel=xvar, ylabel=yvar)
_images/ad57b99f269be28a5a310e239fb9c0848f085309689d81902d64ace9a87d6f83.png

This is my favorite way of to represent the variability of a fitted line due to random sampling.

10.6. Transformation#

Before fitting a line to data, it is sometimes useful to transform one or both variables, for example by computing the squares of the values, their square roots, or their logarithms. To demonstrate, we’ll use heights and weights from the BRFSS, which we saw in Section xxx.

The following cell downloads the BRFSS data.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/CDBRFS08.ASC.gz")

We can load the BRFSS data like this.

from thinkstats import read_brfss

brfss = read_brfss()

Next we’ll select the rows with valid data and select the columns containing heights in centimeters and weights in kilograms.

valid = brfss.dropna(subset=["htm3", "wtkg2"])
heights, weights = valid["htm3"], valid["wtkg2"]

We can use linregress to compute the slope and intercept of the least squares fit.

result_brfss = linregress(heights, weights)
result_brfss.intercept, result_brfss.slope
(-82.65926054409877, 0.957074585033226)

The slope is about 0.96, which means that an increase of 1 centimeter corresponds to an increase of almost 1 kilogram, on average. We can use predict again to generate predicted values for a range of xs.

fit_xs = np.linspace(heights.min(), heights.max())
fit_ys = predict(result_brfss, fit_xs)

Before we make a scatter plot of the data, it’s useful to jitter the heights and weights.

from thinkstats import jitter

jittered_heights = jitter(heights, 2)
jittered_weights = jitter(weights, 1.5)

And we’ll use the mean and standard deviation of the heights to choose the limits of the \(x\) axis.

m, s = heights.mean(), heights.std()
xlim = m - 4 * s, m + 4 * s
ylim = 0, 200

Here’s a scatter plot of the jittered data along with the fitted line.

plt.scatter(jittered_heights, jittered_weights, alpha=0.01, s=0.1)
plt.plot(fit_xs, fit_ys, color="C1")
decorate(xlabel="Height (cm)", ylabel="Weight (kg)", xlim=xlim, ylim=ylim)
_images/3c160949b6ebc447baaf499cb6ee781e1a5bab1191d75f8221a37824e9548e5d.png

It might seem like the fitted line doesn’t pass through the densest part of the scatter plot. That’s because the weights don’t follow a normal distribution. As we saw in Section xxx (Chapter 5), adult weights tend to follow a lognormal distribution, which is skewed toward larger values – and that pulls the fitted line up.

Another cause for concern is the distribution of the residuals, which looks like this.

residuals = compute_residuals(result_brfss, heights, weights)
from thinkstats import make_pmf

pmf_kde = make_pmf(residuals, -60, 120)
pmf_kde.plot()

decorate(xlabel="Residual (kg)", ylabel="Density")
_images/d364f574e82bb202b429c9ba22d030afc0b28f7990fefb6a9fc3264bc1b4b969.png

The distribution of the residuals is skewed to the right. By itself, that’s not necessarily a problem, but it suggests that the least squares fit has not described the relationship between these variables precisely.

If the weights follow a lognormal distribution, their logarithms follow a normal distribution. So let’s see what happens if we fit a line to the logarithms of weight as a function of height.

log_weights = np.log10(weights)
result_brfss2 = linregress(heights, log_weights)
result_brfss2.intercept, result_brfss2.slope
(0.9930804163932876, 0.005281454169417777)

Because we transformed one of the variables, the slope and intercept are harder to interpret. But we can use predict to compute the fitted line.

fit_xs = np.linspace(heights.min(), heights.max())
fit_ys = predict(result_brfss2, fit_xs)

And then plot it along with a scatter plot of the transformed data.

jittered_log_weights = jitter(log_weights, 1.5)
plt.scatter(jittered_heights, log_weights, alpha=0.01, s=0.1)
plt.plot(fit_xs, fit_ys, color="C1")
decorate(xlabel="Height (cm)", ylabel="Weight (log10 kg)", xlim=xlim)
_images/94c7305b9e85f65fa635dc64c83d962b77c8c9a521bb04134b40fd97d62c85cb.png

The fitted line passes through the densest part of the plot, and the actual values extend about the same distance above and below the line. And the distribution of the residuals is roughly symmetric.

residuals = compute_residuals(result, heights, log_weights)
pmf_kde = make_pmf(residuals, -0.6, 0.6)
pmf_kde.plot()

decorate(xlabel="Residual (kg)", ylabel="Density")
_images/5e7797e122c08ea02b0d950e1e211b348f3ab265a770b81f29adcb2aa7aad7f2.png

The appearance of the scatter plot and the distribution of the residuals suggest that the relationship of height and log-transformed weight is well described by the fitted line. If we compare the \(r\) values of the two regressions, we see that the correlation of height with log-transformed weights is slightly higher.

result_brfss.rvalue, result_brfss2.rvalue
(0.5087364789734582, 0.5317282605983435)

Which means that the \(R^2\) value is slightly higher, too.

result_brfss.rvalue**2, result_brfss2.rvalue**2
(0.2588128050383119, 0.28273494311893993)

If we use heights to guess weights, the guesses will be a little better if we work with the log-transformed weights.

However, transforming the data can make the results harder to interpret – it can help to invert the transformation before presenting the results. For example, the inverse of a logarithm in base 10 is exponentiation with base 10. Here’s what the fitted line looks like after the inverse transformation, along with the untransformed data.

plt.scatter(jittered_heights, jittered_weights, alpha=0.01, s=0.1)
plt.plot(fit_xs, 10**fit_ys, color="C1")
decorate(xlabel="Height (cm)", ylabel="Weight (kg)", xlim=xlim, ylim=ylim)
_images/c6366db9dd257d030eb1454b37eb3469f138ab55a64a87a47d5c036b7eb52f7f.png

A fitted line that’s straight with respect to the transformed data is curved with respect to the original data.

10.7. Glossary#

  • linear fit: a line intended to model the relationship between variables.

  • least squares fit: A model of a dataset that minimizes the sum of squares of the residuals.

  • residual: The deviation of an actual value from a model.

  • goodness of fit: A measure of how well a model fits data.

  • coefficient of determination: A statistic intended to quantify goodness of fit.

  • sampling weight: A value associated with an observation in a sample that indicates what part of the population it represents.

10.8. Exercises#

10.8.1. Exercise#

In this chapter we computed a least squares fit for penguin weights as a function of flipper length. There are two other measurements in the dataset we can also consider: culmen length and culmen depth (the culmen is the top ridge of the bill).

Compute the least squares fit for weight as a function of culmen length. Make a scatter plot of these variables and plot the fitted line.

Based on the rvalue attribute of the RegressionResult object, what is the correlation of these variables? What is the coefficient of determination? Which is a better predictor of weight, culmen length or flipper length?

xvar = "Culmen Length (mm)"
yvar = "Body Mass (g)"

culmen_length = adelie[xvar]
body_mass = adelie[yvar]

10.8.2. Exercise#

In this chapter we used resampling to compute the sampling distribution for the slope of a fitted line. We can compute the sampling distribution of the intercept the same way:

  1. Write a function called estimate_intercept that takes a resampled DataFrame as an argument, computes the least squares fit of penguin weight as a function of flipper length, and returns the intercept.

  2. Call the function with many resampled versions of adelie and collect the intercepts.

  3. Use plot_kde to plot the sampling distribution of the intercept.

  4. Compute the standard error and a 90% confidence interval.

  5. Check that the standard error you get from resampling is consistent with the intercept_stderr attribute in the RegressionResult object – it might be a little smaller.

10.8.3. Exercise#

A person’s Body Mass Index (BMI) is their weight in kilograms divided by their height in meters raised to the second power. In the BRFSS dataset, we can compute BMI like this, after converting heights from centimeters to meters.

heights_m = heights / 100
bmis = weights / heights_m**2

In this definition, heights are squared, rather than raised to some other exponent, because of the observation – early in the history of statistics – that average weight increases roughly in proportion to height squared.

To see whether that’s true, we can use data from the BRFSS, a least squares fit, and a little bit of math. Suppose weight is proportional to height raised to an unknown exponent, \(a\). In that case, we can write:

\[w = b h^a\]

where \(w\) is weight, \(h\) is height, and \(b\) is an unknown constant of proportionality. Taking logarithms of both sides:

\[\log w = \log b + a \log h\]

So, if we compute a least squares fit for log-transformed weights as a function of log-transformed heights, the slope of the fitted line estimates the unknown exponent \(a\).

Compute the logarithms of height and weight. You can use any base for the logarithms, as long as it’s the same for both transformations. Compute a least squares fit. Is the slope close to 2?

Think Stats: Exploratory Data Analysis in Python, 3rd Edition

Copyright 2024 Allen B. Downey

Code license: MIT License

Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International