11. Regression#

The linear least squares fit in the previous chapter is an example of regression, which is the more general problem of quantifying the relationship between one set of variables, called dependent variables or outcomes, and another set of variables, called independent variables or predictors.

In the examples in the previous chapter, there is only one dependent and one explanatory variable, which is called simple regression. In this chapter, we move on to multiple regression, with more than one explanatory variable, but still only one dependent variable. If there is more than one dependent variable, that’s multivariate regression, which we won’t cover in this book.

Click here to run this notebook on Colab.

Hide code cell content
%load_ext nb_black
%load_ext autoreload
%autoreload 2
Hide code cell content
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Hide code cell content
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from thinkstats import decorate

11.1. StatsModels#

In the previous chapter we used the SciPy function linregress to compute least squares fit. This function performs simple regression, but not multiple regression. For that, we’ll use StatsModels, a package that provides several forms of regression and other analyses.

As an first example, we’ll continue exploring the penguin data.

The following cell downloads the data.

download(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/c19a904462482430170bfe2c718775ddb7dbb885/inst/extdata/penguins_raw.csv"
)

When we load the data, we’ll use the following dictionary to give the columns names that don’t contain spaces, which will make them easier to use with StatsModels.

columns = {
    "Body Mass (g)": "mass",
    "Flipper Length (mm)": "flipper_length",
    "Culmen Length (mm)": "culmen_length",
    "Culmen Depth (mm)": "culmen_depth",
}

Now we can loar the data, drop the rows with missing weights, and rename the columns.

penguins = (
    pd.read_csv("penguins_raw.csv")
    .dropna(subset=["Body Mass (g)"])
    .rename(columns=columns)
)
penguins.shape
(342, 17)

The dataset contains three species of penguins. We’ll work with just the Adélie penguins.

adelie = penguins.query('Species.str.startswith("Adelie")').copy()
len(adelie)
151

In the previous chapter, we computed a least squares fit between the penguins’ flipper lengths and weights.

flipper_length = adelie["flipper_length"]
body_mass = adelie["mass"]

As a reminder, here’s how we did that with linregress.

from scipy.stats import linregress

result_linregress = linregress(flipper_length, body_mass)
result_linregress.intercept, result_linregress.slope
(-2535.8368022002514, 32.83168975115009)

StatsModels provides two interfaces (APIs) – we’ll use the “formula” API, which uses the Patsy formula language to identify the dependent and explanatory variables. The following formula string specifies that the dependent variable, mass, is a linear function of one explanatory variable, flipper_length

import statsmodels.formula.api as smf

formula = "mass ~ flipper_length"

We can pass this formulat to the StatsModels function ols, along with the data.

model = smf.ols(formula, data=adelie)
type(model)
statsmodels.regression.linear_model.OLS

The name ols stands for “ordinary least squares”, which indicates that this function computes a least squares fit under the most common, or “ordinary”, set of assumptions.

The result OLS object that represents the model. In general, a model is a simplified description of the relationship between variable. In this example, it’s a linear model, which means that it assume that the dependent variable is a linear function of the explanatory variables.

The fit method fits the model to the data and returns a RegressionResults object that contains the result.

result_ols = model.fit()
# Technically it's a RegressionResultsWrapper

type(result_ols)
statsmodels.regression.linear_model.RegressionResultsWrapper

The RegressionResults object contains a lot of information, so thinkstats provides a function that just displays the information we need for now.

from thinkstats import display_summary

display_summary(result_ols)
coef std err t P>|t| [0.025 0.975]
Intercept -2535.8368 964.798 -2.628 0.009 -4442.291 -629.382
flipper_length 32.8317 5.076 6.468 0.000 22.801 42.862
R-squared: 0.2192

The first column contains the intercept and slope, which are the coefficients of the model, also called parameters. We can confirm that they are are same as the parameters we got from linregress.

result_linregress.intercept, result_linregress.slope
(-2535.8368022002514, 32.83168975115009)

The second column contains the standard errors of the estimates parameters – again, they are the same as as the values we got from linregress.

result_linregress.intercept_stderr, result_linregress.stderr,
(964.7984274994059, 5.076138407990821)

The next column report t statistics, which are used to compute p-values – we can ignore them, because the p-values are in the next column, labeled P<|t|. The p-value for flipper_length is rounded down to 0, but we can display it like this.

result_ols.pvalues["flipper_length"]
1.3432645947789321e-09

And confirm that linregress computed the same result.

result_linregress.pvalue
1.3432645947790051e-09

The p-value is very small, which means that if there were actually no relationship between weight and flipper length, it is very unlikely we would see a slope as big as the estimated value – 32.8 grams per mm – by chance.

The last two columns, labeled [0.025 and 0.975], report 95% confidence intervals for the intercept and slope. So the 95% CI for the slope is [22.8, 42.9].

The last line reports the \(R^2\) value of the model, which is about 0.22 – that means we can reduce MSE by about 22% if we use flipper length to predict weight.

For a simple regression \(R^2\) is the correlation of the dependent and explanatory variable, \(r\), squared. So we can compare \(R^2\) computed by ols with \(r^2\) computed by linregress.

result_linregress.rvalue**2
0.21921282646854875

Before we go on to multiple regression, let’s compute one more simple regression, with culmen_length as the explanatory variable (the culmen is the top ridge of the bill).

formula = "mass ~ culmen_length"
result = smf.ols(formula, data=adelie).fit()
display_summary(result)
coef std err t P>|t| [0.025 0.975]
Intercept 34.8830 458.439 0.076 0.939 -870.998 940.764
culmen_length 94.4998 11.790 8.015 0.000 71.202 117.798
R-squared: 0.3013

Again, the p-value of the slope is very small, which means that if there were actually no relationship between mass and culmen length, it is unlikely we would see a slope this big by chance. You might notice that the p-value associated with the intercept is large, but that’s not a problem because we are not concerned about whether the intercept might be zero. In this model, the intercept is close to zero, but that’s just a coincidence – it doesn’t indicate a problem with the model.

The \(R^2\) value for this model is about 0.30, so the reduction in MSE is a little higher if we use culmen length rather than flipper length as an explanatory variable (the \(R^2\) value with flipper length is 0.22).

Now, let’s see what happens if we combine them.

11.2. On To Multiple Regression#

Here’s the formula for a multiple regression model where mass is a linear function of both flipper length and culmen length.

formula = "mass ~ flipper_length + culmen_length"

And here is the result of fitting this model to the data.

result = smf.ols(formula, data=adelie).fit()
display_summary(result)
coef std err t P>|t| [0.025 0.975]
Intercept -3573.0817 866.739 -4.122 0.000 -5285.864 -1860.299
flipper_length 22.7024 4.742 4.787 0.000 13.331 32.074
culmen_length 76.3402 11.644 6.556 0.000 53.331 99.350
R-squared: 0.3949

This model has three coefficients, an intercept and two slopes. The slope associated with flipper length is 22.7, which means we expect a penguin with a longer flipper, by one millimeter, to weight more, by 22.7 grams – assuming that culmen length is the same. Similarly, we expect a penguin with a longer culmen, by one millimeter, to weight more, by 76.3 grams – assuming that flipper length is the same.

The p-values associated with both slopes are small, which means that the contribution of both explanatory variables would be unlikely to happen by chance.

And the \(R^2\) value is 0.39, higher than the model with only culmen length (0.30) and the model with only flipper length (0.22). So predictions based on both explanatory variables are better than predictions based on either one alone.

But it is not as much better as we might have hoped. If flipper length reduces MSE by 22% and culmen length reduces it by 30%, why don’t the two of them together reduce it by a total of 52%. The reason is that the explanatory variables are correlated with each other.

from thinkstats import corrcoef

corrcoef(adelie, "flipper_length", "culmen_length")
0.32578471516515944

A penguin with a longer flipper also has a longer culmen, on average. The explanatory variables contain some information about each other, which means that they contain some of the same information about the dependent variable. When we add an explanatory variable to the model, the improvement in \(R^2\) reflects only the new information provided by the new variable.

We see the same pattern if we add culmen depth as a third explanatory variable.

formula = "mass ~ flipper_length + culmen_length + culmen_depth"
result = smf.ols(formula, data=adelie).fit()
display_summary(result)
coef std err t P>|t| [0.025 0.975]
Intercept -4341.3019 795.117 -5.460 0.000 -5912.639 -2769.964
flipper_length 17.4215 4.385 3.973 0.000 8.756 26.087
culmen_length 55.3676 11.133 4.973 0.000 33.366 77.369
culmen_depth 140.8946 24.216 5.818 0.000 93.037 188.752
R-squared: 0.5082

This model has four coefficients. All of the p-values are small, which means that the contribution of each explanatory variable would be unlikely to happen by chance. And the \(R^2\) value is about 0.51, somewhat better than the previous model with two explanatory variables (0.39), and better than either model with a single variable (0.22 and 0.30).

But again, the incremental improvement is smaller than we might have hoped, because culmen depth is correlated with the other two measurements.

[
    corrcoef(adelie, "culmen_depth", "flipper_length"),
    corrcoef(adelie, "culmen_depth", "culmen_length"),
]
[0.30762017939668534, 0.39149169183587634]

This example demonstrates a common use of multiple regression, combining multiple explanatory variables to make better predictions. Another common use is to quantify the contribution of one set of variables while controlling for the contribution of another set.

11.3. Control Variables#

In Section <<section_comparing_cdfs>> we saw that first babies are lighter than other babies, on average. And in Section <<section_test_correlation>> we saw that birth weight is correlated with the mother’s age – older mothers have heavier babies on average.

These results might be related. If mothers of first babies are younger than mothers of other babies – which seems likely – that might explain why their babies are younger. We can use multiple regression to test this conjecture, by estimating the difference in birth weight between first babies and others while controlling for the mothers’ ages.

Instructions for downloading the data are in the notebook for this chapter.

The following cells download the data files and install statadict, which we need to read the data.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")
try:
    import statadict
except ImportError:
    !pip install statadict

We can use get_nsfg_groups to read the NSFG data, select live births, and group live births into first babies and others.

from nsfg import get_nsfg_groups

live, firsts, others = get_nsfg_groups()

We’ll use dropna to select the rows with valid birth weights, birth order, and mother’s ages.

valid = live.dropna(subset=["agepreg", "birthord", "totalwgt_lb"]).copy()

Now we can use StatsModels to confirm that birth weight is correlated with age, and to estimate the slope – assuming that it is a linear relationship.

formula = "totalwgt_lb ~ agepreg"
result_age = smf.ols(formula, data=valid).fit()
display_summary(result_age)
coef std err t P>|t| [0.025 0.975]
Intercept 6.8304 0.068 100.470 0.000 6.697 6.964
agepreg 0.0175 0.003 6.559 0.000 0.012 0.023
R-squared: 0.004738

The slope is small, only 0.0175 pounds per year. So if two mothers differ in age by a decade, we expect their babies to differ in weight by 0.175 pounds. But the p-value is small, so this slope – small as it is – would be unlikely if there were actually no relationship.

The \(R^2\) value is also small, which means that mother’s age is not very useful as a predictive variable. If we know the mother’s age, our ability to predict the baby’s weight is hardly improved at all.

This combination of a small p-value and a small \(R^2\) value is a common source of confusion, because it seems contradictory – if the relationship is statistically significant, it seems like it should be predictive. But this example shows that there is no contradiction – a relationship can be statistically significant but not very useful for prediction.

If we visualize the results, we’ll see why. First let’s select the relevant columns.

totalwgt = valid["totalwgt_lb"]
agepreg = valid["agepreg"]

To compute the fitted line, we could extract the intercept and slope from result_age, but we don’t have to. The RegressionResults object provides a predict method we can use instead. First we’ll compute a range of values for agepreg.

agepreg_range = np.linspace(agepreg.min(), agepreg.max())

To use predict, we have to put values for the explanatory variables in a DataFrame.

df = pd.DataFrame({"agepreg": agepreg_range})

The columns in the DataFrame have to have the same names as the explanatory variables. Then we can pass it to predict.

fit_ys = result_age.predict(df)

The result is a Series containing the predicted values. Here’s what they look like, along with a scatter plot of the data.

plt.scatter(agepreg, totalwgt, marker=".", alpha=0.1, s=5)
plt.plot(agepreg_range, fit_ys, color="C1", label="Linear model")

decorate(xlabel="Maternal Age", ylabel="Birth weight (pounds)")
_images/70ec23cf94c9e1b29cf15ebb84b6296712cc06efc23e0876fd20c11503aeed92.png

Because the slope of the fitted line is small, we can barely see the difference in the expected birth weight between the youngest and oldest mothers. The variation in birth weight, at every maternal age, is much larger.

Next we’ll use StatsModels to confirm that first babies are lighter than others. To make that work, we’ll add a column to live, a Boolean Series that is True for first babies and False for others.

valid["is_first"] = valid["birthord"] == 1
from thinkstats import value_counts

# check the results
value_counts(valid["is_first"])
is_first
False    4675
True     4363
Name: count, dtype: int64

Here’s the formula for a model with birth weight as the dependent variable and is_first as the explanatory variable. In the Patsy formula language, C and the parentheses around the variable name indicate that it is categorical, not numerical.

formula = "totalwgt_lb ~ C(is_first)"

Now we can fit the model and display the results, as usual.

result_first = smf.ols(formula, data=valid).fit()
display_summary(result_first)
coef std err t P>|t| [0.025 0.975]
Intercept 7.3259 0.021 356.007 0.000 7.286 7.366
C(is_first)[T.True] -0.1248 0.030 -4.212 0.000 -0.183 -0.067
R-squared: 0.00196

In the results, the label C(is_first)[T.True] indicates that is_first is a categorical variable and the coefficient is associated with the value True. The T before True stands for “treatment” – in the language of a controlled experiment, first babies are considered the treatment group and other babies are considered the reference group. These designations are arbitrary – we could consider first babies to be the reference group and others to be the treatment group. But we need to know which is which in order to interpret the results.

The intercept is about 7.3, which means that the average weight of the reference group is 7.3 pounds. The “slope” is -0.12, which means that the average weight of the treatment group – first babies – is 0.12 pounds lighter. We can check both of these results by computing them directly.

others["totalwgt_lb"].mean()
7.325855614973262
diff_weight = firsts["totalwgt_lb"].mean() - others["totalwgt_lb"].mean()
diff_weight
-0.12476118453549034

In addition to these coefficients, StatsModels also computes p-values, confidence intervals, and \(R^2\). The p-value associated with first babies is small, which means that the difference between the groups is statistically significant. And the \(R^2\) value is small, which means that if we’re trying to guess the weight of a baby, it doesn’t help much to know whether it is a first baby.

Now let’s see if it’s plausible that the difference in birth weight is due to the difference in maternal age. On average, mothers of first babies are about 3.6 years younger than other mothers.

diff_age = firsts["agepreg"].mean() - others["agepreg"].mean()
diff_age
-3.5864347661500275

And the slope of birth weight as a function of age is 0.0175 pounds per year.

slope = result_age.params["agepreg"]
slope
0.017453851471802638

If we multiply the slope by the difference in ages, we get the expected difference in birth weight for first babies and others, due to mother’s age.

slope * diff_age
-0.0625970997216918

The result is 0.063 pounds, which is about half of the observed difference. So it seems like the observed difference in birth weight can be partly explained by the difference in mother’s age.

Using multiple regression, we can estimate coefficients for maternal age and first babies at the same time.

formula = "totalwgt_lb ~ agepreg + C(is_first)"
result = smf.ols(formula, data=valid).fit()
display_summary(result)
coef std err t P>|t| [0.025 0.975]
Intercept 6.9142 0.078 89.073 0.000 6.762 7.066
C(is_first)[T.True] -0.0698 0.031 -2.236 0.025 -0.131 -0.009
agepreg 0.0154 0.003 5.499 0.000 0.010 0.021
R-squared: 0.005289

The coefficient of is_first is -0.0698, which means that first babies are 0.0698 pounds lighter than others, on average, after accounting for the difference due to maternal age. That’s about half of the difference we get without accounting for maternal age.

And the p-value is 0.025, which is still considered statistically significant, but it is in the borderline range where we can’t exclude the possibility that a difference this size could happen by chance.

Because this model takes into account the weight difference due to maternal age, we can say that it controls for maternal age. But it assumes that the relationship between weight and maternal age is linear. So let’s see if that’s true.

11.4. Nonlinear Relationships#

To check whether the contribution of agepreg might be nonlinear, can add a new column to the dataset, which contains the values of agepreg squared.

valid["agepreg2"] = valid["agepreg"] ** 2

Now we can define a model that includes a linear relationship and a quadratic relationship.

formula = "totalwgt_lb ~ agepreg + agepreg2"

We can fit the model in the usual way.

result_age2 = smf.ols(formula, data=valid).fit()
display_summary(result_age2)
coef std err t P>|t| [0.025 0.975]
Intercept 5.5720 0.275 20.226 0.000 5.032 6.112
agepreg 0.1186 0.022 5.485 0.000 0.076 0.161
agepreg2 -0.0019 0.000 -4.714 0.000 -0.003 -0.001
R-squared: 0.00718

The p-value associated with the quadratic term, agepreg2, is very small, which suggests that it contributes more information about birth weight than we would expect by chance. And the \(R^2\) value for this model is 0.0072, higher than for the linear model (0.0047).

By estimating parameters for agepreg and agepreg2, we are effectively fitting a parabola to the data. To see that, we can use the RegressionResults object to generate predictions for a range of maternal ages.

First we’ll create a temporary DataFrame that contains columns named agepreg and agepreg2, based on the range of ages in agepreg_range.

df = pd.DataFrame({"agepreg": agepreg_range})
df["agepreg2"] = df["agepreg"] ** 2

Now we can use the predict method, passing the DataFrame as an argument and getting back a Series of predictions.

fit_ys = result_age2.predict(df)

Here’s what the fitted parabola looks like, along with a scatter plot of the data.

plt.plot(agepreg_range, fit_ys, color="C2", label="Quadratic model")

plt.scatter(agepreg, totalwgt, marker=".", alpha=0.1, s=5)

decorate(xlabel="Maternal Age", ylabel="Birth weight (pounds)")
_images/6bf1d493b0a6e1db290df73845b512db07576032e5a0f42404cef19c239001b4.png

The curvature is subtle, but it suggests that birth weights are lower for the youngest and oldest mothers, and higher in the middle.

The quadratic model captures the relationship between these variables better than the linear model, which means it can account more effectively for the difference in birth weight due to maternal age. So let’s see what happens when we add is_first to the quadratic model.

formula = "totalwgt_lb ~ agepreg + agepreg2 + C(is_first)"
result = smf.ols(formula, data=valid).fit()
display_summary(result)
coef std err t P>|t| [0.025 0.975]
Intercept 5.6923 0.286 19.937 0.000 5.133 6.252
C(is_first)[T.True] -0.0504 0.031 -1.602 0.109 -0.112 0.011
agepreg 0.1124 0.022 5.113 0.000 0.069 0.155
agepreg2 -0.0018 0.000 -4.447 0.000 -0.003 -0.001
R-squared: 0.007462

With a more effective control for maternal age, the estimated difference between first babies and others is 0.0504 pounds, smaller than the estimate in the linear model (0.0698 pounds). And the p-value associated with is_first is 0.109, which mean it is plausible that the remaining difference between these groups is due to chance.

We can conclude that the difference in birth weight is explained – at least in part and possibly in full – by the difference in mother’s age.

11.5. Logistic Regression#

Linear regression is based on a model where the dependent variable is the weighted sum of the explanatory variables, an intercept, and a random residual. This model is appropriate when the dependent variable is a continuous quantity like birth weight or penguin weight, but not when the dependent variable is discrete quantity like a count or a categorical variable.

For these kinds of dependent variables, we can use generalized linear models or GLMs. For example:

  • If the dependent variable is a count, we can use Poisson regression.

  • If it’s categorical with only two categories, we can use logistic regression.

  • If it’s categorical with more than two categories, we can use multinomial logistic regression

  • If it’s categorical and ordered, we can use ordered logistic regression.

We won’t cover all of them – just logistic regression, which is the most widely used. As an example, we’ll use the penguin dataset again, and see if we can tell whether a penguin is male or female, based on its weight and other measurements.

StatsModels provides a function that does logistic regression – it’s called logit because that’s the name of a mathematical function that appears in the definition of logistic regression. Before we can use the logit function, we have to transform the dependent variable so the values are 0 and `1’.

adelie["y"] = (adelie["Sex"] == "MALE").astype(int)
adelie["y"].value_counts()
y
0    78
1    73
Name: count, dtype: int64

We’ll start with a simple model with y as the dependent variable and mass as the explanatory variable. Here’s how we make and fit the model – the argument disp=False suppresses messages about the fitting process.

model = smf.logit("y ~ mass", data=adelie)
result = model.fit(disp=False)

And here are the results.

display_summary(result)
coef std err z P>|z| [0.025 0.975]
Intercept -25.9871 4.221 -6.156 0.000 -34.261 -17.713
mass 0.0070 0.001 6.138 0.000 0.005 0.009
Pseudo R-squared: 0.5264

The coefficient of determination, \(R^2\), does not apply to logistic regression, but there are several alternatives that are used as “pseudo \(R^2\) values.” The pseudo \(R^2\) value for this model is about 0.526, which doesn’t mean much by itself, but we will use it to compare models.

The parameter of mass is positive, which means that heavier penguins are more likely to be male. Other than that, the estimated coefficients are not easy to interpret. But we can understand the model better by plotting the predictions. We’ll make a DataFrame with a range of values for mass, and use predict to compute a Series of predictions.

mass = adelie["mass"]
mass_range = np.linspace(mass.min(), mass.max())
df = pd.DataFrame({"mass": mass_range})
fit_ys = result.predict(df)

Each predicted value is the probability a penguin is male as a function of its weight. Here’s what the predicted values look like.

plt.plot(mass_range, fit_ys)

decorate(xlabel="Mass (g)", ylabel="Prob(male)")
_images/5446a1768c2b0d264c9b3a16564214a6d88832d400c1b17ca7f75e715a6a4677.png

The lightest penguins are almost certain to be female, and the heaviest are likely to be male – in the middle, a penguin weighing 3750 grams is equally likely to be male or female.

Now let’s see what happens if we add the other measurements as explanatory variables.

formula = "y ~ mass + flipper_length + culmen_length + culmen_depth"
model = smf.logit(formula, data=adelie)
result = model.fit(disp=False)
display_summary(result)
coef std err z P>|z| [0.025 0.975]
Intercept -60.6075 13.793 -4.394 0.000 -87.642 -33.573
mass 0.0059 0.001 4.153 0.000 0.003 0.009
flipper_length -0.0209 0.052 -0.403 0.687 -0.123 0.081
culmen_length 0.6208 0.176 3.536 0.000 0.277 0.965
culmen_depth 1.0111 0.349 2.896 0.004 0.327 1.695
Pseudo R-squared: 0.6622

The pseudo \(R^2\) value of this model is 0.662, higher than the previous model (0.526) – so the additional measurement contain additional information that distinguishes male and female penguins.

The p-values for culmen length and depth are small, which indicates that they contribute more information than we expect by chance. The p-value for flipper length is large, which suggests that if you know a penguin’s weight a bill dimensions, flipper length doesn’t contribute any additional information.

To understand this model, let’s look at some of it’s predictions. We’ll use the following function, which takes a sequence of masses and a specific value for culmen_length. It sets the other measurements to their mean values, computes predicted probabilities as a function of mass, and plots the results.

def plot_predictions(mass_range, culmen_length, **options):
    """Plot predicted probabilities as a function of mass."""
    df = pd.DataFrame({"mass": mass_range})
    df["flipper_length"] = adelie["flipper_length"].mean()
    df["culmen_length"] = culmen_length
    df["culmen_depth"] = adelie["culmen_depth"].mean()
    fit_ys = result.predict(df)
    plt.plot(mass_range, fit_ys, **options)

Here’s what the results look like for three values of culmen_length: on standard deviation above average, average, and one standard deviation below average.

culmen_length = adelie["culmen_length"]
m, s = culmen_length.mean(), culmen_length.std()
plot_predictions(mass_range, m + s, ls="--", label="Above average culmen length")
plot_predictions(mass_range, m, label="Average culmen length")
plot_predictions(mass_range, m - s, ls=":", label="Below average culmen length")

decorate(xlabel="Mass (g)", ylabel="Prob(male)")
_images/9c8f901d22364542bb3a28ad90b540909a8b5f16bff3cf2cecd0192a4909caad.png

As we saw in the simpler model, heavier penguins are more likely to be male. Also, at any weight, a penguin with a longer bill is more likely to be male.

This model is more useful than it might seem – in fact, it is similar to models that were used in the original research paper this dataset was collected for. The primary topic of that research is sexual dimorphism, which is the degree to which male and female bodies differ. One way to quantify dimorphism is to use measurements to identify males and females. In a species with higher dimorphism, we expect these identifications to be more accurate.

The original paper is Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

To test this methodology, let’s try the same model on a different species. In addition to the Adélie penguins we’ve worked with so far, the dataset also contains measurements from 123 Gentoo penguins. We’ll use the following function to select them.

def get_species(penguins, species):
    df = penguins.query(f'Species.str.startswith("{species}")').copy()
    df["y"] = (df["Sex"] == "MALE").astype(int)
    return df
gentoo = get_species(penguins, "Gentoo")
len(gentoo)
123

And here are the results.

formula = "y ~ mass + flipper_length + culmen_length + culmen_depth"
model = smf.logit(formula, data=gentoo)
result = model.fit(disp=False)
display_summary(result)
coef std err z P>|z| [0.025 0.975]
Intercept -173.9123 62.326 -2.790 0.005 -296.069 -51.756
mass 0.0105 0.004 2.948 0.003 0.004 0.017
flipper_length 0.2839 0.183 1.549 0.121 -0.075 0.643
culmen_length 0.2734 0.285 0.958 0.338 -0.286 0.833
culmen_depth 3.0843 1.291 2.389 0.017 0.554 5.614
Pseudo R-squared: 0.848

The pseudo-\(R^2\) value is 0.848, higher what we got with Adélie penguins (0.662) – that suggests that Gentoo penguins are more dimorphic.

11.6. Glossary#

  • regression: One of several related processes for estimating parameters that fit a model to data.

  • dependent variables: The variables in a regression model we would like to predict. Also known as endogenous variables.

  • explanatory variables: The variables used to predict or explain the dependent variables. Also known as independent, or exogenous, variables.

  • simple regression: A regression with only one dependent and one explanatory variable.

  • multiple regression: A regression with multiple explanatory variables, but only one dependent variable.

  • linear regression: A regression based on a linear model.

  • ordinary least squares: A linear regression that estimates parameters by minimizing the squared error of the residuals.

  • spurious relationship: A relationship between two variables that is caused by a statistical artifact or a factor, not included in the model, that is related to both variables.

  • control variable: A variable included in a regression to eliminate or “control for” a spurious relationship.

  • proxy variable: A variable that contributes information to a regression model indirectly because of a relationship with another factor, so it acts as a proxy for that factor.

  • categorical variable: A variable that can have one of a discrete set of unordered values.

  • join: An operation that combines data from two DataFrames using a key to match up rows in the two frames.

  • data mining: An approach to finding relationships between variables by testing a large number of models.

  • logistic regression: A form of regression used when the dependent variable is boolean.

  • Poisson regression: A form of regression used when the dependent variable is a non-negative integer, usually a count.

  • odds: An alternative way of representing a probability, \(p\), as the ratio of the probability and its complement, \(p / (1-p)\).

11.7. Exercises#

11.7.1. Exercise#

Are baby boys heavier than baby girls? To answer this question, fit a linear regression model with totalwgt_lb as the dependent variable and babysex as a categorical explanatory variable – the value of this variable is 1 for boys and 2 for girls. What is the estimated difference in weight? Is it statistically significant? What if you control for the mother’s age – does maternal age account for some or all of the apparent difference?

11.7.2. Exercise#

The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition” – that is, factors like the mother’s age, size, health, and social status. Some studies have shown this effect among humans, but results are mixed.

See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis.

Let’s see if there is a relationship between mother’s age and the probability of having a boy. Fit a logistic regression model with the baby’s sex as the dependent variable and mother’s age as an explanatory variable. Are older mothers more or less likely to have boys? What if you use a quadratic model of maternal age?

In order to use babysex as a dependent variable in a logistic regression, we’ll recode it with the value 1 for boys and 0 for girls.

valid["y"] = (valid["babysex"] == 1).astype(int)

11.7.3. Exercise#

For the Adelie penguins, fit a linear regression model that predicts penguin weights as a function of flipper_length, culmen_depth, and Sex as a categorical variable. If we control for flipper length and culmen depth, how much heavier are male penguins? Generate and plot predictions for a range of flipper lengths, for male and female penguins, with culmen_depth set to its average value.

11.7.4. Exercise#

Let’s see if Chinstrap penguins are more or less dimorphic than the other penguin species in the dataset, as quantified by the pseudo \(R^2\) value of the model. Use get_species to select the Chinstrap penguins, then use logistic regression to fit a logistic regression model with sex as the dependent variable and all four measurements as explanatory variables. How does the pseudo \(R^2\) value compare to the other models?

chinstrap = get_species(penguins, "Chinstrap")
len(chinstrap)
68

Think Stats: Exploratory Data Analysis in Python, 3rd Edition

Copyright 2024 Allen B. Downey

Code license: MIT License

Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International