Resampling and Logistic Regression#

Click here to run this notebook on Colab

Different ways of computing sampling distributions – and the statistics derived from them, like standard errors and confidence intervals – yield different results. None of them are right or wrong; rather, they are based on different modeling assumptions. In practice, the differences are often small compared to other sources of error, so we are free to use whichever is convenient.

This article is prompted by a recent question on Reddit:

I am trying to do parametric bootstrap for logistic regression but I don’t know what the random error terms should be. I know how to do it with linear regression since the error terms follow a normal distribution. Really appreciate any pointers to resources about this topic.

One of the responses recommends this paper: “An Application of Bootstrapping in Logistic Regression Model”, by Isaac Akpor Adjei and Rezaul Karim.

The paper suggests two ways to compute the sampling distribution of the parameters of a logistic regression by bootstrap sampling. They characterize one as parametric and the other as non-parametric, and advise:

The non-parametric bootstrap [relies] on weaker assumptions (or no assumptions at all) about the underlying distribution of the population. Statistical practitioners should use non-parametric procedures only in so far as the assumptions about the underlying distribution are seriously doubtful in their validity. … However, when assumptions are not violated, non-parametric procedures will usually have greater variance (in point estimation), less power (in hypothesis testing), wider intervals (in confidence interval estimation), lower probability of correct selection (in ranking and selection) and higher risk (in decision theory) when compared to a corresponding parametric procedure (Efron and Tibshirani, 1994 [1] ).

The premise of this advice is that the parametric and non-parametric methods are answering the same question in different ways, and that one might be preferred over the other because the results are preferable in one way or another. It is also based on the assumption that the sampling distribution of the parameters – and the statistics derived from it – are uniquely and objectively defined.

I disagree with these premises. The parametric and non-parametric methods they present are based on different modeling decisions, so they compute answers to different questions. And the models they are based on are only two of many possible models.

To explain what I mean, I will implement the methods they propose and explain the assumptions each is based on. Then I will propose a third method that is a hybrid of the two. I’ll show that all three methods yield different results, and suggest criteria for when one might be preferred over the others.

The Data#

As an example, I’ll use data from the General Social Survey (GSS). I’ll download an HDF file from the Elements of Data Science repository, which contains a subset of the GSS data that has been resampled to correct for stratified sampling.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(0)
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/' +
         'ElementsOfDataScience/raw/master/data/gss_eda.hdf')
gss = pd.read_hdf('gss_eda.hdf', 'gss')
gss.shape
(64814, 8)

It includes 64,814 respondents and 8 variables for each respondent.

gss.head()
YEAR ID_ AGE EDUC SEX GUNLAW GRASS REALINC
0 1972 1 23.0 16.0 2 1.0 NaN 18951.0
1 1972 2 70.0 10.0 1 1.0 NaN 24366.0
2 1972 3 48.0 12.0 2 1.0 NaN 24366.0
3 1972 4 27.0 17.0 2 1.0 NaN 30458.0
4 1972 5 61.0 12.0 2 1.0 NaN 50763.0

To demonstrate logistic regression, I’ll use on one of the questions in the General Social Survey, which asks “Do you think the use of marijuana should be made legal or not?” The responses are in a column called GRASS; here are the values.

gss['GRASS'].value_counts()
2.0    25662
1.0    11884
Name: GRASS, dtype: int64

I’ll use StatsModels, which provides a function that does logistic regression. First we have to recode the dependent variable so 1 means “yes” and 0 means “no”. We can do that by replacing 2 with 0.

gss['GRASS'].replace(2, 0, inplace=True)

To model quadratic relationships, I’ll add columns that contain the values of AGE and EDUC squared.

gss['AGE2'] = gss['AGE']**2
gss['EDUC2'] = gss['EDUC']**2

And I’ll drop the rows that have missing values for the variables we’ll need.

data = gss.dropna(subset=['AGE', 'EDUC', 'SEX', 'GUNLAW', 'GRASS'])

Logistic Regression#

Here are the results of logistic regression with these variables.

import statsmodels.formula.api as smf

formula = 'GRASS ~ AGE + AGE2 + EDUC + EDUC2 + C(SEX)'
result_hat = smf.logit(formula, data=data).fit()
result_hat.summary()
Optimization terminated successfully.
         Current function value: 0.593464
         Iterations 6
Logit Regression Results
Dep. Variable: GRASS No. Observations: 20475
Model: Logit Df Residuals: 20469
Method: MLE Df Model: 5
Date: Tue, 17 Jan 2023 Pseudo R-squ.: 0.05003
Time: 08:58:57 Log-Likelihood: -12151.
converged: True LL-Null: -12791.
Covariance Type: nonrobust LLR p-value: 1.555e-274
coef std err z P>|z| [0.025 0.975]
Intercept -1.6788 0.240 -6.988 0.000 -2.150 -1.208
C(SEX)[T.2] -0.3849 0.031 -12.394 0.000 -0.446 -0.324
AGE -0.0278 0.005 -5.399 0.000 -0.038 -0.018
AGE2 0.0001 5.28e-05 2.190 0.029 1.21e-05 0.000
EDUC 0.2000 0.031 6.412 0.000 0.139 0.261
EDUC2 -0.0029 0.001 -2.450 0.014 -0.005 -0.001

To get a sense of what the results look like, we can plot the predicted probability of saying “yes” as a function of age, for male and female respondents.

df = pd.DataFrame()
df['AGE'] = np.linspace(18, 89)
df['EDUC'] = 16

df['AGE2'] = df['AGE']**2
df['EDUC2'] = df['EDUC']**2

df['SEX'] = 1
pred1 = result_hat.predict(df)
pred1.index = df['AGE']

df['SEX'] = 2
pred2 = result_hat.predict(df)
pred2.index = df['AGE']

pred1.plot(label='Male', alpha=0.6)
pred2.plot(label='Female', alpha=0.6)

plt.xlabel('Age')
plt.ylabel('Fraction')
plt.title('Support for legal marijuana')
plt.legend();
_images/resample_logit_21_0.png

Men are more likely to support legalization than women, and younger people more likely than older people.

Boostrap sampling#

Now let’s implement the method characterized as non-parametric, which is based on bootstrap resampling. The following function samples the rows of data with replacement and runs the regression model on the resampled data.

options = dict(disp=False, start_params=result_hat.params) 

def bootstrap(i):
    bootstrapped = data.sample(n=len(data), replace=True)
    results = smf.logit(formula, data=bootstrapped).fit(**options)
    return results.params

The options sent to fit make it run faster, but don’t affect the results by much.

Each time we run this process, the result represents a single draw from the sampling distribution of the parameters.

%timeit bootstrap(0)
82.2 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

By running it many times, we generate a sample from the sampling distribution. pqdm runs the sampling process in multiple threads.

try:
    from pqdm.threads import pqdm
except ImportError:
    !pip install pqdm
    from pqdm.threads import pqdm
estimates = pqdm(range(101), bootstrap, n_jobs=4)
sampling_dist = pd.DataFrame(estimates)
sampling_dist.head()
Intercept C(SEX)[T.2] AGE AGE2 EDUC EDUC2
0 -1.747448 -0.413202 -0.027473 0.000114 0.213299 -0.003302
1 -2.055303 -0.411384 -0.029943 0.000129 0.276153 -0.005563
2 -2.050773 -0.414960 -0.023190 0.000094 0.242778 -0.004530
3 -2.031659 -0.392391 -0.029538 0.000147 0.251942 -0.004616
4 -1.847295 -0.363829 -0.031532 0.000156 0.238870 -0.004339

Here’s what the sampling distribution looks like for one of the parameters.

ci90 = sampling_dist['C(SEX)[T.2]'].quantile([0.05, 0.95])
sns.kdeplot(sampling_dist['C(SEX)[T.2]'])
[plt.axvline(x, ls=':') for x in ci90]
plt.title('Sampling distribution of a parameter');
_images/resample_logit_31_0.png

The mean of the sampling distribution should be close to the parameters we estimated with the original dataset, and it is.

pd.DataFrame({"Sampling mean": sampling_dist.mean(), 
              "Estimates": result_hat.params})
Sampling mean Estimates
Intercept -1.681880 -1.678838
C(SEX)[T.2] -0.379621 -0.384919
AGE -0.027804 -0.027825
AGE2 0.000115 0.000116
EDUC 0.201137 0.200002
EDUC2 -0.002912 -0.002855

The standard deviations of the sampling distributions should be close to the standard errors computed by StatsModels, and they are. Most of them are close enough that the difference probably doesn’t matter in practice.

def standard_errors(sampling_dist, result_hat):
    df = pd.DataFrame({"Sampling std": sampling_dist.std(), 
                       "Standard error": result_hat.bse})
    num, den = df.values.T
    df['Percent diff'] = (num / den - 1) * 100
    return df
standard_errors(sampling_dist, result_hat)
Sampling std Standard error Percent diff
Intercept 0.248814 0.240243 3.567779
C(SEX)[T.2] 0.027137 0.031057 -12.622418
AGE 0.004944 0.005153 -4.066840
AGE2 0.000051 0.000053 -4.198399
EDUC 0.031756 0.031193 1.804886
EDUC2 0.001197 0.001166 2.687507

Of course, there is nothing magic about the standard errors computed by StatsModels; they are approximations based on a model of the sampling process, just like the results from resampling.

The Resampling Framework#

The purpose of the sampling distribution is to quantify the variability in an estimate due to random sampling. It is the answer to the question, “If we ran this sampling process many times, how much would we expect the results to vary?”

To answer that question, we need a model of the sampling process. We use the model to simulate the sampling process and generate counterfactual datasets that represent other samples the process could have generated.

The following figure represents this framework.

In bootstrap resampling, we treat the rows of the original dataset as if they are the population and draw a random sample from it. We use this simulated data to compute the sample statistic; in this example, it’s the parameters of the logistic regression model. We collect the results to form a sample from the sampling distribution, which we can use to compute confidence intervals and standard errors.

This way of simulating the sampling process asks, in effect, what would have happened if we had surveyed different people. But that’s not the only possible model of the sampling process. An alternative is to ask what would happen if we surveyed the same people, but they gave different answers. That’s the model that underlies the parametric method.

Parametric bootstrap#

So, how do we simulate a sampling process where we survey the same people, but they give different answers? One way is to assume, for the purposes of the simulation, that the parameters we estimated from the original data are correct. If so, we can use the regression model to compute the predicted probability for each respondent, like this:

pi_hat = result_hat.predict(data)
pi_hat.describe()
count    20475.000000
mean         0.317167
std          0.113654
min          0.025989
25%          0.234792
50%          0.317927
75%          0.400791
max          0.630366
dtype: float64

Now we can use these probabilities to generate a biased coin toss for each respondent.

from scipy.stats import bernoulli

simulated = bernoulli.rvs(pi_hat.values)
simulated.mean()
0.3174114774114774

Then we can run the regression model with these simulated values as the dependent variable. The following function shows how.

def bootstrap2(i):
    flipped = data.assign(GRASS=bernoulli.rvs(pi_hat.values))
    results = smf.logit(formula, data=flipped).fit(**options)
    return results.params

Again, the result from a single simulation is a random value from the sampling distribution.

bootstrap2(0)
Intercept     -1.805556
C(SEX)[T.2]   -0.352934
AGE           -0.023101
AGE2           0.000058
EDUC           0.196758
EDUC2         -0.002599
dtype: float64

If we run it many times, we get a sample from the sampling distribution.

from pqdm.threads import pqdm

estimates = pqdm(range(101), bootstrap2, n_jobs=4)
sampling_dist2 = pd.DataFrame(estimates)
sampling_dist2.head()
Intercept C(SEX)[T.2] AGE AGE2 EDUC EDUC2
0 -1.804015 -0.384881 -0.014708 -0.000017 0.180796 -0.002302
1 -1.432664 -0.422171 -0.023207 0.000044 0.166255 -0.001940
2 -1.415658 -0.406837 -0.033815 0.000168 0.180158 -0.001914
3 -1.840253 -0.394937 -0.028435 0.000126 0.221295 -0.003461
4 -1.642430 -0.375719 -0.034946 0.000177 0.216985 -0.003430

Here’s the sampling distribution for a single parameter.

ci90 = sampling_dist2['C(SEX)[T.2]'].quantile([0.05, 0.95])
sns.kdeplot(sampling_dist2['C(SEX)[T.2]'])
[plt.axvline(x, ls=':') for x in ci90]
plt.title('Sampling distribution of a parameter');
_images/resample_logit_53_0.png

Again, the standard deviations should be close to the standard errors, and they are.

standard_errors(sampling_dist2, result_hat)
Sampling std Standard error Percent diff
Intercept 0.199178 0.240243 -17.092962
C(SEX)[T.2] 0.030494 0.031057 -1.812430
AGE 0.005085 0.005153 -1.327560
AGE2 0.000054 0.000053 2.118557
EDUC 0.024899 0.031193 -20.178173
EDUC2 0.000935 0.001166 -19.819110

So we have two models of the sampling process that yield different sampling distributions. But these are not the only models.

If the first method asks, “What if we surveyed different people?” and the second asks “What if we surveyed the same people and they gave different answers?”, let’s consider a third method that asks “¿Por qué no los dos?”.

The Hybrid Model#

The following function uses bootstrap sampling to simulate surveying different people; then it uses the parametric method to simulate their responses.

def bootstrap3(i):
    bootstrapped = data.sample(n=len(data), replace=True)
    pi_hat = result_hat.predict(bootstrapped)
    flipped = bootstrapped.assign(GRASS=bernoulli.rvs(pi_hat.values))
    results = smf.logit(formula, data=flipped).fit(**options)
    return results.params
bootstrap3(0)
Intercept     -1.655314
C(SEX)[T.2]   -0.386624
AGE           -0.024265
AGE2           0.000083
EDUC           0.182822
EDUC2         -0.002334
dtype: float64
from pqdm.threads import pqdm

estimates = pqdm(range(101), bootstrap3, n_jobs=4)
sampling_dist3 = pd.DataFrame(estimates)
sampling_dist3.head()
Intercept C(SEX)[T.2] AGE AGE2 EDUC EDUC2
0 -2.019658 -0.385612 -0.025974 0.000108 0.249717 -0.005012
1 -1.648528 -0.351063 -0.025192 0.000100 0.182666 -0.002312
2 -1.780499 -0.445770 -0.032862 0.000173 0.242922 -0.004569
3 -1.534347 -0.464670 -0.029799 0.000125 0.189228 -0.002218
4 -1.336757 -0.404636 -0.030217 0.000137 0.160754 -0.001453

Here’s the sampling distribution for one of the parameters.

ci90 = sampling_dist3['C(SEX)[T.2]'].quantile([0.05, 0.95])
sns.kdeplot(sampling_dist3['C(SEX)[T.2]'])
[plt.axvline(x, ls=':') for x in ci90]
plt.title('Sampling distribution of a parameter');
_images/resample_logit_63_0.png

Again, the standard errors are comparable to the ones computed by StatsModels.

standard_errors(sampling_dist3, result_hat)
Sampling std Standard error Percent diff
Intercept 0.236901 0.240243 -1.390859
C(SEX)[T.2] 0.035600 0.031057 14.629106
AGE 0.005769 0.005153 11.939358
AGE2 0.000061 0.000053 15.337773
EDUC 0.031469 0.031193 0.884552
EDUC2 0.001173 0.001166 0.613167

Now we have four ways to compute sampling distributions, confidence intervals, and standard errors – and they yield different results. So you might wonder…

Which One Is Right?#

None of them are right. They are based on different models of the sampling process, so they quantify different sources of uncertainty.

In some cases we might find that different methods converge on the same results, asymptotically, under certain assumptions. But that doesn’t really matter, because with finite data sets, the results are not generally the same. So the important question is whether the differences are big enough to matter in practice.

In this example, it is easy to implement multiple models and compare the results. If they were substantially different, we would need to think more carefully about the modeling assumptions they are based on and choose the one we think is the best description of the data-generating process.

But in this example, the differences are small enough that they probably don’t matter in practice. So we are free to choose whichever is easiest to implement, or fastest to compute, or convenient in some other way.

It is a common error to presume that the result of an analytic method is uniquely correct, and that results from computational methods like resampling are approximations to it. Analytic methods are often fast to compute, but they are always based on modeling assumptions and usually based on approximations, so they are no more correct than computational methods.

Copyright 2023 Allen Downey

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)