{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bootstrap Sampling" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-print" ] }, "source": [ "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/v1/12_bootstrap.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-print" ] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + str(local))\n", " return filename\n", "\n", "download('https://raw.githubusercontent.com/AllenDowney/ElementsOfDataScience/v1/utils.py')\n", "\n", "import utils" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous chapter we used resampling to compute sampling distributions, which quantify the variability in an estimate due to random sampling.\n", "\n", "In this chapter, we'll use data from the General Social Survey (GSS) to estimate average income and the 10th percentile of income.\n", "We'll see that the resampling method we used in the previous chapter works for the average but not for the 10th percentile.\n", "To solve this problem, we'll use another kind of resampling, called bootstrapping.\n", "\n", "Then we'll use bootstrapping to compute sampling distributions for correlations and the parameters of linear regression.\n", "Finally, I'll point out a problem with bootstrap resampling when there are not enough different values in a dataset, and a way to solve it with KDE resampling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating Average Income\n", "\n", "As a first example, we'll use data from the General Social Survey to estimate average family income.\n", "We'll work with an extract that contains just the columns we need, as we did in Chapter 8.\n", "Instructions for downloading the extract are in the notebook for this chapter." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "try:\n", " import empiricaldist\n", "except ImportError:\n", " !pip install empiricaldist" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "download('https://github.com/AllenDowney/ElementsOfDataScience/' +\n", " 'raw/v1/data/gss_extract_2022.hdf');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the data like this and display the first few rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearidageeducdegreesexgunlawgrassrealinc
01972123.016.03.02.01.0NaN18951.0
11972270.010.00.01.01.0NaN24366.0
21972348.012.01.02.01.0NaN24366.0
31972427.017.03.02.01.0NaN30458.0
41972561.012.01.02.01.0NaN50763.0
\n", "
" ], "text/plain": [ " year id age educ degree sex gunlaw grass realinc\n", "0 1972 1 23.0 16.0 3.0 2.0 1.0 NaN 18951.0\n", "1 1972 2 70.0 10.0 0.0 1.0 1.0 NaN 24366.0\n", "2 1972 3 48.0 12.0 1.0 2.0 1.0 NaN 24366.0\n", "3 1972 4 27.0 17.0 3.0 2.0 1.0 NaN 30458.0\n", "4 1972 5 61.0 12.0 1.0 2.0 1.0 NaN 50763.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "gss = pd.read_hdf('gss_extract_2022.hdf', 'gss')\n", "gss.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column `realinc` records family income, converted to 1986 dollars.\n", "The following figure uses the Seaborn function `kdeplot` to show the distribution of family income.\n", "The argument `cut=0` cuts off the curve so it doesn't extend beyond the observed minimum and maximum values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "sns.kdeplot(gss['realinc'] / 1000, label='GSS data', cut=0)\n", "\n", "plt.xlabel('Family income ($1000s)')\n", "plt.ylabel('PDF')\n", "plt.title('Distribution of income')\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distribution of income is skewed to the right; most household incomes are less than $60,000, but a few are substantially higher.\n", "Here are the mean and standard deviation of the reported incomes." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "32537.399981032493 30883.22609399141\n" ] } ], "source": [ "mean_realinc = gss['realinc'].mean()\n", "std_realinc = gss['realinc'].std()\n", "print(mean_realinc, std_realinc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The average family income in this sample is $32,537.\n", "But if we ran the GSS survey again, the average might be higher or lower.\n", "To see how much it might vary, we can use this function from the previous chapter to simulate the sampling process." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def simulate_sample_mean(n, mu, sigma):\n", " sample = np.random.normal(mu, sigma, size=n)\n", " return sample.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`simulate_sample_mean` takes as parameters the sample size and the mean and standard deviation.\n", "It generates a sample from a normal distribution and returns the mean.\n", "\n", "Before we call this function, we have to count the number of valid responses." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "64912" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_realinc = gss['realinc'].count()\n", "n_realinc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, if we call `simulate_sample_mean` once, we get a single value from the sampling distribution of the mean." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# set the random seed so we get the same results every time\n", "np.random.seed(18)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32573.420195135117" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simulate_sample_mean(n_realinc, mean_realinc, std_realinc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we call it many times, we get a random sample from the sampling distribution." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "t1 = [simulate_sample_mean(n_realinc, mean_realinc, std_realinc)\n", " for i in range(1001)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the sampling distribution looks like." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.kdeplot(t1)\n", "\n", "plt.xlabel('Family income (1986 $)')\n", "plt.ylabel('PDF')\n", "plt.title('Sampling distribution of mean income');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This distribution shows how much we would expect the observed mean to vary if we ran the GSS survey again.\n", "We'll use the following function to summarize the sampling distribution." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def summarize(t, digits=2, label=''):\n", " est = np.mean(t).round(digits)\n", " SE = np.std(t).round(digits)\n", " CI90 = np.percentile(t, [5, 95]).round(digits)\n", " data = [est, SE, CI90]\n", " columns = ['Estimate', 'SE', 'CI90']\n", " table = pd.DataFrame([data], index=[label], columns=columns)\n", " return table" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
32533.8120.7[32331.4, 32724.2]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 32533.8 120.7 [32331.4, 32724.2]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary1 = summarize(t1, digits=1)\n", "summary1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result shows the mean of the sampling distribution, the standard error, and a 90% confidence interval.\n", "The mean of the sampling distribution is close to the mean of the data, as we expect.\n", "The standard error quantifies the width of the sampling distribution, which is about $121.\n", "Informally, that's how much we would expect the sample mean to change if we ran the survey again.\n", "And if we ran the survey many times and computed the average income each time, we would expect 90\\% of the results to fall in the range from 32,331 to 32,724." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "32537.399981032493" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_realinc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we used a normal distribution to simulate the sampling process.\n", "The normal distribution is not a particularly good model for the distribution of income, but it works well enough for this example, and the results are reasonable.\n", "In the next section we'll see an example where the normal distribution is not good enough and the results are not reasonable.\n", "Then we'll see how to fix the problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating Percentiles\n", "\n", "Suppose that, instead of estimating the average income, we want to estimate the 10th percentile.\n", "Computing percentiles of income is often relevant to discussions of income inequality.\n", "\n", "To compute the 10th percentile of the data, we can use the Pandas method `quantile`, which is similar to the NumPy function `percentile`, except that it drops `NaN` values.\n", "Also, the parameter of `quantile` is a probability between 0 and 1, rather than a percentage between 0 and 100." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5730.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss['realinc'].quantile(0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 10th percentile of the sample is $5730, but if we collected another sample, the result might be higher or lower.\n", "To see how much it would vary, we can use the following function to simulate the sampling process: `simulate_sample_percentile` generates a sample from a normal distribution and returns the 10th percentile." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def simulate_sample_percentile(n, mu, sigma):\n", " sample = np.random.normal(mu, sigma, size=n)\n", " return np.percentile(sample, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we call it many times, the result is a sample from the sampling distribution of the 10th percentile." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "t2 = [simulate_sample_percentile(n_realinc, mean_realinc, std_realinc)\n", " for i in range(1001)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what that sampling distribution looks like." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.kdeplot(t2)\n", "\n", "plt.xlabel('Family income (1986 $)')\n", "plt.ylabel('PDF')\n", "plt.title('Sampling distribution of the 10th percentile');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that something has gone wrong.\n", "All of the values in the sampling distribution are negative, even though no one in the sample reported a negative income.\n", "To see what happened, let's look at the distribution of reported incomes again compared to the normal distribution with the same mean and standard deviation." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import norm\n", "\n", "xs = np.linspace(-50, 150)\n", "ys = norm(mean_realinc/1000, std_realinc/1000).pdf(xs)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.kdeplot(gss['realinc'] / 1000, label='GSS data', cut=0)\n", "plt.plot(xs, ys, color='0.7', label='normal model')\n", "\n", "plt.xlabel('Family income ($1000s)')\n", "plt.ylabel('PDF')\n", "plt.title('Distribution of income')\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem is that the normal model extends past the lower bound of the observed values, so it doesn't produce sensible results.\n", "Fortunately there is a simple alternative that is more robust: bootstrapping." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bootstrapping\n", "\n", "Bootstrapping is a kind of resampling, based on the framework we saw in the previous chapter:\n", "\n", "![](https://github.com/AllenDowney/ElementsOfDataScience/raw/master/figs/resampling.png)\n", "\n", "The idea is that we treat the original sample as if it were the entire population, and simulate the sampling process by choosing random rows with replacement.\n", "`DataFrame` provides a method called `sample` we can use to select a random sample of the rows." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(64912, 9)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bootstrapped = gss.sample(n=n_realinc, replace=True)\n", "bootstrapped.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument `n=n_realinc` means that the bootstrapped sample has the same size as the original. \n", "`replace=True` means that sampling is done with replacement -- that is, the same row can be chosen more than once.\n", "To see how many times each row appears in the bootstrapped sample, we can use `value_counts` and the `id` column, which contains a unique identifier for each respondent. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "90 55\n", "373 49\n", "322 47\n", "190 46\n", "975 46\n", "Name: count, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repeats = bootstrapped['id'].value_counts()\n", "repeats.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Several of the rows appear more than 40 times.\n", "Since some rows appear many times, other rows don't appear at all. To see how many, we can use `set` subtraction to count the values of `id` that appear in the original dataset but not the bootstrapped sample." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "228" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unused = set(gss['id']) - set(bootstrapped['id'])\n", "len(unused)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use bootstrapping to generate a sampling distribution.\n", "For example, the following function takes a `DataFrame`, generates a bootstrapped sample, and returns the average income." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def bootstrap_mean(df, varname):\n", " bootstrapped = df.sample(n=len(df), replace=True)\n", " return bootstrapped[varname].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we run it many times, we get a sample from the sampling distribution of the mean." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "t3 = [bootstrap_mean(gss, 'realinc')\n", " for i in range(1001)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a summary of the results, compared to the results based on the normal model." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
normal model32533.80120.70[32331.4, 32724.2]
bootstrapping32540.97120.44[32345.43, 32735.15]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", "normal model 32533.80 120.70 [32331.4, 32724.2]\n", "bootstrapping 32540.97 120.44 [32345.43, 32735.15]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary3 = summarize(t3)\n", "table = pd.concat([summary1, summary3])\n", "table.index=['normal model', 'bootstrapping']\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results from bootstrap sampling are consistent with the results based on the normal model.\n", "Now let's see what happens when we estimate the 10th percentile.\n", "The following function generates a bootstrapped sample and returns the 10th percentile." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def bootstrap_income_percentile(df):\n", " bootstrapped = df.sample(n=len(df), replace=True)\n", " return bootstrapped['realinc'].quantile(0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use it to generate a sample from the sampling distribution of the 10th percentile." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "t4 = [bootstrap_income_percentile(gss)\n", " for i in range(1001)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results from bootstrapping compared to the results from the normal model." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
normal model-7036.41206.15[-7377.72, -6709.54]
bootstrapping5687.1291.42[5512.5, 5827.5]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", "normal model -7036.41 206.15 [-7377.72, -6709.54]\n", "bootstrapping 5687.12 91.42 [5512.5, 5827.5]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary2 = summarize(t2)\n", "summary4 = summarize(t4)\n", "table = pd.concat([summary2, summary4])\n", "table.index=['normal model', 'bootstrapping']\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results from bootstrapping are more sensible -- the mean of the sampling distribution and the bounds of the confidence interval are positive and consistent with the 10th percentile of the data." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "5730.0" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss['realinc'].quantile(0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, bootstrapping is robust -- that is, it works well with many different distributions and many different statistics.\n", "However, at the end of the chapter, we'll see one example where it fails." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with Bigger Data\n", "\n", "As sample size increases, errors due to random sampling get smaller.\n", "To demonstrate this effect, we'll use data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the average height for men in the United States.\n", "\n", "First, let's read the 2021 data, which I have stored in an HDF file.\n", "Instructions for downloading it are in the notebook for this chapter." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "download('https://github.com/AllenDowney/ElementsOfDataScience/raw/v1/data/brfss_2021.hdf');" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(438693, 10)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "brfss = pd.read_hdf('brfss_2021.hdf', 'brfss')\n", "brfss.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset contains 438,693 rows, one for each respondent, and 10 columns, one for each variable in the extract.\n", "Here are the first few rows." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEQNOHTM4WTKG3_SEX_AGEG5YR_VEGESU1_INCOMG1_LLCPWT_HTM4G10AGE
02021000001150.032.66211.02.143.0744.745531140.072.0
12021000002168.0NaN210.01.28NaN299.137394160.067.0
22021000003165.077.11211.00.712.0587.862986160.072.0
32021000004163.088.4529.01.655.01099.621570160.062.0
42021000005180.093.44112.02.582.01711.825870170.077.0
\n", "
" ], "text/plain": [ " SEQNO HTM4 WTKG3 _SEX _AGEG5YR _VEGESU1 _INCOMG1 _LLCPWT \\\n", "0 2021000001 150.0 32.66 2 11.0 2.14 3.0 744.745531 \n", "1 2021000002 168.0 NaN 2 10.0 1.28 NaN 299.137394 \n", "2 2021000003 165.0 77.11 2 11.0 0.71 2.0 587.862986 \n", "3 2021000004 163.0 88.45 2 9.0 1.65 5.0 1099.621570 \n", "4 2021000005 180.0 93.44 1 12.0 2.58 2.0 1711.825870 \n", "\n", " _HTM4G10 AGE \n", "0 140.0 72.0 \n", "1 160.0 67.0 \n", "2 160.0 72.0 \n", "3 160.0 62.0 \n", "4 170.0 77.0 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brfss.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `HTM4` column contains the respondents' heights in centimeters." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "height = brfss['HTM4']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select male respondents, we'll use the `SEX` column to make a Boolean `Series`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "203760" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "male = (brfss['_SEX'] == 1)\n", "male.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `notna` and `sum` to count the number of male respondents with valid height data." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "193701" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_height = height[male].notna().sum()\n", "n_height" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the mean and standard deviation of these values." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(178.14807357731763, 7.987083970017878)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_height = height[male].mean()\n", "std_height = height[male].std()\n", "mean_height, std_height" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The average height for men in the U.S. is about 178 cm.\n", "To see how precise this estimate is, we can use bootstrapping to generate values from the sampling distribution.\n", "To reduce computation time, I set the number of iterations to 201." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
178.1480.018[178.121, 178.176]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 178.148 0.018 [178.121, 178.176]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t5 = [bootstrap_mean(brfss[male], 'HTM4')\n", " for i in range(201)]\n", "\n", "summarize(t5, digits=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the sample size is so large, the standard error is small and the confidence interval is narrow.\n", "This result suggests that our estimate is very precise, which is true in the sense that the error due to random sampling is small.\n", "\n", "But there are other sources of error.\n", "For example, the heights and weights in this dataset are self-reported, so they are vulnerable to **social desirability bias**, which is the tendency of people to represent themselves in a positive light. \n", "\n", "It's also possible that there are errors in recording the data.\n", "In a previous year of the BRFSS, there are a suspicious number of heights recorded as 60 or 61 centimeters.\n", "I suspect that many of them are six feet tall, or six feet and one inch, and something went wrong in recording the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that brings us to the point of this example: \n", "\n", "> With large sample sizes, variability due to random sampling is small, but with real-world data, that often means that other sources of error are bigger. So we can't be sure that the estimate is accurate.\n", "\n", "In fact, there is another source of error in this example that we have not taken into account: oversampling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weighted Bootstrapping\n", "\n", "By design, the BRFSS oversamples some demographic groups -- that is, people in some groups are more likely than others to appear in the sample.\n", "If people in these groups are taller than others on average, or shorter, our estimated mean would not be accurate.\n", "\n", "We encountered this issue in Chapter 7, where we used data from the National Survey of Family Growth (NSFG) to compute the average birth weight for babies in the United States.\n", "In that example, we corrected for oversampling by computing a weighted mean.\n", "\n", "In this example, we'll use a different method, **weighted bootstrapping**, to estimate the mean and compute a confidence interval.\n", "The BRFSS dataset includes a column, `_LLCPWT`, that contains sampling weights.\n", "The sampling weight for each respondent is the number of people in the population they represent.\n", "People in oversampled groups have lower sampling weights; people in undersampled groups have higher sampling weights.\n", "Here's what the range of values looks like." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 438693.000000\n", "mean 560.851529\n", "std 1136.781547\n", "min 0.545800\n", "25% 95.573000\n", "50% 248.677287\n", "75% 592.546811\n", "max 49028.547000\n", "Name: _LLCPWT, dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brfss['_LLCPWT'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lowest sampling weight is about 0.5; the largest is about 49,000 -- so that's a very wide range!\n", "We can take these weights into account by passing them as an argument to `sample`.\n", "That way, the probability that any row is selected is proportional to its sampling weight." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "n = len(brfss)\n", "bootstrapped = brfss.sample(n=n, replace=True, weights='_LLCPWT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw with unweighted bootstrapping, the same row can appear more than once.\n", "To see how many times, we can use `value_counts` and the `SEQNO` column, which contains a unique identifier for each respondent." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SEQNO\n", "2021000019 144\n", "2021001348 132\n", "2021000044 129\n", "2021003808 127\n", "2021000091 124\n", "Name: count, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repeats = bootstrapped['SEQNO'].value_counts()\n", "repeats.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some rows appear more than 100 times.\n", "Most likely, these are the rows for people from undersampled groups, who have the highest sampling weights.\n", "\n", "To see how many rows don't appear at all, we can use `set` subtraction to count the values of `SEQNO` that appear in the original dataset but not the sample." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14616" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unused = set(brfss['SEQNO']) - set(bootstrapped['SEQNO'])\n", "len(unused)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are thousands of rows that don't appear in this sample, but they are not dropped altogether -- when we repeat this process, they will appear in other samples.\n", "\n", "Now we can use weighted bootstrapping to generate values from the sampling distribution of the mean.\n", "The following function uses `sample` and the `_LLCPWT` column to generate a bootstrapped sample, then returns the average height." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def weighted_bootstrap_mean(df):\n", " n = len(df)\n", " sample = df.sample(n=n, replace=True, weights='_LLCPWT')\n", " return sample['HTM4'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll test this function with a `DataFrame` that contains only male respondents.\n", "If we run it once, we get a single value from the sampling distribution of the weighted mean." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "177.569630553049" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "male_df = brfss[male]\n", "weighted_bootstrap_mean(male_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we run it many times, we get a random sample from the sampling distribution." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
177.5410.018[177.513, 177.573]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 177.541 0.018 [177.513, 177.573]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t6 = [weighted_bootstrap_mean(male_df) \n", " for i in range(201)]\n", "\n", "summarize(t6, digits=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean of the sampling distribution estimates the average height for men in the U.S., corrected for oversampling.\n", "If we compare it to the unweighted mean we computed, it is a little lower." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "177.54149968876962 178.14807357731763\n" ] } ], "source": [ "print(np.mean(t6), mean_height)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it seems like people in the oversampled groups are taller than others, on average, by enough to bring the unweighted mean up by about half a centimeter.\n", "\n", "The difference between the weighted and unweighted averages is bigger than the width of the confidence interval.\n", "So in this example the error if we fail to correct for oversampling is bigger than variability due to random sampling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation and Regression\n", "\n", "Bootstrap resampling can be used to estimate other statistics and their sampling distributions.\n", "For example, in Chapter 9 we computed the correlation between height and weight, which is about 0.47." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4693981914367917" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var1, var2 = 'HTM4', 'WTKG3'\n", "corr = brfss[var1].corr(brfss[var2])\n", "corr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That correlation does not take into account oversampling.\n", "We can correct it with this function, which generates a weighted bootstrapped sample and uses it to compute the correlation of the columns specified by `var1` and `var2`." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def weighted_bootstrap_corr(df, var1, var2):\n", " n = len(df)\n", " sample = df.sample(n=n, replace=True, weights='_LLCPWT')\n", " corr = sample[var1].corr(sample[var2])\n", " return corr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Use this function to draw 101 values from the sampling distribution of the correlation between height and weight.\n", "What is the mean of these values? Is it substantially different from the correlation we computed without correcting for oversampling?\n", "Compute the standard error and 90% confidence interval for the estimated correlation." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
0.454920.00125[0.45244, 0.45678]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 0.45492 0.00125 [0.45244, 0.45678]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "t7 = [weighted_bootstrap_corr(brfss, 'HTM4', 'WTKG3')\n", " for i in range(101)]\n", "\n", "summarize(t7, digits=5)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "# The estimated correlation with weighted bootstrapping is\n", "# slightly smaller, but the difference is not enough to matter \n", "# in practice.\n", "\n", "# The error due to oversampling, although small,\n", "# is bigger than variability due to random sampling,\n", "# which is small because the sample size is so large." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** In Chapter 9 we also computed the slope of the regression line for weight as a function of height.\n", "Here's the result." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9366891536604244" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import linregress\n", "\n", "subset = brfss.dropna(subset=['WTKG3', 'HTM4'])\n", "res = linregress(subset['HTM4'], subset['WTKG3'])\n", "res.slope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimated slope is 0.94 kg/cm, which means that we expect someone 1 cm taller than average to be about 0.94 kg heavier than average.\n", "\n", "Write a function called `weighted_bootstrap_slope` that takes a `DataFrame`, generates a weighted bootstrapped sample, runs `linregress` with height and weight, and returns the slope of the regression line.\n", "\n", "Run it 101 times and collect the results. Use the sampling distribution to compute the mean of the slope estimates, standard error, and a 90% confidence interval." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def weighted_bootstrap_slope(df):\n", " n = len(df)\n", " sample = df.sample(n=n, replace=True, weights='_LLCPWT')\n", " subset = sample.dropna(subset=['WTKG3', 'HTM4'])\n", " res = linregress(subset['HTM4'], subset['WTKG3'])\n", " return res.slope" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
0.905250.00307[0.90018, 0.90964]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 0.90525 0.00307 [0.90018, 0.90964]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "t8 = [weighted_bootstrap_slope(brfss)\n", " for i in range(101)]\n", "\n", "summarize(t8, digits=5)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "# In this example, again the value we get with weighted resampling is a little\n", "# different, but not enough to matter in practice.\n", "\n", "# Because the sample size is large, the standard error is small and \n", "# the confidence interval is narrow, but there might be sources of error,\n", "# other than random sampling, that have a bigger effect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limitations of Bootstrapping\n", "\n", "One limitation of bootstrapping is that it can be computationally expensive.\n", "With small datasets, it is usually fast enough that we can generate a thousand values from the sampling distribution, which means that we can compute standard errors and confidence intervals precisely.\n", "With larger datasets, we can cut the computation time by generating fewer values.\n", "With 100-200 values, the standard errors we get are usually precise enough, but the bounds of the confidence intervals might be noisier.\n", "\n", "The other limitation, which can be more problematic, is that bootstrap sampling does not work well with datasets that contain a small number of different values.\n", "To demonstrate, I'll select data from the GSS for one year, 2018:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "one_year = gss['year']==2018\n", "gss2018 = gss[one_year]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And I'll use bootstrapping to generate values from the sampling distribution of the 10th percentile." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "t9 = [bootstrap_income_percentile(gss2018)\n", " for i in range(1001)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
5155.46223.92[5107.5, 5107.5]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", " 5155.46 223.92 [5107.5, 5107.5]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary9 = summarize(t9)\n", "summary9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimate and the standard error look plausible at first glance, but the width of the confidence interval is 0, which suggests that something has gone wrong!\n", "The problem is that `realinc` is not really a numerical variable -- it is a categorical variable in disguise.\n", "Using `value_counts`, we can see that there are only 26 distinct values in this column." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(gss2018['realinc'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason is that the GSS does not ask respondents to report their incomes.\n", "Instead, it gives them a list of ranges and asks them to pick the range their income falls in. \n", "Then GSS analysts compute the midpoint of each range and convert to 1986 dollars by adjusting for inflation.\n", "As a result, there are only 26 distinct values for each year of the survey.\n", "When we generate a bootstrapped sample and compute the 10th percentile, we get a small subset of them.\n", "Here are the values that appear in our sample." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-print" ] }, "source": [ "Details of the methodology are in available [here](https://gss.norc.org/Documents/reports/methodological-reports/MR101%20Getting%20the%20Most%20Out%20of%20the%20GSS%20Income%20Measures.pdf)." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5107.5 955\n", "5221.0 1\n", "5448.0 1\n", "5561.5 1\n", "5675.0 1\n", "5902.0 2\n", "6015.5 1\n", "6129.0 2\n", "6242.5 37\n", "Name: count, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(t9).value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are only four different values, and one of them appears more than 95% of the time.\n", "When we compute a 90% confidence interval, this value is both the 5th and the 95th percentile.\n", "\n", "Bootstrapping works well for most distributions and most statistics, but the one thing it can't handle is lack of diversity in the data.\n", "Fortunately, this problem can be solved.\n", "The cause of the problem is that the data have been discretized excessively, so the solution is to smooth it.\n", "Jittering is one option.\n", "Another is to use kernel density estimation (KDE)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling with KDE\n", "\n", "We have used KDE several times to estimate and plot a probability density based on a sample.\n", "We can also use it to smooth data that have been discretized.\n", "\n", "In Chapter 8 we saw that the distribution of income is well modeled by a lognormal distribution, so if we take the log of income, it is well modeled by a normal distribution.\n", "Here are the logarithms of the income data." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "log_realinc = np.log10(gss2018['realinc'].dropna())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's what the estimated density looks like." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.kdeplot(log_realinc)\n", "\n", "plt.xlabel('Income (log10 1986 dollars)')\n", "plt.ylabel('Probability density')\n", "plt.title('Estimated distribution of income');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To draw samples from this distribution, we'll use a Scipy function called `gaussian_kde`, which takes the data and returns an object that represents the estimated density." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import gaussian_kde\n", "\n", "kde = gaussian_kde(log_realinc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`kde` provides a method called `resample` that draws random values from the estimated density.\n", "As we've done in previous examples, we'll generate a resampled dataset with the same size as the original -- which is stored as `kde.n`." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "sample = kde.resample(kde.n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can compute the 10th percentile and convert from a logarithm to a dollar value." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5235.936465561343" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "10 ** np.percentile(sample, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a random value from the sampling distribution of the 10th percentile.\n", "The following function encapsulates these steps." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "def resample_kde_percentile(kde):\n", " sample = kde.resample(kde.n)\n", " return 10 ** np.percentile(sample, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can generate a sample from the sampling distribution." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "t10 = [resample_kde_percentile(kde)\n", " for i in range(1000)]\n", "\n", "summary10 = summarize(t10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following table compares the result to the previous result with bootstrapping." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EstimateSECI90
bootstrapping5155.46223.92[5107.5, 5107.5]
KDE resampling5097.59246.25[4692.62, 5485.93]
\n", "
" ], "text/plain": [ " Estimate SE CI90\n", "bootstrapping 5155.46 223.92 [5107.5, 5107.5]\n", "KDE resampling 5097.59 246.25 [4692.62, 5485.93]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = pd.concat([summary9, summary10])\n", "table.index=['bootstrapping', 'KDE resampling']\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The means and standard errors are about the same with either method.\n", "But the confidence interval we get from KDE resampling is sensible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "There are ten examples in this chapter, so let's review them:\n", "\n", "1. First we used resampling based on a normal model to estimate average family income in the GSS and compute a confidence interval.\n", "\n", "2. Then we used the same method to estimate the 10th percentile of income, and we found that all of the values in the sampling distribution are negative. The problem is that the normal model does not fit the distribution of income. \n", "\n", "3. To solve this problem, we switched to bootstrap sampling. First we estimated average family income and confirmed that the results are consistent with the results based on the normal model.\n", "\n", "4. Then we used bootstrapping to estimate the 10th percentile of income. The results are much more plausible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Next we used data from the BRFSS to estimate the average height of men in the U.S. Since this dataset is large, the confidence interval is very small. That means that the estimate is precise, in the sense that variability due to random sampling is small, but we don't know whether it is accurate, because there are other sources of error.\n", "\n", "6. One of those sources of error is oversampling -- that is, some people are more likely to appear in the sample than others. In the BFRSS, each respondent has a sampling weight that indicates how many people in the population they represent. We used these weights to do weighted bootstrapping, and found that the error due to oversampling is larger than the variability due to random sampling.\n", "\n", "7. In one exercise you used weighted bootstrapping to estimate the correlation of height and weight and compute a confidence interval.\n", "\n", "8. In another exercise you estimated the slope of a regression line and computed a confidence interval.\n", "\n", "9. Then I demonstrated a problem with bootstrap sampling when the dataset has only a few different values,\n", "\n", "10. And presented a solution using KDE to smooth the data and draw samples from an estimated distribution.\n", "\n", "In the exercise below, you can work on one more example.\n", "It is a little more involved than the previous exercises, but I will walk you through it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** In Chapter 10 we used logistic regression to model support for legalizing marijuana as a function of age, sex, and education level.\n", "Going back to that example, let's explore changes in support over time and generate predictions for the next decade.\n", "\n", "To prepare the data for logistic regression, we have to recode the `grass` column so `1` means in favor of legalization and `0` means not in favor." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "grass\n", "0.0 25997\n", "1.0 12672\n", "Name: count, dtype: int64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss['grass'] = gss['grass'].replace(2, 0)\n", "gss['grass'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As explanatory variables we'll use `year` and `year` squared, which I'll store in a column called `year2`.\n", "Subtracting 1990 from `year` before squaring keeps the values of `year2` relatively small, which makes logistic regression work better. " ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "gss['year2'] = (gss['year'] - 1990) ** 2.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run the model like this:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.585064\n", " Iterations 5\n" ] } ], "source": [ "import statsmodels.formula.api as smf\n", "\n", "formula = 'grass ~ year + year2'\n", "results = smf.logit(formula, data=gss).fit()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Logit Regression Results
Dep. Variable: grass No. Observations: 38669
Model: Logit Df Residuals: 38666
Method: MLE Df Model: 2
Date: Wed, 08 May 2024 Pseudo R-squ.: 0.07506
Time: 14:15:47 Log-Likelihood: -22624.
converged: True LL-Null: -24460.
Covariance Type: nonrobust LLR p-value: 0.000
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
Intercept -42.0901 2.372 -17.746 0.000 -46.739 -37.441
year 0.0205 0.001 17.203 0.000 0.018 0.023
year2 0.0016 6.3e-05 25.436 0.000 0.001 0.002
" ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & grass & \\textbf{ No. Observations: } & 38669 \\\\\n", "\\textbf{Model:} & Logit & \\textbf{ Df Residuals: } & 38666 \\\\\n", "\\textbf{Method:} & MLE & \\textbf{ Df Model: } & 2 \\\\\n", "\\textbf{Date:} & Wed, 08 May 2024 & \\textbf{ Pseudo R-squ.: } & 0.07506 \\\\\n", "\\textbf{Time:} & 14:15:47 & \\textbf{ Log-Likelihood: } & -22624. \\\\\n", "\\textbf{converged:} & True & \\textbf{ LL-Null: } & -24460. \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ LLR p-value: } & 0.000 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{z} & \\textbf{P$> |$z$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & -42.0901 & 2.372 & -17.746 & 0.000 & -46.739 & -37.441 \\\\\n", "\\textbf{year} & 0.0205 & 0.001 & 17.203 & 0.000 & 0.018 & 0.023 \\\\\n", "\\textbf{year2} & 0.0016 & 6.3e-05 & 25.436 & 0.000 & 0.001 & 0.002 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{Logit Regression Results}\n", "\\end{center}" ], "text/plain": [ "\n", "\"\"\"\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: grass No. Observations: 38669\n", "Model: Logit Df Residuals: 38666\n", "Method: MLE Df Model: 2\n", "Date: Wed, 08 May 2024 Pseudo R-squ.: 0.07506\n", "Time: 14:15:47 Log-Likelihood: -22624.\n", "converged: True LL-Null: -24460.\n", "Covariance Type: nonrobust LLR p-value: 0.000\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept -42.0901 2.372 -17.746 0.000 -46.739 -37.441\n", "year 0.0205 0.001 17.203 0.000 0.018 0.023\n", "year2 0.0016 6.3e-05 25.436 0.000 0.001 0.002\n", "==============================================================================\n", "\"\"\"" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate predictions, I'll create a `DataFrame` with a range of values of `year` up to 2030, and corresponding values of `year2`." ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "years = np.linspace(1972, 2030)\n", "df_pred = pd.DataFrame()\n", "df_pred['year'] = years\n", "df_pred['year2'] = (years - 1990) **2\n", "\n", "pred = results.predict(df_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll use `groupby` to compute the fraction of respondents in favor of legalization during each year." ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "grass_by_year = gss.groupby('year')['grass'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function plots the data and decorates the axes." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "def plot_data():\n", " grass_by_year.plot(style='o', alpha=0.5, label='data')\n", " plt.xlabel('Year')\n", " plt.ylabel('Fraction in favor')\n", " plt.title('Support for legalization of marijuana')\n", " plt.legend(loc='upper left');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the predictions look like, plotted along with the data." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(years, pred, label='logistic model', color='gray', alpha=0.4)\n", "plot_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model fits past data reasonably well and makes plausible predictions for the next decade, although we can never be sure that trends like this will continue.\n", "\n", "This way of representing the results could be misleading because it does not show our uncertainty about the predictions.\n", "Random sampling is just one source of uncertainty among many, and for this kind of prediction it is certainly not the biggest.\n", "But it is the easiest to quantify, so let's do it, if only as an exercise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function called `bootstrap_regression_line` that takes a `DataFrame` as a parameter, uses `sample` to resample the rows, runs the logistic regression model, generates predictions for the rows in `df_pred`, and returns the predictions.\n", "\n", "Call this function 101 times and save the results as a list of `Series` objects.\n", "To visualize the results, you have two options:\n", "\n", "1. Loop through the list and plot each prediction using a gray line with a low value of `alpha`. The overlapping lines will form a region showing the range of uncertainty over time.\n", "\n", "2. Pass the list of `Series` to `np.percentile` with the argument `axis=0` to compute the 5th and 95th percentile in each column. Plot these percentiles as two lines, or use `plt.fill_between` to plot a shaded region between them." ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def bootstrap_regression_line(df):\n", " n = len(df)\n", " sample = df.sample(n=n, replace=True)\n", " results = smf.logit(formula, data=sample).fit(disp=False)\n", " pred = results.predict(df_pred)\n", " return pred" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "t11 = [bootstrap_regression_line(gss)\n", " for i in range(101)]" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Solution\n", "\n", "for pred in t11:\n", " plt.plot(years, pred, color='gray', alpha=0.01)\n", "plot_data()" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Solution\n", "\n", "low, high = np.percentile(t11, [5, 95], axis=0)\n", "plt.fill_between(years, low, high, \n", " color='gray', alpha=0.4, label='model')\n", "plot_data()" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "# In this example, the width of the CI is quite narrow, which might suggest that\n", "# the predictions are nearly certain. But remember that the CI only quantifies\n", "# uncertainty due to random sampling. In this example, there are many other sources\n", "# of uncertainty; one of the big ones is that there is no guarantee that the trends\n", "# we see in the past will continue in the future." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "*Elements of Data Science*\n", "\n", "Copyright 2021 [Allen B. Downey](https://allendowney.com)\n", "\n", "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }