8. Estimation#

Suppose you live in a town with a population of 10,000 people, and you want to predict who will win an upcoming election for mayor. In theory, you could ask everyone in town who they plan to vote for, and if they all answered honestly, you could make a reliable prediction.

But even in a small town, it is probably not practical to survey the entire population. Fortunately, is it not necessary. If you survey a random subset of the people, you can use the sample to infer the voting preferences of the population. This process – using a sample to make inferences about a population – is called statistical inference.

Two steps of statistical inference are estimation, which is the topic of this chapter, and hypothesis testing, which is the topic of the next chapter.

Click here to run this notebook on Colab.

Hide code cell content
%load_ext nb_black
%load_ext autoreload
%autoreload 2
Hide code cell content
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Hide code cell content
try:
    import statadict
except ImportError:
    !pip install statadict
Hide code cell content
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from thinkstats import decorate

8.1. Weighing Penguins#

Suppose you are a researcher in Antarctica, studying local populations of penguins. One of your tasks is to monitor the average weight of the penguins as it varies over the course of the year. It would be impractical to weigh every penguin in the environment, so your plan is to collect a random sample of 10 penguins each week, weigh them, and use the sample to estimate the mean of the entire population – which is called the population mean.

There are many ways you could use the sample to estimate the population mean, but we’ll consider just two: the sample mean and the sample median. They are both reasonable choices, but let’s see which is better – and think about what we mean by “better”.

For purposes of experiment, we’ll assume that the penguins’ weights are drawn from a normal distribution, and we know the actual mean and standard deviation of that distribution, in kilograms – which I’ll denote mu and sigma.

mu = 3.7
sigma = 0.46

With these parameters, we can use NumPy to simulate the sampling process and generate a sample of any size. For example, here’s a hypothetical sample of 10 weights.

np.random.seed(1)
sample = np.random.normal(mu, sigma, size=10)
sample
array([4.44719887, 3.41859205, 3.45704099, 3.20643443, 4.09808751,
       2.6412922 , 4.50261341, 3.34984483, 3.84675798, 3.58528963])

And here are the mean and median of the sample.

np.mean(sample), np.median(sample)
(3.6553151902291945, 3.521165310619601)

The mean and median are different enough that we should wonder which is a better estimate. To find out, we’ll use the following function to generate hypothetical samples with the given size, n.

def make_sample(n):
    return np.random.normal(mu, sigma, size=n)

As a first experiment, let’s see how the sample mean and sample median behave as the sample size increases. We’ll use logspace to make a range of ns from 1 to 10,000, equally spaced on a logarithmic scale.

ns = np.logspace(1, 5).astype(int)

For each value of n, we’ll generate a hypothetical sample and compute its mean.

means = [np.mean(make_sample(n)) for n in ns]

And we’ll do the same for the median.

medians = [np.median(make_sample(n)) for n in ns]

A statistic, like the sample mean or median, that’s used to estimate a property of a population is called an estimator.

The following figure shows how these estimators behave as we increase the sample size. The horizontal line shows the actual mean in the population.

plt.axhline(mu, color="0.9")
plt.plot(ns, means, "--", label="mean")
plt.plot(ns, medians, label="median")

decorate(xlabel="Sample size", xscale="log", ylabel="Estimate")
_images/09313891d81800f0ce4fa1b173661d991eff5a6351fbb6e024f81ab0507a715f.png

For both estimators, the estimates converge to the actual value as the sample size increases. This demonstrates that they are consistent, which is one of the properties a good estimator should have. Based on this property, the mean and median seem equally good.

In the previous figure, you might notice that the estimates are sometimes too high and sometimes too low – and it looks like the variation is roughly symmetric around the true value. That suggests another experiment: if we collect many samples with the same size and compute many estimates, what is the average of the estimates?

The following loop simulates this scenario, collecting 1001 samples of 10 penguins and computing the mean of each sample.

means = [np.mean(make_sample(n=10)) for i in range(10001)]
np.mean(means)
3.70034508492869

The average of the means is very close to the actual mean we used to generate the samples: 3.7 kg.

The following loop simulates the same scenario, but this time it computes the median of each sample.

medians = [np.median(make_sample(n=10)) for i in range(10001)]
np.mean(medians)
3.701214089907223

The average of these hypothetical medians is also very close to the actual population mean.

These results demonstrate that the sample mean and median are unbiased estimators, which means that they are correct on average. The word “bias” means different things in different contexts, which can be a source of confusion. In this context, it means that the average of the estimates is the actual value.

So far, we’ve shown that both estimators are consistent and unbiased, but it’s still not clear which is better. Let’s try one more experiment: let’s see which estimator is more accurate. The word “accurate” also means different things in different contexts – as an example, let’s consider the mean squared error (MSE).

The following function takes a sequence of estimates and the actual value, computes the differences between the estimates and the actual values, and returns the average of the squares of these errors.

def mse(estimates, actual):
    """Mean squared error of a sequence of estimates."""
    errors = np.asarray(estimates) - actual
    return np.mean(errors**2)

Notice that we can only compute MSE if we know the actual value. In practice, we usually don’t – after all, if we knew the actual value, we wouldn’t have to estimate it. But in our experiment, we know that the actual population mean is 3.7 kg, so we can use it to compute the MSE of the sample means.

mse(means, mu)
0.020871984891289382

If we have samples with size 10 and we use the sample mean to estimate the population mean, the average squared error is about 0.021 kilograms squared.

And here’s the MSE of the sample medians.

mse(medians, mu)
0.029022273128644173

If we use the sample medians to estimate the population mean, the average squared error is about 0.029 kilograms squared, a bit higher than the MSE of the sample mean. In fact, the sample mean is optimal in this sense – it minimizes MSE.

Minimizing MSE is a good property for an estimator to have, but MSE is not always the best way to quantify estimation error. For one thing, it is hard to interpret. In this example, the units of MSE are kilograms squared, so it’s hard to say what that means.

One solution is to use the square root of MSE, called “root mean squared error”, or RMSE. Another option is to use the average of the absolute values of the errors, called the “mean absolute error” or MAE. The following function computes MAE for a sequence of estimates.

def mae(estimates, actual):
    """Mean absolute error of a sequence of estimates."""
    errors = np.asarray(estimates) - actual
    return np.mean(np.abs(errors))

Here’s the MAE of the sample means.

mae(means, mu)
0.11540433749505272

And the sample medians.

mae(medians, mu)
0.13654429774596036

On average, we expect the sample mean to be off by about 0.115 kg, and the sample median to be off by 0.137 kg. So the sample mean is probably the better choice, at least for this example.

8.2. Robustness#

Now let’s consider a different scenario. Suppose that 2% of the time, when you try to weigh a penguin, it accidentally presses the units button on the scale and the weight gets recorded in pounds instead of kilograms. Assuming that the error goes unnoticed, it introduces an outlier in the sample.

The following function simulates this scenario, multiplying 2% of the weights by the conversion factor 2.2 pounds per kilogram.

def make_sample_with_errors(n):
    sample = np.random.normal(mu, sigma, size=n)
    factor = np.random.choice([1, 2.2], p=[0.98, 0.02], size=n)
    return sample * factor

To see what effect this has on the distribution, we’ll generate a large sample.

sample = make_sample_with_errors(n=1000)

To plot the distribution of the sample, we’ll use KDE and the Pdf object from Section XXX.

from scipy.stats import gaussian_kde
from thinkstats import Pdf

kde = gaussian_kde(sample)
domain = 0, 10
pdf = Pdf(kde, domain)
pdf.plot()
decorate(xlabel="Penguin weight (kg)", ylabel="Density")
_images/efda79896d9b34713a8f5b94d271aece5eb0b4b81c755f7fa12b1b13e6cc5252.png

In addition to the mode near 3.7 kg, the measurement errors introduce a second mode near 8 kilograms.

Now let’s repeat the previous experiment, simulating many samples with size 10, computing the mean of each sample, and then computing the average of the sample means.

means = [np.mean(make_sample_with_errors(n=10)) for i in range(10001)]
np.mean(means)
3.786352945690677

The measurement errors cause the sample mean to be higher, on average, than 3.7 kg.

Now here’s the same experiment using sample medians.

medians = [np.median(make_sample_with_errors(n=10)) for i in range(10001)]
np.mean(medians)
3.7121869836715353

The average of the sample medians is also higher than 3.7 kg, but it is not off by nearly as much. If we compare the MSE of the estimates, we see that the sample medians are substantially more accurate.

mse(means, mu), mse(medians, mu)
(0.06853430354724438, 0.031164467796883758)

If measurements actually come from a normal distribution, the sample mean minimizes MSE, but this scenario violates that assumption, so the sample mean doesn’t minimize MSE. The sample median is less sensitive to outliers than the sample mean, so it is less biased and its MSE is smaller. Estimators that deal well with outliers – and similar violations of assumptions – are said to be robust.

8.3. Estimating Variance#

As another example, suppose we want to estimate variance in the penguin’s weights. In Section XXX, we saw that there are two ways to compute the variance of a sample. I promised to explain the difference later – and later is now.

The reason there are two ways to compute the variance of a sample is that one is a biased estimator of the population variance, and the other is unbiased. The following function computes the biased estimator, which is the sum of the squared deviations divided by n.

def biased_var(xs):
    """Population variance with n in the denominator"""
    n = len(xs)
    deviations = xs - np.mean(xs)
    return np.sum(deviations**2) / n

To test it, we’ll simulate many samples with size 10, compute the biased variance of each sample, and then compute the average of the variances.

biased_vars = [biased_var(make_sample(n=10)) for i in range(10001)]
np.mean(biased_vars)
0.19049277659404473

The result is about 0.19, but in this case, we know that the actual population variance is about 0.21, so the biased sample variance is too low on average – in other words, it is biased.

actual_var = sigma**2
actual_var
0.2116

The following function computes the unbiased estimator, which is the sum of the squared deviations divided by n-1.

def unbiased_var(xs):
    """Population variance with n in the denominator"""
    n = len(xs)
    deviations = xs - np.mean(xs)
    return np.sum(deviations**2) / (n - 1)

We can test it by generating many samples and computing the unbiased variance for each one.

unbiased_vars = [unbiased_var(make_sample(n=10)) for i in range(10001)]
np.mean(unbiased_vars)
0.21159109492300626

The average of the unbiased sample variances is very close to the actual value – as advertised, it is an unbiased estimator.

With sample size 10, the difference between the biased and unbiased estimators is about 10%, which might be non-negligible. With sample size 100, the difference is only 1%, which is small enough that is probably doesn’t matter in practice.

n = 10
1 - (n - 1) / n
0.09999999999999998
n = 100
1 - (n - 1) / n
0.010000000000000009

8.4. Sampling Distributions#

The section introduces an important idea in statistical inference, the sampling distribution. To demonstrate, we’ll use a sample of penguin weights. So let’s start with the data.

Between 2007 and 2010, researchers at Palmer Station in Antarctica measured and weighed 342 penguins from local populations of penguins. The data they collected is freely available – instructions for downloading it are in the notebook for this chapter.

The following cell downloads the data from a repository created by Allison Horst.

download(
    "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/c19a904462482430170bfe2c718775ddb7dbb885/inst/extdata/penguins_raw.csv"
)

We can use Pandas to read the data.

penguins = pd.read_csv("penguins_raw.csv").dropna(subset=["Body Mass (g)"])
penguins.shape
(342, 17)

The dataset includes three penguins species.

penguins["Species"].value_counts()
Species
Adelie Penguin (Pygoscelis adeliae)          151
Gentoo penguin (Pygoscelis papua)            123
Chinstrap penguin (Pygoscelis antarctica)     68
Name: count, dtype: int64

For the first example we’ll select just the Chinstrap penguins.

chinstrap = penguins.query('Species.str.startswith("Chinstrap")')

We’ll use this function to plot estimated PDFs.

def plot_kde(sample, name="", **options):
    """Plot an estimated PDF."""

    kde = gaussian_kde(sample)
    m, s = np.mean(sample), np.std(sample)
    plt.axvline(m, ls=":", color="0.3")

    domain = m - 4 * s, m + 4 * s
    pdf = Pdf(kde, domain, name)
    pdf.plot(**options)

Here’s the distribution of chinstrap penguin weights in kilograms. The vertical dotted line shows the sample mean.

weights = chinstrap["Body Mass (g)"] / 1000
plot_kde(weights)
decorate(xlabel="Penguin weight (kg)", ylabel="Density")
_images/76e04d901d5038d0947d004a136a5a1d0a1be5d8071dec068e031c5a03d3a043.png

The sample mean is about 3.7 kg.

sample_mean = np.mean(weights)
sample_mean
3.733088235294118

If you are asked to estimate the population mean, 3.7 kg is a reasonable choice – but how accurate is that estimate?

One way to answer that question is to compute a sampling distribution, which shows how much the mean varies from one sample to another. If we knew the actual mean and standard deviation in the population, we could simulate the sampling process and compute the sampling distribution. But if we knew the actual population mean, we wouldn’t have to estimate it!

Fortunately, there’s a simple way to approximate the sampling distribution, called resampling. The core idea is to use the sample to make a model of the population, then use the model to simulate the sampling process.

More specifically, we’ll use bootstrap resampling, which works by treating the sample as if it is the entire population, and drawing new samples from it. Even more specifically, we’ll use parametric bootstrap resampling, which means we’ll use the sample to estimate the parameters of the population – the mean and standard deviation – and then use a normal distribution to generate new samples.

The following function implements this process.

def resample(weights):
    """Generate a sample from a normal distribution."""
    m, s = np.mean(weights), np.std(weights)
    return np.random.normal(m, s, len(weights))

The following loop uses resample to generate many samples and compute the mean of each one.

sample_means = [np.mean(resample(weights)) for i in range(1001)]

The following figure shows the distribution of these sample means.

plot_kde(sample_means, "Chinstrap")
decorate(xlabel="Sample mean of weight (kg)", ylabel="Density")
_images/55de8fafe9185b12768dab797f4817c3e6dccff3f957a6f489572084cb29fbc1.png

This result is the sampling distribution of the sample mean. It shows how much we expect the sample mean to vary if we collect many samples of the same size – assuming that our model of the population is accurate.

Informally, we can see that the sample mean could be as low as 3.55, if we collected another sample, or as high as 3.9.

8.5. Standard Error#

To quantify the width of the sampling distribution, one option is to compute its standard deviation – the result is called the standard error.

standard_error = np.std(sample_means)
standard_error
0.04626531069684985

In this case, the standard error is about 0.045 kg – so if we collect many samples, we expect the sample means to vary by about 0.045 kg, on average.

With larger samples, there is less variation in the sample means, so the standard error is smaller. To demonstrate, let’s compute the sampling distribution and standard error for the Adélie penguins.

adelie = penguins.query('Species.str.startswith("Adelie")')
len(adelie), len(chinstrap)
(151, 68)

The dataset includes weights for 151 Adélie penguins, compared to only 68 chinstraps. The sample mean for Adélie penguins is about 3.7 kg.

weights2 = adelie["Body Mass (g)"] / 1000
sample_mean2 = np.mean(weights2)
sample_mean2
3.7006622516556287

The following loop uses resample to generate simulated samples and compute their means.

sample_means2 = [np.mean(resample(weights2)) for i in range(1001)]

Here’s the sampling distribution for Adélie penguins, compared to the sampling distribution for chinstraps.

plot_kde(sample_means, "Chinstrap")
plot_kde(sample_means2, "Adélie", ls="--")
decorate(xlabel="Sample mean of weight (kg)", ylabel="Density")
_images/ed169fe2672841e1bf6d4e048331a0320efc25cc0397f2ab6ab75f5a316423a1.png

The sample distribution for Adélie penguins is a narrower because the sample size is bigger. And the standard error is smaller, too.

standard_error2 = np.std(sample_means2)
standard_error2
0.03637933631301007

People often confuse standard error and standard deviation. Remember:

  • Standard deviation quantifies variation in measurements.

  • Standard error quantifies the precision of an estimate.

In this dataset, the standard deviation of penguin weights is about 0.38 kg for chinstrap penguins and 0.46 kg for Adélie penguins.

np.std(weights), np.std(weights2)
(0.3814986213564681, 0.457045172712245)

The standard error of the average weight is about 0.045 kg for chinstrap penguins and 0.0.035 kg for Adélie penguins.

np.std(sample_means), np.std(sample_means2)
(0.04626531069684985, 0.03637933631301007)

Standard deviation tells you how much penguins differ in weight. Standard error tells you how precise an estimate is. They are answers to different questions.

However, there is a relationship between them. If we know the standard deviation and sample size, we can approximate the standard error like this:

def approximate_standard_error(sample):
    n = len(sample)
    return np.std(sample) / np.sqrt(n)
approximate_standard_error(weights)
0.046263503290595163
approximate_standard_error(weights2)
0.037193808669029625

These results are close to what we got by resampling.

8.6. Confidence Intervals#

Another way to summarize the sampling distribution is to compute a condfidence interval. For example, a 90% confidence interval contains 90% of the values in the sampling distribution, which we can find by computing the 5th and 95th percentiles. Here’s the 90% confidence interval for the average weight of chinstrap penguins.

ci90 = np.percentile(sample_means, [5, 95])
ci90
array([3.6576334 , 3.80737506])

And here’s the 90% confidence interval for the average weight of Adélie penguins.

ci90 = np.percentile(sample_means2, [5, 95])
ci90
array([3.64335199, 3.76389419])

To interpret a confidence interval, it is tempting to say that there is a 90% chance that the true value of the population parameter falls in the 90% confidence interval. In this example, we would say there is a 90% chance that the population mean for chinstrap penguins is between 3.66 and 3.81 kg.

Under a strict philosophy of probability called frequentism, this interpretation would not be allowed, and in many statistics books, you will be told that this interpretation is wrong.

In my opinion, this prohibition is unnecessarily strict. Under reasonable philosophies of probability, a confidence interval means what people expect it to mean: there is a 90% chance that the true value falls in the 90% confidence interval.

However, confidence intervals only quantify variability due to sampling – that is, measuring only part of the population. The sampling distribution does not account for other sources of error, notably sampling bias and measurement error, which are the topics of the next section.

8.7. Sources of Error#

Suppose that instead of the average weight of penguins in Antarctica, you want to know the average weight of women in the city where you live. You can’t randomly choose a representative sample of women and weigh them.

A simple alternative would be “telephone sampling” – that is, you could choose random numbers from the phone book, call and ask to speak to an adult woman, and ask how much she weighs.

But telephone sampling has obvious problems. For example, the sample is limited to people whose telephone numbers are listed, so it eliminates people without phones (who might be poorer than average) and people with unlisted numbers (who might be richer). Also, if you call home telephones during the day, you are less likely to sample people with jobs. And if you only sample the person who answers the phone, you are less likely to sample people who share a phone line.

If factors like income, employment, and household size are related to weight – and it is plausible that they are – the results of your survey would be affected one way or another. This problem is called sampling bias because it is a property of the sampling process.

This sampling process is also vulnerable to self-selection, which is a kind of sampling bias. Some people will refuse to answer the question, and if the tendency to refuse is related to weight, that would affect the results.

Finally, if you ask people how much they weigh, rather than weighing them, the results might not be accurate. Even helpful respondents might round up or down if they are uncomfortable with their actual weight. And not all respondents are helpful. These inaccuracies are examples of measurement error.

When you report an estimated quantity, it is useful to quantify variability due to sampling by reporting a standard error a confidence interval. But remember that sampling error is only one source of error, and often it is not the biggest.

8.8. Glossary#

  • estimation: The process of inferring the parameters of a distribution from a sample.

  • estimator: A statistic used to estimate a parameter.

  • mean squared error (MSE): A measure of estimation error.

  • root mean squared error (RMSE): The square root of MSE, a more meaningful representation of typical error magnitude.

  • maximum likelihood estimator (MLE): An estimator that computes the point estimate most likely to be correct.

  • bias (of an estimator): The tendency of an estimator to be above or below the actual value of the parameter, when averaged over repeated experiments.

  • sampling error: Error in an estimate due to the limited size of the sample and variation due to chance.

  • sampling bias: Error in an estimate due to a sampling process that is not representative of the population.

  • measurement error: Error in an estimate due to inaccuracy collecting or recording data.

  • sampling distribution: The distribution of a statistic if an experiment is repeated many times.

  • standard error: The RMSE of an estimate, which quantifies variability due to sampling error (but not other sources of error).

  • confidence interval: An interval that represents the expected range of an estimator if an experiment is repeated many times.

8.9. Exercises#

8.9.1. Exercise#

One of the strengths of resampling methods is that they are easy to extend to other statistics. In this chapter, we computed the sample mean of penguin weights and then used resampling to approximate the sampling distribution of the mean.

As an exercise, compute the sample standard deviation of weights for chinstrap penguins. Then use resample to approximate the sampling distribution of the standard deviation. Use the sampling distribution to compute the standard error of the estimate and a 90% confidence interval.

8.9.2. Exercise#

The Behavioral Risk Factor Surveillance System (BRFSS) dataset includes self-reported heights and weights for a sample of adults in the United States. Use this data to estimate the average height of male adults. Use resample to approximate the sampling distribution and compute a 90% confidence interval.

Because the sample size is very large, the confidence interval is very small, which means that variability due to random sampling is small. But other sources of error might be bigger. What other sources of error do you think affect the results.

The following cells download the data, read it into a DataFrame, and select the heights of male respondents.

download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/CDBRFS08.ASC.gz")
from thinkstats import read_brfss

brfss = read_brfss()
male = brfss.query("sex == 1")
heights = male["htm3"]
heights.describe()
count    154407.000000
mean        178.066221
std           7.723563
min          61.000000
25%         173.000000
50%         178.000000
75%         183.000000
max         236.000000
Name: htm3, dtype: float64

8.10. Exercise#

In games like soccer and hockey, the time between goals tends to follow an exponential distribution (as we saw in Section xxx). Suppose we observe a sample of times between goals. If we assume that the sample came from an exponential distribution, how can we estimate the actual mean of the distribution? We might consider using either the sample mean or the sample median. Let’s see if either of them is a consistent, unbiased estimator.

For the experiments, we’ll assume that the actual mean time between goals is 10 minutes.

actual_mean = 10

The following function generates a sample from an exponential distribution with this mean and the given sample size.

def make_exponential(n):
    return np.random.exponential(actual_mean, size=n)

Use this function to generate samples with a range of sizes and compute the mean of each one. As n increases, do the sample means converge to the actual mean?

Next, generate samples with a range of size and compute the median of each one. Do the sample medians converge to the actual median? Here’s how you can compute the actual median of an exponential distribution with the given mean.

actual_median = np.log(2) * actual_mean
actual_median
6.931471805599453

Next, generate many samples with size 10 and check whether the sample mean is an unbiased estimator of the actual mean.

Finally, check whether the sample median is an unbiased estimator of the actual median.

8.10.1. Exercise#

In Section xxx we tested a biased estimator of variance and showed that it is, in fact, biased. And we showed that the unbiased estimator is unbiased. Now let’s try standard deviation.

To estimate the standard deviation of a population, we can compute the square root of the biased or unbiased estimator of variance, like this:

def biased_std(sample):
    """Square root of the biased estimator of variance."""
    var = biased_var(sample)
    return np.sqrt(var)
def unbiased_std(sample):
    """Square root of the unbiased estimator of variance."""
    var = unbiased_var(sample)
    return np.sqrt(var)

Use make_sample to compute many samples of size 10 from a normal distribution with mean 3.7 and standard deviation 0.46. Check whether either of these is an unbiased estimator of standard deviation.

# Here's an example using `make_sample`

mu, sigma = 3.7, 0.46
make_sample(n=10)
array([3.12391738, 3.42108139, 2.72577285, 3.9949709 , 3.969503  ,
       4.05017156, 4.08291732, 4.14870438, 3.52032859, 3.66219622])

8.11. Exercise#

The exercise is based on the German tank problem, which a simplified version of an actual analysis performed by the Economic Warfare Division of the American Embassy in London during World War II.

Suppose you are an Allied spy and your job is to estimate how many tanks the Germans have built. As data, you have serial numbers recovered from k captured tanks.

If we assume that the Germans have N tanks numbered from 1 to N, and that all tanks in this range were equally likely to be captured, we can estimate N like this:

def estimate_tanks(sample):
    m = np.max(sample)
    k = len(sample)
    return m + (m - k) / k

As an example, suppose N is 122.

N = 122
tanks = np.arange(1, N + 1)

We can use the following function to generate a random sample of k tanks.

def sample_tanks(k):
    return np.random.choice(tanks, replace=False, size=k)

Here’s an example.

np.random.seed(17)
sample = sample_tanks(5)
sample
array([74, 71, 95, 10, 17])

And here is the estimate based on this sample.

estimate_tanks(sample)
113.0

Check whether this estimator is biased.

For more on this problem, see [this Wikipedia page][https://en.wikipedia.org/wiki/German_tank_problem] and Ruggles and Brodie, “An Empirical Approach to Economic Intelligence in World War II”, Journal of the American Statistical Association, March 1947, available here.

For an explanation of how this estimator works, you might like this video.

8.11.1. Exercise#

In several sports – especially basketball – many players and fans believe in a phenomenon called the “hot hand”, which implies that a player who has hit several consecutive shots is more likely to hit the next, and a player who has missed several times is more likely to miss.

A famous paper proposed a way to test whether the hot hand is real or a statistical illusion, by looking at sequences of hits and misses from professional basketball games. For each player, the authors computed the overall probability of making a shot, and the conditional probability of making a shot after three consecutive hits. For eight out of nine players, they found that the probability of making a shot was lower after three hits. Based on this and other results, they concluded that there is “no evidence for a positive correlation between the outcomes of successive shots”. And for several decades, many people believed that the hot hand had been debunked.

However, this conclusion is based on a statistical error, at least in part. A 2018 paper showed that the statistic used in the first paper – the probability of making a shot after three hits – is biased. Even if the probability of making every shot is exactly 0.5, and there is no correlation between the outcomes, the probability of making a shot after three hits is less than 0.5.

It is not obvious why that’s true, which is why the error went undetected for so long, and I won’t try to explain it here. But we can use the methods from this chapter to check it. We’ll use the following function to generate a sequence of 0s and 1s with probability 0.5 and no correlation.

def make_hits_and_misses(n):
    """Generate a sequence of 0s and 1s."""
    return np.random.choice([0, 1], size=n)

In the notebook for this chapter, I provide a function that finds all subsequences of three hits (1s) and returns the element of the sequence that follows.

import numpy as np


def get_successors(seq, target_sum=3):
    """Get the successors of each subsequence that sums to a target value.

    Parameters:
    seq (array-like): Sequence of 1s and 0s.
    target_sum (int): The target sum of the subsequence. Default is 3.

    Returns:
    np.ndarray: Array of successors to subsequences that sum to `target_sum`.
    """
    # Check if the input sequence is too short
    if len(seq) < 3:
        return np.array([])

    # Compute the sum of each subsequence of length 3
    kernel = [1, 1, 1]
    corr = np.correlate(seq, kernel, mode="valid")

    # Find the indices where the subsequence sums to the target value
    indices = np.nonzero(corr == target_sum)[0]

    # Remove cases where the subsequence is at the end of the sequence
    indices = indices[indices < len(seq) - 3]

    # Find the successors of each valid subsequence
    successors = seq[indices + 3] if len(indices) > 0 else np.array([])

    return successors

Generate a large number of sequences with length 100 and for each sequence, find each shot that follows three hits. Compute the percentage of these shots that are hits. Hint: if the sequence does not contain three consecutive hits, the function returns an empty sequence, so your code will have to handle that.

If you run this simulation many times, what is the average percentage of hits? How does this result vary as you increase or decrease the length of the sequence.

The famous paper is Gilovich, T., Vallone, R., & Tversky, A. (1985). The hot hand in basketball: On the misperception of random sequences. Cognitive psychology, 17(3), 295-314.

The paper showing the statistical error is Miller, J. B., & Sanjurjo, A. (2018). Surprised by the hot hand fallacy? A truth in the law of small numbers. Econometrica, 86(6), 2019-2047.

The first paper is available here. The second is available here. For an overview of the topic and an explanation of the error, you might like this video.

Think Stats: Exploratory Data Analysis in Python, 3rd Edition

Copyright 2024 Allen B. Downey

Code license: MIT License

Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International