Relay Races and Revolving Doors

Relay Races and Revolving Doors#

Click here to run this notebook on Colab.

I gave a talk related to this chapter at PyData NYC 2019. You can watch the video here.

When you run 209 miles, you have a lot of time to think. In 2010, I was a member of a 12-person team that ran a 209-mile relay race in New Hampshire.

Long-distance relay races are an unusual format, and this was the first (and last!) time I participated in one. I ran the third leg, so when I joined the race, it had been going for a few hours and runners were spread out over several miles of the course.

After I ran a few miles, I noticed something unusual:

There were more fast runners in the race than I expected. Several times I was overtaken by runners much faster than me.
There were also more slow runners than I expected. When I passed other runners, I was often much faster than them.

At first I thought this pattern might reflect the kind of people who sign up for a 209-mile race. Maybe, for some reason, this format appeals primarily to runners who are much faster than average or much slower, and not as much to middle-of-the-pack runners like me.

After the race, with more oxygen available to my brain, I realized that this explanation is wrong. To my embarrassment, I was fooled by a common statistical error, one that I teach students in my classes!

The error is called length-biased sampling, and its effect is called the inspection paradox. If you have not heard of it, this chapter will change your life, because once you learn about the inspection paradox, you see it everywhere.

To explain it, I’ll start with simple examples and we will work our way up. Some of the examples are fun, but some are more serious. For example, length-biased sampling shows up in the criminal justice system and distorts our perception of prison sentences and the risk of repeat offenders.

But it’s not all bad news; if you are aware of the inspection paradox, sometimes you can use it to measure indirectly quantities that would be hard or impossible to measure directly. As an example, I’ll explain a clever system used during the COVID pandemic to track infections and identify superspreaders.

But let’s start with class sizes.

Class Size#

The class sizes in this section come from data reported by Purdue University for undergraduate class sizes in the 2013–14 academic year.

The data was originally posted here

Now archived here

I typed in the number of classes in the row labeled “Total all classes”.

sizes = [(1, 1), (2, 9), (10, 19), (20, 29), (30, 39), (40, 49), (50, 99), (100, 300)]

counts = [138, 635, 1788, 1979, 796, 354, 487, 333]

Since the class sizes are specified in ranges, I used the following function to generate uniform random samples within each range.

def generate_sample(sizes, counts):
    """Generate a sample from a distribution.

    sizes: sequence of (low, high) pairs
    counts: sequence of integers

    returns: NumPy array
    """
    t = []
    for (low, high), count in zip(sizes, counts):
        print(low, high, count)
        sample = np.random.randint(low, high + 1, count)
        t.extend(sample)
    return np.array(t)

unbiased = generate_sample(sizes, counts)

1 1 138
2 9 635
10 19 1788
20 29 1979
30 39 796
40 49 354
50 99 487
100 300 333

The following function takes a sample and generates a new random sample, using the given weights.

def resample_weighted(sample, weights):
    """Generate a biased sample.

    sample: NumPy array
    weights: NumPy array

    returns: NumPy array
    """
    n = len(sample)
    p = weights / np.sum(weights)
    return np.random.choice(sample, n, p=p)

biased = resample_weighted(unbiased, unbiased)

Here’s the distribution of class sizes as given in the report.

from utils import decorate, kdeplot

xs = np.arange(1, 300)
kdeplot(unbiased, xs, "Reported by the Dean", color="C1")

decorate(xlabel="Class size", title="Distribution of class sizes")

_images/650e12c4546f7975340b569492f891abc6347c4f1c8be66b2070166a2cdf21cd.png

The upper bound in this figure, 300, is just my guess. The original data indicates how many classes are bigger than 100, but it doesn’t say how much bigger. For this example, though, we don’t have to be too precise.

Now here’s what the distribution would look like if we sampled students.

xs = np.arange(1, 300)
kdeplot(unbiased, xs, "Reported by the Dean", color="C1")
kdeplot(biased, xs, "Reported by students", ls="--")

decorate(xlabel="Class size", title="Distribution of class sizes")

_images/3112e48c9303b5f811552c5cc853ce721a81b71fd04dfb40b767b04c9e19abaa.png

np.mean(unbiased)

34.611827956989245

np.mean(biased)

92.59815668202765

Unbiasing the Data#

Now suppose we collect a biased sample of 500 class sizes and apply the inverse weights to estimate the unbiased distribution.

sample = np.random.choice(biased, 500)

reweighted = resample_weighted(sample, 1 / sample)

xs = np.arange(1, 300)
kdeplot(unbiased, xs, "Reported by the Dean", color="C1")
kdeplot(biased, xs, "Reported by students", ls="--")
kdeplot(reweighted, xs, "Reweighted student sample", color="C2", ls=":")

decorate(xlabel="Class size", title="Distribution of class sizes")

_images/1406f22111b6571e72a0589b3230bbd864049471b0ad87d6ef8d6bafc721fc6d.png

If the estimate were perfect, the solid and dotted lines would be identical. But with a limited sample size, we underestimate the number of small classes by a little and overestimate the number of classes with 50-80 students. Nevertheless, it works pretty well.

This strategy works in other cases where the actual distribution is not available, deliberately or not. If we can collect a good quality sample from the biased distribution, we can approximate the actual distribution by drawing a sample from the biased data. This process is an example of weighted resampling. It’s “weighted” in the sense that some items are given more weight than others, that is, more probability of being sampled. And it’s called “resampling” because we’re drawing a random sample from something that is itself a random sample.

Where’s My Train?#

I collected the following data from the Red Line, which is a subway line in Boston, Massachusetts. The MBTA, which operates the Red Line, provides a real-time data service, which I used to record the arrival times for 70 trains between 4pm and 5pm over several days. I don’t remember when, but I think it was in 2010.

unbiased = [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
]

unbiased = np.array(unbiased) / 60

biased = resample_weighted(unbiased, unbiased)

Here’s the unbiased distribution.

xs = np.linspace(1, 16.5, 101)
kdeplot(unbiased, xs, "Seen by MBTA", color="C1")

decorate(
    xlabel="Time between trains (min)", title="Distribution of time between trains"
)

_images/599e459e47a000a2b6ae2612bc4acb9b37ba710daeff2e95fbcea35e61934570.png

And here’s the biased distribution as seen by passengers.

xs = np.linspace(1, 16.5, 101)
kdeplot(unbiased, xs, "Seen by MBTA", color="C1")
kdeplot(biased, xs, "Seen by passengers", ls="--")

decorate(
    xlabel="Time between trains (min)", title="Distribution of time between trains"
)

_images/a56cf7a29ddc6362dbdd2cdd30ccb09a54c10821d36742c440be2030ab138d03.png

np.mean(biased), np.mean(unbiased)

(8.570714285714285, 7.7680952380952375)

(np.mean(biased) - np.mean(unbiased)) / np.mean(unbiased) * 100

10.332250352479615

In the actual distribution, the average time between trains is 7.8 minutes; in the biased distribution, as seen by a random passenger, it is 9.2 minutes, about 20% longer.

Are you popular? Hint: no.#

To demonstrate the friendship paradox, I used data from a sample of about 4000 Facebook users from the Stanford Network Analysis Project (SNAP).

download("https://snap.stanford.edu/data/facebook_combined.txt.gz")

The following function reads the Facebook data and builds a NetworkX graph.

import networkx as nx


def read_graph(filename, **options):
    """Read a graph from a file.

    filename: string

    return: Graph
    """
    G = nx.Graph()
    array = np.loadtxt(filename, dtype=int, **options)
    G.add_edges_from(array)
    return G

fb = read_graph("facebook_combined.txt.gz")
n = len(fb)
m = len(fb.edges())
n, m

(4039, 88234)

unbiased = [fb.degree(node) for node in fb]
len(unbiased)

np.max(unbiased)

biased = resample_weighted(unbiased, unbiased)

high = 250

Here’s the unbiased distribution.

xs = np.linspace(0, high, 101)
kdeplot(unbiased, xs, "Random sample of people", color="C1")

decorate(
    xlabel="Number of friends",
    title="Distribution of number of friends",
)

_images/e94b675d6b737b03c83452e3066857412633b02c7cce4beb3327a9f894bdf76b.png

And here’s the biased distribution. Compared to the unbiased distribution, people with more than 50 friends are overrepresented; people with fewer friends are underrepresented.

xs = np.linspace(0, high, 101)
kdeplot(unbiased, xs, "Random sample of people", color="C1")
kdeplot(biased, xs, "Random sample of friends", ls="--")

decorate(
    xlabel="Number of friends",
    title="Distribution of number of friends",
)

_images/4fff45e97f69885ebb22542574a230bff3cf89d70764a781a10660659950425b.png

The difference between the distributions is substantial: in the unbiased sample, the average user has 44 friends; in the biased sample, the average friend has 104, more than twice as many. And if you are a random person in this sample, the probability that your friend is more popular than you is about 76%.

If you are a bottlenose dolphin in Doubtful Sound, New Zealand, you interact regularly with between 1 and 12 other dolphins, according to the researchers who mapped your social network. If I choose one of your “friends” at random and compare them to you, the probability is about 65% that your friend has more friends than you.

The dolphin data is from The Network Data Repository.

import os

download("http://nrvis.com/download/data/soc/soc-dolphins.zip")

if not exists('soc-dolphins.mtx'):
    !unzip soc-dolphins.zip

from scipy.io import mmread

array = mmread('soc-dolphins.mtx')
dolphins = nx.from_scipy_sparse_array(array)
n = len(dolphins)
m = len(dolphins.edges())
n, m

(62, 159)

unbiased = [dolphins.degree(node) for node in dolphins]
np.array(unbiased)

array([ 6,  8,  4,  3,  1,  4,  6,  5,  6,  7,  5,  1,  1,  8, 12,  7,  6,
        9,  7,  4,  9,  6,  1,  3,  6,  3,  3,  5,  5,  9,  5,  1,  3, 10,
        5,  1,  7, 11,  8,  2,  8,  5,  6,  7,  4, 11,  2,  6,  1,  2,  7,
       10,  4,  2,  7,  2,  2,  9,  1,  5,  1,  3])

from empiricaldist import Pmf

pmf = Pmf.from_seq(unbiased, normalize=False)
pmf.bar(label="", alpha=0.4)
decorate(
    xlabel="Friend count",
    ylabel="Number of dolphins",
    title="Distribution of friend counts",
)

_images/198942ded051684c9490376e59246aebe40c280fabc5f97f66e3d802b99b7ffd.png

biased = resample_weighted(unbiased, unbiased)

high = 16
xs = np.linspace(0, high, 101)
kdeplot(unbiased, xs, "Random sample of dolphins", color="C1")
kdeplot(biased, xs, "Random sample of friends", ls="--")

decorate(
    xlabel="Number of friends in social network",
    title="Distribution of social network size",
)

_images/8c7a1f231603b91fa2b549c0d8ea53f4c8bd5a7043ed219f19681cb032be85f5.png

np.mean(biased), np.mean(unbiased)

(6.854838709677419, 5.129032258064516)

np.mean(biased > unbiased)

0.6290322580645161

Finding Superspreaders#

To quantify this effect, let’s suppose that 70% of infected people don’t infect anyone else, as in the Hong Kong study, and the other 30% infect between 1 and 15 other people, uniformly distributed. The average of this distribution is 2.4, which is a plausible value of $R$ .

Now suppose we discover an infected patient, trace forward, and find someone the patient infected. On average, we expect this person to infect 2.4 other people.

import numpy as np

t1 = np.arange(1, 16)
t2 = [0] * 70
unbiased = np.concatenate([t1, t1, t2])
unbiased

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  1,  2,
        3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

np.mean(unbiased)

2.4

Here’s the mean of the biased distribution.

biased = resample_weighted(unbiased, unbiased)
np.mean(biased)

10.29

np.mean(biased) / np.mean(unbiased)

4.2875

Road Rage#

In the following figure, the solid line shows the actual distribution of speeds from the James Joyce Ramble, a 10K race in Massachusetts.

The original web page is not available, but we can get it from the Internet Archive

download("https://web.archive.org/web/20100429073703/http://coolrunning.com/results/10/ma/Apr25_27thAn_set1.shtml")

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("Apr25_27thAn_set1.shtml"), "html.parser")

import pandas as pd

speeds = pd.Series([], dtype=float)

table = soup.find("pre")
for line in table.text.split("\n"):
    t = line.split()
    if len(t) in [13, 14]:
        place, place_in_div, div, gun, net, pace = t[0:6]
        place = int(place)
        m, s = [int(x) for x in pace.split(":")]
        secs = m * 60 + s
        kph = 1.61 / secs * 60 * 60
        speeds[place] = kph

len(speeds)

The following is the distribution a spectator would see if they watched all of the runners go by. The dashed line shows the biased distribution that would be seen by a runner going 11 kilometers per hour (kph).

unbiased = speeds.values

xs = np.linspace(speeds.min(), speeds.max(), 101)
kdeplot(unbiased, xs, "Seen by spectator", color="C1")

decorate(xlabel="Running speed (kph)", title="Distribution of running speed")

_images/ae2d688efcf8022243745f860f32b5c27a86028321540a8f1970044030456ea5.png

And here’s the biased view seen be someone running 11 kph.

weights = np.abs(unbiased - 11)
biased = resample_weighted(unbiased, weights)

kdeplot(unbiased, xs, "Seen by spectator", color="C1")
kdeplot(biased, xs, "Seen by runner at 11 kph", ls="--")

decorate(xlabel="Running speed (kph)", title="Distribution of running speed")

_images/48e344617160f785fc0298a30b6f86b280613ff1455f0d3a52242754f7e5a56b.png

In the actual distribution, there are a lot of runners near 11 kph, but if you run at that speed, you are unlikely to see them. As a result, the biased distribution has few runners near 11 kph and more at the extremes. And it has two modes, one near 9 kph and one near 13 kph. So that explains my oxygen-deprived confusion.

Just Visiting#

I downloaded data from the U.S. Federal Bureau of Prisons (BOP) in June 2019. I archived the page so you can download it here.

# here's how I read the file, but this doesn't work on windows

#tables = pd.read_html("BOP Statistics_ Sentences Imposed.html")
#df = tables[0]

# so here's the CSV version I saved

download(
    "https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/BOP_Statistics_Sentences_Imposed.csv"
)

df = pd.read_csv('BOP_Statistics_Sentences_Imposed.csv', index_col=0)
df

	Sentence	# of Inmates	% of Inmates
0	0 to 1 year*	5155	2.3 %
1	> 1 year to < 3 years**	18619	11.3%
2	3 years to < 5 years	17897	10.9%
3	5 years to < 10 years	41887	25.4%
4	10 years to < 15 years	34995	21.3%
5	15 years to < 20 years	18674	11.3%
6	20 years or more but < Life	22738	13.8%
7	Life	4600	2.8%

Here are the low and high sentences for each range. I assume that the minimum sentence is about a week, that sentences “less than life” are 40 years, and that a life sentence is between 40 and 60 years.

sentences = [(0.02, 1), (1, 3), (3, 5), (5, 10), (10, 15), (15, 20), (20, 40), (40, 60)]

counts = df["# of Inmates"]

def generate_sample_continuous(sizes, counts):
    """Generate a sample from a distribution.

    sizes: sequence of (low, high) pairs
    counts: sequence of integers

    returns: NumPy array
    """
    t = []
    for (low, high), count in zip(sizes, counts):
        print(count, low, high)
        sample = np.random.uniform(low, high, count)
        t.extend(sample)
    return np.array(t)

biased = generate_sample_continuous(sentences, counts)

5155 0.02 1
18619 1 3
17897 3 5
41887 5 10
34995 10 15
18674 15 20
22738 20 40
4600 40 60

weights = 1 / (0.85 * np.array(biased))

unbiased = resample_weighted(biased, weights)

In the following figure, the dashed line shows the distribution of sentences as reported by the BOP. The solid line shows the unbiased distribution I estimated.

xs = np.linspace(0, 60, 101)
kdeplot(unbiased, xs, "Unbiased distribution", color="C1")
kdeplot(biased, xs, "Sampled BOP data", ls="--")

decorate(
    xlabel="Prison sentence (years)", title="Distribution of federal prison sentences"
)

_images/214fc8f91bb6e9fde485f89d83f515108d9203ba68dc852ba779e185b90bb3f2.png

In the BOP sample, sentences less than three years are underrepresented and longer sentences are overrepresented.

If the length of your stay is $y$ , the chance of overlapping with a prisoner whose sentence is $x$ is proportional to $x + y$ . In the following figure, the dotted line shows what the resulting sample looks like when $y$ is 13 months.

x = 0.85 * unbiased
y = 13 / 12

weights = x + y

kerman = resample_weighted(unbiased, weights)

xs = np.linspace(0, 45, 101)
kdeplot(unbiased, xs, 'Actual distribution', color='C1')
kdeplot(biased, xs, 'Sampled BOP data', ls='--')
kdeplot(kerman, xs, 'Seen during a 13 month sentence', color='C2', ls=':')

decorate(xlabel='Prison sentence (years)',
         title='Distribution of federal prison sentences')

_images/077051387d57397e3e6094348cca60363f28ae44dbacf3e05e26de3406ecef4e.png

The mean of the actual distribution is 3.6 years; the mean of the biased distribution is almost 13 years, more than three times longer! To a 13-month observer, the mean is about 10 years, still much greater than the actual mean.
In the actual distribution, about 45% of prisoners have sentences less than a year. If you visit a prison once, fewer than 5% of the prisoners you see are short-timers. If you stay for 13 months, your estimate is better but still not accurate: about 15% of the prisoners you meet are short-timers.

# In the unbiased distribution, almost half of prisoners serve less than one year.

np.mean(unbiased < 1)

0.4573633518670434

np.mean(biased < 1)

0.0313250083553611

np.mean(kerman < 1)

0.14339014978883724

np.mean(unbiased)

3.5116851817207166

np.mean(biased)

12.770277541686598

np.mean(kerman)

10.196918727005256

Recidivism#

Rhodes, William, Gerald Gaes, Jeremy Luallen, Ryan Kling, Tom Rich, and Michael Shively. “Following incarceration, most released offenders never return to prison.” Crime & Delinquency 62, no. 8 (2016): 1003-1025.

This 2016 paper in the journal Crime & Delinquency showed how the inspection paradox affects our estimates of recidivism, that is, the number of people released from prison who later return to prison. Based on data from 17 states, collected between 2000 and 2012, the authors compute the number of admissions to prison in an event-based sample, on page 1015.

from empiricaldist import Pmf

ps = [0.51, 0.25, 0.13, 0.07, 0.03, 0.01]
ks = np.arange(1, len(ps) + 1)
pmf = Pmf(ps, ks)
pmf.sum()

1.0

pmf.plot(color="C1", label="Event-based sample")

decorate(
    xlabel="Number of admissions",
    ylabel="Fraction of prisoners",
    title="Distribution of admissions to prison",
)

_images/69f924f86c08c7648b78ad1b453d7811ddacfc784f856596232ea7b8281e9896.png

# fraction of recidivists in the event-based sample

1 - pmf[1]

0.49

In this sample, 51% of the prisoners served only one sentence; the other 49% were recidivists. We can use this data to simulate an individual-based sample. In the following figure, the dashed line shows the result.

unbiased = pmf / ks
unbiased.normalize()

# fraction of recidivists in biased sample
1 - unbiased[1], unbiased[1]

(0.2750533049040512, 0.7249466950959488)

pmf.plot(color="C1", label="Event-based sample")
unbiased.plot(ls="--", label="Individual-based sample")

decorate(
    xlabel="Total number of sentences",
    ylabel="Fraction of prisoners",
    title="Distribution of sentences",
)

_images/a94d69e328cfa52cb127ed1fdd241dbec63fd68c8e90a9f5401d9423a781dd36.png

pmf.mean(), unbiased.mean()

(1.89, 1.421464108031272)

In the individual-based sample, most prisoners serve one sentence; only 28% are recidivists. That’s substantially lower than the recidivism rate in the event-based sample, which is 49%.

Probably Overthinking It

The code in this notebook and utils.py is under the MIT license.