The Poincaré Problem

The Poincaré Problem#

Selection bias is the hardest problem in statistics because it’s almost unavoidable in practice, and once the data have been collected, it’s usually not possible to quantify the effect of selection or recover an unbiased estimate of what you are trying to measure.

And because the effect is systematic, not random, it doesn’t help to collect more data. In fact, larger sample sizes make the problem worse, because they give the false impression of precision.

But sometimes, if we are willing to make assumptions about the data generating process, we can use Bayesian methods to infer the effect of selection bias and produce an unbiased estimate.

As an example, let’s solve an exercise from Chapter 7 of Think Bayes. It’s based on a fictional anecdote about the mathematician Henri Poincaré:

Supposedly Poincaré suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.

For the next year, Poincaré continued to weigh his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.

Why? Because the shape of the new distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving Poincaré the heavier ones.

To see whether this anecdote is plausible, let’s suppose that when the baker sees Poincaré coming, he hefts k loaves of bread and gives Poincaré the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?

Click here to run this notebook on Colab.

The following function generates the maximum of k values from a normal distribution with the given parameters, repeated n times.

from scipy.stats import norm

def generate_sample(mu, sigma, k, n):
    return norm(mu, sigma).rvs((k, n)).max(axis=0)

Here are distributions with the same underlying normal distribution and different values of k.

mu_true, sigma_true = 950, 50

for k in range(1, 5):
    sample = generate_sample(mu_true, sigma_true, k, n=1000)
    sns.kdeplot(sample, label=f'k={k}')
    
decorate(xlabel='Weight (g)')
plt.savefig('bread_dist_max.png', dpi=300)

_images/35cfe517511163559ff87ce035853e291ea5625f0b9eba3d13b81592c0427cca.png

As k increases, the mean increases and the standard deviation decreases.

When k=4, the mean is close to 1000. So let’s assume the baker hefted four loaves and gave the heaviest to Poincaré.

k_true = 4

sample4 = generate_sample(mu_true, sigma_true, k_true, n=365)
mean, std = sample4.mean(), sample4.std()
mean, std

(1001.9175597523723, 36.27105788650583)

At the end of one year, can we tell the difference between these hypotheses?

Innocent: The baker actually increased the mean to 1000, and k=1.
Shenanigans: The mean was still 950, but the baker selected with k=4.

Here’s a sample under the k=4 scenario, compared to 10 samples with k=1.

for i in range(10):
    sample = generate_sample(mean, std, k=1, n=365)
    sns.kdeplot(sample, color='gray', lw=0.5)

sns.kdeplot(sample4, color='C3', label='k=4')
decorate(xlabel='Weight (g)')
plt.savefig('bread_dist_compare.png', dpi=300)

_images/7b6da6e593659e1009facc4f1dee2ef9641cd61e691f7c197a0c60a0d56d4308.png

np.random.seed(19)

sample = generate_sample(mean, std, k=1, n=365)
sns.kdeplot(sample, color='C4', label='N(1000, 35)')

sns.kdeplot(sample4, color='C1', label='max of 4 N(950, 50)')
decorate(xlabel='Weight (g)')
plt.savefig('bread_dist_compare.png', dpi=300)

_images/9eaa2197635edb888a63457047af6cda476769df22f0f2533c2e062091aee5e9.png

The k=4 distribution falls mostly within the range of variation we’d expect from the k=1 distribution with the same mean and standard deviation. If you were on the jury and saw this evidence, would you convict the baker?

Ask a Bayesian#

As a Bayesian approach to this problem, let’s see if we can use this data to estimate k and the parameters of the underlying distribution. To compute the likelihood of the data, we have to do a little math.

Assume that \(m\) is the maximum of \(k\) draws from a distribution with CDF \(F(x)\). In general, the CDF of \(m\) is

\[ F_M(m) = F(m)^k \]

which we can differentiate to get the PDF:

\[ f_M(m) = k f(m) F(m)^{k-1} \]

If the draws are from a normal distribution with parameters \(\mu\) and \(\sigma\),

\[ F(m) = \Phi(z) \]

\[ f(m) = \frac{1}{\sigma} \phi(z) \]

where \(\phi\) and \(\Phi\) are the standard normal PDF and CDF, and \(z\) is the standardized value \((m-\mu) / \sigma\). Plugging in, we get the PDF of \(m\):

\[ f_M(m) = \frac{k}{\sigma} \phi(z) \Phi(z)^{k-1} \]

Here’s a function that evaluates the log of this PDF at \(m\).

LOG2PI = np.log(2 * np.pi)
SQRT2 = np.sqrt(2)

def max_normal_logp(m, mu, sigma, k):
    """Evaluate the log PDF of the distribution of the max of k draws."""
    z = (m - mu) / sigma
    log_phi = -0.5 * (z**2 + LOG2PI)

    # Use PyTensor's erfc for better numerical stability
    # Φ(z) = 0.5 * erfc(-z/√2)
    Phi = 0.5 * pm.math.erfc(-z / SQRT2)
    log_Phi = pm.math.log(Phi)
    
    return pm.math.log(k/sigma) + log_phi + (k - 1) * log_Phi

Now here’s a PyMC model that

Defines prior distributions for mu, sigma, and k, and
Uses the previous function to define a custom distribution that computes the likelihood of the data for a hypothetical set of parameters.

def make_model(sample):
    with pm.Model() as model:
        mu = pm.Normal("mu", mu=950, sigma=30)
        sigma = pm.HalfNormal("sigma", sigma=30)
        k = pm.Uniform("k", lower=0.5, upper=15)

        obs = pm.CustomDist(
            "obs",
            mu, sigma, k,
            logp=max_normal_logp,
            observed=sample,
        )
    return model

Notice that we treat k as continuous. That’s because continuous parameters are much easier to sample (and the log PDF function allows non-integer values of k). But it also make sense in the context of the problem – for example, if the baker sometimes hefts three loaves and sometimes four, we can approximate the distribution of the maximum with k=3.5.

Now let’s sample.

model = make_model(sample4)
with model:
    trace = pm.sample()

The model runs quickly and the diagnostics look good.

az.summary(trace)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu	940.810	21.048	900.642	979.223	0.840	0.517	626.0	792.0	1.0
sigma	53.525	5.706	42.235	63.465	0.223	0.135	658.0	800.0	1.0
k	5.264	2.526	1.462	9.986	0.101	0.088	629.0	776.0	1.0

Here are the posterior distributions of the parameters compared to their known values.

plot_posterior(trace, 'mu', mu_true)
plt.savefig('bread_posterior_mu.png', dpi=300)

_images/7e6d75e36e18efed8978c8ed0c8792ecbcffd84a189623d5191bd8d7ffb42898.png

plot_posterior(trace, 'sigma', sigma_true)
plt.savefig('bread_posterior_sigma.png', dpi=300)

_images/dc199514c935126220d1ceeae4f04a03b25cb1eb1774b928247421eaeded230e.png

plot_posterior(trace, 'k', k_true, xlabel='Number of loaves')
plt.savefig('bread_posterior_k.png', dpi=300)

_images/d51f8f647ff0d4cc482ebcb361e58080d6320ef8f0da7805f7a0d995d543b529.png

With one year of data, we can recover the parameters pretty well. The true values fall comfortably inside the posterior distributions, and the posterior mode of k is close to the true value, 4.

But the posterior distributions are still quite wide. There is even some possibility that the baker is innocent, although it is small.

samples_k = az.extract(trace)['k'].to_numpy()
(samples_k < 1.5).mean()

0.00625

This example shows that we can use the shape of an observed distribution to estimate the effect of selection bias and recover the unbiased latent distribution. But we might need a lot of data, and the inference depends on strong assumptions about the data generating process.

Credits: I don’t remember where I got this example from (maybe here?), but it appears in Leonard Mlodinov, The Drunkard’s Walk (2008). Mlodinov credits Bart Holland, What Are the Chances? (2002). The ultimate source seems to be George Gamow and Marvin Stern, Puzzle Math (1958) – but their version is about a German professor, not Poincaré.

mu_true, sigma_true = 950, 50

sample2 = generate_sample(mu_true, sigma_true, k=4, n=365)

sns.kdeplot(sample, label=f'Year 1')
sns.kdeplot(sample2, label=f'Year 2')
    
decorate(xlabel='Weight (g)')
plt.savefig('bread_dist_2.png', dpi=300)
sample2.mean()

1000.6543865551631

_images/5f153834fc0e14974cfd0990b181952f7e478791b3ae0153fe4c31ca188f2890.png

mu_true, sigma_true = 950, 50

sample = generate_sample(mu_true, sigma_true, k=1, n=365)
sns.kdeplot(sample, label=f'Year 1')
    
decorate(xlabel='Weight (g)')
plt.savefig('bread_dist_1.png', dpi=300)
sample.mean()

949.81307056437

_images/675ae70fea44c5ed5b12d399b7cf274904e04ad1d9840397abee1df841e5e170.png

Think Bayes, Second Edition

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

The Poincaré Problem

Contents

The Poincaré Problem#

Ask a Bayesian#