Estimating vaccine efficacy

In a recent vaccine trial, no one in the vaccinated group got the disease, so the makers of the vaccine reported that it is 100% effective. Of course, nothing in the real world is 100% effective, so this example raises a question: how do we estimate efficacy when the number of cases is 0?

In this article, I present a Bayesian way to answer this question – and the answer turns out to be about 94%.

If you like this article, you might also like the second edition of Think Bayes.

Click here to run this article on Colab

The Data

According to this press release

…in a Phase 3 trial in adolescents 12 to 15 years of age with or without prior evidence of SARS-CoV-2 infection, the Pfizer-BioNTech COVID-19 vaccine BNT162b2 demonstrated 100% efficacy and robust antibody responses…

Of course, it is good news that the vaccine seems to work in this age group as well as it does in adults, and possibly better. But reporting that the vaccine is 100% effective is a little strange.

The press release includes the following details about the results

The trial enrolled 2,260 adolescents 12 to 15 years of age in the United States. In the trial, 18 cases of COVID-19 were observed in the placebo group (n=1,129) versus none in the vaccinated group (n=1,131).

Since there were no cases in the vaccinated group, it is reasonable to think that the vaccine works, but in the real world, nothing is 100% effective. In a bigger trial, we should expect to see at least one case in the vaccinated group. But how can we estimate the frequency of something that has not happened yet?

Bayesian statistics provides a sensible way to do just that.

Here’s the data as reported in the press release.

n_control = 1129
n_treatment = 1131

cases_control = 18
cases_treatment = 0

To estimate the efficacy of the vaccine, I’ll start by computing the risk in each group separately.

The Beta Distribution

If we start with a uniform prior distribution, the posterior is a beta distribution with parameters

alpha = cases + 1
beta = n - cases + 1

We can use SciPy to make beta distributions that represents the posteriors.

from scipy.stats import beta

def make_beta(n, cases):
    return beta(cases+1, n-cases+1)

Here’s the posterior distribution for the control branch.

dist_control = make_beta(n_control, cases_control)
dist_control.mean() * 100

The posterior mean is close to the observed proportion:

cases_control / n_control * 100

Here’s the posterior distribution for the treatment branch.

dist_treatment = make_beta(n_treatment, cases_treatment)
dist_treatment.mean() * 100

The posterior mean is about 0.1%, which is small, but not 0.

To compute the distribution of risk ratios, I’ll make a discrete approximation to the two posterior distributions, using the Pmf object from empiricaldist:

import numpy as np
from empiricaldist import Pmf

def make_pmf(dist):
    """PMF to approximate a beta distribution.
    dist: SciPy `beta` object
    returns: Pmf
    qs = np.linspace(8e-6, 0.03, 1000)
    ps = dist.pdf(qs)
    pmf = Pmf(ps, qs)
    return pmf

Here are the Pmf objects:

pmf_control = make_pmf(dist_control).mul_dist(100)
pmf_treatment = make_pmf(dist_treatment).mul_dist(100)

And here’s what they look like:

import matplotlib.pyplot as plt


plt.xlabel('Infection rate (%)')
plt.title('Posterior distributions of infection rate')

Computing Efficacy

Now, to compute efficacy, I’ll use these distributions to compute the distribution of risk ratio, RR, and then compute the distribution of efficacy, which is the complement of risk ration, 1-RR.

We can use div_dist to compute the risk ratio.

pmf_ratio = pmf_treatment.div_dist(pmf_control)

Here’s the CDF of the risk ratio. I cut it off at 1 because higher values have very low probabilities; that is, we are pretty sure the treatment is effective.

plt.xlim([0, 1])

plt.xlabel('Risk ratio')
plt.title('Posterior distribution of risk ratio');

To compute the distribution of efficacy, we have to compute the distribution of 1-RR, where RR is the risk ratio. We can do that with empiricaldist by creating a deterministic Pmf with the quantity 1 and using sub_dist to subtract two Pmfs.

efficacy = Pmf.from_seq(1).sub_dist(pmf_ratio)

Here’s the result.

plt.xlim([0, 1])

plt.title('Posterior distribution of efficacy');

The posterior mean is about 94%.


And the 95% credible interval is between 79% and 99.9%.

array([0.78696377, 0.99943336])

By these estimates, the efficacy in teenagers is about the same as in adults.

If you like this article, you might also like the second edition of Think Bayes.

Copyright 2020 Allen Downey

Code: MIT License

Text: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)