Survival Analysis

Run this notebook on Colab

Introduction

This notebook presents four ways to represent a distribution: PMFs, CDFs, survival functions, and hazard functions.

These representations are equivalent in the sense that if you are given any of them, you can compute the others. So you might wonder why we need four ways to represent the same information. There are two reasons:

  1. Each representation is useful for computing different values and answering different questions. In this notebook I’ll show you what each representation is good for.

  2. In some cases we can use a sample to compute a PMF, and use the PMF to compute the other representations. In other cases it is easier to estimate the hazard function and use it to compute the others.

In this notebook I will demonstrate the first process, starting with the PMF. In the next notebook we’ll see the second process, starting with the hazard function.

Light bulb lifetimes

As a first example, I’ll use I’ll use data from an experiment that measures the lifetimes of 50 light bulbs. I downloaded the data from from this gist, which includes this documentation:

Dataset from:
    
V.J. Menon and D.C. Agrawal,  Renewal Rate of Filament Lamps: 
Theory and Experiment. Journal of Failure Analysis and Prevention. 
December 2007, p. 421, Table 2/
DOI: 10.1007/s11668-007-9074-9

Description:

An assembly of 50 new Philips (India) lamps with the 
rating 40 W, 220 V (AC) was taken and installed in the horizontal 
orientation and uniformly distributed over a lab area 11 m by 7 m. 

The assembly was monitored at regular intervals of 12 h to
look for failures. The instants of recorded failures were
called t‘ and a total of 32 data points were obtained such
that even the last bulb failed. 

Variables:

i - observation number
h - time in hours since experiment start
f - number of failed lamps at particular time h
K - number of surviving lamps  at particular time h

Because of the design of this experiment, we can consider the data to be a representative sample from the distribution of lifetimes, at least for light bulbs that are lit continuously.

The following cell downloads the data (thanks to Evgeny Pogrebnyak for making it available).
If that fails, you should be able to get it from my repository.

import os

datafile = 'lamps.csv'
if not os.path.exists(datafile):
    !wget https://gist.github.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv

We can use Pandas to load the data from the CSV file.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('lamps.csv', index_col=0)
df.head()
h f K
i
0 0 0 50
1 840 2 48
2 852 1 47
3 936 1 46
4 960 1 45

PMF

This dataset is complete in the sense that the experiment ran until all light bulbs failed, so the lifetime for every bulb is known. So we can use the data to estimate the probability mass function (PMF) of lifetimes.

To represent a PMF, I’ll use a Pandas Series with the lifetimes (column h) as the index and the number of failures (column f) as the values.

pmf = pd.Series(df['f'].values, index=df['h'])
pmf.index.name = 't'
pmf.head()

To normalize the PMF, we divide through by the total number of lightbulbs.

pmf /= pmf.sum()
pmf.head()

For a given lifetime, t, the PMF contains the fraction of lifetimes equal to t. For example, we can compute the fraction of light bulbs that lasted 840 hours:

pmf[840]

Or the fraction equal to 1524 hours.

pmf[1524]

Here’s what the PMF looks like.

def decorate(**options):
    """Decorate the current axes.
    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')
    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    plt.gca().set(**options)
    plt.tight_layout()
pmf.plot()
decorate(xlabel='Lifetime (hours)', 
         ylabel='PMF', 
         title='PMF of lightbulb lifetimes')

This way of visualizing the data can be useful for validation, and it gives a sense of the location and spread of the data. But it does not make the shape of the distrbution clear. For that, the CDF is better.

CDF

The Cumulative Distribution Function (CDF) is the cumulative sum of the PMF.

cdf = pmf.cumsum()
cdf.head()

For a given lifetime, t, the CDF contains the fraction of lifetimes less than or equal to t. For example, here is the fraction of light bulbs that expired at or before 840 hours:

cdf[840]

And the fraction that expired at or before 1524 hours.

cdf[1524]

Here’s what the CDF looks like.

cdf.plot()
decorate(xlabel='Lifetime (hours)', 
         ylabel='CDF', 
         title='CDF of lightbulb lifetimes')

In my opinion, the CDF is often the best way to visualize the distribution of a sample.

Between 800 and 1800 hours, the CDF is roughly a straight line, which suggests that the distribution is uniform in this range.

Survival function

The survival function is the complement of the CDF; that is, for a given lifetime, t:

  • The CDF is the fraction of the population less than or equal to t.

  • The survival function is the fraction (strictly) greater than t.

We can compute the survival function like this:

surv = 1 - cdf
surv.head()

We can use surv to compute the fraction of lightbulbs that live longer than 1524 hours:

surv[1524]

If we plot the survival function along with the CDF, we can see that they are complements.

cdf.plot(color='gray', alpha=0.4)
surv.plot()
decorate(xlabel='Lifetime (hours)', 
         ylabel='Prob(lifetime>t)', 
         title='Survival function of lightbulb lifetimes')

It might not be obvious why the survival function is useful. Given the CDF and a little arithmetic, we can answer all of the same questions.

There are two reasons:

  • In some domains it is more natural, or at least conventional, to represent distributions in terms of survival rates.

  • The survival function is a step on the way to the hazard function, which we’ll get to now.

Hazard function

For a given lifetime t, the hazard function computes the “hazard rate” at t. Using the vocabulary of the light bulb example, the hazard rate is the fraction of light bulbs that survive until t and then fail at t.

We can compute the hazard rate by computing these quantities:

  • pmf(t) is the fraction of light bulbs that fail at t.

  • surv(t) is the fraction of light bulbs that live longer than t.

  • The sum, pmf(t) + surv(t) is the fraction that survive until t.

So the hazard rate is the ratio of pmf(t) to the sum pmf(t) + surv(t). We can compute it like this:

haz = pmf / (pmf + surv)
haz.head()

Here’s what the hazard function looks like:

haz.plot()
decorate(xlabel='Lifetime (hours)', 
         ylabel='Hazard rate', 
         title='Hazard function of lightbulb lifetimes')

With this kind of data, plotting the hazard function does not provide a clear picture of what’s happening. There are two problems:

  1. The plot shows spikes at the locations of the data, but it is hard to see the shape of the curve.

  2. The large values on the right are unreliable because they are based on a small number of values.

To explain the second point, let’s look at the last few rows of the failure column, f:

df.tail()

We can see that one bulb failed at each of 1812, 1836, 1860, 1980, and 2568 hours.

So the last value of the hazard function is based on only one bulb; the second-to-last point is based on 2 bulbs; and so on.

The cumulative hazard function

To get a better sense for the shape of the curve, we can plot the cumulative hazard function, like this:

haz.cumsum().plot()

decorate(xlabel='Lifetime (hours)', 
         ylabel='Cumulative hazard rate', 
         title='Hazard function of lightbulb lifetimes')

The slope of the cumulative hazard function is proportional to the hazard rate, so we can use this curve to see when the hazard rate is low or high, and when it is increasing or decreasing.

Between 0 and 1000 hours, the slope is low, so the hazard rate is low.

Between 1000 and 2000 hours, the slope is increasing, so the hazard rate is increasing.

But notice that the vertical scale goes to 4. You might wonder that that means; the answer is “not much”.

The values of the hazard function are rates, that is, percentages of light bulbs that expire at each point in time. When you add up these rates, the result does not have a clear interpretation.

When you look at a cumulative hazard function, you should pay attention to the slope of the curve and ignore the values.

Resampling

In the previous section, I suggested that we have to be careful not to overinterpret the right side of the hazard function, because it is based on a small number of data points.

To see how precise the estimated hazard function is, we can use resampling.

First I will use the Pmf of lifetimes to make a kernel density estimate (KDE) of the distribution.

from scipy.stats import gaussian_kde
    
kde = gaussian_kde(pmf.index, weights=pmf)
size = df['f'].sum()

We can use the KDE to draw a new sample of lifetimes, with the same size as the original data set, like this.

sample = kde.resample(size).flatten()
sample.shape

The following function takes a sample and computes its PMF, CDF, survival function, and hazard function.

def make_hazard(sample):
    pmf = pd.Series(sample).value_counts(normalize=True).sort_index()    
    cdf = pmf.cumsum()
    surv = 1 - cdf
    haz = pmf / (pmf + surv)
    return pmf, cdf, surv, haz

I’ll generate 100 samples and plot their survival functions.

for i in range(100):
    sample = kde.resample(size).flatten()
    _, _, sf, _ = make_hazard(sample)
    sf.plot(color='gray', alpha=0.1)
    
surv.plot()
decorate(xlabel='Lifetime (hours)', 
         ylabel='Prob(lifetime>t)', 
         title='Survival function of lightbulb lifetimes')

By plotting the resampled survival functions on top of each other, we can get a sense of how much the results vary due to random sampling.

We can do the same thing with the hazard function:

for i in range(100):
    sample = kde.resample(size).flatten()
    _, _, _, hf = make_hazard(sample)
    hf.cumsum().plot(color='gray', alpha=0.1)

haz.cumsum().plot()

decorate(xlabel='Lifetime (hours)', 
         ylabel='Cumulative hazard rate', 
         title='Hazard function of lightbulb lifetimes')

This plot gives us a sense of which parts of the cumulative hazard function are reliable and which are not.

Below 2000 hours, all of the resampled curves are similar; they increase with increasing slope.

After that, the variability of the curves is much wider, which means we don’t have enough data to characterize this part of the hazard function.

Summary

We have seen four ways to represent a distribution of lifetimes:

  • Probability mass function (PMF), which maps from each value in the distribution to its probability.

  • Cumulative distribution function (CDF), which maps from each value, x, to its cumulative probability, that is, the probability of being less than or equal to x.

  • Survival function, which is the complement of the CDF; that is, the probability of exceeding x.

  • Hazard function, which is the number of failures at x as a fraction of the number of cases that survive until x.

These representations are equivalent in the sense that they all contain the same information. Given any of them, we can compute any of the others.

In this notebook, we computed the Pmf directly from the data, then computed the Cdf, survival function, and hazard function, in that order.

In the next notebook we will use these representations to deal with a case where we cannot compute the PMF directly from the data.

Part of Survival Analysis in Python

Allen B. Downey

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)