1. Exploratory Data Analysis#
The thesis of this book is we can use data to answer questions, resolve debates, and make better decions.
This chapter introduces the steps we’ll use to do that: loading and validating data, exploring, and choosing statistics that measure what your are interested in. As an example, it uses data from the National Survey of Family Growth (NSFG) to answer a question I heard when my wife and I were expecting our first child: do first babies tend to arrive late?
Click here to run this notebook on Colab.
Show code cell content
%load_ext nb_black
%load_ext autoreload
%autoreload 2
Show code cell content
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print("Downloaded " + local)
download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")
Show code cell content
try:
import empiricaldist
except ImportError:
!pip install empiricaldist
Show code cell content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from thinkstats import decorate
1.1. Evidence#
You might have heard that first babies are more likely to be late. If you Google this question, you will find plenty of discussion. Some people claim it’s true, others say it’s a myth, and some people say it’s the other way around: first babies come early.
In many of these discussions, people provide data to support their claims. I found many examples like these:
“My two friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labour or being induced.”
“My first one came 2 weeks late and now I think the second one is going to come out two weeks early!!”
“I don’t think that can be true because my sister was my mother’s first and she was early, as with many of my cousins.”
Reports like these are called anecdotal evidence because they are based on data that is unpublished and usually personal. In casual conversation, there is nothing wrong with anecdotes, so I don’t mean to pick on the people I quoted.
But we might want evidence that is more persuasive and an answer that is more reliable. By those standards, anecdotal evidence usually fails, because:
Small number of observations: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
Selection bias: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
Confirmation bias: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
Inaccuracy: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.
To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection: We will use data from a large national survey that was designed explicitly with the goal of generating statistically valid inferences about the U.S. population.
Descriptive statistics: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
Exploratory data analysis: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
Estimation: We will use data from a sample to estimate characteristics of the general population.
Hypothesis testing: Where we see apparent effects, like a difference between two groups, we will evaluate whether the effect might have happened by chance.
By performing these steps with care to avoid pitfalls, we can reach conclusions that are more justifiable and more likely to be correct.
1.2. The National Survey of Family Growth#
Since 1973 the U.S. Centers for Disease Control and Prevention (CDC) have conducted the National Survey of Family Growth (NSFG), which is intended to gather “information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. The survey results are used…to plan health services and health education programs, and to do statistical studies of families, fertility, and health.”
You can read more about the NSFG at http://cdc.gov/nchs/nsfg.htm.
We will use data collected by this survey to investigate whether first babies tend to come late, and other questions. In order to use this data effectively, we have to understand the design of the study.
The NSFG is a cross-sectional study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a longitudinal study, which observes a group repeatedly over time.
The NSFG has been conducted seven times; each deployment is called a cycle. We will use data from Cycle 6, which was conducted from January 2002 to March 2003.
The goal of the survey is to draw conclusions about a population; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that’s seldom possible. Instead we collect data from a subset of the population called a sample. The people who participate in a survey are called respondents.
In general, cross-sectional studies are meant to be representative, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.
The NSFG is not representative; instead it is deliberately oversampled. The designers of the study recruited three groups – Hispanics, African-Americans and teenagers – at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.
Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.
When working with this kind of data, it is important to be familiar with the codebook, which documents the design of the study, the survey questions, and the encoding of the responses.
The codebook and user’s guide for the NSFG data are available from http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
1.3. Reading the Data#
TODO: Fill this in…
Instructions for downloading the data are in the notebook for this chapter.
The following cells download the data files and install statadict
, which we need to read the data.
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")
try:
import statadict
except ImportError:
!pip install statadict
from statadict import parse_stata_dict
def read_stata(dct_file, dat_file):
"""Read NSFG data.
dct_file: string file name
dat_file: string file name
returns: DataFrame
"""
stata_dict = parse_stata_dict(dct_file)
resp = pd.read_fwf(
dat_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs,
compression="gzip",
)
return resp
Here’s how we use it.
dct_file = "2002FemPreg.dct"
dat_file = "2002FemPreg.dat.gz"
preg = read_stata(dct_file, dat_file)
The result is a DataFrame
, which is the fundamental data structure provided by Pandas.
A DataFrame
contains a row for each record – in this case one row per pregnancy – and a column for each variable.
In addition to the data, a DataFrame
also contains the variable names and their types, and it provides methods for accessing and modifying the data.
The DataFrame
has an attribute called shape
that contains the number of rows and columns.
preg.shape
(13593, 243)
And it has a method called head
that displays the first few rows.
preg.head()
caseid | pregordr | howpreg_n | howpreg_p | moscurrp | nowprgdk | pregend1 | pregend2 | nbrnaliv | multbrth | ... | poverty_i | laborfor_i | religion_i | metro_i | basewgt | adj_mod_basewgt | finalwgt | secu_p | sest | cmintvw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | 1231 |
1 | 1 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | 1231 |
2 | 2 | 1 | NaN | NaN | NaN | NaN | 5.0 | NaN | 3.0 | 5.0 | ... | 0 | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | 1231 |
3 | 2 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | 1231 |
4 | 2 | 3 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | 1231 |
5 rows × 243 columns
The first column is the Index
of the DataFrame
, which contains a label for each row.
In this case, the labels are integers starting from 0, but they can also be strings and other types.
The DataFrame
also has an attribute called columns
that contains the names of the columns.
preg.columns
Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
...
'poverty_i', 'laborfor_i', 'religion_i', 'metro_i', 'basewgt',
'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw'],
dtype='object', length=243)
To access a column from a DataFrame
, you can use the column name as a key:
pregordr = preg["pregordr"]
type(pregordr)
pandas.core.series.Series
The result is a Pandas Series
, which represents a sequence of values.
Series
also provides head
, which displays the first few values in the Series
and their labels.
pregordr.head()
0 1
1 2
2 1
3 2
4 3
Name: pregordr, dtype: int64
The last line includes the name of the Series
and dtype
, which is the type of the values.
In this example, int64
indicates that the values are 64-bit integers.
You can access a value in a Series
using the bracket operator integer with a single index.
pregordr[0]
1
Or a subsequence of the values using a slice index.
pregordr[2:5]
2 1
3 2
4 3
Name: pregordr, dtype: int64
You can also access the columns of a DataFrame
using dot notation:
pregordr = preg.pregordr # NOT RECOMMENDED
But this notation only works if the column name is a valid Python identifier, so it has to begin with a letter, can’t contain spaces, etc.
Because of this limitation, I recommend using bracket operators to select columns from a DataFrame
.
1.4. Variables#
We have already seen two variables in the NSFG dataset, caseid
and pregordr
, and we have seen that there are 243 variables in total.
For the explorations in this book, we’ll use the following variables:
caseid
is the integer ID of the respondent.prglngth
is the integer duration of the pregnancy in weeks.outcome
is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.pregordr
is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.birthord
is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.birthwgt_lb
andbirthwgt_oz
contain the pounds and ounces parts of the birth weight of the baby.agepreg
is the mother’s age at the end of the pregnancy.finalwgt
is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.
If you read the codebook carefully, you will see that many of the variables are recodes, which means that they are not part of the raw data collected by the survey – they are calculated using the raw data.
For example, prglngth
for live births is equal to the raw variable wksgest
(weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33
(months of gestation times the average number of weeks in a month).
Recodes are often based on logic that checks the consistency and accuracy of the data. In general it is a good idea to use recodes when they are available, unless there is a compelling reason to process the raw data yourself.
1.5. Validation#
When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other misunderstandings. If you take time to validate the data, you can save time later and avoid errors.
One way to validate data is to compute basic statistics and compare them with published results.
For example, the NSFG codebook includes tables that summarize each variable.
Here is the table for outcome
, which encodes the outcome of each pregnancy.
The Series
class provides a method, value_counts
, that counts the number of times each value appears.
Here’s how we use it to count the values in the outcome
column.
preg["outcome"].value_counts().sort_index()
outcome
1 9148
2 1862
3 120
4 1921
5 190
6 352
Name: count, dtype: int64
The result of value_counts
is a Series
; sort_index()
sorts the Series
by index, so the values appear in order.
Comparing the results with the published table, it looks like the values in outcome
are correct.
Similarly, here is the published table for birthwgt_lb
And here are the value counts:
preg.birthwgt_lb.value_counts().sort_index()
birthwgt_lb
0.0 8
1.0 40
2.0 53
3.0 98
4.0 229
5.0 697
6.0 2223
7.0 3049
8.0 1889
9.0 623
10.0 132
11.0 26
12.0 10
13.0 3
14.0 3
15.0 1
51.0 1
97.0 1
98.0 1
99.0 57
Name: count, dtype: int64
The counts for 6, 7, and 8 pounds check out, and if you add up the counts for 0-5 and 9-95, they check out, too. But if you look more closely, you will notice one value that has to be an error, a 51 pound baby!
To deal with this error, we can use a comparison operator to identify the rows with invalid data.
invalid = preg["birthwgt_lb"] > 20
invalid.dtype
dtype('bool')
The result is a Boolean Series
that is True
where the value of birthwgt_lb
is invalid and False
otherwise.
We can use this Series
to replace invalid data with nan
.
preg.loc[invalid, "birthwgt_lb"] = np.nan
The attribute loc
selects rows and columns from a DataFrame
.
In this example, the first expression in brackets selects the invalid rows; the second expression selects the column.
1.6. Transformation#
When you read data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called data cleaning.
As an example, agepreg
contains the mother’s age at the end of the pregnancy, but it is encoded as an integer number of centiyears (hundredths of a year).
To convert it to year, we can divide through by 100.
preg["agepreg"] /= 100.0
As another example, birthwgt_lb
and birthwgt_oz
contain the weight of the baby, in pounds and ounces, for pregnancies that end in live birth – but they also contain several special values.
Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby.
The replace
method replaces these values with np.nan
, which is a special floating-point value that represents “not a number”.
na_vals = [97, 98, 99]
preg["birthwgt_lb"] = preg["birthwgt_lb"].replace(na_vals, np.nan)
preg["birthwgt_oz"] = preg["birthwgt_oz"].replace(na_vals, np.nan)
Now we can use the cleaned values to create a new column that combines pounds and ounces into a single quantity, in pounds.
preg["totalwgt_lb"] = preg["birthwgt_lb"] + preg["birthwgt_oz"] / 16.0
When you add a new column to a DataFrame
, you have to use the bracket operator, not dot notation, like this.
# WRONG!
# preg.totalwgt_lb = preg["birthwgt_lb"] + preg["birthwgt_oz"] / 16.0
The version with dot notation adds an attribute to the DataFrame
object, but that attribute is not treated as a new column.
1.7. Summary Statistics#
A statistic is a number derived from a dataset, usually intended to quantify some aspect of the data. Examples include the mean, variance, and standard deviation.
A mean is a statistic intended to locate the center of a set of values. But there are many ways to define “center”, so there are many ways to compute a mean, including arithmetic, geometric, and harmonic means. For now, we’ll stick with the arithmetic mean, which is the sum of a set of values divided by the number of values.
A Series
object has a count
method that returns the number of values that are not nan
.
weights = preg["totalwgt_lb"]
n = weights.count()
n
9038
It also provides a sum
method that returns the sum of the values – we can use it to compute the mean like this.
mean = weights.sum() / n
mean
7.265628457623368
But as you might expect, there’s also a mean
method that does the same thing.
weights.mean()
7.265628457623368
In this dataset, the mean of the birth weights is about 7.3 pounds.
Variance is a statistic that quantifies the spread of a set of values. It is the mean of the squared deviations, which are the distances of each point from the mean.
squared_deviations = (weights - mean) ** 2
We can compute the mean of the squared deviations like this.
var = squared_deviations.sum() / n
var
1.983070989750022
As you might expect, Series
provides a var
method that does the same thing, not exactly the same thing.
weights.var()
1.9832904288326545
The result is slightly different because when the var
method computes the mean of the squared deviations, it divides by n-1
rather than n
.
That’s because there are two ways to compute the variance of a sample, depending on what you are trying to do.
I’ll explain the difference in Chapter XXX – but in practice it usually doesn’t matter.
If you prefer the version with n
in the denominator, you can get it by passing ddof=0
as a keyword argument to the var
method.
weights.var(ddof=0)
1.983070989750022
In this dataset, the variance of the birth weights is about 1.98, but that value is hard to interpret – for one thing, it is in units of pounds squared. Variance is useful in some computations, but not a good way to describe a dataset.
A better option is the standard deviation, which is the square root of variance. We can compute it like this.
std = np.sqrt(var)
std
1.40821553384062
Or, we can use the std
method.
weights.std(ddof=0)
1.40821553384062
In this dataset, the standard deviation of birth weights is about 1.4 pounds.
1.8. Interpretation#
To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.
As an example, let’s select the rows in the pregnancy file with caseid
10229.
The query
method takes a string that can contain column names, comparison operators, and numbers, among other things.
subset = preg.query("caseid == 10229")
subset.shape
(7, 244)
The result is a DataFrame
that contains only the rows where the query is True
.
This respondent reported seven pregnancies – here are their outcomes, which are recorded in chronological order.
subset["outcome"]
11093 4
11094 4
11095 4
11096 4
11097 4
11098 4
11099 1
Name: outcome, dtype: int64
The outcome code 1
indicates a live birth.
Code 4
indicates a miscarriage – that is, a pregnancy that ended spontaneously, usually with no known medical cause.
Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more. But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy, it is natural to be moved by the story it tells.
Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.
1.9. Exercises#
1.9.1. Exercise#
Select the birthord
column, print the value counts, and compare to results published in the codebook.
The codebook for this dataset is at https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf.
1.9.2. Exercise#
Create a new column named totalwgt_kg
that contains birth weight in kilograms (there are approximately 2.2 pounds per kilogram).
Remember that when you create a new column, you have to use the bracket operator, not dot notation.
Compute the mean and standard deviation of the new column.
1.9.3. Exercise#
What are the pregnancy lengths for the respondent with caseid
2298?
What was the birth weight of the first baby born to the respondent with caseid
5013?
Hint: You can use and
to check more than one condition in a query.
# TODO: add credit cell
Think Stats: Exploratory Data Analysis in Python, 3rd Edition
Copyright 2024 Allen B. Downey
Code license: MIT License
Text license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International