DataFrames and Series¶
This chapter introduces Pandas, which is a powerful library for working with data. Pandas provides functions for reading and writing data files, exploring and analyzing data, and generating visualizations. And it provides two new types for working with data,
We will use these tools to answer a data question: what is the average birth weight of babies in the United States? This example will demonstrate important steps in almost any data science project:
Identifying data that can answer a question.
Obtaining the data and loading it in Python.
Checking the data and dealing with errors.
Selecting relevant subsets from the data.
Using histograms to visualize a distribution of values.
Using summary statistics to describe the data in a way that best answers the question.
Considering possible sources of error and limitations in our conclusions.
Let’s start by getting the data.
Reading the data¶
We’ll use data from the National Survey of Family Growth (NSFG), which is available from the National Center for Health Statistics at https://www.cdc.gov/nchs/nsfg/index.htm.
To download the data, you have to agree to the Data User’s Agreement at https://www.cdc.gov/nchs/data_access/ftp_dua.htm. You should read those terms carefully, but let me draw your attention to what I think is the most important one:
Make no attempt to learn the identity of any person or establishment included in these data.
Respondents to the NSFG provide general information about themselves, which is stored in the respondent file, and information about each time they have been pregnant, which is stored in the pregnancy file.
We will work with the pregnancy file, which contains one row for each pregnancy and 248 variables. Each variable represents responses to a question on the NSFG questionnaire.
The data is stored in a fixed-width format, which means that every row is the same length and each variable spans a fixed range of columns (see https://www.ibm.com/docs/en/baw/19.x?topic=formats-fixed-width-format).
In addition to the data file, we also need the data dictionary, which includes the names of the variables and specifies the range of columns where each variable appears.
dict_file = '2015_2017_FemPregSetup.dct' data_file = '2015_2017_FemPregData.dat'
Pandas can read data in most common formats, including CSV, Excel, and fixed-width format, but it cannot read the data dictionary, which is in Stata format.
For that, we’ll use a Python library called
parse_stata_dict, we’ll import
parse_stata_dict, which reads the data dictionary.
from statadict import parse_stata_dict stata_dict = parse_stata_dict(dict_file) stata_dict
<statadict.base.StataDict at 0x7ff25831bdd0>
The result is an object that contains
names, which is a list of variable names, and
colspecs, which is a list of tuples.
Each tuple in
colspecs specifies the first and last column where a variable appears.
These values are exactly the arguments we need to use
read_fwf, which is the Pandas function that reads a file in fixed-width format.
import pandas as pd nsfg = pd.read_fwf(data_file, names=stata_dict.names, colspecs=stata_dict.colspecs) type(nsfg)
The result from
read_hdf() is a
DataFrame, which is the primary type Pandas uses to store data.
DataFrame has a method called
head() that shows the first 5 rows:
5 rows × 248 columns
The first column is
CASEID, which is a unique identifier for each respondent.
The first three rows have the same
CASEID, so this respondent reported information about three pregnancies.
The second column is
PREGORDR, which indicates the order of the pregnancies for each respondent, starting from
We will learn more about the other variables as we go along.
In addition to methods like
nsfg has several attributes, which are variables associated with a particular type. For example,
nsfg has an attribute called
shape, which is the number of rows and columns:
There are 9553 rows in this dataset, one for each pregnancy, and 248 columns, one for each variable.
nsfg also has an attribute called
columns, which contains the column names:
Index(['CASEID', 'PREGORDR', 'HOWPREG_N', 'HOWPREG_P', 'MOSCURRP', 'NOWPRGDK', 'PREGEND1', 'PREGEND2', 'HOWENDDK', 'NBRNALIV', ... 'SECU', 'SEST', 'CMINTVW', 'CMLSTYR', 'CMJAN3YR', 'CMJAN4YR', 'CMJAN5YR', 'QUARTER', 'PHASE', 'INTVWYEAR'], dtype='object', length=248)
The column names are stored in an
Index, which is another Pandas type, similar to a list.
Based on the column names, you might be able to guess what some of the variables are, but in general you have to read the documentation.
When you work with datasets like the NSFG, it is important to read the documentation carefully. If you interpret a variable incorrectly, you can generate nonsense results and never realize it. So, before we start looking at data, let’s get familiar with the NSFG codebook, which describes every variable.
Until recently, the NSFG codebook was available in an interactive online format. Unfortunately, it is no longer available, so we have to make due with the PDF file at https://github.com/AllenDowney/ElementsOfDataScience/raw/master/data/2015-2017_NSFG_FemPregFile_Codebook-508.pdf, which contains a short description of each variable.
If you search that document for “weigh at birth” you should find these variables related to birth weight.
BIRTHWGT_LB1: Birthweight in Pounds - 1st baby from this pregnancy
BIRTHWGT_OZ1: Birthweight in Ounces - 1st baby from this pregnancy
There are similar variables for a 2nd or 3rd baby, in the case of twins or triplets. For now we will focus on the first baby from each pregnancy, and we will come back to the issue of multiple births.
In many ways a
DataFrame is like a Python dictionary, where the column names are the keys and the columns are the values. You can select a column from a
DataFrame using the bracket operator, with a string as the key.
pounds = nsfg['BIRTHWGT_LB1'] type(pounds)
The result is a
Series, which is another Pandas type.
In this case the
Series contains the birth weight, in pounds, for each live birth.
head shows the first five values in the series, the name of the series, and the data type:
0 7.0 1 NaN 2 9.0 3 6.0 4 7.0 Name: BIRTHWGT_LB1, dtype: float64
One of the values is
NaN, which stands for “Not a Number”.
NaN is a special value used to indicate invalid or missing data. In this example, the pregnancy did not end in live birth, so birth weight is inapplicable.
Exercise: The variable
BIRTHWGT_OZ1 contains the ounces part of birth weight.
Select the column
'BIRTHWGT_OZ1' from the
nsfg DataFrame and assign it to a new variable called
ounces. Then display the first 5 elements of
Exercise: The Pandas types we have seen so far are
Series. You can find the documentation of these types at:
This documentation can be overwhelming; I don’t recommend trying to read it all now. But you might want to skim it so you know where to look later.
At this point we have identified the columns we need to answer the question and assigned them to variables named
pounds = nsfg['BIRTHWGT_LB1'] ounces = nsfg['BIRTHWGT_OZ1']
Before we do anything with this data, we have to validate it. One part of validation is confirming that we are interpreting the data correctly.
We can use the
value_counts method to see what values appear in
pounds and how many times each value appears.
By default, the results are sorted with the most frequent value first, but we can use
sort_index to sort them by value instead, with the lightest babies first and heaviest babies last.
0.0 2 1.0 28 2.0 46 3.0 76 4.0 179 5.0 570 6.0 1644 7.0 2268 8.0 1287 9.0 396 10.0 82 11.0 17 12.0 2 13.0 1 14.0 1 98.0 2 99.0 89 Name: BIRTHWGT_LB1, dtype: int64
As we’d expect, the most frequent values are 6-8 pounds, but there are some very light babies, a few very heavy babies, and two special values, 98, and 99. According to the codebook, these values indicate that the respondent declined to answer the question (98) or did not know (99).
We can validate the results by comparing them to the codebook, which lists the values and their frequencies.
UNDER 6 POUNDS
9 POUNDS OR MORE
The results from
value_counts agree with the codebook, so we have some confidence that we are reading and interpreting the data correctly.
Exercise: In the
nsfg DataFrame, the column
'OUTCOME' encodes the outcome of each pregnancy as shown below:
value_counts to display the values in this column and how many times each value appears. Are the results consistent with the codebook?
Another way to validate the data is with
describe, which computes summary statistics like the mean, standard deviation, minimum, and maximum.
Here are the results for
count 6690.000000 mean 8.008819 std 10.771360 min 0.000000 25% 6.000000 50% 7.000000 75% 8.000000 max 99.000000 Name: BIRTHWGT_LB1, dtype: float64
count is the number of values, not including
For this variable, there are 6690 value that are not
std are the mean and standard deviation.
max are the minimum and maximum values, and in between are the 25th, 50th, and 75th percentiles. The 50th percentile is the median.
The mean is about
8.05, but that doesn’t mean much because it includes the special values 98 and 99. Before we can really compute the mean, we have to replace those values with
NaN to identify them as missing data.
replace() method does what we want:
import numpy as np pounds_clean = pounds.replace([98, 99], np.nan)
replace takes a list of the values we want to replace and the value we want to replace them with.
np.nan means we are getting the special value
NaN from the NumPy library, which is imported as
The result from
replace() is a new Series, which I assign to
If we run
describe again, we see that
count includes only the valid values.
count 6599.000000 mean 6.754357 std 1.383268 min 0.000000 25% 6.000000 50% 7.000000 75% 8.000000 max 14.000000 Name: BIRTHWGT_LB1, dtype: float64
The mean of the new series is about 6.7 pounds. Remember that the mean of the original series was more than 8 pounds. It makes a big difference when you remove a few 99-pound babies!
describe to summarize
replace to replace the special values 98 and 99 with NaN, and assign the result to
describe again. How much does this cleaning affect the results?
Now we want to combine
ounces into a single Series that contains total birth weight.
Arithmetic operators work with Series objects; so, for example, to convert
pounds to ounces, we could write
pounds * 16
Then we could add in
ounces like this
pounds * 16 + ounces
ounces_clean to compute the total birth weight expressed in kilograms (there are roughly 2.2 pounds per kilogram). What is the mean birth weight in kilograms?
Exercise: For each pregnancy in the NSFG dataset, the variable
'AGECON' encodes the respondent’s age at conception, and
'AGEPREG' the respondent’s age at the end of the pregnancy.
Both variables are recorded as integers with two implicit decimal places, so the value
2575 means that the respondent’s age was
Read the documentation of these variables. Are there any special values we have to deal with?
'AGEPREG', divide them by
100, and assign them to variables named
Compute the difference, which is an estimate of the duration of the pregnancy.
.describe()to compute the mean duration and other summary statistics.
If the mean length of pregnancy seems short, remember that this dataset includes all pregnancies, not just the ones that ended in live birth.
Let’s get back to the original queston: what is the average birth weight for babies in the U.S.?
As an answer we could take the results from the previous section and compute the mean:
pounds_clean = pounds.replace([98, 99], np.nan) ounces_clean = ounces.replace([98, 99], np.nan) birth_weight = pounds_clean + ounces_clean / 16 birth_weight.mean()
But it is risky to compute a summary statistic, like the mean, before we look at the whole distribution of values.
A distribution is a set of possible values and their frequencies. One way to visualize a distribution is a histogram, which shows values on the
x axis and their frequencies on the
Series provides a
hist method that makes histograms.
And we can use Matplotlib to label the axes.
import matplotlib.pyplot as plt birth_weight.hist(bins=30) plt.xlabel('Birth weight in pounds') plt.ylabel('Number of live births') plt.title('Distribution of U.S. birth weight');
The keyword argument,
hist to divide the range of weights into 30 intervals, called bins, and count how many values fall in each bin.
x axis is birth weight in pounds; the
y axis is the number of births in each bin.
The distribution looks a little like a bell curve, but the tail is longer on the left than on the right; that is, there are more light babies than heavy babies. That makes sense, because the distribution includes some babies that were born preterm.
hist takes keyword arguments that specify the type and appearance of the histogram. Find the documentation of
hist and see if you can figure out how to plot the histogram as an unfilled line.
Exercise: As we saw in a previous exercise, the NSFG dataset includes a column called
AGECON that records age at conception for each pregnancy.
Select this column from the
DataFrameand divide by 100 to convert it to years.
Plot the histogram of these values with 20 bins.
We have seen that the distribution of birth weights is skewed to the left; that is, there are more light babies than heavy ones and they are farther from the mean. That’s because preterm babies tend to be lighter. The most common duration for pregnancy is 39 weeks, which is “full term”; “preterm” is usually defined to be less than 37 weeks.
To see which babies are preterm, we can use
PRGLNGTH, which records pregnancy length in weeks and compute it to
preterm = (nsfg['PRGLNGTH'] < 37) preterm.dtype
When you compare a
Series to a value, the result is a Boolean
Series; that is, each element is a Boolean value,
False. In this case, it’s
True for each preterm baby and
False otherwise. We can use
head to see the first 5 elements.
0 False 1 True 2 False 3 False 4 False Name: PRGLNGTH, dtype: bool
If you compute the sum of a Boolean
Series, it treats
True as 1 and
False as 0, so the sum is the number of
True values, which is the number of preterm babies, about 3700.
If you compute the mean of a Boolean
Series, you get the fraction of
In this case, it’s about 0.38; that is, about 38% of the pregnancies are less than 37 weeks.
However, this result might be misleading because it includes all pregnancy outcomes, not just live births.
We can create another Boolean
Series to indicate which pregnancies ended in live birth:
live = (nsfg['OUTCOME'] == 1) live.mean()
Now we can use the logical operator
& to identify pregnancies where the outcome is a preterm live birth:
live_preterm = (live & preterm) live_preterm.mean()
Exercise: Of all live births, what fraction are preterm?
The other common logical operators are:
|, which is the OR operator; for example
live | pretermis true if either
liveis true, or
pretermis true, or both.
~, which is the NOT operator; for example
~liveis true if
liveis false or
The logical operators treat
NaN the same as False. So you should be careful about using the NOT operator with a Series that contains
~preterm would include not just full term pregnancies, but also pregnancies with unknown length.
Exercise: Of all pregnancies, what fraction are full term, that is, 37 weeks or more? Of all live births, what fraction are full term?
We can use a Boolean
Series as a filter; that is, we can select only rows that satisfy a condition or meet some criterion.
For example, we can use
preterm and the bracket operator to select values from
preterm_weight gets birth weights for preterm babies.
preterm_weight = birth_weight[preterm] preterm_weight.mean()
To select full-term babies, we can create a Boolean
Series like this:
fullterm = (nsfg['PRGLNGTH'] >= 37)
And use it to select birth weights for full term babies:
full_term_weight = birth_weight[fullterm] full_term_weight.mean()
As expected, full term babies are heavier, on average, than preterm babies. To be more explicit, we could also limit the results to live births, like this:
full_term_weight = birth_weight[live & fullterm] full_term_weight.mean()
But in this case we get the same result because
birth_weight is only valid for live births.
Exercise: Let’s see if there is a difference in weight between single births and multiple births (twins, triplets, etc.).
NBRNALIV represents the number of babies born alive from a single pregnancy.
nbrnaliv = nsfg['NBRNALIV'] nbrnaliv.value_counts()
1.0 6573 2.0 111 3.0 6 Name: NBRNALIV, dtype: int64
live to create a Boolean series called
multiple that is true for multiple live births.
Of all live births, what fraction are multiple births?
Exercise: Make a Boolean series called
single that is true for single live births.
Of all single births, what fraction are preterm?
Of all multiple births, what fraction are preterm?
Exercise: What is the average birth weight for live, single, full-term births?
We are almost ready to compute the average birth weight, but there’s one more problem we have to solve: oversampling.
The NSFG is not exactly representative of the U.S. population. By design, some groups are more likely to appear in the sample than others; that is, they are oversampled. Oversampling helps to ensure that you have enough people in every subgroup to get reliable statistics, but it makes data analysis a little more complicated.
Each pregnancy in the dataset has a sampling weight that indicates how many pregnancies it represents. In
nsfg, the sampling weight is stored in a column named
Here’s what it looks like.
sampling_weight = nsfg['WGT2015_2017'] sampling_weight.describe()
count 9553.000000 mean 13337.425944 std 16138.878271 min 1924.916000 25% 4575.221221 50% 7292.490835 75% 15724.902673 max 106774.400000 Name: WGT2015_2017, dtype: float64
The median value (50th percentile) in this column is about 7292, which means that a pregnancy with that weight represents 7292 total pregnancies in the population. But the range of values is wide, so some rows represent many more pregnancies than others.
To take these weights into account, we can compute a weighted mean. Here are the steps:
Multiply the birth weights for each pregnancy by the sampling weights and add up the products.
Add up the sampling weights.
Divide the first sum by the second.
To do this correctly, we have to be careful with missing data.
To help with that, we’ll use two
isna returns a Boolean
Series that is
True where the corresponding value is
missing = birth_weight.isna() missing.sum()
birth_weight there are 3013 missing values (mostly for pregnancies that did not end in live birth).
notna returns a Boolean
Series that is
True where the corresponding value is not
valid = birth_weight.notna() valid.sum()
We can combine
valid with the other Boolean
Series we have computed to identify single, full term, live births with valid birth weights.
single = (nbrnaliv == 1) selected = valid & live & single & fullterm selected.sum()
sampling_weight to compute the weighted mean of birth weight for live, single, full term births.
You should find that the weighted mean is a little higher than the unweighted mean we computed in the previous section. That’s because the groups that are oversampled in the NSFG tend to have lighter babies, on average.
This chapter poses what seems like a simple question: what is the average birth weight of babies in the United States?
To answer it, we found an appropriate dataset and read the files. Then we validated the data and dealt with special values, missing data, and errors.
To explore the data, we used
describe, and other Pandas methods.
And to select relevant data, we used Boolean
Along the way, we had to think more about the question. What do we mean by “average”, and which babies should we include? Should we include all live births or exclude preterm babies or multiple births?
And we had to think about the sampling process. By design, the NSFG respondents are not representative of the U.S. population, but we can use sampling weights to correct for this effect.
Even a simple question can be a challenging data science project.