Cleaning and Validation#

This is the first in a series of notebooks that make up a case study in exploratory data analysis. This case study is part of the Elements of Data Science curriculum. Click here to run this notebook on Colab

In this notebook, we

  1. Read data from the General Social Survey (GSS),

  2. Clean the data, particularly dealing with special codes that indicate missing data,

  3. Validate the data by comparing the values in the dataset with values documented in the codebook.

  4. Generate resampled datasets that correct for deliberate oversampling in the dataset, and

  5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one.

The following cell loads the packages we need. If you have everything installed, there should be no error messages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Reading the data#

The data we’ll use is from the General Social Survey (GSS). Using the GSS Data Explorer, I selected a subset of the variables in the GSS and made it available along with this notebook. The following cell downloads this extract.

download("https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs_2022.hdf")
gss = pd.read_hdf("gss_pacs_2022.hdf", "gss")
gss.shape
(72390, 207)

We can use head to see what the DataFrame looks like.

gss.head()
abany abdefect abhlth abnomore abpoor abrape absingle acqntsex adults affrmact ... trdunion trust union wkharsex wkracism wksexism wtssall wtssps xmarsex year
0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN ... NaN 3.0 NaN NaN NaN NaN 0.4446 0.663196 NaN 1972
1 NaN 1.0 1.0 2.0 2.0 1.0 1.0 NaN 2.0 NaN ... NaN 1.0 NaN NaN NaN NaN 0.8893 0.917370 NaN 1972
2 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 2.0 NaN ... NaN 2.0 NaN NaN NaN NaN 0.8893 0.897413 NaN 1972
3 NaN 2.0 1.0 2.0 1.0 1.0 1.0 NaN 2.0 NaN ... NaN 2.0 NaN NaN NaN NaN 0.8893 1.066341 NaN 1972
4 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 2.0 NaN ... NaN 2.0 NaN NaN NaN NaN 0.8893 0.944324 NaN 1972

5 rows × 207 columns

This dataset has one row for each respondent and one column for each variable.

Validation#

Now that we’ve got the data loaded, it is important to validate it, which means checking for errors.

The kinds of errors you have to check for depend on the nature of the data, the collection process, how the data is stored and transmitted, etc.

For this dataset, there are three kinds of validation we’ll think about:

  1. We need to check the integrity of the dataset; that is, whether the data were corrupted or changed during transmission, storage, or conversion from one format to another.

  2. We need to check our interpretation of the data; for example, whether the numbers used to encode the data mean what we think they mean.

  3. We will also keep an eye out for invalid data; for example, missing data might be represented using special codes, or there might be patterns in the data that indicate problems with the survey process and the recording of the data.

In a different dataset I worked with, I found a surprising number of respondents whose height was supposedly 62 centimeters. After investigating, I concluded that they were probably 6 feet, 2 inches, and their heights were recorded incorrectly.

Validating data can be a tedious process, but it is important. If you interpret data incorrectly and publish invalid results, you will be embarrassed in the best case, and in the worst case you might do real harm. See this article for a recent example.

However, I don’t expect you to validate every variable in this dataset. Instead, I will demonstrate the process, and then ask you to validate one additional variable as an exercise.

The first variable we’ll validate is called polviews. It records responses to the following question:

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

You can read the documentation of this variable in the GSS codebook.

The responses are encoded like this:

1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slghtly conservative
6	Conservative
7	Extremely conservative
8	Don't know
9	No answer
0	Not applicable

The following function, values, takes a Series that represents a single variable and returns the values in the series and their frequencies.

def values(series):
    """Count the values and sort.

    series: pd.Series

    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()

Here are the values for the variable polviews.

polviews = gss["polviews"]
values(polviews)
1.0     2081
2.0     7623
3.0     7900
4.0    23992
5.0     9596
6.0     9361
7.0     2165
Name: polviews, dtype: int64

To check the integrity of the data and confirm that we have loaded it correctly, we’ll do a “spot check”; that is, we’ll pick one year and compare the values we see in the dataset to the values reported in the codebook.

We can select values from a single year like this:

one_year = gss["year"] == 1974

And look at the values and their frequencies:

values(polviews[one_year])
1.0     22
2.0    201
3.0    207
4.0    564
5.0    221
6.0    160
7.0     35
Name: polviews, dtype: int64

If you compare these results to the values in the codebook, you should see that they agree.

Exercise: Go back and change 1974 to another year, and compare the results to the codebook.

Missing data#

For many variables, missing values are encoded with numerical codes that we need to replace before we do any analysis.

For polviews, the values 8, 9, and 0 represent “Don’t know”, “No answer”, and “Not applicable”. “Not applicable” usually means the respondent was not asked a particular question.

To keep things simple, we’ll treat all of these values as equivalent, but we lose some information by doing that. For example, if a respondent refuses to answer a question, that might suggest something about their answer. If so, treating their response as missing data might bias the results.

Fortunately, for most questions the number of respondents who refused to answer is small.

I’ll replace the numeric codes 8, 9, and 0 with np.nan, which is a special value used to indicate missing data.

NA = np.nan
clean = polviews.replace([0, 8, 9], NA)
clean
0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
72385    1.0
72386    4.0
72387    3.0
72388    3.0
72389    2.0
Name: polviews, Length: 72390, dtype: float64

When np.nan is displayed, it appears as NaN, which stands for “not a number”.

We can use notna and sum to count the valid responses:

clean.notna().sum()
62718

And we use isna to count the missing responses:

clean.isna().sum()
9672

We can check these results against the codebook; at the bottom of that page, it reports the number of “Valid cases” and “Missing cases”.

In this example, the results don’t match. The codebook reports 53081 valid cases and 9385 missing cases. It turns out that the numbers in the codebook are not up to date, so for now we’ll have to assume that the numbers in the dataset are correct.

Replacing missing data#

For the other variables in this dataset, I read through the code book and identified the special values that indicate missing data. I recorded that information in the following function, which is intended to replace special values with NaN.

def replace_invalid(df, columns, bad):
    for column in columns:
        df[column].replace(bad, NA, inplace=True)


def gss_replace_invalid(df):
    """Replace invalid data with NaN.

    df: DataFrame
    """
    # different variables use different codes for invalid data
    df.cohort.replace([0, 9999], NA, inplace=True)

    # since there are a lot of variables that use 0, 8, and 9 for invalid data,
    # I'll use a loop to replace all of them
    columns = [
        "abany",
        "abdefect",
        "abhlth",
        "abnomore",
        "abpoor",
        "abrape",
        "absingle",
        "acqntsex",
        "affrmact",
        "bible",
        "cappun",
        "colath",
        "colcom",
        "colhomo",
        "colmil",
        "colmslm",
        "colrac",
        "colsoc",
        "compuse",
        "conarmy",
        "conbus",
        "conclerg",
        "coneduc",
        "confed",
        "confinan",
        "conjudge",
        "conlabor",
        "conlegis",
        "conmedic",
        "conpress",
        "consci",
        "contv",
        "databank",
        "discaffm",
        "discaffw",
        "divlaw",
        "divorce",
        "eqwlth",
        "fair",
        "fear",
        "fechld",
        "fefam",
        "fehelp",
        "fehire",
        "fehome",
        "fejobaff",
        "fepol",
        "fepres",
        "fepresch",
        "fework",
        "finrela",
        "frndsex",
        "fund",
        "god",
        "goodlife",
        "grass",
        "gunlaw",
        "hapmar",
        "happy",
        "health",
        "helpful",
        "hhrace",
        "homosex",
        "hunt",
        "libath",
        "libcom",
        "libhomo",
        "libmil",
        "libmslm",
        "librac",
        "libsoc",
        "life",
        "matesex",
        "meovrwrk",
        "miracles",
        "nataid",
        "natarms",
        "natchld",
        "natcity",
        "natcrime",
        "natdrug",
        "nateduc",
        "natenrgy",
        "natenvir",
        "natfare",
        "natheal",
        "natmass",
        "natpark",
        "natrace",
        "natroad",
        "natsci",
        "natsoc",
        "natspac",
        "othersex",
        "paidsex",
        "pikupsex",
        "polabuse",
        "polattak",
        "polescap",
        "polhitok",
        "polmurdr",
        "polviews",
        "popespks",
        "pornlaw",
        "postlife",
        "pray",
        "prayer",
        "premarsx",
        "pres00",
        "pres04",
        "pres08",
        "pres12",
        "pres96",
        "racchurh",
        "racclos",
        "racdif1",
        "racdif2",
        "racdif3",
        "racdif4",
        "racdin",
        "racdis",
        "racfew",
        "rachaf",
        "rachome",
        "racinteg",
        "raclive",
        "racmar",
        "racmost",
        "racopen",
        "racpres",
        "racpush",
        "racschol",
        "racseg",
        "racwork",
        "reborn",
        "relexp",
        "relexper",
        "reliten",
        "relpersn",
        "res16",
        "rowngun",
        "satfin",
        "satjob",
        "savesoul",
        "sexbirth",
        "sexeduc",
        "sexnow",
        "sexornt",
        "sexsex",
        "sexsex5",
        "spanking",
        "spkath",
        "spkcom",
        "spkhomo",
        "spklang",
        "spkmil",
        "spkmslm",
        "spkrac",
        "spksoc",
        "sprtprsn",
        "teensex",
        "trdunion",
        "trust",
        "union",
        "wkharsex",
        "wkracism",
        "wksexism",
        "xmarsex",
        "commun",
    ]
    replace_invalid(df, columns, [0, 8, 9])

    columns = ["degree", "partyid"]
    replace_invalid(df, columns, [8, 9])

    df.phone.replace([0, 2, 9], NA, inplace=True)
    df.owngun.replace([0, 3, 8, 9], NA, inplace=True)
    df.pistol.replace([0, 3, 8, 9], NA, inplace=True)
    df["class"].replace([0, 5, 8, 9], NA, inplace=True)

    df.chldidel.replace([-1, 8, 9], NA, inplace=True)
    df.sexfreq.replace([-1, 8, 9], NA, inplace=True)

    df.attend.replace([9], NA, inplace=True)
    df.childs.replace([9], NA, inplace=True)
    df.adults.replace([9], NA, inplace=True)

    df.age.replace([0, 98, 99], NA, inplace=True)
    df.relig.replace([0, 98, 99], NA, inplace=True)
    df.relig16.replace([0, 98, 99], NA, inplace=True)
    df.relactiv.replace([0, 98, 89], NA, inplace=True)

    # note: sibs contains some unlikely numbers
    df.sibs.replace([-1, 98, 99], NA, inplace=True)
    df.hrsrelax.replace([-1, 98, 99], NA, inplace=True)

    df.educ.replace([97, 98, 99], NA, inplace=True)

    df.realinc.replace([0], NA, inplace=True)
    df.realrinc.replace([0], NA, inplace=True)

    df.income.replace([0, 13, 98, 99], NA, inplace=True)
    df.rincome.replace([0, 13, 98, 99], NA, inplace=True)
gss_replace_invalid(gss)
# check that we've cleaned all columns that need it;
# all columns we've added NaN to should be floats

# some columns have no missing values
clean_columns = ["id", "year", "ballot", "sex", "race", "reg16", "region", "srcbelt"]

for column in gss.columns:
    if gss[column].dtype == int and column not in clean_columns:
        print(f"'{column}', ", end="")
print()

At this point, I have only moderate confidence that this code is correct. I’m not sure I have dealt with every variable in the dataset, and I’m not sure that the special values for every variable are correct.

So I will ask for your help.

Exercise: In order to validate the other variables, I’d like each person who works with this notebook to validate one variable.

If you run the following cell, it will choose one of the columns from the dataset at random. That’s the variable you will check.

If you get year or id_, run the cell again to get a different variable name.

np.random.seed(None)
np.random.choice(gss.columns)
'abhlth'

Go back through the previous two sections of this notebook and replace polviews with your randomly chosen variable. Then run the cells again and go to this online survey to report the results.

Note: Not all questions were asked during all years. If your variable doesn’t have data for 1974 or 2018, you might have to choose different years.

Resampling#

The GSS uses stratified sampling, which means that some groups are deliberately oversampled to help with statistical validity.

As a result, each respondent has a sampling weight which is proportional to the number of people in the population they represent.

Before running any analysis, we can compensate for stratified sampling by “resampling”, that is, by drawing a random sample from the dataset, where each respondent’s chance of appearing in the sample is proportional to their sampling weight.

def resample_rows_weighted(df, column):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column]
    sample = df.sample(n=len(df), replace=True, weights=weights)
    return sample
def resample_by_year(df, column):
    """Resample rows within each year.

    df: DataFrame
    column: string name of weight variable

    returns DataFrame
    """
    grouped = df.groupby("year")
    samples = [resample_rows_weighted(group, column) for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample
np.random.seed(19)
sample = resample_by_year(gss, "wtssall")

Saving the results#

I’ll save the results to an HDF5 file, which is a binary format that makes it much faster to read the data back.

First I’ll save the original (not resampled) data.

An HDF5 file is like a dictionary on disk. It contains keys and corresponding values.

to_hdf takes three arguments:

  • The filename, gss_pacs_clean.hdf.

  • The key, gss

  • The compression level, which controls how hard the algorithm works to compress the file.

So this file contains a single key, gss, which maps to the DataFrame with the original GSS data.

The argument w says that if the file already exists, we should overwrite it.

With compression level 6, it reduces the size of the file by a factor of more than 10.

# if the file already exists, remove it
import os

if os.path.isfile("gss_pacs_clean.hdf"):
    !rm gss_pacs_resampled.hdf
# save the original

gss.to_hdf("gss_pacs_clean.hdf", "gss", "w", complevel=6)
!ls -lh gss_pacs_clean.hdf
-rw-rw-r-- 1 downey downey 11M Apr  3 11:42 gss_pacs_clean.hdf

And I’ll create a second file with three random resamplings of the original dataset.

# if the file already exists, remove it
import os

if os.path.isfile("gss_pacs_resampled.hdf"):
    !rm gss_pacs_resampled.hdf

This file contains three keys, gss0, gss1, and gss2, which map to three DataFrames.

# generate and store three resamplings
keys = ["gss0", "gss1", "gss2"]

for i in range(3):
    np.random.seed(i)
    sample = resample_by_year(gss, "wtssall")

    sample.to_hdf("gss_pacs_resampled.hdf", keys[i], complevel=7)
!ls -lh gss_pacs_resampled.hdf
-rw-rw-r-- 1 downey downey 31M Apr  3 11:43 gss_pacs_resampled.hdf

For the other notebooks in this case study, we’ll load this resampled data rather than reading and cleaning the data every time.