{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning and Validation" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-print" ] }, "source": [ "This is the first in a series of notebooks that make up a [case study in exploratory data analysis](https://allendowney.github.io/PoliticalAlignmentCaseStudy/).\n", "This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum.\n", "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/PoliticalAlignmentCaseStudy/blob/v1/01_clean.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we \n", "\n", "1. Read data from the General Social Survey (GSS),\n", "\n", "2. Clean the data, particularly dealing with special codes that indicate missing data,\n", "\n", "3. Validate the data by comparing the values in the dataset with values documented in the codebook.\n", "\n", "4. Generate resampled datasets that correct for deliberate oversampling in the dataset, and\n", "\n", "5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell loads the packages we need. If you have everything installed, there should be no error messages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "download(\"https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/v1/utils.py\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the data\n", "\n", "The data we'll use is from the General Social Survey (GSS). Using the [GSS Data Explorer](https://gssdataexplorer.norc.org), I selected a subset of the variables in the GSS and made it available along with this notebook.\n", "The following cell downloads this extract." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs_2022.hdf\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(72390, 207)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss = pd.read_hdf(\"gss_pacs_2022.hdf\", \"gss\")\n", "gss.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `head` to see what the `DataFrame` looks like." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | abany | \n", "abdefect | \n", "abhlth | \n", "abnomore | \n", "abpoor | \n", "abrape | \n", "absingle | \n", "acqntsex | \n", "adults | \n", "affrmact | \n", "... | \n", "trdunion | \n", "trust | \n", "union | \n", "wkharsex | \n", "wkracism | \n", "wksexism | \n", "wtssall | \n", "wtssps | \n", "xmarsex | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "NaN | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "... | \n", "NaN | \n", "3.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.4446 | \n", "0.663196 | \n", "NaN | \n", "1972 | \n", "
1 | \n", "NaN | \n", "1.0 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "... | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.8893 | \n", "0.917370 | \n", "NaN | \n", "1972 | \n", "
2 | \n", "NaN | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "... | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.8893 | \n", "0.897413 | \n", "NaN | \n", "1972 | \n", "
3 | \n", "NaN | \n", "2.0 | \n", "1.0 | \n", "2.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "... | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.8893 | \n", "1.066341 | \n", "NaN | \n", "1972 | \n", "
4 | \n", "NaN | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "... | \n", "NaN | \n", "2.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.8893 | \n", "0.944324 | \n", "NaN | \n", "1972 | \n", "
5 rows × 207 columns
\n", "