{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Printed copies of *Elements of Data Science* are available now, with a **full color interior**.\n", "\n", "From July 17 to July 31, [get 20% off at Lulu.com](https://www.lulu.com/shop/allen-downey/elements-of-data-science/paperback/product-9dyrwn.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regression" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-print" ] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + str(local))\n", " return filename\n", "\n", "download('https://raw.githubusercontent.com/AllenDowney/ElementsOfDataScience/v1/utils.py')\n", "\n", "import utils" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-print" ] }, "source": [ "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/v1/10_regression.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous chapter we used simple linear regression to quantify the relationship between two variables.\n", "In this chapter we'll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression.\n", "These tools will allow us to explore relationships among sets of variables.\n", "As an example, we will use data from the General Social Survey (GSS) to explore the relationship between education, sex, age, and income.\n", "\n", "The GSS dataset contains hundreds of columns.\n", "We'll work with an extract that contains just the columns we need, as we did in Chapter 8.\n", "Instructions for downloading the extract are in the notebook for this chapter." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "download('https://github.com/AllenDowney/ElementsOfDataScience/' +\n", " 'raw/v1/data/gss_extract_2022.hdf');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can read the `DataFrame` like this and display the first few rows." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "id | \n", "age | \n", "educ | \n", "degree | \n", "sex | \n", "gunlaw | \n", "grass | \n", "realinc | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "1972 | \n", "1 | \n", "23.0 | \n", "16.0 | \n", "3.0 | \n", "2.0 | \n", "1.0 | \n", "NaN | \n", "18951.0 | \n", "
1 | \n", "1972 | \n", "2 | \n", "70.0 | \n", "10.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "NaN | \n", "24366.0 | \n", "
2 | \n", "1972 | \n", "3 | \n", "48.0 | \n", "12.0 | \n", "1.0 | \n", "2.0 | \n", "1.0 | \n", "NaN | \n", "24366.0 | \n", "
3 | \n", "1972 | \n", "4 | \n", "27.0 | \n", "17.0 | \n", "3.0 | \n", "2.0 | \n", "1.0 | \n", "NaN | \n", "30458.0 | \n", "
4 | \n", "1972 | \n", "5 | \n", "61.0 | \n", "12.0 | \n", "1.0 | \n", "2.0 | \n", "1.0 | \n", "NaN | \n", "50763.0 | \n", "