Relationships#

Click here to run this notebook on Colab.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://raw.githubusercontent.com/AllenDowney/ElementsOfDataScience/v1/utils.py')

import utils

This chapter explores relationships between variables.

  • We will visualize relationships using scatter plots, box plots, and violin plots,

  • And we will quantify relationships using correlation and simple regression.

The most important lesson in this chapter is that you should always visualize the relationship between variables before you try to quantify it – otherwise, you are likely to be misled.

Exploring relationships#

So far we have mostly considered one variable at a time. Now it’s time to explore relationships between variables. As a first example, we’ll look at the relationship between height and weight. We’ll use data from the Behavioral Risk Factor Surveillance System (BRFSS), which is run by the Centers for Disease Control (CDC). Based on the BRFSS data from 2021, I have created an extract with one row for each survey respondent and one column for each of the variables I selected.

You can read about the BRFSS at at https://www.cdc.gov/brfss.

import pandas as pd

brfss = pd.read_hdf('brfss_2021.hdf', 'brfss')
brfss.shape
(438693, 9)

Here are the first few rows.

brfss.drop(columns='SEQNO').head()
HTM4 WTKG3 _SEX _AGEG5YR _VEGESU1 _INCOMG1 _HTM4G10 AGE
0 160.0 65.77 2 1.0 3.14 5.0 150.0 22.0
1 168.0 78.93 1 8.0 NaN 6.0 160.0 57.0
2 NaN 58.97 1 NaN 1.00 NaN NaN NaN
3 168.0 74.84 1 3.0 2.28 7.0 160.0 32.0
4 168.0 81.65 2 2.0 1.28 4.0 160.0 27.0

The BRFSS includes hundreds of variables. For the examples in this chapter, we’ll work with just these nine. The ones we’ll start with are HTM4, which records each respondent’s height in centimeters, and WTKG3, which records weight in kilograms.

height = brfss['HTM4']
weight = brfss['WTKG3']

To visualize the relationship between these variables, we’ll make a scatter plot, which shows one marker for each pair of values. Scatter plots are common and readily understood, but they are surprisingly hard to get right. As a first attempt, we’ll use plot with the style string o, which plots a circle for each data point.

import matplotlib.pyplot as plt

plt.plot(height, weight, 'o')

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/f4617188abc48fb9e4fc4bb887a4af6f8f18b2ab0eb1468b25750e0f2a5ab55f.png

Each marker represents the height and weight of one person. Based on the shape of the result, it looks like taller people are heavier, but there are a few things about this plot that make it hard to interpret. Most importantly, it is overplotted, which means that there are markers piled on top of each other so you can’t tell where there are a lot of data points and where there is just one. When that happens, the results can be misleading.

One way to improve the plot is to use transparency, which we can do with the keyword argument alpha. The lower the value of alpha, the more transparent each data point is.
Here’s what it looks like with alpha=0.01.

plt.plot(height, weight, 'o', alpha=0.01)

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/acb5c69ebf865007740f39480faed639eeea3d99d91fd5887e6f1d58b51fa61b.png

This is better, but there are so many data points, the scatter plot is still overplotted. The next step is to make the markers smaller. With markersize=0.5 and a low value of alpha, the scatter plot is less saturated. Here’s what it looks like.

plt.plot(height, weight, 'o', alpha=0.01, markersize=0.5)

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/d95e44a267ed351b6eb1bd981a5ec40b49a933861e958d635d82fee855bfc7a9.png

Again, this is better, but now we can see that the points fall in discrete columns. That’s because most heights were reported in inches and converted to centimeters. We can break up the columns by adding some random noise to the values – in effect, we are filling in the values that got rounded off. Adding random noise like this is called jittering. We can use NumPy to add noise from a normal distribution with mean 0 and standard deviation 2.

import numpy as np

noise = np.random.normal(0, 2, size=len(brfss))
height_jitter = height + noise

Here’s what the plot looks like with jittered heights.

plt.plot(height_jitter, weight, 'o', alpha=0.01, markersize=0.5)

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/3fcfb464540bfba941e2a9911b239b23d49acf3ca298a776768f806c4286af1b.png

The columns are gone, but now we can see that there are rows where people rounded off their weight. We can fix that by jittering weight, too.

noise = np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + noise

Here’s what it looks like.

plt.plot(height_jitter, weight_jitter, 'o', 
         alpha=0.01, markersize=0.5)

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/aa36321c9198e3b723a3cd86d1e4f26416cb2054360f42889e662a538f4f7ef7.png

Finally, let’s zoom in on the area where most of the data points are. The functions xlim and ylim set the lower and upper limits of the \(x\)-axis and \(y\)-axis; in this case, we plot heights from 140 to 200 centimeters and weights up to 160 kilograms. Here’s what it looks like.

plt.plot(height_jitter, weight_jitter, 'o', alpha=0.01, markersize=0.5)

plt.xlim([140, 200])
plt.ylim([0, 160])
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/17a577539aa1441248213d84989bb556d7738a96e7516615e9e31a28b6ab088b.png

Now we have a reliable picture of the relationship between height and weight. Below you can see the misleading plot we started with and the more reliable one we ended with. They are clearly different, and they suggest different relationships between these variables.

Hide code cell source
# Set the figure size
plt.figure(figsize=(8, 3))

# Create subplots with 2 rows, 1 column, and start plot 1
plt.subplot(1, 2, 1)
plt.plot(height, weight, 'o')

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height')

# Adjust the layout so the two plots don't overlap
plt.tight_layout()

# Start plot 2
plt.subplot(1, 2, 2)

plt.plot(height_jitter, weight_jitter, 'o', alpha=0.01, markersize=0.5)

plt.xlim([140, 200])
plt.ylim([0, 160])
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height')
plt.tight_layout()
_images/6c3a8283bd548435c2e41fcdd2e1c229df2be05cc67dc1272350d41e504d411d.png

The point of this example is that it takes some effort to make an effective scatter plot.

Exercise: Do people tend to gain weight as they get older? We can answer this question by visualizing the relationship between weight and age. But before we make a scatter plot, it is a good idea to visualize distributions one variable at a time. So let’s look at the distribution of age.

The BRFSS dataset includes a column, AGE, which represents each respondent’s age in years. To protect respondents’ privacy, ages are rounded off into 5-year bins. The values in AGE are the midpoints of the bins.

  • Extract the variable 'AGE' from brfss and assign it to age.

  • Plot the PMF of age as a bar chart, using Pmf from empiricaldist.

from empiricaldist import Pmf

Exercise: Now let’s look at the distribution of weight. The column that contains weight in kilograms is WTKG3. Because this column contains many unique values, displaying it as a PMF doesn’t work very well. Instead, use Cdf from empiricaldist to plot the CDF of weight.

Exercise: Now make a scatter plot of weight versus age. Adjust alpha and markersize to avoid overplotting. Use ylim to limit the y axis from 0 to 200 kilograms.

Exercise: In the previous exercise, the ages fall in columns because they’ve been rounded into 5-year bins. If we jitter them, the scatter plot will show the relationship more clearly.

  • Add random noise to age with mean 0 and standard deviation 2.5.

  • Make a scatter plot and adjust alpha and markersize again.

Visualizing relationships#

In the previous section we used scatter plots to visualize relationships between variables, and in the exercises, you explored the relationship between age and weight. In this section, we’ll see other ways to visualize these relationships, including box plots and violin plots. Let’s start with a scatter plot of weight versus age.

age = brfss['AGE']
noise = np.random.normal(0, 1.0, size=len(brfss))
age_jitter = age + noise

plt.plot(age_jitter, weight_jitter, 'o', alpha=0.01, markersize=0.5)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.ylim([0, 200])
plt.title('Weight versus age');
_images/8c651fafb18b48c74a4b4087e7eef0fa172c29a0a57d723b151fc15fe6936edd.png

In this version of the scatter plot, the ages are jittered, but there’s still space between the columns. That makes it possible to see the shape of the distribution in each age group, and the differences between groups. With this view, it looks like average weights increase until age 40 or 50, and then start to decrease.

If we take this idea one step farther, we can use KDE to estimate the density function in each column and plot it. And there’s a name for that – it’s called a violin plot. Seaborn provides a function that makes violin plots, but before we can use it, we have to get rid of any rows with missing data. Here’s how:

data = brfss.dropna(subset=['AGE', 'WTKG3'])
data.shape
(393047, 9)

dropna creates a new DataFrame that omits the rows in brfss where AGE or WTKG3 are NaN. Now we can call violinplot.

import seaborn as sns

sns.violinplot(x='AGE', y='WTKG3', data=data, inner=None)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Weight versus age');
_images/3aded7c417775acb37f5df16ec4dc2544f97a4d9f28d76aa82de97801d9f2bcd.png

Documentation of violinplot is at https://seaborn.pydata.org/generated/seaborn.violinplot.html.

data is the DataFrame we just created; the x and y arguments are the names of columns from data. The argument inner=None simplifies the plot a little. In the result each shape shows the distribution of weight in one age group. The width of each shape is proportional to the estimated density, so it’s like two vertical KDEs plotted back to back.

Another, related way to look at data like this is called a box plot, which represents summary statistics for the values in each group.
The code to generate a box plot is very similar.

sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Weight versus age');
_images/49dbc32cff899850ef6e5cca9f014a9f651dad69eb880ade75721ca675ac690a.png

The argument whis=10 turns off a feature we don’t need. If you are curious about it, you can read about it in the Seaborn documentation.

Documentation of boxplot is at https://seaborn.pydata.org/generated/seaborn.boxplot.html.

Each box represents the distribution of weight in an age group. The height of each box represents the range from the 25th to the 75th percentile. The line in the middle of each box is the median. The spines sticking out of the top and bottom show the minimum and maximum values.

In my opinion, this plot gives us the best view of the relationship between weight and age.

  • Looking at the medians, it seems like people in their 40s are the heaviest; younger and older people are lighter.

  • Looking at the sizes of the boxes, it seems like people in their 40s have the most variability in weight, too.

  • These plots also show how skewed the distribution of weight is – that is, the heaviest people are much farther from the median than the lightest people.

For data that skews toward higher values, it is sometimes useful to look at it on a logarithmic scale. We can do that with the Pyplot function yscale.

sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)

plt.yscale('log')
plt.xlabel('Age in years')
plt.ylabel('Weight in kg (log scale)')
plt.title('Weight versus age');
_images/128ff078ebafff3051d8297debe7f54c38e55da4bfb5cad9a1f4fba9acf662c7.png

On a log scale, the distributions are symmetric – so the spines extend about the same distance in both directions, the boxes are close to the middle of the figure, and we can see the relationship between age and weight clearly.

In the following exercises, you’ll have a chance to generate violin and box plots for other variables.

Exercise: Previously we looked at a scatter plot of height and weight, and saw that taller people tend to be heavier. Now let’s take a closer look using a box plot. The brfss DataFrame contains a column named _HTMG10 that represents height in centimeters, binned into 10 cm groups.

  • Make a box plot that shows the distribution of weight in each height group.

  • Plot the y-axis on a logarithmic scale.

Suggestion: If the labels on the x axis collide, you can rotate them like this:

plt.xticks(rotation=45)

Exercise: As a second example, let’s look at the relationship between income and height. In the BRFSS, income is represented as a categorical variable – that is, respondents are assigned to one of seven income categories. The column name is _INCOMG1. Before we connect income with anything else, let’s look at the distribution by computing the PMF. Extract _INCOMG1 from brfss and assign it to income – then plot the PMF of income as a bar chart.

Exercise: Generate a violin plot that shows the distribution of height in each income group. Can you see a relationship between these variables?

Quantifying Correlation#

In the previous section, we visualized relationships between pairs of variables. Now we’ll learn about the coefficient of correlation, which quantifies the strength of these relationships. When people say “correlation” casually, they might mean any relationship between two variables. In statistics, it usually means Pearson’s correlation coefficient, which is a number between -1 and 1 that quantifies the strength of a linear relationship between variables. To demonstrate, we’ll select three columns from the BRFSS dataset:

columns = ['HTM4', 'WTKG3', 'AGE']
subset = brfss[columns]

The result is a DataFrame with just those columns. With this subset of the data, we can use the corr method, like this:

subset.corr()
HTM4 WTKG3 AGE
HTM4 1.000000 0.453259 -0.092798
WTKG3 0.453259 1.000000 0.001107
AGE -0.092798 0.001107 1.000000

The result is a correlation matrix. Reading across the first row, the correlation of HTM4 with itself is 1. That’s expected – the correlation of anything with itself is 1.

The next entry is more interesting: the correlation of height and weight is about 0.45. It’s positive, which means taller people are heavier, and it’s moderate in strength, which means it has some predictive value – if you know someone’s height, you can make a somewhat better guess about their weight.

The correlation between height and age is about -0.09. It’s negative, which means that older people tend to be shorter, but it’s weak, which means that knowing someone’s age would not help much if you were trying to guess their height.

The correlation between weight and age is even smaller. It is tempting to conclude that there is no relationship between age and weight, but we have already seen that there is. So why is the correlation so low? Remember that the relationship between weight and age looks like this.

Hide code cell source
data = brfss.dropna(subset=['AGE', 'WTKG3'])
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Weight versus age');
_images/49dbc32cff899850ef6e5cca9f014a9f651dad69eb880ade75721ca675ac690a.png

People in their forties are the heaviest; younger and older people are lighter. So this relationship is nonlinear. But correlation only measures linear relationships. If the relationship is nonlinear, correlation generally underestimates how strong it is.

To demonstrate this point, I’ll generate some fake data: xs contains equally-spaced points between -1 and 1; ys is xs squared plus some random noise.

xs = np.linspace(-1, 1)
ys = xs**2 + np.random.normal(0, 0.05, len(xs))

Here’s the scatter plot of xs and ys.

plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of a fake dataset');
_images/83c9982e43b754c248a63d6f6258b8f22e9808bae617b8e0b5d7bea8c79f0191.png

It’s clear that this is a strong relationship – if you are given x, you can make a much better guess about y. But here’s the correlation matrix:

np.corrcoef(xs, ys)
array([[1.        , 0.01993483],
       [0.01993483, 1.        ]])

Even though there is a strong non-linear relationship, the computed correlation is close to 0. In general, if correlation is high – that is, close to 1 or -1 – you can conclude that there is a strong linear relationship. But if correlation is close to 0, that doesn’t mean there is no relationship; there might be a non-linear relationship. This is one reason correlation can be misleading.

And there’s another reason to be careful with correlation – it doesn’t mean what people take it to mean. Specifically, correlation says nothing about slope. If we say that two variables are correlated, that means we can use one to predict the other. But that might not be what we care about.

For example, suppose we are concerned about the health effects of weight gain, so we plot weight versus age from 20 to 50 years old. I’ll generate two fake datasets to demonstrate the point. In each dataset, xs represents age and ys represents weight.

Hide code cell source
np.random.seed(18)
xs1 = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs1 + np.random.normal(0, 0.15, len(xs1))

plt.plot(xs1, ys1, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake dataset #1');
_images/d186167c2f966dae37b7e4d9e7e370df3936e6e8fdd3f5b632ae0f3a11155846.png

And here’s the second dataset:

Hide code cell source
np.random.seed(18)
xs2 = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs2 + np.random.normal(0, 3, len(xs2))

plt.plot(xs2, ys2, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake dataset #2');
_images/817e2365704d2f37fcb6860c0d61b756d5cf85637a7e03af321b18b43752d6a2.png

I constructed these examples so they look similar, but they have substantially different correlations.

rho1 = np.corrcoef(xs1, ys1)[0][1]
rho1
0.7579660563439401
rho2 = np.corrcoef(xs2, ys2)[0][1]
rho2
0.4782776976576317

In the first example, the correlation is strong, close to 0.75. In the second example, the correlation is moderate, close to 0.5. So we might think the first relationship is more important. But look more closely at the y axis in both figures.

In the first example, the average weight gain over 30 years is less than 1 kilogram; in the second it is more than 5 kilograms! If we are concerned about the health effects of weight gain, the second relationship is probably more important, even though the correlation is lower.
In this scenario, the statistic we really care about is the line of best fit, not the coefficient of correlation. In the next section, we’ll see how to fit a line to this data, but first let’s practice with correlation.

Exercise: The purpose of the BRFSS is to explore health risk factors, so it includes questions about diet. The column _VEGESU1 represents the number of servings of vegetables respondents reported eating per day. Before we compute correlations, let’s look at the distribution of this variable. Extract _VEGESU1 from brfss and assign it to vegesu – then plot the CDF of the values.

The original dataset includes a small number of values greater than 10, some of them unreasonably large. For this extract, I have cut off the values at 10.

Exercise: Now let’s visualize the relationship between age and vegetables. Make a box plot that summarizes the distribution of vegetable servings in each age group. How would you describe the relationship, if any?

Exercise: Finally, let’s look at correlations between age, income, and vegetable servings.

  • From brfss, select the columns 'AGE', _INCOMG1, and _VEGESU1.

  • Compute the correlation matrix for these variables.

Is the correlation between age and vegetable servings what you expected based on the box plot?

Simple Linear Regression#

In the previous section we saw that correlation does not always measure what we really want to know. In this section, we look at an alternative: simple linear regression. Here “simple” means there are only two variables, as opposed to “multiple” regression, which can work with any number of variables.

Let’s look again at the relationship between weight and age. In the previous section, I generated two fake datasets to make a point:

Hide code cell source
plt.figure(figsize=(8, 3))

plt.subplot(1, 2, 1)
plt.plot(xs1, ys1, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake dataset #1')
plt.tight_layout()

plt.subplot(1, 2, 2)
plt.plot(xs2, ys2, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake dataset #2')
plt.tight_layout()
_images/238626719ebb18738662c783563277fc496185049bec051efb8290720c128f7e.png

The one on the left has higher correlation, about 0.75 compared to 0.5. But in this context, the statistic we probably care about is the slope of the line, not the correlation coefficient. To estimate the slope, we can use linregress from the SciPy stats library.

from scipy.stats import linregress

res1 = linregress(xs1, ys1)
res1._asdict()
{'slope': 0.018821034903244386,
 'intercept': 75.08049023710964,
 'rvalue': 0.7579660563439402,
 'pvalue': 1.8470158725246148e-10,
 'stderr': 0.002337849260560818,
 'intercept_stderr': 0.08439154079040358}

linregress takes two sequences as arguments. The result is a LinregressResult object that contains five values: slope is the slope of the line of best fit for the data; intercept is the intercept; and rvalue is correlation. We’ll interpret some of the other values later.

For Fake Dataset #1, the estimated slope is about 0.019 kilograms per year or about 0.56 kilograms over the 30-year range.

res1.slope * 30
0.5646310470973316

Here are the results for Fake Dataset #2.

res2 = linregress(xs2, ys2)
res2._asdict()
{'slope': 0.17642069806488855,
 'intercept': 66.60980474219305,
 'rvalue': 0.47827769765763173,
 'pvalue': 0.0004430600283776241,
 'stderr': 0.04675698521121631,
 'intercept_stderr': 1.6878308158080697}

The estimated slope is almost 10 times higher: about 0.18 kilograms per year or about 5.3 kilograms per 30 years.

res2.slope * 30
5.292620941946657

rvalue confirms what we saw before – the first example has higher correlation, about 0.75 compared to 0.5.

res1.rvalue, res2.rvalue
(0.7579660563439402, 0.47827769765763173)

But the strength of the effect, as measured by the slope of the line, is about 10 times higher in the second example.

We can use the results from linregress to compute the line of best fit. First we get the minimum and maximum of the observed xs; then we multiply by the slope and add the intercept. Here’s what that looks like for the first example.

plt.plot(xs1, ys1, 'o', alpha=0.5)

low, high = xs1.min(), xs1.max()
fx = np.array([low, high])
fy = res1.intercept + res1.slope * fx
plt.plot(fx, fy, '-')

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake Dataset #1');
_images/1d08b698f32be1fcb5e86216a03189a7c266e7c9e8d6072e8a2cd2aebbfaa01a.png

And here’s what it looks like for the second example.

plt.plot(xs2, ys2, 'o', alpha=0.5)

low, high = xs2.min(), xs2.max()
fx = np.array([low, high])
fy = res2.intercept + res2.slope * fx
plt.plot(fx, fy, '-')

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Fake Dataset #2');
_images/45fca1aa0f97503133af3a78636d31ac7fec1cbe8d1c073a7c142a7626779c48.png

The visualization here might be misleading unless you look closely at the vertical scales – the slope in the second figure is almost 10 times higher.

Regression of Height and Weight#

Now let’s look at an example of regression with real data. Here’s the scatter plot of height and weight one more time.

plt.plot(height_jitter, weight_jitter, 'o', 
         alpha=0.01, markersize=0.5)

plt.xlim([140, 200])
plt.ylim([0, 160])
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/17a577539aa1441248213d84989bb556d7738a96e7516615e9e31a28b6ab088b.png

To compute the regression line, we’ll use linregress again. But it can’t handle NaN values, so we have to use dropna to remove rows that are missing the data we need.

data = brfss.dropna(subset=['WTKG3', 'HTM4'])

Now we can compute the linear regression.

res_hw = linregress(data['HTM4'], data['WTKG3'])
res_hw._asdict()
{'slope': 0.9012473927929429,
 'intercept': -70.83347519772558,
 'rvalue': 0.45325892670525614,
 'pvalue': 0.0,
 'stderr': 0.0028318189296177936,
 'intercept_stderr': 0.48327199271627963}

The slope is about 0.9 kilograms per centimeter, which means that we expect a person one centimeter taller to be almost a kilogram heavier. That’s quite a lot. As before, we can compute the line of best fit:

low, high = data['HTM4'].min(), data['HTM4'].max()
fx = np.array([low, high])
fy = res_hw.intercept + res_hw.slope * fx

And here’s what that looks like.

plt.plot(height_jitter, weight_jitter, 'o', alpha=0.01, markersize=0.5)
plt.plot(fx, fy, '-')

plt.xlim([140, 200])
plt.ylim([0, 160])
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.title('Scatter plot of weight versus height');
_images/a76a519e33d30225c1cfa6b1dc18ce0b44c2d85ab97e07b8c8775433598f491d.png

The slope of this line seems consistent with the scatter plot. As another example, here’s the scatter plot of weight versus age, which we saw earlier.

plt.plot(age_jitter, weight_jitter, 'o', 
         alpha=0.01, markersize=0.5)

plt.ylim([0, 160])
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Weight versus age');
_images/3871aa46b71b87b855272caac969b43351104d917982445378cfc00f65bfe141.png

People in their 40s are the heaviest; younger and older people are lighter. So the relationship is nonlinear. If we don’t look at the scatter plot and blindly compute the regression line, here’s what we get.

subset = brfss.dropna(subset=['WTKG3', 'AGE']) 
res_aw = linregress(subset['AGE'], subset['WTKG3'])
res_aw._asdict()
{'slope': 0.0013125832660571195,
 'intercept': 82.50816942598159,
 'rvalue': 0.0011068212407106782,
 'pvalue': 0.4877433280947321,
 'stderr': 0.001891594202330922,
 'intercept_stderr': 0.09723506804688767}

The estimated slope is close to zero. And here’s what the line of best fit looks like.

plt.plot(age_jitter, weight_jitter, 'o', 
         alpha=0.01, markersize=0.5)

low, high = data['AGE'].min(), data['AGE'].max()
fx = np.array([low, high])
fy = res_aw.intercept + res_aw.slope * fx
plt.plot(fx, fy, '-')

plt.ylim([0, 160])
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
plt.title('Weight versus age');
_images/c1f5e4a0d53973996311ade1df1368e5eef2c9b59d2097da60e4ecb11ac91c96.png

A straight line does not capture the relationship between these variables well. That’s because linear regression has the same problem as correlation – it only measures the strength of a linear relationship.

In the next chapter, we’ll see how to use multiple regression to estimate non-linear relationships. But first, let’s practice simple regression.

Exercise: Who do you think eats more vegetables, people with low income, or people with high income? Let’s find out. As we’ve seen previously, the column _INCOMG1 represents income level and _VEGESU1 represents the number of vegetable servings respondents reported eating per day. Make a scatter plot with vegetable servings versus income, that is, with vegetable servings on the y axis and income group on the x axis. You might want to use ylim to zoom in on the bottom half of the y axis.

Exercise: Now let’s estimate the slope of the relationship between vegetable consumption and income.

  • Use dropna to select rows where _INCOMG1 and _VEGESU1 are not NaN.

  • Extract _INCOMG1 and _VEGESU1 and compute the simple linear regression of these variables.

What is the slope of the regression line? What does this slope means in the context of the question we are exploring?

Exercise: Finally, plot the regression line on top of the scatter plot.

Summary#

This chapter presents three ways to visualize the relationship between two variables: a scatter plot, violin plot, and box plot. A scatter plot is often a good choice when you are exploring a new data set, but it can take some attention to avoid overplotting. Violin plots and box plots are particularly useful when one of the variables only takes on a few discrete values.

We considered two ways to quantify the strength of a relationship: the coefficient of correlation and the slope of a regression line. These statistics capture different aspect of what we might mean by “strength”. The coefficient of correlation indicates how well we can predict one variable, given the other. The slope of the regression line indicates how much difference we expect in one variable as we vary the other. One or the other might be more relevant, depending on the context.