Is the Ideology Gap Growing?#

This tweet from John Burn-Murdoch links to an article in the Financial Times, “A new global gender divide is emerging”, which includes this figure:

The article claims:

In the US, Gallup data shows that after decades where the sexes were each spread roughly equally across liberal and conservative world views, women aged 18 to 30 are now 30 percentage points more liberal than their male contemporaries. That gap took just six years to open up.

The figure says it is based on General Social Survey data and the text says it’s based on Gallup data, so I’m not sure which it is.

And I don’t know what it means that “All figures are adjusted for time trend in the overall population”.

Anyway, since I used GSS data in the last three chapters of Probably Overthinking It, this analysis did not sound right to me.

This notebook is my attempt to replicate the analysis with GSS data. I conclude:

  • The GSS data does not look like the figure in the FT.

  • Men are a more likely to say that they are conservative, by 5-10 percentage points.

  • The only evidence that the gap is growing depends entirely on a data point from 2022 that is almost certainly wrong.

  • If we drop the 2022 data and apply moderate smoothing, we see no evidence that the gap is growing.

Most of the functions in this notebook are the ones I used to write Probably Overthinking It. All of the notebooks for that book are available in this repository.

Click here to run this notebook on Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess


def make_lowess(series, frac=0.5):
    """Use LOWESS to compute a smooth line.

    series: pd.Series

    returns: pd.Series
    """
    y = series.values
    x = series.index.values

    smooth = lowess(y, x, frac=frac)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=index)
Hide code cell content
def plot_series_lowess(series, color, plot_series=True, frac=0.5, **options):
    """Plots a series of data points and a smooth line.

    series: pd.Series
    color: string or tuple
    """
    if "label" not in options:
        options["label"] = series.name

    if plot_series or len(series) == 1:
        x = series.index
        y = series.values
        plt.plot(x, y, "o", color=color, alpha=0.3, label="_")

    if not plot_series and len(series) == 1:
        x = series.index
        y = series.values
        plt.plot(x, y, "o", color=color, alpha=0.6, label=options["label"])

    if len(series) > 1:
        smooth = make_lowess(series, frac=frac)
        smooth.plot(color=color, **options)
Hide code cell content
def decorate(**options):
    """Decorate the current axes.

    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')

    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)

    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)

    plt.tight_layout()

Read the Data#

I’m using data I previous cleaned in this notebook.

from os.path import basename, exists
from pathlib import Path


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename
download("https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs_2022.hdf")
'gss_pacs_2022.hdf'
gss = pd.read_hdf("gss_pacs_2022.hdf", "gss")
gss.shape
(72390, 205)

The primary variable we’ll use is polviews, which asks

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

The points on the scale are Extremely liberal, Liberal, and Slightly liberal; Moderate; Slightly conservative, Conservative, and Extremely conservative.

I’ll clean polviews, replacing cases where the question was not asked or the respondent did not answer.

gss['polviews'].replace([0, 8, 9], np.nan, inplace=True)

To confirm that my cleaning did not introduce errors, I compared the following cross-tabulation with the one in the GSS Explorer.

year = gss["year"]
column = gss["polviews"]
xtab = pd.crosstab(column, year)
xtab
year 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 ... 2004 2006 2008 2010 2012 2014 2016 2018 2021 2022
polviews
1.0 22 46 31 37 22 36 48 16 29 35 ... 46 139 69 76 81 94 136 122 207 192
2.0 201 179 187 169 142 121 199 67 133 163 ... 120 524 240 259 244 304 350 278 623 486
3.0 207 196 186 214 241 208 267 98 177 171 ... 153 517 221 232 208 263 310 256 490 400
4.0 564 559 559 564 549 582 694 319 568 566 ... 497 1683 740 746 713 989 1032 855 1377 1245
5.0 221 232 221 251 263 258 235 142 276 271 ... 214 618 268 265 268 334 382 283 476 430
6.0 160 150 190 179 188 180 229 108 186 214 ... 223 685 327 315 292 358 426 354 617 514
7.0 35 35 27 39 30 44 67 20 41 42 ... 56 167 68 80 68 107 120 99 174 159

7 rows × 32 columns

I’ll lump the first three points into “Liberal” and the last three into “Conservative”.

recode_polviews = {
    1: "Liberal",
    2: "Liberal",
    3: "Liberal",
    4: "Moderate",
    5: "Conservative",
    6: "Conservative",
    7: "Conservative",
}
gss["polviews3"] = gss["polviews"].replace(recode_polviews)
gss["polviews3"].value_counts()
polviews3
Moderate        23992
Conservative    21122
Liberal         17604
Name: count, dtype: int64

Make the plot with all respondents#

The following functions generate a figure like the one in the FT.

def make_diff(df):
    """Compute % liberal minus % conservative.
    """
    year = df["year"]
    column = df["polviews3"]

    xtab = pd.crosstab(year, column, normalize='index')
    diff = xtab['Liberal'] - xtab['Conservative']
    
    return diff * 100
def make_plot(df, title=''):
    """Plot % liberal - % conservative for male and female respondents.
    """
    male = df.query('sex==1')
    female = df.query('sex==2')

    diff_male = make_diff(male)
    diff_female = make_diff(female)

    plot_series_lowess(diff_male, color='C0', label='Male')
    plot_series_lowess(diff_female, color='C1', label='Female')
    decorate_plot(title)
def decorate_plot(title):
    decorate(xlabel='Year',
             ylabel='% liberal - % conservative',
             title=title)

Generate the plot for all respondents.

make_plot(gss, 'All respondents')
_images/226f0cae1c29bfc7f3335ab221fc18066c45636c5b3fd08f14881f46c136d3d5.png

In the general population, men are more likely to say they are conservative by 5-10 percentage points.

The gap might have increased in the most recent data, depending on how seriously we take the last two points in a noisy series.

Just young people#

Now let’s select just people under 30.

subset = gss.query('age < 30')
subset.shape
(14360, 206)

And make the same figure.

make_plot(subset, 'Age < 30')
_images/5d72f4588d6871fdab57f3f1b61ab17cc392f3091d9d822d08a1fc32c6b33b6a.png

The trends here are pretty much the same as in the general population. Men are more likely to say they are conservative, by 5-10 percentage points.

It’s possible that the gap has grown in the most recent data, but the evidence is weak and depends on how we draw a smooth curve through noisy data.

Anyway, there is no evidence the trend for men is going down, and the gap in the most recent data is nowhere near 30 percentage points.

Here are the sample sizes.

year = subset["year"]
column = subset["sex"]
xtab = pd.crosstab(column, year)
xtab
year 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 ... 2004 2006 2008 2010 2012 2014 2016 2018 2021 2022
sex
1.0 223 180 176 181 187 175 167 142 222 182 ... 253 351 157 153 155 190 228 185 177 246
2.0 177 203 204 224 201 193 240 215 274 233 ... 288 415 202 222 176 198 253 213 226 307

2 rows × 34 columns

With Sampling Weights#

In the previous figures, I have not taken into account the sampling weights. I didn’t expect them to make much difference, and they don’t except for men in 2022 – and as we’ll see, there is almost certainly something wrong with that data point.

male = subset.query('sex==1')
female = subset.query('sex==2')

diff_male = make_diff(male)
diff_female = make_diff(female)

We only have weighted data since 1988, since that’s how far back the wtssps variable goes.

recent = gss.dropna(subset=['wtssps']).query('age < 30')
def resample_rows_weighted(df, column):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column]
    sample = df.sample(n=len(df), replace=True, weights=weights)
    return sample
def resample_by_year(df, column):
    """Resample rows within each year.

    df: DataFrame
    column: string name of weight variable

    returns DataFrame
    """
    grouped = df.groupby("year")
    samples = [resample_rows_weighted(group, column) for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample
def percentile_rows(series_seq, ps):
    """Computes percentiles from aligned series.

    series_seq: list of sequences
    ps: cumulative probabilities

    returns: Series of x-values, NumPy array with selected rows
    """
    df = pd.concat(series_seq, axis=1).dropna()
    xs = df.index
    array = df.values.transpose()
    array = np.sort(array, axis=0)
    nrows, _ = array.shape

    ps = np.asarray(ps)
    indices = (ps * nrows).astype(int)
    rows = array[indices]
    return xs, rows
def plot_percentiles(series_seq, ps=None, label=None, **options):
    """Plot the low, median, and high percentiles.

    series_seq: sequence of Series
    ps: percentiles to use for low, medium and high
    label: string label for the median line
    options: options passed plt.plot and plt.fill_between
    """
    if ps is None:
        ps = [0.05, 0.5, 0.95]
    assert len(ps) == 3

    xs, rows = percentile_rows(series_seq, ps)
    low, med, high = rows
    plt.plot(xs, med, alpha=0.5, label=label, **options)
    plt.fill_between(xs, low, high, linewidth=0, alpha=0.2, **options)
def resample_diffs(df, query, iters=101):
    diffs = []
    for i in range(iters):
        sample = resample_by_year(df, "wtssps").query(query)
        diff = make_diff(sample)
        diffs.append(diff)
    return diffs
diffs_male = resample_diffs(recent, 'sex==1')
diffs_female = resample_diffs(recent, 'sex==2')

This figure shows the median of 101 resamplings and a 90% CI, along with the unweighted data.

plot_percentiles(diffs_male)
plot_percentiles(diffs_female)

diff_male.plot(style='.', color='C0', label='Male')
diff_female.plot(style='.', color='C1', label='Female')

decorate_plot('Age < 30 with sampling weights')
_images/69391ea6ce88072987e89a860decf128ce4e3f1fd76c7b1f959a91bcd50e0a7b.png

In most cases, the unweighted data falls in the CI of the weighted data, but for male respondents in 2022, the weighting moves the needle by almost 30 percentage points.

So something is not right there. I think the best option is to drop the 2022 data, but just for completeness, let’s see what happens if we apply some smoothing.

Resampling and smoothing#

def resample_diffs_smooth(df, query, iters=101):
    diffs = []
    for i in range(iters):
        sample = resample_by_year(df, "wtssps").query(query)
        diff = make_diff(sample)
        smooth = make_lowess(diff)
        diffs.append(smooth)
    return diffs
diffs_male = resample_diffs_smooth(recent, 'sex==1')
diffs_female = resample_diffs_smooth(recent, 'sex==2')

Here’s a version of the same plot with moderate smoothing, and dropping the unweighted data.

plot_percentiles(diffs_male, label='Male')
plot_percentiles(diffs_female, label='Female')

decorate_plot('Age < 30 with sampling weights')
_images/a6a82db57debf8341c79fefd29be17ce375791a3eba5f37db305818bf92019e4.png

You could make the argument that this figure shows evidence for an increasing gap, but the error bounds are very wide, and as we’ll see in the next figure, the entire effect is due to the likely error in the 2022 data.

Resampling and smoothing without 2022#

Finally, here’s the analysis I think is the best choice, dropping the 2022 data for both men and women.

pre2022 = recent.query('year < 2022')
diffs_male = resample_diffs_smooth(pre2022, 'sex==1')
diffs_female = resample_diffs_smooth(pre2022, 'sex==2')
plot_percentiles(diffs_male, label='Male')
plot_percentiles(diffs_female, label='Female')

decorate_plot('Age < 30 with sampling weights, no 2022')
_images/a17ad0a364193bb520fa13256ef7e6fdf28a190187003ed75e5302cba36e0633.png

Since the 1990s, both men and women have become more likely to identify as liberal.

Men are more likely to identify as conservative by 5-10 percentage points.

But this figure shows no evidence that the ideology gap is growing.

What’s wrong with 2022?#

I don’t know yet, but I’ll add some explorations here.

For one thing, the magnitudes of the weights are substantially different than in previous years, which is why they are able to drag the results so far.

subset.query('year>2010 and sex==2').groupby('year')['wtssall'].describe()
count mean std min 25% 50% 75% max
year
2012 176.0 1.172650 0.662591 0.411898 0.823796 0.823796 1.647593 3.495950
2014 198.0 1.162238 0.612888 0.448002 0.896003 0.896003 1.380591 3.451477
2016 253.0 1.121762 0.558590 0.478497 0.782182 0.956994 1.435490 3.910908
2018 213.0 1.197720 0.725489 0.471499 0.942997 0.942997 1.414496 5.897420
2021 226.0 1.638973 1.090648 0.361567 0.835459 1.411978 2.014895 6.159929
2022 307.0 1.230579 1.438955 0.143659 0.420716 0.743450 1.295847 9.343494