Is the Ideology Gap Growing?

Is the Ideology Gap Growing?#

This tweet from John Burn-Murdoch links to an article in the Financial Times, “A new global gender divide is emerging”, which includes this figure:

The article claims:

In the US, Gallup data shows that after decades where the sexes were each spread roughly equally across liberal and conservative world views, women aged 18 to 30 are now 30 percentage points more liberal than their male contemporaries. That gap took just six years to open up.

The figure says it is based on General Social Survey data and the text says it’s based on Gallup data, so I’m not sure which it is. UPDATE: In this tweet Burn-Murdoch explains that the figure shows Gallup data, backfilled with GSS data from before the Gallup series began.

And I don’t know what it means that “All figures are adjusted for time trend in the overall population”. UPDATE: See the section below that explains the adjustment.

Anyway, since I used GSS data in the last three chapters of Probably Overthinking It, this analysis did not sound right to me.

This notebook is my attempt to replicate the analysis with GSS data. I conclude:

The GSS data does not look like the figure in the FT.
Men are a more likely to say that they are conservative, by 5-10 percentage points.
The only evidence that the gap is growing depends entirely on a data point from 2022 that is almost certainly wrong.
If we drop the 2022 data and apply moderate smoothing, we see no evidence that the gap is growing.

Most of the functions in this notebook are the ones I used to write Probably Overthinking It. All of the notebooks for that book are available in this repository.

Click here to run this notebook on Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.nonparametric.smoothers_lowess import lowess


def make_lowess(series, frac=0.5):
    """Use LOWESS to compute a smooth line.

    series: pd.Series

    returns: pd.Series
    """
    y = series.values
    x = series.index.values

    smooth = lowess(y, x, frac=frac)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=index)

Read the Data#

I’m using data I previous cleaned in this notebook.

from os.path import basename, exists
from pathlib import Path


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download("https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs_2022.hdf")

'gss_pacs_2022.hdf'

gss = pd.read_hdf("gss_pacs_2022.hdf", "gss")
gss.shape

(72390, 205)

The primary variable we’ll use is polviews, which asks

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

The points on the scale are Extremely liberal, Liberal, and Slightly liberal; Moderate; Slightly conservative, Conservative, and Extremely conservative.

I’ll clean polviews, replacing cases where the question was not asked or the respondent did not answer.

gss['polviews'].replace([0, 8, 9], np.nan, inplace=True)

To confirm that my cleaning did not introduce errors, I compared the following cross-tabulation with the one in the GSS Explorer.

year = gss["year"]
column = gss["polviews"]
xtab = pd.crosstab(column, year)
xtab

year	1974	1975	1976	1977	1978	1980	1982	1983	1984	1985	...	2004	2006	2008	2010	2012	2014	2016	2018	2021	2022
polviews
1.0	22	46	31	37	22	36	48	16	29	35	...	46	139	69	76	81	94	136	122	207	192
2.0	201	179	187	169	142	121	199	67	133	163	...	120	524	240	259	244	304	350	278	623	486
3.0	207	196	186	214	241	208	267	98	177	171	...	153	517	221	232	208	263	310	256	490	400
4.0	564	559	559	564	549	582	694	319	568	566	...	497	1683	740	746	713	989	1032	855	1377	1245
5.0	221	232	221	251	263	258	235	142	276	271	...	214	618	268	265	268	334	382	283	476	430
6.0	160	150	190	179	188	180	229	108	186	214	...	223	685	327	315	292	358	426	354	617	514
7.0	35	35	27	39	30	44	67	20	41	42	...	56	167	68	80	68	107	120	99	174	159

7 rows × 32 columns

I’ll lump the first three points into “Liberal” and the last three into “Conservative”.

recode_polviews = {
"Liberal",
"Liberal",
"Liberal",
"Moderate",
"Conservative",
"Conservative",
"Conservative",
}

gss["polviews3"] = gss["polviews"].replace(recode_polviews)
gss["polviews3"].value_counts()

polviews3
Moderate        23992
Conservative    21122
Liberal         17604
Name: count, dtype: int64

def make_diff(df):
    """Compute % liberal minus % conservative.
    """
    year = df["year"]
    column = df["polviews3"]

    xtab = pd.crosstab(year, column, normalize='index')
    diff = xtab['Liberal'] - xtab['Conservative']
    
    return diff * 100

def make_mean(df):
    """Compute % liberal minus % conservative.
    """
    series = df.groupby('year')['polviews'].mean()
    
    return series

def decorate_plot(title):
    decorate(xlabel='Year',
             ylabel='% liberal - % conservative',
             title=title)

Make the plot with all respondents#

The following functions generate a figure like the one in the FT.

def make_plot(df, title=''):
    """Plot % liberal - % conservative for male and female respondents.
    """
    male = df.query('sex==1')
    female = df.query('sex==2')

    diff_male = make_diff(male)
    diff_female = make_diff(female)

    plot_series_lowess(diff_male, color='C0', label='Male')
    plot_series_lowess(diff_female, color='C1', label='Female')
    decorate_plot(title)

Generate the plot for all respondents.

def savefig(filename, **options):
    if 'dpi' not in options:
        options['dpi'] = 300
    plt.savefig(filename, **options)

make_plot(gss, 'All respondents')
savefig('ideology_gap1.png')

_images/226f0cae1c29bfc7f3335ab221fc18066c45636c5b3fd08f14881f46c136d3d5.png

In the general population, men are more likely to say they are conservative by 5-10 percentage points.

The gap might have increased in the most recent data, depending on how seriously we take the last two points in a noisy series.

Just young people#

Now let’s select just people under 30.

subset = gss.query('age < 30')
subset.shape

(14360, 206)

And make the same figure.

make_plot(subset, 'Age < 30')
savefig('ideology_gap2.png')

_images/5d72f4588d6871fdab57f3f1b61ab17cc392f3091d9d822d08a1fc32c6b33b6a.png

The trends here are pretty much the same as in the general population. Men are more likely to say they are conservative, by 5-10 percentage points.

It’s possible that the gap has grown in the most recent data, but the evidence is weak and depends on how we draw a smooth curve through noisy data.

Anyway, there is no evidence the trend for men is going down, and the gap in the most recent data is nowhere near 30 percentage points.

Here are the sample sizes.

year = subset["year"]
column = subset["sex"]
xtab = pd.crosstab(column, year)
xtab

year	1972	1973	1974	1975	1976	1977	1978	1980	1982	1983	...	2004	2006	2008	2010	2012	2014	2016	2018	2021	2022
sex
1.0	223	180	176	181	187	175	167	142	222	182	...	253	351	157	153	155	190	228	185	177	246
2.0	177	203	204	224	201	193	240	215	274	233	...	288	415	202	222	176	198	253	213	226	307

2 rows × 34 columns

With Sampling Weights#

In the previous figures, I have not taken into account the sampling weights. I didn’t expect them to make much difference, and they don’t except for men in 2022 – and as we’ll see, there is almost certainly something wrong with that data point.

male = subset.query('sex==1')
female = subset.query('sex==2')

diff_male = make_diff(male)
diff_female = make_diff(female)

We only have weighted data since 1988, since that’s how far back the wtssps variable goes.

recent = gss.dropna(subset=['wtssps']).query('age < 30')

def resample_rows_weighted(df, column):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column]
    sample = df.sample(n=len(df), replace=True, weights=weights)
    return sample

def resample_by_year(df, column):
    """Resample rows within each year.

    df: DataFrame
    column: string name of weight variable

    returns DataFrame
    """
    grouped = df.groupby("year")
    samples = [resample_rows_weighted(group, column) for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample

def percentile_rows(series_seq, ps):
    """Computes percentiles from aligned series.

    series_seq: list of sequences
    ps: cumulative probabilities

    returns: Series of x-values, NumPy array with selected rows
    """
    df = pd.concat(series_seq, axis=1).dropna()
    xs = df.index
    array = df.values.transpose()
    array = np.sort(array, axis=0)
    nrows, _ = array.shape

    ps = np.asarray(ps)
    indices = (ps * nrows).astype(int)
    rows = array[indices]
    return xs, rows

def plot_percentiles(series_seq, ps=None, label=None, **options):
    """Plot the low, median, and high percentiles.

    series_seq: sequence of Series
    ps: percentiles to use for low, medium and high
    label: string label for the median line
    options: options passed plt.plot and plt.fill_between
    """
    if ps is None:
        ps = [0.05, 0.5, 0.95]
    assert len(ps) == 3

    xs, rows = percentile_rows(series_seq, ps)
    low, med, high = rows
    plt.plot(xs, med, alpha=0.5, label=label, **options)
    plt.fill_between(xs, low, high, linewidth=0, alpha=0.2, **options)

def resample_diffs(df, query, iters=101):
    diffs = []
    for i in range(iters):
        sample = resample_by_year(df, "wtssps").query(query)
        diff = make_diff(sample)
        diffs.append(diff)
    return diffs

diffs_male = resample_diffs(recent, 'sex==1')
diffs_female = resample_diffs(recent, 'sex==2')

This figure shows the median of 101 resamplings and a 90% CI, along with the unweighted data.

plot_percentiles(diffs_male)
plot_percentiles(diffs_female)

diff_male.plot(style='.', color='C0', label='Male')
diff_female.plot(style='.', color='C1', label='Female')

decorate_plot('Age < 30 with sampling weights')
savefig('ideology_gap3.png')

_images/e5fac031cc7e5557751490690c894c7e9a903f4257b3eb7446fd8cf9d9194820.png

In most cases, the unweighted data falls in the CI of the weighted data, but for male respondents in 2022, the weighting moves the needle by almost 30 percentage points.

So something is not right there. I think the best option is to drop the 2022 data, but just for completeness, let’s see what happens if we apply some smoothing.

Resampling and smoothing#

def resample_diffs_smooth(df, query, iters=101):
    diffs = []
    for i in range(iters):
        sample = resample_by_year(df, "wtssps").query(query)
        diff = make_diff(sample)
        smooth = make_lowess(diff)
        diffs.append(smooth)
    return diffs

diffs_male = resample_diffs_smooth(recent, 'sex==1')
diffs_female = resample_diffs_smooth(recent, 'sex==2')

Here’s a version of the same plot with moderate smoothing, and dropping the unweighted data.

plot_percentiles(diffs_male, label='Male')
plot_percentiles(diffs_female, label='Female')

decorate_plot('Age < 30 with sampling weights, smoothed')
savefig('ideology_gap4.png')

_images/baf2755acfbd444ac55a4e3282c65a30bcff0ee307d45ee56aea317e3492e253.png

You could make the argument that this figure shows evidence for an increasing gap, but the error bounds are very wide, and as we’ll see in the next figure, the entire effect is due to the likely error in the 2022 data.

Resampling and smoothing without 2022#

Finally, here’s the analysis I think is the best choice, dropping the 2022 data for both men and women.

pre2022 = recent.query('year < 2022')

diffs_male = resample_diffs_smooth(pre2022, 'sex==1')
diffs_female = resample_diffs_smooth(pre2022, 'sex==2')

plot_percentiles(diffs_male, label='Male')
plot_percentiles(diffs_female, label='Female')

decorate_plot('Age < 30 with sampling weights, smoothed, no 2022')
savefig('ideology_gap5.png')

_images/21e3230f146cefcc1f430020661d16e914b0f713b2b09f39aaaea54c30afcb7d.png

Since the 1990s, both men and women have become more likely to identify as liberal.

Men are more likely to identify as conservative by 5-10 percentage points.

But this figure shows no evidence that the ideology gap is growing.

Adjusting#

In this tweet, Burn-Murdoch explains that the adjustment mentioned in the figure is to subtract off the overall trend.

Here’s the overall trend, not including the 2022 data.

diff_overall = make_diff(pre2022)

plot_series_lowess(diff_overall, color='C2', label='Overall')
decorate_plot('Overall')

_images/271676c236f954004d04fd6cd7d1c62c2e344514eefee04b37b9de4a91815d2f.png

def resample_diffs_smooth_adjusted(df, query, iters=101):
    diff_overall = make_diff(df)
    diff_overall_smooth = make_lowess(diff_overall)

    diffs = []
    for i in range(iters):
        sample = resample_by_year(df, "wtssps").query(query)
        diff = make_diff(sample)
        smooth = make_lowess(diff) - diff_overall_smooth
        diffs.append(smooth)
    return diffs

diffs_male = resample_diffs_smooth_adjusted(pre2022, 'sex==1')
diffs_female = resample_diffs_smooth_adjusted(pre2022, 'sex==2')

Here’s what the adjusted data looks like.

plot_percentiles(diffs_male, label='Male')
plot_percentiles(diffs_female, label='Female')

decorate_plot('Age < 30 with sampling weights, smoothed, no 2022, adjusted')
savefig('ideology_gap6.png')

_images/322455b233fb27300e45dd1fb881fe2566066e524269925649c636db76160193.png

This figure gives a stronger visual sense that the gap is growing, but even then, it is probably not more than 10 percentage points, and smaller than it was in the 1980s.

This way of showing the data makes it seem as if men are increasingly likely to say they are conservative, which is misleading. They are increasingly likely to say they are liberal, but not increasing as fast as the overall average.

What’s wrong with 2022?#

I don’t know yet, but I’ll add some explorations here.

For one thing, the magnitudes of the weights are substantially different than in previous years, which is why they are able to drag the results so far.

subset.query('year>2010 and sex==2').groupby('year')['wtssall'].describe()

	count	mean	std	min	25%	50%	75%	max
year
2012	176.0	1.172650	0.662591	0.411898	0.823796	0.823796	1.647593	3.495950
2014	198.0	1.162238	0.612888	0.448002	0.896003	0.896003	1.380591	3.451477
2016	253.0	1.121762	0.558590	0.478497	0.782182	0.956994	1.435490	3.910908
2018	213.0	1.197720	0.725489	0.471499	0.942997	0.942997	1.414496	5.897420
2021	226.0	1.638973	1.090648	0.361567	0.835459	1.411978	2.014895	6.159929
2022	307.0	1.230579	1.438955	0.143659	0.420716	0.743450	1.295847	9.343494

year	1972	1973	1974	1975	1976	1977	1978	1980	1982	1983	...	2004	2006	2008	2010	2012	2014	2016	2018	2021	2022
sex
1.0	223	180	176	181	187	175	167	142	222	182	...	253	351	157	153	155	190	228	185	177	246
2.0	177	203	204	224	201	193	240	215	274	233	...	288	415	202	222	176	198	253	213	226	307

year	1972	1973	1974	1975	1976	1977	1978	1980	1982	1983	...	2004	2006	2008	2010	2012	2014	2016	2018	2021	2022
sex
1.0	223	180	176	181	187	175	167	142	222	182	...	253	351	157	153	155	190	228	185	177	246
2.0	177	203	204	224	201	193	240	215	274	233	...	288	415	202	222	176	198	253	213	226	307