Data Inventory#

Allen Downey

MIT License

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Data#

GSS released 2022_r3a in April 2024.

Download the Stata data from https://gss.norc.org/get-the-data/stata

Move to nb directory and unzip

!ls GSS_stata/
'2022 Release Variables.pdf'	 gss7222_r3a.dta  'Release Notes 7222.pdf'
'GSS 2022 Codebook.pdf'		 gss7222_r3.dta
"GSS 2022 - What's New R3.pdf"	 ReadMe.txt
filename = "GSS_stata/gss7222_r3a.dta"

The following subset includes all of the fe variables that were asked in more than a few years, the standard set of demographic variables, and a few related topics we might explore at some point.

columns = sorted(
    [
        'abany',
        'abdefect',
        'abhlth',
        'abnomore',
        'abpoor',
        'abrape',
        'absingle',
        'acqntsex',
        'age',
        'attend',
        'ballot',
        'cohort',
        'degree',
        'discaffm',
        'discaffw',
        'divorce',
        'educ',
        'fair',
        'fechld',
        'fefam',
        'fehelp',
        'fehire',
        'fehome',
        'fejobaff',
        'fepol',
        'fepres',
        'fepresch',
        'fework',
        'frndsex',
        'fund',
        'hapmar',
        'happy',
        'health',
        'helpful',
        'id',
        'life',
        'matesex',
        'othersex',
        'paidsex',
        'partyid',
        'pikupsex',
        'polviews',
        'race',
        'realinc',
        'realrinc',
        'region',
        'relig',
        'reliten',
        'rincome',
        'sex',
        'sexbirth',
        'sexfreq',
        'sexnow',
        'sexornt',
        'sexsex',
        'sexsex5',
        'spanking',
        'srcbelt',
        'trust',
        'wtssall',
        'wtssps',
        'year'
    ]
)
gss = pd.read_stata(filename, columns=columns, convert_categoricals=False)
# weights are different in 2021 and 2022 so mixing them in might seem like a bad idea,
# but we only use them for resampling within one year of the survey,
# so I think it's ok
gss["wtssall"] = gss["wtssall"].fillna(gss["wtssps"])
gss["wtssall"].describe()
count    72390.000000
mean         1.000014
std          0.550871
min          0.073972
25%          0.549300
50%          0.961700
75%          1.098500
max         14.272462
Name: wtssall, dtype: float64
del gss["wtssps"]
print(gss.shape)
gss.head()
(72390, 61)
abany abdefect abhlth abnomore abpoor abrape absingle acqntsex age attend ... sexfreq sexnow sexornt sexsex sexsex5 spanking srcbelt trust wtssall year
0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 23.0 2.0 ... NaN NaN NaN NaN NaN NaN 3.0 3.0 0.4446 1972
1 NaN 1.0 1.0 2.0 2.0 1.0 1.0 NaN 70.0 7.0 ... NaN NaN NaN NaN NaN NaN 3.0 1.0 0.8893 1972
2 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 48.0 4.0 ... NaN NaN NaN NaN NaN NaN 3.0 2.0 0.8893 1972
3 NaN 2.0 1.0 2.0 1.0 1.0 1.0 NaN 27.0 0.0 ... NaN NaN NaN NaN NaN NaN 3.0 2.0 0.8893 1972
4 NaN 1.0 1.0 1.0 1.0 1.0 1.0 NaN 61.0 0.0 ... NaN NaN NaN NaN NaN NaN 3.0 2.0 0.8893 1972

5 rows × 61 columns

Inventory#

Here are the 10 fe variables and the text of the questions.

fechld

A. A working mother can establish just as warm and secure a relationship with her children as a mother who does not work.

fefam

D. It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.

fehelp

B. It is more important for a wife to help her husband’s career than to have one herself.

fehire

Because of past discrimination, employers should make special efforts to hire and promote qualified women.

fehome

Women should take care of running their homes and leave running the country up to men.

fejobaff

Some people say that because of past discrimination, women should be given preference in hiring and promotion. Others say that such preference in hiring and promotion of women is wrong because it discriminates against men. What about your opinion - are you for or against preferential hiring and promotion of women? IF FOR:Do you favor preference in hiring and promotion strongly or not strongly? IF AGAINST:Do you oppose preference in hiring and promotion strongly or not strongly?

fepol

A. Tell me if you agree or disagree with this statement: Most men are better suited emotionally for politics than are most women.

fepres

If your party nominated a woman for President, would you vote for her if she were qualified for the job?

fepresch

C. A preschool child is likely to suffer if his or her mother works.

fework

Do you approve or disapprove of a married woman earning money in business or industry if she has a husband capable of supporting her?

fe_columns = [x for x in gss.columns if x.startswith('fe')]
fe_columns
['fechld',
 'fefam',
 'fehelp',
 'fehire',
 'fehome',
 'fejobaff',
 'fepol',
 'fepres',
 'fepresch',
 'fework']
len(fe_columns)
10

#

from utils import decorate

grouped = gss.groupby('year')
intervals = pd.DataFrame(columns=['first', 'last', '# years'], dtype=int)

for column in fe_columns:
    plt.figure()
    counts = grouped[column].count()
    counts.plot.bar()
    nonzero = counts.replace(0, np.nan).dropna()
    n_years = len(nonzero)
    first, last = nonzero.index.min(), nonzero.index.max()
    intervals.loc[column] = first, last, n_years
    decorate()
_images/1a103f63767f222ea5b0ed1216134e8fa86b3db2d3bada5840444f4fc0888577.png _images/19e32009f1d4b361ae7a938c92b1cba21f536380e2588489d6c41626716aee30.png _images/001924647265d9ea9c50c0bca1333d4fa4ccf65100f2660a6838daaa1684f9c6.png _images/02da431e246a662fc0df2bc19fbd4cbcc27af798ba1128b05a4d85f8142ba8e1.png _images/87b6963ac61b6ff5cb27e9541fd9548f3a707c2fbb500f1de3e2f7524aead3db.png _images/cc924047fc7d5201b69bcba4b10069aa3a00c3e8d165eb40ae5c4355731e262f.png _images/72e4ba8193eba6bf28f67aee60402ac06d223da761a622e527b7a3f696726e2e.png _images/a11663c956de63886004e0a8263cdaec0ef3904284b03c461e0485ac21eb6a20.png _images/998d63e7785329f21f30e0bd9731d9382b01e5efd289b89a60da2e0c14fb3154.png _images/8a4a5b6dc6246a80c3142b06f65e1429b3f634a6f5bf4011a8de46ca6146e1bc.png
intervals
first last # years
fechld 1977 2022 23
fefam 1977 2022 23
fehelp 1977 1998 11
fehire 1996 2022 13
fehome 1974 1998 16
fejobaff 1996 2022 13
fepol 1974 2022 27
fepres 1972 2010 19
fepresch 1977 2022 23
fework 1972 1998 17

Responses#

Most are on a four point scale:

1	STRONGLY AGREE	
2	AGREE	
3	DISAGREE	
4	STRONGLY DISAGREE

fehire is on a five-point scale

1	STRONGLY AGREE
2	AGREE	
3	NEITHER AGREE NOR DISAGREE	
4	DISAGREE	
5	STRONGLY DISAGREE

Some are on a two-point scale.

from utils import values

for col in fe_columns:
    print(values(gss[col]))
fechld
1.0     9240
2.0    15202
3.0     8666
4.0     2342
NaN    36940
Name: count, dtype: int64
fefam
1.0     2810
2.0     9839
3.0    15198
4.0     7284
NaN    37259
Name: count, dtype: int64
fehelp
1.0      769
2.0     3769
3.0     7732
4.0     3041
NaN    57079
Name: count, dtype: int64
fehire
1.0     2817
2.0     5945
3.0     2048
4.0     2389
5.0      580
NaN    58611
Name: count, dtype: int64
fehome
1.0     5424
2.0    17114
NaN    49852
Name: count, dtype: int64
fejobaff
1.0     2299
2.0     1311
3.0     2906
4.0     3936
NaN    61938
Name: count, dtype: int64
fepol
1.0     9982
2.0    25715
NaN    36693
Name: count, dtype: int64
fepres
1.0    23257
2.0     3531
5.0        4
NaN    45598
Name: count, dtype: int64
fepresch
1.0     2817
2.0    11254
3.0    16303
4.0     4731
NaN    37285
Name: count, dtype: int64
fework
1.0    18753
2.0     5648
NaN    47989
Name: count, dtype: int64

fepol and fehome: 1 agree, 2 disagree

fework: 1 approve, 2 disapprove

fepres: 1 yes 2 no 5 would not vote – let’s replace 5 with no

gss['fepres'] = gss['fepres'].replace(5, 2)
values(gss['fepres'])
fepres
1.0    23257
2.0     3535
NaN    45598
Name: count, dtype: int64

For each variable, I’ll select “agree” and “strongly agree”, except for fework, where I’ve selected “approve”.

agree_responses = {
    'fechld': [1, 2],
    'fefam': [1, 2],
    'fehelp': [1, 2],
    'fehire': [1, 2],
    'fehome': [1],
    'fejobaff': [1, 2],
    'fepol': [1],
    'fepres': [1],
    'fepresch': [1, 2],
    'fework': [1],
}
from utils import plot_series_lowess

def plot_series(data, column, color, title):
    xtab = pd.crosstab(data['year'], data[column], normalize='index')
    series = xtab[agree_responses[column]].sum(axis=1)
    plot_series_lowess(series, color=color, label=column)
    decorate(title=title)

All respondents#

Note that these results have not yet been corrected for stratified sampling, so think of this as an inventory of the data, not inferences about the population.

  • Last two points of fehire have gone wonky – I’ve seen things like this in the 2021 and 2022 data. Not sure what the issue is.

for column in fe_columns:
    plt.figure()
    plot_series(gss, column, 'C2', 'All respondents')
_images/63764164469ab9b0fdf29a41b86ac485f056229406ad1a4f59b9d1c39ac1742b.png _images/6202d4451f76a3d785e17e823afe64489fc605156881fdb25a038cce01d77dc3.png _images/a0c3b27e31aed3f9c32160057b6691de343d97191aafcc512963a3178e34135b.png _images/bc3bd352cab66571ba2d0c60e925e31ff9086b9481cf4cacb1e2ed62301768b7.png _images/0eaa0332f691cc830be34fa567f7a5401362f45ba5c8de9a7f0e2beabf553553.png _images/9714334efb0099dd3e9d23c11d06623db57cb2531a62dbde15b0d1dc35a29623.png _images/586095b062d080f621299d0a7929e48c106152c6d8b3005e92709bdf1187ec6c.png _images/250f3d736d7f921083fdbcea7e2f93d74b3d677011beb340b3fb9ad36a05074e.png _images/7e99d873f572d40637bb402342be5c54d2f1953321cc7ab939f2cb35e2cbd012.png _images/e9411b575fe6c78f7736218e157bfe9fbde119eb5818531a55371db4c021ed8e.png

Female respondents#

female = gss.query('sex == 2')
for column in fe_columns:
    plt.figure()
    plot_series(female, column, 'C1', title='Female respondents')
_images/816ee6a1f69ea000776511967de951a76ef6ac54b0d49ad9a8d593f6aaa909e2.png _images/4e70e1dc74d6cd0f6fccf77e34b33a2f11eebdaf17e5b6e5ca5f7fd5067b164a.png _images/fcd3c6bf079440b249a56ded609538741cd633cf7e5050f4436636b8a0123793.png _images/237d10d20b8e8eb77f081245d421577874650a9e447623e1a7ed1dd4bb77f29a.png _images/812cb860080b26a4fb879d0a8459e35b90adb0e457c8fd500583a6c298d11747.png _images/7821ba3de836d4bd9e3f7840d522f16384cb81b38909a538ece5772f7b953080.png _images/b701c11027810ed2aaa9e85ff804940af0956088cbb26d1b2f8e2dc5aff8db19.png _images/7901a29a6568c3488d3f6e0e3ad866633260cf1ba8da2c4a4c39aa8462576f49.png _images/b8bc23f56340aba940e46360c8e0aebaba51b330274c80a587a9717c79c939ba.png _images/2ff25f1721c646bb366605c3161539aea813eb5259c1e4e3426e622038c5c2e2.png

Young females#

Noisier series due to smaller sample sizes.

young_female = female.query('age < 30')
for column in fe_columns:
    plt.figure()
    plot_series(young_female, column, 'C4', title='Female respondents age < 30')
_images/37c373ed677faa03104c097e7f5845386aae5260131cce4e07aee2cfb41dcb33.png _images/738f3a6035d2b4985d1caf5481f7b0f23aeadfd497199c446b74ce3026a396f0.png _images/deeef9e295e09cb28c82e8f33d46543a77020d2641070eb76780284112eee820.png _images/08409c958b3865b127e963168ceb05d636992bb28e4cdc3399029a44de7f9c4b.png _images/411f8249e36c8e0a5494fbb877e5c6fe54589232a60133db2ba45180354a8eec.png _images/b4dd85b3f96c50dc79e303b200c0296095799ecfafe90ba7dce398d12abe48d8.png _images/6bd8ae28c5f78dfe8f2800d3e7b788fb088e298db7557488ba33d793f17a73b7.png _images/43163e239fb3a02184aacef53a0b54408797a10cebfdd97225836e8f57ff4362.png _images/0f057fb801427330d0bf50c29579879e3a1288afed3342c3a286bc2a82bdec79.png _images/b38b63f168e7fd93d2ad93d65237f88b16f3de045a6ade6781b702eca1e289dc.png

Male respondents#

No indications of recent reversals, except fehire, which is wonky for everybody.

Strange pattern in fepol.

male = gss.query('sex == 1')
for column in fe_columns:
    plt.figure()
    plot_series(male, column, 'C0', title='Male respondents')
_images/60406ce35a1f9fc3428de3169194a8f6394dc99d3f2500ab8d8614de0d6396f5.png _images/2919ccaed79b4fdc4d4402028c78e55e54e79daa62dcea9ce64526a6df8b89d1.png _images/ea80eb75052137275ae91019dd50158dac18ae0e4121362bc6326ba0cdf7445e.png _images/57f6c0dc934f5bad1b021872e38ed4edd1db490f334298ec0a5bfb8fce0d25af.png _images/14c667065761bded732f3e1d6a452df55560aa2119a413304995176b17bd3a51.png _images/a5d2719b7b6390128858af632e241ee6b7359507a8b553f343f484392990f48a.png _images/a0f89c265837d7a3e2da4083387ec7825ec0ac17147a425122decd7b80677ab8.png _images/0095c9de701e8c937b4f38207b010dd6a76c3c901514100894279f16072aed10.png _images/ca0161217e1dfa95806bf9c2d26baecb3b21dcd000acf4f9b9c1080fded3fe9e.png _images/090aa5889a4bfed8120060d36ed00aec3d4efa290f724d7f58644502a72d12ad.png

Young males#

Noisier series due to smaller sample sizes.

No indications of reversals that are anything other than random, with the possible exception of fework – but I doubt it’s real, and even if it was, it happened in 1990.

young_male = male.query('age < 30')
for column in fe_columns:
    plt.figure()
    plot_series(young_male, column, 'C3', title='Male respondents age < 30')
_images/855d0211f041c5140e3a821098163286b1a392dc3ed2ff3096a6630d282c2208.png _images/e2631fff036eb2b34c2618152f94d795bc1be989b8967fbedb8c8b5c21a0eead.png _images/bbc7b72b38e0f06db2c6913180830430cee41575bd5153918cbfc57709e5800f.png _images/ea4f1f058175f3ce5c20902e7a31e2415253e3a5723988c349aa51ae7a6ea0f9.png _images/f925c0b97e1615047f4b1411d472cd5c34125734e7d46bd1d3fccbf63092b479.png _images/f972de18903b8b13b43834c1be27dd8cbac7cab975e2af08eeb8e9f44eb77e21.png _images/06f60360f92a763cb70f4731d703676db6858b25d6cde787e8d3bb63587c12e7.png _images/43ec31dafd8d0d07bd455f6d38af1bd3a80a642d1f813b7c95c2c05464e621cd.png _images/a111a222c8b83a800f9bf9b173e0f4c1c98ee33aaf6c99cc7a2a32f05ce94350.png _images/0c97a4482ea0778b5f71145c7d867aded35c5a2237a584b35201a0e64ca2313e.png

Write extracts#

!rm -f gss_eds_2022.hdf
gss.to_hdf("gss_feminism_2022.hdf", key="gss", complevel=6)
!ls -lh gss_feminism_2022.hdf
-rw-rw-r-- 1 downey downey 3.2M Jun  3 20:35 gss_feminism_2022.hdf

Resample

from utils import resample_by_year
sample = resample_by_year(gss, "wtssall")
!rm gss_feminism_resampled.hdf
sample.to_hdf("gss_feminism_resampled.hdf", key="gss", complevel=6)
!ls -lh gss_feminism_resampled.hdf
-rw-rw-r-- 1 downey downey 3.3M Jun  3 20:35 gss_feminism_resampled.hdf