Data Inventory

Data Inventory#

Allen Downey

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Data#

GSS released 2022_r3a in April 2024.

Download the Stata data from https://gss.norc.org/get-the-data/stata

Move to nb directory and unzip

!ls GSS_stata/

'2022 Release Variables.pdf'	 gss7222_r3a.dta  'Release Notes 7222.pdf'
'GSS 2022 Codebook.pdf'		 gss7222_r3.dta
"GSS 2022 - What's New R3.pdf"	 ReadMe.txt

filename = "GSS_stata/gss7222_r3a.dta"

The following subset includes all of the fe variables that were asked in more than a few years, the standard set of demographic variables, and a few related topics we might explore at some point.

columns = sorted(
    [
        'abany',
        'abdefect',
        'abhlth',
        'abnomore',
        'abpoor',
        'abrape',
        'absingle',
        'acqntsex',
        'age',
        'attend',
        'ballot',
        'cohort',
        'degree',
        'discaffm',
        'discaffw',
        'divorce',
        'educ',
        'fair',
        'fechld',
        'fefam',
        'fehelp',
        'fehire',
        'fehome',
        'fejobaff',
        'fepol',
        'fepres',
        'fepresch',
        'fework',
        'frndsex',
        'fund',
        'hapmar',
        'happy',
        'health',
        'helpful',
        'id',
        'life',
        'matesex',
        'othersex',
        'paidsex',
        'partyid',
        'pikupsex',
        'polviews',
        'race',
        'realinc',
        'realrinc',
        'region',
        'relig',
        'reliten',
        'rincome',
        'sex',
        'sexbirth',
        'sexfreq',
        'sexnow',
        'sexornt',
        'sexsex',
        'sexsex5',
        'spanking',
        'srcbelt',
        'trust',
        'wtssall',
        'wtssps',
        'year'
    ]
)

gss = pd.read_stata(filename, columns=columns, convert_categoricals=False)

# weights are different in 2021 and 2022 so mixing them in might seem like a bad idea,
# but we only use them for resampling within one year of the survey,
# so I think it's ok
gss["wtssall"] = gss["wtssall"].fillna(gss["wtssps"])
gss["wtssall"].describe()

count    72390.000000
mean         1.000014
std          0.550871
min          0.073972
25%          0.549300
50%          0.961700
75%          1.098500
max         14.272462
Name: wtssall, dtype: float64

del gss["wtssps"]

print(gss.shape)
gss.head()

(72390, 61)

	abany	abdefect	abhlth	abnomore	abpoor	abrape	absingle	acqntsex	age	attend	...	sexfreq	sexnow	sexornt	sexsex	sexsex5	spanking	srcbelt	trust	wtssall	year
0	NaN	1.0	1.0	1.0	1.0	1.0	1.0	NaN	23.0	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	3.0	0.4446	1972
1	NaN	1.0	1.0	2.0	2.0	1.0	1.0	NaN	70.0	7.0	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	1.0	0.8893	1972
2	NaN	1.0	1.0	1.0	1.0	1.0	1.0	NaN	48.0	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	2.0	0.8893	1972
3	NaN	2.0	1.0	2.0	1.0	1.0	1.0	NaN	27.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	2.0	0.8893	1972
4	NaN	1.0	1.0	1.0	1.0	1.0	1.0	NaN	61.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	2.0	0.8893	1972

5 rows × 61 columns

Inventory#

Here are the 10 fe variables and the text of the questions.

fechld

A. A working mother can establish just as warm and secure a relationship with her children as a mother who does not work.

fefam

D. It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.

fehelp

B. It is more important for a wife to help her husband’s career than to have one herself.

fehire

Because of past discrimination, employers should make special efforts to hire and promote qualified women.

fehome

Women should take care of running their homes and leave running the country up to men.

fejobaff

Some people say that because of past discrimination, women should be given preference in hiring and promotion. Others say that such preference in hiring and promotion of women is wrong because it discriminates against men. What about your opinion - are you for or against preferential hiring and promotion of women? IF FOR:Do you favor preference in hiring and promotion strongly or not strongly? IF AGAINST:Do you oppose preference in hiring and promotion strongly or not strongly?

fepol

A. Tell me if you agree or disagree with this statement: Most men are better suited emotionally for politics than are most women.

fepres

If your party nominated a woman for President, would you vote for her if she were qualified for the job?

fepresch

C. A preschool child is likely to suffer if his or her mother works.

fework

Do you approve or disapprove of a married woman earning money in business or industry if she has a husband capable of supporting her?

fe_columns = [x for x in gss.columns if x.startswith('fe')]
fe_columns

['fechld',
 'fefam',
 'fehelp',
 'fehire',
 'fehome',
 'fejobaff',
 'fepol',
 'fepres',
 'fepresch',
 'fework']

len(fe_columns)

#

from utils import decorate

grouped = gss.groupby('year')
intervals = pd.DataFrame(columns=['first', 'last', '# years'], dtype=int)

for column in fe_columns:
    plt.figure()
    counts = grouped[column].count()
    counts.plot.bar()
    nonzero = counts.replace(0, np.nan).dropna()
    n_years = len(nonzero)
    first, last = nonzero.index.min(), nonzero.index.max()
    intervals.loc[column] = first, last, n_years
    decorate()

_images/1a103f63767f222ea5b0ed1216134e8fa86b3db2d3bada5840444f4fc0888577.png

_images/19e32009f1d4b361ae7a938c92b1cba21f536380e2588489d6c41626716aee30.png

_images/001924647265d9ea9c50c0bca1333d4fa4ccf65100f2660a6838daaa1684f9c6.png

_images/02da431e246a662fc0df2bc19fbd4cbcc27af798ba1128b05a4d85f8142ba8e1.png

_images/87b6963ac61b6ff5cb27e9541fd9548f3a707c2fbb500f1de3e2f7524aead3db.png

_images/cc924047fc7d5201b69bcba4b10069aa3a00c3e8d165eb40ae5c4355731e262f.png

_images/72e4ba8193eba6bf28f67aee60402ac06d223da761a622e527b7a3f696726e2e.png

_images/a11663c956de63886004e0a8263cdaec0ef3904284b03c461e0485ac21eb6a20.png

_images/998d63e7785329f21f30e0bd9731d9382b01e5efd289b89a60da2e0c14fb3154.png

_images/8a4a5b6dc6246a80c3142b06f65e1429b3f634a6f5bf4011a8de46ca6146e1bc.png

intervals

	first	last	# years
fechld	1977	2022	23
fefam	1977	2022	23
fehelp	1977	1998	11
fehire	1996	2022	13
fehome	1974	1998	16
fejobaff	1996	2022	13
fepol	1974	2022	27
fepres	1972	2010	19
fepresch	1977	2022	23
fework	1972	1998	17

Responses#

Most are on a four point scale:

STRONGLY AGREE	
AGREE	
DISAGREE	
STRONGLY DISAGREE

fehire is on a five-point scale

STRONGLY AGREE
AGREE	
NEITHER AGREE NOR DISAGREE	
DISAGREE	
STRONGLY DISAGREE

Some are on a two-point scale.

from utils import values

for col in fe_columns:
    print(values(gss[col]))

fechld
1.0     9240
2.0    15202
3.0     8666
4.0     2342
NaN    36940
Name: count, dtype: int64
fefam
1.0     2810
2.0     9839
3.0    15198
4.0     7284
NaN    37259
Name: count, dtype: int64
fehelp
1.0      769
2.0     3769
3.0     7732
4.0     3041
NaN    57079
Name: count, dtype: int64
fehire
1.0     2817
2.0     5945
3.0     2048
4.0     2389
5.0      580
NaN    58611
Name: count, dtype: int64
fehome
1.0     5424
2.0    17114
NaN    49852
Name: count, dtype: int64
fejobaff
1.0     2299
2.0     1311
3.0     2906
4.0     3936
NaN    61938
Name: count, dtype: int64
fepol
1.0     9982
2.0    25715
NaN    36693
Name: count, dtype: int64
fepres
1.0    23257
2.0     3531
5.0        4
NaN    45598
Name: count, dtype: int64
fepresch
1.0     2817
2.0    11254
3.0    16303
4.0     4731
NaN    37285
Name: count, dtype: int64
fework
1.0    18753
2.0     5648
NaN    47989
Name: count, dtype: int64

fepol and fehome: 1 agree, 2 disagree

fework: 1 approve, 2 disapprove

fepres: 1 yes 2 no 5 would not vote – let’s replace 5 with no

gss['fepres'] = gss['fepres'].replace(5, 2)
values(gss['fepres'])

fepres
1.0    23257
2.0     3535
NaN    45598
Name: count, dtype: int64

For each variable, I’ll select “agree” and “strongly agree”, except for fework, where I’ve selected “approve”.

agree_responses = {
    'fechld': [1, 2],
    'fefam': [1, 2],
    'fehelp': [1, 2],
    'fehire': [1, 2],
    'fehome': [1],
    'fejobaff': [1, 2],
    'fepol': [1],
    'fepres': [1],
    'fepresch': [1, 2],
    'fework': [1],
}

from utils import plot_series_lowess

def plot_series(data, column, color, title):
    xtab = pd.crosstab(data['year'], data[column], normalize='index')
    series = xtab[agree_responses[column]].sum(axis=1)
    plot_series_lowess(series, color=color, label=column)
    decorate(title=title)

All respondents#

Note that these results have not yet been corrected for stratified sampling, so think of this as an inventory of the data, not inferences about the population.

Last two points of fehire have gone wonky – I’ve seen things like this in the 2021 and 2022 data. Not sure what the issue is.

for column in fe_columns:
    plt.figure()
    plot_series(gss, column, 'C2', 'All respondents')

_images/63764164469ab9b0fdf29a41b86ac485f056229406ad1a4f59b9d1c39ac1742b.png

_images/6202d4451f76a3d785e17e823afe64489fc605156881fdb25a038cce01d77dc3.png

_images/a0c3b27e31aed3f9c32160057b6691de343d97191aafcc512963a3178e34135b.png

_images/bc3bd352cab66571ba2d0c60e925e31ff9086b9481cf4cacb1e2ed62301768b7.png

_images/0eaa0332f691cc830be34fa567f7a5401362f45ba5c8de9a7f0e2beabf553553.png

_images/9714334efb0099dd3e9d23c11d06623db57cb2531a62dbde15b0d1dc35a29623.png

_images/586095b062d080f621299d0a7929e48c106152c6d8b3005e92709bdf1187ec6c.png

_images/250f3d736d7f921083fdbcea7e2f93d74b3d677011beb340b3fb9ad36a05074e.png

_images/7e99d873f572d40637bb402342be5c54d2f1953321cc7ab939f2cb35e2cbd012.png

_images/e9411b575fe6c78f7736218e157bfe9fbde119eb5818531a55371db4c021ed8e.png

Female respondents#

female = gss.query('sex == 2')
for column in fe_columns:
    plt.figure()
    plot_series(female, column, 'C1', title='Female respondents')

_images/816ee6a1f69ea000776511967de951a76ef6ac54b0d49ad9a8d593f6aaa909e2.png

_images/4e70e1dc74d6cd0f6fccf77e34b33a2f11eebdaf17e5b6e5ca5f7fd5067b164a.png

_images/fcd3c6bf079440b249a56ded609538741cd633cf7e5050f4436636b8a0123793.png

_images/237d10d20b8e8eb77f081245d421577874650a9e447623e1a7ed1dd4bb77f29a.png

_images/812cb860080b26a4fb879d0a8459e35b90adb0e457c8fd500583a6c298d11747.png

_images/7821ba3de836d4bd9e3f7840d522f16384cb81b38909a538ece5772f7b953080.png

_images/b701c11027810ed2aaa9e85ff804940af0956088cbb26d1b2f8e2dc5aff8db19.png

_images/7901a29a6568c3488d3f6e0e3ad866633260cf1ba8da2c4a4c39aa8462576f49.png

_images/b8bc23f56340aba940e46360c8e0aebaba51b330274c80a587a9717c79c939ba.png

_images/2ff25f1721c646bb366605c3161539aea813eb5259c1e4e3426e622038c5c2e2.png

Young females#

Noisier series due to smaller sample sizes.

young_female = female.query('age < 30')
for column in fe_columns:
    plt.figure()
    plot_series(young_female, column, 'C4', title='Female respondents age < 30')

_images/37c373ed677faa03104c097e7f5845386aae5260131cce4e07aee2cfb41dcb33.png

_images/738f3a6035d2b4985d1caf5481f7b0f23aeadfd497199c446b74ce3026a396f0.png

_images/deeef9e295e09cb28c82e8f33d46543a77020d2641070eb76780284112eee820.png

_images/08409c958b3865b127e963168ceb05d636992bb28e4cdc3399029a44de7f9c4b.png

_images/411f8249e36c8e0a5494fbb877e5c6fe54589232a60133db2ba45180354a8eec.png

_images/b4dd85b3f96c50dc79e303b200c0296095799ecfafe90ba7dce398d12abe48d8.png

_images/6bd8ae28c5f78dfe8f2800d3e7b788fb088e298db7557488ba33d793f17a73b7.png

_images/43163e239fb3a02184aacef53a0b54408797a10cebfdd97225836e8f57ff4362.png

_images/0f057fb801427330d0bf50c29579879e3a1288afed3342c3a286bc2a82bdec79.png

_images/b38b63f168e7fd93d2ad93d65237f88b16f3de045a6ade6781b702eca1e289dc.png

Male respondents#

No indications of recent reversals, except fehire, which is wonky for everybody.

Strange pattern in fepol.

male = gss.query('sex == 1')
for column in fe_columns:
    plt.figure()
    plot_series(male, column, 'C0', title='Male respondents')

_images/60406ce35a1f9fc3428de3169194a8f6394dc99d3f2500ab8d8614de0d6396f5.png

_images/2919ccaed79b4fdc4d4402028c78e55e54e79daa62dcea9ce64526a6df8b89d1.png

_images/ea80eb75052137275ae91019dd50158dac18ae0e4121362bc6326ba0cdf7445e.png

_images/57f6c0dc934f5bad1b021872e38ed4edd1db490f334298ec0a5bfb8fce0d25af.png

_images/14c667065761bded732f3e1d6a452df55560aa2119a413304995176b17bd3a51.png

_images/a5d2719b7b6390128858af632e241ee6b7359507a8b553f343f484392990f48a.png

_images/a0f89c265837d7a3e2da4083387ec7825ec0ac17147a425122decd7b80677ab8.png

_images/0095c9de701e8c937b4f38207b010dd6a76c3c901514100894279f16072aed10.png

_images/ca0161217e1dfa95806bf9c2d26baecb3b21dcd000acf4f9b9c1080fded3fe9e.png

_images/090aa5889a4bfed8120060d36ed00aec3d4efa290f724d7f58644502a72d12ad.png

Young males#