Survey Data with Pandas and StatsModels

Survey Data with Pandas and StatsModels#

Tutorial for PyCon 2025

Allen B. Downey

Click here to run this notebook on Colab.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

def decorate(**options):
    """Decorate the current axes.

    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')

    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    legend = options.pop("legend", True)
    loc = options.pop("loc", "best")

    # Pass options to Axis.set
    ax = plt.gca()
    ax.set(**options)

    # Add a legend if there are any labeled elements
    handles, labels = ax.get_legend_handles_labels()
    if handles and legend:
        ax.legend(handles, labels, loc=loc)

    # Tight layout is generally a good idea
    plt.tight_layout()

Data#

We will use data from the General Social Survey (GSS). The raw dataset is big, so I’ve prepared an extract, which the following cell downloads. The code to generate the extract is in this notebook.

# This dataset is prepared in the GssExtract repository

DATA_PATH = "https://github.com/AllenDowney/GssExtract/raw/main/data/interim/"

filename = "gss_extract_pacs_workshop.hdf"
download(DATA_PATH + filename)

Show code cell content Hide code cell content

# Solution

gss.head()

	age	attend	ballot	class	cohort	degree	educ	fair	fear	goodlife	...	region	relig	rincome	satfin	satjob	sex	srcbelt	trust	wtssall	year
0	24.0	1.0	NaN	3.0	1948.0	3.0	16.0	2.0	NaN	NaN	...	1	1.0	NaN	1.0	NaN	1.0	5.0	1.0	1.3339	1972
1	33.0	2.0	NaN	2.0	1939.0	0.0	10.0	2.0	NaN	NaN	...	1	2.0	NaN	3.0	2.0	1.0	1.0	1.0	0.8893	1972
2	40.0	8.0	NaN	2.0	1932.0	1.0	12.0	2.0	NaN	NaN	...	1	5.0	NaN	1.0	NaN	2.0	3.0	2.0	1.3339	1972
3	24.0	2.0	NaN	2.0	1948.0	1.0	12.0	1.0	NaN	NaN	...	9	2.0	NaN	2.0	NaN	1.0	3.0	2.0	1.7786	1972
4	21.0	7.0	NaN	3.0	1951.0	1.0	12.0	1.0	NaN	NaN	...	9	1.0	NaN	1.0	NaN	2.0	2.0	NaN	1.7786	1972

5 rows × 29 columns

Distributions#

We’ll see two ways to describe distributions: value counts and KDE.

And for joint distributions, we’ll use a cross tabulation.

def value_counts(seq, **options):
    """Make a series of values and the number of times they appear.

    Args:
        seq: sequence

    returns: Pandas Series
    """
    return pd.Series(seq).value_counts(**options).sort_index()

Exercise#

Use value_counts to plot the distribution of cohort as a bar plot. Then use sns.kdeplot to show the same distribution as a continuous quantity.

Cross tabulation#

Cross tabulation is a way to represent the joint distribution of two variables.

Show code cell content Hide code cell content

# Solution

xtab = pd.crosstab(gss['cohort'], gss['age'])
xtab

age	18.0	19.0	20.0	21.0	22.0	23.0	24.0	25.0	26.0	27.0	...	80.0	81.0	82.0	83.0	84.0	85.0	86.0	87.0	88.0	89.0
cohort
1883.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
1884.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	1
1885.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	5
1886.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	4	4
1887.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	2	1	0	1	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2000.0	36	0	0	71	77	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2001.0	0	0	68	63	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2002.0	0	59	69	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2003.0	7	70	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2004.0	25	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

122 rows × 72 columns

Target variable#

I’ve selected four variables you might want to work with. Whichever one you choose, put its cell last.

# https://gssdataexplorer.norc.org/variables/434/vshow

varname = 'happy'

question = """Taken all together, how would you say things are these days--
would you say that you are very happy, pretty happy, or not too happy?
"""

responses = ['Very happy', "Happy", 'Not too happy']

ylabel = "Percent saying 'very happy'"

# https://gssdataexplorer.norc.org/variables/439/vshow

varname = 'helpful'

question = """Would you say that most of the time people try to be helpful,
or that they are mostly just looking out for themselves?
"""

responses = ['Helpful', 'Look out\nfor themselves', 'Depends']

ylabel = "Percent saying 'helpful'"

# https://gssdataexplorer.norc.org/variables/440/vshow

varname = 'fair'

question = """Do you think most people would try to take advantage of you 
if they got a chance, or would they try to be fair?
"""

# notice that the negative response is first here!
responses = ['Take advantage', 'Fair', 'Depends']

ylabel = "Percent saying 'fair'"

# https://gssdataexplorer.norc.org/variables/441/vshow

varname = 'trust'

question = """Generally speaking, would you say that most people can be trusted
or that you can't be too careful in dealing with people?
"""

responses = ['Trusted', "Can't be\ntoo careful", 'Depends']

ylabel = "Percent saying 'trusted'"

Exercise#

Do you think people who say “trust” are also more likely to say “helpful”? To find out, make a cross tabulation of trust and helpful. Consider using normalize='index'. Make a heatmap that shows this joint distribution. What relationships can you see in the responses?

Show code cell content Hide code cell content

# Solution

xtab = pd.crosstab(gss['trust'], gss['helpful'], normalize='index')
xtab

helpful	1.0	2.0	3.0
trust
1.0	0.707773	0.243133	0.049094
2.0	0.340947	0.592560	0.066493
3.0	0.417197	0.283970	0.298832

Time Series#

To see how responses have changed over time, we’ll use cross tabulation again.

Show code cell content Hide code cell content

# Solution

xtab = pd.crosstab(gss["year"], gss[varname], normalize='index')
xtab.head()

trust	1.0	2.0	3.0
year
1972	0.463551	0.490343	0.046106
1973	0.483656	0.495664	0.020680
1975	0.391099	0.564396	0.044504
1976	0.455518	0.513043	0.031438
1978	0.396597	0.549738	0.053665

Smoothing#

To smooth the time series, we’ll use LOWESS (locally weighted scatterplot smoothing).

def make_lowess(series, frac=0.5):
    """Use LOWESS to compute a smooth line.

    series: pd.Series

    returns: pd.Series
    """
    y = series.to_numpy()
    x = series.index.to_numpy()

    smooth_array = lowess(y, x, frac)
    index, data = np.transpose(smooth_array)

    return pd.Series(data, index=index)

Exercise#

Use this function to make smooth_series. Plot the result along with time_series.

Binary series#

It will be useful to convert the responses to a binary variable represented with 0s and 1s.

Pivot tables#

Now we’ll use a pivot table to put respondents in groups and see how the responses have changed over time.

polviews_map = {
'Liberal',
'Liberal',
'Liberal',
'Moderate',
'Conservative',
'Conservative',
'Conservative',
}

Show code cell content Hide code cell content

# Solution

table = gss.pivot_table(
    index='year', columns='polviews3', values='y', aggfunc='mean'
) * 100
table.head()

polviews3	Conservative	Liberal	Moderate
year
1975	41.891892	39.192399	38.733706
1976	49.590164	45.961003	44.561404
1978	42.699115	43.262411	37.347295
1980	47.609562	45.896657	46.204620
1983	41.935484	31.914894	39.240506

Better colors#

We’ll select colors from a Seaborn color palette.

Exercise#

Loop through the groups in color_map. For each one, extract a column from table and plot the data points; then plot a smooth line.

Groupby#

Now that we have y as a binary variable, we have another way to compute the time series and the pivot table. There are pros and cons of each method.

Logistic regression#

We’ll use logistic regression to model changes over time.

def fit_model(data, x_range, x_shift):
    formula = 'y ~ x + I(x**2)'
    model = smf.logit(formula, data=data).fit(disp=False)
    
    pred_df = pd.DataFrame(dict(x=x_range - x_shift))
    pred = model.predict(pred_df) * 100
    pred.index = x_range
    pred.name = 'model'

    return pred

Exercise#

Use subset.groupby('polviews3') to loop through the groups. For each one, use fit_model to generate and plot a “prediction” for the group.

Group by cohort#

Now let’s group respondents by decade of birth and see how each cohort changes over time.

def round_into_bins(series, bin_width, low=0, high=None):
    """Rounds values down to the bin they belong in.

    series: pd.Series
    bin_width: number, width of the bins

    returns: Series of bin values (with NaN preserved)
    """
    if high is None:
        high = series.max()

    bins = np.arange(low, high + bin_width, bin_width)
    indices = np.digitize(series, bins)
    result = pd.Series(bins[indices - 1], index=series.index, dtype='float')

    result[series.isna()] = np.nan
    return result

All the Cohorts#

Now let’s compare cohorts. The following function encapsulates the steps from the previous section.

def plot_cohort(df, cohort, color='C0'):
    cohort_df = df.query("cohort10 == @cohort").dropna(subset=['y', 'x'])

    cohort_series = cohort_df.groupby('year')['y'].mean() * 100

    year_range = cohort_series.index

    pred_series = fit_model(cohort_df, year_range, year_shift)
    x, y = pred_series.index[-1], pred_series.values[-1]
    plt.text(x, y, cohort, ha='left', va='center')

    pred_series.plot(label=cohort, color=color)

Exercise#

Plot pred_all. Then loop through cohorts and colors, and call plot_cohort for each group.

Depending on which target variable you chose, you might get a warning from StatsModels.

Simpson’s Paradox#

The composition of the population changes over time.

Show code cell content Hide code cell content

# Solution

xtab_cohort = pd.crosstab(gss["cohort10"], gss["year"], normalize="columns")
xtab_cohort.head()

year	1972	1973	1974	1975	1976	1977	1978	1980	1982	1983	...	2004	2006	2008	2010	2012	2014	2016	2018	2021	2022
cohort10
1880.0	0.003113	0.003995	0.008108	0.006739	0.004692	0.000000	0.003277	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0
1890.0	0.047945	0.038615	0.029730	0.037062	0.047587	0.027613	0.028834	0.017088	0.018929	0.005013	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0
1900.0	0.110212	0.099201	0.129730	0.082210	0.070375	0.091387	0.083224	0.059467	0.051379	0.038221	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0
1910.0	0.152553	0.158455	0.131081	0.130728	0.144772	0.125575	0.105505	0.124402	0.103840	0.093358	...	0.015714	0.007562	0.007944	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0
1920.0	0.224782	0.180426	0.193243	0.187332	0.159517	0.184747	0.152687	0.160629	0.149270	0.137218	...	0.038929	0.040036	0.038232	0.035294	0.025381	0.014235	0.009807	0.008576	0.0	0.0

5 rows × 34 columns

Sampling weights#

The GSS uses stratified sampling, so some respondents represent more people in the population than others. We can use resampling to correct for stratified sampling.

Show code cell content Hide code cell content

# Solution

gss.groupby('year')['wtssall'].describe()

	count	mean	std	min	25%	50%	75%	max
year
1972	1613.0	1.146481	0.486191	0.444600	0.889300	0.889300	1.333900	3.557100
1973	1504.0	1.130645	0.417802	0.457200	0.914500	0.914500	1.371700	2.743500
1974	1484.0	1.160032	0.443681	0.466000	0.932000	0.932000	1.397900	3.727900
1975	1490.0	1.140258	0.466167	0.466000	0.932000	0.932000	1.397900	3.261900
1976	1499.0	1.147392	0.459938	0.485400	0.970900	0.970900	1.456300	3.398100
1977	1530.0	1.161231	0.458057	0.494100	0.988100	0.988100	1.482200	2.964400
1978	1532.0	1.148797	0.448502	0.506300	1.012700	1.012700	1.519000	3.038000
1980	1468.0	1.168663	0.486058	0.511200	1.022500	1.022500	1.533700	3.578700
1982	1860.0	1.189170	0.513264	0.506400	1.031900	1.031900	1.547800	4.127500
1983	1599.0	1.174535	0.489286	0.505800	1.011600	1.011600	1.517500	3.540700
1984	1473.0	1.163876	0.473032	0.516500	1.033100	1.033100	1.549600	3.615700
1985	1534.0	1.158889	0.463941	0.518100	1.036300	1.036300	1.554400	3.626900
1986	1470.0	1.164447	0.467082	0.503600	1.007200	1.007200	1.510800	3.021600
1987	1819.0	1.186371	0.499286	0.505200	1.010400	1.010400	1.515500	3.536200
1988	1481.0	1.163860	0.463531	0.531300	1.062700	1.062700	1.062700	3.719400
1989	1537.0	1.184300	0.504330	0.510700	1.021500	1.021500	1.532200	3.575100
1990	1372.0	1.161439	0.489095	0.531300	1.062700	1.062700	1.062700	3.188100
1991	1517.0	1.174147	0.478842	0.531100	1.062100	1.062100	1.593200	3.717500
1993	1606.0	1.169310	0.448682	0.528500	1.057100	1.057100	1.585600	3.699800
1994	2992.0	1.166058	0.457253	0.541200	1.082500	1.082500	1.082500	3.247500
1996	2904.0	1.181024	0.479229	0.543200	1.086400	1.086400	1.086400	3.259100
1998	2832.0	1.177809	0.462729	0.550100	1.100100	1.100100	1.100100	3.850400
2000	2817.0	1.165950	0.453726	0.549300	1.098500	1.098500	1.098500	3.295600
2002	2765.0	1.198505	0.507470	0.557800	1.115600	1.115600	1.115600	2.788900
2004	2812.0	1.299424	0.762699	0.459200	0.918400	0.961700	1.836800	6.428700
2006	4510.0	1.360152	0.841060	0.429700	0.859300	0.954800	1.909600	5.728800
2008	2023.0	1.425750	0.911863	0.437745	0.875490	1.067670	2.135340	6.406021
2010	2044.0	1.313222	0.726111	0.463128	0.926255	0.926255	1.684101	4.210252
2012	1974.0	1.366495	0.884637	0.411898	0.823796	1.235694	1.747975	8.739876
2014	2538.0	1.273430	0.674193	0.448002	0.896003	1.344005	1.380591	5.376020
2016	2867.0	1.227412	0.607020	0.391825	0.956994	0.956994	1.564363	4.306471
2018	2348.0	1.382192	0.936160	0.471499	0.942997	0.942997	1.414496	5.897420
2021	4032.0	1.551637	1.175611	0.227357	0.790829	1.195842	1.869697	7.557038
2022	3544.0	2.551821	2.489079	0.113721	0.714314	1.781121	3.529911	14.272462

def resample_by_year(df, column='wtssall'):
    """Resample rows within each year using weighted sampling.

    df: DataFrame
    column: string name of weight variable

    returns: DataFrame
    """
    grouped = df.groupby('year')
    samples = [group.sample(n=len(group), replace=True, weights=group[column])
               for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample

The code in this notebook is under the MIT license.

Survey Data with Pandas and StatsModels

Contents

Survey Data with Pandas and StatsModels#

Data#

Distributions#

Exercise#

Cross tabulation#

Target variable#

Exercise#

Time Series#

Smoothing#

Exercise#

Binary series#

Pivot tables#

Better colors#

Exercise#

Groupby#

Logistic regression#

Exercise#

Group by cohort#

All the Cohorts#

Exercise#

Simpson’s Paradox#

Sampling weights#