Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Technical Report: Bayesian Cohort-Period Model for GSS Data

Introduction

This project develops a hierarchical Bayesian model for analyzing period-cohort effects in General Social Survey (GSS) data. The model addresses limitations of previous approaches that used ad hoc smoothing and binned cohorts, providing a principled framework for estimating how attitudes and behaviors change across birth cohorts and over time.

Problem Statement

Previous analyses of GSS data using period-cohort decomposition faced several challenges:

Proposed Solution

We develop a hierarchical Bayesian model that makes smoothing assumptions explicit and principled:

Model Specification

Basic Structure

For binary response data collapsed to cells (cohort × nominal year):

The cell count y(c,t) is Binomial with success probability p(c,t). The linear predictor on the logit scale is additive:

η(c,t) = α + f(c) + g(t) + ε(t)

where α is the baseline intercept, f(c) is a smooth cohort effect, g(t) is a smooth period effect, and ε(t) ~ N(0, σ_short) is extra-Binomial observation noise per nominal year (overdispersion), capturing year-level residual variance beyond the additive cohort+period structure. The probability is p(c,t) = logistic(η(c,t)).

RW2 Prior

A Gaussian Random Walk of order 2 (RW2) encourages locally linear trajectories:

Δ²f_c ~ N(0, σ_f)

This means the second differences are Gaussian:

f_k - 2f_{k-1} + f_{k-2} ~ N(0, σ)

Why RW2 over RW1:

Overdispersion

The model includes extra-Binomial observation noise on the logit scale: ε_t ~ N(0, σ_short) per nominal year, so p = logistic(η + ε_t). This captures year-level residual variance (news cycles, wording drift, context effects) that the single additive trajectory cannot explain. Prior: σ_short ~ Exponential(0.2). Time-only (one ε per nominal year, not per cell) keeps the parameter count parsimonious.

Boundary Noise Handling

RW priors naturally handle sparse cells:

Implementation

PyMC Implementation

PyMC does not have built-in RW2, so we implement it manually using second differences and double cumulative sum:

# Smoothing scale
sigma_rw2 = pm.Exponential("sigma_rw2", 8.0)  # mean = 0.125

# Second differences
delta2 = pm.Normal(
    "delta2",
    mu=0.0,
    sigma=sigma_rw2,
    shape=K - 2
)

# Reconstruct latent function (double cumsum)
# Free init: f_init ~ N(0,1) for level/slope at boundary (recommended)
# Legacy: pt.zeros(2) fixes first two at zero
f_init = pm.Normal("f_init", 0, 1, shape=2)
f_raw = pt.concatenate([f_init, pt.cumsum(pt.cumsum(delta2))])

# Center for identifiability (removes intercept redundancy)
f = f_raw - pt.mean(f_raw)

# Overdispersion: extra-Binomial noise per nominal year
sigma_short = pm.Exponential("sigma_short", 1.0 / 0.2)
epsilon = pm.Normal("epsilon", 0, sigma_short, dims="year_idx")
epsilon_cells = epsilon[cells["year_idx"].values]
p_obs = pm.math.sigmoid(eta + epsilon_cells)

Key Implementation Details

  1. Null Space Handling: RW2 has a 2D null space (intercept + linear trend). We use free initial values (f_init ~ N(0,1)) at the boundary, then center for identifiability. We do not remove the linear trend; detrending in addition to centering would over-constrain the model. Free init produces more interpretable cohort trajectories than pinning the first two values at zero.

  2. Overdispersion: Extra-Binomial noise ε_t ~ N(0, σ_short) per nominal year captures year-level residual variance. Prior: σ_short ~ Exponential(0.2). Time-only parameterization keeps the model parsimonious.

  3. Survey Weights: We use weighted_mean (Kish effective sample size) to compute cell proportions and n_eff, incorporating GSS survey weights for population-level inference.

  4. Prior Choice: Exponential priors with mean 0.125 (rate=8.0) for smoothing scales provide optimal balance of geometry and flexibility.

  5. Single-Year Cohorts: We use single-year birth cohorts rather than bins, letting the RW2 prior provide the smoothing. This avoids arbitrary discontinuities at bin edges.

Model Development and Testing

Version History

We tested multiple model versions to optimize performance:

Version 10 (current recommended version):

Model Comparison: Version 9 vs Version 10

We compared the yearly grid (Version 9) and two-year cadence (Version 10) approaches:

Version 9 (Yearly Grid):

Version 10 (Two-Year Cadence):

Trajectory Comparison:

Data Preparation

GSS Data

We use the General Social Survey (GSS) data, which has been conducted since 1972. The current analysis uses data through 2024.

Data Structure:

Variable Selection: We focus on variables that:

  1. Are binary or easily binarizable

  2. Have been asked frequently over many years

  3. Have clear substantive interpretation

  4. Represent different domains (politics, social issues, etc.)

Current variables analyzed:

Survey Year Mapping (Version 10)

To implement the two-year cadence, we map actual survey years to nominal years:

Model Interpretation

Regularized Cell-Means Estimator

This model is best understood as a regularized cell-means estimator over cohort × period. Where data are dense, estimates are effectively descriptive; where data are sparse or missing, RW2 smoothing provides principled interpolation.

Key interpretive framing:

Visualization Strategy

Recommended plots:

  1. Heatmap of E[p_{c,t}]: Full latent surface showing cohort × year support

  2. Cohort trajectories: Aggregated to decade-of-birth bins for readability

  3. Coverage/exposure map: Sample sizes by cohort × year to show sparsity

Results and Validation

Posterior Predictive Checks

The model exhibits well-calibrated uncertainty and appropriate shrinkage:

Boundary Cohort Handling

For sparsely observed boundary cohorts:

Convergence Diagnostics

Version 10 diagnostics (for cappun variable):

Batch Processing Results Summary

We ran Model 10 on 15 GSS variables to assess model performance across different substantive domains. The following tables summarize key results:

Table 1: Data Characteristics and Convergence Diagnostics

VariableObservationsCellsMax R-hatMin ESS BulkMin ESS TailDivergences
cappun61,4711,6541.011743140
homosex43,6351,5431.031905310
racopen33,7731,1571.627111000
fepol36,0261,3641.031283170
abany40,6851,4301.031284470
premarsx44,4911,5341.012163950
prayer35,2611,3881.012723820
natfare39,7491,4861.021884470
grass38,8841,4121.031784770
natenvir39,7301,4821.041193150
divlaw38,8371,3951.011722370
letdie135,9931,3261.011804490
gunlaw49,0291,6101.0841740
sexeduc41,8851,4871.061152340
pornlaw45,1201,4991.012492470

Note: Values in bold indicate convergence problems. racopen shows severe convergence issues (R-hat 1.62, 25% divergences).

Table 2: Posterior Estimates for Smoothing Scales

Variableσ_c (mean ± sd)σ_t (mean ± sd)Sampling Time (min)
cappun0.010 ± 0.0030.233 ± 0.0450.0*
homosex0.060 ± 0.0160.305 ± 0.0560.0*
racopen0.105 ± 0.1532.191 ± 1.2310.0*
fepol0.013 ± 0.0052.360 ± 0.2451.2
abany0.017 ± 0.0050.146 ± 0.0361.3
premarsx0.020 ± 0.0062.932 ± 0.2781.7
prayer0.028 ± 0.0070.108 ± 0.0461.5
natfare0.010 ± 0.0030.295 ± 0.0491.1
grass0.029 ± 0.0083.267 ± 0.3101.8
natenvir0.021 ± 0.0060.239 ± 0.0431.5
divlaw0.012 ± 0.0041.296 ± 0.1671.2
letdie10.012 ± 0.0040.069 ± 0.0310.8
gunlaw0.002 ± 0.0010.114 ± 0.0360.7
sexeduc0.019 ± 0.0071.400 ± 0.2251.3
pornlaw0.047 ± 0.0120.093 ± 0.0472.7

Note: * indicates results loaded from cache (sampling time not recorded).

Key Findings:

  1. Convergence: Most variables (14 of 15) show good convergence (R-hat ≤ 1.01-1.08). One variable (racopen) shows severe convergence problems:

    • racopen: R-hat 1.62, ESS 7-11, 1,000/4,000 divergences (25%)

    • Analysis indicates this is due to 5 missing nominal years in the period grid (21 observed out of 26), creating poor sampling geometry

    • gunlaw (R-hat 1.08) and sexeduc (R-hat 1.06) show marginal convergence that may benefit from longer sampling

  2. Effective Sample Size: Min ESS bulk ranges from 7 (racopen) to 272 (prayer). Most variables (12 of 15) have ESS > 100, though several fall below the ideal threshold of 400. The lowest ESS values are associated with convergence problems (racopen, gunlaw).

  3. Smoothing Scales - Cohort Effects (σ_c):

    • Consistently small across all variables (0.002-0.105)

    • Indicates smooth cohort trajectories across all substantive domains

    • racopen shows elevated uncertainty (0.105 ± 0.153) due to convergence issues

  4. Smoothing Scales - Period Effects (σ_t):

    • Shows substantial variation (0.069-3.267), reflecting different temporal dynamics:

      • Low period variation (σ_t < 0.3): letdie1 (0.069), gunlaw (0.114), prayer (0.108), pornlaw (0.093) - relatively stable over time

      • Moderate period variation (0.3 < σ_t < 1.5): cappun (0.233), abany (0.146), natfare (0.295), natenvir (0.239), homosex (0.305), divlaw (1.296), sexeduc (1.400) - moderate temporal change

      • High period variation (σ_t > 2.0): fepol (2.360), premarsx (2.932), grass (3.267), racopen (2.191) - substantial temporal shifts

    • Variables with high period variation tend to be social issues that have undergone major attitude shifts over the survey period

  5. Sampling Efficiency: Sampling times range from 0.7 to 2.7 minutes per variable for full runs, with most completing in 1-2 minutes. The model demonstrates good computational efficiency across all variables.

  6. Divergences: 14 of 15 variables show zero divergences, indicating good sampling geometry and appropriate prior specification. Only racopen shows significant divergences (25%), consistent with its convergence problems.

Future Directions

Potential Extensions

  1. Ordered Logistic Model: For multi-category responses (e.g., 4-point agree/disagree scales)

  2. Age Component: Explicit age effects if needed (currently absorbed in cohort-period structure)

  3. 2D Lexis Surface: Tensor spline approach if strong cohort-period interactions found

  4. Survey Weights: weighted_mean (n_eff) is implemented; bootstrap resampling available for sensitivity analysis

Current Limitations

  1. Binary Outcomes: Most variables are binarized, losing ordering information

  2. No Age Component: Age-related variation is reflected in cell probabilities but not separately parameterized

  3. Additive Structure: Cohort and period effects are additive; interactions not currently modeled

Conclusion

The hierarchical Bayesian cohort-period model with RW2 priors provides a principled, defensible replacement for ad hoc smoothing approaches. The model:

Version 10 (two-year cadence) is the recommended configuration, offering better convergence and structural alignment than the yearly grid approach while maintaining similar predictive accuracy.