Are You Normal? Hint: No.#

Click here to run this notebook on Colab.

Hide code cell content
# Install empirical dist if we don't already have it

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist
Hide code cell content
# download utils.py

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
        
download("https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/notebooks/utils.py")

What does it mean to be normal? And what does it mean to be weird? I think there are two factors that underlie our intuition for these ideas:

  • “Normal” and “weird” are related to the idea of average. If by some measurement you are close to the average, you are normal; if you are far from the average, you are weird.

  • “Normal” and “weird” are also related to the idea of rarity. If some ability or characteristic of yours is common, it is normal; if it’s rare, it’s weird.

Intuitively, most people think that these things go together; that is, we expect measurements close to the average be common, and measurements far from average to be rare.

For many things this intuition is valid. For example, the average height of adults in the United States is about 170 cm. Most people are close to this average: about 64% of adults are within 10 cm plus or minus; 93% are within 20 cm. And few people are far from average: only 1% of the population is shorter than 145 cm or taller than 195 cm.

The numbers in the previous paragraph come from the Behavioral Risk Factor Surveillance System (BRFSS). Here is the notebook I used to download and clean the data.

I generated a sample from the dataset, which we can download like this:

download("https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/brfss_sample.hdf")

But what is true when we consider a single characteristic turns out to be partly false when we consider a few characteristics at once, and spectacularly false when we consider more than a few. In fact, when we consider the many ways people are different, we find that

  • People near the average are rare or non-existent,

  • Everyone is far from average, and

  • Everyone is roughly the same distance from average.

At least in a mathematical sense, no one is normal, everyone is weird, and everyone is the same amount of weird.

Present… Arms#

Here’s the ANSUR data, originally downloaded from The OPEN Design Lab.

download("https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/ANSURIIFEMALEPublic.csv")
download("https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/ANSURIIMALEPublic.csv")

Since heights are in mm, I converted to cm.

The following function plots histograms using plt.stairs.

Here are the histograms of male and female heights.

_images/eb8c305b8cf7484a225522cc538a95a6df266397a4f781a80ba270370483a55f.png

The following cell shows a way to calculate a Gaussian curve – for clarity I used base 10 (instead of \(e\)) and no parameters.

_images/a26d0d6365af3ede9b94d2996efe71f32b5492286b35ed77571d32e1d1071cb3.png

The following function plots a Gaussian curve with the same mean and std as the data.

The scale parameter is arbitrary – I chose it to be roughly the same height as the histogram, which is also arbitrary. Avoiding this arbitrariness is one reason I prefer to use CDFs to compare the distribution of data with a model.

_images/9316a5d32b611b2d8be1b88443287b93b6a250fb5db9572450c43f938291be88.png

Why?#

The following function generates simulated values that are the sum of a number of random factors, shifted and scaled so they have the same mean and std as the given data.

The result shows that values generated this way can resemble the actual data.

_images/d4bc4af9dfbb073c6ddaa0f39b94d0d60cb1de49761355f2e2b608a8968a08fe.png

Comparing Distributions#

Here are the examples in the book.

And here’s a figure that shows the CDFs and the highlighted examples.

_images/e499c1167fe30ddbe449480df596bf14daae40c528995ac5553ddc694e4363dc.png

The following figure shows the CDFs of the data compared to the CDF of a Gaussian model chosen to fit the data.

How Gaussian Is It?#

Here are the names of the measurements.

To find the measurements that fit a Gaussian model the best and the worst, I use the following function, which computes the maximum vertical distance between the CDF of the data and the model.

The following function computes this distance for all of the measurements.

Now we can make a list of results for all measurements, male and female.

Here are the measurements where the distance between the data and the model is smallest.

And here are the measurements where it’s the biggest.

Most of the best-fitting measurements are male, most of the worst-fitting are female. That’s because the sample size is smaller for the female participants, which means there is more deviation from the model due to chance.

Here’s the measurement that fits the Gaussian model the best.

_images/689ad3623118fb0be30099475924de5af052add932d7d15b61e03aea01b75f22.png

The measurement with the highest distance is interpupillarybreadth. But based on the figure below, that appears to be an artifact due to a single measurement. In retrospect, I could have computed distances in a way that would not be thrown off by this anomaly.

_images/253f961b75507639f88b91ed9345cf8ce5caf65c49acc136c2f07820a84045de.png

What I reported in the book is radialestylionlength, which is second and the list and seems to be legitimately the measurement that is the worst fit for the model.

_images/2e7ff726d41cdfdf956f783a9d92e48f745c3b6e5f0c7f32991efbd4fbbd71fb.png

The Myth of the “Average Man”#

Here are the measurements from ANSUR that are the closest to the measurements in Daniels’s report.

The following table shows the names of these measurements, the mean and standard deviations of the values, the low and high ends of the range considered “approximately average”, and the percentage of survey participants who fall in the range.

Measurement Mean Std Dev Low High % in range
Stature (height) 175.6 6.9 173.6 177.7 23.2
Chest Circumference 105.9 8.7 103.2 108.5 22.9
Sleeve Length 89.6 4.0 88.4 90.8 23.1
Crotch Height 84.6 4.6 83.2 86.0 22.1
Vertical Trunk Circ. 166.5 9.0 163.8 169.2 24.2
Hip Breadth Sitting 37.9 3.0 37.0 38.8 24.8
Neck Circumference 39.8 2.6 39.0 40.5 25.2
Waist Circumference 94.1 11.2 90.7 97.4 22.1
Thigh Circumference 62.5 5.8 60.8 64.3 24.9
Crotch Length 35.6 2.9 34.7 36.5 22.1

Here are the measurements where there are the biggest differences between the Daniels dataset and the ANSUR dataset (you can ignore the two very large differences, which are the result of using measurements that are defined differently).

The other differences suggest that members of the Army and Marines now are bigger than members of the Air Force in 1950.

The following cells replicate the analysis in Daniels, computing the number of people who are close to the average in each measurement, considered in succession.

The following table shows the number of people who make it past each of the “hurdles”.

The same is true if you design for the average woman. The ANSUR dataset includes fewer women than men: 1986 compared to 4086. So we can use a more generous definition of “approximately average”, including anyone within 0.4 standard deviations of the mean.

Even so, we find only 2 women who make it past the first eight hurdles, 1 who makes it past nine, and none that make it past all ten.

The Big Five#

The Big Five data is originally from the Open-Source Psychometrics Project.

Here is the notebook I used to download and clean the data.

I did some preliminary cleaning and put the result in an HDF file.

The survey consists of 50 questions, with 10 questions intended to measure each of the five personality traits. People respond on a five-point scale from “Strongly Disagree” to “Strongly Agree”. I scored the responses like this:

  • “Strongly agree” scores 2 points.

  • “Agree” scores 1 point,

  • “Neutral” scores 0 points,

  • “Disagree” scores -1 point, and

  • “Strongly disagree” scores -2 points.

For some questions, the scale is reversed; for example, if someone strongly agrees that they are “quiet around strangers,” that counts as -2 points on the extroversion score.

Since there are 10 questions for each trait, the maximum score is 20 and the minimum score is -20. For each of the five traits, the following figure shows the distributions of total scores for more than 800,000 respondents.

_images/430064c1b3cb89b5673dfce8129df3b2c7dacd6fde6c224b987a6c8788ef2658.png

The figures on the left show histograms; the figures on the right show CDFs. In both figures, the shaded line is a Gaussian distribution I chose to fit the data. The Gaussian model fits the first three distributions well (extroversion, emotional stability, and conscientiousness) except in the extreme tails.

Let’s see what happens if we apply Daniels’s analysis to the Big Five data. The following table shows the mean and standard deviation of the five scores, the range of values we’ll consider “approximately average”, and the percentage of the sample that falls in that range.

Trait Mean Std Dev Low High % in range
Extroversion -0.4 9.1 -3.1 2.3 23.4
Emotional stability -0.7 8.6 -3.2 1.9 20.9
Conscientiousness 3.7 7.4 1.4 5.9 20.2
Agreeableness 7.7 7.3 5.5 9.9 21.1
Openness 8.5 5.2 7.0 10.1 28.3

For each trait, the “average” range contains 20-28% of the population. Now if we treat each trait as a hurdle and select people who are close to average on each one, the following table shows the results.

Trait Counts Percentages
Extroversion 204077 23.4
Emotional stability 46988 5.4
Conscientiousness 10976 1.3
Agreeableness 2981 0.3
Openness 926 0.1

The first column shows the number of people who make it past each hurdle; the second column shows the percentages.

Of the 873,173 people we started with, about 204,000 are close to the average in extroversion. Of those, about 47,000 are close to the average in emotional stability. And so on, until we find 926 who are close to average on all five traits, which is barely one person in a thousand.

We Are All Equally Weird#

Using the Big Five data again, I counted the number of traits where each respondent falls outside the range we defined as “approximately average”. We can think of the result as a kind of “weirdness score”, where 5 means they are far from average on all five traits, and 0 means they are far from average on none.

The following figure shows the distribution of these scores for the roughly 800,000 people who completed the Big Five survey.

_images/4b13ee2ad922986a7813282680ca67e381dd62ab63ecc01c32c6c97cf65e4b35.png

As we’ve already seen, very few people are close to average on all five traits. Almost everyone is weird in two or more ways, and the majority (68%) are weird in four or five ways!

The distribution of weirdness is similar with physical traits. Using the 93 measurements in the ANSUR dataset, we can count the number of ways each participant deviates from average. The following figure shows the distribution of these counts for the male ANSUR participants.

_images/364a87351e615cb1e005a7360d35ab9a5aaab8144ff68159f24d5afccd9fe0ed.png

Nearly everyone in this dataset is “weird” in at least 40 ways, and 90% of them are weird in at least 57 ways. With enough measurements, being weird is normal.

Now I’ll use the ANSUR measurements to compute all possible ratios of two measurements. With 93 measurements and 4278 ratios, there are a total of 4317 ways to be weird. The following figure shows the distribution of weirdness scores for the male participants.

_images/c1243a856e9f9247cd950687b2f2d771d71b307056dc71ce732a1ff094c4d112.png

With these measurements, all participants fall in a relatively narrow range of weirdness. The most “normal” participant deviates from average in 2446 ways; the weirdest in 4038 ways.

Copyright 2022 Allen Downey

The code in this notebook and utils.py is under the MIT license.