{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Probability" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "Think Bayes, Second Edition\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The foundation of Bayesian statistics is Bayes's Theorem, and the foundation of Bayes's Theorem is conditional probability. \n", "\n", "In this chapter, we'll start with conditional probability, derive Bayes's Theorem, and demonstrate it using a real dataset. In the next chapter, we'll use Bayes's Theorem to solve problems related to conditional probability. In the chapters that follow, we'll make the transition from Bayes's Theorem to Bayesian statistics, and I'll explain the difference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linda the Banker\n", "\n", "To introduce conditional probability, I'll use an example from a [famous experiment by Tversky and Kahneman](https://en.wikipedia.org/wiki/Conjunction_fallacy), who posed the following question:\n", "\n", "> Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?\n", "> 1. Linda is a bank teller.\n", "> 2. Linda is a bank teller and is active in the feminist movement.\n", "\n", "Many people choose the second answer, presumably because it seems more consistent with the description. It seems uncharacteristic if Linda is *just* a bank teller; it seems more consistent if she is also a feminist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the second answer cannot be \"more probable\", as the question asks. Suppose we find 1000 people who fit Linda's description and 10 of them work as bank tellers. How many of them are also feminists? At most, all 10 of them are; in that case, the two options are *equally* probable. If fewer than 10 are, the second option is *less* probable. But there is no way the second option can be *more* probable.\n", "\n", "If you were inclined to choose the second option, you are in good company. The biologist [Stephen J. Gould wrote](https://doi.org/10.1080/09332480.1989.10554932) :\n", "\n", "> I am particularly fond of this example because I know that the [second] statement is least probable, yet a little [homunculus](https://en.wikipedia.org/wiki/Homunculus_argument) in my head continues to jump up and down, shouting at me, \"but she can't just be a bank teller; read the description.\"\n", "\n", "If the little person in your head is still unhappy, maybe this chapter will help." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probability\n", "\n", "At this point I should provide a definition of \"probability\", but that [turns out to be surprisingly difficult](https://en.wikipedia.org/wiki/Probability_interpretations). To avoid getting stuck before we start, we will use a simple definition for now and refine it later: A **probability** is a fraction of a finite set.\n", "\n", "For example, if we survey 1000 people, and 20 of them are bank tellers, the fraction that work as bank tellers is 0.02 or 2\\%. If we choose a person from this population at random, the probability that they are a bank teller is 2\\%.\n", "By \"at random\" I mean that every person in the dataset has the same chance of being chosen.\n", "\n", "With this definition and an appropriate dataset, we can compute probabilities by counting.\n", "To demonstrate, I'll use data from the [General Social Survey](http://gss.norc.org/) (GSS). " ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "The following cell downloads the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-04-16T19:35:08.628591Z", "iopub.status.busy": "2021-04-16T19:35:08.627766Z", "iopub.status.idle": "2021-04-16T19:35:08.630508Z", "shell.execute_reply": "2021-04-16T19:35:08.629988Z" }, "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Load the data file\n", "\n", "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", " \n", "download('https://github.com/AllenDowney/ThinkBayes2/raw/master/data/gss_bayes.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll use Pandas to read the data and store it in a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-04-16T19:35:08.634488Z", "iopub.status.busy": "2021-04-16T19:35:08.633720Z", "iopub.status.idle": "2021-04-16T19:35:09.081049Z", "shell.execute_reply": "2021-04-16T19:35:09.081454Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | caseid | \n", "year | \n", "age | \n", "sex | \n", "polviews | \n", "partyid | \n", "indus10 | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "1974 | \n", "21.0 | \n", "1 | \n", "4.0 | \n", "2.0 | \n", "4970.0 | \n", "
1 | \n", "2 | \n", "1974 | \n", "41.0 | \n", "1 | \n", "5.0 | \n", "0.0 | \n", "9160.0 | \n", "
2 | \n", "5 | \n", "1974 | \n", "58.0 | \n", "2 | \n", "6.0 | \n", "1.0 | \n", "2670.0 | \n", "
3 | \n", "6 | \n", "1974 | \n", "30.0 | \n", "1 | \n", "5.0 | \n", "4.0 | \n", "6870.0 | \n", "
4 | \n", "7 | \n", "1974 | \n", "48.0 | \n", "1 | \n", "5.0 | \n", "4.0 | \n", "7860.0 | \n", "