The Long Tail of Disaster

The Long Tail of Disaster#

Click here to run this notebook on Colab.

You would think we’d be better prepared for disaster. But events like Hurricane Katrina in 2005, which caused catastrophic flooding in New Orleans, and Hurricane Maria in 2017, which caused damage in Puerto Rico that has still not been repaired, show that large-scale disaster response is often inadequate. Even wealthy countries – with large government agencies that respond to emergencies and well-funded organizations that provide disaster relief – have been caught unprepared time and again.

The are many reasons for these failures, but one of them is that rare, large events are fundamentally hard to comprehend. Because they are rare, it is hard to get the data we need to estimate precisely how rare. And because they are large, they challenge our ability to imagine quantities that are orders of magnitude bigger than what we experience in ordinary life.

My goal in this chapter is to present the tools we need to comprehend the small probabilities of large events so that, maybe, we will be better prepared next time.

The Distribution of Disaster#

Natural and human-caused disasters are highly variable. The worst disasters, in terms of both lives lost and property destroyed, are thousands of times bigger than smaller, more common disasters.

That fact might not be surprising, but there is a pattern to these costs that is less obvious. One way to see this pattern is to plot the magnitudes of the biggest disasters on a logarithmic scale. To demonstrate, I’ll use the estimated costs from a list of 125 disasters on Wikipedia.

# I archived a copy of https://en.wikipedia.org/wiki/List_of_disasters_by_cost
# on June 3, 2022

DATA_PATH = "https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/"

filename = "List_of_disasters_by_cost.html"
download(DATA_PATH + filename)

# If you want a more recent list, try this, but it might not work if the
# page has been reorganized.

# download("https://en.wikipedia.org/wiki/List_of_disasters_by_cost")

tables = pd.read_html(filename)

disasters = tables[0]
print(len(disasters))
disasters.head()

	Cost ($billions)		Fatalities	Event	Type	Year	Nation
	Unnamed: 0_level_1	(2021)[5]	Fatalities	Event	Type	Year	Nation
0	$700[2]	$774.9 (2021)	4000 - 200000[6]	Chernobyl disaster [1]	Contamination (Radioactive)	1986	Soviet Union (, )
1	$360[7][8][4]	$423.7 (2021)	19747	2011 Tōhoku earthquake and tsunami	Earthquake, Tsunami, Contamination (Radioactive)	2011	Japan
2	$197[9]	$344.7 (2021)	5502 – 6434	Great Hanshin earthquake	Earthquake	1995	Japan
3	$148[10]	$179.7 (2021)	87587	2008 Sichuan earthquake	Earthquake	2008	China
4	$125[11]	$167.4 (2021)	1245 – 1836	Hurricane Katrina	Tropical cyclone	2005	United States

Most of the disasters on the list are natural, including 56 tropical cyclones and other windstorms, 16 earthquakes, 8 wildfires, 8 floods, and 6 tornadoes.

disasters["Type", "Type"].value_counts()

(Type, Type)
Tropical cyclone                                    49
Earthquake                                          16
Wildfire                                             8
Flood                                                8
European windstorm                                   7
Tornado                                              6
Space flight accident                                3
Drought                                              3
Hailstorm                                            3
Structure failure                                    2
Contamination (Oil)                                  2
Terror attack                                        2
Volcano                                              2
Winter storm                                         2
Explosion                                            2
Contamination (Radioactive)                          1
Contamination (Radiation)                            1
Maritime disaster                                    1
Severe Thunderstorm                                  1
Drought, Wildfire                                    1
Severe storm                                         1
Earthquake, Tsunami, Contamination (Radioactive)     1
Earthquake, Tsunami                                  1
Winter storm, power outage                           1
Fire                                                 1
Name: count, dtype: int64

Some of the costs are reported as a range, so I compute the midpoints.

costs = disasters["Cost ($billions)", "(2021)[5]"].str.split()
costs[0] = 774.9
costs[11] = (72.2 + 120.4) / 2
costs[13] = (56.9 + 64.4) / 2
costs[25] = (32.2 + 199.1) / 2
costs[35] = (27.1 + 36.1) / 2
costs[44] = (18.3 + 29.8) / 2
costs[49] = (10.9 + 16.4) / 2
costs[62] = (9.8 + 17.0) / 2
costs[111] = (1.8 + 2.8) / 2
costs[119] = (4.1 + 5.6) / 2
disasters[costs.isna()]

	Cost ($billions)		Fatalities	Event	Type	Year	Nation
	Unnamed: 0_level_1	(2021)[5]	Fatalities	Event	Type	Year	Nation

Remove the dollar signs and convert to float.

xs = []
for cost in costs:
    try:
        x = float(cost[0].strip("$"))
    except:
        x = cost
    assert type(x) is float
    xs.append(x)

xs.sort(reverse=True)

Get the magnitudes and ranks

mags = np.log10(xs)
n = len(mags)
ranks = np.arange(1, n + 1)

The following figure shows a “rank-size plot” of the data: the x-axis shows the ranks of the disasters from 1 to 125, where 1 is the most costly; the y-axis shows their costs in billions of U.S. dollars.

plt.plot(ranks, xs)
decorate(
    xlabel="Rank (1 to 125)",
    ylabel="Cost (billions of 2021 dollars)",
    xscale="log",
    yscale="log",
    title="Rank-size plot for cost of disasters",
)
xticks = [1, 2, 5, 10, 20, 50, 100]
yticks = [1, 10, 100, 1000]
plt.xticks(xticks, xticks)
plt.yticks(yticks, yticks)
plt.savefig("longtail1.png", dpi=600)

_images/c29991320b86a655b1cee80ca20dbca2a2016eb48625a7d19289b4d712de8e97.png

With both axes on a log scale, the rank-size plot is a nearly straight line, at least for the top 100 disasters. And the slope of this line is close to 1, which implies that each time we double the rank, we cut the costs by one half. That is, compared to the costliest disaster, the cost of the second-worst is one half; the cost of the fourth-worst is one quarter; the cost of the eighth is one eighth, and so on.

index = np.array([1, 2, 4, 8, 16, 32, 64]) - 1
steps = 10 ** mags[index]
steps /= steps[0]
1 / steps

array([ 1.        ,  1.82888836,  4.31218698,  6.70038911, 12.77658697,
       31.62857143, 74.50961538])

To visualize disaster sizes, I will show the “tail distribution”, which is the complement of the cumulative distribution function (CDF). The following figure shows the tail distribution of the 125 costliest disasters, on a log scale, along with a lognormal model.

from utils import make_surv, fit_truncated_normal

surv = make_surv(mags)
mu, sigma = fit_truncated_normal(surv)

[1.07816053 0.57668442]
[1.07816055 0.57668442]
[1.07816053 0.57668444]
[1.04133203 0.54825645]
[1.04133204 0.54825645]
[1.04133203 0.54825647]
[1.04293263 0.54534355]
[1.04293265 0.54534355]
[1.04293263 0.54534356]
[1.04292536 0.54517372]
[1.04292537 0.54517372]
[1.04292536 0.54517373]

df = pd.DataFrame(dict(surv=surv))
df.index.name = 'magnitude'
df.reset_index()

	magnitude	surv
0	-0.221849	1.000
1	-0.154902	0.992
2	-0.096910	0.984
3	0.000000	0.976
4	0.079181	0.960
...	...	...
103	2.223755	0.040
104	2.254548	0.032
105	2.537441	0.024
106	2.627058	0.016
107	2.889246	0.008

108 rows × 2 columns

df.reset_index().to_csv('disaster_magnitude.csv', index=False)
!head disaster_magnitude.csv

magnitude,surv
-0.22184874961635637,1.0
-0.1549019599857432,0.9920000000000007
-0.09691001300805639,0.9840000000000007
0.0,0.9760000000000006
0.07918124604762482,0.9600000000000006
0.11394335230683678,0.9520000000000006
0.146128035678238,0.9440000000000006
0.20411998265592482,0.9360000000000006
0.2787536009528289,0.9280000000000006

from utils import truncated_normal_sf

low, high = mags.min(), mags.max() * 0.99
qs = np.linspace(low, high, 2000)
surv_model = truncated_normal_sf(qs, mu, sigma)

ylabel = "Percentage with cost ≥ $x$"
xlabel = "Cost (billions of 2021 dollars)"

xlabels = [1, 10, 100, 1000]
xticks = np.log10(xlabels)

yticks = [1, 0.1, 0.01]
ylabels = ["100%", "10%", "1%"]

yticks = np.linspace(1, 0, 5)
ylabels = ["100%", "75%", "50%", "25%", "0%"]
yticks

array([1.  , 0.75, 0.5 , 0.25, 0.  ])

surv_linear = make_surv(xs)
surv_linear.plot(color="C1", ls="--")

plt.yticks(yticks, ylabels)
decorate(
    xlabel=xlabel, ylabel=ylabel, title="Tail distribution of disaster costs"
)
plt.savefig("longtail1a.png", dpi=600)

_images/f5bfdb36bc9a5f556c9cde9c5f4c9ba25a24d318bfadde682158eb6a6502919f.png

surv.plot(color="C1", ls="--")

plt.xticks(xticks, xlabels)
plt.yticks(yticks, ylabels)
decorate(
    xlabel=xlabel, ylabel=ylabel, title="Tail distribution of disaster costs, log scale"
)
plt.savefig("longtail1b.png", dpi=600)

_images/8b0984596e5fedd293862854bf791a58969b83f688787593506f296a7d7ffe8a.png

surv_model.plot(color="gray", alpha=0.4, label="Lognormal model")
plot_error_bounds(surv_model, n, color="gray")
surv.plot(color="C1", ls="--", label="Data")

plt.xticks(xticks, xlabels)
plt.yticks(yticks, ylabels)
decorate(
    xlabel=xlabel, ylabel=ylabel, title="Tail distribution of disaster costs, log scale"
)
plt.savefig("longtail2.png", dpi=600)

_images/888c3619ce3184eedd4738e94c9af635a2b15bc7ca9f9bc7b365767e3f359f75.png

surv([0, 1, 2])

array([0.976, 0.544, 0.072])

The dashed line shows the fraction of disasters that exceed each magnitude of cost. For example, 98% of the disasters exceed a cost of $1 b i l l i o n, 54$ 10 billion, and 7% exceed $100 billion.

The gray line shows a lognormal model I chose to fit the data; the shaded area shows how much variation we expect with a sample size of 125. Most of the data fall within the bounds of the shaded area, which means that they are consistent with the lognormal model.

However, if you look closely, there are more large-cost disasters than we expect in a lognormal distribution. To see this part of the distribution more clearly, we can plot the y-axis on a log scale; the following figure shows what that looks like.

x = np.log10(500)
y1 = surv(x)
y2 = surv_model(x)
y1 * 1e3, y2 * 1e3, y1 / y2

(15.999999999999897, 1.2081898734673209, 13.24295158515303)

# The Surv object doesn't interpolate, so y1 ends up too high in this example

from scipy.interpolate import interp1d

interp = interp1d(surv.index, surv.values, fill_value="extrapolate")
y1 = interp(x)

plt.plot([x, x], [y1, y2], ls=":", color="gray")

surv_model.plot(color="gray", alpha=0.4, label="Lognormal model")
plot_error_bounds(surv_model, n, color="gray")
surv.plot(color="C2", ls="--", label="Data")
plot_error_bounds(surv_model, n, color="gray")
plt.xticks(xticks, xlabels)
decorate(
    xlabel=xlabel,
    ylabel=ylabel,
    title="Tail distribution of disaster costs, log-log scale",
    yscale="log",
)

yticks = [1, 0.1, 0.01, 0.001]
ylabels = ["100%", "10%", "1%", "0.1%"]
plt.yticks(yticks, ylabels)
plt.savefig("longtail3.png", dpi=600)

_images/d28819a69c97092023e77b1e1f449951d96443b98e29cf3d8f34586f0539f345.png

Again, the shaded area shows how much variation we expect in a sample of this size from a lognormal model.

Now we can see more clearly where the model diverges from the data and by how much. For example, the vertical dotted line shows the discrepancy at $500 billion. According to the model, only 1 out of 1000 disasters should exceed this cost, but the actual rate is 16 per 1000.

To make better predictions, we need a model that accurately estimates the probability of large disasters. In the literature of long-tailed distributions, there are several to choose from; the one I found that best matches the data is Student’s $t$ -distribution.

The shape of the $t$ -distribution is a bell curve similar to a Gaussian, but the tails extend farther to the right and left. It has two parameters that determine the center point and width of the curve, like the Gaussian distribution, plus a third parameter that controls the thickness of the tails, that is, what fraction of the values extend far from the mean.

from scipy.stats import norm
from scipy.stats import t as t_dist

xs = np.linspace(-5, 5, 201)
ys1 = norm.pdf(xs, 0, 1)

plt.plot(xs, ys1, label='Standard normal')

df = 10
ys2 = t_dist.pdf(xs, df, 0, 1)
plt.plot(xs, ys2, label='Student t with df=10')

plt.xlabel('x')
plt.ylabel('PDF')
plt.legend()
plt.tight_layout()
plt.savefig("longtail3a.png", dpi=600)

_images/5ad0a0aef464e4037d25c4425177719ad0d552b121b10595fc94589da5e86109.png

from scipy.stats import norm
from scipy.stats import t as t_dist

xs = np.linspace(-5, 5)
ys1 = norm.sf(xs, 0, 1)

plt.plot(xs, ys1, label='Standard normal')

ys2 = t_dist.sf(xs, df, 0, 1)
plt.plot(xs, ys2, label='Student t with df=10')

plt.xlabel('x')
plt.ylabel('Prob magnitude $>$ x')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.tight_layout()
plt.savefig("longtail3b.png", dpi=600)

_images/af4549c01be85a514c8495d345232e2298b9daeb0554d96434f5ae295353259b.png

Since I am using a $t$ -distribution to fit the logarithms of the values, I’ll call the result a log- $t$ model. The following figures shows the distribution of costs again, along with a log- $t$ model I chose to fit the data. In the top figure, the y-axis shows the fraction of disasters on a linear scale; in the bottom figure, the y-axis shows the same fractions on a log scale.

from utils import minimize_df

df = minimize_df(5, surv, [(0, 20)])

from utils import fit_truncated_t

mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([3.52824501]), 1.0165307353889603, 0.488752041667769)

from utils import truncated_t_sf

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)

model_label = "Log-$t$ model"
ylabel = "Fraction with cost ≥ x"
yticks = [1, 0.1, 0.01]
ylabels = yticks
axs = plot_two(
    surv,
    surv_model2,
    n,
    hatch="/",
    title="Tail distribution of disaster costs with log-$t$ model",
)
plt.savefig("longtail4.png", dpi=600)

_images/4ec11813b1f2cf8478ff70022a86b147b7a417d4ff93aa1c0a5bf61acfb8d079.png

In the top figure, we see that the data fall within the shaded area over the entire range of the distribution. In the bottom figure, we see that the model fits the data well even in the extreme tail.

The agreement between the model and the data provides a hint about why the distribution might have this shape. Mathematically, Student’s $t$ distribution is a mixture of Gaussian distributions with different standard deviations.

The following figure was not included in the book, but it shows that if we generate values from Gaussian distributions with different standard deviations (drawn from an inverse gamma distribution), the values follow a t distribution.

from scipy.stats import invgamma

np.random.seed(17)

a = df
sigma = np.sqrt(invgamma(a).rvs(size=125))
sigma.mean(), sigma.std()

(0.6007941329583358, 0.1717044227431617)

from scipy.stats import norm

mu = 1.04
sample = norm(mu, sigma).rvs()

surv_sample = make_surv(sample[sample > -0.4])

model_label = "Mix of lognormal"
axs = plot_two(
    surv,
    surv_sample,
    n,
    title="Tail distribution of distaster costs with lognormal mixture",
)
decorate(ylim=[5e-3, 1.1])
plt.savefig("longtail5.png", dpi=600)

_images/92c884c20d2dbc4ee2b065aa00301f2c034f0c5805e11685028c3a3f66035e7e.png

Earthquakes#

To describe the magnitudes of earthquakes, I downloaded data from the Southern California Earthquake Data Center. Their archive goes all the way back to 1932, but the sensors they use and their coverage have changed over that time. For consistency, I selected data from January 1981 to April 2022, which includes records of 791,329 earthquakes.

For information on catalog completeness and data sources, see http://www.data.scec.org/about/data_avail.html.

# This version of quake.csv includes more recent data than the version in the book,
# so the results are a little different.

filename = "quake.csv"
download(DATA_PATH + filename)

quake_df = pd.read_csv(filename, low_memory=False)
quake_df.shape

(803451, 4)

quake_df.head()

	#YYY	MM	DD	MAG
0	1981	1	1	2.3
1	1981	1	1	2.3
2	1981	1	1	2.4
3	1981	1	1	1.6
4	1981	1	1	1.9

quake_df.tail()

	#YYY	MM	DD	MAG
803446	2022	12	31	0.5
803447	2022	12	31	1.0
803448	2022	12	31	0.9
803449	2022	12	31	0.6
803450	2022	12	31	0.6

year = quake_df["#YYY"] == "1989"
year.sum()

quake_df.sort_values(by="MAG").tail(10)

	#YYY	MM	DD	MAG
102346	1987	11	24	6.2
163645	1992	6	28	6.3
674787	2019	7	4	6.4
408783	2003	12	22	6.5
102647	1987	11	24	6.6
224869	1994	1	17	6.7
335296	1999	10	16	7.1
678320	2019	7	6	7.1
492576	2010	4	4	7.2
163589	1992	6	28	7.3

mags = quake_df["MAG"]
mags.describe()

count    803451.000000
mean          1.388616
std           0.658252
min           0.000000
25%           0.900000
50%           1.300000
75%           1.800000
max           7.300000
Name: MAG, dtype: float64

(mags < 2).mean()

0.8159564180018446

quake_df.groupby("#YYY")["MAG"].count().iloc[1:-1].describe()

count       40.000000
mean     19397.050000
std      11049.695313
min      10092.000000
25%      13467.500000
50%      16186.000000
75%      20463.000000
max      63571.000000
Name: MAG, dtype: float64

n = mags.count()
n

surv = make_surv(mags)

mu, sigma = fit_truncated_normal(surv)

[1.48860875 0.65822228]
[1.48860877 0.65822228]
[1.48860875 0.6582223 ]
[1.37983937 0.66707222]
[1.37983939 0.66707222]
[1.37983937 0.66707224]
[1.38099766 0.65403304]
[1.38099768 0.65403304]
[1.38099766 0.65403305]
[1.38080268 0.65402889]
[1.3808027  0.65402889]
[1.38080268 0.65402891]

low, high = mags.min(), mags.max() * 0.99
qs = np.linspace(low, high, 2000)
surv_model = truncated_normal_sf(qs, mu, sigma)

The following figure shows the distribution of moment magnitudes compared to a Gaussian model, which is a lognormal model in terms of energy.

x = 4
y1 = surv(x)
y2 = surv_model(x)

xlabel = "Magnitude (log energy)"
ylabel = "Fraction with mag ≥ $x$"
xticks = np.arange(8)
xlabels = xticks
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
ylabels = [1, 0.1, 0.01, 0.001, "$10^{-4}$", "$10^{-5}$", "$10^{-6}$", "$10^{-7}$"]

model_label = "Lognormal model"
cutoff = surv_model > 2e-7
axs = plot_two(
    surv,
    surv_model[cutoff],
    n,
    title="Earthquake magnitudes with lognormal model",
)
plt.plot([x, x], [y1, y2], ls=":", color="gray")

plt.tight_layout()
plt.savefig("longtail6.png", dpi=600)

_images/b4d5ca7ab533542c3676fe05e3701cfed824d9fc9c4c52a24d694ee910ad8ca2.png

If you only look at the top figure, you might think the lognormal model is good enough. But the probabilities for magnitudes greater than 3 are so small that differences between the model and the data are not visible.

The bottom figure, with the y-axis on a log scale, shows the tail more clearly. Here we can see that the lognormal model drops off more quickly than the data. According to the model, the fraction of earthquakes with magnitude 4 or more should be about 33 per million. But the actual fraction of earthquakes this big is about 1800 per million, more than 50 times higher.

For larger earthquakes, the discrepancy is even bigger. According to the model, fewer than one earthquake in a million should exceed magnitude 5; in reality, about 170 per million do. If your job is to predict large earthquakes, and you underestimate their frequency by a factor of 170, you are in trouble.

The following figures show the same distribution compared to a log- $t$ model.

y1 * 1e6, y2 * 1e6, y1 / y2

(1784.8008154824963, 31.897268299459753, 55.9546604030893)

x = 5
surv(x) * 1e6, surv_model(x) * 1e7

(168.02518137406247, 0.15958924021795134)

x = 7
surv(x) * 1e6, surv_model(x) * 1e18

# 6.318484473557436, 4.618502751454663

(6.223154865724416, 4.406280129716829)

df = minimize_df(10, surv, [(0, 20)])
mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([9.07638598]), 1.3634555603669711, 0.6375531964799466)

from scipy.stats import t as t_dist

y7, y8, y9 = t_dist(df, mu, sigma).sf([7, 8, 9])

quakes_per_year = quake_df.groupby("#YYY")["MAG"].count().mean()
quakes_per_year

19129.785714285714

y522 = 1 / 522 / quakes_per_year
y522

1.0014272197675095e-07

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, 1.1 * high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)

model_label = "Log-$t$ model"
xticks = np.arange(9)
xlabels = xticks
axs = plot_two(
    surv,
    surv_model2,
    n,
    hatch="/",
    title="Earthquake magnitudes with log-$t$ model",
)
#plt.plot(8, y8, "s", color="C0", alpha=0.6)
#plt.plot(8, y522, "o", color="C1", alpha=0.6)

plt.tight_layout()
plt.savefig("longtail7.png", dpi=600)

_images/9821dc3e86a10e72212efe8cda691447d026647f464c8a1658d3ab5b49cd54d2.png

The top figure shows that the log- $t$ model fits the distribution well for magnitudes less than 3. The bottom figure shows that it also fits the tail of the distribution.

If we extrapolate beyond the data, we can use this model to compute the probability of earthquakes larger than the ones we have seen so far. For example, the square marker shows the predicted probability of an earthquake with magnitude 8, which is about 1.2 per million.

y8 * 1e6

1.197711963360669

1 / quakes_per_year / y8

43.645302435809846

The number of earthquakes in this dataset is about 18,800 per year; at that rate we should expect an earthquake that exceeds magnitude 8 every 43 years, on average. To see how credible that expectation is, let’s compare it to predictions from sources more knowledgeable than me.

In 2015, the U.S. Geological Survey (USGS) published the third revision of the Uniform California Earthquake Rupture Forecast (UCERF3). It predicts that we should expect earthquakes in Southern California that exceed magnitude 8 at a rate of one per 522 years. The circle marker in the figure shows the corresponding probability.

Their forecast is compatible with mine in the sense that it falls within the shaded area that represents the uncertainty of the model. In other words, my model concedes that the predicted probability of a magnitude 8 earthquake could be off by a factor of 10, based on the data we have. Nevertheless, if you own a building in California, or provide insurance for one, a factor of 10 is a big difference. So you might wonder which model to believe, theirs or mine.

To find out, let’s see how their predictions from 2015 have turned out so far. For comparison, I ran the log- $t$ model again using only data from prior to 2015. The following figure shows their predictions, mine, and the actual number of earthquakes that exceeded each magnitude in the 7.3 years between January 2015 and May 2022.

# https://pubs.usgs.gov/fs/2015/3009/pdf/fs2015-3009.pdf

prefix = quake_df["#YYY"].astype(int) < 2015
quakes_per_year = quake_df.loc[prefix].groupby("#YYY")["MAG"].count().mean()
quakes_per_year

17524.264705882353

mags = quake_df.loc[prefix, "MAG"]
surv = make_surv(mags)

df = minimize_df(10, surv, [(0, 20)])
mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([8.41547484]), 1.4816444224709593, 0.6026839581707539)

# convert UCERF predictions to quakes per 7.33 years

pred_years = 7.33
ucerf = pred_years / pd.Series([0.24, 2.3, 12, 25, 87, 522], [5, 6, 6.7, 7, 7.5, 8])
ucerf

5.0    30.541667
6.0     3.186957
6.7     0.610833
7.0     0.293200
7.5     0.084253
8.0     0.014042
dtype: float64

# convert log-t predictions to quakes per 7.33 years

a = t_dist(df, mu, sigma).sf(ucerf.index) * quakes_per_year * pred_years
logt = pd.Series(a, ucerf.index)
logt

5.0    20.567019
6.0     3.414845
6.7     1.152932
7.0     0.750355
7.5     0.382366
8.0     0.204125
dtype: float64

# get the actual number of quakes since 2015

suffix = quake_df["#YYY"].astype(int) >= 2015
suffix.sum(), quakes_per_year * pred_years

(207626, 128452.86029411765)

mags = quake_df.loc[suffix, "MAG"]
actual = pd.Series(index=[5, 5.3, 5.7, 6, 6.3, 6.7, 7], dtype=float)
for thresh in actual.index:
    actual[thresh] = (mags > thresh).sum()
actual

5.0    10.0
5.3     7.0
5.7     3.0
6.0     2.0
6.3     2.0
6.7     1.0
7.0     1.0
dtype: float64

x = actual[:7].index
y = actual[:7].values
plt.bar(x, y, width=0.2, color="gray", alpha=0.3, label="actual")
logt[:7].plot(style="s:", alpha=0.6, label="log-$t$")
ucerf[:7].plot(style="o--", alpha=0.6, label="UCERF3")

decorate(xlabel=xlabel, ylabel="Number of quakes", yscale="log")
plt.xticks(actual[:7].index)
yticks = [20, 10, 5, 2, 1, 0.5]
ylabels = yticks
plt.yticks(yticks, yticks)

plt.tight_layout()
plt.savefig("longtail8.png", dpi=600)

/tmp/ipykernel_418621/1845424094.py:1: FutureWarning: The behavior of obj[i:j] with a float-dtype index is deprecated. In a future version, this will be treated as positional instead of label-based. For label-based slicing, use obj.loc[i:j] instead
  x = actual[:7].index
/tmp/ipykernel_418621/1845424094.py:2: FutureWarning: The behavior of obj[i:j] with a float-dtype index is deprecated. In a future version, this will be treated as positional instead of label-based. For label-based slicing, use obj.loc[i:j] instead
  y = actual[:7].values
/tmp/ipykernel_418621/1845424094.py:4: FutureWarning: The behavior of obj[i:j] with a float-dtype index is deprecated. In a future version, this will be treated as positional instead of label-based. For label-based slicing, use obj.loc[i:j] instead
  logt[:7].plot(style="s:", alpha=0.6, label="log-$t$")
/tmp/ipykernel_418621/1845424094.py:5: FutureWarning: The behavior of obj[i:j] with a float-dtype index is deprecated. In a future version, this will be treated as positional instead of label-based. For label-based slicing, use obj.loc[i:j] instead
  ucerf[:7].plot(style="o--", alpha=0.6, label="UCERF3")
/tmp/ipykernel_418621/1845424094.py:8: FutureWarning: The behavior of obj[i:j] with a float-dtype index is deprecated. In a future version, this will be treated as positional instead of label-based. For label-based slicing, use obj.loc[i:j] instead
  plt.xticks(actual[:7].index)

_images/7ed046ebc531df1cff4d5a4410b48bd5cef50cceaaa002423b08d9d52bc32dd3.png

Starting on the left, the UCERF model predicted that there would be 30 earthquakes in this interval with magnitude 5.0 or more. The log- $t$ model predicted 20. In fact, there were 10. So there were fewer large earthquakes than either model predicted.

Near the middle of the figure, both models predicted 3 earthquakes with magnitude 6.0 or more. In fact, there were two, so both models were close.

On the right, the UCERF model expects 0.3 earthquakes with magnitude 7.0 or more in 7.3 years. The log- $t$ expects 0.8 earthquakes of this size in the same interval. In fact, there was one: the main shock of the 2019 Ridgecrest earthquake sequence measured 7.1.

# https://en.wikipedia.org/wiki/2019_Ridgecrest_earthquakes

1 / (ucerf / 7.33)

5.0      0.24
6.0      2.30
6.7     12.00
7.0     25.00
7.5     87.00
8.0    522.00
dtype: float64

1 / (logt / 7.33)

5.0     0.356396
6.0     2.146510
6.7     6.357705
7.0     9.768712
7.5    19.170102
8.0    35.909361
dtype: float64

The discrepancy between the two models increases for larger earthquakes. The UCERF model estimates 87 years between earthquakes that exceed 7.5; the log- $t$ model estimates only 19. The UCERF model estimates 522 years between earthquakes that exceed 8.0; the log- $t$ model estimates only 36.

Solar Flares#

The Space Weather Prediction Center (SWPC) monitors and predicts solar activity and its effects on Earth. Since 1974 it has operated the Geostationary Operational Environmental Satellite system (GOES). Several of the satellites in this system carry sensors that measure solar flares. Data from these sensors, going back to 1975, is available for download.

Since 1997, these datasets include “integrated flux” – the kind of term science fiction writers love – which measures the total energy from a solar flare that would pass through a given area. The magnitude of this flux is one way to quantify its potential for impact on Earth. This dataset includes measurements from more than 36,000 solar flares. The following figure shows the distribution of their magnitudes compared to a lognormal model.

# Kurtzgesagt https://www.youtube.com/watch?v=oHHSSJDJ4oo

filename = "flares.csv"
download(DATA_PATH + filename)

flares_df = pd.read_csv(filename, header=None)
flares = flares_df[0]
flares.shape

(36560,)

mags = np.log10(flares)
mags.describe()

count    36560.000000
mean        -3.058657
std          0.691503
min         -5.022276
25%         -3.537602
50%         -3.080922
75%         -2.619789
max          0.414973
Name: 0, dtype: float64

n = mags.count()
n

surv = make_surv(mags)

mu, sigma = fit_truncated_normal(surv)

[-3.04218419  0.69158557]
[-3.04218423  0.69158557]
[-3.04218419  0.69158559]
[-3.0738114   0.68338351]
[-3.07381145  0.68338351]
[-3.0738114   0.68338353]
[-3.07353797  0.68244264]
[-3.07353802  0.68244264]
[-3.07353797  0.68244265]

low, high = mags.min(), mags.max() * 0.99
qs = np.linspace(low, high, 2000)
surv_model = truncated_normal_sf(qs, mu, sigma)

xlabel = "Integrated flux (Joules / square meter)"
ylabel = "Fraction with flux ≥ $x$"
xticks = np.arange(-5, 1)
xlabels = ["$10^{-5}$", "$10^{-4}$", 0.001, 0.01, 0.1, 1]
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5, 1e-6]
ylabels = [1, 0.1, 0.01, 0.001, "$10^{-4}$", "$10^{-5}$", "$10^{-6}$"]

flares.max(), flares.min(), flares.max() / flares.min()

(2.6, 9.5e-06, 273684.2105263158)

cutoff = surv_model > 1e-6

x = 0
y1 = surv(x)
y2 = surv_model(x)
y1 * 1e6, y2 * 1e6, y1 / y2

(218.81838074390188, 3.3991818848706328, 64.37383704527177)

model_label = "Lognormal Model"
axs = plot_two(
    surv, surv_model[cutoff], n, 
    title="Tail distribution of solar flare flux with lognormal model"
)
lines = plt.plot([x, x], [y1, y2], ls=":", color="gray")

plt.tight_layout()
plt.savefig("longtail9.png", dpi=600)

_images/76cd3d3f2632a360a8e47248753c0652dbe9c382750fba1d74a24811e0659d5d.png

The x-axis shows integrated flux on a log scale; the largest flare observed during this period measured 2.6 Joules per square meter on September 7, 2005. By itself, that number might not mean much, but it is almost a million times bigger than the smallest flare. So the thing to appreciate here is the great range of magnitudes.

Looking at the top figure, it seems like the lognormal model fits the data well. But for fluxes greater than 0.1 Joules per square meter, the probabilities are so small they are indistinguishable.

Looking at the bottom figure, we can see that for fluxes in this range, the lognormal model underestimates the probabilities. For example, according to the model, the fraction of flares with flux greater than 1 should be about 3 per million; in the dataset, it is about 200 per million. The dotted line in the figure shows this difference.

If we are only interested in flares with flux less than 0.01, the lognormal model might be good enough. But for predicting the effect of space weather on Earth, it is the largest flares we care about, and for those the lognormal model might be dangerously inaccurate.

Let’s see if the log- $t$ model does any better. The following figures show the distribution of fluxes again, along with a log- $t$ distribution.

df = minimize_df(10, surv, [(0, 20)])

mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([12.38820488]), -3.0803784847994766, 0.6632850329593064)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, 3 * high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)
cutoff2 = surv_model2.qs <= 1.1 * high

model_label = "Log-$t$ model"
axs = plot_two(
    surv,
    surv_model2[cutoff2],
    n,
    hatch="/",
    title="Tail distribution of solar flare flux with log-$t$ model",
)

plt.tight_layout()
plt.savefig("longtail10.png", dpi=600)

_images/f916bfabb94ef51ac7242cedd013e7ee0368c9b45ea7e2ae3ac7f9a4ad315a46.png

scale = 10
low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, scale * high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)
cutoff2 = surv_model2.qs <= scale * high

xticks = np.arange(-5, 5)
xlabels = ["$10^{-5}$", "$10^{-4}$", 0.001, 0.01, 0.1, 1, 10, 100, 1000, "$10^{5}$"]
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
ylabels = [1, 0.1, 0.01, 0.001, "$10^{-4}$", "$10^{-5}$", "$10^{-6}$", "$10^{-7}$"]

model_label = "Log-$t$ model"
axs = plot_two(
    surv,
    surv_model2[cutoff2],
    n,
    hatch="/",
    title="Tail distribution of solar flare flux with log-$t$ model",
)

plt.tight_layout()
plt.savefig("longtail10a.png", dpi=600)

_images/9431a9be5036218d009304e418228d4864ec87877577bafc3d27e971e8c2e27e.png

The top figure shows that the log- $t$ model fits the data well over the range of the distribution, except for some discrepancy among the smallest flares. Maybe we don’t detect small flares as reliably, or maybe there are just fewer of them than the model predicts.

The bottom figure shows that the model fits the tail of the distribution except possibly for the most extreme values. Above 1 Joule per meter squared, the actual curve drops off faster than the model, although it is still within the range of variation we expect.

(mags > 0).sum()

This example shows that the distribution of the size of solar flares is longer-tailed than a lognormal distribution, which suggests that the largest possible solar flares might be substantially larger than the ones we have seen so far.

If we use the model to extrapolate beyond the data – which is always an uncertain thing to do – we expect 2.4 flares per million to exceed 10 Joules per square meter, and 0.36 per million to exceed 100.

t_dist.sf([1, 2, 3], df, mu, sigma) * 1e6, 20e6 / n

(array([21.47023593,  2.41162628,  0.35574138]), 547.0459518599563)

t_dist.sf(5, df, mu, sigma) * 1e9, 1e9 / 36000

(array([14.61658962]), 27777.777777777777)

In the last 20 years we have observed about 36,000 flares, so it would take more than 500 years to observe a million. But in that time, we might see a solar flare 10 or 100 times bigger than the biggest we have seen so far.

Lunar Craters#

The largest crater on the near side of the moon, named Bailly, is 303 kilometers in diameter; the largest on the far side, the South Pole-Aitken basin, is roughly 2500 kilometers in diameter.

In addition to large, visible craters like these, there are innumerable smaller craters. The Lunar Crater Database catalogs nearly all of the ones larger than one kilometer, about 1.3 million in total. It is based on images taken by the Lunar Reconnaissance Orbiter, which NASA sent to the Moon in 2009.

# https://www.nytimes.com/2022/03/21/science/nasa-asteroid-strike.html

# https://en.wikipedia.org/wiki/Tunguska_event

# https://www.nasa.gov/centers/goddard/news/features/2010/biggest_crater.html
# https://en.wikipedia.org/wiki/Bailly_(crater)
# https://en.wikipedia.org/wiki/South_Pole%E2%80%93Aitken_basin

# https://astrogeology.usgs.gov/search/map/Moon/Research/Craters/lunar_crater_database_robbins_2018
# https://en.wikipedia.org/wiki/Lunar_Reconnaissance_Orbiter
# Robbins, S. J. (2018). A new global database of lunar impact craters >1–2 km: 1. Crater locations and sizes, comparisons with published databases, and global analysis. Journal of Geophysical Research: Planets, 124(4), pp. 871-892. https://doi.org/10.1029/2018JE005592

1740 * 2

Download the data from https://astrogeology.usgs.gov/search/map/Moon/Research/Craters/lunar_crater_database_robbins_2018

# Looks like the old URL doesn't work
# download("https://pdsimage2.wr.usgs.gov/Individual_Investigations/moon_lro.kaguya_multi_craterdatabase_robbins_2018/data/lunar_crater_database_robbins_2018.csv")

download('https://dwjtvz5c9xobz.cloudfront.net/moon_lro.kaguya_multi_craterdatabase_robbins_2018/data/lunar_crater_database_robbins_2018.csv')

!ls -lh lunar_crater_database_robbins_2018.csv

-rw-rw-r-- 1 downey downey 228M Mar  2  2023 lunar_crater_database_robbins_2018.csv

filename = "lunar_crater_database_robbins_2018.csv"
craters = pd.read_csv(filename, nrows=None)
craters.head()

	CRATER_ID	LAT_CIRC_IMG	LON_CIRC_IMG	LAT_ELLI_IMG	LON_ELLI_IMG	DIAM_CIRC_IMG	DIAM_CIRC_SD_IMG	DIAM_ELLI_MAJOR_IMG	DIAM_ELLI_MINOR_IMG	DIAM_ELLI_ECCEN_IMG	...	DIAM_ELLI_ANGLE_IMG	LAT_ELLI_SD_IMG	LON_ELLI_SD_IMG	DIAM_ELLI_MAJOR_SD_IMG	DIAM_ELLI_MINOR_SD_IMG	DIAM_ELLI_ANGLE_SD_IMG	DIAM_ELLI_ECCEN_SD_IMG	DIAM_ELLI_ELLIP_SD_IMG	ARC_IMG	PTS_RIM_IMG
0	00-1-000000	-19.83040	264.7570	-19.89050	264.6650	940.960	21.31790	975.874	905.968	0.371666	...	35.9919	0.007888	0.008424	0.636750	0.560417	0.373749	0.002085	0.000968	0.568712	8088
1	00-1-000001	44.77630	328.6020	44.40830	329.0460	249.840	5.99621	289.440	245.786	0.528111	...	127.0030	0.011178	0.015101	1.052780	0.209035	0.357296	0.005100	0.004399	0.627328	2785
2	00-1-000002	57.08660	82.0995	56.90000	81.6464	599.778	21.57900	632.571	561.435	0.460721	...	149.1620	0.008464	0.019515	0.776149	0.747352	0.374057	0.003095	0.002040	0.492373	5199
3	00-1-000003	1.96124	230.6220	1.95072	230.5880	558.762	14.18190	568.529	546.378	0.276416	...	133.6910	0.007079	0.007839	0.526945	0.532872	1.262710	0.004496	0.001400	0.595221	4341
4	00-1-000004	-49.14960	266.3470	-49.18330	266.3530	654.332	17.50970	665.240	636.578	0.290365	...	87.6468	0.008827	0.017733	0.568958	0.758631	1.383530	0.004626	0.001533	0.545924	5933

5 rows × 21 columns

diameters = craters["DIAM_CIRC_IMG"]
diameters.describe()

count    1.296796e+06
mean     2.436963e+00
std      5.519133e+00
min      1.000000e+00
25%      1.242710e+00
50%      1.606840e+00
75%      2.380860e+00
max      2.491870e+03
Name: DIAM_CIRC_IMG, dtype: float64

mags = np.log10(diameters)
mags.describe()

count    1.296796e+06
mean     2.730862e-01
std      2.483783e-01
min      0.000000e+00
25%      9.436979e-02
50%      2.059726e-01
75%      3.767339e-01
max      3.396525e+00
Name: DIAM_CIRC_IMG, dtype: float64

n = mags.count()
n

surv = make_surv(mags)

mu, sigma = fit_truncated_normal(surv)

[0.27308865 0.24837722]
[0.27308867 0.24837722]
[0.27308865 0.24837723]
[0.05295431 0.37367805]
[0.05295432 0.37367805]
[0.05295431 0.37367806]
[-0.25451594  0.46113896]
[-0.25451595  0.46113896]
[-0.25451594  0.46113898]
[-0.34244255  0.4572347 ]
[-0.34244256  0.4572347 ]
[-0.34244255  0.45723471]
[-0.2936525   0.44014652]
[-0.29365252  0.44014652]
[-0.2936525   0.44014653]
[-0.29545249  0.44042469]
[-0.2954525   0.44042469]
[-0.29545249  0.44042471]
[-0.29531495  0.44037601]
[-0.29531496  0.44037601]
[-0.29531495  0.44037602]

low, high = mags.min(), mags.max() * 0.99
qs = np.linspace(low, high, 2000)
surv_model = truncated_normal_sf(qs, mu, sigma)

xlabel = "Crater diameter (kilometers)"
ylabel = "Fraction with diameter ≥ $x$"
xlabels = [1, 3, 10, 100, 1000]
xticks = np.log10(xlabels)
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5, 1e-6, 1e-6]
ylabels = [1, 0.1, 0.01, 0.001, "$10^{-4}$", "$10^{-5}$", "$10^{-6}$", "$10^{-7}$"]

The following figures show the distribution of their sizes on a log scale, compared to a lognormal model. The left figure shows the tail distribution on a linear scale; the right figure shows the same curve on a log scale.

x = 2
y1 = surv(x)
y2 = surv_model(x)
y1 * 1e6, y2 * 1e6

(242.90636307851742, 0.37897902114510834)

model_label = "Lognormal model"
cutoff = surv_model > 1e-7
axs = plot_two(
    surv, surv_model[cutoff], n, 
    title="Moon crater diameters with lognormal model"
)
plt.plot([x, x], [y1, y2], ls=":", color="gray")
plt.tight_layout()
plt.savefig("longtail11.png", dpi=600)

_images/ec8e4e78c814bf1e576cdaeb9eb291d62dae3629983c5430f9f0197be35a752b.png

Looking at the figure on the left, we can see a discrepancy between the data and the model between 3 and 10 kilometers. Nevertheless, the model fits the logarithms of the diameters well, so we could conclude that the distribution of crater sizes is approximately lognormal.

However, looking at the figure on the right, we can see big differences between the data and the model in the tail. In the dataset, the fraction of craters as big as 100 kilometers is about 250 per million; according to the model, it would be less than 1 per million. The dotted line in the figure shows this difference.

Going farther into the tail, the fraction of craters as big as 1000 kilometers is about 3 per million; according to the model, it would be less than one per trillion. And the probability of a crater as big as the South Pole-Aitken basin is about 50 per sextillion.

If we are only interested in craters less than 10 kilometers in diameter, the lognormal model might be good enough. But for questions related to the biggest craters, it is way off.

The following figure shows the distribution of these crater sizes, compared to a log- $t$ model.

y1 * 1e6, y2 * 1e6

(242.90636307851742, 0.37897902114510834)

x = np.log10(1000)
surv(x), surv_model(x)

(array(3.08452525e-06), array(1.46578142e-13))

# 50 per sextillion

norm(mu, sigma).sf(high) * 1e18

49.38523077882056

df = minimize_df(10, surv, [(0, 20)])

mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([11.13359356]), -0.2019431517012043, 0.37469477145698354)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, 1.2 * high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)
cutoff2 = surv_model2.qs <= 1.03 * high

model_label = "Log-$t$ model"
axs = plot_two(
    surv,
    surv_model2[cutoff2],
    n,
    hatch="/",
    title="Moon crater diameters with log-t model",
)

plt.tight_layout()
plt.savefig("longtail12.png", dpi=600)

_images/490d4c0710c62133ebb6fd9b1ad8bdda64368a485d11643f7765da14e03d5bfd.png

Since the dataset does not include craters smaller than one kilometer, I cut off the model at the same threshold. We can assume that there are many smaller craters, but with this dataset we can’t tell what the distribution of their sizes looks like.

The log- $t$ model fits the distribution well, but it’s not perfect: there are more craters near 100 km than the model expects, and fewer craters larger than 1000 km. As usual, the world is under no obligation to follow simple rules, but this model does pretty well.

Asteroids#

The Jet Propulsion Laboratory (JPL) and NASA provide data related to asteroids, comets and other small bodies in our solar system. From their Small-Body Database, I selected asteroids in the “main asteroid belt” between the orbits of Mars and Jupiter.

There are more than one million asteroids in this dataset, about 136,000 with known diameter. The largest are Ceres (940 kilometers in diameter), Vesta (525 km), Pallas (513 km), and Hygeia (407 km). The smallest are less than one kilometer.

I downloaded the data using the Small-Body Database

Selecting Inner, outer, and main belt asteroids
Variables: IAU name, diameter, GM (mass times gravitational constant G)

filename = "sbdb_query_results2.csv"
download(DATA_PATH + filename)

asteroids = pd.read_csv(filename, low_memory=False)
asteroids.shape

(1138975, 3)

Fans of The Expanse will recognize the names of the largest asteroids in the Belt.

asteroids.dropna().sort_values(by="diameter").tail()

	name	diameter	GM
693	Interamnia	306.313	5.000000
9	Hygiea	407.120	7.000000
1	Pallas	513.000	13.630000
3	Vesta	525.400	17.288245
0	Ceres	939.400	62.628400

asteroids.dropna().sort_values(by="diameter").head()

	name	diameter	GM
241	Ida	32.000	0.00275
251	Mathilde	52.800	0.00689
21	Kalliope	167.536	0.49100
106	Camilla	210.370	0.74750
15	Psyche	226.000	1.53000

asteroids["diameter"].describe()

count    135915.000000
mean          5.265815
std           8.495350
min           0.347000
25%           2.783000
50%           3.940000
75%           5.651000
max         939.400000
Name: diameter, dtype: float64

mags = np.log10(asteroids["diameter"])
mags.describe()

count    135915.000000
mean          0.613943
std           0.257474
min          -0.459671
25%           0.444513
50%           0.595496
75%           0.752125
max           2.972851
Name: diameter, dtype: float64

n = mags.count()
n

surv = make_surv(mags)

mu, sigma = fit_truncated_normal(surv)

[0.61406612 0.2574008 ]
[0.61406614 0.2574008 ]
[0.61406612 0.25740082]
[0.59649365 0.22729091]
[0.59649367 0.22729091]
[0.59649365 0.22729093]
[0.59765045 0.22862602]
[0.59765047 0.22862602]
[0.59765045 0.22862603]
[0.59766666 0.22865271]
[0.59766668 0.22865271]
[0.59766666 0.22865273]

from utils import truncated_normal_pmf

low, high = mags.min(), mags.max() * 0.99
qs = np.linspace(low, high, 2000)
pmf_model = truncated_normal_pmf(qs, mu, sigma)
surv_model = pmf_model.make_surv()

xlabel = "Asteroid diameter (kilometers)"
ylabel = "Fraction with diameter ≥ $x$"
xticks = [0, 1, 2, 3]
xlabels = [1, 10, 100, 1000]

x = 1.5
y1 = surv(x)
y2 = surv_model(x)
y1 * 1e6, y2 * 1e6

(8188.941617943904, 39.34458611676295)

cutoff = surv_model > 1e-6
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5, 1e-6]
ylabels = [1, 0.1, 0.01, "$10^{-3}$", "$10^{-4}$", "$10^{-5}$", "$10^{-6}$"]

model_label = "Lognormal model"
axs = plot_two(
    surv,
    surv_model[cutoff],
    n,
    title="Tail distribution of asteroid diameters with lognormal model",
)
plt.tight_layout()
plt.savefig("longtail13.png", dpi=600)

_images/9e9e6f9f5a60ed0a04c23c21bae7ccab5a248af76c8c2c8a94015ff87183e46a.png

The following figure shows the distribution of asteroid sizes compared to a log- $t$ model.

df = minimize_df(10, surv, [(0, 20)])

df = 6.5
mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(6.5, 0.5953882516804466, 0.2150180564203059)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, 1.0 * high, 2000)
surv_model2 = truncated_t_sf(qs, df, mu, sigma)
cutoff2 = surv_model2.qs <= 1.0 * high

model_label = "Log-$t$ model"
axs = plot_two(
    surv,
    surv_model2[cutoff2],
    n,
    hatch="/",
    title="Tail distribution of asteroid diameters with log-t model",
)
plt.tight_layout()
plt.savefig("longtail14.png", dpi=600)

_images/8ccd293fa9210c7e273cb9c564601b646b439698647da04c1da28db5782a68b3.png

In the top figure, we see that the log- $t$ model fits the data well in the middle of the range, except possibly near 1 kilometer. In the bottom, we see that the log- $t$ model does not fit the tail of the distribution particularly well: there are more asteroids near 100 km than the model predicts. So the distribution of asteroid sizes does not strictly follow a log- $t$ distribution. Nevertheless, we can use it to explain the distribution of crater sizes, as I’ll show in the next section.

Origins of Long-Tailed Distributions#

Empirically – that is, based on data rather than a physical model – the diameter of an impact crater depends on the diameter of the projectile that created it, raised to the power 0.78, and on the impact velocity raised to the power 0.44. It also depends on the density of the asteroid and the angle of impact.

Source: Melosh, Planetary Surface Processes.

As a simple model of this relationship, I’ll simulate the crater formation process by drawing asteroid diameters from the distribution in the previous section and drawing the other factors – density, velocity, and angle – from a lognormal distribution with parameters chosen to match the data. The following figure shows the results from this simulation along with the actual distribution of crater sizes.

diam_asteroid = asteroids["diameter"].dropna()
diam_crater = craters["DIAM_CIRC_IMG"].dropna()
diam_asteroid.count(), diam_crater.count()

(135915, 1296796)

surv_crater = make_surv(np.log10(diam_crater))

np.random.seed(17)
n = diam_crater.count()
resample_asteroid = diam_asteroid.sample(100 * n, replace=True)
# resample_asteroid = diam_asteroid
rvs = np.random.normal(size=resample_asteroid.count())
len(rvs)

129679600

def error_func_pred(params):
    """Error function for finding parameters for the asteroid problem.
    
    params: tuple of mu and sigma
    
    returns: array of differences in the survival functions
    """
    print(params)
    mu, sigma = params
    factors = np.power(10, rvs * sigma + mu)
    pred_crater = factors * resample_asteroid**0.78
    mags = np.log10(pred_crater)
    cutoff = mags >= surv_crater.qs.min()
    surv_pred = make_surv(mags[cutoff])
    # surv.plot()
    # surv_pred.plot(color='gray')

    ps = np.logspace(-1, -5, 20)
    # ps = np.linspace(0.1, 0.9, 20)
    qs = surv.inverse(ps)
    # print(qs)

    error = surv_crater(qs) - surv_pred(qs)
    return error

mu = -3.194
sigma = 0.883
params = mu, sigma
error_func_pred(params)

(-3.194, 0.883)

array([-3.16249439e-04, -2.73415007e-05,  1.20415497e-04,  2.96512650e-04,
        4.48439739e-04,  6.12564529e-04,  4.29226980e-04,  2.16708737e-04,
        1.21459936e-04,  6.07299681e-05,  3.32894198e-05,  2.09586479e-05,
        8.46778641e-06,  8.78063634e-06,  7.08561326e-06,  1.45431455e-07,
        1.04829884e-05,  1.38803636e-05,  1.38803636e-05,  4.62678787e-06])

def run_minimize():
    """Find the values of mu and sigma that solve the asteroid problem.
    """
    res = least_squares(error_func_pred, x0=params, xtol=1e-3, diff_step=0.01)
    assert res.success
    mu, sigma = res.x
    return mu, sigma


# mu, sigma = run_minimize()

factors = np.power(10, rvs * sigma + mu)
pred_crater = factors * resample_asteroid**0.78
mags = np.log10(pred_crater)
cutoff = mags >= surv_crater.qs.min()
surv_pred = make_surv(mags[cutoff])
surv_pred.head()

	probs
diameter
0.000001	1.000000
0.000002	0.999995
0.000005	0.999990

xlabel = "Crater diameter (kilometers)"
ylabel = "Fraction with diameter ≥ $x$"

xticks = [0, 1, 2, 3]
xlabels = [1, 10, 100, 1000]

cutoff = surv_pred.qs >= 0
model_label = "Simulated crater sizes"

axs = plot_two(
    surv_crater,
    surv_pred,
    n,
    hatch="/",
    title="Moon crater diameters with simulation results",
)

plt.tight_layout()
plt.savefig("longtail15.png", dpi=600)

_images/220bc590b8037e015775f9e854e105ec32a2f82fca6dc143d384d723132f25f5.png

The simulation results are a good match for the data. This example suggests that the distribution of crater sizes can be explained by the relationship between the distributions of asteroid sizes and other factors like velocity and density.

Stock Market Crashes#

Like natural disasters, the distribution of magnitudes of stock market crashes is long-tailed. To demonstrate, I’ll use data from the MeasuringWorth Foundation which has compiled the value of the Dow Jones Industrial Average at the end of each day from February 16, 1885 to the present, with adjustments at several points to make the values consistent. The series I collected ends on May 6, 2022, so it includes 37,512 days.

# "Citation: Samuel H. Williamson, 'Daily Closing Value of the Dow Jones Average, 1885 to Present,' MeasuringWorth, 2022. "

# https://www.measuringworth.com/datasets/DJA

filename = "DJA.csv"
download(DATA_PATH + filename)

djia = pd.read_csv(filename, skiprows=4)
djia.shape

(37512, 2)

djia.head()

	Date	DJIA
0	2/16/1885	30.9226
1	2/17/1885	31.3365
2	2/18/1885	31.4744
3	2/19/1885	31.6765
4	2/20/1885	31.4252

djia.tail()

	Date	DJIA
37507	5/2/2022	33061.50
37508	5/3/2022	33128.79
37509	5/4/2022	34061.06
37510	5/5/2022	32997.97
37511	5/6/2022	32899.37

In percentage terms, the largest single-day drop was 22.6%, during the Black Monday crash. The second largest drop was 12.9% on March 16, 2020, during the crash caused by the COVID-19 pandemic. Numbers three and four on the list were 12.8% and 11.7%, on consecutive days during Wall Street Crash of 1929.

diff = np.log10(djia["DJIA"]).diff()
djia["change"] = (10**diff - 1) * 100
djia.sort_values(by="change").head()

	Date	DJIA	change
28833	10/19/1987	1738.74	-22.610538
36978	3/16/2020	20188.52	-12.926547
13297	10/28/1929	260.64	-12.820684
13298	10/29/1929	230.07	-11.728821
36976	3/12/2020	21200.62	-9.988443

On the positive side, the largest single-day gain was on March 15, 1933, when the index gained 15.3%. There were other large gains in October 1929 and October 1931. The largest gain of the 21st century, so far, was on March 24, 2020, when the index gained 11.4% in expectation that the U.S. Congress would pass a large economic stimulus bill.

djia.sort_values(by="change").tail()

	Date	DJIA	change
36984	3/24/2020	20704.9100	11.365038
13299	10/30/1929	258.4700	12.344069
13871	10/6/1931	99.3400	14.870490
14292	3/15/1933	62.1000	15.341753
0	2/16/1885	30.9226	NaN

But these large moves are unusual; on 95% of days, the change is less than 2 percentage points either way.

djia["change"].abs().describe()

count    37511.000000
mean         0.712317
std          0.799769
min          0.000000
25%          0.217753
50%          0.493605
75%          0.935523
max         22.610538
Name: change, dtype: float64

(djia["change"].abs() < 2).mean()

0.9453508210705908

(djia["change"] < -2).mean()

0.028897419492429088

all_mags = np.log10(djia["DJIA"]).diff()
all_mags.describe()

count    37511.000000
mean         0.000081
std          0.004660
min         -0.111318
25%         -0.001970
50%          0.000195
75%          0.002291
max          0.061987
Name: DJIA, dtype: float64

(all_mags > 0).mean(), (all_mags < 0).mean(), (all_mags == 0).mean()

(0.5231392621027938, 0.4713158455960759, 0.005518234165067178)

r = 10 ** all_mags.mean()
(r - 1) * 100

0.018582199949923606

As an aside, the index climbed on 52.3% of days, fell on 47.1% of days, and closed exactly where it opened on the remaining 0.06% of days. The average increase was only 0.02% per day, but with about 250 trading days per year, that amounts to an increase of 4.8% annually. After 37,512 days, the total increase is more than 100,000%. So that’s a pretty good return on investment, if you made the investment in 1885.

r**250 - 1

0.04754694096007417

n = len(all_mags)
n, r**n - 1

(37512, 1063.1240848647278)

start, end = djia["DJIA"].iloc[[0, -1]]
start, end, end / start - 1

(30.9226, 32899.37, 1062.926383939255)

losses = all_mags < 0
losses.mean()

0.4713158455960759

mags = -all_mags[losses]
mags.describe()

count    1.768000e+04
mean     3.196865e-03
std      3.704198e-03
min      7.287896e-07
25%      9.450282e-04
50%      2.151678e-03
75%      4.170744e-03
max      1.113182e-01
Name: DJIA, dtype: float64

10 ** (-mags.max())

0.7738946206503647

n = mags.count()
n

surv = make_surv(mags)

mu, sigma = fit_truncated_normal(surv)
mu, sigma

[0.00319705 0.00370412]
[0.00319706 0.00370412]
[0.00319705 0.00370414]
[-0.0013507   0.00550956]
[-0.00135072  0.00550956]
[-0.0013507   0.00550957]
[-0.00621391  0.00604884]
[-0.00621392  0.00604884]
[-0.00621391  0.00604886]
[-0.01126   0.007184]
[-0.01126002  0.007184  ]
[-0.01126     0.00718402]
[-0.01332923  0.00747085]
[-0.01332924  0.00747085]
[-0.01332923  0.00747086]
[-0.01298826  0.00737537]
[-0.01298827  0.00737537]
[-0.01298826  0.00737538]
[-0.01293662  0.00736386]
[-0.01293664  0.00736386]
[-0.01293662  0.00736387]
[-0.01293473  0.00736343]
[-0.01293475  0.00736343]
[-0.01293473  0.00736344]

(-0.012934732189136055, 0.007363429563516698)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, high, 2000)
surv_model = truncated_normal_sf(qs, mu, sigma)

xlabel = "One day percent change"
xlabels = ["no change", "-5%", "-10%", "-15%", "-20%"]
xticks = -np.log10([1, 0.95, 0.9, 0.85, 0.8])
xticks

array([-0.        ,  0.02227639,  0.04575749,  0.07058107,  0.09691001])

surv(xticks)[1] * 100

0.48076923076922906

cutoff = surv_model > 1e-5
yticks = [1, 0.1, 0.01, 1e-3, 1e-4, 1e-5]
ylabels = [1, 0.1, 0.01, 0.001, "$10^{-4}$", "$10^{-5}$"]

Since we are interested in stock market crashes, I selected the 17,680 days when the value of the index dropped.

model_label = "Lognormal model"
ylabel = "Fraction with drop ≥ $x$"

axs = plot_two(
    surv,
    surv_model[cutoff],
    n,
    title="Tail distribution of negative changes with lognormal model",
)

plt.tight_layout()
plt.savefig("longtail16.png", dpi=600)

_images/25a121b4f28ed2d4687935a4115c38b895f1751a5104896952833f213d859b19.png

On the left, the lognormal model seems to fit the distribution well, but again there are big differences hiding in the tail. On the right, we can see that the lognormal model does not fit the tail of the distribution at all.

The following figure shows the distribution of these negative percentage changes compared to a log- $t$ model.

df = minimize_df(3, surv, [(1, 20)])

mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(array([3.8945931]), -0.002607418984953457, 0.003426713548607919)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, 1.02 * high, 2000)
surv_model = truncated_t_sf(qs, df, mu, sigma)
cutoff = surv_model.qs <= high

model_label = "Log-$t$ model"
axs = plot_two(
    surv,
    surv_model,
    n,
    hatch="/",
    title="Tail distribution of stock market changes with log-$t$ model",
)
surv.iloc[-2:].plot(color="C0", style="o")

plt.tight_layout()
plt.savefig("longtail17.png", dpi=600)

_images/a5efb65b1e07abf7f4b7c8a6eef5f966cc09b1006c04ca072444d1a7acca77b3.png

The top figure shows that the log- $t$ model fits the left side of the distribution well. The bottom figure shows that it fits the tail well, with the possible exception of the last data point, the 1987 crash, which falls just at the edge of variation we expect. According to the model, there is only a 5% chance that we would see a crash as big as this one after 17,680 down days.

x = surv.qs.max()
y1 = surv(x)
y2 = surv_model(x)
n, y1, y2, y1 / y2

(17680, array(5.6561086e-05), array(1.28916302e-05), 4.387426949065176)

1 - t_dist(df, mu, sigma).cdf(x) ** n

array([0.0542351])

Black and Gray Swans#

If you have read The Black Swan, by Nassim Taleb, the examples in this chapter provide context for understanding black swans and their relatives, gray swans.

In Taleb’s vocabulary, a black swan is a large, impactful event that was considered extremely unlikely before it happened, based on a model of prior events. If the distribution of event sizes is actually long-tailed and the model is Gaussian, black swans will happen with some regularity.

For example, if we use a Gaussian distribution to model the magnitudes of earthquakes, we predict that the probability of a magnitude 7 earthquake is about 4 per sextillion ( $10^{18}$ ). The actual rate in the dataset we looked at is about 6 per million, which is more likely by a factor of a trillion ( $10^{12}$ ). When a magnitude 7 earthquake occurs, it would be a black swan if your predictions were based on the Gaussian model.

# x = 7
# surv(x)*1e6, surv_model(x)*1e18

# 6.318484473557436, 4.618502751454663

In a Long-Tailed World#

Imagine a world where the distribution of human height is not Gaussian or even lognormal; instead, it follows a long-tailed distribution like the various disasters in this chapter. Specifically, suppose it follows a log- $t$ distribution with the same tail thickness as the distribution of disaster costs, but with the same mean as the distribution of human height.

filename = "brfss_sample.hdf"
download(DATA_PATH + filename)

brfss = pd.read_hdf(filename, "brfss")
height = brfss["HTM4"]
height.quantile([0.25, 0.75])

0.25    163.0
0.75    178.0
Name: HTM4, dtype: float64

mags = np.log10(height)
n = mags.count()
mags.describe()

count    375271.000000
mean          2.229956
std           0.028219
min           1.959041
25%           2.212188
50%           2.230449
75%           2.250420
max           2.369216
Name: HTM4, dtype: float64

surv = make_surv(mags)

df = minimize_df(5, surv, [(1, 1e6)])
df

array([999859.46341698])

df = 3.53  # same as distribution of disasters
mu, sigma = fit_truncated_t(df, surv)
df, mu, sigma

(3.53, 2.2293001672048596, 0.025915097488203958)

low, high = surv.qs.min(), surv.qs.max()
qs = np.linspace(low, high, 2000)
surv_model = truncated_t_sf(qs, df, mu, sigma)

model_label = "Log-$t$ model"
xlabel = ""
ylabel = ""
xticks = []
xlabels = xticks
axs = plot_two(surv, surv_model, n)

_images/f8a590abf3448c805d8ed6813476f6aa7728460e63665bebd080199ed3a12157.png

surv_model.plot(color="gray", alpha=0.4)
surv.plot(ls="--")
decorate(xlabel="Log Height", ylabel="Prob >= x")

_images/5ce304cc56035defdfe3fb32e93ff62f1ddf959a5ee3c50a6066f699d24bd98f.png

dist = t_dist(df, mu, sigma)

ps = [0.9, 0.99, 0.999, 1 - 1 / 10000, 1 - 1e-6, 1 - 1 / 330e6, 1 - 1 / 7e9]
ps

[0.9, 0.99, 0.999, 0.9999, 0.999999, 0.999999996969697, 0.9999999998571428]

(height >= 277).mean() * 7e9

0.0

Out of 100 people, the tallest might be 215 cm.

If you see 1000 people, the tallest of them might be 277 cm.

Out of 10,000 people, the tallest would be more than 4 meters tall.

Out of a million people in Long-tailed World, the tallest would be 59 meters . In a country the size of the United States, the tallest person would be 160,000 kilometers tall.

And in a world population of 7 billion, the tallest would be 14 quintillion kilometers tall, which is about 1500 light years.

# predicted heights converted from cm to m

qs = dist.ppf(ps)
series = pd.Series(ps, 10**qs / 100)
series

1.862427e+00    0.900000
2.157498e+00    0.990000
2.770801e+00    0.999000
4.416977e+00    0.999900
5.888934e+01    0.999999
1.575737e+08    1.000000
1.445579e+19    1.000000
dtype: float64

# height of the tallest person in light years

series.index[-1] / 9.46073e15

1527.9783031704246

# double-checking that the "tallest people" examples are on the model curve

transformed = 1 - pd.Series(ps, qs)
transformed

2.270079     1.000000e-01
2.333950     1.000000e-02
2.442605     1.000000e-03
2.645125     1.000000e-04
3.770037     1.000000e-06
10.197484    3.030303e-09
21.160042    1.428572e-10
dtype: float64

surv_model.plot(color="gray", alpha=0.4)
surv.plot(ls="--")
transformed.iloc[:3].plot()

decorate(yscale="log")

_images/3225beedbf48d4918f05d456e03e6a44941029b9fbfa7477e30cb0a8c537269b.png

Probably Overthinking It

The code in this notebook and utils.py is under the MIT license.