4. Scale-Free Networks#
Code examples from Think Complexity, 2nd edition.
Copyright 2016 Allen Downey, MIT License
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
download('https://github.com/AllenDowney/ThinkComplexity2/raw/master/notebooks/utils.py')
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import seaborn as sns
from utils import decorate, savefig
# Set the random seed so the notebook
# produces the same results every time.
np.random.seed(17)
# make a directory for figures
!mkdir -p figs
4.1. Facebook data#
The following function reads a file with one edge per line, specified by two integer node IDs.
def read_graph(filename):
G = nx.Graph()
array = np.loadtxt(filename, dtype=int)
G.add_edges_from(array)
return G
We’ll read the Facecook data downloaded from SNAP
download('https://snap.stanford.edu/data/facebook_combined.txt.gz')
fb = read_graph('facebook_combined.txt.gz')
n = len(fb)
m = len(fb.edges())
n, m
With larger graphs, it takes too long to compute clustering coefficients and path lengths, but we can estimate them by sampling. NetworkX provides a function in its approximation module that estimates the clustering coefficient:
from networkx.algorithms.approximation import average_clustering
And I’ve written a function that estimates the average path length.
def sample_path_lengths(G, nodes=None, trials=1000):
"""Choose random pairs of nodes and compute the path length between them.
G: Graph
nodes: list of nodes to choose from
trials: number of pairs to choose
returns: list of path lengths
"""
if nodes is None:
nodes = list(G)
else:
nodes = list(nodes)
pairs = np.random.choice(nodes, (trials, 2))
lengths = [nx.shortest_path_length(G, *pair)
for pair in pairs]
return lengths
def estimate_path_length(G, nodes=None, trials=1000):
return np.mean(sample_path_lengths(G, nodes, trials))
The average clustering coefficient is high.
C = average_clustering(fb)
C
The average path length is low.
L = estimate_path_length(fb)
L
4.2. WS Graph#
Next I’ll construct a WS graph with the same number of nodes and average degree as the Facebook network:
n = len(fb)
m = len(fb.edges())
k = int(round(2*m/n))
k
With p=0
we get a ring lattice.
The number of edges is a little bigger than in the dataset because we have to round k
to an integer.
lattice = nx.watts_strogatz_graph(n, k, p=0)
len(lattice), len(lattice.edges())
The clustering coefficient is a little higher than in the dataset.
C, average_clustering(lattice)
And the path length is much higher.
L, estimate_path_length(lattice)
With p=1
we get a random graph.
random_graph = nx.watts_strogatz_graph(n, k, p=1)
The clustering coefficient is small.
C, average_clustering(random_graph)
And the path lengths are very small.
L, estimate_path_length(random_graph)
By trial and error, I found that p=0.05
yields a graph with about the right values for C
and L
.
ws = nx.watts_strogatz_graph(n, k, 0.05, seed=15)
The clustering coefficient is a little higher than in the data.
C, average_clustering(ws)
And the path length is a little lower.
L, estimate_path_length(ws)
So that seems good so far.
4.3. Degree#
But let’s look at the degree distribution.
The following function returns a list of degrees, one for each node:
def degrees(G):
"""List of degrees for nodes in `G`.
G: Graph object
returns: list of int
"""
return [G.degree(u) for u in G]
The average degree in the WS model is about right.
np.mean(degrees(fb)), np.mean(degrees(ws))
But the standard deviation isn’t even close:
np.std(degrees(fb)), np.std(degrees(ws))
To see what’s going on, we need to look at the whole distribution.
I’ll start with a very small graph:
G = nx.Graph()
G.add_edge(1, 0)
G.add_edge(2, 0)
G.add_edge(3, 0)
nx.draw(G)
Here’s what the list of degrees looks like for this graph:
degrees(G)
To compute the degree distribution, I’ll use the Pmf
class from empiricaldist
try:
import empiricaldist
except ImportError:
!pip install empiricaldist
A Pmf
object maps from each degree to the fraction of nodes with that degree.
from empiricaldist import Pmf
pmf = Pmf.from_seq(degrees(G))
pmf
75% of the nodes have degree 1; 25% have degree 3.
We can visualize the distribution as a histogram:
pmf.bar()
decorate(xlabel='Degree',
ylabel='Pmf')
And we can use the Pmf
to compute mean and standard deviation:
pmf_fb = Pmf.from_seq(degrees(fb))
pmf_fb.mean(), pmf_fb.std()
pmf_ws = Pmf.from_seq(degrees(ws))
pmf_ws.mean(), pmf_ws.std()
We can also use the Pmf
to look up the fraction of nodes with exactly 1 neighbor.
pmf_fb(1), pmf_ws(1)
Here’s what the degree distributions look like for the Facebook data and the WS model. They don’t resemble each other at all.
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
pmf_fb.plot(label='Facebook', color='C0')
decorate(xlabel='Degree', ylabel='PMF')
plt.subplot(1,2,2)
pmf_ws.plot(label='WS graph', color='C1')
decorate(xlabel='Degree')
savefig('figs/chap04-1')
We can get a better view of the Facebook data by plotting the PMF on a log-log scale.
The result suggests that the degree distribution follows a power law, at least for values larger than 10 or so.
The log-log scale doesn’t help the WS graph.
plt.figure(figsize=(8,4))
options = dict(ls='', marker='.')
plt.subplot(1,2,1)
plt.plot([20, 1000], [5e-2, 2e-4], color='gray', linestyle='dashed')
pmf_fb.plot(label='Facebook', color='C0', **options)
decorate(xscale='log', yscale='log',
xlabel='Degree', ylabel='PMF')
plt.subplot(1,2,2)
pmf_ws.plot(label='WS graph', color='C1', **options)
decorate(xlim=[35, 55],
xscale='log', yscale='log',
xlabel='Degree')
savefig('figs/chap04-2')
The discrepancy between the actual degree distribution and the WS model is the motivation for the BA model.
4.4. BA model#
Here’s a simplified version of the NetworkX function that generates BA graphs.
# modified version of the NetworkX implementation from
# https://github.com/networkx/networkx/blob/master/networkx/generators/random_graphs.py
import random
def barabasi_albert_graph(n, k, seed=None):
"""Constructs a BA graph.
n: number of nodes
k: number of edges for each new node
seed: random seen
"""
if seed is not None:
random.seed(seed)
G = nx.empty_graph(k)
targets = set(range(k))
repeated_nodes = []
for source in range(k, n):
G.add_edges_from(zip([source]*k, targets))
repeated_nodes.extend(targets)
repeated_nodes.extend([source] * k)
targets = _random_subset(repeated_nodes, k)
return G
And here’s the function that generates a random subset without repetition.
def _random_subset(repeated_nodes, k):
"""Select a random subset of nodes without repeating.
repeated_nodes: list of nodes
k: size of set
returns: set of nodes
"""
targets = set()
while len(targets) < k:
x = random.choice(repeated_nodes)
targets.add(x)
return targets
I’ll generate a BA graph with the same number of nodes and edges as the Facebook data:
n = len(fb)
m = len(fb.edges())
k = int(round(m/n))
n, m, k
Providing a random seed means we’ll get the same graph every time.
ba = barabasi_albert_graph(n, k, seed=15)
The number of edges is pretty close to what we asked for.
len(ba), len(ba.edges()), len(ba.edges())/len(ba)
So the mean degree is about right.
np.mean(degrees(fb)), np.mean(degrees(ba))
The standard deviation of degree is pretty close, and much better than the WS model.
np.std(degrees(fb)), np.std(degrees(ba))
Let’s take a look at the degree distribution.
pmf_ba = Pmf.from_seq(degrees(ba))
Looking at the PMFs on a linear scale, we see one difference, which is that the BA model has no nodes with degree less than k
, which is 22.
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
pmf_fb.plot(label='Facebook', color='C0')
decorate(xlabel='Degree', ylabel='PMF')
plt.subplot(1,2,2)
pmf_ba.plot(label='BA graph', color='C2')
decorate(xlabel='Degree')
But if we look at the PMF on a log-log scale, the BA model looks pretty good for values bigger than about 20. And it seems to follow a power law.
plt.figure(figsize=(8,4))
options = dict(ls='', marker='.')
plt.subplot(1,2,1)
pmf_fb.plot(label='Facebook', color='C0', **options)
decorate(xlabel='Degree', ylabel='PMF',
xscale='log', yscale='log')
plt.subplot(1,2,2)
pmf_ba.plot(label='BA model', color='C2', **options)
decorate(xlabel='Degree',
xlim=[1, 1e4],
xscale='log', yscale='log')
savefig('figs/chap04-3')
The characteristic path length is even smaller in the model than in the data.
L, estimate_path_length(ba)
But the clustering coefficient isn’t even close.
C, average_clustering(ba)
In the BA model, the degree distribution is better than in the WS model, but the clustering coefficient is too low.
4.5. Cumulative distributions#
Cumulative distributions are a better way to visualize distributions. The following function shows what a cumulative probability is:
def cumulative_prob(pmf, x):
"""Computes the cumulative probability of `x`.
Total probability of all values <= x.
returns: float probability
"""
ps = [pmf[value] for value in pmf.qs if value<=x]
return np.sum(ps)
The total probability for all values up to and including 11 is 0.258, so the 25th percentile is about 11.
cumulative_prob(pmf_fb, 11)
The median degree is about 25.
cumulative_prob(pmf_fb, 25)
And the 75th percentile is about 57. That is, about 75% of users have 57 friends or fewer.
cumulative_prob(pmf_fb, 57)
empiricaldist
provides Cdf
, which computes cumulative distribution functions.
from empiricaldist import Cdf
Here are the degree CDFs for the Facebook data, the WS model, and the BA model.
cdf_fb = Cdf.from_seq(degrees(fb), name='Facebook')
cdf_ws = Cdf.from_seq(degrees(ws), name='WS model')
cdf_ba = Cdf.from_seq(degrees(ba), name='BA model')
If we plot them on a log-x scale, we get a sense of how well the models fit the central part of the distribution.
The WS model is hopeless. The BA model is ok for values above the median, but not very good for smaller values.
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
cdf_fb.plot(color='C0')
cdf_ws.plot(color='C1', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
ylabel='CDF')
plt.subplot(1,2,2)
cdf_fb.plot(color='C0', label='Facebook')
cdf_ba.plot(color='C2', alpha=0.4)
decorate(xlabel='Degree', xscale='log')
savefig('figs/chap04-4')
On a log-log scale, we see that the BA model fits the tail of the distribution reasonably well.
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
(1 - cdf_fb).plot(color='C0')
(1 - cdf_ws).plot(color='C1', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
ylabel='CCDF', yscale='log')
plt.subplot(1,2,2)
(1 - cdf_fb).plot(color='C0', label='Facebook')
(1 - cdf_ba).plot(color='C2', alpha=0.4)
decorate(xlabel='Degree', xscale='log',
yscale='log')
savefig('figs/chap04-5')
But there is certainly room for a model that does a better job of fitting the whole distribution.
4.6. Exercises#
Exercise: Data files from the Barabasi and Albert paper are available from this web page.
Their actor collaboration data is included in the repository for this book in a file named
actor.dat.gz
. The following function reads the file and builds the graph.
download('https://github.com/AllenDowney/ThinkComplexity2/raw/master/data/actor.dat.gz')
import gzip
def read_actor_network(filename, n=None):
"""Reads graph data from a file.
filename: string
n: int, number of lines to read (default is all)
"""
G = nx.Graph()
with gzip.open(filename) as f:
for i, line in enumerate(f):
nodes = [int(x) for x in line.split()]
G.add_edges_from(all_pairs(nodes))
if n and i >= n:
break
return G
def all_pairs(nodes):
"""Generates all pairs of nodes."""
for i, u in enumerate(nodes):
for j, v in enumerate(nodes):
if i < j:
yield u, v
Compute the number of actors in the graph and the number of edges.
Check whether this graph has the small world properties, high clustering and low path length.
Plot the PMF of degree on a log-log scale. Does it seem to follow a power law?
Also plot the CDF of degree on a log-x scale, to see the general shape of the distribution, and on a log-log scale, to see whether the tail follows a power law.
Note: The actor network is not connected, so you might want to use
nx.connected_components
to find connected subsets of the
nodes.
# WARNING: if you run this with larger values of `n`, you
# might run out of memory, and Jupyter does not handle that well.
%time actors = read_actor_network('actor.dat.gz', n=10000)
len(actors)
Exercise: NetworkX provides a function called powerlaw_cluster_graph
that implements the “Holme and Kim algorithm for growing graphs with powerlaw degree distribution and approximate average clustering”. Read the documentation of this function and see if you can use it to generate a graph that has the same number of nodes as the Facebook network, the same average degree, and the same clustering coefficient. How does the degree distribution in the model compare to the actual distribution?
# Again, here are the parameters of the Facebook data
n = len(fb)
m = len(fb.edges())
k = int(round(m / n))
n, m, k