Getting to Philosophy#

The goal of this notebook is to develop a Web crawler that tests the “Getting to Philosophy” conjecture. As explained on this Wikipedia page:

Clicking on the first link in the main text of an English Wikipedia article, and then repeating the process for subsequent articles, usually leads to the Philosophy article. In February 2016, this was true for 97% of all articles in Wikipedia…

More specifically, the link can’t be in parentheses or italics, and it can’t be an external link, a link to the current page, or a link to a non-existent page.

We’ll use the urllib library to download Wikipedia pages and BeautifulSoup to parse HTML text and navigate the Document Object Model (DOM).

Before we start working with Wikipedia pages, let’s warm up with a minimal HTML document, which I’ve adapted from the BeautifulSoup documentation.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
(<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>),
<i><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</i>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

This document contains three links, but the first one is in parentheses and the second is in italics, so the third is the link we would follow to get to philosophy.

Here’s how we parse this document and make a BeautifulSoup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
type(soup)

bs4.BeautifulSoup

To iterate through the elements in the DOM, we can write our own implementation of depth first search, like this:

def iterative_DFS(root):
    stack = [root]
    
    while(stack):
        element = stack.pop()
        yield element

        children = getattr(element, "contents", [])
        stack.extend(reversed(children))

For example, we can iterate through the elements and print all NavigableString elements:

from bs4 import NavigableString

for element in iterative_DFS(soup):
    if isinstance(element, NavigableString):
        print(element.string, end='')

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
(Elsie),
Lacie and
Tillie;
and they lived at the bottom of a well.
...

But we can also use descendants, which does the same thing.

for element in soup.descendants:
    if isinstance(element, NavigableString):
        print(element.string, end='')

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
(Elsie),
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Checking for Parentheses#

One theory of software development suggests you should tackle the hardest problem first, because it will drive the design. Then you can figure out how to handle the easier problems.

For “Getting to Philosophy”, one of the harder problems is to figure out whether a link is in parentheses. If you have a link, you could work your way outward looking for enclosing parentheses, but in a tree, that could get complicated.

The alternative I chose is to iterate through the text in order, counting open and close parentheses, and yield links only if they are not enclosed.

from bs4 import Tag

def link_generator(root):
    paren_stack = []

    for element in root.descendants:
        if isinstance(element, NavigableString):
            for char in element.string:
                if char == '(':
                    paren_stack.append(char)
                if char == ')':
                    paren_stack.pop()

        if isinstance(element, Tag) and element.name == "a":
            if len(paren_stack) == 0:
                yield element

Now we can iterate through the links that are not in parentheses.

for link in link_generator(soup):
    print(link)

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Checking for Italics#

To see whether a link is in italics, we can:

If its parent is a Tag with name a, it’s in italics.
Otherwise we have to check the parent of the parent, and so on.
If we get to the root without finding an italics tag, it’s not in italics.

For example, here’s the first link from link_generator.

link = next(link_generator(soup))
link

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

Its parent is an italics tag.

parent = link.parent
isinstance(parent, Tag)

True

parent.name

'i'

Exercise: Write a function called in_italics that takes an element and returns True if it is in italics.

Then write a more general function called in_bad_element that takes an element and returns True if:

The element or one of its ancestors has a “bad” tag name, like i, or
The element or one of its ancestors is a div whose class attribute contains a “bad” class name.

We will need the general version of this function to exclude invalid links on Wikipedia pages.

Working with Wikipedia Pages#

Actual Wikipedia pages are more complicated that the simple example, so it will take some effort to understand their structure and make sure we select the right “first link”.

The following cell downloads the Wikipedia page on Python.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
download(url)

Now we can parse it and make soup.

filename = basename(url)
fp = open(filename)
soup2 = BeautifulSoup(fp)

If you use a web browser to view this page, and use the Inspect Element tool to explore the structure, you’ll see that the body of the article is in a div element with the class name mw-body-content.

We can use find to get this element, and we’ll use it as the root for our searches.

root = soup2.find(class_='mw-body-content')

Exercise: Write a generator function called valid_link_generator that uses link_generator to find links that are not in parentheses; then it should filter out links that are not valid, including links that are in italics, links to external pages, etc.

Test your function with a few different pages until it reliably finds the “first link” that seems most consistent with the spirit of the rules.

`WikiFetcher`#

When you write a Web crawler, it is easy to download too many pages too fast, which might violate the terms of service for the server you are downloading from. To avoid that, we’ll use an object called WikiFetcher that does two things:

It encapsulates the code for downloading and parsing web pages.
It measures the time between requests and, if we don’t leave enough time between requests, it sleeps until a reasonable interval has elapsed. By default, the interval is one second.

Here’s the definition of WikiFetcher:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import time, sleep
    
class WikiFetcher:
    next_request_time = None
    min_interval = 1  # second

    def fetch_wikipedia(self, url):
        self.sleep_if_needed()
        fp = urlopen(url)
        soup = BeautifulSoup(fp, 'html.parser')
        return soup

    def sleep_if_needed(self):
        if self.next_request_time:
            sleep_time = self.next_request_time - time()    
            if sleep_time > 0:
                sleep(sleep_time)
        
        self.next_request_time = time() + self.min_interval

fetch_wikipedia takes a URL as a String and returns a BeautifulSoup object that represents the contents of the page.

sleep_if_needed checks the time since the last request and sleeps if the elapsed time is less than min_interval.

Here’s an example that demonstrates how it’s used:

wf = WikiFetcher()
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

print(time())
wf.fetch_wikipedia(url)
print(time())
wf.fetch_wikipedia(url)
print(time())

1640031013.2612915
1640031013.7938814
1640031014.7832372

If things have gone according to plan, the three timestamps should be no less than 1 second apart, which is consistent with the terms in Wikipedia’s robots.txt:

Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.

Exercise: Now let’s pull it all together. Write a function called get_to_philosophy that takes as a parameter the URL of a Wikipedia page. It should:

Use the WikiFetcher object we just created to download and parse the page.
Traverse the resulting BeautifulSoup object to find the first valid link according to the spirit of the rules.
If the page has no links, or if the first link is a page we have already seen, the program should indicate failure and exit.
If the link matches the URL of the Wikipedia page on philosophy, the program should indicate success and exit.
Otherwise it should go back to Step 1 (although you might want to put a limit on the number of hops).

The program should build a list of the URLs it visits and display the results at the end (whether it succeeds or fails).

Since the links you find are relative, you might find the urljoin function helpful:

from urllib.parse import urljoin

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
relative_path = "/wiki/Interpreted_language"

urljoin(url, relative_path)

'https://en.wikipedia.org/wiki/Interpreted_language'

get_to_philosophy(url)

https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/Interpreted_language
https://en.wikipedia.org/wiki/Computer_science
https://en.wikipedia.org/wiki/Computation
https://en.wikipedia.org/wiki/Calculation
https://en.wikipedia.org/wiki/Arithmetic
https://en.wikipedia.org/wiki/Mathematics
https://en.wikipedia.org/wiki/Epistemology
https://en.wikipedia.org/wiki/Outline_of_philosophy
Got there in 9 steps!

['https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/Interpreted_language',
 'https://en.wikipedia.org/wiki/Computer_science',
 'https://en.wikipedia.org/wiki/Computation',
 'https://en.wikipedia.org/wiki/Calculation',
 'https://en.wikipedia.org/wiki/Arithmetic',
 'https://en.wikipedia.org/wiki/Mathematics',
 'https://en.wikipedia.org/wiki/Epistemology',
 'https://en.wikipedia.org/wiki/Outline_of_philosophy']

Data Structures and Information Retrieval in Python

Getting to Philosophy

Contents

Getting to Philosophy#

Getting to Philosophy#

Checking for Parentheses#

Checking for Italics#

Working with Wikipedia Pages#

`WikiFetcher`#

Data Structures and Information Retrieval in Python

Getting to Philosophy

Contents

Getting to Philosophy#

Getting to Philosophy#

Checking for Parentheses#

Checking for Italics#

Working with Wikipedia Pages#

WikiFetcher#

`WikiFetcher`#