The Dormouse's story

{ "cells": [ { "cell_type": "markdown", "id": "feaedff8", "metadata": {}, "source": [ "# Getting to Philosophy" ] }, { "cell_type": "markdown", "id": "0b883d41", "metadata": {}, "source": [ "[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/philosophy.ipynb)" ] }, { "cell_type": "markdown", "id": "f6a4a79b", "metadata": {}, "source": [ "# Getting to Philosophy\n", "\n", "The goal of this notebook is to develop a Web crawler that tests the\n", "\"Getting to Philosophy\" conjecture. As explained on [this Wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy):\n", "\n", "> Clicking on the first link in the main text of an English Wikipedia article, and then repeating the process for subsequent articles, usually leads to the Philosophy article. In February 2016, this was true for 97% of all articles in Wikipedia...\n", "\n", "More specifically, the link can't be in parentheses or italics, and it can't be an external link, a link to the current page, or a link to a non-existent page.\n", "\n", "We'll use the `urllib` library to download Wikipedia pages and BeautifulSoup to parse HTML text and navigate the Document Object Model (DOM)." ] }, { "cell_type": "markdown", "id": "e075e092", "metadata": {}, "source": [ "Before we start working with Wikipedia pages, let's warm up with a minimal HTML document, which I've adapted from the BeautifulSoup documentation." ] }, { "cell_type": "code", "execution_count": 1, "id": "d55b02f9", "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "(Elsie),\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "edf78b69", "metadata": {}, "source": [ "This document contains three links, but the first one is in parentheses and the second is in italics, so the third is the link we would follow to get to philosophy.\n", "\n", "Here's how we parse this document and make a `BeautifulSoup` object." ] }, { "cell_type": "code", "execution_count": 2, "id": "18daf934", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(html_doc)\n", "type(soup)" ] }, { "cell_type": "markdown", "id": "1c74c8fa", "metadata": {}, "source": [ "To iterate through the elements in the DOM, we can write our own implementation of depth first search, like this:" ] }, { "cell_type": "code", "execution_count": 3, "id": "65f13165", "metadata": {}, "outputs": [], "source": [ "def iterative_DFS(root):\n", " stack = [root]\n", " \n", " while(stack):\n", " element = stack.pop()\n", " yield element\n", "\n", " children = getattr(element, \"contents\", [])\n", " stack.extend(reversed(children))" ] }, { "cell_type": "markdown", "id": "48850017", "metadata": {}, "source": [ "For example, we can iterate through the elements and print all `NavigableString` elements:" ] }, { "cell_type": "code", "execution_count": 4, "id": "53eafee6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "(Elsie),\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "from bs4 import NavigableString\n", "\n", "for element in iterative_DFS(soup):\n", " if isinstance(element, NavigableString):\n", " print(element.string, end='')" ] }, { "cell_type": "markdown", "id": "20b7800a", "metadata": {}, "source": [ "But we can also use `descendants`, which does the same thing." ] }, { "cell_type": "code", "execution_count": 5, "id": "625f449d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "(Elsie),\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "for element in soup.descendants:\n", " if isinstance(element, NavigableString):\n", " print(element.string, end='')" ] }, { "cell_type": "markdown", "id": "24608925", "metadata": {}, "source": [ "## Checking for Parentheses\n", "\n", "One theory of software development suggests you should tackle the hardest problem first, because it will drive the design. Then you can figure out how to handle the easier problems.\n", "\n", "For \"Getting to Philosophy\", one of the harder problems is to figure out whether a link is in parentheses.\n", "If you have a link, you could work your way outward looking for enclosing parentheses, but in a tree, that could get complicated.\n", "\n", "The alternative I chose is to iterate through the text in order, counting open and close parentheses, and yield links only if they are not enclosed." ] }, { "cell_type": "code", "execution_count": 6, "id": "aa828190", "metadata": { "tags": [ "fill-in" ] }, "outputs": [], "source": [ "from bs4 import Tag\n", "\n", "def link_generator(root):\n", " paren_stack = []\n", "\n", " for element in root.descendants:\n", " if isinstance(element, NavigableString):\n", " for char in element.string:\n", " if char == '(':\n", " paren_stack.append(char)\n", " if char == ')':\n", " paren_stack.pop()\n", "\n", " if isinstance(element, Tag) and element.name == \"a\":\n", " if len(paren_stack) == 0:\n", " yield element" ] }, { "cell_type": "markdown", "id": "52b32aff", "metadata": {}, "source": [ "Now we can iterate through the links that are not in parentheses." ] }, { "cell_type": "code", "execution_count": 7, "id": "d78b1c21", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lacie\n", "Tillie\n" ] } ], "source": [ "for link in link_generator(soup):\n", " print(link)" ] }, { "cell_type": "markdown", "id": "1054af68", "metadata": {}, "source": [ "## Checking for Italics\n", "\n", "To see whether a link is in italics, we can:\n", "\n", "1) If its parent is a `Tag` with name `a`, it's in italics.\n", "\n", "2) Otherwise we have to check the parent of the parent, and so on.\n", "\n", "3) If we get to the root without finding an italics tag, it's not in italics." ] }, { "cell_type": "markdown", "id": "4db33e84", "metadata": {}, "source": [ "For example, here's the first link from `link_generator`." ] }, { "cell_type": "code", "execution_count": 8, "id": "d35a16e5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Lacie" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "link = next(link_generator(soup))\n", "link" ] }, { "cell_type": "markdown", "id": "6bb6ee09", "metadata": {}, "source": [ "Its parent is an italics tag." ] }, { "cell_type": "code", "execution_count": 9, "id": "600997fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parent = link.parent\n", "isinstance(parent, Tag)" ] }, { "cell_type": "code", "execution_count": 10, "id": "f6875673", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'i'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parent.name" ] }, { "cell_type": "markdown", "id": "fd523372", "metadata": {}, "source": [ "**Exercise:** Write a function called `in_italics` that takes an element and returns `True` if it is in italics." ] }, { "cell_type": "code", "execution_count": 11, "id": "af48fbbe", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def in_italic(element):\n", " if isinstance(element, BeautifulSoup):\n", " return False\n", " if isinstance(element, Tag) and element.name=='i':\n", " return True\n", " return in_italic(element.parent)" ] }, { "cell_type": "code", "execution_count": 12, "id": "93f8f3c5", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tillie\n" ] } ], "source": [ "# Solution\n", "\n", "for link in link_generator(soup):\n", " if not in_italic(link):\n", " print(link)" ] }, { "cell_type": "markdown", "id": "212f4c4c", "metadata": {}, "source": [ "Then write a more general function called `in_bad_element` that takes an element and returns `True` if:\n", "\n", "* The element or one of its ancestors has a \"bad\" tag name, like `i`, or\n", "\n", "* The element or one of its ancestors is a `div` whose `class` attribute contains a \"bad\" class name.\n", "\n", "We will need the general version of this function to exclude invalid links on Wikipedia pages." ] }, { "cell_type": "code", "execution_count": 13, "id": "19c4e346", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "bad_tags = ['i', 'table']\n", "\n", "bad_classes = {'image', 'hatnote', 'internal', \n", " 'thumb', 'mw-disambig'}\n", "\n", "def in_bad_element(element, ):\n", " if isinstance(element, BeautifulSoup):\n", " return False\n", " if isinstance(element, Tag):\n", " if element.name in bad_tags:\n", " return True\n", " if element.name == 'div':\n", " class_list = element.get('class', [])\n", " inter = bad_classes.intersection(class_list)\n", " if inter:\n", " return True\n", " \n", " return in_bad_element(element.parent)" ] }, { "cell_type": "markdown", "id": "a049eea0", "metadata": {}, "source": [ "## Working with Wikipedia Pages\n", "\n", "Actual Wikipedia pages are more complicated that the simple example, so it will take some effort to understand their structure and make sure we select the right \"first link\".\n", "\n", "The following cell downloads the Wikipedia page on Python." ] }, { "cell_type": "code", "execution_count": 14, "id": "fdb4f5e6", "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)" ] }, { "cell_type": "code", "execution_count": 15, "id": "95309299", "metadata": {}, "outputs": [], "source": [ "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"\n", "download(url)" ] }, { "cell_type": "markdown", "id": "330f5df2", "metadata": {}, "source": [ "Now we can parse it and make `soup`." ] }, { "cell_type": "code", "execution_count": 16, "id": "10bb4b22", "metadata": {}, "outputs": [], "source": [ "filename = basename(url)\n", "fp = open(filename)\n", "soup2 = BeautifulSoup(fp)" ] }, { "cell_type": "markdown", "id": "3ec384e4", "metadata": {}, "source": [ "If you use a web browser to view this page, and use the Inspect Element tool to explore the structure, you'll see that the body of the article is in a `div` element with the class name `mw-body-content`.\n", "\n", "We can use `find` to get this element, and we'll use it as the root for our searches." ] }, { "cell_type": "code", "execution_count": 17, "id": "31215d3b", "metadata": { "tags": [ "fill-in" ] }, "outputs": [], "source": [ "root = soup2.find(class_='mw-body-content')" ] }, { "cell_type": "markdown", "id": "3227b5a6", "metadata": {}, "source": [ "**Exercise:** Write a generator function called `valid_link_generator` that uses `link_generator` to find links that are not in parentheses; then it should filter out links that are not valid, including links that are in italics, links to external pages, etc.\n", "\n", "Test your function with a few different pages until it reliably finds the \"first link\" that seems most consistent with the spirit of the rules." ] }, { "cell_type": "code", "execution_count": 18, "id": "2c2e4729", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def valid_link_generator(root):\n", " for link in link_generator(root):\n", " if in_bad_element(link):\n", " continue\n", " \n", " href = link.get(\"href\", '')\n", " if not href.startswith('/wiki'):\n", " continue\n", " \n", " yield link" ] }, { "cell_type": "code", "execution_count": 19, "id": "f9fd3f47", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "interpreted" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "it = valid_link_generator(root)\n", "link = next(it)\n", "link" ] }, { "cell_type": "markdown", "id": "b14f73b3", "metadata": {}, "source": [ "## `WikiFetcher`\n", "\n", "When you write a Web crawler, it is easy to download too many pages too\n", "fast, which might violate the terms of service for the server you are\n", "downloading from. To avoid that, we'll use an object called\n", "`WikiFetcher` that does two things:\n", "\n", "1. It encapsulates the code for downloading and parsing web pages.\n", "\n", "2. It measures the time between requests and, if we don't leave enough\n", " time between requests, it sleeps until a reasonable interval has\n", " elapsed. By default, the interval is one second.\n", "\n", "Here's the definition of `WikiFetcher`:" ] }, { "cell_type": "code", "execution_count": 20, "id": "1e6c5654", "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "from time import time, sleep\n", " \n", "class WikiFetcher:\n", " next_request_time = None\n", " min_interval = 1 # second\n", "\n", " def fetch_wikipedia(self, url):\n", " self.sleep_if_needed()\n", " fp = urlopen(url)\n", " soup = BeautifulSoup(fp, 'html.parser')\n", " return soup\n", "\n", " def sleep_if_needed(self):\n", " if self.next_request_time:\n", " sleep_time = self.next_request_time - time() \n", " if sleep_time > 0:\n", " sleep(sleep_time)\n", " \n", " self.next_request_time = time() + self.min_interval" ] }, { "cell_type": "markdown", "id": "252b6528", "metadata": {}, "source": [ "`fetch_wikipedia` takes a URL as a\n", "`String` and returns a BeautifulSoup object that represents the contents of the page.\n", "\n", "`sleep_if_needed` checks the time since the last\n", "request and sleeps if the elapsed time is less than `min_interval`.\n", "\n", "Here's an example that demonstrates how it's used:" ] }, { "cell_type": "code", "execution_count": 21, "id": "f97c92ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1640031013.2612915\n", "1640031013.7938814\n", "1640031014.7832372\n" ] } ], "source": [ "wf = WikiFetcher()\n", "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"\n", "\n", "print(time())\n", "wf.fetch_wikipedia(url)\n", "print(time())\n", "wf.fetch_wikipedia(url)\n", "print(time())" ] }, { "cell_type": "markdown", "id": "5335b9a6", "metadata": {}, "source": [ "If things have gone according to plan, the three timestamps should be no less than 1 second apart, which is consistent with the terms in Wikipedia's [robots.txt](https://en.wikipedia.org/robots.txt):\n", "\n", "> Friendly, low-speed bots are welcome viewing article pages, but not\n", "dynamically-generated pages please." ] }, { "cell_type": "markdown", "id": "66332ab6", "metadata": {}, "source": [ "**Exercise:** Now let's pull it all together. Write a function called `get_to_philosophy` that takes as a parameter the URL of a Wikipedia page. It should:\n", "\n", "1. Use the `WikiFetcher` object we just created to download and parse the page.\n", "\n", "2. Traverse the resulting `BeautifulSoup` object to find the first valid link according to the spirit of the rules.\n", "\n", "3. If the page has no links, or if the first link is a page we have already seen, the program should indicate failure and exit.\n", "\n", "4. If the link matches the URL of the Wikipedia page on philosophy, the program should indicate success and exit.\n", "\n", "5. Otherwise it should go back to Step 1 (although you might want to put a limit on the number of hops).\n", "\n", "The program should build a list of the URLs it visits and display the\n", "results at the end (whether it succeeds or fails).\n", "\n", "Since the links you find are relative, you might find the `urljoin` function helpful:" ] }, { "cell_type": "code", "execution_count": 22, "id": "ab913376", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://en.wikipedia.org/wiki/Interpreted_language'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from urllib.parse import urljoin\n", "\n", "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"\n", "relative_path = \"/wiki/Interpreted_language\"\n", "\n", "urljoin(url, relative_path)" ] }, { "cell_type": "code", "execution_count": 23, "id": "39848591", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "target = 'https://en.wikipedia.org/wiki/Philosophy'\n", "\n", "def get_to_philosophy(url):\n", " visited = []\n", " \n", " for i in range(20):\n", " if url == target:\n", " print(f'Got there in {i} steps!')\n", " return visited\n", " \n", " if url in visited:\n", " raise ValueError(f'URL already visited {url}')\n", " else:\n", " print(url)\n", " visited.append(url)\n", " \n", " soup = wf.fetch_wikipedia(url)\n", " root = soup.find(class_='mw-body-content')\n", " link = next(valid_link_generator(root))\n", " url = urljoin(url, link['href'])\n", " \n", " return visited" ] }, { "cell_type": "code", "execution_count": 24, "id": "147b670c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://en.wikipedia.org/wiki/Python_(programming_language)\n", "https://en.wikipedia.org/wiki/Interpreted_language\n", "https://en.wikipedia.org/wiki/Computer_science\n", "https://en.wikipedia.org/wiki/Computation\n", "https://en.wikipedia.org/wiki/Calculation\n", "https://en.wikipedia.org/wiki/Arithmetic\n", "https://en.wikipedia.org/wiki/Mathematics\n", "https://en.wikipedia.org/wiki/Epistemology\n", "https://en.wikipedia.org/wiki/Outline_of_philosophy\n", "Got there in 9 steps!\n" ] }, { "data": { "text/plain": [ "['https://en.wikipedia.org/wiki/Python_(programming_language)',\n", " 'https://en.wikipedia.org/wiki/Interpreted_language',\n", " 'https://en.wikipedia.org/wiki/Computer_science',\n", " 'https://en.wikipedia.org/wiki/Computation',\n", " 'https://en.wikipedia.org/wiki/Calculation',\n", " 'https://en.wikipedia.org/wiki/Arithmetic',\n", " 'https://en.wikipedia.org/wiki/Mathematics',\n", " 'https://en.wikipedia.org/wiki/Epistemology',\n", " 'https://en.wikipedia.org/wiki/Outline_of_philosophy']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_to_philosophy(url)" ] }, { "cell_type": "markdown", "id": "d8a3d580", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "*Data Structures and Information Retrieval in Python*\n", "\n", "Copyright 2021 Allen Downey\n", "\n", "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 5 }