{ "cells": [ { "cell_type": "markdown", "id": "c905f415", "metadata": {}, "source": [ "# Depth First Search" ] }, { "cell_type": "markdown", "id": "42302f03", "metadata": {}, "source": [ "[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/dfs.ipynb)" ] }, { "cell_type": "markdown", "id": "19dc5163", "metadata": {}, "source": [ "This notebook presents \"depth first search\" as a way to iterate through the nodes in a tree.\n", "This algorithm applies to any kind of tree, but since we need an example, we'll use BeautifulSoup, which is a Python module that reads HTML (and related languages) and builds a tree that represents the content." ] }, { "cell_type": "markdown", "id": "82af1897", "metadata": {}, "source": [ "## Using BeautifulSoup\n", "\n", "When you download a web page, the contents are written in HyperText Markup Language, aka HTML. \n", "For example, here is a minimal HTML document, which I borrowed from the [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io). The text is from Lewis Carroll's [*Alice's Adventures in Wonderland*](https://www.gutenberg.org/files/11/11-h/11-h.htm)." ] }, { "cell_type": "code", "execution_count": 1, "id": "a777d368", "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "
The Dormouse's story
\n", "\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.
\n", "\n", "...
\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "88220a73", "metadata": {}, "source": [ "Here's how we use BeautifulSoup to read it." ] }, { "cell_type": "code", "execution_count": 2, "id": "397bda6b", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(html_doc)\n", "type(soup)" ] }, { "cell_type": "markdown", "id": "5313d704", "metadata": {}, "source": [ "The result is a `BeautifulSoup` object that represents the root of the tree. If we display the soup, it reproduces the HTML." ] }, { "cell_type": "code", "execution_count": 3, "id": "d9cd7d4a", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.
\n", "...
\n", "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "markdown", "id": "6edc272a", "metadata": {}, "source": [ "`prettify` uses indentation to show the structure of the document." ] }, { "cell_type": "code", "execution_count": 4, "id": "179128f7", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", "\n", " \n", " The Dormouse's story\n", " \n", "
\n", "\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "
\n", "\n", " ...\n", "
\n", " \n", "\n" ] } ], "source": [ "print(soup.prettify())" ] }, { "cell_type": "markdown", "id": "1bd74af3", "metadata": {}, "source": [ "The `BeautifulSoup` object has a property called `children` that returns an iterator of the objects it contains." ] }, { "cell_type": "code", "execution_count": 5, "id": "0141df22", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
\n", "Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
\n", "...
\n", " ]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.contents" ] }, { "cell_type": "markdown", "id": "26dfbfd3", "metadata": {}, "source": [ "The only child is an HTML element that contains the whole document.\n", "Let's get just this element:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6d83d8f0", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.
\n", "...
\n", "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "element = soup.contents[0]\n", "element" ] }, { "cell_type": "markdown", "id": "68b7b64d", "metadata": {}, "source": [ "The type of the element is `Tag`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "6aae480b", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(element)" ] }, { "cell_type": "markdown", "id": "2f9b9cad", "metadata": {}, "source": [ "And the name of the tag is `html`." ] }, { "cell_type": "code", "execution_count": 12, "id": "e11a78d8", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "'html'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "element.name" ] }, { "cell_type": "markdown", "id": "d2e65f4e", "metadata": {}, "source": [ "Now let's get the children of this top-level element:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c5a6fd88", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story
\n", "Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
\n", "...
\n", " ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "children = element.contents\n", "children" ] }, { "cell_type": "markdown", "id": "60cd822d", "metadata": {}, "source": [ "There are three elements in this list, but it's hard to read because when you print an element, it prints all of the HTML.\n", "\n", "I'll use the following function to print elements in a simple form." ] }, { "cell_type": "code", "execution_count": 14, "id": "9860d35c", "metadata": {}, "outputs": [], "source": [ "from bs4 import Tag, NavigableString\n", "\n", "def print_element(element):\n", " if isinstance(element, Tag):\n", " print(f'{type(element).__name__}<{element.name}>')\n", " if isinstance(element, NavigableString):\n", " print(type(element).__name__)" ] }, { "cell_type": "code", "execution_count": 15, "id": "9cc4c70d", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag\n" ] } ], "source": [ "print_element(element)" ] }, { "cell_type": "markdown", "id": "e4f38388", "metadata": {}, "source": [ "And the following function to print a list of elements." ] }, { "cell_type": "code", "execution_count": 16, "id": "6d73e622", "metadata": {}, "outputs": [], "source": [ "def print_element_list(element_list):\n", " print('[')\n", " for element in element_list:\n", " print_element(element)\n", " print(']')" ] }, { "cell_type": "code", "execution_count": 17, "id": "1f64d3bd", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", "Tag\n", "NavigableString\n", "Tag\n", "]\n" ] } ], "source": [ "print_element_list(element.contents)" ] }, { "cell_type": "markdown", "id": "d08fbd7f", "metadata": {}, "source": [ "Now let's try navigating the tree. I'll start with the first child of `element`." ] }, { "cell_type": "code", "execution_count": 18, "id": "c24c19c2", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag\n" ] } ], "source": [ "child = element.contents[0]\n", "print_element(child)" ] }, { "cell_type": "markdown", "id": "5fb945ad", "metadata": {}, "source": [ "And print its children." ] }, { "cell_type": "code", "execution_count": 19, "id": "099c0497", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", "TagThe Dormouse's story
" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "find(soup, \"p\")" ] }, { "cell_type": "code", "execution_count": 32, "id": "4692e57d", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "find(soup, \"a\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "22c8e0a9", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "find(soup, \"not a tag\")" ] }, { "cell_type": "code", "execution_count": 34, "id": "29ba35e7", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def recursive_find(element, tag_name):\n", " if is_right_tag(element, tag_name):\n", " return element\n", "\n", " children = getattr(element, 'contents', [])\n", " for child in children:\n", " result = recursive_find(child, tag_name)\n", " if result:\n", " return result\n", " \n", " return None" ] }, { "cell_type": "code", "execution_count": 35, "id": "8b85a2cd", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "recursive_find(soup, 'p')" ] }, { "cell_type": "code", "execution_count": 36, "id": "3618a5b5", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "recursive_find(soup, 'a')" ] }, { "cell_type": "markdown", "id": "b0b935a4", "metadata": {}, "source": [ "**Exercise:** Write a generator function similar to `PageElement.find_all` that takes a `PageElement` and a tag name and yields all tags with the given name. You can write it iteratively or recursively." ] }, { "cell_type": "code", "execution_count": 37, "id": "e7694762", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def find_all(root, tag_name):\n", " stack = [root]\n", " \n", " while(stack):\n", " element = stack.pop()\n", " if is_right_tag(element, tag_name):\n", " yield element\n", "\n", " children = getattr(element, 'contents', [])\n", " stack.extend(reversed(children))" ] }, { "cell_type": "code", "execution_count": 38, "id": "99c31d24", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "