The Dormouse's story

{ "cells": [ { "cell_type": "markdown", "id": "c905f415", "metadata": {}, "source": [ "# Depth First Search" ] }, { "cell_type": "markdown", "id": "42302f03", "metadata": {}, "source": [ "[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/dfs.ipynb)" ] }, { "cell_type": "markdown", "id": "19dc5163", "metadata": {}, "source": [ "This notebook presents \"depth first search\" as a way to iterate through the nodes in a tree.\n", "This algorithm applies to any kind of tree, but since we need an example, we'll use BeautifulSoup, which is a Python module that reads HTML (and related languages) and builds a tree that represents the content." ] }, { "cell_type": "markdown", "id": "82af1897", "metadata": {}, "source": [ "## Using BeautifulSoup\n", "\n", "When you download a web page, the contents are written in HyperText Markup Language, aka HTML. \n", "For example, here is a minimal HTML document, which I borrowed from the [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io). The text is from Lewis Carroll's [*Alice's Adventures in Wonderland*](https://www.gutenberg.org/files/11/11-h/11-h.htm)." ] }, { "cell_type": "code", "execution_count": 1, "id": "a777d368", "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "88220a73", "metadata": {}, "source": [ "Here's how we use BeautifulSoup to read it." ] }, { "cell_type": "code", "execution_count": 2, "id": "397bda6b", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(html_doc)\n", "type(soup)" ] }, { "cell_type": "markdown", "id": "5313d704", "metadata": {}, "source": [ "The result is a `BeautifulSoup` object that represents the root of the tree. If we display the soup, it reproduces the HTML." ] }, { "cell_type": "code", "execution_count": 3, "id": "d9cd7d4a", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "

...

\n", "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "markdown", "id": "6edc272a", "metadata": {}, "source": [ "`prettify` uses indentation to show the structure of the document." ] }, { "cell_type": "code", "execution_count": 4, "id": "179128f7", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " The Dormouse's story\n", " \n", " \n", " \n", "

\n", " \n", " The Dormouse's story\n", " \n", "

\n", "

\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "

\n", "

\n", " ...\n", "

\n", " \n", "\n" ] } ], "source": [ "print(soup.prettify())" ] }, { "cell_type": "markdown", "id": "1bd74af3", "metadata": {}, "source": [ "The `BeautifulSoup` object has a property called `children` that returns an iterator of the objects it contains." ] }, { "cell_type": "code", "execution_count": 5, "id": "0141df22", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.children" ] }, { "cell_type": "markdown", "id": "9ccafb05", "metadata": {}, "source": [ "We can use a for loop to iterate through them." ] }, { "cell_type": "code", "execution_count": 6, "id": "863d7d8c", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "for element in soup.children:\n", " print(type(element))" ] }, { "cell_type": "markdown", "id": "b6036c93", "metadata": {}, "source": [ "This soup contains only a single child, which is a `Tag`.\n", "\n", "`BeautifulSoup` also provides `contents`, which returns the children in the form of a list, which can be more convenient." ] }, { "cell_type": "code", "execution_count": 7, "id": "9bcb0f15", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story\n", " \n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

\n", "

...

\n", " ]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.contents" ] }, { "cell_type": "markdown", "id": "26dfbfd3", "metadata": {}, "source": [ "The only child is an HTML element that contains the whole document.\n", "Let's get just this element:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6d83d8f0", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "

...

\n", "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "element = soup.contents[0]\n", "element" ] }, { "cell_type": "markdown", "id": "68b7b64d", "metadata": {}, "source": [ "The type of the element is `Tag`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "6aae480b", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(element)" ] }, { "cell_type": "markdown", "id": "2f9b9cad", "metadata": {}, "source": [ "And the name of the tag is `html`." ] }, { "cell_type": "code", "execution_count": 12, "id": "e11a78d8", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "'html'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "element.name" ] }, { "cell_type": "markdown", "id": "d2e65f4e", "metadata": {}, "source": [ "Now let's get the children of this top-level element:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c5a6fd88", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story,\n", " '\\n',\n", " \n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

\n", "

...

\n", " ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "children = element.contents\n", "children" ] }, { "cell_type": "markdown", "id": "60cd822d", "metadata": {}, "source": [ "There are three elements in this list, but it's hard to read because when you print an element, it prints all of the HTML.\n", "\n", "I'll use the following function to print elements in a simple form." ] }, { "cell_type": "code", "execution_count": 14, "id": "9860d35c", "metadata": {}, "outputs": [], "source": [ "from bs4 import Tag, NavigableString\n", "\n", "def print_element(element):\n", " if isinstance(element, Tag):\n", " print(f'{type(element).__name__}<{element.name}>')\n", " if isinstance(element, NavigableString):\n", " print(type(element).__name__)" ] }, { "cell_type": "code", "execution_count": 15, "id": "9cc4c70d", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag\n" ] } ], "source": [ "print_element(element)" ] }, { "cell_type": "markdown", "id": "e4f38388", "metadata": {}, "source": [ "And the following function to print a list of elements." ] }, { "cell_type": "code", "execution_count": 16, "id": "6d73e622", "metadata": {}, "outputs": [], "source": [ "def print_element_list(element_list):\n", " print('[')\n", " for element in element_list:\n", " print_element(element)\n", " print(']')" ] }, { "cell_type": "code", "execution_count": 17, "id": "1f64d3bd", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", "Tag\n", "NavigableString\n", "Tag\n", "]\n" ] } ], "source": [ "print_element_list(element.contents)" ] }, { "cell_type": "markdown", "id": "d08fbd7f", "metadata": {}, "source": [ "Now let's try navigating the tree. I'll start with the first child of `element`." ] }, { "cell_type": "code", "execution_count": 18, "id": "c24c19c2", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag\n" ] } ], "source": [ "child = element.contents[0]\n", "print_element(child)" ] }, { "cell_type": "markdown", "id": "5fb945ad", "metadata": {}, "source": [ "And print its children." ] }, { "cell_type": "code", "execution_count": 19, "id": "099c0497", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", "Tag\n", "]\n" ] } ], "source": [ "print_element_list(child.contents)" ] }, { "cell_type": "markdown", "id": "52e82d8c", "metadata": {}, "source": [ "Now let's get the first child of the first child." ] }, { "cell_type": "code", "execution_count": 20, "id": "cc81a5bf", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag<title>\n" ] } ], "source": [ "grandchild = child.contents[0]\n", "print_element(grandchild)" ] }, { "cell_type": "code", "execution_count": 21, "id": "2896c8ad", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag<title>\n" ] } ], "source": [ "grandchild = child.contents[0]\n", "print_element(grandchild)" ] }, { "cell_type": "markdown", "id": "0adf7398", "metadata": {}, "source": [ "And the first child of the first grandchild." ] }, { "cell_type": "code", "execution_count": 22, "id": "77d0f128", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NavigableString\n" ] } ], "source": [ "greatgrandchild = grandchild.contents[0]\n", "print_element(greatgrandchild)" ] }, { "cell_type": "code", "execution_count": 23, "id": "00e55c1c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AttributeError: 'NavigableString' object has no attribute 'contents'\n" ] } ], "source": [ "try:\n", " greatgrandchild.contents\n", "except AttributeError as e:\n", " print('AttributeError:', e)" ] }, { "cell_type": "code", "execution_count": 24, "id": "a7c6f553", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "greatgrandchild" ] }, { "cell_type": "markdown", "id": "d7b74ee1", "metadata": {}, "source": [ "`NavigableString` has no children, so we've come to the end of the road.\n", "\n", "In order to continue, we would have to backtrack to the grandchild and select the second child.\n", "\n", "Which means we have to keep track of which elements we have seen, in order to pick up where we left off.\n", "\n", "That's what depth-first search does." ] }, { "cell_type": "markdown", "id": "a0565bab", "metadata": {}, "source": [ "## Depth-first search\n", "\n", "DFS starts at the root of the tree and selects the first child. If the\n", "child has children, it selects the first child again. When it gets to a\n", "node with no children, it backtracks, moving up the tree to the parent\n", "node, where it selects the next child if there is one; otherwise it\n", "backtracks again. When it has explored the last child of the root, it's\n", "done.\n", "\n", "There are two common ways to implement DFS, recursively and iteratively.\n", "The recursive implementation looks like this:" ] }, { "cell_type": "code", "execution_count": 25, "id": "6b3ebd50", "metadata": { "tags": [ "fill-in" ] }, "outputs": [], "source": [ "def recursive_DFS(element):\n", " if isinstance(element, NavigableString):\n", " print(element, end='')\n", " return\n", "\n", " for child in element.children:\n", " recursive_DFS(child)" ] }, { "cell_type": "code", "execution_count": 26, "id": "be3f2553", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "recursive_DFS(soup)" ] }, { "cell_type": "markdown", "id": "8bf2a76e", "metadata": {}, "source": [ "Here is an iterative version of DFS that uses a list to represent a stack of elements:" ] }, { "cell_type": "code", "execution_count": 27, "id": "8a2a2329", "metadata": { "tags": [ "fill-in" ] }, "outputs": [], "source": [ "def iterative_DFS(root):\n", " stack = [root]\n", " \n", " while(stack):\n", " element = stack.pop()\n", " if isinstance(element, NavigableString):\n", " print(element, end='')\n", " else:\n", " children = reversed(element.contents)\n", " stack.extend(children)" ] }, { "cell_type": "markdown", "id": "13fc8784", "metadata": {}, "source": [ "The parameter, `root`, is the root of the tree we want to traverse, so\n", "we start by creating the stack and pushing the root onto it.\n", "\n", "The loop continues until the stack is empty. Each time through, it pops\n", "a `PageElement` off the stack. If it gets a `NavigableString`, it prints the contents.\n", "Then it pushes the children onto the stack. In order to process the\n", "children in the right order, we have to push them onto the stack in\n", "reverse order.\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "6c6dc39d", "metadata": { "tags": [ "fill-in" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "iterative_DFS(soup)" ] }, { "cell_type": "markdown", "id": "bbe16ef6", "metadata": {}, "source": [ "**Exercise:** Write a function similar to `PageElement.find` that takes a `PageElement` and a tag name and returns the first tag with the given name. You can write it iteratively or recursively.\n", "\n", "Here's how to check whether a `PageElement` is a `Tag`.\n", "\n", "```\n", "from bs4 import Tag\n", "isinstance(element, Tag)\n", "```" ] }, { "cell_type": "code", "execution_count": 29, "id": "53e04642", "metadata": {}, "outputs": [], "source": [ "def is_right_tag(element, tag_name):\n", " return (isinstance(element, Tag) and \n", " element.name == tag_name)" ] }, { "cell_type": "code", "execution_count": 30, "id": "395250e2", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def find(root, tag_name):\n", " stack = [root]\n", " \n", " while(stack):\n", " element = stack.pop()\n", " if is_right_tag(element, tag_name):\n", " return element\n", "\n", " children = getattr(element, 'contents', [])\n", " stack.extend(reversed(children))" ] }, { "cell_type": "code", "execution_count": 31, "id": "2129fc36", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "find(soup, \"p\")" ] }, { "cell_type": "code", "execution_count": 32, "id": "4692e57d", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "find(soup, \"a\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "22c8e0a9", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "find(soup, \"not a tag\")" ] }, { "cell_type": "code", "execution_count": 34, "id": "29ba35e7", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def recursive_find(element, tag_name):\n", " if is_right_tag(element, tag_name):\n", " return element\n", "\n", " children = getattr(element, 'contents', [])\n", " for child in children:\n", " result = recursive_find(child, tag_name)\n", " if result:\n", " return result\n", " \n", " return None" ] }, { "cell_type": "code", "execution_count": 35, "id": "8b85a2cd", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "recursive_find(soup, 'p')" ] }, { "cell_type": "code", "execution_count": 36, "id": "3618a5b5", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "recursive_find(soup, 'a')" ] }, { "cell_type": "markdown", "id": "b0b935a4", "metadata": {}, "source": [ "**Exercise:** Write a generator function similar to `PageElement.find_all` that takes a `PageElement` and a tag name and yields all tags with the given name. You can write it iteratively or recursively." ] }, { "cell_type": "code", "execution_count": 37, "id": "e7694762", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def find_all(root, tag_name):\n", " stack = [root]\n", " \n", " while(stack):\n", " element = stack.pop()\n", " if is_right_tag(element, tag_name):\n", " yield element\n", "\n", " children = getattr(element, 'contents', [])\n", " stack.extend(reversed(children))" ] }, { "cell_type": "code", "execution_count": 38, "id": "99c31d24", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/plain": [ "<generator object find_all at 0x7f4cf810cac0>" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "it = find_all(soup, \"a\")\n", "it" ] }, { "cell_type": "code", "execution_count": 39, "id": "c2ca2c7e", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>\n", "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>\n", "<a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n" ] } ], "source": [ "# Solution\n", "\n", "for tag in it:\n", " print(tag)" ] }, { "cell_type": "code", "execution_count": 40, "id": "689a7e73", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def recursive_find_all(element, tag_name):\n", " if is_right_tag(element, tag_name):\n", " yield element\n", "\n", " children = getattr(element, \"children\", [])\n", " for child in children:\n", " yield from recursive_find_all(child, tag_name)" ] }, { "cell_type": "code", "execution_count": 41, "id": "ed2e86d7", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>\n", "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>\n", "<a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n" ] } ], "source": [ "# Solution\n", "\n", "it = recursive_find_all(soup, \"a\")\n", "\n", "for tag in it:\n", " print(tag)" ] }, { "cell_type": "markdown", "id": "00af821b", "metadata": {}, "source": [ "*Data Structures and Information Retrieval in Python*\n", "\n", "Copyright 2021 Allen Downey\n", "\n", "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "code", "execution_count": null, "id": "b848aaaf", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 5 }