Data Structures and Information Retrieval in Python

  • Introduction
  • Algorithms
  • Analysis of Algorithms
  • Testing Order of Growth
  • Quiz 1
  • Generators and Iterators
  • Sets
  • Recursion
  • Quiz 2
  • Depth First Search
  • Search
  • Implementing Mapping Types
  • Quiz 3
  • Priority Queues and Heaps
  • Huffman Code
  • Getting to Philosophy
  • Quiz 4
  • Redis
  • Linked List
  • Indexer
  • Quiz 5
  • Deque
  • Graphs
  • Level Order Traversal
  • Breadth First Search
  • Quiz 6
  • Crawler
  • Merge Sort
  • FFT
  • Quiz 7
  • PageRank
Powered by Jupyter Book
  • Binder
  • .ipynb
Contents
  • Using BeautifulSoup
  • Depth-first search

Depth First Search

Contents

  • Using BeautifulSoup
  • Depth-first search

Depth First Search#

Click here to run this chapter on Colab

This notebook presents “depth first search” as a way to iterate through the nodes in a tree. This algorithm applies to any kind of tree, but since we need an example, we’ll use BeautifulSoup, which is a Python module that reads HTML (and related languages) and builds a tree that represents the content.

Using BeautifulSoup#

When you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a minimal HTML document, which I borrowed from the BeautifulSoup documentation. The text is from Lewis Carroll’s Alice’s Adventures in Wonderland.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Here’s how we use BeautifulSoup to read it.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
type(soup)
bs4.BeautifulSoup

The result is a BeautifulSoup object that represents the root of the tree. If we display the soup, it reproduces the HTML.

soup
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

prettify uses indentation to show the structure of the document.

print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

The BeautifulSoup object has a property called children that returns an iterator of the objects it contains.

soup.children
<list_iterator at 0x7f4cf816b850>

We can use a for loop to iterate through them.

for element in soup.children:
    print(type(element))
<class 'bs4.element.Tag'>

This soup contains only a single child, which is a Tag.

BeautifulSoup also provides contents, which returns the children in the form of a list, which can be more convenient.

soup.contents
[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body></html>]

The only child is an HTML element that contains the whole document. Let’s get just this element:

element = soup.contents[0]
element
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

The type of the element is Tag:

type(element)
bs4.element.Tag

And the name of the tag is html.

element.name
'html'

Now let’s get the children of this top-level element:

children = element.contents
children
[<head><title>The Dormouse's story</title></head>,
 '\n',
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body>]

There are three elements in this list, but it’s hard to read because when you print an element, it prints all of the HTML.

I’ll use the following function to print elements in a simple form.

from bs4 import Tag, NavigableString

def print_element(element):
    if isinstance(element, Tag):
        print(f'{type(element).__name__}<{element.name}>')
    if isinstance(element, NavigableString):
        print(type(element).__name__)
print_element(element)
Tag<html>

And the following function to print a list of elements.

def print_element_list(element_list):
    print('[')
    for element in element_list:
        print_element(element)
    print(']')
print_element_list(element.contents)
[
Tag<head>
NavigableString
Tag<body>
]

Now let’s try navigating the tree. I’ll start with the first child of element.

child = element.contents[0]
print_element(child)
Tag<head>

And print its children.

print_element_list(child.contents)
[
Tag<title>
]

Now let’s get the first child of the first child.

grandchild = child.contents[0]
print_element(grandchild)
Tag<title>
grandchild = child.contents[0]
print_element(grandchild)
Tag<title>

And the first child of the first grandchild.

greatgrandchild = grandchild.contents[0]
print_element(greatgrandchild)
NavigableString
try:
    greatgrandchild.contents
except AttributeError as e:
    print('AttributeError:', e)
AttributeError: 'NavigableString' object has no attribute 'contents'
greatgrandchild
"The Dormouse's story"

NavigableString has no children, so we’ve come to the end of the road.

In order to continue, we would have to backtrack to the grandchild and select the second child.

Which means we have to keep track of which elements we have seen, in order to pick up where we left off.

That’s what depth-first search does.

Depth-first search#

DFS starts at the root of the tree and selects the first child. If the child has children, it selects the first child again. When it gets to a node with no children, it backtracks, moving up the tree to the parent node, where it selects the next child if there is one; otherwise it backtracks again. When it has explored the last child of the root, it’s done.

There are two common ways to implement DFS, recursively and iteratively. The recursive implementation looks like this:

def recursive_DFS(element):
    if isinstance(element, NavigableString):
        print(element, end='')
        return

    for child in element.children:
        recursive_DFS(child)
recursive_DFS(soup)
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Here is an iterative version of DFS that uses a list to represent a stack of elements:

def iterative_DFS(root):
    stack = [root]
    
    while(stack):
        element = stack.pop()
        if isinstance(element, NavigableString):
            print(element, end='')
        else:
            children = reversed(element.contents)
            stack.extend(children)

The parameter, root, is the root of the tree we want to traverse, so we start by creating the stack and pushing the root onto it.

The loop continues until the stack is empty. Each time through, it pops a PageElement off the stack. If it gets a NavigableString, it prints the contents. Then it pushes the children onto the stack. In order to process the children in the right order, we have to push them onto the stack in reverse order.

iterative_DFS(soup)
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Exercise: Write a function similar to PageElement.find that takes a PageElement and a tag name and returns the first tag with the given name. You can write it iteratively or recursively.

Here’s how to check whether a PageElement is a Tag.

from bs4 import Tag
isinstance(element, Tag)
def is_right_tag(element, tag_name):
    return (isinstance(element, Tag) and 
            element.name == tag_name)

Exercise: Write a generator function similar to PageElement.find_all that takes a PageElement and a tag name and yields all tags with the given name. You can write it iteratively or recursively.

Data Structures and Information Retrieval in Python

Copyright 2021 Allen Downey

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

previous

Quiz 2

next

Search

By Allen B. Downey
© Copyright 2022.