Introduction

Data Structures and Information Retrieval in Python is an introduction to data structures and algorithms using a web search engine as a motivating example. It is based in part on Think Data Structures, which uses Java.

The elements of the search engine are:

  • The Crawler, which downloads web pages and follows links to other pages,

  • The Indexer, which builds a map from each search term to the pages where it appears, and

  • The Retriever, which looks up search terms and finds relevant, high-quality pages.

The index is stored in Redis, which is a data store that provides structures like sets, lists, and hashmaps. The book presents each data structure first in Python, then in Redis, which should help readers see which features are essential and which are implementation details.

As I did with Think Bayes, I wrote this book entirely in Jupyter notebooks, and used JupyterBook to translate them to HTML. The notebooks run on Colab, which is a service provided by Google that runs notebooks in a browser. So you can read the book, run the code, and work on exercises without installing anything.

This material is a work in progress, so your feedback is welcome. The best way to provide that feedback is to click here and create an issue in this GitHub repository.

Overview slides

The notebooks

Click on the links below to run the notebooks on Colab.

  • Algorithms: Day One activity checking for anagrams and finding anagram sets.
    Notebook

  • Analysis: Introduction to the analysis of algorithms and Big O notation.
    Slides
    Notebook

  • Timing: Checking asymptotic behavior by measuring run time.
    Notebook

  • Quiz 1
    Notebook

  • Generator functions: Separate the iteration from the program logic
    Slides
    Notebook

  • Set: Using Python sets to cheat at word games.
    Notebook

  • Recursion: Practice recursive functions.
    Notebook

  • Quiz 2
    Notebook

  • Depth First Search: Tree traversal in BeautifulSoup.
    Slides
    Notebook

  • Search: Linear search, bisection, and binary search trees.
    Slides
    Notebook

  • Hashmap: How the greatest of all data structures works.
    Slides
    Notebook

  • Quiz 3
    Notebook

  • Heap: It’s an array, it’s a tree, it’s a PriorityQueue!
    Slides
    Notebook

  • Huffman Code: Use the structures we’ve learned to make an optimal prefix code.
    Slides
    Notebook

  • Getting to Philosophy: Follow Wikipedia links until you get to Philosophy.
    Slides
    Notebook

  • Quiz 4
    Notebook

  • Redis: Introduction to the Redis data store.
    Notebook

  • Linked List: Trees before lists? Strange but true. Slides Notebook

  • Indexer: Make a map of the internet for fast lookups.
    Notebook

  • Quiz 5
    Notebook

  • Deque: Like a linked list, but more so.
    Notebook

  • Graphs: Representing graphs with NetworkX.
    Notebook

  • Level Order Search: Use the os module to traverse a file system.
    Slides
    Notebook

  • Breadth-First Search: The foundation of graph algorithms.
    Slides
    Notebook

  • Quiz 6
    Notebook

  • Crawler: Follow links and breadth-first search the internet.
    Notebook

  • Mergesort: Divide, conquer, and merge in linearithmic time.
    Slides
    Notebook

  • Fast Fourier Transform: It’s like mergesort, but with complex numbers.
    Slides
    Notebook

  • Quiz 7 Notebook

  • PageRank: Random walks, adjacency matrices, and eigenvectors!
    Slides
    Notebook

Copyright 2021 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)