Introduction#

Data Structures and Information Retrieval in Python is an introduction to data structures and algorithms using a web search engine as a motivating example. It is based in part on Think Data Structures, which uses Java.

The elements of the search engine are:

The Crawler, which downloads web pages and follows links to other pages,
The Indexer, which builds a map from each search term to the pages where it appears, and
The Retriever, which looks up search terms and finds relevant, high-quality pages.

The index is stored in Redis, which is a data store that provides structures like sets, lists, and hashmaps. The book presents each data structure first in Python, then in Redis, which should help readers see which features are essential and which are implementation details.

As I did with Think Bayes, I wrote this book entirely in Jupyter notebooks, and used JupyterBook to translate them to HTML. The notebooks run on Colab, which is a service provided by Google that runs notebooks in a browser. So you can read the book, run the code, and work on exercises without installing anything.

This material is a work in progress, so your feedback is welcome. The best way to provide that feedback is to click here and create an issue in this GitHub repository.

Overview slides

The notebooks#

Click on the links below to run the notebooks on Colab.

Algorithms: Day One activity checking for anagrams and finding anagram sets.
Notebook
Analysis: Introduction to the analysis of algorithms and Big O notation.
Slides
Notebook
Timing: Checking asymptotic behavior by measuring run time.
Notebook
Quiz 1
Notebook
Generator functions: Separate the iteration from the program logic
Slides
Notebook
Set: Using Python sets to cheat at word games.
Notebook
Recursion: Practice recursive functions.
Notebook
Quiz 2
Notebook
Depth First Search: Tree traversal in BeautifulSoup.
Slides
Notebook
Search: Linear search, bisection, and binary search trees.
Slides
Notebook
Hashmap: How the greatest of all data structures works.
Slides
Notebook
Quiz 3
Notebook
Heap: It’s an array, it’s a tree, it’s a PriorityQueue!
Slides
Notebook
Huffman Code: Use the structures we’ve learned to make an optimal prefix code.
Slides
Notebook
Getting to Philosophy: Follow Wikipedia links until you get to Philosophy.
Slides
Notebook
Quiz 4
Notebook
Redis: Introduction to the Redis data store.
Notebook
Linked List: Trees before lists? Strange but true.
Slides
Notebook
Indexer: Make a map of the internet for fast lookups.
Notebook
Quiz 5
Notebook
Deque: Like a linked list, but more so.
Notebook
Graphs: Representing graphs with NetworkX.
Notebook
Level Order Search: Use the os module to traverse a file system.
Slides
Notebook
Breadth-First Search: The foundation of graph algorithms.
Slides
Notebook
Quiz 6
Notebook
Crawler: Follow links and breadth-first search the internet.
Notebook
Mergesort: Divide, conquer, and merge in linearithmic time.
Slides
Notebook
Fast Fourier Transform: It’s like mergesort, but with complex numbers.
Slides
Notebook
Quiz 7 Notebook
PageRank: Random walks, adjacency matrices, and eigenvectors!
Slides
Notebook

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Data Structures and Information Retrieval in Python

Introduction

Contents

Introduction#

The notebooks#