Introduction
Contents
Introduction#
Data Structures and Information Retrieval in Python is an introduction to data structures and algorithms using a web search engine as a motivating example. It is based in part on Think Data Structures, which uses Java.
The elements of the search engine are:
The Crawler, which downloads web pages and follows links to other pages,
The Indexer, which builds a map from each search term to the pages where it appears, and
The Retriever, which looks up search terms and finds relevant, high-quality pages.
The index is stored in Redis, which is a data store that provides structures like sets, lists, and hashmaps. The book presents each data structure first in Python, then in Redis, which should help readers see which features are essential and which are implementation details.
As I did with Think Bayes, I wrote this book entirely in Jupyter notebooks, and used JupyterBook to translate them to HTML. The notebooks run on Colab, which is a service provided by Google that runs notebooks in a browser. So you can read the book, run the code, and work on exercises without installing anything.
This material is a work in progress, so your feedback is welcome. The best way to provide that feedback is to click here and create an issue in this GitHub repository.
The notebooks#
Click on the links below to run the notebooks on Colab.
Algorithms: Day One activity checking for anagrams and finding anagram sets.
NotebookAnalysis: Introduction to the analysis of algorithms and Big O notation.
Slides
NotebookTiming: Checking asymptotic behavior by measuring run time.
NotebookQuiz 1
NotebookGenerator functions: Separate the iteration from the program logic
Slides
NotebookSet: Using Python sets to cheat at word games.
NotebookRecursion: Practice recursive functions.
NotebookQuiz 2
NotebookDepth First Search: Tree traversal in BeautifulSoup.
Slides
NotebookSearch: Linear search, bisection, and binary search trees.
Slides
NotebookHashmap: How the greatest of all data structures works.
Slides
NotebookQuiz 3
NotebookHeap: It’s an array, it’s a tree, it’s a PriorityQueue!
Slides
NotebookHuffman Code: Use the structures we’ve learned to make an optimal prefix code.
Slides
NotebookGetting to Philosophy: Follow Wikipedia links until you get to Philosophy.
Slides
NotebookQuiz 4
NotebookRedis: Introduction to the Redis data store.
NotebookLinked List: Trees before lists? Strange but true.
Slides
NotebookIndexer: Make a map of the internet for fast lookups.
NotebookQuiz 5
NotebookDeque: Like a linked list, but more so.
NotebookGraphs: Representing graphs with NetworkX.
NotebookLevel Order Search: Use the
os
module to traverse a file system.
Slides
NotebookBreadth-First Search: The foundation of graph algorithms.
Slides
NotebookQuiz 6
NotebookCrawler: Follow links and breadth-first search the internet.
NotebookMergesort: Divide, conquer, and merge in linearithmic time.
Slides
NotebookFast Fourier Transform: It’s like mergesort, but with complex numbers.
Slides
NotebookQuiz 7 Notebook
PageRank: Random walks, adjacency matrices, and eigenvectors!
Slides
Notebook
Copyright 2021 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)