The Dormouse's story

{ "cells": [ { "cell_type": "markdown", "id": "17fb04da", "metadata": {}, "source": [ "# Indexer" ] }, { "cell_type": "markdown", "id": "ab29bae1", "metadata": {}, "source": [ "[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/indexer.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "id": "17c28d70", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", " return filename" ] }, { "cell_type": "markdown", "id": "457822ea", "metadata": {}, "source": [ "[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/indexer.ipynb)" ] }, { "cell_type": "markdown", "id": "e5cbc301", "metadata": {}, "source": [ "## Indexing the web\n", "\n", "In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.\n", "\n", "For example, if a user submits the search terms \"Python\" and \"programming\", we would look up both search terms and get two sets of\n", "pages. Pages with the word \"Python\" would include pages about the species of snake and pages about the programming language. Pages\n", "with the word \"programming\" would include pages about different\n", "programming languages, as well as other uses of the word. By selecting\n", "pages with both terms, we hope to eliminate irrelevant pages and find\n", "the ones about Python programming." ] }, { "cell_type": "markdown", "id": "3da98834", "metadata": {}, "source": [ "In order to make an index, we'll need to iterate through the words in a document and count them.\n", "So that's where we'll start.\n", "\n", "Here's a minimal HTML document we have seen before, borrowed from the BeautifulSoup documentation." ] }, { "cell_type": "code", "execution_count": 2, "id": "32366201", "metadata": {}, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "52fafdff", "metadata": {}, "source": [ "We can use `BeautifulSoup` to parse the text and make a DOM." ] }, { "cell_type": "code", "execution_count": 3, "id": "ca8689bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(html_doc)\n", "type(soup)" ] }, { "cell_type": "markdown", "id": "d55df8e4", "metadata": {}, "source": [ "The following is a generator function that iterates the elements of the DOM, finds the `NavigableString` objects, iterates through the words, and yields them one at a time.\n", "\n", "From each word, it removes the characters identified by the `string` module as whitespace or punctuation. " ] }, { "cell_type": "code", "execution_count": 4, "id": "e7b75e21", "metadata": {}, "outputs": [], "source": [ "from bs4 import NavigableString\n", "from string import whitespace, punctuation\n", "\n", "def iterate_words(soup):\n", " for element in soup.descendants:\n", " if isinstance(element, NavigableString):\n", " for word in element.string.split():\n", " word = word.strip(whitespace + punctuation)\n", " if word:\n", " yield word.lower()" ] }, { "cell_type": "markdown", "id": "d8e01b05", "metadata": {}, "source": [ "We can loop through the words like this:" ] }, { "cell_type": "code", "execution_count": 5, "id": "afd6efc8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the\n", "dormouse's\n", "story\n", "the\n", "dormouse's\n", "story\n", "once\n", "upon\n", "a\n", "time\n", "there\n", "were\n", "three\n", "little\n", "sisters\n", "and\n", "their\n", "names\n", "were\n", "elsie\n", "lacie\n", "and\n", "tillie\n", "and\n", "they\n", "lived\n", "at\n", "the\n", "bottom\n", "of\n", "a\n", "well\n" ] } ], "source": [ "for word in iterate_words(soup):\n", " print(word)" ] }, { "cell_type": "markdown", "id": "3aa08a02", "metadata": {}, "source": [ "And count them like this." ] }, { "cell_type": "code", "execution_count": 6, "id": "319748e3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'the': 3,\n", " \"dormouse's\": 2,\n", " 'story': 2,\n", " 'once': 1,\n", " 'upon': 1,\n", " 'a': 2,\n", " 'time': 1,\n", " 'there': 1,\n", " 'were': 2,\n", " 'three': 1,\n", " 'little': 1,\n", " 'sisters': 1,\n", " 'and': 3,\n", " 'their': 1,\n", " 'names': 1,\n", " 'elsie': 1,\n", " 'lacie': 1,\n", " 'tillie': 1,\n", " 'they': 1,\n", " 'lived': 1,\n", " 'at': 1,\n", " 'bottom': 1,\n", " 'of': 1,\n", " 'well': 1})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "counter = Counter(iterate_words(soup))\n", "counter" ] }, { "cell_type": "markdown", "id": "d4870c03", "metadata": {}, "source": [ "## Parsing Wikipedia\n", "\n", "Now let's do the same thing with the text of a Wikipedia page:" ] }, { "cell_type": "code", "execution_count": 7, "id": "a3d44c97", "metadata": {}, "outputs": [], "source": [ "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"\n", "filename = download(url)" ] }, { "cell_type": "code", "execution_count": 8, "id": "98856065", "metadata": {}, "outputs": [], "source": [ "fp = open(filename)\n", "soup2 = BeautifulSoup(fp)" ] }, { "cell_type": "code", "execution_count": 9, "id": "48b223c6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 3),\n", " ('and', 3),\n", " (\"dormouse's\", 2),\n", " ('story', 2),\n", " ('a', 2),\n", " ('were', 2),\n", " ('once', 1),\n", " ('upon', 1),\n", " ('time', 1),\n", " ('there', 1)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counter = Counter(iterate_words(soup2))\n", "counter.most_common(10)" ] }, { "cell_type": "markdown", "id": "ccb436fa", "metadata": {}, "source": [ "As you might expect, the word \"python\" is one of the most common words on the Wikipedia page about Python.\n", "The word \"programming\" didn't make the top 10, but it also appears many times." ] }, { "cell_type": "code", "execution_count": 10, "id": "d28e1117", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counter['programming']" ] }, { "cell_type": "markdown", "id": "5c39a795", "metadata": {}, "source": [ "However, there are a number of common words, like \"the\" and \"from\" that also appear many times.\n", "Later, we'll come back and think about how to distinguish the words that really indicate what the page is about from the common words that appear on every page.\n", "\n", "But first, let's think about making an index." ] }, { "cell_type": "markdown", "id": "fbfd17ff", "metadata": {}, "source": [ "## Indexing\n", "\n", "An index is a map from a search word, like \"python\", to a collection of pages that contain the word.\n", "The collection should also indicate how many times the word appears on each page.\n", "\n", "We want the index to be persistent, so we'll store it on Redis.\n", "\n", "So what Redis type should we use?\n", "There are several options, but one reasonable choice is a hash for each word, where the fields are pages (represented by URL) and the values are counts.\n", "\n", "To manage the size of the index, we won't list a page for a given search word unless it appears at least three times." ] }, { "cell_type": "markdown", "id": "4e49495c", "metadata": {}, "source": [ "Let's get Redis started." ] }, { "cell_type": "code", "execution_count": 11, "id": "1cbe8a49", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "340987:C 20 Dec 2021 15:08:08.771 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo\r\n", "340987:C 20 Dec 2021 15:08:08.771 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=340987, just started\r\n", "340987:C 20 Dec 2021 15:08:08.771 # Configuration loaded\r\n" ] } ], "source": [ "import sys\n", "\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install redis-server\n", " !/usr/local/lib/python*/dist-packages/redis_server/bin/redis-server --daemonize yes\n", "else:\n", " !redis-server --daemonize yes" ] }, { "cell_type": "markdown", "id": "aa23066c", "metadata": {}, "source": [ "And make sure the Redis client is installed." ] }, { "cell_type": "code", "execution_count": 12, "id": "68fecf23", "metadata": {}, "outputs": [], "source": [ "try:\n", " import redis\n", "except ImportError:\n", " !pip install redis" ] }, { "cell_type": "markdown", "id": "99acc3db", "metadata": {}, "source": [ "And let's make a `Redis` object that creates the connection to the Redis database." ] }, { "cell_type": "code", "execution_count": 13, "id": "5e35a4cd", "metadata": {}, "outputs": [], "source": [ "import redis\n", "\n", "r = redis.Redis()" ] }, { "cell_type": "markdown", "id": "74cd2e6d", "metadata": {}, "source": [ "If you have a Redis database running on a different machine, you can create a `Redis` object using the URL of the database, like this\n", "\n", "```\n", "url = 'redis://redistogo:example@dory.redistogo.com:10534/'\n", "r = redis.Redis.from_url(url)\n", "```" ] }, { "cell_type": "markdown", "id": "bc012027", "metadata": {}, "source": [ "**Exercise:** Write a function called `redis_index` that takes a URL and indexes it. It should download the web page with the given URL, iterate through the words, and make a `Counter` that maps from words to their frequencies.\n", "\n", "Then it should iterate through the words and add field-value pairs to Redis hashes.\n", "\n", "* The keys for the hashes should have the prefix `Index:`; for example, the key for the word `python` should be `Index:python`.\n", "\n", "* The fields in the hashes should be URLs. \n", "\n", "* The values in the hashes should be word counts.\n", "\n", "Use your function to index at least these two pages:" ] }, { "cell_type": "code", "execution_count": 14, "id": "8e9197eb", "metadata": {}, "outputs": [], "source": [ "url1 = 'https://en.wikipedia.org/wiki/Python_(programming_language)'\n", "url2 = 'https://en.wikipedia.org/wiki/Python_(genus)'" ] }, { "cell_type": "code", "execution_count": 15, "id": "cae0f2ea", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "def redis_index(url):\n", " filename = download(url)\n", " soup = BeautifulSoup(open(filename))\n", " counter = Counter(iterate_words(soup))\n", " for word, count in counter.items():\n", " if count >= 3:\n", " key = f'Index:{word}'\n", " # print(key, count)\n", " r.hset(key, url, count)" ] }, { "cell_type": "code", "execution_count": 16, "id": "a8c478d2", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "redis_index(url1)" ] }, { "cell_type": "code", "execution_count": 17, "id": "36dc8ba4", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "redis_index(url2)" ] }, { "cell_type": "markdown", "id": "cf6bb330", "metadata": {}, "source": [ "Use `hscan_iter` to iterate the field-values pairs in the index for the word `python`.\n", "Print the URLs of the pages where this word appears and the number of times it appears on each page." ] }, { "cell_type": "code", "execution_count": 18, "id": "f160da90", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'https://en.wikipedia.org/wiki/Reflective_programming' b'3'\n", "b'https://en.wikipedia.org/wiki/Functional_programming' b'10'\n", "b'https://en.wikipedia.org/wiki/Object-oriented_programming' b'6'\n", "b'https://en.wikipedia.org/wiki/Multi-paradigm_programming_language' b'3'\n", "b'https://en.wikipedia.org/wiki/Python_(genus)' b'60'\n", "b'https://en.wikipedia.org/wiki/Programming_paradigm' b'3'\n", "b'https://en.wikipedia.org/wiki/Python_(programming_language)' b'429'\n" ] } ], "source": [ "# Solution\n", "\n", "key = f'Index:python'\n", "\n", "for page, count in r.hscan_iter(key):\n", " print(page, count)" ] }, { "cell_type": "markdown", "id": "366c1c12", "metadata": {}, "source": [ "## Shutdown\n", "\n", "If you are running this notebook on your own computer, you can use the following command to shut down the Redis server.\n", "\n", "If you are running on Colab, it's not really necessary: the Redis server will get shut down when the Colab runtime shuts down (and everything stored in it will disappear)." ] }, { "cell_type": "code", "execution_count": 19, "id": "a2b82180", "metadata": {}, "outputs": [], "source": [ "!killall redis-server" ] }, { "cell_type": "markdown", "id": "3045db25", "metadata": {}, "source": [ "## RedisToGo\n", "\n", "[RedisToGo](https://redistogo.com) is a hosting service that provides remote Redis databases.\n", "They offer a free plan that includes a small database that is perfect for testing our indexer.\n", "\n", "If you sign up and go to your list of instances, you should find a URL that looks like this:\n", "\n", "```\n", "redis://redistogo:digitsandnumbers@dory.redistogo.com:10534/\n", "```\n", "\n", "If you pass this url to `Redis.from_url`, as described above, you should be able to connect to your database on RedisToGo and run your exercise solution again.\n", "\n", "And if you come back later and read the index, your data should still be there!" ] }, { "cell_type": "markdown", "id": "d6ac25cf", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "*Data Structures and Information Retrieval in Python*\n", "\n", "Copyright 2021 Allen Downey\n", "\n", "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 5 }