Implementing Mapping Types#

Click here to run this chapter on Colab

Analysis of search algorithms#

A search is an algorithm that takes a collection and a target item and determines whether the target is in the collection, often returning the index of the target.

The simplest search algorithm is a “linear search”, which traverses the items of the collection in order, stopping if it finds the target. In the worst case it has to traverse the entire collection, so the run time is linear.

The in operator for sequences uses a linear search; so do string methods like find and count.

If the elements of the sequence are in order, you can use a bisection search, which is \(O(\log n)\). Bisection search is similar to the algorithm you might use to look a word up in a dictionary (a paper dictionary, not the data structure). Instead of starting at the beginning and checking each item in order, you start with the item in the middle and check whether the word you are looking for comes before or after. If it comes before, then you search the first half of the sequence. Otherwise you search the second half. Either way, you cut the number of remaining items in half.

If the sequence has 1,000,000 items, it will take about 20 steps to find the word or conclude that it’s not there. So that’s about 50,000 times faster than a linear search.

Bisection search can be much faster than linear search, but it requires the sequence to be in order, which might require extra work.

There is another data structure, called a hashtable that is even faster—it can do a search in constant time—and it doesn’t require the items to be sorted. Python dictionaries are implemented using hashtables, which is why most dictionary operations, including the in operator, are constant time.

LinearMap#

To explain how hashtables work and why their performance is so good, I start with a simple implementation of a map and gradually improve it until it’s a hashtable.

I use Python to demonstrate these implementations, but in real life you wouldn’t write code like this in Python; you would just use a dictionary! So this notebook, you have to imagine that dictionaries don’t exist and you want to implement a data structure that maps from keys to values.

The operations we’ll implement are:

  • add(k, v): Add a new item that maps from key k to value v. With a Python dictionary, d, this operation is written d[k] = v.

  • get(k): Look up and return the value that corresponds to key k. With a Python dictionary, d, this operation is written d[k] or d.get(k).

For now, I assume that each key only appears once.

Here’s a simple implementation of this interface using a list of tuples, where each tuple is a key-value pair.

items = []

key = 'a'
value = 1
items.append((key, value))

items.append(('b', 2))

target = 'b'
for k, v in items:
    if k == target:
        print(v)
2
items = []

def add(k, v):
    items.append((k, v))

add('a', 1)
add('b', 2)

def get(target):
    for key, val in items:
        if key == target:
            print(val)

get('b')
2
class LinearMap:
    def __init__(self):
        self.items = []
        
lmap = LinearMap()
lmap.items
[]
class LinearMap:

    def __init__(self):
        self.items = []

    def add(self, k, v):
        self.items.append((k, v))

    def get(self, target):
        for k, v in self.items:
            if k == target:
                return v
        raise KeyError(f'{target} not found')

__init__ creates a new map with an empty list of items, so that’s constant time.

add appends a key-value tuple to the list of items, which takes constant time.

get uses a for loop to search the list: if it finds the target key it returns the corresponding value; otherwise it raises a KeyError. So get is linear.

Let’s try out this implementation.

import string

lmap = LinearMap()

for i, c in enumerate(string.ascii_lowercase):
    lmap.add(c, i)

lmap.get('x')
23

An alternative is to keep the list sorted by key. Then get could use a bisection search, which is \(O(\log n)\). But inserting a new item in the middle of a list is linear, so this might not be the best option.

We could also use a binary search tree, which can implement add and get in log time, but that’s still not as good as constant time, so let’s move on.

BetterMap#

One way to improve LinearMap is to break the list of key-value pairs into smaller lists. Here’s an implementation called BetterMap, which is a list of 100 LinearMaps. As we’ll see in a second, the order of growth for get is still linear, but BetterMap is a step on the path toward hashtables:

class BetterMap:

    def __init__(self, n=100):
        self.maps = [LinearMap() for i in range(100)]

    def find_map(self, k):
        index = hash(k) % len(self.maps)
        return self.maps[index]

    def add(self, k, v):
        m = self.find_map(k)
        m.add(k, v)

    def get(self, k):
        m = self.find_map(k)
        return m.get(k)

__init__ makes a list of LinearMap objects.

find_map is used by add and get to figure out which map to put the new item in, or which map to search.

find_map uses the built-in function hash, which takes almost any Python object and returns an integer. A limitation of this implementation is that it only works with hashable keys. Mutable types like lists and dictionaries are unhashable.

Hashable objects that are considered equivalent return the same hash value, but the converse is not necessarily true: two objects with different values can return the same hash value.

find_map uses the modulus operator to wrap the hash values into the range from 0 to len(self.maps), so the result is a legal index into the list. Of course, this means that many different hash values will wrap onto the same index. But if the hash function spreads things out pretty evenly (which is what hash functions are designed to do), then we expect \(n/100\) items per LinearMap.

Let’s try it out:

bmap = BetterMap()

for i, c in enumerate(string.ascii_lowercase):
    bmap.add(c, i)

bmap.get('x')
23
for lmap in bmap.maps:
    print(len(lmap.items))
0
0
0
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
2
1
0
1
0
0
0
0
0
1
2
0
0
0
0
1
1
2
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
1

Since the run time of LinearMap.get is proportional to the number of items, we expect BetterMap to be about 100 times faster than LinearMap. The order of growth is still linear, but the leading coefficient is smaller.

Hash Functions#

BetterMap.find_map uses the built-in function hash, which takes any hashable object and returns an integer:

hash(object)

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

from fractions import Fraction

hash(2), hash(2.0), hash(2 + 0j), hash(Fraction(4, 2))
(2, 2, 2, 2)
t = 2,
hash(t)
6909455589863252355
try:
    hash([2])
except TypeError as e:
    print(e)
unhashable type: 'list'
hash('2')
133118177212449111

HashMap#

Here (finally) is the crucial idea that makes hashtables fast: if you can keep the maximum length of the LinearMaps bounded, LinearMap.get is constant time. All you have to do is keep track of the number of items and when the number of items per LinearMap exceeds a threshold, resize the hashtable by adding more LinearMaps.

Here is an implementation of a hashtable:

class HashMap:

    def __init__(self):
        self.bmap = BetterMap(2)
        self.num = 0

    def get(self, k):
        return self.bmap.get(k)

    def add(self, k, v):
        if self.num == len(self.bmap.maps):
            self.resize()

        self.bmap.add(k, v)
        self.num += 1

    def resize(self):
        new_bmap = BetterMap(len(self.bmap.maps) * 2)

        for m in self.bmap.maps:
            for k, v in m.items:
                new_bmap.add(k, v)

        self.bmap = new_bmap

__init__ creates a BetterMap and initializes num, which keeps track of the number of items.

get just invokes BetterMap.get, which uses find_map to figure out which LinearMap to search.

The real work happens in add, which checks the number of items and the size of the BetterMap: if they are equal, the average number of items per LinearMap is 1, so it calls resize.

resize makes a new BetterMap, twice as big as the previous one, and then “rehashes” the items from the old map to the new.

hmap = HashMap()

for i, c in enumerate(string.ascii_lowercase):
    hmap.add(c, i)

hmap.get('x')
23

Rehashing is necessary because changing the number of LinearMap objects changes the denominator of the modulus operator in find_map. That means that some objects that used to hash into the same LinearMap will get split up (which is what we wanted, right?).

Rehashing is linear, so resize is linear, which might seem bad, since I promised that add would be constant time. But remember that we don’t have to resize every time, so add is usually constant time and only occasionally linear. The total amount of work to run add \(n\) times is proportional to \(n\), so the average time of each add is constant time!

To see how this works, think about starting with an empty HashTable and adding a sequence of items. We start with 2 LinearMap objects, so the first 2 adds are fast (no resizing required). Let’s say that they take one unit of work each. The next add requires a resize, so we have to rehash the first two items (let’s call that 2 more units of work) and then add the third item (one more unit). Adding the next item costs 1 unit, so the total so far is 6 units of work for 4 items.

The next add costs 5 units, but the next three are only one unit each, so the total is 14 units for the first 8 adds.

The next add costs 9 units, but then we can add 7 more before the next resize, so the total is 30 units for the first 16 adds.

After 32 adds, the total cost is 62 units, and I hope you are starting to see a pattern. After \(n\) adds, where \(n\) is a power of two, the total cost is \(2n-2\) units, so the average work per add is a little less than 2 units. When \(n\) is a power of two, that’s the best case; for other values of \(n\) the average work is a little higher, but that’s not important. The important thing is that it is \(O(1)\).

The following figure shows how this works graphically. Each block represents a unit of work. The columns show the total work for each add in order from left to right: the first two adds cost 1 unit each, the third costs 3 units, etc.

The extra work of rehashing appears as a sequence of increasingly tall towers with increasing space between them. Now if you knock over the towers, spreading the cost of resizing over all adds, you can see graphically that the total cost after \(n\) adds is \(2n - 2\).

An important feature of this algorithm is that when we resize the HashTable it grows geometrically; that is, we multiply the size by a constant. If you increase the size arithmetically—adding a fixed number each time—the average time per add is linear.

Run Time#

For the implementation of a dictionary, a good hash function is one that spreads out the values so the number of items in each of the LinearMap objects is about the same.

In the worst case, if the hash function returns the same value for all objects, they would all be in the same LinearMap, and the get operation would be linear.

Hash functions can be expensive to compute, especially if the keys are large objects (like long strings, for example). So dictionaries are “fast” because the operations are constant time, but they can be “slow” because the leading constant is relatively high.

If the number of items in the dictionary is small, other implementations might be faster.

Exercise: What are the orders of growth for these two functions? Which one is faster when the words are 11 letters long?

from collections import Counter

def is_anagram2(word1, word2):
    return Counter(word1) == Counter(word2)
def is_anagram3(word1, word2):
    return sorted(word1) == sorted(word2)
%timeit is_anagram2('tachymetric', 'mccarthyite')
4.84 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit is_anagram3('tachymetric', 'mccarthyite')
758 ns ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Data Structures and Information Retrieval in Python

Copyright 2021 Allen Downey

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International