Sets#

Click here to run this chapter on Colab

Set operators and methods#

The following example is based on Luciano Ramalho’s talk, Set Practice: Learning from Python’s set type.

def fibonacci(stop):
    a, b = 0, 1
    while a < stop:
        yield a
        a, b = b, a + b
f = {n for n in fibonacci(10)}
f
{0, 1, 2, 3, 5, 8}
def primes(stop):
    m = {}
    q = 2
    while q < stop:
        if q not in m:
            yield q 
            m[q*q] = [q]
        else:
            for p in m[q]:
                m.setdefault(p+q, []).append(p)
            del m[q]
        q += 1
p = {n for n in primes(10)}
p
{2, 3, 5, 7}

Checking membership is constant time.

8 in f
True
8 in p
False

Intersection is like AND: it returns elements in f AND in p.

f & p
{2, 3, 5}

Union is like OR: it returns elements in f OR in p.

f | p
{0, 1, 2, 3, 5, 7, 8}

Symmetric difference is like XOR: elements from f OR p but not both.

f ^ p
{0, 1, 7, 8}

Here are the Fibonacci numbers that are not prime.

f - p
{0, 1, 8}

And the primes that are not Fibonacci numbers.

p - f
{7}

The comparison operators check for subset and superset relationships.

The Fibonacci numbers are not a superset of the primes.

f >= p
False

And the primes are not a superset of the Fibonacci numbers.

p >= f
False

In that sense, sets are not like numbers: they are only partially ordered.

f is a superset of {1, 2, 3}

f >= {1, 2, 3}
True
p >= {1, 2, 3}
False

Sets provide methods as well as operators. Why?

For one thing, the argument you pass to a method can be any iterable, not just a set.

try:
    f >= [1, 2, 3]
except TypeError as e:
    print(e)
'>=' not supported between instances of 'set' and 'list'
f.issuperset([1,2,3])
True

Methods also accept more than one argument:

f.union([1,2,3], (3,4,5), {5,6,7}, {7:'a', 8:'b'})
{0, 1, 2, 3, 4, 5, 6, 7, 8}

If you don’t have a set to start with, you can use an empty set.

set().union([1,2,3], (3,4,5), {5,6,7}, {7:'a', 8:'b'})
{1, 2, 3, 4, 5, 6, 7, 8}

One small syntax nuisance: {1, 2, 3} is a set, but {} is an empty dictionary.

Spelling Bee#

The New York Times Spelling Bee is a daily puzzle where the goal is to spell as many words as possible using only the given set of seven letters. For example, in a recent Spelling Bee, the available letters were dehiklo, so you could spell “like” and “hold”.

You can use each of the letters more than once, so “hook” and “deed” would be allowed, too.

To make it a little more interesting, one of the letters is special and must be included in every word. In this example, the special letter is o, so “hood” would be allowed, but not “like”.

Each word you find scores points depending on it’s length, which must be at least four letters. A word that uses all of the letters is called a “pangram” and scores extra points.

We’ll use this puzzle to explore the use of Python sets.

Suppose we’re given a word and we would like to know whether it can be spelled using only a given set of letters. The following function solves this problem using string operations.

def uses_only(word, letters):
    for letter in word:
        if letter not in letters:
            return False
    return True

If we find any letters in word that are not in the list of letters, we can return False immediately. If we get through the word without finding any unavailable letters, we can return True.

Let’s try it out with some examples. In a recent Spelling Bee, the available letters were dehiklo. Let’s see what we can spell with them.

letters = "dehiklo"
uses_only('lode', letters)
True
uses_only('implode', letters)
False

Exercise: It is possible to implement uses_only more concisely using set operations rather than list operations. Read the documentation of the set class and rewrite uses_only using sets.

uses_only('lode', letters)
True
uses_only('implode', letters)
False

Word list#

The following function downloads a list of about 100,000 words in American English.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/DSIRP/raw/main/american-english')

The file contains one word per line, so we can read the file and split it into a list of words like this:

word_list = open('american-english').read().split()
len(word_list)
102401

Exercise: Write a loop that iterates through this word list and prints only words

  • With at least four letters,

  • That can be spelled using only the letters dehiklo, and

  • That include the letter o.

Exercise: Now let’s check for pangrams. Write a function called uses_all that takes two strings and returns True if the first string uses all of the letters in the second string. Think about how to express this computation using set operations.

Test your function with at least one case that returns True and one that returns False.

Exercise: Modify the previous loop to search the word list for pangrams using uses_only and uses_all.

Or, as a bonus, write a function called uses_all_and_only that checks both conditions using a single set operation.

Leftovers#

So far we’ve been writing Boolean functions that test specific conditions, but if they return False, they don’t explain why. As an alternative to uses_only, we could write a function called bad_letters that takes a word and a set of letters, and returns a new string that contains the letters in words that are not available. Let’s call it bad_letters.

def bad_letters(word, letters):
    return set(word) - set(letters)

Now if we run this function with an illegal word, it will tell us which letters in the word are not available.

bad_letters('oilfield', letters)
{'f'}

Exercise: Write a function called unused_letters that takes a word and a set of letters and returns the subset of the letters that are not used in word.

Exercise: Write a function called no_duplicates that takes a string and returns True if each letter appears only once.

Data Structures and Information Retrieval in Python

Copyright 2021 Allen Downey

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International