Printed copies of Elements of Data Science are available now, with a full color interior, from Lulu.com.
4. Loops and Files#
Click here to run this notebook on Colab.
This chapter presents loops, which are used to perform repeated computation, and files, which are used to store data. As an example, we will download the famous book War and Peace and write a loop that reads the book and counts the words. This example presents some new computational tools – it is also an introduction to working with textual data.
4.1. Loops#
One of the most important elements of computation is repetition, and the most common way to perform repetitive computations is a for
loop.
As a simple example, suppose we want to display the elements of a tuple. Here’s a tuple of three integers:
t = (1, 2, 3)
And here’s a for
loop that prints the elements.
for x in t:
print(x)
1
2
3
The first line of the loop is a header that specifies the tuple, t
, and a variable name, x
. The tuple must already exists, but if x
does not, the loop will create it. Note that the header ends with a colon, :
.
Inside the loop is a print
statement, which displays the value of x
. So here’s what happens:
When the loop starts, it gets the first element of
t
, which is1
, and assigns it tox
. It executes theprint
statement, which displays the value1
.Then it gets the second element of
t
, which is2
, and displays it.Then it gets the third element of
t
, which is3
, and displays it.
After printing the last element of the tuple, the loop ends. We can also loop through the letters in a string:
word = 'Data'
for letter in word:
print(letter)
D
a
t
a
When the loop begins, word
already exists, but letter
does not.
Again, the loop creates letter
and assigns values to it.
The variable created by the loop is called the loop variable.
You can give it any name you like – in this example, I chose letter
to remind me what kind of value it contains.
After the loop ends, the loop variable contains the last value.
letter
'a'
Exercise: Create a list called sequence
with four elements of any type.
Write a for
loop that prints the elements.
Call the loop variable element
.
4.2. Counting with Loops#
Inside a loop, it is common to use a variable to count the number of times something happens. We’ve already seen that you can create a variable and give it a value, like this:
count = 0
count
0
If you assign a different value to the same variable, the new value replaces the old one.
count = 1
count
1
You can increase the value of a variable by reading the old value, adding 1
, and assigning the result back to the original variable.
count = count + 1
count
2
Increasing the value of a variable is called incrementing and decreasing the value is called decrementing. These operations are so common that there are special operators for them.
count += 1
count
3
In this example, the +=
operator reads the value of count
, adds 1
, and assigns the result back to count
.
Python also provides -=
and other update operators like *=
and /=
.
Exercise: The following is a number trick from the website Learn With Math Games:
Finding Someone’s Age
Ask the person to multiply the first number of their age by 5.
Tell them to add 3.
Now tell them to double this figure.
Finally, have the person add the second number of their age to the figure and have them tell you the answer.
Deduct 6 and you will have their age.
Test this algorithm using your age.
Use a single variable and update it using +=
and other update operators.
The original game is at https://www.learn-with-math-games.com/math-number-tricks.html
4.3. Files#
Now that we know how to count, let’s see how to read words from a file. As an example, we’ll read a file that contains the text of Tolstoy’s famous novel, War and Peace. We can download it from Project Gutenberg, which is a repository of free books. Instructions are in the notebook for this chapter.
In order to read the contents of the file, you have to open it, which you can do with the open
function.
fp = open('2600-0.txt')
fp
<_io.TextIOWrapper name='2600-0.txt' mode='r' encoding='UTF-8'>
The result is a TextIOWrapper
, which is a type of file pointer.
It contains the name of the file, the mode (which is r
for “reading”) and the encoding (which is UTF
for “Unicode Transformation Format”).
A file pointer is like a bookmark – it keeps track of which parts of the file you have read.
If you use a file pointer in a for
loop, it loops through the lines in the file.
So we can count the number of lines like this:
fp = open('2600-0.txt')
count = 0
for line in fp:
count += 1
And then display the result.
count
66050
There are about 66,000 lines in this file.
4.4. if Statements#
if
statements are used to check whether a condition is true and, depending on the result, perform different computations.
A condition is an expression whose value is either True
or False
.
For example, the following expression compares the final value of count
to a number:
count > 60000
True
For War and Peace, the result is True
.
We can use this condition in an if
statement to display a message, or not, depending on the result.
if count > 60000:
print('Long book!')
Long book!
The first line specifies the condition we’re checking for.
Like the header of a for
statement, the first line of an if
statement has to end with a colon.
If the condition is true, the indented statement runs; otherwise, it doesn’t.
In the previous example, the condition is true, so the print
statement runs.
In the following example, the condition is false, so the print
statement doesn’t run.
if count < 1000:
print('Short book!')
We can put an if
statement inside a for
loop.
The following example only prints a line from the book when count
is 0
.
The other lines are read, but not displayed.
fp = open('2600-0.txt')
count = 0
for line in fp:
if count == 0:
print(line)
count += 1
The Project Gutenberg EBook of War and Peace, by Leo Tolstoy
Notice that we use ==
to compare values and check if they are equal, not =
, which is used in assignment statements. Also, notice the indentation in this example:
Statements inside the
for
loop are indented.The statement inside the
if
statement is indented.The statement
count += 1
is outdented from the previous line, so it ends theif
statement. But it is still inside thefor
loop.
It is legal in Python to use spaces or tabs for indentation, but the most common convention is to use four spaces, never tabs.
4.5. The break
Statement#
If we display the final value of count
, we see that the loop reads the entire file, but only prints one line:
count
66050
We can avoid reading the whole file by using a break
statement, like this:
fp = open('2600-0.txt')
count = 0
for line in fp:
print(line)
count += 1
if count == 1:
break
The Project Gutenberg EBook of War and Peace, by Leo Tolstoy
The break
statement ends the loop immediately, skipping the rest of the file, as we can confirm by checking the final value of count
.
count
1
Exercise: Write a loop that prints the first 5 lines of the file and then breaks out of the loop.
4.6. Whitespace#
If we run the loop again and display the final value of line
, we see the special sequence \n
at the end.
fp = open('2600-0.txt')
count = 0
for line in fp:
count += 1
if count == 1:
break
line
'The Project Gutenberg EBook of War and Peace, by Leo Tolstoy\n'
This sequence represents a single character, called a newline, that puts vertical space between lines.
If we use a print
statement to display line
, we don’t see the special sequence, but we do see extra space after the line.
print(line)
The Project Gutenberg EBook of War and Peace, by Leo Tolstoy
In other strings, you might see the sequence \t
, which represents a tab character.
When you print a tab character, it adds enough space to make the next character appear in a column that is a multiple of 8.
print('| ' * 6)
print('a\tbc\tdef\tghij\tklmno\tpqrstu')
| | | | | |
a bc def ghij klmno pqrstu
Newline characters, tabs, and spaces are called whitespace because when they are printed they leave white space on the page (assuming that the background color is white).
4.7. Counting Words#
So far we’ve counted the lines in a file – now let’s count the words.
To split a line into words, we can use a function called split
that takes a string and returns a list of words.
To be more precise, split
doesn’t actually know what a word is – it just splits the line wherever there’s a space or other whitespace character.
line.split()
['The',
'Project',
'Gutenberg',
'EBook',
'of',
'War',
'and',
'Peace,',
'by',
'Leo',
'Tolstoy']
Notice that the syntax for split
is different from other functions we have seen. Normally when we call a function, we name the function and provide values in parentheses. So you might have expected to write split(line)
. Sadly, that doesn’t work.
%%expect NameError
split(line)
NameError: name 'split' is not defined
The problem is that the split
function belongs to the string line
.
In a sense, the function is attached to the string, so we can only refer to it using the string and the dot operator, which is the period between line
and split
.
For historical reasons, functions like this are called methods.
Now that we can split a line into a list of words, we can use len
to get the number of words in each list, and increment count
accordingly.
fp = open('2600-0.txt')
count = 0
for line in fp:
count += len(line.split())
count
566316
By this count, there are more than half a million words in War and Peace.
Actually, there aren’t quite that many, because the file we got from Project Gutenberg has some introductory material before the text and some license information at the end.
To mark the beginning and end of the text, the file includes special lines that begin with '***'
.
We can identify these lines with the startswith
function, which checks whether a string begins with a particular sequence of characters.
line = '*** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***'
line.startswith('***')
True
To skip the front matter, we can use a loop to read lines until it finds the first line that starts with this sequence.
Then we can use a second loop to read lines and count words until it finds the second line that starts with this sequence.
fp = open('2600-0.txt')
for line in fp:
if line.startswith('***'):
print(line)
break
count = 0
for line in fp:
if line.startswith('***'):
print(line)
break
count += len(line.split())
*** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***
*** END OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***
When the second loop exits, count
contains the number of words in the text.
count
563299
4.8. Summary#
This chapter presents loops, if
statements, and the break
statement.
It also introduces tools for working with letters and words, and a simple kind of textual analysis, word counting.
In the next chapter we’ll continue this example, counting the number of unique words in a text and the number of times each word appears. And we’ll see another way to represent a collection of values, a Python dictionary.