Variables and values¶
Click here to run this notebook on Colab or click here to download it.
Data Science is the use of data to answers questions and guide decision making. For example, a topic of current debate is whether we should raise the minimum wage in the United States. Some economists think that raising the minimum wage would raise families out of poverty; others think it would cause more unemployment. But economic theory can only take us so far. At some point, we need data.
A successful data science project requires three elements:
A question: For example, what is the relationship between the minimum wage and unemployment?
Data: To answer this question, the best data would be results from a well designed experiment. But if we can’t get ideal data, we have to work with what we can get.
Methods: With the right data, simple methods are often enough to find answers and present them clearly. But sometimes we need more specialized tools.
In an ideal world, we would pose a question, find data, and choose the appropriate methods. More often, the data science is iterative. We might start with one question, get stuck, and pivot to a different question. Or explore a new dataset and discover the questions it can answer. Or we might start with a tool and look for problems it can solve.
Most data science projects require flexibility and persistence.
The goal of this book is to give you the tools you need to execute a data science project from beginning to end, including these steps:
Choosing questions, data, and methods that go together.
Finding data or collecting it yourself.
Cleaning and validating data.
Exploring datasets, visualizing distributions and relationships between variables.
Modeling data and generating predictions.
Designing data visualizations that tell a compelling story.
Communicating results effectively.
We’ll start with basic programming concepts and work our way toward data science tools.
I won’t assume that you already know about programming, statistics, or data science. When I use a term, I try to define it immediately, and when I use a programming feature, I try to explain it clearly.
This book is in the form of Jupyter notebooks. Jupyter is a software development tool you can run in a web browser, so you don’t have to install any software. A Jupyter notebook is a document that contains text, Python code, and results. So you can read it like a book, but you can also modify the code, run it, develop new programs, and test them.
The notebooks contain exercises where you can practice what you learn. I encourage you to do the exercises as you go along.
The topics in this chapter are:
Using Jupyter to write and run Python code.
Basic programming features in Python: variables and values.
Translating formulas from math notation to Python.
Along the way, we’ll review a couple of math topics I assume you have seen before, logarithms and algebra.
Numbers¶
Python provides tools for working with many kinds of data, including numbers, words, dates, times, and locations (latitude and longitude).
Let’s start with numbers. Python can work with several types of numbers, but the two most common are:
int
, which represents integer values like3
, andfloat
, which represents numbers that have a fraction part, like3.14159
.
Most often, we use int
to represent counts and float
to represent measurements.
Here’s an example of an int
and a float
:
3
3
3.14159
3.14159
float
is short for “floating-point”, which is the name for the way these numbers are stored.
Exercise: Create a code cell below this one and type in the following number: 1.2345e3
Then run the cell. The output should be 1234.5
The e
in 1.2345e3
stands for “exponent”. This way of writing numbers is a version of scientific notation that means \(1.2345 \times 10^{3}\). If you are not familiar with scientific notation, you might want to read this.
Arithmetic¶
Python provides operators that perform arithmetic. The operators that perform addition and subtraction are +
and -
:
3 + 2 - 1
4
The operators that perform multiplication and division are *
and /
:
2 * 3
6
2 / 3
0.6666666666666666
And the operator for exponentiation is **
:
2**3
8
Unlike math notation, Python does not allow “implicit multiplication”. For example, in math notation, if you write \(3 (2 + 1)\), that’s understood to be the same as \(3 \times (2+ 1)\). Python does not allow that notation.
Try running this code to see what error you get.
3 (2 + 1)
In this example, the error message is not very helpful, which is why I am warning you now. If you want to multiply, you have to use the *
operator:
The arithmetic operators follow the rules of precedence you might have learned as “PEMDAS”:
Parentheses before
Exponentiation before
Multiplication and division before
Addition and subtraction
So in this expression:
1 + 2 * 3
7
The multiplication happens first. If that’s not what you want, you can use parentheses to make the order of operations explicit:
(1 + 2) * 3
9
Exercise: Write a Python expression that raises 1+2
to the power 3*4
. The answer should be 531441
.
Math functions¶
Python provides functions that compute all the usual mathematical functions, like sin
and cos
, exp
and log
.
However, they are not part of Python itself; they are in a library, which is a collection of functions that supplement the Python language.
Actually, there are several libraries that provide math functions; the one we’ll use is called NumPy, which stands for “Numerical Python”, and is pronounced “num’ pie”.
Before you can use a library, you have to “import” it. Here’s how we import NumPy:
import numpy as np
It is conventional to import numpy
as np
, which means we can refer to it by the short name np
rather than the longer name numpy
.
Names like this are case-sensitive, which means that numpy
is not the same as NumPy
. So even though the name of the library is NumPy, when we import it we have to call it numpy
.
But assuming we import np
correctly, we can use it to read the value pi
, which is an approximation of the mathematical constant \(\pi\).
np.pi
3.141592653589793
The result is a float
with 16 digits. As you might know, we can’t represent \(\pi\) with a finite number of digits, so this result is only approximate.
numpy
provides log
, which computes the natural logarithm
np.log(100)
4.605170185988092
And exp
, which raises the constant e
to a power.
np.exp(1)
2.718281828459045
Exercise: Use these functions to confirm the mathematical identity \(\log(e^x) = x\), which should be true for any value of \(x\).
With floating-point values, this identity should work for values of \(x\) between -700 and 700. What happens when you try it with larger and smaller values?
As this example shows, floating-point numbers are finite approximations, which means they don’t always behave like math.
As another example, let’s see what happens when you add up 0.1
three times:
0.1 + 0.1 + 0.1
0.30000000000000004
The result is close to 0.3
, but not exact.
We’ll see other examples of floating-point approximation later, and learn some ways to deal with it.
Variables¶
A variable is a name that refers to a value.
The following statement assigns the value 5
to a variable named x
:
x = 5
The variable we just created has the name x
and the value 5
.
If a variable name appears at the end of a cell, Jupyter displays its value.
x
5
If we use x
as part of an arithmetic operation, it represents the value 5
:
x + 1
6
x**2
25
We can also use x
with numpy
functions:
np.exp(x)
148.4131591025766
Notice that the result from exp
is a float
, even though the value of x
is an int
.
Exercise: If you have not programmed before, one of the things you have to get used to is that programming languages are picky about details. Natural languages, like English, and semi-formal languages, like math notation, are more forgiving.
As an example, in math notation, parentheses and square brackets mean the same thing, you can write
\(\sin (\omega t)\)
or
\(\sin [\omega t]\)
Either one is fine. And you can leave out the parentheses altogether, as long as the meaning is clear:
\(\sin \omega t\)
In Python, every character counts. For example, the following are all different:
np.exp(x)
np.Exp(x)
np.exp[x]
np.exp x
While you are learning, I encourage you to make mistakes on purpose to see what goes wrong. Read the error messages carefully. Sometimes they are helpful and tell you exactly what’s wrong. Other times they can be misleading. But if you have seen the message before, you might remember some likely causes.
Exercise: Search the NumPy documentation to find the function that computes square roots, and use it to compute a floating-point approximation of the golden ratio:
\(\phi = \frac{1 + \sqrt{5}}{2}\)
Hint: The result should be close to 1.618
.
Calculation with variables¶
Now let’s use variables to solve a problem involving mathematical calculation. Suppose we have the following formula for computing compound interest from Wikipedia:
“The total accumulated value, including the principal sum \(P\) plus compounded interest \(I\), is given by the formula:
\(V=P\left(1+{\frac {r}{n}}\right)^{nt}\)
where:
\(P\) is the original principal sum
\(V\) is the total accumulated value
\(r\) is the nominal annual interest rate
\(n\) is the compounding frequency
\(t\) is the overall length of time the interest is applied (expressed using the same time units as \(r\), usually years).
“Suppose a principal amount of $1,500 is deposited in a bank paying an annual interest rate of 4.3%, compounded quarterly. Then the balance after 6 years is found by using the formula above, with
P = 1500
r = 0.043
n = 4
t = 6
We can compute the total accumulated value by translating the mathematical formula into Python syntax:
P * (1 + r/n)**(n*t)
1938.8368221341054
Exercise: Continuing the example from Wikipedia:
“Suppose the same amount of $1,500 is compounded biennially”, so n = 1/2
.
What would the total value be after 6 years? Hint: we expect the answer to be a bit less than the previous answer.
Exercise: If interest is compounded continuously, the value after time \(t\) is given by the formula:
\(V=P~e^{rt}\)
Translate this equation into Python and use it compute the value of the investment in the previous example with continuous compounding. Hint: we expect the answer to be a bit more than the previous answers.
The point of this exercise is to practice using variables. But it is also a reminder about logarithms, which we will use extensively.
A little more Jupyter¶
Here are a few tips on using Jupyter to compute and display values.
Generally, if there is a single expression in a cell, Jupyter computes the value of the expression and displays the result.
For example, we’ve already seen how to display the value of np.pi
:
np.pi
3.141592653589793
Here’s a more complex example with functions, operators, and numbers:
1 / np.sqrt(2 * np.pi) * np.exp(-3**2 / 2)
0.0044318484119380075
If you put more than one expression in a cell, Jupyter computes them all, but it only display the result from the last:
1
2 + 3
np.exp(1)
(1 + np.sqrt(5)) / 2
1.618033988749895
If you want to display more than one value, you can separate them with commas:
1, 2 + 3, np.exp(1), (1 + np.sqrt(5)) / 2
(1, 5, 2.718281828459045, 1.618033988749895)
That result is actually a tuple, which you will learn about in the next chapter.
Here’s one last Jupyter tip: when you assign a value to variable, Jupyter does not display the value:
phi = (1 + np.sqrt(5)) / 2
So it is idiomatic to assign a value to a variable and immediately display the result:
phi = (1 + np.sqrt(5)) / 2
phi
1.618033988749895
Exercise: Display the value of \(\phi\) and its inverse, \(1/\phi\), on a single line.