Data Science From Scratch, 2nd Edition

istory
Data Science from Scratch

opics SECOND EDITION
utorials First Principles with Python
ffers & Deals

Joel Grus
ighlights
ettings
Support
Sign Out
y
Data Science from Scratch

History
by Joel Grus
Topics
Copyright © 2019 Joel Grus. All rights reserved.
Tutorials
Printed in the United States of America.
Offers & Deals

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
Highlights
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://oreilly.com/safari). For more information,
Settings
contact our corporate/institutional sales department: 8009989938 or [email protected].
Support
Editor: Michele Cronin
Sign Out
Production Editor: Justin Billing
Indexer: Ellen TroutmanZaig
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
April 2015: First Edition
May 2019: Second Edition
Revision History for the Early Release
20181210: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491901427 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science from Scratch,
the cover image of a Rock Ptarmigan, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the author have used good faith efforts to ensure that the information
and instructions contained in this work are accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions
contained in this work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such licenses
and/or rights.
9781492041139
[LSI]
Chapter 1. Introduction
History
Topics
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
—Arthur Conan Doyle
Tutorials
The Ascendance of Data

Offers & Deals
We live in a world that’s drowning in data. Websites track every user’s every click. Your
Highlights
smartphone is building up a record of your location and speed every second of every day.
“Quantified selfers” wear pedometersonsteroids that are ever recording their heart rates,
Settings
movement habits, diet, and sleep patterns. Smart cars collect driving habits, smart homes collect
living habits, and smart marketers collect purchasing habits. The Internet itself represents a huge
Support
graph of knowledge that contains (among other things) an enormous crossreferenced
Signencyclopedia; domainspecific databases about movies, music, sports results, pinball machines,
Out
memes, and cocktails; and too many government statistics (some of them nearly true!) from too
many governments to wrap your head around.
Buried in these data are answers to countless questions that no one’s ever thought to ask. In this
book, we’ll learn how to find them.
What Is Data Science?

There’s a joke that says a data scientist is someone who knows more statistics than a computer
scientist and more computer science than a statistician. (I didn’t say it was a good joke.) In fact,
some data scientists are—for all practical purposes—statisticians, while others are pretty much
indistinguishable from software engineers. Some are machinelearning experts, while others
couldn’t machinelearn their way out of kindergarten. Some are PhDs with impressive
publication records, while others have never read an academic paper (shame on them, though).
In short, pretty much no matter how you define data science, you’ll find practitioners for whom
the definition is totally, absolutely wrong.
Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is someone
who extracts insights from messy data. Today’s world is full of people trying to turn data into
insight.
For instance, the dating site OkCupid asks its members to answer thousands of questions in
order to find the most appropriate matches for them. But it also analyzes these results to figure
out innocuoussounding questions you can ask someone to find out how likely someone is to
sleep with you on the first date.
Facebook asks you to list your hometown and your current location, ostensibly to make it easier
for your friends to find and connect with you. But it also analyzes these locations to identify
global migration patterns and where the fanbases of different football teams live.
As a large retailer, Target tracks your purchases and interactions, both online and instore. And
it uses the data to predictively model which of its customers are pregnant, to better market baby
related purchases to them.
In 2012, the Obama campaign employed dozens of data scientists who datamined and
experimented their way to identifying voters who needed extra attention, choosing optimal
donorspecific fundraising appeals and programs, and focusing getoutthevote efforts where
they were most likely to be useful. And in 2016 the Trump campaign tested a staggering variety
of online ads and analyzed the data to find what worked and what didn’t.
Now, before you start feeling too jaded: some data scientists also occasionally use their skills
for good—using data to make government more effective, to help the homeless, and to improve
public health. But it certainly won’t hurt your career if you like figuring out the best way to get
people to click on advertisements.
Motivating Hypothetical: DataSciencester

Congratulations! You’ve just been hired to lead the data science efforts at DataSciencester, the
social network for data scientists.
NOTE
When I wrote the first edition of this book, I thought that “a social network for data
scientists” was a fun, silly hypothetical. Since then people have actually created
social networks for data scientists, and have raised much more money from venture
capitalists than I made from my book. Most likely there is a valuable lesson here
about silly data science hypotheticals and/or book publishing.
Despite being for data scientists, DataSciencester has never actually invested in building its own
data science practice. (In fairness, DataSciencester has never really invested in building its
product either.) That will be your job! Throughout the book, we’ll be learning about data
science concepts by solving problems that you encounter at work. Sometimes we’ll look at data
explicitly supplied by users, sometimes we’ll look at data generated through their interactions
with the site, and sometimes we’ll even look at data from experiments that we’ll design.
And because DataSciencester has a strong “notinventedhere” mentality, we’ll be building our
own tools from scratch. At the end, you’ll have a pretty solid understanding of the fundamentals
of data science. And you’ll be ready to apply your skills at a company with a less shaky
premise, or to any other problems that happen to interest you.
Welcome aboard, and good luck! (You’re allowed to wear jeans on Fridays, and the bathroom is
down the hall on the right.)
Finding Key Connectors
It’s your first day on the job at DataSciencester, and the VP of Networking is full of questions
about your users. Until now he’s had no one to ask, so he’s very excited to have you aboard.
In particular, he wants you to identify who the “key connectors” are among data scientists. To
this end, he gives you a dump of the entire DataSciencester network. (In real life, people don’t
typically hand you the data you need. Chapter 9 is devoted to getting data.)
What does this data dump look like? It consists of a list of users, each represented by a dict
that contains for each user his or her id (which is a number) and name (which, in one of the
great cosmic coincidences, rhymes with the user’s id):
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]
He also gives you the “friendship” data, represented as a list of pairs of IDs:
friendship_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
For example, the tuple (0, 1) indicates that the data scientist with id 0 (Hero) and the data
scientist with id 1 (Dunn) are friends. The network is illustrated in Figure 11.
Figure 11. The DataSciencester network
Having friendships represented as a list of pairs is not the easiest way to work with them. To
find all the friendships for user 1, I have to iterate over every pair looking for pairs containing 1.
If I had a lot of pairs, this would take a long time.
Instead, let’s create a dict where the keys are user ids and the values are lists of friend ids.
(Looking things up in a dict is very fast.)
NOTE
Don’t get too hung up on the details of the code right now. In Chapter 2, we’ll take
you through a crash course in Python. For now just try to get the general flavor of
what we’re doing.
We’ll still have to look at every pair to create the dict, but we only have to do that once, and
we’ll get cheap lookups after that.
# Initialize the dict with an empty list for each user id:
friendships = {user["id"]: [] for user in users}
# And loop over the friendship pairs to populate it:
for i, j in friendship_pairs:
friendships[i].append(j) # add j as a friend of user i
friendships[j].append(i) # add i as a friend of user j
Now that we have the friendships in a dict, we can easily ask questions of our graph, like
“what’s the average number of connections?”
First we find the total number of connections, by summing up the lengths of all the friends
lists:
def number_of_friends(user):
"""How many friends does _user_ have?"""
user_id = user["id"]
friend_ids = friendships[user_id]
return len(friend_ids)
total_connections = sum(number_of_friends(user)
for user in users) # 24
And then we just divide by the number of users:
num_users = len(users) # length of the user
s list
avg_connections = total_connections / num_users # 24 / 10 == 2.4
It’s also easy to find the most connected people—they’re the people who have the largest
number of friends.
Since there aren’t very many users, we can simply sort them from “most friends” to “least
friends”:
# Create a list (user_id, number_of_friends).
num_friends_by_id = [(user["id"], number_of_friends(user))
for user in users]
sorted(num_friends_by_id, # Sort the list
key=lambda id_and_friends: id_and_friends[1], # by num_friends
reverse=True) # largest to sm
allest
# Each pair is (user_id, num_friends):
# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3),
# (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]
One way to think of what we’ve done is as a way of identifying people who are somehow
central to the network. In fact, what we’ve just computed is the network metric degree centrality
(Figure 12).
Figure 12. The DataSciencester network sized by degree
This has the virtue of being pretty easy to calculate, but it doesn’t always give the results you’d
want or expect. For example, in the DataSciencester network Thor (id 4) only has two
connections while Dunn (id 1) has three. Yet looking at the network it intuitively seems like
Thor should be more central. In Chapter 21, we’ll investigate networks in more detail, and we’ll
look at more complex notions of centrality that may or may not accord better with our intuition.
Data Scientists You May Know
While you’re still filling out newhire paperwork, the VP of Fraternization comes by your desk.
She wants to encourage more connections among your members, and she asks you to design a
“Data Scientists You May Know” suggester.
Your first instinct is to suggest that a user might know the friends of their friends. These are
easy to compute: for each of a user’s friends, iterate over that person’s friends, and collect all
the results:
def foaf_ids_bad(user):
"""foaf is short for "friend of a friend" """
return [foaf_id
for friend_id in friendships[user["id"]]
for foaf_id in friendships[friend_id]]
When we call this on users[0] (Hero), it produces:
[0, 2, 3, 0, 1, 3]
It includes user 0 (twice), since Hero is indeed friends with both of his friends. It includes users
1 and 2, although they are both friends with Hero already. And it includes user 3 twice, as Chi is
reachable through two different friends:
print(friendships[0]) # [1, 2]
print(friendships[1]) # [0, 2, 3]
print(friendships[2]) # [0, 1, 3]
Knowing that people are friendsoffriends in multiple ways seems like interesting information,
so maybe instead we should produce a count of mutual friends. And we should probably
exclude people already known to the user:
from collections import Counter # not loaded by defa
ult
def friends_of_friends(user):
user_id = user["id"]
return Counter(
foaf_id
for friend_id in friendships[user_id] # For each of my fri
ends,
for foaf_id in friendships[friend_id] # find their friends
if foaf_id != user_id # who aren't me
and foaf_id not in friendships[user_id] # and aren't my frie
nds.
)
print(friends_of_friends(users[3])) # Counter({0: 2, 5:
1})
This correctly tells Chi (id 3) that she has two mutual friends with Hero (id 0) but only one
mutual friend with Clive (id 5).
As a data scientist, you know that you also might enjoy meeting users with similar interests.
(This is a good example of the “substantive expertise” aspect of data science.) After asking
around, you manage to get your hands on this data, as a list of pairs (user_id,
interest):
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikitlearn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Py
thon"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikitlearn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning
"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
For example, Hero (id 0) has no friends in common with Klein (id 9), but they share interests
in Java and big data.
It’s easy to build a function that finds users with a certain interest:
def data_scientists_who_like(target_interest):
"""Find the ids of all users who like the target interest."""
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
This works, but it has to examine the whole list of interests for every search. If we have a lot of
users and interests (or if we just want to do a lot of searches), we’re probably better off building
an index from interests to users:
from collections import defaultdict
# Keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
user_ids_by_interest[interest].append(user_id)
And another from users to interests:
# Keys are user_ids, values are lists of interests for that user_id.
interests_by_user_id = defaultdict(list)
for user_id, interest in interests:
interests_by_user_id[user_id].append(interest)
Now it’s easy to find who has the most interests in common with a given user:
Iterate over the user’s interests.
For each interest, iterate over the other users with that interest.
Keep count of how many times we see each other user.
def most_common_interests_with(user):
return Counter(
interested_user_id
for interest in interests_by_user_id[user["id"]]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user["id"]
)
We could then use this to build a richer “Data Scientists You Should Know” feature based on a
combination of mutual friends and mutual interests. We’ll explore these kinds of applications in
Chapter 22.
Salaries and Experience
Right as you’re about to head to lunch, the VP of Public Relations asks if you can provide some
fun facts about how much data scientists earn. Salary data is of course sensitive, but he manages
to provide you an anonymous data set containing each user’s salary (in dollars) and tenure
as a data scientist (in years):
salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
(48000, 0.7), (76000, 6),
(69000, 6.5), (76000, 7.5),
(60000, 2.5), (83000, 10),
(48000, 1.9), (63000, 4.2)]
The natural first step is to plot the data (which we’ll see how to do in Chapter 3). You can see
the results in Figure 13.
Figure 13. Salary by years of experience
It seems pretty clear that people with more experience tend to earn more. How can you turn this
into a fun fact? Your first idea is to look at the average salary for each tenure:
# Keys are years, values are lists of the salaries for each tenure.
salary_by_tenure = defaultdict(list)
for salary, tenure in salaries_and_tenures:
salary_by_tenure[tenure].append(salary)
# Keys are years, each value is average salary for that tenure.
average_salary_by_tenure = {
tenure: sum(salaries) / len(salaries)
for tenure, salaries in salary_by_tenure.items()
}
This turns out to be not particularly useful, as none of the users have the same tenure, which
means we’re just reporting the individual users’ salaries:
{0.7: 48000.0,
1.9: 48000.0,
2.5: 60000.0,
4.2: 63000.0,
6: 76000.0,
6.5: 69000.0,
7.5: 76000.0,
8.1: 88000.0,
8.7: 83000.0,
10: 83000.0}
It might be more helpful to bucket the tenures:
def tenure_bucket(tenure):
if tenure < 2:
return "less than two"
elif tenure < 5:
return "between two and five"
else:
return "more than five"
Then we can group together the salaries corresponding to each bucket:
# Keys are tenure buckets, values are lists of salaries for that bucke
t.
salary_by_tenure_bucket = defaultdict(list)
for salary, tenure in salaries_and_tenures:
bucket = tenure_bucket(tenure)
salary_by_tenure_bucket[bucket].append(salary)
And finally compute the average salary for each group:
# Keys are tenure buckets, values are average salary for that bucket
average_salary_by_bucket = {
tenure_bucket: sum(salaries) / len(salaries)
for tenure_bucket, salaries in salary_by_tenure_bucket.items()
}
which is more interesting:
{'between two and five': 61500.0,
'less than two': 48000.0,
'more than five': 79166.66666666667}
And you have your soundbite: “Data scientists with more than five years experience earn 65%
more than data scientists with little or no experience!”
But we chose the buckets in a pretty arbitrary way. What we’d really like is to make some sort
of statement about the salary effect—on average—of having an additional year of experience. In
addition to making for a snappier fun fact, this allows us to make predictions about salaries that
we don’t know. We’ll explore this idea in Chapter 14.
Paid Accounts
When you get back to your desk, the VP of Revenue is waiting for you. She wants to better
understand which users pay for accounts and which don’t. (She knows their names, but that’s
not particularly actionable information.)
You notice that there seems to be a correspondence between years of experience and paid
accounts:
0.7 paid
1.9 unpaid
2.5 paid
4.2 unpaid
6 unpaid
6.5 unpaid
7.5 unpaid
8.1 unpaid
8.7 paid
10 paid
Users with very few and very many years of experience tend to pay; users with average amounts
of experience don’t.
Accordingly, if you wanted to create a model—though this is definitely not enough data to base
a model on—you might try to predict “paid” for users with very few and very many years of
experience, and “unpaid” for users with middling amounts of experience:
def predict_paid_or_unpaid(years_experience):
if years_experience < 3.0:
return "paid"
elif years_experience < 8.5:
return "unpaid"
else:
return "paid"
Of course, we totally eyeballed the cutoffs.
With more data (and more mathematics), we could build a model predicting the likelihood that a
user would pay, based on his years of experience. We’ll investigate this sort of problem in
Chapter 16.
Topics of Interest
As you’re wrapping up your first day, the VP of Content Strategy asks you for data about what
topics users are most interested in, so that she can plan out her blog calendar accordingly. You
already have the raw data from the friendsuggester project:
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikitlearn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Py
thon"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikitlearn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning
"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
One simple (if not particularly exciting) way to find the most popular interests is simply to
count the words:
1. Lowercase each interest (since different users may or may not capitalize their interests).
2. Split it into words.
3. Count the results.
In code:
words_and_counts = Counter(word
for user, interest in interests
for word in interest.lower().split())
This makes it easy to list out the words that occur more than once:
for word, count in words_and_counts.most_common():
if count > 1:
print(word, count)
which gives the results you’d expect (unless you expect “scikitlearn” to get split into two
words, in which case it doesn’t give the results you expect):
learning 3
java 3
python 3
big 3
data 3
hbase 2
regression 2
cassandra 2
statistics 2
probability 2
hadoop 2
networks 2
machine 2
neural 2
scikitlearn 2
r 2
We’ll look at more sophisticated ways to extract topics from data in Chapter 20.
Onward
It’s been a successful first day! Exhausted, you slip out of the building before anyone else can
ask you for anything else. Get a good night’s rest, because tomorrow is new employee
orientation. (Yes, you went through a full day of work before new employee orientation. Take it
up with HR.)
Chapter 2. A Crash Course in Python
History
Topics
People are still crazy about Python after twentyfive years, which I find hard to believe.
—Michael Palin
Tutorials
All new employees at DataSciencester are required to go through new employee orientation, the
Offers & Deals
most interesting part of which is a crash course in Python.
This is not a comprehensive Python tutorial but instead is intended to highlight the parts of the
Highlights
language that will be most important to us (some of which are often not the focus of Python
tutorials). If you have never used Python before, you probably want to supplement this with
Settings
some sort of beginner tutorial.
Support
Getting Python
Sign Out
Example 21.
As instructions about how to install things can change, while printed books cannot, uptodate
instructions on how to install Python can be found in the book’s GitHub repo.
If the ones printed here don’t work for you, check those.
You can download Python from python.org. But if you don’t already have Python, I recommend
instead installing the Anaconda distribution, which already includes most of the libraries that
you need to do data science.
When I wrote the first version of Data Science from Scratch, Python 2.7 was still the preferred
version of most data scientists. Accordingly, the book was based on Python 2.7.
In the last several years, however, pretty much everyone who counts has migrated to Python 3.
Recent versions of Python have many features that make it easier to write clean code, and we’ll
be taking ample advantage of features that are only available in Python 3.6 or later. This means
that you should get Python 3.6 or later.
If you don’t get Anaconda, make sure to install pip, which is a Python package manager that
allows you to easily install thirdparty packages (some of which we’ll need). It’s also worth
getting IPython, which is a much nicer Python shell to work with.
(If you installed Anaconda then it should have come with pip and IPython.)
Just run:
pip install ipython
and then search the Internet for solutions to whatever cryptic error messages that causes.
The Zen of Python

Python has a somewhat Zen description of its design principles, which you can also find inside
the Python interpreter itself by typing import this.
One of the most discussed of these is:
There should be one—and preferably only one—obvious way to do it.
Code written in accordance with this “obvious” way (which may not be obvious at all to a
newcomer) is often described as “Pythonic.” Although this is not a book about Python, we will
occasionally contrast Pythonic and nonPythonic ways of accomplishing the same things, and
we will generally favor Pythonic solutions to our problems.
Several others touch on aesthetics,
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.
and represent ideals that we will strive for in our code.
Whitespace Formatting
Many languages use curly braces to delimit blocks of code. Python uses indentation:
for i in [1, 2, 3, 4, 5]:
print(i) # first line in "for i" block
for j in [1, 2, 3, 4, 5]:
print(j) # first line in "for j" block
print(i + j) # last line in "for j" block
print(i) # last line in "for i" block
print("done looping")
This makes Python code very readable, but it also means that you have to be very careful with
your formatting.
Example 22.
Programmers will often argue over whether to use tabs or spaces for indentation. For many
languages it doesn’t matter that much; however, Python considers tabs and spaces different
indentation and will not be able to run your code if you mix the two. My personal preference is
that you should always use spaces, never tabs. (If you write code in an editor you can configure
it so that the “tab” key just inserts spaces.)
Whitespace is ignored inside parentheses and brackets, which can be helpful for longwinded
computations:
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11
+ 12 +
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)
and for making code easier to read:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
easier_to_read_list_of_lists = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
You can also use a backslash to indicate that a statement continues onto the next line, although
we’ll rarely do this:
two_plus_three = 2 + \
3
One consequence of whitespace formatting is that it can be hard to copy and paste code into the
Python shell. For example, if you tried to paste the code:
for i in [1, 2, 3, 4, 5]:
# notice the blank line
print(i)
into the ordinary Python shell, you would receive the complaint
IndentationError: expected an indented block
because the interpreter thinks the blank line signals the end of the for loop’s block.
IPython has a magic function %paste, which correctly pastes whatever is on your clipboard,
whitespace and all. This alone is a good reason to use IPython.
Modules
Certain features of Python are not loaded by default. These include both features included as
part of the language as well as thirdparty features that you download yourself. In order to use
these features, you’ll need to import the modules that contain them.
One approach is to simply import the module itself:
import re
my_regex = re.compile("[09]+", re.I)
Here re is the module containing functions and constants for working with regular expressions.
After this type of import you can only access those functions by prefixing them with re..
If you already had a different re in your code you could use an alias:
import re as regex
my_regex = regex.compile("[09]+", regex.I)
You might also do this if your module has an unwieldy name or if you’re going to be typing it a
lot. For example, when visualizing data with matplotlib, a standard convention is:
import matplotlib.pyplot as plt
plt.plot(...)
If you need a few specific values from a module, you can import them explicitly and use them
without qualification:
from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()
If you were a bad person, you could import the entire contents of a module into your namespace,
w hich might inadvertently overwrite variables you’ve already defined:
match = 10
from re import * # uh oh, re has a match function
print(match) # "<function match at 0x10281e6a8>"
However, since you are not a bad person, you won’t ever do this.
Functions
A function is a rule for taking zero or more inputs and returning a corresponding output. In
Python, we typically define functions using def:
def double(x):
"""
This is where you put an optional docstring that explains what the
function does. For example, this function multiplies its input by
2.
"""
return x * 2
Python functions are firstclass, which means that we can assign them to variables and pass
them into functions just like any other arguments:
def apply_to_one(f):
"""Calls the function f with 1 as its argument"""
return f(1)
my_double = double # refers to the previously defined func
tion
x = apply_to_one(my_double) # equals 2
It is also easy to create short anonymous functions, or lambdas:
y = apply_to_one(lambda x: x + 4) # equals 5
You can assign lambdas to variables, although most people will tell you that you should just use
def instead:
another_double = lambda x: 2 * x # don't do this
def another_double(x):
"""do this instead"""
return 2 * x
Function parameters can also be given default arguments, which only need to be specified when
you want a value other than the default:
def my_print(message = "my default message"):
print(message)
my_print("hello") # prints 'hello'
my_print() # prints 'my default message'
It is sometimes useful to specify arguments by name:
def full_name(first = "What'shisname", last = "Something"):
return first + " " + last
full_name("Joel", "Grus") # "Joel Grus"
full_name("Joel") # "Joel Something"
full_name(last="Grus") # "What'shisname Grus"
We will be creating many, many functions.
Strings
Strings can be delimited by single or double quotation marks (but the quotes have to match):
single_quoted_string = 'data science'
double_quoted_string = "data science"
Python uses backslashes to encode special characters. For example:
tab_string = "\t" # represents the tab character
len(tab_string) # is 1
If you want backslashes as backslashes (which you might in Windows directory names or in
regular expressions), you can create raw strings using r"":
not_tab_string = r"\t" # represents the characters '\' and 't'
len(not_tab_string) # is 2
https://avxhm.se/blogs/hill0
You can create multiline strings using triple[double]quotes:
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
A new feature in Python 3.6 is the “fstring”, which provides a simple way to substitute values
into strings. For example, if we had first name and last name given separately
first_name = "Joel"
last_name = "Grus"
we might want to combine them into a full name. There is a variety of ways to construct such a
full_name string.
full_name1 = first_name + " " + last_name # string addition
full_name2 = "{0} {1}".format(first_name, last_name) # string.format
But the fstring way is much less unwieldy:
full_name3 = f"{first_name} {last_name}"
and we’ll prefer it throughout the book.
Exceptions
When something goes wrong, Python raises an exception. Unhandled, these will cause your
program to crash. You can handle them using try and except:
try:
print(0 / 0)
except ZeroDivisionError:
print("cannot divide by zero")
Although in many languages exceptions are considered bad, in Python there is no shame in
using them to make your code cleaner, and we will sometimes so.
Lists
Probably the most fundamental data structure in Python is the list. A list is simply an ordered
collection. (It is similar to what in other languages might be called an array, but with some
added functionality.)
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [integer_list, heterogeneous_list, []]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6
You can get or set the nth element of a list with square brackets:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
zero = x[0] # equals 0, lists are 0indexed
one = x[1] # equals 1
nine = x[1] # equals 9, 'Pythonic' for last element
eight = x[2] # equals 8, 'Pythonic' for nexttolast element
x[0] = 1 # now x is [1, 1, 2, 3, ..., 9]
You can also use square brackets to “slice” lists. The slice i:j means all elements from i
(inclusive) to j (not inclusive). If you leave off the start of the slice, you’ll slice from the
beginning of the list, and if you leave of the end of the slice, you’ll slice until the end of the list:
first_three = x[:3] # [1, 1, 2]
three_to_end = x[3:] # [3, 4, ..., 9]
one_to_four = x[1:5] # [1, 2, 3, 4]
last_three = x[3:] # [7, 8, 9]
without_first_and_last = x[1:1] # [1, 2, ..., 8]
copy_of_x = x[:] # [1, 1, 2, ..., 9]
You can similarly slice strings and other “sequential” types.
Python has an in operator to check for list membership:
1 in [1, 2, 3] # True
0 in [1, 2, 3] # False
This check involves examining the elements of the list one at a time, which means that you
probably shouldn’t use it unless you know your list is pretty small (or unless you don’t care how
long the check takes).
It is easy to concatenate lists together. If you want to modify a list inplace, you can use
extend to add items from another collection.
x = [1, 2, 3]
x.extend([4, 5, 6]) # x is now [1, 2, 3, 4, 5, 6]
If you don’t want to modify x you can use list addition:
x = [1, 2, 3]
y = x + [4, 5, 6] # y is [1, 2, 3, 4, 5, 6]; x is unchanged
More frequently we will append to lists one item at a time:
x = [1, 2, 3]
x.append(0) # x is now [1, 2, 3, 0]
y = x[1] # equals 0
z = len(x) # equals 4
It is often convenient to unpack lists when you know how many elements they contain:
x, y = [1, 2] # now x is 1, y is 2
although you will get a ValueError if you don’t have the same numbers of elements on both
sides.
A common idiom is to use an underscore for a value you’re going to throw away:
_, y = [1, 2] # now y == 2, didn't care about the first element
Tuples
Tuples are lists’ immutable cousins. Pretty much anything you can do to a list that doesn’t
involve modifying it, you can do to a tuple. You specify a tuple by using parentheses (or
nothing) instead of square brackets:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3 # my_list is now [1, 3]
try:
my_tuple[1] = 3
except TypeError:
print("cannot modify a tuple")
Tuples are a convenient way to return multiple values from functions:
def sum_and_product(x, y):
return (x + y), (x * y)
sp = sum_and_product(2, 3) # sp is (5, 6)
s, p = sum_and_product(5, 10) # s is 15, p is 50
Tuples (and lists) can also be used for multiple assignment:
x, y = 1, 2 # now x is 1, y is 2
x, y = y, x # Pythonic way to swap variables; now x is 2, y is 1
Dictionaries
Another fundamental data structure is a dictionary, which associates values with keys and allows
you to quickly retrieve the value corresponding to a given key:
empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = {"Joel": 80, "Tim": 95} # dictionary literal
You can look up the value for a key using square brackets:
joels_grade = grades["Joel"] # equals 80
But you’ll get a KeyError if you ask for a key that’s not in the dictionary:
try:
kates_grade = grades["Kate"]
except KeyError:
print("no grade for Kate!")
You can check for the existence of a key using in:
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
This membership check is fast even for large dictionaries.
Dictionaries have a get method that returns a default value (instead of raising an exception)
when you look up a key that’s not in the dictionary:
joels_grade = grades.get("Joel", 0) # equals 80
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # default default is None
You can assign keyvalue pairs using the same square brackets:
grades["Tim"] = 99 # replaces the old value
grades["Kate"] = 100 # adds a third entry
num_students = len(grades) # equals 3
As you saw in the introduction, you can use dictionaries to represent structured data:
tweet = {
"user" : "joelgrus",
"text" : "Data Science is Awesome",
"retweet_count" : 100,
"hashtags" : ["#data", "#science", "#datascience", "#awesome", "#y
olo"]
}
although we’ll soon see a better approach.
Besides looking for specific keys we can look at all of them:
tweet_keys = tweet.keys() # iterable for the keys
tweet_values = tweet.values() # iterable for the values
tweet_items = tweet.items() # iterable for the (key, value) tuples
"user" in tweet_keys # True, but checks keys one by one (ba
d!)
"user" in tweet # more Pythonic, uses faster dict look
up
"joelgrus" in tweet_values # True (slow but the only way to check
)
Dictionary keys must be “hashable”; in particular, you cannot use lists as keys. If you need a
multipart key, you should probably use a tuple or figure out a way to turn the key into a
string.
defaultdict
Imagine that you’re trying to count the words in a document. An obvious approach is to create a
dictionary in which the keys are words and the values are counts. As you check each word, you
can increment its count if it’s already in the dictionary and add it to the dictionary if it’s not:
word_counts = {}
for word in document:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
You could also use the “forgiveness is better than permission” approach and just handle the
exception from trying to look up a missing key:
word_counts = {}
try:
except KeyError:
word_counts[word] = 1
A third approach is to use get, which behaves gracefully for missing keys:
word_counts = {}
previous_count = word_counts.get(word, 0)
word_counts[word] = previous_count + 1
Every one of these is slightly unwieldy, which is why defaultdict is useful. A
defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t
contain, it first adds a value for it using a zeroargument function you provided when you
created it. In order to use defaultdicts, you have to import them from collections:
from collections import defaultdict
word_counts = defaultdict(int) # int() produces 0
They can also be useful with list or dict or even your own functions:
dd_list = defaultdict(list) # list() produces an empty list
dd_list[2].append(1) # now dd_list contains {2: [1]}
dd_dict = defaultdict(dict) # dict() produces an empty dict
dd_dict["Joel"]["City"] = "Seattle" # {"Joel" : {"City": Seattle"}}
dd_pair = defaultdict(lambda: [0, 0])
dd_pair[2][1] = 1 # now dd_pair contains {2: [0,
1]}
These will be useful when we’re using dictionaries to “collect” results by some key and don’t
want to have to check every time to see if the key exists yet.
Counter
A Counter turns a sequence of values into a defaultdict(int)like object mapping keys
to counts.
from collections import Counter
c = Counter([0, 1, 2, 0]) # c is (basically) {0: 2, 1: 1, 2:
1}
This gives us a very simple way to solve our word_counts problem:
# recall, document is a list of words
word_counts = Counter(document)
A Counter instance has a most_common method that is frequently useful:
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
print(word, count)
Sets
Another useful data structure is set, which represents a collection of distinct elements. You can
define a set by listing its elements between curly braces:
===
primes_below_10 = {2, 3, 5, 7}
===
However, that doesn’t work for empty sets, as {} already means “empty dict“. In that case
you’ll need to use set() itself:
s = set()
s.add(1) # s is now {1}
s.add(2) # s is now {1, 2}
s.add(2) # s is still {1, 2}
x = len(s) # equals 2
y = 2 in s # equals True
z = 3 in s # equals False
We’ll use sets for two main reasons. The first is that in is a very fast operation on sets. If we
have a large collection of items that we want to use for a membership test, a set is more
appropriate than a list:
stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet",
"you"]
"zip" in stopwords_list # False, but have to check every element
stopwords_set = set(stopwords_list)
"zip" in stopwords_set # very fast to check
The second reason is to find the distinct items in a collection:
item_list = [1, 2, 3, 1, 2, 3]
num_items = len(item_list) # 6
item_set = set(item_list) # {1, 2, 3}
num_distinct_items = len(item_set) # 3
distinct_item_list = list(item_set) # [1, 2, 3]
We’ll use sets much less frequently than dicts and lists.
Control Flow
As in most programming languages, you can perform an action conditionally using if:
if 1 > 2:
message = "if only 1 were greater than two..."
elif 1 > 3:
message = "elif stands for 'else if'"
else:
message = "when all else fails use else (if you want to)"
You can also write a ternary ifthenelse on one line, which we will do occasionally:
parity = "even" if x % 2 == 0 else "odd"
Python has a while loop:
x = 0
while x < 10:
print(f"{x} is less than 10")
x += 1
although more often we’ll use for and in:
# range(10) is the numbers 0, 1, ..., 9
for x in range(10):
print(f"{x} is less than 10")
If you need morecomplex logic, you can use continue and break:
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print(x)
This will print 0, 1, 2, and 4.
Truthiness
Booleans in Python work as in most other languages, except that they’re capitalized:
one_is_less_than_two = 1 < 2 # equals True
true_equals_false = True == False # equals False
Python uses the value None to indicate a nonexistent value. It is similar to other languages’
null:
x = None
assert x == None, "this is the not the Pythonic way to check for None"
assert x is None, "this is the Pythonic way to check for None"
Python lets you use any value where it expects a Boolean. The following are all “Falsy”:
False
None
[] (an empty list)
{} (an empty dict)
""
set()
0.0
Pretty much anything else gets treated as True. This allows you to easily use if statements to
test for empty lists or empty strings or empty dictionaries or so on. It also sometimes causes
tricky bugs if you’re not expecting this behavior:
s = some_function_that_returns_a_string()
if s:
first_char = s[0]
else:
first_char = ""
A shorter (but possibly more confusing) way of doing the same is:
first_char = s and s[0]
since and returns its second value when the first is “truthy,” the first value when it’s not.
Similarly, if x is either a number or possibly None:
safe_x = x or 0
is definitely a number, although
safe_x = x if x is not None else 0
is possibly more readable.
Python has an all function, which takes a an iterable and returns True precisely when every
element is truthy, and an any function, which returns True when at least one element is truthy:
all([True, 1, {3}]) # True, all are truthy
all([True, 1, {}]) # False, {} is falsy
any([True, 1, {}]) # True, True is truthy
all([]) # True, no falsy elements in the list
any([]) # False, no truthy elements in the list
Sorting
Every Python list has a sort method that sorts it in place. If you don’t want to mess up your
list, you can use the sorted function, which returns a new list:
x = [4, 1, 2, 3]
y = sorted(x) # y is [1, 2, 3, 4], x is unchanged
x.sort() # now x is [1, 2, 3, 4]
By default, sort (and sorted) sort a list from smallest to largest based on naively comparing
the elements to one another.
If you want elements sorted from largest to smallest, you can specify a reverse=True
parameter. And instead of comparing the elements themselves, you can compare the results of a
function that you specify with key:
# sort the list by absolute value from largest to smallest
x = sorted([4, 1, 2, 3], key=abs, reverse=True) # is [4, 3, 2, 1]
# sort the words and counts from highest count to lowest
wc = sorted(word_counts.items(),
key=lambda word_and_count: word_and_count[1],
reverse=True)
List Comprehensions
Frequently, you’ll want to transform a list into another list, by choosing only certain elements,
or by transforming elements, or both. The Pythonic way of doing this is list comprehensions:
even_numbers = [x for x in range(5) if x % 2 == 0] # [0, 2, 4]
squares = [x * x for x in range(5)] # [0, 1, 4, 9, 16]
even_squares = [x * x for x in even_numbers] # [0, 4, 16]
You can similarly turn lists into dictionaries or sets:
square_dict = {x: x * x for x in range(5)} # {0: 0, 1: 1, 2: 4, 3: 9,
4: 16}
square_set = {x * x for x in [1, 1]} # {1}
If you don’t need the value from the list, it’s common to use an underscore as the variable:
zeros = [0 for _ in even_numbers] # has the same length as even_n
umbers
A list comprehension can include multiple fors:
pairs = [(x, y)
for x in range(10)
for y in range(10)] # 100 pairs (0,0) (0,1) ... (9,8), (9,9
)
and later fors can use the results of earlier ones:
increasing_pairs = [(x, y) # only pairs with x <
y,
for x in range(10) # range(lo, hi) equals
for y in range(x + 1, 10)] # [lo, lo + 1, ..., h
i 1]
We will use list comprehensions a lot.
Automated Testing and assert

As data scientists, we’ll be writing a lot of code. How can we be confident our code is correct?
One way is with types (see below), but another way is with automated tests.
There are elaborate frameworks for writing and running tests, but in this book we’ll restrict
ourselves to using assert statements, which will cause your code to raise an
AssertionError if your specified condition is not truthy.
assert 1 + 1 == 2
assert 1 + 1 == 2, "1 + 1 should equal 2 but didn't"
As you can see in the second case, you can optionally add a message to be printed if the
assertion fails.
It’s not particularly interesting to assert that 1 + 1 = 2. What’s more interesting is to assert that
functions you write are doing what you expect them to.
def smallest_item(xs):
return min(xs)
assert smallest_item([10, 20, 5, 40]) == 5
assert smallest_item([1, 0, 1, 2]) == 1
Throughout the book we’ll be using assert in this way. It is a good practice, and I strongly
encourage you to make liberal use of it in your own code. (If you look at the book’s code on
GitHub, you will see that it contains many, many more assert statements than are printed in
the book. This helps me be confident that the code I’ve written for you is correct.)
Another less common use is to assert things about inputs to functions:
def smallest_item(xs):
assert xs, "empty list has no smallest item"
return min(xs)
But we’ll rarely do this.
Object-Oriented Programming
Like many languages, Python allows you to define classes that encapsulate data and the
functions that operate on them. We’ll use them sometimes to make our code cleaner and
simpler. It’s probably simplest to explain them by constructing a heavily annotated example.
Here we’ll construct a class representing a “counting clicker”, the sort that is used at the door to
track how many people have shown up for the “advanced topics in data science” meetup.
It maintains a count, can be clicked to increment the count, allows you to read_count,
and can be reset back to zero. (In real life one of these rolls over from 9999 to 0000, but we
won’t bother with that.)
To define a class, you use the class keyword and a PascalCase name:
class CountingClicker:
"""A class can/should have a docstring, just like a function"""
A class contains zero or more member functions. By convention, each takes a first parameter
self that refers to the particular class instance.
Normally, a class has a constructor, named init. It takes whatever parameters you need to
construct an instance of your class and does whatever setup you need:
def __init__(self, count = 0):
self.count = count
Although the constructor has a funny name, we construct instances of the clicker using just the
class name:
clicker1 = CountingClicker() # initialized to 0
clicker2 = CountingClicker(100) # starts with count=100
clicker3 = CountingClicker(count=100) # more explicit way of doing th
e same
Notice that the init method name starts and ends with double underscores. These “magic”
methods are sometimes called “dunder” methods (DoubleUNDERscore, get it?) and represent
“special” behaviors. Another such method is repr, which produces the string representation of
a class instance:
def __repr__(self):
return f"CountingClicker(count={self.count})"
And finally we need to implement the “public API” of our class.
Example 23.
Class methods that start with an underscore are — by convention — considered “private”, and
users of the class are not supposed to directly call them. However, Python will not stop users
from calling them.
def click(self, num_times = 1):
"""Click the clicker some number of times."""
self.count += num_times
def read(self):
return self.count
def reset(self):
self.count = 0
Having defined it, let’s use assert to write some test cases for our clicker.
clicker = CountingClicker()
assert clicker.read() == 0, "clicker should start with count 0"
clicker.click()
clicker.click()
assert clicker.read() == 2, "after two clicks, clicker should have cou
nt 2"
clicker.reset()
assert clicker.read() == 0, "after reset, clicker should be back to 0"
Writing tests like this helps us be confident that our code is working the way it’s designed to,
and that it remains doing so whenever we make changes to it.
Iterables and Generators

One nice thing about a list is that you can retrieve specific elements by their indexes. But you
don’t always need this! A list of a billion numbers takes up a lot of memory. If you only want
the elements one at a time, there’s no good reason to keep them all around. If you only end up
needing the first several elements, generating all billion is hugely wasteful.
Often all we need is to iterate over the collection using for and in. In this case we can create
generators, which can be iterated over just like lists but generate their values lazily ondemand.
One way to create generators is with functions and the yield operator:
def generate_range(n):
i = 0
while i < n:
yield i # every call to yield produces a value of the genera
tor
i += 1
The following loop will consume the yielded values one at a time until none are left:
for i in generate_range(10):
print(f"i: {i}")
(In fact, range is itself lazy, so there’s no point in doing this.)
With a generator, you can even create an infinite sequence:
def natural_numbers():
"""returns 1, 2, 3, ..."""
n = 1
while True:
yield n
n += 1
although you probably shouldn’t iterate over it without using some kind of break logic.
TIP
The flip side of laziness is that you can only iterate through a generator once. If you
need to iterate through something multiple times, you’ll need to either recreate the
generator each time or use a list. If generating the values is expensive, that might be
a good reason to use a list instead.
A second way to create generators is by using for comprehensions wrapped in parentheses:
evens_below_20 = (i for i in generate_range(20) if i % 2 == 0)
Such a “generator comprehension” doesn’t do any work until you iterate over it (using for or
next). We can use this to build up elaborate data processing pipelines:
# None of these computations *does* anything until we iterate
data = natural_numbers()
evens = (x for x in data if x % 2 == 0)
even_squares = (x ** 2 for x in evens)
even_squares_ending_in_six = (x for x in even_squares if x % 10 == 6)
# and so on
Not infrequently when we’re iterating over a list or a generator we’ll want not just the values but
also their indices. For this common case Python provides an enumerate function, which turns
values into pairs (index, value):
names = ["Alice", "Bob", "Charlie", "Debbie"]
# not Pythonic
for i in range(len(names)):
print(f"name {i} is {names[i]}")
# also not Pythonic
i = 0
for name in names:
print(f"name {i} is {names[i]}")
i += 1
# Pythonic
for i, name in enumerate(names):
print(f"name {i} is {name}")
We’ll use this a lot.
Randomness
As we learn data science, we will frequently need to generate random numbers, which we can
do with the random module:
import random
random.seed(10) # this ensures we get the same results every time
four_uniform_randoms = [random.random() for _ in range(4)]
# [0.5714025946899135, # random.random() produces numbers
# 0.4288890546751146, # uniformly between 0 and 1
# 0.5780913011344704, # it's the random function we'll use
# 0.20609823213950174] # most often
The random module actually produces pseudorandom (that is, deterministic) numbers based on
an internal state that you can set with random.seed if you want to get reproducible results:
random.seed(10) # set the seed to 10
print(random.random()) # 0.57140259469
random.seed(10) # reset the seed to 10
print(random.random()) # 0.57140259469 again
We’ll sometimes use random.randrange, which takes either 1 or 2 arguments and returns
an element chosen randomly from the corresponding range():
random.randrange(10) # choose randomly from range(10) = [0, 1, ...,
9]
random.randrange(3, 6) # choose randomly from range(3, 6) = [3, 4, 5]
There are a few more methods that we’ll sometimes find convenient. random.shuffle
randomly reorders the elements of a list:
up_to_ten = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random.shuffle(up_to_ten)
print(up_to_ten)
# [7, 2, 6, 8, 9, 4, 10, 1, 3, 5] (your results will probably be dif
ferent)
If you need to randomly pick one element from a list you can use random.choice:
my_best_friend = random.choice(["Alice", "Bob", "Charlie"]) # "Bob
" for me
And if you need to randomly choose a sample of elements without replacement (i.e., with no
duplicates), you can use random.sample:
lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6) # [16, 36, 10, 6,
25, 9]
To choose a sample of elements with replacement (i.e., allowing duplicates), you can just make
multiple calls to random.choice:
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print(four_with_replacement) # [9, 4, 4, 2]
Regular Expressions
Regular expressions provide a way of searching text. They are incredibly useful but also fairly
complicated, so much so that there are entire books written about them. We will explain their
details the few times we encounter them; here are a few examples of how to use them in Python:
import re
re_examples = [ # all of these are true, because
not re.match("a", "cat"), # 'cat' doesn't start with
'a'
re.search("a", "cat"), # 'cat' has an 'a' in it
not re.search("c", "dog"), # 'dog' doesn't have a 'c'
in it
3 == len(re.split("[ab]", "carbs")), # split on a or b to ['c',
'r','s']
"RD" == re.sub("[09]", "", "R2D2") # replace digits with dash
es
]
assert all(re_examples), "all the regex examples should be True"
One import thing to note is that re.match checks whether the beginning of a string matches a
regular expression; while re.search checks whether any part of a string matches a regular
expression. At some point you will mix these two up and it will cause you grief.
The official documentation goes into much more detail.
partial
When passing functions around, sometimes we’ll want to partially apply (or curry) functions to
create new functions. As a simple example, imagine that we have a function of two variables:
def exp(base, power):
return base ** power
and we want to use it to create a function of one variable two_to_the whose input is a
power and whose output is the result of exp(2, power).
We can, of course, do this with def, but this can sometimes get unwieldy:
def two_to_the(power):
return exp(2, power)
A different approach is to use functools.partial:
from functools import partial
two_to_the = partial(exp, 2) # is now a function of one variable
assert two_to_the(3) == 8, "2 ** 3 should equal 8"
You can also use partial to fill in later arguments if you specify their names:
square_of = partial(exp, power=2)
assert square_of(3) == 9, "3 ** 2 should equal 9"
It starts to get messy if you curry arguments in the middle of the function, so we’ll try to avoid
doing that.
Example 24.
The first edition of this book also introduced the Python functions map, reduce, and filter
at this point. On my journey toward enlightenment I have realized that these functions are best
avoided, and their uses in the book have been replaced with list comprehensions and for loops.
zip and Argument Unpacking

Often we will need to zip two or more iterables together. zip transforms multiple iterables
into a single iterable of tuples of corresponding elements:
list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]
# zip is lazy, so you have to do something like the following
[pair for pair in zip(list1, list2)] # is [('a', 1), ('b', 2), ('c'
, 3)]
If the lists are different lengths, zip stops as soon as the first list ends.
You can also “unzip” a list using a strange trick:
pairs = [('a', 1), ('b', 2), ('c', 3)]
letters, numbers = zip(*pairs)
The asterisk performs argument unpacking, which uses the elements of pairs as individual
arguments to zip. It ends up the same as if you’d called:
letters, numbers = zip(('a', 1), ('b', 2), ('c', 3))
You can use argument unpacking with any function:
def add(a, b): return a + b
add(1, 2) # returns 3
try:
add([1, 2])
except TypeError:
print("add expects two inputs")
add(*[1, 2]) # returns 3
It is rare that we’ll find this useful, but when we do it’s a neat trick.
args and kwargs

Let’s say we want to create a higherorder function that takes as input some function f and
returns a new function that for any input returns twice the value of f:
def doubler(f):
# Here we define a new function that keeps a reference to f
def g(x):
return 2 * f(x)
# And return that new function.
return g
This works in some cases:
def f1(x):
return x + 1
g = doubler(f1)
assert g(3) == 8, "(3 + 1) * 2 should equal 8"
assert g(1) == 0, "(1 + 1) * 2 should equal 0"
However, it doesn’t work with functions that take more than a single argument:
def f2(x, y):
return x + y
g = doubler(f2)
try:
g(1, 2)
except TypeError:
print("as defined, g only takes one argument")
What we need is a way to specify a function that takes arbitrary arguments. We can do this with
argument unpacking and a little bit of magic:
def magic(*args, **kwargs):
print("unnamed args:", args)
print("keyword args:", kwargs)
magic(1, 2, key="word", key2="word2")
# prints
# unnamed args: (1, 2)
# keyword args: {'key': 'word', 'key2': 'word2'}
That is, when we define a function like this, args is a tuple of its unnamed arguments and
kwargs is a dict of its named arguments. It works the other way too, if you want to use a
list (or tuple) and dict to supply arguments to a function:
def other_way_magic(x, y, z):
return x + y + z
x_y_list = [1, 2]
z_dict = {"z": 3}
assert other_way_magic(*x_y_list, **z_dict) == 6, "1 + 2 + 3 should be
6"
You could do all sorts of strange tricks with this; we will only use it to produce higherorder
functions whose inputs can accept arbitrary arguments:
def doubler_correct(f):
"""works no matter what kind of inputs f expects"""
def g(*args, **kwargs):
"""whatever arguments g is supplied, pass them through to f"""
return 2 * f(*args, **kwargs)
return g
g = doubler_correct(f2)
assert g(1, 2) == 6, "doubler should work now"
As a general rule, your code will be more correct and more readable if you are explicit about
what sorts of arguments your functions require; accordingly, we will use args and kwargs
only when we have no other option.
Type Hints
Python is a dynamicallytyped language. That means that it in general it doesn’t care about the
types of objects we use, as long as we use them in valid ways:
def add(a, b):
return a + b
assert add(10, 5) == 15, "+ is valid for numbers"
assert add([1, 2], [3]) == [1, 2, 3], "+ is valid for lists"
assert add("hi ", "there") == "hi there", "+ is valid for strings"
try:
add(10, "five")
except TypeError:
print("cannot add an int to a string")
Whereas in a statically typed language your functions and objects would have specific types:
def add(a: int, b: int) > int:
return a + b
add(10, 5) # you'd like this to be OK
add("hi ", "there") # you'd like this to be not OK
In fact, recent versions of Python do (sort of) have this functionality. The above version of add
with the int type annotations is valid Python 3.6!
However, the type annotations don’t actually do anything. You can still use the annotated add
function to add strings, and the call to add(10, "five") would still raise a TypeError
only at runtime.
That said, there are still (at least) four good reasons to use type hints in your Python code:
1. There are external tools (the most popular is mypy) that will read your code, inspect the
type hints, and let you know about type errors before you ever run your code. For
example, if you ran mypy over a file containing the most recent snippet, it would warn you
error: Argument 1 to "add" has incompatible type "str"; expected "int"
Like assert testing, this is a good way to find mistakes in your code before you ever run it.
The narrative in the book will not involve such a type checker; however, behind the scenes I
will be running one to make sure the book’s code is correct.
1. Types are an important form of documentation. This is doubly true in a book that is using
code to teach you theoretical and mathematical concepts. Compare the following two
function stubs:
def dot_product(x, y): ...
# we have not yet defined Vector, but imagine we had
def dot_product(x: Vector, y: Vector) > float: ...
I find the second one exceedingly more informative; hopefully you do too. (At this point I have
gotten so used to type hinting that I now find untyped Python difficult to read.)
1. Having to think about the types in your code forces you to design cleaner functions and
interfaces.
from typing import Union
def secretly_ugly_function(value, operation): ...
def ugly_function(value: int, operation: Union[str, int, float, bool])
> int:
...
Here we have a function whose operation parameter is allowed to be a string, or an int, or a
float, or a bool. It is highly likely that this function is fragile and difficult to use, but this
becomes far more clear when the types are made explicit. Doing so, then, will force us to design
in a less clunky way, for which our users will thank us.
1. Using types allows your editor to help you with things like autocomplete and to get angry
at type errors.
Sometimes people insist that type hints may be valuable on large projects but are not worth the
time for small ones. However, since type hints take almost no additional time to type and allow
your editor to save you time, I maintain that type hints actually allow you to write code more
quickly, even for small projects.
For all these reasons, all of the code in the remainder of the book will use type annotations.
How To Write Type Annotations

As we’ve seen, for builtin types like int and bool and float, you just use the type itself as
the type hint. What if you had (say) a list?
def total(xs: list) > float:
return sum(total)
This isn’t wrong, but the type is not restrictive enough. It’s clear we really want xs to be a list
of floats, not (say) a list of strings.
The typing module provides a number of parameterized types that we can use to do just this:
from typing import List # note capital L
def total(xs: List[float]) > float:
return sum(total)
Up until now we’ve only specified type hints for function parameters and return types. For
variables themselves it’s usually obvious what the type is:
# This is how to type hint variables when you define them.
x: int = 5 # this is fine but unnecessary, it's "obvious" x is an int
However, sometimes it’s not obvious:
values = [] # what's my type?
best_so_far = None # what's my type?
In such cases we will supply inline type hints:
from typing import Optional
values: List[int] = []
best_so_far: Optional[float] = None # allowed to be either a float or
None
The typing module contains many other types, a only few of which we’ll ever use:
# the type hints in this snippet are all unnecessary
from typing import Dict, Iterable, Tuple
# keys are strings, values are ints
counts: Dict[str, int] = {'data': 1, 'science': 2}
# lists and generators are both iterable
if lazy:
evens: Iterable[int] = (x for x in range(10) if x % 2 == 0)
else:
evens = [0, 2, 4, 6, 8]
# tuples specify a type for each element
triple: Tuple[int, float, int] = (10, 2.3, 5)
Finally, since Python has firstclass functions, we need a type to represent those as well. Here’s
a pretty contrived example:
from typing import Callable
# The type hint says that repeater is a function that takes
# two arguments, a string and an int, and returns a string.
def twice(repeater: Callable[[str, int], str], s: str) > str:
return repeater(s, 2)
def comma_repeater(s: str, n: int) > str:
n_copies = [s for _ in range(n)]
return ', '.join(n_copies)
assert twice(comma_repeater, "type hints") == "type hints, type hints"
As type hints are just Python objects, we can assign them to variables to make them easier to
refer to:
Number = int
Numbers = List[Number]
def total(xs: Numbers) > Number:
return sum(xs)
By the time you get to the end of the book, you’ll be quite familiar with Python type hints, and I
hope you’ll use them in your code.
Welcome to DataSciencester!
This concludes newemployee orientation. Oh, and also, try not to embezzle anything.
For Further Exploration

There is no shortage of Python tutorials in the world. The official one is not a bad place to
start.
The official IPython tutorial will help you get started with IPython, if you decide to use it.
Please use it.
The Mypy documentation will tell you more than you ever wanted to know about Python
type annotations and type checking.
Chapter 3. Visualizing Data
History
Topics
I believe that visualization is one of the most powerful means of achieving personal goals.
—Harvey Mackay
Tutorials
A fundamental part of the data scientist’s toolkit is data visualization. Although it is very easy to
Offers & Deals
create visualizations, it’s much harder to produce good ones.
There are two primary uses for data visualization:
Highlights
Settings To explore data
To communicate data
Support
In this chapter, we will concentrate on building the skills that you’ll need to start exploring your
Sign Out
own data and to produce the visualizations we’ll be using throughout the rest of the book. Like
most of our chapter topics, data visualization is a rich field of study that deserves its own book.
Nonetheless, we’ll try to give you a sense of what makes for a good visualization and what
doesn’t.
matplotlib
A wide variety of tools exists for visualizing data. We will be using the matplotlib library,
which is widely used (although sort of showing its age). If you are interested in producing
elaborate interactive visualizations for the Web, it is likely not the right choice, but for simple
bar charts, line charts, and scatterplots, it works pretty well.
matplotlib is not part of the core Python library (although if you installed the Anaconda
distribution, it should have been included). If you don’t have it you will need to install it from
the command line using
pip install matplotlib
We will be using the matplotlib.pyplot module. In its simplest use, pyplot maintains
an internal state in which you build up a visualization step by step. Once you’re done, you can
save it (with savefig()) or display it (with show()).
For example, making simple plots (like Figure 31) is pretty simple:
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on xaxis, gdp on yaxis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the yaxis
plt.ylabel("Billions of $")
plt.show()
Figure 31. A simple line chart
Making plots that look publicationquality good is more complicated and beyond the scope of
this chapter. There are many ways you can customize your charts with (for example) axis labels,
line styles, and point markers. Rather than attempt a comprehensive treatment of these options,
we’ll just use (and call attention to) some of them in our examples.
NOTE
Although we won’t be using much of this functionality, matplotlib is capable of
producing complicated plots within plots, sophisticated formatting, and interactive
visualizations. Check out its documentation if you want to go deeper than we do in
this book.
Bar Charts
A bar chart is a good choice when you want to show how some quantity varies among some
discrete set of items. For instance, Figure 32 shows how many Academy Awards were won by
each of a variety of movies:
movies = ["Annie Hall", "BenHur", "Casablanca", "Gandhi", "West Side
Story"]
num_oscars = [5, 11, 3, 8, 10]
# plot bars with left xcoordinates [0, 1, 2, 3, 4], heights [num_osca
rs]
plt.bar(range(len(movies)), num_oscars)
plt.title("My Favorite Movies") # add a title
plt.ylabel("# of Academy Awards") # label the yaxis
# label xaxis with movie names at bar centers
plt.xticks(range(len(movies)), movies)
plt.show()
Figure 32. A simple bar chart
A bar chart can also be a good choice for plotting histograms of bucketed numeric values, in
order to visually explore how the values are distributed, as in Figure 33:
from collections import Counter
grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
# Bucket grades by decile, but put 100 in with the 90s
decile = lambda grade: min(grade // 10 * 10, 90)
histogram = Counter(decile(grade) for grade in grades)
plt.bar([x + 5 for x in histogram.keys()], # shift bars right by 5
histogram.values(), # give each bar its correc
t height
10, # give each bar a width of
8
edgecolor=(0, 0, 0)) # black edges for each bar
plt.axis([5, 105, 0, 5]) # xaxis from 5 to 105,
# yaxis from 0 to 5
plt.xticks([10 * i for i in range(11)]) # xaxis labels at 0, 10, .
.., 100
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()
Figure 33. Using a bar chart for a histogram
The third argument to plt.bar specifies the bar width. Here we chose a width of 10, to fill the
entire decile. We also shifted the bars right by 5, so that (for example) the “10” bar (which
corresponds to the decile 1020) would have its center at 15 and hence occupy the correct range.
We also added a black edge to each bar to make them visually distinct.
The call to plt.axis indicates that we want the xaxis to range from 5 to 105 (just to leave a
little space on the left and right), and that the yaxis should range from 0 to 5. And the call to
plt.xticks puts xaxis labels at 0, 10, 20, …, 100.
Be judicious when using plt.axis(). When creating bar charts it is considered especially
bad form for your yaxis not to start at 0, since this is an easy way to mislead people (Figure I
1):
mentions = [500, 505]
years = [2017, 2018]
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel("# of times I heard someone say 'data science'")
# if you don't do this, matplotlib will label the xaxis 0, 1
# and then add a +2.013e3 off in the corner (bad matplotlib!)
plt.ticklabel_format(useOffset=False)
# misleading yaxis only shows the part above 500
plt.axis([2016.5, 2018.5, 499, 506])
plt.title("Look at the 'Huge' Increase!")
plt.show()
History
Part I. TODO: replace chart
Topics
Tutorials
Offers & Deals
Highlights
Settings
Support
Sign Out
Figure I1. A chart with a misleading yaxis
In Figure I2, we use moresensible axes, and it looks far less impressive:
plt.axis([2016.5, 2018.5, 0, 550])
plt.title("Not So Huge Anymore")
plt.show()
Figure I2. The same chart with a nonmisleading yaxis
Line Charts
As we saw already, we can make line charts using plt.plot(). These are a good choice for
showing trends, as illustrated in Figure I3:
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
plt.plot(xs, variance, 'g', label='variance') # green solid l
ine
plt.plot(xs, bias_squared, 'r.', label='bias^2') # red dotdashe
d line
plt.plot(xs, total_error, 'b:', label='total error') # blue dotted l
ine
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The BiasVariance Tradeoff")
plt.show()
TODO: replace chart
Figure I3. Several line charts with a legend
Scatterplots
A scatterplot is the right choice for visualizing the relationship between two paired sets of data.
For example, Figure I4 illustrates the relationship between the number of friends your users
have and the number of minutes they spend on the site every day:
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put the label with its point
xytext=(5, 5), # but slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
TODO: replace figure
Figure I4. A scatterplot of friends and time on the site
If you’re scattering comparable variables, you might get a misleading picture if you let
matplotlib choose the scale, as in Figure I5:
test_1_grades = [ 99, 90, 85, 97, 80]
test_2_grades = [100, 85, 60, 90, 70]
plt.scatter(test_1_grades, test_2_grades)
plt.title("Axes Aren't Comparable")
plt.xlabel("test 1 grade")
plt.ylabel("test 2 grade")
plt.show()
Figure I5. A scatterplot with uncomparable axes
If we include a call to plt.axis("equal"), the plot (Figure I6) more accurately shows
that most of the variation occurs on test 2.
That’s enough to get you started doing visualization. We’ll learn much more about visualization
throughout the book.
TODO: update plot
Figure I6. The same scatterplot with equal axes

seaborn is built on top of matplotlib and allows you to easily produce prettier (and
more complex) visualizations.
D3.js is a JavaScript library for producing sophisticated interactive visualizations for the
web. Although it is not in Python, it is both trendy and widely used, and it is well worth
your while to be familiar with it.
Bokeh is a newer library that brings D3style visualizations into Python.
Chapter 4. Linear Algebra
History
Topics
Is there anything more useless or less useful than Algebra?
—Billy Connolly
Tutorials
Linear algebra is the branch of mathematics that deals with vector spaces. Although I can’t hope
to teach you linear algebra in a brief chapter, it underpins a large number of data science
Offers & Deals
concepts and techniques, which means I owe it to you to at least try. What we learn in this
Highlights
chapter we’ll use heavily throughout the rest of the book.
Settings
Vectors
Abstractly, vectors are objects that can be added together (to form new vectors) and that can be
Support
multiplied by scalars (i.e., numbers), also to form new vectors.
Sign Out
Concretely (for us), vectors are points in some finitedimensional space. Although you might
not think of your data as vectors, they are a good way to represent numeric data.
For example, if you have the heights, weights, and ages of a large number of people, you can
treat your data as threedimensional vectors (height, weight, age). If you’re teaching a
class with four exams, you can treat student grades as fourdimensional vectors (exam1,
exam2, exam3, exam4).
The simplest fromscratch approach is to represent vectors as lists of numbers. A list of three
numbers corresponds to a vector in threedimensional space, and vice versa.
We’ll accomplish this with a type alias:
from typing import List
Vector = List[float]
height_weight_age = [70, # inches,
170, # pounds,
40 ] # years
grades = [95, # exam1
80, # exam2
75, # exam3
62 ] # exam4
We’ll also want to perform arithmetic on vectors. Because Python lists aren’t vectors (and hence
provide no facilities for vector arithmetic), we’ll need to build these arithmetic tools ourselves.
So let’s start with that.
To begin with, we’ll frequently need to add two vectors. Vectors add componentwise. This
means that if two vectors v and w are the same length, their sum is just the vector whose first
element is v[0] + w[0], whose second element is v[1] + w[1], and so on. (If they’re not
the same length, then we’re not allowed to add them.)
For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3,
3], as shown in Figure 41.
Figure 41. Adding two vectors
We can easily implement this by ziping the vectors together and using a list comprehension to
add the corresponding elements:
def add(v: Vector, w: Vector) > Vector:
"""Adds corresponding elements"""
assert len(v) == len(w), "vectors must be the same length"
return [v_i + w_i for v_i, w_i in zip(v, w)]
Similarly, to subtract two vectors we just subtract corresponding elements:
def subtract(v: Vector, w: Vector) > Vector:
"""Subtracts corresponding elements"""
return [v_i w_i for v_i, w_i in zip(v, w)]
We’ll also sometimes want to componentwise sum a list of vectors. That is, create a new vector
whose first element is the sum of all the first elements, whose second element is the sum of all
the second elements, and so on.
from typing import List
def vector_sum(vectors: List[Vector]) > Vector:
"""Sums all corresponding elements"""
assert vectors, "vectors cannot be empty"
result = vectors[0] # start with the first vector
for vector in vectors[1:]: # then loop over the others
result = add(result, vector) # and add them to the result
return result
We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying
each element of the vector by that number:
def scalar_multiply(c: float, v: Vector) > Vector:
"""Multiplies every element by c"""
return [c * v_i for v_i in v]
This allows us to compute the componentwise means of a list of (samesized) vectors:
def vector_mean(vectors):
"""Computes the elementwise average"""
n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
A less obvious tool is the dot product. The dot product of two vectors is the sum of their
componentwise products:
def dot(v: Vector, w: Vector) > float:
"""Computes v_1 * w_1 + ... + v_n * w_n"""
assert len(v) == len(w), "vectors must be same length"
return sum(v_i * w_i for v_i, w_i in zip(v, w))
If w has magnitude 1, the dot product measures how far the vector v extends in the w direction.
For example, if w = [1, 0] then dot(v, w) is just the first component of v. Another way
of saying this is that it’s the length of the vector you’d get if you projected v onto w (Figure 4
2).
Figure 42. The dot product as vector projection
Using this, it’s easy to compute a vector’s sum of squares:
def sum_of_squares(v: Vector) > float:
"""Returns v_1 * v_1 + ... + v_n * v_n"""
return dot(v, v)
Which we can use to compute its magnitude (or length):
import math
def magnitude(v: Vector) > float:
"""Returns the magnitude (or length) of v"""
return math.sqrt(sum_of_squares(v)) # math.sqrt is square root f
unction
We now have all the pieces we need to compute the distance between two vectors, defined as:
def squared_distance(v: Vector, w: Vector) > float:
"""Computes (v_1 w_1) ** 2 + ... + (v_n w_n) ** 2"""
return sum_of_squares(subtract(v, w))
def distance(v: Vector, w: Vector) > float:
"""Computes the distance between v and w"""
return math.sqrt(squared_distance(v, w))
Which is possibly clearer if we write it as (the equivalent):
def distance(v: Vector, w: Vector) > float:
return magnitude(subtract(v, w))
That should be plenty to get us started. We’ll be using these functions heavily throughout the
book.
NOTE
Using lists as vectors is great for exposition but terrible for performance.
In production code, you would want to use the NumPy library, which includes a
highperformance array class with all sorts of arithmetic operations included.
Matrices
A matrix is a twodimensional collection of numbers. We will represent matrices as lists of
lists, with each inner list having the same size and representing a row of the matrix. If A is a
matrix, then A[i][j] is the element in the ith row and the jth column. Per mathematical
convention, we will typically use capital letters to represent matrices. For example:
# Another type alias
Matrix = List[List[float]]
A = [[1, 2, 3], # A has 2 rows and 3 columns
[4, 5, 6]]
B = [[1, 2], # B has 3 rows and 2 columns
[3, 4],
[5, 6]]
NOTE
In mathematics, you would usually name the first row of the matrix “row 1” and the
first column “column 1.” Because we’re representing matrices with Python lists,
which are zeroindexed, we’ll call the first row of a matrix “row 0” and the first
column “column 0.”
Given this listoflists representation, the matrix A has len(A) rows and len(A[0])
columns, which we consider its shape:
from typing import Tuple
def shape(A: Matrix) > Tuple[int, int]:
"""Returns (# of rows of A, # of columns of A)"""
num_rows = len(A)
num_cols = len(A[0]) if A else 0 # number of elements in first r
ow
return num_rows, num_cols
If a matrix has n rows and k columns, we will refer to it as a matrix. We can (and
sometimes will) think of each row of a matrix as a vector of length k, and each column
as a vector of length n:
def get_row(A: Matrix, i: int) > Vector:
"""Returns the ith row of A (as a Vector)"""
return A[i] # A[i] is already the ith row
def get_column(A: Matrix, j: int) > Vector:
"""Returns the jth column of A (as a Vector)"""
return [A_i[j] # jth element of row A_i
for A_i in A] # for each row A_i
We’ll also want to be able to create a matrix given its shape and a function for generating its
elements. We can do this using a nested list comprehension:
from typing import Callable
def make_matrix(num_rows: int,
num_cols: int,
entry_fn: Callable[[int, int], float]) > Matrix:
"""
Returns a num_rows x num_cols matrix
whose (i,j)th entry is entry_fn(i, j)
"""
return [[entry_fn(i, j) # given i, create a list
for j in range(num_cols)] # [entry_fn(i, 0), ... ]
for i in range(num_rows)] # create one list for each i
Given this function, you could make a 5 × 5 identity matrix (with 1s on the diagonal and 0s
elsewhere) with:
def identity_matrix(n: int) > Matrix:
"""Returns the n x n identity matrix"""
return make_matrix(n, n, lambda i, j: 1 if i == j else 0)
assert identity_matrix(5) == [[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]
Matrices will be important to us for several reasons.
First, we can use a matrix to represent a data set consisting of multiple vectors, simply by
considering each vector as a row of the matrix. For example, if you had the heights, weights,
and ages of 1,000 people you could put them in a matrix:
data = [[70, 170, 40],
[65, 120, 26],
[77, 250, 19],
# ....
]
Second, as we’ll see later, we can use an matrix to represent a linear function that maps
kdimensional vectors to ndimensional vectors. Several of our techniques and concepts will
involve such functions.
Third, matrices can be used to represent binary relationships. In Chapter 1, we represented the
edges of a network as a collection of pairs (i, j). An alternative representation would be to
create a matrix A such that A[i][j] is 1 if nodes i and j are connected and 0 otherwise.
Recall that before we had:
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
We could also represent this as:
# user 0 1 2 3 4 5 6 7 8 9
#
friend_matrix = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1
[1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2
[0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7
[0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9
If there are very few connections, this is a much more inefficient representation, since you end
up having to store a lot of zeroes. However, with the matrix representation it is much quicker to
check whether two nodes are connected—you just have to do a matrix lookup instead of
(potentially) inspecting every edge:
assert friend_matrix[0][2] == 1, "0 and 2 are friends"
assert friend_matrix[0][8] == 0, "0 and 8 are not friends"
Similarly, to find the connections a node has, you only need to inspect the column (or the row)
corresponding to that node:
# only need to look at one row
friends_of_five = [i
for i, is_friend in enumerate(friend_matrix[5])
if is_friend]
With a small graph you could just add a list of connections to each node object to speed up this
process; but for a large, evolving graph that would probably be too expensive and difficult to
maintain.
We’ll revisit matrices throughout the book.

Linear algebra is widely used by data scientists (frequently implicitly, and not infrequently
by people who don’t understand it). It wouldn’t be a bad idea to read a textbook. You can
find several freely available online:
Linear Algebra, from UC Davis
Linear Algebra, from Saint Michael’s College
If you are feeling adventurous, Linear Algebra Done Wrong is a more advanced
introduction
All of the machinery we built here you get for free if you use NumPy. (You get a lot more
too, including much better performance.)

Data Science From Scratch, 2nd Edition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science From Scratch, 2nd Edition

Uploaded by

Copyright:

Available Formats

istory

Data Science from Scratch

ffers & Deals

Data Science from Scratch

Offers & Deals

Revision History for the Early Release

The Ascendance of Data

What Is Data Science?

Motivating Hypothetical: DataSciencester

Finding Key Connectors

Data Scientists You May Know

Salaries and Experience

The Zen of Python

Automated Testing and assert

Iterables and Generators

zip and Argument Unpacking

args and kwargs

How To Write Type Annotations

For Further Exploration

Offers & Deals

For Further Exploration

For Further Exploration

You might also like