Professional Documents
Culture Documents
9781838826321-Managing Data Science
9781838826321-Managing Data Science
Kirill Dubovikov
BIRMINGHAM - MUMBAI
Managing Data Science
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, without the prior written permission of the publisher, except in the case of brief quotations
embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented.
However, the information contained in this book is sold without warranty, either express or implied. Neither the
author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged
to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products
mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy
of this information.
ISBN 978-1-83882-632-1
www.packt.com
To my family, who believed in me and encouraged me to write this book. To my wife and
children, who provided great support and care throughout the writing process. To my friends
and fellow editors, who pushed me forward and helped to make the book better.
Subscribe to our online digital library for full access to over 7,000 books and videos, as well
as industry leading tools to help you plan your personal development and advance your
career. For more information, please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and Videos
from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.packt.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on Packt books and
eBooks.
Contributors
Kirill is a happy father of two boys. He loves learning all things new, reading books, and
writing articles for top Medium publications.
About the reviewers
Phuong Vo. T.H is a data science manager at FPT Telecom in Vietnam. She graduated with
an MSc degree in computer science at Soongsil University, Korea. She has experience in
analyzing user behavior and building recommendation or prediction systems for business
optimization. She loves to read machine learning and mathematics-related algorithm books,
as well as data analysis articles.
Jean-Pierre Sleiman works for BNP Paribas as a project owner in charge of the AI strategy
for the retail activities of the Group. His work ranges from delivering data & analytics use
cases for the business to co-designing the Group's target operating model for AI. Jean-
Pierre has been involved with data analytics & AI projects for more than 3 years, working
on several initiatives, such as crafting data strategies and managing the implementation of
data products & services within multidisciplinary teams. He was also a lecturer at HEC
Paris and École Polytechnique for MSc Data Science for Business, where he co-created the
content of the course and animated the Data science project management and the Business case
for data projects lessons.
[ ii ]
Table of Contents
[ iii ]
Table of Contents
Kanban 183
Scrum 185
Choosing a methodology that suits your project 187
Creating disruptive innovation 187
Providing a tested solution 188
Developing a custom project for a customer 188
Estimating data science projects 188
Learning to make time and cost estimates 190
Discovering the goals of the estimation process 194
Summary 194
Chapter 9: Common Pitfalls of Data Science Projects 195
Avoiding the common risks of data science projects 195
Approaching research projects 196
Dealing with prototypes and MVP projects 199
Case study – creating an MVP in a consulting company 200
Mitigating risks in production-oriented data science systems 201
Case study – bringing a sales forecasting system into production 204
Summary 206
Chapter 10: Creating Products and Improving Reusability 207
Thinking of projects as products 207
Determining the stage of your project 208
Case study – building a service desk routing system 210
Improving reusability 212
Seeking and building products 213
Privacy concerns 214
Summary 215
[ iv ]
Table of Contents
[v]
Preface
Data science and machine learning can transform any organization and open new
opportunities. Any data science project is a unique mix of research, software engineering,
and business expertise. A substantial managerial effort is needed to guide the solution from
prototype development to production. Traditional approaches often fail as they have
different conditions and requirements in mind. This book presents a proven approach to
data science project management, with tips and best practices to guide you along the way.
With the help of this book, you will learn about the practical applications of data science
and AI and will be able to incorporate them into your solutions. You will go through the
data science project life cycle, explore the common pitfalls encountered at each step, and
learn how to avoid them. Any data science project requires a balanced skillful team, and
this book will present helpful advice for hiring and growing a skillful data science team for
your organization. The book also shows you how you can efficiently manage and improve
your data science projects through the use of DevOps.
By the end of the book, readers will have the practical knowledge required to tackle the
various challenges they deal with on a daily basis and will have an understanding of
various data science solutions.
Chapter 2, Testing Your Models, explains how to distinguish good solutions from bad ones
with the help of model testing. This chapter will also look at different types of metrics by
using mathematical functions that evaluate the quality of predictions.
Preface
Chapter 3, Understanding AI, looks into the inner workings of data science. Some of the
main concepts behind machine learning and deep learning will be explored as well. This
chapter will also give a brief introduction to data science.
Chapter 4, An Ideal Data Science Team, explains how to build and sustain a data science
team that is capable of delivering complex cross-functional projects. This chapter also gives
us an understanding of the importance of software engineering and sourcing help from
software development teams.
Chapter 5, Conducting Data Science Interviews, covers how to conduct an efficient data
science interview. This chapter also looks into the importance of setting goals before
starting the interview process.
Chapter 6, Building Your Data Science Team, develops guidelines for building data science
teams. You will learn the three key aspects of building a successful team and the role of a
leader in a data science team.
Chapter 7, Managing Innovation, explores innovations and how to manage them. We will
find out how to identify projects and problems that have real value behind them.
Chapter 8, Managing Data Science Projects, explores the data science project life cycle that
allows you to structure and plan tasks for your team. We will also look into what
distinguishes analytical projects from software engineering projects.
Chapter 9, Common Pitfalls of Data Science Projects, looks closely at the common pitfalls of
data science projects. This chapter explores the mistakes that increase the risks associated
with your projects and mitigates the issues one by one, following the data science project
life cycle.
Chapter 10, Creating Products and Improving Reusability, looks at how to grow data science
products and improve your internal team performance by using reusable technology. We
will also look at strategies for improving the reusability of your projects and explore the
conditions that allow the building of standalone products from your experience.
Chapter 11, Implementing ModelOps, will explore how ModelOps is related to DevOps and
the main steps involved in the ModelOps pipeline. This chapter also looks at the strategies
for managing code, versioning data, and sharing project environments between team
members.
Chapter 12, Building Your Technology Stack, looks at how to build and manage the data
science technology stack. This chapter also discusses the differences between core and
project-specific technology stacks and examines an analytical approach for comparing
different technologies.
[2]
Preface
Chapter 13, Conclusion, provides a list of books that help you advance your knowledge in
the domain of data science.
Expertise in the main concepts of software development, project management, and DevOps
will be helpful for the reader since this text draws parallels with these approaches.
An ideal management method for data science projects is far from complete. In fact, it will
never be. A silver bullet solution to all the problems of every business in each situation does
not exist. Instead of creating a rigid and complex set of managerial processes, this book
gives recipes and practical advice. I hope that it will serve you as a good guide that will
make your journey to the world of data science as seamless as possible.
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames,
file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an
example: "They have also used the --gitlab flag for the pyscaffold command so that
they will have a ready-to-use GitLab CI/CD template when they need it."
[3]
Preface
Bold: Indicates a new term, an important word, or words that you see on screen. For
example, words in menus or dialog boxes appear in the text like this. Here is an example:
"A running instance of a Docker image is called a Docker Container."
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you have found a mistake in this book, we would be grateful if you would
report this to us. Please visit www.packtpub.com/support/errata, selecting your book,
clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we
would be grateful if you would provide us with the location address or website name.
Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in
and you are interested in either writing or contributing to a book, please visit
authors.packtpub.com.
[4]
Preface
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on
the site that you purchased it from? Potential readers can then see and use your unbiased
opinion to make purchase decisions, we at Packt can understand what you think about our
products, and our authors can see your feedback on their book. Thank you!
[5]
1
Section 1: What is Data
Science?
Before diving into the management issues of building systems around machine learning
algorithms, we need to explore the topic of data science itself. What are the main concepts
behind data science and machine learning? How do you build and test a model? What are
the common pitfalls in this process? What kinds of models are there? What tasks can we
solve using machine learning?
Data science has become popular, and many business people and technical professionals
have an increasing interest in understanding data science and applying it to solve their
problems. People often form their first opinions about data science from the information
that they collect through the background: news sites, social networks, and so on.
Unfortunately, most of those sources misguide, rather than give a realistic picture of data
science and machine learning.
What You Can Do with Data Science Chapter 1
Instead of explaining, the media describes the ultimate magical tools that easily solve all
our problems. The technological singularity is near. A universal income economy is around
the corner. Well, only if machines learned and thought like humans. In fact, we are far from
creating general-purpose, self-learning, and self-improving algorithms.
This chapter explores current possibilities and modern applications of the main tools
of data science: machine learning and deep learning.
Defining AI
Introduction to machine learning
Introduction to deep learning
Deep learning use case
Introduction to causal inference
Defining AI
Media and news use AI as a substitute buzzword for any technology related to data
analysis. In fact, AI is a sub-field of computer science and mathematics. It all started in the
1950s, when several researchers started asking whether computers can learn, think, and
reason. 70 years later, we still do not know the answer. However, we have made significant
progress in a specific kind of AI that solves thoroughly specified narrow tasks: weak AI.
Science fiction novels tell about machines that can reason and think like humans. In
scientific language, they are described as strong AI. Strong AI can think like a human, and
its intellectual abilities may be much more advanced. The creation of strong AI remains the
main long-term dream of the scientific community. However, practical applications are all
about weak AI. While strong AI tries to solve the problem of general intelligence, weak AI
is focused on solving one narrow cognition task, such as vision, speech, or
listening. Examples of weak AI tasks are diverse: speech recognition, image classification,
and customer churn prediction. Weak AI plays an important role in our lives, changing the
way we work, think, and live. We can find successful applications of weak AI in every area
of our lives. Medicine, robotics, marketing, logistics, art, and music all benefit from recent
advances in weak AI.
[8]
What You Can Do with Data Science Chapter 1
This diagram includes all the technical topics that will be discussed in this book:
AI is a general scientific field that covers everything related to weak and strong
AI. We won't focus much on AI, since most practical applications come from its
subfields, which we define and discuss through the rest of Section 1: What is Data
Science?
Machine learning is a subfield of AI that studies algorithms that can adapt their
behavior based on incoming data without explicit instructions from a
programmer.
Deep learning is a subfield of machine learning that studies a specific kind of
machine learning model called deep neural networks.
Data science is a multidisciplinary field that uses a set of tools to extract
knowledge from data and support decision making. Machine learning and deep
learning are among the main tools of data science.
[9]
What You Can Do with Data Science Chapter 1
The ultimate goal of data science is to solve problems by extracting knowledge from data
and giving support for complex decisions. The first part of solving a problem is getting a
good understanding of its domain. You need to understand the insurance business before
using data science for risk analysis. You need to know the details of the goods
manufacturing process before designing an automated quality assurance process. First, you
understand the domain. Then, you find a problem. If you skip this part, you have a good
chance of solving the wrong problem.
After coming up with a good problem definition, you seek a solution. Suppose that you
have created a model that solves a task. A machine learning model in a vacuum is rarely
interesting for anyone. So, it is not useful. To make it useful, we need to wrap our models
into something that can be seen and acted upon. In other words, we need to create software
around models. Data science always comes hand-in-hand with creating software systems.
Any machine learning model needs software. Without software, models would just lie in
computer memory, not helping anyone.
So, data science is never only about science. Business knowledge and software
development are also important. Without them, no solution would be complete.
Thanks to the internet and global computerization, we create vast amounts of data daily.
Ever-increasing volumes of data allow us to automate human labor.
Sadly, for each use case that improves our lives, we can easily find two that make them
worse. To give you a disturbing example, let's look at China. The Chinese government is
experimenting with a new social credit system. It uses surveillance cameras to track the
daily lives of its citizens on a grand scale. Computer vision systems can recognize and log
every action that you make while commuting to work, waiting in lines at a government
office, or going home after a party. A special social score is then calculated based on your
monitored actions. This score affects the lives of real people. In particular, public transport
fees can change depending on your score; low scores can prohibit you from interviewing
for a range of government jobs.
[ 10 ]
What You Can Do with Data Science Chapter 1
On the other hand, this same technology can be used to help people. For example, it can be
used to track criminals in large crowds. The way you apply this new technology can bring
the world closer to George Orwell's 1984, or make it a safer place. The general public must
be more conscious of these choices, as they might have lasting effects on their lives.
Another example of some disturbing uses of machine learning is businesses that use hiring
algorithms based on machine learning. Months later, they discovered that the algorithms
introduced bias against women. It is becoming clear that we do not give the right amount
of attention to the ethics of data science. While companies such as Google create internal
ethics boards, there is still no governmental control over the unethical use of modern
technology. Before such programs arrive, I strongly encourage you to consider the ethical
implications of using data science. We all want a better world to live in. Our future, and the
future of our children, depends on small decisions we make each day.
Insufficient understanding of the technical side of data science can lead to serious problems
in your projects. You can start a project only to discover that you cannot solve the task at
all. Even worse, you can find out that nothing works as intended only after deployment.
Depending on your use case, it can affect real people. Understanding the main principles
behind data science will rid you of many technical risks that predetermine a project's fate
before it has even started.
[ 11 ]
What You Can Do with Data Science Chapter 1
Machine learning is a scientific field that studies algorithms that can learn to perform tasks
without specific instructions, relying on patterns discovered in data. For example, we can
use algorithms to predict the likelihood of having a disease or assess the risk of failure in
complex manufacturing equipment. Every machine learning algorithm follows a simple
formula. In the following diagram, you can see a high-level decision process that is based
on a machine learning algorithm. Each machine learning model consumes data to produce
information that can support human decisions or fully automate them:
We will now explore the meaning of each block in more detail in the next section.
[ 12 ]
What You Can Do with Data Science Chapter 1
Another option is to create an algorithm that provides you with an insight. You can use this
insight as part of a decision-making process. This way, the outputs of many machine
learning algorithms can take part in complex decision making. To illustrate this, we'll look
at a warehouse security surveillance system. It monitors all surveillance cameras and
identifies employees from the video feed. If the system does not recognize a person as an
employee, it raises an alert. This setup uses two machine learning models: face detection
and face recognition. At first, the face detection model searches for faces in each frame of
the video. Next, the face recognition model identifies a person as an employee by searching
the face database. Each model does not solve the employee identification task alone. Yet
each model provides an insight that is a part of the decision-making process.
We can divide the entire world's data into two categories: structured and unstructured.
Most data is unstructured. Images, audio recordings, documents, books, and articles all
represent unstructured data. Unstructured data is a natural byproduct of our lives
nowadays. Smartphones and social networks facilitate the creation of endless data streams.
Nowadays, you need little to snap a photo or make a video. Analyzing unstructured data is
so difficult that we didn't come up with a practical solution until 2010.
Structured data is hard to gather and maintain, but it is the easiest to analyze. The reason is
that we often collect it for this exact purpose. Structured data is typically stored inside
computer databases and files. In digital advertising, ad networks apply huge effort to
collect as much data as possible. Data is gold for advertising companies. They collect your
browsing history, link clicks, time spent on site pages, and many other features for each
user. Vast amounts of data allow the creation of accurate click probability models that can
personalize your ad browsing experience. Personalization increases click probabilities,
which increases advertisers' profits.
To increase their data supply, modern enterprises build their business processes in a way
that generates as much structured data as possible. For example, banks record every
financial transaction you make. This information is necessary to manage accounts, but they
also use it as the main data source for credit scoring models. Those models use customers'
financial profiles to estimate their credit default risk probability.
[ 13 ]
What You Can Do with Data Science Chapter 1
The difficulty of analyzing unstructured data comes from its high dimensionality. To
explain this, let's take a data table with two columns of numbers: x and y. We can say that
this data has two dimensions.
As you can see, we can have an accurate guess of what the value of y will be given a value
of x. To do this, we can look at the corresponding point on the line. For example, for x = 10,
y will be equal to 8, which matches the real data points depicted as blue dots.
Now, let's shift to a photo made with an iPhone. This image will have a resolution of 4,032 x
3,024 pixels. Each pixel will have three color channels: red, green, and blue. If we represent
pixels in this image as numbers, we will end up with over twelve million of them in each
photo. In other words, our data has a dimensionality of 12 million.
[ 14 ]
What You Can Do with Data Science Chapter 1
In the following screenshot, you can see an image represented as a set of numbers:
[ 15 ]
What You Can Do with Data Science Chapter 1
The least squares algorithm invented by Gauss is the basic form of linear
regression. The general form of the linear model was introduced much
later.
Sometimes, we are not interested in predicting the future or finding evidence. We want
machines to recognize complex patterns in data, such as objects in an image, or to analyze
the sentiment of a text message. This group of tasks is called recognition.
Machine learning covers many types and flavors of models. But why do we need such
variety of different algorithms? The reason lies in a theorem called the no free lunch
theorem. It states that there is no best model that will consistently give you the best results
on each task for every dataset. Each algorithm has its own benefits and pitfalls. It may work
flawlessly on one task, but fail miserably at another. One of the most important goals of a
data scientist is to find the best model to solve the problem at hand.
[ 16 ]
What You Can Do with Data Science Chapter 1
For instance, let's look at fashion retail warehouse demand forecasting. A retailer sells a
fixed set of clothes at their stores. Before an item makes it to the shelves, it must be bought
from the manufacturer and transferred to a warehouse. Let's assume that their logistics
cycle takes two weeks. How do we know the best quantity of each item to order? There is a
good chance that we have item sales data for each store. We can use it to create a predictive
model that estimates average customer demand for each item in the catalog of our
warehouse over the next two weeks. That is, we forecast an average number of to-be-
bought items over the next two weeks. The simplest model would be to take an average
demand for each item from the last two weeks as an estimate of average future demand.
Simple statistical models like this are frequently used at real retail stores, so we do not
oversimplify. To be more general, let's call the thing we want to forecast the target
variable. In the previous case, the target variable is the demand. To build a forecasting
model, we will use two previous weeks of historical data to calculate arithmetic averages
for each item. We then use those averages as estimates for the future values. In a way,
historical data was used to teach our model about how it should predict the target
variable. When the model learns to perform a given task using input/output examples, we
call the process supervised learning. It would be an overstatement to name our average
calculator as a supervised learning algorithm, but nothing stops us from doing this (at least
technically).
Supervised learning is not limited to simple models. In general, the more complex your
model is, the more data it requires to train. The more training data you have, the better
your model will be. To illustrate, we will look into the world of information security. Our
imaginary client is a large telecoms company. Over the years, they have experienced many
security breaches in their network. Thankfully, specialists have recorded and thoroughly
investigated all fraudulent activity on the network. Security experts labeled each fraudulent
activity in network logs. Having lots of data with labeled training examples, we can train a
supervised learning model to distinguish between normal and fraudulent activity. This
model will recognize suspicious behavior from vast amounts of incoming data. However,
this will only be the case if experts labeled the data correctly. Our model won't correct them
if they didn't. This principle is called garbage in, garbage out. Your model can only be as
good as your data.
Both the retail and security examples use supervised learning, but let's look at the target
variables more closely. The forecasting algorithm used demand as the target variable.
Demand is a continuous number ranging from 0 to infinity. On the other hand, the security
model has a fixed number of outcomes.
[ 17 ]
What You Can Do with Data Science Chapter 1
Network activity is either normal or fraudulent. We call the first type of the target
variable—continuous, and the second type—categorical. The target variable type strongly
indicates which kind of task we can solve. Prediction tasks with continuous target
variables are called regression problems. And when the total number of outcomes is
limited, we say that we solve a classification problem. Classification models assign data
points to categories, while regression models estimate quantities.
You can consider yourself lucky if you have found a good labeled dataset. You're even
luckier if the dataset is large, contains no missing labels, and is a good match for an end-to-
end solution to a problem. The labels we use for supervised learning are a scarce resource.
A total absence of labels is a lot more common than fully labeled datasets. This means that
often, we cannot use supervised learning. But the absence of labels does not mean that we
are doomed. One solution is to label data by hand. You can assign this task to your
employees if the data cannot be shared outside of the company. Otherwise, a much simpler
and faster solution is to use crowdfunding services such as Amazon Mechanical Turk.
There, you can outsource data labeling to a large number of people, paying a small fee for
each data point.
While it's convenient, labeling data is not always affordable, and may be impossible. A
learning process where the target variable is missing or can be derived from the data itself
is called unsupervised learning. While supervised learning implies that the data was
labeled, unsupervised learning removes this limitation, allowing an algorithm to learn from
data without guidance.
For example, the marketing department may want to discover new segments of customers
with similar buying habits. Those insights can be used to tailor marketing campaigns and
increase revenue in each segment. One way to discover hidden structures in your dataset is
to use a clustering algorithm. Many clustering algorithms can work with unlabeled data.
This characteristic makes them particularly interesting.
Sometimes, labels hide inside the raw data. Look at the task of music creation. Suppose we
want to create an algorithm that composes new music. We can use supervised learning
without explicit labeling in this case. The next note in a sequence serves as a great label for
this task. Starting with a single note, the model predicts the next note. Taking the previous
two, it outputs the third. This way, we can add as many new notes as we need.
[ 18 ]
What You Can Do with Data Science Chapter 1
Now, let's see whether we can apply machine learning to games. If we take a single game,
we may label some data and use supervised learning. But scaling this approach for all
games is not possible in practice. On a typical gaming console, you use the same controller
to play different games. Try to recall when you played for the first time in your life. I
suppose it was Mario. It is likely that you were unsure of what to do. You might have tried
pressing a few buttons and looked at a jumping character. Piece by piece, you must have
figured out the rules of the game and started playing. I wouldn't be surprised if you felt
confident playing and could finish the first levels after a few hours of experience.
Using the knowledge we have gained so far, can we design a machine learning algorithm
that will learn how to play games? It might be tempting to use supervised learning, but
think first. You had no training data when you took the controller for the first time in your
life. But can we create algorithms that would figure out game rules by themselves?
It is easy to write a good bot for a specific computer game if we know the rules in advance.
Almost all modern computer games have rule-based AIs or non-player characters that can
interact with the player. Some games even have such advanced AI that all gameplay builds
around this feature. If you are interested, look at Alien: Isolation, released in 2014. The
biggest limitation of those algorithms is that they are game-specific. They cannot learn and
act based on experience.
This was the case until 2015, when deep learning researchers discovered a way to train
machine learning models to play Atari games as humans do: look at the screen and perform
actions by using a game controller. The only difference was that the algorithm was not
using physical eyes or hands to play the game. It received each frame of the game through
RAM and acted via a virtual controller. Most importantly, the model received the current
game score in each incoming frame. At the start, the model performed random actions.
Some actions led to a score increase that was received as positive feedback. Over time, the
model learned input/output or frame/action pairs that corresponded to higher scores. The
results were stunning. After 75 hours of training, the model played Breakout at an amateur
level. It wasn't provided with any prior knowledge of the game. All it saw were raw pixels.
Sometime later, the algorithm had learned how to play Breakout better than humans. The
exact same model can be trained to play different games. The learning framework that was
used to train such models is called reinforcement learning. In reinforcement learning, an
agent (player) learns a policy (a specific way of performing actions based on incoming
inputs) that maximizes the reward (game score) in an unknown environment (a computer
game).
[ 19 ]
What You Can Do with Data Science Chapter 1
Of course, there are limitations. Remember, there are no free lunches in the machine
learning restaurant. While this algorithm performed well on a large set of games, it failed
completely at others. In particular, a game called Montezuma's Revenge had stymied every
model until 2018. The problem with this game is that you need to perform a specific series
of actions over a long time before getting even a small reward signal.
Another large victory for reinforcement learning was, of course, the game of Go. The total
number of actions in Go is larger than the total number of atoms in our universe, making
this game very hard to tackle using computers. Computer scientists defeated the best
humans at chess in 1997. For Go, it took them another 18 years. If you are interested, Google
filmed a documentary about AlphaGo, an algorithm that won at Go against world
champions.
Reinforcement learning works well when you can completely simulate your environment,
that is, you know the rules of the game in advance. This chicken and egg problem makes
applying reinforcement learning tricky. Still, reinforcement learning can be used to solve
real-world problems. A team of scientists used reinforcement learning to find the optimal
parameters for a rocket engine. This was possible thanks to complete physical models of the
inner workings of the engine. They used these models to create a simulation where a
reinforcement learning algorithm changed the parameters and design of the engine to find
the optimal setup.
[ 20 ]
What You Can Do with Data Science Chapter 1
The answer is simple: hardware. Training big neural networks uses a great amount of
computation power. But not any computation power will suffice. It turns out that neural
networks do a lot of matrix operations under the hood. Strangely, rendering computer
graphics also involves many matrix operations, so many, in fact, that each computer has a
dedicated circuit inside: a GPU. Nvidia knew of the scientific need for fast matrix
operations, so they developed a special programming framework called CUDA. CUDA
allows you to harness the power of your GPU not only for computer graphics, but for
general computing tasks as well. GPUs can do insane amounts of parallel matrix
operations. Modern graphics cards have thousands of cores, so you can perform thousands
of operations in parallel. And all individual operations also work quickly. Modern GPUs
can execute thousands of parallel, floating-point computations. GPUs specialize in solving
one specific task much faster than general-purpose CPUs.
All this meant that scientists could train larger neural networks. The art and science of
training large neural networks is called deep learning. The origin of the word deep in the
name comes from the specific structure of neural networks that allows them to be efficient
and accurate in their predictions. We will look more into the internals of neural networks in
Chapter 2, Testing Your Models.
Deep learning is fantastic at solving tasks with unstructured datasets. To illustrate this, let's
look at a machine learning competition called ImageNet. It contains over 14 million images,
classified into 22,000 distinct categories. To solve ImageNet, an algorithm should learn to
identify an object in the photo. While human performance on this task is around 95%
accuracy, the best neural network model surpassed this level in 2015.
Traditional machine learning algorithms are not very good at working with unstructured
data. However, they are equally important because the gap in performance between
traditional machine learning models and deep learning models is not so big in the domain
of structured datasets. Most winners of data science competitions on structured datasets do
not use deep neural networks. They use simpler models because they showed better results
on structured datasets. Those models train faster, do not need specialized hardware, and
use less computational power.
Simpler models are not easier to apply. Deep learning can show good
results on structured data, too. Still, it is likely that you will spend more
time, money, and computing power in a real-world scenario. The
simplicity refers to the internal model's structure and the number of
changing parameters inside the model.
[ 21 ]
What You Can Do with Data Science Chapter 1
To differentiate other machine learning algorithms from deep learning, professionals often
refer to deep learning as a field that studies neural networks, and machine learning is used
for every other model. Drawing a line between machine learning and deep learning is
incorrect, but for the lack of a better term, the community has agreed on ambiguous
definitions.
From this section's title, you can probably guess it all changed with deep learning. Deep
learning models can perform many more text processing tasks without the need to
explicitly state complex grammatical rules and parsers. Deep learning models took the NLP
world by storm. They can perform a wide range of NLP tasks with much better quality than
previous generation NLP models. Deep neural networks translate text to another language
with near-human accuracy. They are also quite accurate at doing part-of-speech tagging.
How many cores do modern desktop CPUs have? (less than 10)
When did neural networks originate? (1980s)
Text summarization seeks to extract the main points from the source. If we feed the first
few paragraphs of this section into a text summarization model, we will get the following
results:
Now it is time to draw a line between machine learning and deep learning. In truth,
we can't do this, because deep learning is a sub-field of machine learning. Formally,
deep learning studies a specific set of models called neural networks. The theory
behind neural networks originated in the 1980s. You can try out a sample text
summarization model online at http://textsummarization.net/.
[ 22 ]
What You Can Do with Data Science Chapter 1
Text classification is also used to solve automated email and document tagging. We can use
neural networks to process large chunks of emails and to assign appropriate tags to them.
The pinnacle of practical NLP is the creation of dialog systems, or chatbots. Chatbots can be
used to automate common scenarios at IT support departments and call centers. However,
creating a bot that can reliably and consistently solve its task is not an easy task. Clients
tend to communicate with bots in rather unexpected ways, so you will have a lot of corner
cases to cover. NLP research is not quite at the point of providing an end-to-end
conversational model that can solve the task.
General chatbots use several models to understand user requests and prepare an answer:
NLP models can also generate texts of arbitrary length based on initial input. State-of-the-
art models output results that are arguably indistinguishable from human-written texts.
You can look at the model output samples in an OpenAI blog post about
the GPT-2 model: https://openai.com/blog/better-language-models/
#sample1.
While the results are very compelling, we are yet to find useful and practical applications
for text generation models. While writers and content creators could potentially benefit
from these models by using them to expand a list of key points into a coherent text,
unfortunately, these models lack the means to control their output, making them difficult to
use in practice.
[ 23 ]
What You Can Do with Data Science Chapter 1
In 2015, a deep neural network surpassed human performance on ImageNet. Since then,
many computer vision algorithms have been rendered obsolete. Deep learning allows us
not only to classify images, but also to do object detection and instance segmentation.
The following two pictures help to describe the difference between object detection and
instance segmentation:
[ 24 ]
What You Can Do with Data Science Chapter 1
The preceding photo shows us that object detection models recognize objects and place
bounding boxes around them.
Thus, the main practical uses of deep learning in computer vision are essentially the same
tasks at different levels of resolution:
[ 25 ]
What You Can Do with Data Science Chapter 1
Another interesting direction in computer vision is generative models. While the models
we have already examined perform recognition tasks, generative models change images, or
even create entirely new ones. Style transfer models can change the stylistic appearance of
an image to look more like another image. This kind of model can be used to transform
photos into paintings that look and feel like the work of an artist, as follows:
[ 26 ]
What You Can Do with Data Science Chapter 1
We can also use GANs to perform conditional image generation. The word conditional
means that we can specify some parameters for the generator. In particular, we can specify
a type of object or texture that is being generated. For example, Nvidia's landscape
generator software can transform a simple color-coded image, where specific colors
represent soil, sky, water, and other objects, to realistic-looking photos.
Up-to-date pricing is very important for large internet retailers. In situations where your
competitor lowers the price of a popular product, late reaction leads to large profit losses. If
you know the correct market price distributions for your product catalog, you can always
remain a step ahead of your competitors. To create such a distribution for a single product,
you first need to find this product description on a competitor's site. While automated
collection of product descriptions is easy, product matching is the hard part.
Once we have a large volume of unstructured text, we need to extract product attributes
from it. To do this, we first need to tell whether two descriptions refer to the same product.
Suppose that we have collected a large dataset of similar product descriptions. If we shuffle
all pairs in our data, we will get another dataset of non-similar product descriptions. Using
lots of examples of similar and dissimilar product descriptions, we can train an NLP model
that can identify similar product descriptions. We may also think about comparing photos
on the retailer's sites to find similar products. To do this, we can apply computer vision
models to do the matching. Even with those two models, the total matching accuracy will
probably be insufficient for our requirements. Another way to boost it is to extract product
attributes from textual descriptions. We may train word-tagging models or develop a set of
matching rules to do this task. Matching accuracy will increase, along with the diversity
and descriptiveness of the data sources we use.
[ 27 ]
What You Can Do with Data Science Chapter 1
Prediction models:
When will our sales double?
What is the probability of this client buying a certain product?
Causal inference models:
Was this cancer treatment effective? Or is the effect apparent only
because of the complex relationships between variables in the
data?
Was the new version of our recommendation model better than the
other? If it is, by what amount does it increase our sales compared
to the old model?
What factors cause some books to be bestsellers?
What factors cause Alzheimer's disease?
There is a mantra amongst statisticians: correlation does not imply causation. If some
variables change together or have similar values, does not mean that they are connected in
some logical way. There are many great examples of non-sensible absurd correlations in
real-world data at http://www.tylervigen.com/spurious-correlations.
[ 28 ]
What You Can Do with Data Science Chapter 1
Seeking explanations and measuring effects is important when data-driven decisions affect
people's lives. Specifically, when a new drug is invented, we need to check that it really
works. To do this, we need to collect data and measure the statistical effect of using the new
drug versus not using the new drug.
[ 29 ]
What You Can Do with Data Science Chapter 1
Causal inference poses this question in a rather ambitious way: the simplest way to
measure the effect of some treatment is to split our universe into two parts. In the first part,
we do not apply the treatment and act as if the drug was never discovered. In the second
part, we apply the new treatment. Unfortunately, the creation of new universes is far
beyond the capabilities of statisticians. But they have come up with a reasoning framework
that allows you to design experiments and collect data as if it would come from two
separate independent universes.
1. First, we will randomly sample a test group of people from the entire globe.
2. Then, each person will be assigned the new drug or a sugar pill (placebo) with
50% probability.
3. After a while, we can measure the treatment's effect in the two groups.
Studies like this can be very complex to execute. Imagine selecting a random sample of the
entire world's population each time you need to test a new drug. Moreover, you cannot
transport all individuals to a single place because a sudden change of environment or
climate may affect treatment results. This experiment may also be considered unethical,
especially if the treatment is associated with a risk of death. Causal inference allows you to
design more complex experiment designs that will still be equivalent to a random
experiment under certain conditions. This way, you can create an ethical and realistic study
that has statistical rigor.
The framework of applying a treatment to a test group is very general. It is not limited to
medical studies and can measure and explain the effects of any changes. The question—is
using a machine learning model better than not using it at all? If so, how much benefit does it give?
often occupies the minds of data scientists. Thanks to causal inference, we can find a
solution. In this question, you can substitute a machine learning model for the treatment.
[ 30 ]
What You Can Do with Data Science Chapter 1
The only way to measure the real effect is to check both approaches on real users. While
being difficult to conduct in a physical world, purely randomized experiments are easy on
the internet. If you use machine learning models at a large internet company with a large
number of customers, designing a randomized experiment can seem easy. You can
randomly assign two different versions of your software to each user and wait until a
sufficiently large sample of data has been collected.
However, you should be wary of many things that can distort results:
Hidden biases in data are called confounders. Customer lifestyle, social factors,
or environmental exposure can affect your seemingly random sample of users.
A flaw in test group selection is called selection bias. For example, random
selection of test participants from a single region could affect the study.
Measurement errors: Erroneous or non-consistent data collection can lead to
misleading results
Summary
In this chapter, we have explored the practical applications of AI, data science, machine
learning, deep learning, and causal inference. We have defined machine learning as a field
that studies algorithms that use data to support decisions and give insights without specific
instructions. There are three main machine learning methodologies: supervised,
unsupervised, and reinforcement learning. In practice, the most common types of task we
solve using machine learning are regression and classification. Next, we described deep
learning as a subset of machine learning devoted to studying neural network algorithms.
The main application domains of deep learning are computer vision and NLP. We have
also touched on the important topic of causal inference: the field that studies a set of
methods for discovering causal relationships in data. You now know a lot about general
data science capabilities. But can machine learning models successfully solve your specific
set of problems?
In the next chapter, we will learn to distinguish good solutions from bad ones by
performing model testing.
[ 31 ]
2
Testing Your Models
Coming up with a perfect machine learning model is not simple if you do not use a good
testing methodology. This seemingly perfect model will fail the moment you deploy it.
Testing the model's performance is not an easy task, but it is an essential part of every data
science project. Without proper testing, you can't be sure whether your models will work as
expected, and you can't choose the best approach to solve the task at hand.
This chapter will explore various approaches for model testing and look at different types
of metrics, using mathematical functions that evaluate the quality of predictions. We will
also go through a set of methods for testing classifier models.
The difference between the real value and model's approximation makes up the model
error:
For regression problems, we can measure the error in quantities that the model predicts.
For example, if we predict house prices using a machine learning model and get a
prediction of $300,000 for a house with a real price of $350,000, we can say that the error is
$350,000 - $300,000 = $50,000.
For classification problems in the simplest setting, we can measure the error as 0 for a
guess, and 1 for a wrong answer. For example, for a cat/dog recognizer, we give an error of
1 if the model predicts that there is a cat in a dog photo, and 0 if it gives a correct answer.
Decomposing errors
You won't find a machine learning model that perfectly solves your problems without
making even a single mistake, no matter how small. Since every model makes mistakes, it is
critical to understand their nature. Suppose that our model makes a prediction and we
know the real value. If this prediction is incorrect, then there is some difference between the
prediction and the true value:
The other part of this error will come from imperfections in our data, and some from
imperfections in our model. No matter how complex our model is, it can only reduce the
modeling error. Irreducible error is out of our control, hence its name.
Not all reducible errors are the same. We can decompose reducible errors further. For
example, look at the following diagram:
[ 33 ]
Testing Your Models Chapter 2
The red center of each target represents our goal (real value), and the blue shots represent
the model predictions. In the target, the model's aim is off—all predictions are close
together, but they are far away from the target. This kind of error is called bias. The simpler
our model is, the more bias it will have. For a simple model, the bias component can
become prevailing:
[ 34 ]
Testing Your Models Chapter 2
In the preceding plot, we try to model a complex relationship between variables with a
simple line. This kind of model has a high bias.
All predictions appear to be clustered around the true target, but the spread is too high. The
source of this error comes from the model's sensitivity to fluctuations in data. If the model
has high variance, randomness in measurements can lead to very different predictions.
So far, we have decomposed model error into the three following numbers:
It is not a coincidence that bias and variance sit close together in the same formula. There is
a relationship between them. Predictive models show a property called the bias-variance
tradeoff—the more biased a model, the lower the variance component of the error. And in
reverse, the more variance it has, the lower its bias will be. This important fact will be a
game-changer for building ensemble models, which we will explore in Chapter
3, Understanding AI.
Typically, models that impose some kind of structure in the data have a high bias (they
assume certain laws that the data conforms to). Biased models will work well, as long as the
data does not contradict the underlying logic of the model. To give you an example of such
a model, think of a simple line. For example, we will predict a housing price as a linear
function of its size in square feet:
[ 35 ]
Testing Your Models Chapter 2
Notice that if we change the square footage by a little, say 0.1, then the prediction won't
change by much. Thus, this model has low variance. When the model is sensitive to
changes in its input, its variance will outgrow the bias. The variance component will grow
with your model increases in complexity and the total number of parameters grows. In the
following plot, you can see how two different models fit the same dataset. The first simple
model has low variance, and the second complex model has high variance:
In the preceding plot, slight changes in X can lead to large fluctuations of Y. Models with
high variance are robust and imply that the data is much less structured.
Understanding overfitting
The bias-variance trade-off goes hand in hand with a very important problem in machine
learning called overfitting. If your model is too simple, it will cause large errors. If it is too
complex, it will memorize the data too well. An overfitted model remembers data too well
and acts like a database. Suppose that our housing dataset contains some lucky deals where
previous houses had a low price because of circumstances not captured in the data. An
overfit model will memorize those examples too closely and predict incorrect price values
on unseen data.
Now, having understood the error decomposition, can we use it as a stepping stone to
design a model-testing pipeline?
[ 36 ]
Testing Your Models Chapter 2
We need to determine how to measure model error in such a way that it will correspond to
the real model performance on unseen data. The answer comes from the question itself. We
will split all the available data into two sets: a training set and a test set, as shown in the
following screenshot:
We will use data in the training set to train our model. The test set acts as unseen data and
you should not use the test set in the training process. When the model's training is
finished, you can feed the test data into your model. Now you can calculate errors for all
predictions. The model did not use the test data during training, so the test set error
represents the model error on unseen data. The drawback to this approach is that you take
a significant amount of data, usually up to 30%, to use for testing. This means less training
data and lower model quality. There is also a caveat – if you use your test set too much,
error metrics will start to lie. For example, suppose that you did the following:
1. Trained a model
2. Measured the error on the test data
3. Changed your model to improve the metrics
4. Repeated steps 1-3 ten times
5. Deployed the model to production
It is likely that the quality of your model will be much lower than expected. Why did this
happen? Let's look more closely step 3. You looked at a score, and changed your model or
data processing code several consecutive times. In fact, you did several learning iterations
by hand. By repeatedly improving the test score, you indirectly disclosed information about
the test data to your model. When the metric values measured on a test set deviate from the
metrics measured on the real data, we say that the test data has leaked into our model. Data
leaks are notoriously hard to detect before they cause damage. To avoid them, you should
always be mindful of the possibility of a leak, think critically, and follow best practices.
We can use a separate piece of data to fight test set leakage. Data scientists use validation
sets to tune model parameters and compare different models before choosing the best one.
Then, the test data is used only as a final check that informs you about model quality on
unseen data. After you have measured the test metric scores, the only decision left is to
make is whether the model will proceed to testing in a real-world scenario.
[ 37 ]
Testing Your Models Chapter 2
In the following screenshot, you can see an example of a train/validation/test split of the
dataset:
Unfortunately, the following two problems persist when we use this approach:
The information about our test set might still leak into our solution after many
iterations. Test-set leakage does not disappear completely when you use the
validation set, it just becomes slower. To overcome this, change your test data
from time to time. Ideally, make a new test set for every model-deployment
cycle.
You might overfit your validation data quickly, because of the train-measure-
change feedback cycle for tuning your models.
To prevent overfitting, you can randomly select train and validation sets from your data for
each experiment. Randomly shuffle all available data, then select random train and
validation datasets by splitting the data into three parts according to proportions you have
chosen.
There is no general rule for how much training, validation, and testing data you should use.
Often, more training data means a more accurate model, but it means that you will have
less data to assess the model's performance. The typical split for medium-sized datasets (up
to 100,000 data points) is to use 60-80% of the data to train the model and use the rest for
validation.
The situation changes for large datasets. If you have a dataset with 10,000,000 rows, using
30% for testing would comprise 3,000,000 rows. It is likely that this amount would be
overkill. Increasing test and validation test sizes will yield diminishing returns. For some
problems, you will get good results with 100,000 examples for testing, which would
amount for a 1% test size. The more data you have, the lower the proportion you should
use for testing.
[ 38 ]
Testing Your Models Chapter 2
Often, there is too little data. In those situations, taking from 30%-40% data for testing and
validation might severely decrease the model's accuracy. You can apply a technique called
cross-validation in data-scarce situations. With cross-validation, there's no need to create a
separate validation or test set. Cross-validation proceeds in the following way:
[ 39 ]
Testing Your Models Chapter 2
Another important aspect related to model testing is how to split the data. A slight error in
your splitting logic can mean all your testing efforts were in vain. Splitting is easy, if all
observations in your dataset are independent. Then you can use random data splits. But
what if we are solving the stock-price prediction problem? When our data rows are tied to
time, we can't look at them as independent values. Today's stock prices depend on their
past values. If this wasn't true, the prices would randomly jump from $0 to $1,000. In this
situation, suppose we have two years' worth of stock data, from January 2017 to December
2018. If we use random splits, it is possible that our model will train in September 2018 and
test on February 2017. This makes no sense. We must always think about causal
relationships and dependencies between your observations and be sure to check whether
your validation procedure is correct.
Next, we will learn about metrics, which are formulas we can use to summarize validation
and test errors. Metrics will allow us to compare different models and choose the best
candidates for production use.
[ 40 ]
Testing Your Models Chapter 2
In particular, for regression problems the most common metric is the root mean square
error, or RMSE:
Another popular way to measure regression errors is mean absolute error (MAE):
Note that MAE is very similar to RMSE. Compared to MAE, RMSE has a square root
instead of absolute value and it squares errors. While MAE and RMSE may seem identical,
there are some technical differences between them. Data scientists can choose best metrics
for a problem, knowing their trade-offs and shortcomings. You don't need to learn them all,
but I would like to highlight one difference to give you a general feel of the thought
process. RMSE penalizes large errors more than MAE. This property comes from the fact
that RMSE uses squared errors, while MAE uses absolute values. To illustrate, an error of 4
would be 4 in MAE, but in RMSE it will turn into 16 because of the square.
[ 41 ]
Testing Your Models Chapter 2
For classification problems, the metric-calculation process is more involved. Let's imagine
that we are building a binary classifier that estimates the probability of a person having
pneumonia. To calculate how accurate the model is, we may just divide the total of correct
answers by the number of rows in the dataset:
Here, is the amount of correct predictions, and is the total number of predictions.
Accuracy is simple to understand and calculate, but it has a major flaw. Let's assume the
average probability of having pneumonia is 0.001%. That is, one person out of 100,000 has
the illness. If you had collected data on 200,000 people, it is feasible that your dataset would
contain only two positive cases. Imagine you have asked a data scientist to build a machine
learning model that estimates pneumonia probability based on a patient's data. You have
said that you would only accept an accuracy of no less than 99.9%. Suppose that someone
created a dummy algorithm that always outputs zeros.
This model has no real value, but its accuracy on our data will be high as it will make only
two errors:
The problem is that accuracy considers only global fraction of answers. When one class
outnumbers the others, accuracy outputs misleading values.
After looking at this table, we can see that the dummy model won't be helpful to anyone. It
didn't identify two people with the condition as positive. We call those errors False
Negatives (FN). The model also correctly identified all patients with no pneumonia, or
True Negatives (TN), but it has failed to diagnose ill patients correctly.
[ 42 ]
Testing Your Models Chapter 2
Now, suppose that your team has built a real model and got the following results:
Model prediction: Model prediction:
Has pneumonia Does not have pneumonia
Real outcome: 2
0
Has pneumonia
Real outcome:
30 199,968
Does not have pneumonia
This model correctly identified two cases, making two True Positive (TP) predictions. This
is a clear improvement over the previous iteration. However, the model also identified 30
people as having pneumonia, while they were not ill in reality. We call such an error a
False Positive (FP) prediction. Is having 30 false positives a significant disadvantage? That
depends on how physicians will use the model. If all subjects will be automatically
prescribed with heavy medication with side-effects, false positives can be critical.
Next, your team has done another experiment and created a new model:
Model prediction: Model prediction:
Has pneumonia Does not have
pneumonia
Real outcome: 0
2
Has pneumonia
Real outcome:
100,000 99,998
Does not have pneumonia
Does this model perform better? The model would have missed one patient that needed
therapy and assigned 100,000 healthy people to a treatment group, making physicians do
unnecessary work. In truth, you can make the final decision only after presenting results to
the people who will use the model. They may have a different opinion on what is best. It
would be best to define this at the first stages of the project by creating a model-testing
methodology document by collaborating with experts in the field.
You will face binary classification problems everywhere, thus having a good understanding
of terminology is important.
[ 43 ]
Testing Your Models Chapter 2
You can see all new concepts summed up in the following table:
Model prediction: Model prediction:
1 (positive case) 0 (negative case)
Real outcome: TP
FN
1 (positive case)
Real outcome:
FP TN
0 (negative case)
It is crucial to note that you can control the amount of false positive and
false negative responses for a single model. Classifiers output a
probability of a data point belonging to a class. That is, the model
prediction is a number between 0 and 1. You can decide whether a
prediction belongs to a positive or negative class by comparing it with a
threshold. For example, if the threshold is 0.5, then any model prediction
greater than 0.5 will belong to class 1 and to 0 otherwise.
By changing the threshold, you can change the proportions between the
cells in the confusion table. By choosing a large threshold, like 0.9, the
volume of false positive responses will decrease, but false negative
responses will increase. Threshold selection is essential for binary
classification problems. Some environments, such as digital advertising,
will be more forgiving to false positives, while in others, such as
healthcare or insurance, you may find them unacceptable.
Confusion tables provide deep insights into classification problems but require your
attention and time. This can be limiting when you want to do numerous experiments and
compare many models. To simplify the process, statisticians and data scientists have
designed many metrics that sum up classifier performance without suffering from
problems like accuracy metrics do. First, let's examine some ways to summarize confusion
table rows and columns. From there, we will explore how to condense it into a single
statistic.
In the following table, you can see two new metrics for summarizing different kinds of
errors, precision and recall:
Model prediction: Model
1 (positive case) prediction: Combined metric
0 (negative case)
Real outcome: True Positive
1 (positive case) False Negative
Real outcome:
False Positive True Negative
0 (negative case)
[ 44 ]
Testing Your Models Chapter 2
Combined metric
, also called True Positive Rate (TPR)
Precision measures a proportion of positive (relevant) cases that your model has identified.
If your model predicted 10 positive cases and 2 positive predictions turned out to be
negative in reality, then its precision would be 0.8. Recall represents a probability of
correctly predicting a positive case. If out of 10 positive cases, the model had predicted all
10 correctly (10 true positives) and marked 5 negative cases as positive (5 false positives),
then its recall would be 0.67. A recall of 0.67 means that if our model predicts a positive
case, it will be correct 67 times out of 100.
For binary classification, precision and recall diminish the amount of metrics we must work
with to two. This is better, but not ideal. We can sum up everything into a single number by
using a metric called F1-score. You can calculate F1 using the following formula:
F1 is 1 for a perfect classifier and 0 for the worst classifier. Because it considers both
precision and recall, it does not suffer from the same problem as accuracy and is a better
default metric for classification problems.
[ 45 ]
Testing Your Models Chapter 2
To illustrate this, let's take a trained model and generate predictions for the test dataset. If
we calculate class assignments by taking lots of different thresholds, and then calculate the
precision, recall, and F1 score for each of those assignments, we could depict each precision
and recall value in a single plot:
[ 46 ]
Testing Your Models Chapter 2
The preceding plot was made using the yellowbrick library, which
contains many useful visualization tools for model selection and
interpretation. You can see the capabilities of this library here: https://
www.scikit-yb.org/en/latest/index.html.
In the preceding plot, you can see the precision (blue), recall (green), and F1 (red) values for
each threshold between 0 and 1. Based on this plot, we can see that 0.5, which is the default
in many machine learning libraries, might not be the best choice and that something like
0.45 would yield more optimal metric values.
Another useful concept shown in the plot is the queue rate (depicted in magenta), which
shows the proportion of instances in the test dataset labeled as positive. For the 0.45
threshold (identified in the plot as a dashed line), you can see that the queue rate is 0.4. This
means that approximately 40% of all cases will be labeled as fraudulent. Depending on the
business process in which the model will be used, positive cases might need to be further
investigated by humans. In some cases, manual checking takes a lot of time or resources,
but it is OK to misclassify a few positive instances for a much lower queue rate. In such
cases, you might want to choose models with lower queue rates even if their performance is
lower.
The difference arises from the ways in which those metrics are calculated,
but we will omit the technical details here for the sake of brevity.
Calculating AUC metrics is a bit more involved than the other metrics
presented in this chapter. For more information, check out https://www.
chioka.in/differences-between-roc-auc-and-pr-auc/ and https://
en.wikipedia.org/wiki/Receiver_operating_characteristic.
[ 47 ]
Testing Your Models Chapter 2
There is no single rule for choosing the right balance for the precision, recall, F1, and queue
rate. Those values should be thoughtfully investigated with respect to the business process.
Relying solely on technical metrics for model selection can result in a disaster, as models
that are the best for your customers are not always the most accurate models. In some cases,
high precision might be more important than recall, while for others, the queue rate will be
most important. At this point, we need to introduce another kind of metric that will act as a
bridge between technical metrics and business requirements: business metrics.
At this point, we conclude an introduction to technical metrics. There are many more ways
to test classification and regression models, all with pros and cons. Enumerating and
describing them all would take a book in itself and would be unnecessary because we have
already achieved our goal. Armed with new concepts from this chapter, you now
understand the general flow of how to evaluate a machine learning model before testing it
in real-world conditions. Now you can use offline model testing to check the model's
quality before deploying the model. Next, we will explore online testing to complete your
understanding of a model's quality assessment.
[ 48 ]
Testing Your Models Chapter 2
The only way to be certain about model performance in the near future is to perform a live
test. Depending on the environment, such test may introduce big risks. For example,
models that assess airplane engine quality or patient health would be unsuitable for real-
world testing before we become confident in their performance.
When the time for a live test comes, you will want to minimize risks while making
statistically valid conclusions. Thankfully, there is a statistical framework for that purpose
known as hypothesis testing. When performing a hypothesis test, you check the validity of
some idea (hypothesis) by collecting data and executing a statistical test. Imagine you need
to check whether your new advertising model increases revenues from the ad service. To
do this, you randomly split all your clients into two groups: one group uses the old
advertising algorithm, while the others see ads recommended by a new algorithm. After
you have collected a sufficient volume of data, you compare two groups and measure
differences between them. Why do we need to bother with statistics, you may ask?
Because we can answer the following questions only with the help of stats:
How should I sort (sample) individuals into each group? Can my sampling
process distort results of the test?
What is the minimum number of clients in each group? Can random fluctuations
in the data affect my measurements?
How long should I run the test for to get a confident answer?
What formula should I use to compare results in each group?
[ 49 ]
Testing Your Models Chapter 2
The experiment setup for a hypothesis test splits test targets into two groups on purpose.
We can try to use a single group instead. For instance, we can take one set of measurements
with the old model. After the first part of the experiment is finished, we can deploy the new
algorithm and measure its effect. Then, we compare two measurements made one after
another. What could go wrong? In fact, the results we get wouldn't mean anything. Many
things could have changed in between our measurements, such as the following:
User preferences
General user mood
Popularity of our service
Average user profile
Any other attribute of users or businesses
All these hidden effects could affect our measurements in unpredictable ways, which is
why we need two groups: test and control. We must select these groups in such a way that
the only difference between them is our hypothesis. It should be present in the test group
and missing from the control group. To illustrate, in medical trials, control groups are the
ones who get the placebo. Suppose we want to test the positive effect of a new painkiller.
Here are some examples of bad test setups:
The easiest way to create groups is random selection. Truly random selection may be hard
to do in the real world, but is easy if you deal with internet services. There, you may just
randomly decide which version of your algorithm to use for each active user. Be sure to
always design experiment setups with an experienced statistician or data scientist, as
correct tests are notoriously hard to execute, especially in offline settings.
Statistical tests check the validity of a null hypothesis, that is, that the results you got are by
chance. The opposite result is called an alternative hypothesis. For instance, here is the
hypothesis set for our ad model test:
Null hypothesis: The new model does not affect the ad service revenue.
Alternative hypothesis: The new model affects the ad service revenue.
[ 50 ]
Testing Your Models Chapter 2
Typically, a statistical test measures the probability of a null hypothesis being true. If the
chances are low, then the alternative hypothesis is true. Otherwise, we accept the null
hypothesis. If, according to a statistical test, the probability that the new model does not
affect service revenue would be 5%, we would say that we accept the alternative hypothesis
at a 95% confidence level. This means the model affects the ad service revenue with a 95%
probability. The significance level for rejecting the null hypothesis depends on the level of
risk you want to take. For an ad model, a 95% significance may be enough, while no less
than a 99% significance is satisfactory for a model that tests patient health conditions.
The most typical hypothesis test is comparing two means. If we use this test in our ad
model example, we would measure average revenues with and without the new ranking
algorithm. We may accept or reject the null hypothesis using a test statistic when the
experiment is finished.
The amount of data you need to collect for conducting a hypothesis test depends on several
factors:
Confidence level: The more statistical confidence you need, the more data is
required to support the evidence.
Statistical power: This measures the probability of detecting a significant
difference, if one exists. The more statistical power your test has, the lower the
chance of false negative responses.
Hypothesized difference and population variance: If your data has large
variance, you need to collect more data to detect a significant difference. If the
difference between the two means is smaller than population variance, you
would need even more data.
You can see how different test parameters determine their data hunger in the following
table:
While powerful, hypothesis tests have limitations: you need to wait until the experiment
ends before you can apply its results. If your model is bad, you won't be able to reduce
damage without compromising the test procedure. Another limitation is that you can test
only one model at a time with a single hypothesis test.
[ 51 ]
Testing Your Models Chapter 2
In situations where you can trade off statistical rigor for speed and risk-aversion, there is an
alternative approach called Multi-Armed Bandits (MABs). To understand how MABs
work, imagine yourself inside a casino with lots of slot machines. You know that some of
those machines yield better returns than others. Your task is to find the best slot machine
with a minimal number of trials. Thus, you try different (multi) arms of slot machines
(bandits) to maximize your reward. You can extend this situation to testing multiple ad
models: for each user, you must find a model that is most likely to increase your ad
revenue.
The most popular MAB algorithm is called an epsilon-greedy bandit. Despite the name, the
inner workings of the method are simple:
Frequently, we start with large values of epsilon and decrease it to smaller values. In this
way, MAB explores lots of random choices at the start and exploits the most profitable
actions toward the end. The exploration frequency is gradually diminishing, becoming
closer to zero.
When you first launch MAB, it collects rewards from random actions. As time passes, you
will see that average rewards for all choices converge to their true values. The major benefit
of MABs is that they change their behavior in real time. While someone is waiting for a
hypothesis test results, MAB gives you a changing picture while covering to the best choice.
Bandits are one of the most basic reinforcement learning algorithms. Despite their
simplicity, they can provide good results.
[ 52 ]
Testing Your Models Chapter 2
We now have two new testing approaches to use. How do we choose between them?
Unfortunately, there is no simple answer. Hypothesis tests and MABs pose different
constraints on data, sampling processes, and experiment conditions. It is better to consult
an experienced statistician or a data scientist before deciding. Mathematical constraints are
not the only things that affect the choice; the environment is also important. MABs are easy
to apply in situations where you can test different choices on random individuals from the
whole population. This may be very convenient when testing models for a large online
retailer, but is impossible for clinical trials, where you are better to apply hypothesis
testing. Let's see a rule of thumb for choosing between MABs and hypothesis tests:
MABs are better suited to environments where you need to test many
alternatives with a limited set of resources. You trade off statistical rigor for
efficiency when using MABs. MABs can take a lot of time to converge, gradually
improving over time.
You should apply hypothesis tests if you have only one alternative, if your trial
involves great risks, or if you need a statistically-rigorous answer. Hypothesis
tests take a fixed amount of time and resources to complete, but impose larger
risks than MABs.
While testing models in an online setting is extremely important for making sure that your
offline test results stay true after the deployment stage, there is still a danger zone that we
have not covered. Abrupt and unexpected changes in data can severely affect or even break
the deployed model, so it is also important to monitor incoming data quality.
[ 53 ]
Testing Your Models Chapter 2
To overcome this, your system needs to monitor all incoming data and check its quality on
the fly. Here are some typical checks:
Summary
In this chapter, we answered a very important question: what does it mean for a model to
work correctly? We explored the nature of errors and studied metrics that can quantify and
measure model errors. We drew a line between offline and online model testing and
defined testing procedures for both types. We can perform offline model testing using
train/validation/test data splits and cross-validation. For online testing, we can choose
between hypothesis tests and MABs.
In the next chapter, we will look into the inner workings of data science. We will dive into
the main concepts behind machine learning and deep learning, giving an intuitive
understanding of how machines learn.
[ 54 ]
3
Understanding AI
You now have a good understanding of what data science can do and how we can check
whether it works. We have covered the main domains of data science, including machine
learning and deep learning, but still, the inner workings of the algorithms are difficult to
discern through the fog. In this chapter, we will look at algorithms. You will get an intuitive
understanding of how the learning process is defined using mathematics and statistics.
Deep neural networks won't be so mystical anymore, and common machine learning jargon
will not scare you but provide understanding and ideas to complete the ever-growing list
of potential projects.
You are not the only one who will benefit from reading this chapter. Your new knowledge
will streamline communication with colleagues, making meetings short and purposeful and
teamwork more efficient. We will start at the heart of every machine learning problem:
defining the learning process. To do this, we will start with the two subjects that lie at the
root of data science: mathematics and statistics.
If you work in the business sector, I bet that you hear the word optimization several times a
day. To optimize something means to make it more efficient, cut costs, increase revenues,
and minimize risks. Optimization involves taking a number of actions, measuring results,
and deciding whether you have ended up in a better place.
For example, to optimize your daily route to work, you can minimize the total time you
spend driving from home to the office. Let's suppose that in your case the only thing that
matters is time. Thus, optimization means minimization of the time. You may try different
options such as using another road or going by public transport instead of driving your car.
To choose the best, you will evaluate all routes using the same quantity, that is, the total
time to make it from home to the office.
To get a better idea of defining optimization problems, let's consider another example. Our
friend Jonathan was tired of his day-to-day job at a bank, so he has started a rabbit farm. It
turns out that rabbits breed fast. To start, he bought four rabbits, and after a while he had
16. A month later, there were 256 of them. All of those new rabbits causing additional
expense. Jonathan's rabbit sale rates fell lower than the rate at which the rabbits bred.
Jonathan's smart farmer friend Aron was impressed with his rabbit production rates, so he
proposed to buy all excess rabbits for a discounted price. Now, Jonathan needs to find out
how many rabbits to sell to Aron so that he can keep within the following boundaries:
As you can see, we have defined another optimization problem, and the rabbit farm started
to remind the bank job that Jonathan left before. This optimization task is quite different
though, it looks harder. In the first problem, we tried to minimize commuting time. In this
problem, we need to seek the minimum amount of rabbits to sell so it does not violate other
conditions. We call problems like these constrained optimization. Additional constraints
allow us to model more realistic scenarios in complex environments. To name a few,
constrained optimization can solve planning, budgeting, and routing problems. In the end,
Jonathan was disappointed with his rabbit farm and sold it to Aron. He then continued his
path of finding a perfect occupation that wouldn't end up similar to his banking job.
[ 56 ]
Understanding AI Chapter 3
There is one place where profits and losses cease making you mad; that is the mathematics
department at a technical university. To get a position there, they ask you to pass an exam.
The first task is to find a minimum of a function, .
While examining the plot, you notice that this function takes values no less than 0, so the
answer is obviously 0. The next question looks like the previous one but with a twist—Find
a minimum of the function , where is an arbitrary number. To solve it, you draw a
bunch of plots and find out that the minimum value is always a.
The last question takes it to the extreme. It says that you won't be given a formula for ,
but you can go to your teacher and ask for values of for some as many times as you
want. It is impossible to draw a plot. In other plots, the minimum was always the lowest
point. How can we find that point without looking at the plot? To tackle this problem, we
will first imagine that we have a plot of this function.
[ 57 ]
Understanding AI Chapter 3
First, we will draw a line between two arbitrary points of the function, as seen in the
following plot:
We will call the distance between those points . If we make smaller and smaller, two
points will be so close that they will visually converge to a single point:
[ 58 ]
Understanding AI Chapter 3
The line in the preceding plot is called a tangent. It has a very convenient property, and the
slope of this line can help us find the minimum or maximum of a function. If the line is flat,
then we have found either the minimum or maximum of a function. If all nearby points are
higher, then it should be a maximum. If all nearby points are lower, then this is the
minimum.
The following plot shows a function (drawn in blue) and its maximum, along with the
tangent (drawn in orange):
Drawing a bunch of lines and points becomes mundane after a while. Thankfully, there is a
simple way to compute a slope of this line between and . If you recall the
Pythagorean theorem, you will quickly find an answer: . We can easily find the
slope using this formula.
[ 59 ]
Understanding AI Chapter 3
You can solve the final task by using gradient descent. You can choose a starting point, ,
ask for the value of , and calculate the slope using the small number, . By looking at
the slope, you can decide whether your next pick, , should be greater or less than .
When the slope becomes zero, you can test whether your current value of is the
minimum or maximum by looking at several nearby values. If every value is less than ,
then is the maximum. Otherwise, it is a minimum.
If we start gradient descent at point A, we will end up with a real minimum. But if we start
at point B, we will be stuck at the local minimum. When you use gradient descent, you can
never actually know whether you are in a local or global minimum. One way to check is to
repeat the descent from various points that are far away from each other. The other way to
avoid local minima is to increase the step size, . But be careful; if is too large, you will
just jump over the minima again and again, never reaching your true goal, the global
minimum.
Like in machine learning, there are many mathematical optimization algorithms with
different trade-offs. Gradient descent is one of the simplest and easiest to get started with.
Despite being simple, gradient descent is commonly used to train machine learning models.
[ 60 ]
Understanding AI Chapter 3
Now you have a good grip on the main principles of mathematical optimization in your
tool belt, we can research the field of statistics—the grandfather of machine learning.
To start, let's look at why we need probabilities when we already are equipped with such
great mathematical tooling. We use calculus to work with functions on an infinitesimal
scale and to measure how they change. We developed algebra to solve equations, and we
have dozens of other areas of mathematics that help us to tackle almost any kind of hard
problem we can think of. We even came up with category theory that provides a universal
language for all mathematics that almost no one can understand (Haskell programmers
included).
The difficult part is that we all live in a chaotic universe where things can't be measured
exactly. When we study real-world processes we want to learn about many random events
that distort our experiments. Uncertainty is everywhere, and we must tame and use it for
our needs. That is when probability theory and statistics come into play. Probabilities allow
us to quantify and measure uncertain events so we can make sound decisions. Daniel
Kahneman showed in his widely known book Thinking, Fast and Slow, that our intuition is
notoriously bad in solving statistical problems. Probabilistic thinking helps us to avoid
biases and act rationally.
[ 61 ]
Understanding AI Chapter 3
Frequentist probabilities
Imagine that a stranger suggested you play a game: he gives you a coin. You toss it. If it
comes up heads, you get $100. If it is tails, you lose $75. Before playing a game, you will
surely want to check whether or not it is fair. If the coin is biased toward tails, you can lose
money pretty quickly. How can we approach this? Let's conduct an experiment, wherein
we will record 1 if heads come up and 0 if we see tails. The fun part is that we will need to
make 1,000 tosses to be sure that our calculations are right. Imagine we got the following
results: 600 heads (1s) and 400 tails (0s). If we then count how frequent heads or tails came
up in the past, we will get 60% and 40%, respectively. We can interpret those frequencies as
probabilities of a coin coming up heads or tails. We call this a frequentist view on the
probabilities. It turns out that our coin is actually biased toward heads. The expected value
of this game can be calculated by multiplying probabilities with their values and summing
everything up (the value in the following formula is negative because $40 is a potential loss,
not gain):
The more you play, the more you get. Even after having several consecutive unlucky
throws in a row, you can be sure that the returns will average out soon. Thus, a frequentist
probability measures a proportion of some event to all other possible events.
Conditional probabilities
It is handy to know the probability of an event given some other event has occurred. We
write the conditional probability of an event given event as . Take rain, for
example:
In the following diagram, you can see the probabilities of different events occurring
together:
[ 62 ]
Understanding AI Chapter 3
From this Euler diagram, we can see that , meaning that there is always
rain when we hear thunder (yes, it is not exactly true, but we'll take this as true for the sake
of simplicity).
What about ? Visually, this probability is small, but how can we formulate this
mathematically to do the exact calculations? Conditional probability is defined as follows:
In words, we divide the joint probability of both Rain and Sunny by the probability of
sunny weather.
[ 63 ]
Understanding AI Chapter 3
But why does this formula work? First, let's rename events for the first and second tosses as
A and B to remove notational clutter and then rewrite the probability of a roll explicitly as a
joint probability of both rolls we have seen so far:
And now multiply and divide by (nothing changes, it can be canceled out), and
recall the definition of conditional probability:
If we read the previous expression from right to left, we find that . Basically,
this means that A is independent of B! The same argument goes for .
In this example, the prior value is . If we do not know the real prior value, we
can substitute an estimate that is based on our experience to make an approximate
calculation. This is the beauty of Bayes' theorem. You can calculate complex conditional
probabilities with simple components.
Bayes' theorem has immense value and a vast area of application. The Bayesian theory even
has its own branch of statistics and inference methods. Many people think that the Bayesian
view is a lot closer to how we humans understand uncertainties, in particular, how prior
experience affects decisions we make.
[ 64 ]
Understanding AI Chapter 3
Distributions
Probabilities work with sets of outcomes or events. Many problems we describe with
probabilities share common properties. In the following plot, you can see the bell curve:
The bell curve, or Gaussian distribution, is centered around the most probable set of
outcomes, and the tails on both ends represent the least likely outcomes. Because of its
mathematical properties, the bell curve appears everywhere in our world. Measure the
height of lots of random people, and you will see a bell curve; look at the height of all grass
blades in your lawn, and you will see it again. Calculate the probability of people in your
city having a certain income, and here it is again.
The Gaussian distribution is one of the most common distributions, but there are many
more. A probability distribution is a mathematical law that tells us the probabilities of
different possible outcomes of events formulated as a mathematical function.
[ 65 ]
Understanding AI Chapter 3
This means that the random variable X is distributed according to a Bernoulli law with the
probability of success (heads) equal to 0.6.
Collect the heights of every person in your city and calculate average
Apply statistics
Statistics allows us to reason about different properties of the population without collecting
a full dataset for each person in the population. The process of selecting a random subset of
data from the true population is called sampling. A statistic is any function that is used to
summarize the data using values from the sample. The ubiquitous statistic that is used by
almost everyone on a daily basis is the sample mean or arithmetic average:
[ 66 ]
Understanding AI Chapter 3
We collected a sample of four heights for each day, a total of 16 heights. Your statistician
friend Fred told you on Friday that he had already collected a sample of 2,000 people and
the average height in the area was about 170 cm.
To investigate, we can look at how your sample average changed with each new data point:
Notice, that on day 2, the average value was unexpectedly high. It may just have happened
that we had stumbled upon four tall people. The random fluctuations in the data are called
variance.
Sample variance summarizes our data, so we can consider it as another statistic. The larger
the variance is, the more sample size you need to collect before calculating the accurate
average value, which will be close to the real one. This phenomenon has a name—the law
of large numbers. The more measurements you make, the better your estimate will be.
[ 67 ]
Understanding AI Chapter 3
Statistical modeling
Statistics is more than simply calculating summary numbers. One of the most interesting
aspects of statistics is modeling. Statistical modeling studies mathematical models that
make a set of statistical assumptions about data. To be more clear, let's return to our
weather example. We have collected a dataset with random variables that describe the
current weather:
Using this data, we want to infer which variables are related to rain. To do this, we will
build a statistical model. Besides the previous data, we have recorded a binary rain variable
that takes the value 1 if it rained, and 0 otherwise.
You may find it strange thinking about rain in terms of probability. What does it mean to
say that last Wednesday, the probability of rain was 45%? Last Wednesday is a past date, so
we can examine the data and check whether there was rain. The trick is to understand that
in our dataset, there are many days similar to Wednesday. Let's suppose that we have
collected the following values:
[ 68 ]
Understanding AI Chapter 3
In this example, Monday and Wednesday are very similar, but their rain outcomes are
different. In a sufficiently large dataset, we could find two rows that match exactly but have
different outcomes. Why is this happening? First, our dataset does not include all the
possible variables that can describe rain. It is impossible to collect such a dataset, so we
make an assumption that our data is related to rain, but does not describe it fully.
Measurement errors, the randomness of events, and incomplete data make rain
probabilistic. You may wonder if rain is probabilistic in nature? Or is every period of rain
predetermined? To check whether rain events are deterministic, we must collect a daily
snapshot of the complete state of the universe, which is impossible. Statistics and
probability theory help us to understand our world even if we have imperfect information.
For example, imagine that we have 10 days similar to last Wednesday in our dataset. By
similar, I mean that all variables we have collected differ only by a small amount. Out of
those 10 days, 8 were rainy and 2 were sunny. We may say that on a day typical to last
Wednesday there is an 80% probability of rain. That is the most accurate answer we can
give using this data.
Having assumptions about data in place, we can proceed to modeling. We can make
another assumption that there exists some mathematical model M, that uses data to
estimate rain probability. That is, model M uses data d to learn the relationship between the
data and rain probability. The model will infer this relationship by assigning rain
probabilities that are closest to real outcomes in our dataset.
The main goal of model M is not to make accurate predictions, but to find and explain
relationships. This is where we can draw a line between statistics and machine learning.
Machine learning seeks to find accurate predictive models, while statistics uses models to
find explanations and interpretations. Goals differ, but the underlying concepts that allow
models to learn from data are the same. Now, we can finally uncover how this model
M can learn from data. We will disentangle the magic, leaving a straightforward
understanding of the mathematics behind machine learning.
[ 69 ]
Understanding AI Chapter 3
In practical cases, we are interested in teaching algorithms to perform all kinds of specific
tasks faster, better, and more reliably that we can do ourselves. For now, we will focus on
prediction and recognition tasks. Thus, we want to build algorithms that are able to
recognize patterns and predict future outcomes. The following table shows some examples
of recognition and prediction tasks:
The first idea may be to approach learning as humans do, and provide explanations and
examples through speech, images, and sets of examples. Unfortunately, while learning in
this way, we perform many complex cognitive tasks, such as listening, writing, and
speaking. A computer algorithm by itself cannot collect new experiences the way we do.
What if, instead, we take a simplified model of our world in the form of digital data? For
example, predicting customer churn for Acme Co could be done only using data about
customer purchases and product ratings. The more complete and full the dataset is, the
more accurate the model of customer churn is likely to be.
Let's look at another example. We will build a machine learning project cost estimator. This
model will use the attributes of a project to calculate the cost estimate. Suppose that we
have collected the following data attributes for each project in our company:
[ 70 ]
Understanding AI Chapter 3
0 = customer's
company employee
number greater
Indicates if the customer has a large than 100
Is a large company Integer
number of employees 1 = customer's
company employee
number less or
equal to 100
Total project cost Integer Total cost in USD 0 to ∞
The example dataset containing these attributes is provided in the following table:
The simplest model we can imagine is a so-called linear model. It sums data attributes
multiplied by variable coefficients to calculate the project cost estimate:
In this simplified scenario, we do not know the real values of cost variables. However, we
can use statistical methods and estimate them from data. Let's start with a random set of
parameters:
[ 71 ]
Understanding AI Chapter 3
If we use the parameters for every project we have in our dataset, we will get the following
results:
Total project 4 cost = 50,000 + 115 x 100 + 40,000 x 3 + 50,000 x 1 + 50,000 x 1 = 281,500
You have probably noticed that the values differ from the real project costs in our dataset.
This means that if we use this model on any real project our estimates will be erroneous.
We can measure this error in multiple ways, but let's stick with one of the most popular
choices:
There are many ways to quantify prediction errors. All of them introduce
different trade-offs and limitations. Error measure selection is one of the
most important technical aspects of building a machine learning model.
For the overall error, let's take an arithmetic average of individual errors of all projects. The
number that we calculated is called the root mean squared error (RMSE).
It turns out that we can use optimization algorithms to tweak our cost parameters so that
RMSE will be minimized. In other words, we can find the best fit for cost parameters that
minimize the error for all rows in the dataset. We call this procedure MLE.
[ 72 ]
Understanding AI Chapter 3
Let's focus on another example to get a more general approach. Imagine that we have
started a coffee subscription service. A customer chooses her favorite flavors of coffee in
our mobile app and fills in an address and payment information. After that, our courier
delivers a hot cup of coffee every morning. There is a feedback system built in the app. We
promote seasonal offerings and discounts to clients via push notifications. There was a big
growth in subscribers last year: almost 2,000 people are already using the service and 100
more are subscribing each month. However, our customer churn percentage is growing
disturbingly fast. Marketing offers do not seem to make a big difference. To solve this
problem, we have decided to build a machine learning model that predicts customer churn
in advance. Knowing that a customer will churn, we can tailor an individual offer that will
turn them into active user again.
This time, we will be more rigorous and abstract in our definitions. We will define a model
M that takes customer data X and historical churn outcomes Y. We will call Y the target
variable.
This kind of table is called a data dictionary. We can use it to understand the data coming
in and out of the model, without looking into the code or databases. Every data science
project must have an up-to-date data dictionary. More complete examples of data
dictionaries will be shown later in this book.
[ 73 ]
Understanding AI Chapter 3
Given the customer description, , the model outputs churn probability, . A hat over
means that is not a real churn probability, but only an estimate that can contain errors.
This value will not be strictly zero or one. Instead, the model will output a probability
between 0% and 100%. For example, for some customer , we got a of 76%. We may
interpret this value like this: based on historical data, the expectancy of this customer to
churn is 76%. Or, out of 100 customers like customer , 76 will churn.
A machine learning model must have some variable parameters that can change to better
match churn outcomes. Now that we have used formulas, we can't go on without
introducing at least one Greek letter. All the parameters of our model will be represented
by .
Historical customer data and churn outcomes , which we will call the
training dataset
Machine learning algorithm that accepts customer description and outputs
churn probability
Model parameters that can be tuned using MLE
We will estimate the parameters for our model using MLE on the training dataset
. I have placed a hat on top of to indicate that in theory there may be the best
parameter set , but in reality we have limited data. Thus, the best parameter set we can get
is only an estimate of true that may contain errors.
Now, we can finally use our model to make predictions about customer churn
probability :
[ 74 ]
Understanding AI Chapter 3
Note that we had not explicitly defined the kind of machine learning model M we use. We
have defined an abstract framework for learning from data that does not depend on the
specific data or concrete algorithms it uses. This is the beauty of mathematics that opens up
limitless practical applications. With this abstract framework, we can come up with many
models M that have different trade-offs and capabilities. This is how machines learn.
There are many different types of machine learning models and many
estimation methods. Linear regression and MLE are among the simplest
examples that show the underlying principles that lie beneath many
machine learning models. A theorem called the no free lunch
theorem says that there is no model that will give you the best results for
each task for every dataset. Our machine learning framework is abstract,
but this does not mean it can yield a perfect algorithm. Some models are
best for one task, but are terrible for another. One model may classify
images better than humans do, but it will fail at credit scoring. The
process of choosing the best model for a given task requires deep
knowledge of several disciplines, such as machine learning, statistics, and
software engineering. It depends on many factors, such as statistical data
properties, the type of task we are trying to solve, business constraints,
and risks. That is why only a professional data scientist can handle the
selection and training machine of learning models. This process has many
intricacies and explaining all of them is beyond the scope of this book. An
interested reader may refer to the book list at the end of this book. There
you can find free books that explain the technical side of machine learning
in great depth.
[ 75 ]
Understanding AI Chapter 3
We will now dive into the main intuitions behind popular machine learning algorithms,
leaving out technical details for the sake of seeing the wood for the trees.
Exploratory data analysts: In this stage, a data scientist uses a set of statistical
and visualization techniques to have a better understanding of data.
Data preparation: In this part, a data scientist transforms data into a format
suitable for applying a machine learning algorithm.
Data preprocessing: Here, we clean prepared data and transform it so that a
machine learning algorithm can properly use every part of the data.
Modeling: In this part, a data scientist trains machine learning models.
Testing: In this stage, we evaluate the model using a set of metrics that measure
its performance.
This process repeats many times before we achieve sufficiently good results. You can apply
the life cycle to train many kinds of machine learning models, which we will explore next.
Linear models
The most basic type of machine learning models is a linear model. We have already seen a
detailed example of a linear model in the previous section. Predictions of a linear model can
be interpreted by looking at coefficients of a model. The greater the coefficient, the more its
contribution to the final prediction. While simple, those models are often not the most
accurate. Linear models are fast and computationally efficient, which makes them valuable
in settings with lots of data and limited computational resources.
[ 76 ]
Understanding AI Chapter 3
Linear models are fast, efficient, simple, and interpretable. They can solve both
classification and regression problems.
To make a prediction, an algorithm starts from the top of the tree and makes consecutive
choices based on values in the data. For binary classification, at the bottom of the tree you
will get a proportion of positive cases for similar customers.
[ 77 ]
Understanding AI Chapter 3
But how does CART choose columns and values for splits? We will explore the general
logic of CART on binary classification:
1. At first, it takes a single column and divides the data into two parts for each
value of this column.
2. Then it computes a proportion of positive cases for each split.
3. Step 1 and step 2 are repeated for each column in the data.
4. We rank each split by how well it divides the data. If the split divides the dataset
perfectly, then positive cases would be for all values, lower than a threshold and
negative cases would be on the other side. To illustrate, if Age > 25 is a perfect
split, then all customers younger than 25 will have credit defaults and all
customers older than 25 will have a perfect credit history.
5. According to step 4, the best split is chosen for the current level of the tree. The
dataset is divided into two parts according to the split value.
6. Steps 1 to 5 are repeated for each new dataset part.
7. The procedure continues before the algorithm meets a stopping criterion. For
example, we can stop building a tree by looking at the depth of the decision tree
or the minimum number of data points available for the next split.
We can also apply CART to regression problems, although the algorithm would be slightly
more complicated. CART is simple and interpretable, but it produces very weak models
that are rarely applied in the practice. However, the properties of the algorithm and
implementation tricks allow us to use their weakest point as their main strength. We will
learn how to exploit these properties in the following section.
Ensemble models
Suppose that you own a retail store franchise. Business is growing and you are ready to
build a new store. The question is, where should you build it? Selection of a building
location is extremely important as it is permanent and it defines the local customer base
that will go into your shop.
1. Decide yourself.
2. Ask for the advice of the most competent employee.
3. Ask the opinion of many slightly less experienced employees.
[ 78 ]
Understanding AI Chapter 3
Options 1 and 2 encompass one and two persons that make a decision. Option 3
encompasses opinions of several of experts. Statistically, option 3 is likely to yield a better
decision. Even world-class experts can make a mistake. Several professionals, sharing
information between each other, are much more likely to succeed. That is the reason why
living in big communities and working in large organizations leads to great results.
In machine learning, this principle works too. Many models can contribute to making a
single decision in an ensemble. Model ensembles tend to be more accurate than single
models, including the most advanced ones. Be wary, though; you need to build many
models to create an ensemble. A large number of models increase computational resource
requirements quite rapidly, making a tradeoff between prediction accuracy and speed.
Tree-based ensembles
A particularly useful model for ensembles is a decision tree. There is an entire class of
machine learning models devoted to different ways of creating tree ensembles. This type of
model is the most frequent winner of Kaggle competitions on structured data, so it is
important to understand how it works.
Trees are good candidates to build ensembles because they have high variance. Because of
the randomness in a tree-building algorithm, every decision tree differs from the previous
one, even if the dataset does not change. Each time we build a decision tree, we may come
up with something that's different from before. Thus, each tree will make different errors.
Recall the following diagram:
[ 79 ]
Understanding AI Chapter 3
It turns out that decision trees have extremely low bias and high variance. Imagine that
many different trees make hundreds of predictions for each individual, creating an
ensemble. What would happen if we average all predictions? We will be a lot closer to the
real answer. When used in an ensemble, decision trees can handle complex datasets with
high prediction accuracy.
In the following diagram, you can see how multiple trees can create an ensemble:
If you are working with structured data, be sure to try decision tree ensembles before
jumping into other areas of machine learning, including deep learning. Nine times out of
ten, the results will satisfy both you and your customer. Media often overlooks the value of
this algorithm. Praise for ensemble models is rare to find, yet it is arguably the most
commonly used algorithm family for solving practical applied machine learning problems.
Be sure to give tree ensembles a chance.
[ 80 ]
Understanding AI Chapter 3
Clustering models
Another useful application of machine learning is clustering. In contrast to other machine
learning problems we have studied in this section, clustering is an unsupervised learning
problem. This means that clustering algorithms can work with unlabeled data. To illustrate,
let's look at a task that's central to marketing departments—customer segmentation.
Coming up with marketing offers for each individual client may be impossible. For
example, if you own a large retail store network, you want to apply different discounts at
stores depending on customer interests to boost sales. To do this, marketing departments
create customer segments and tailor marketing companies to each specific segment.
In the following diagram, you can see six customers assigned to two different segments:
We may automate customer segmentation by taking all purchase histories for all customers
and applying a clustering algorithm to group similar customers together. The algorithm
will assign each customer to a single segment, allowing you to further analyze those
segments. While exploring data inside each segment, you may find interesting patterns that
will give insights for new marketing offers targeted at this specific segment of customers.
We can apply clustering algorithms to data in an ad hoc manner because they don't require
prior labeling. However, the situation can get complicated, as many of the algorithms suffer
from the curse of dimensionality and can't work with many columns in the data.
The most popular clustering algorithm is K-means. In its simplest form, the algorithm has
only one parameter: the number of clusters to find in the data. K-means approaches
clustering from a geometrical standpoint. Imagine each data row as a point in space. For
us, this idea is easy for datasets with two or three points, but it works well beyond three
dimensions. Having laid out our dataset in a geometric space, we can now see that some
points will be closer to each other. K-means finds center points around which other points
cluster.
[ 81 ]
Understanding AI Chapter 3
1. It takes the current cluster centers (for the first iteration, it takes random points).
2. It goes through all data rows and assigns them to the closest cluster's center
point.
3. It updates cluster centers by averaging the locations of all points from step 2.
At this point, we conclude the introduction to machine learning. While there are many
more machine learning algorithms to study, describing them is beyond the scope of this
book. I am sure you will find that knowledge about regression, decision trees, ensemble
models, and clustering covers a surprisingly large portion of practical applications and will
serve you well. Now we are ready to move on to deep learning.
[ 82 ]
Understanding AI Chapter 3
To demystify neural networks, we will start with the basic building blocks: artificial
neurons. An artificial neuron is nothing more than two mathematical functions. The first
takes a bunch of numbers as input and combines them by using its internal state—weights.
The second, an activation function, takes the output of the first and applies special
transformations. The activation function tells us how active this neuron is to a particular
input combination. In the following diagram, you can see how an artificial neuron converts
the input to output:
In the following diagram, we can see the plot of the most popular activation function:
[ 83 ]
Understanding AI Chapter 3
If the output is less than 0, the function will output 0. If it is greater, it will echo its input.
Simple, isn't it? Try to come up with the name for this function. I know, naming is hard.
Names should be simple, while conveying deep insights about the core concept of the thing
you are naming. Of course, mathematicians knew this and, as we have witnessed many
times before, came up with a perfect and crystal-clear name—rectified linear unit (ReLU).
An interesting fact is that ReLU does not conform to basic requirements for an activation
function, but still gives better results than other alternatives. Other activation functions may
be better in specific situations, but none of them beat ReLU in being a sensible default.
Another important activation function you need to know about is sigmoid. You can see it in
the following screenshot:
Before ReLU came to the throne, sigmoid was a popular choice as an activation function.
While its value as an activation function has faded, the sigmoid is still important for
another reason. It often comes up in binary classification problems. If you look at the plot
closely, you will find it can map any number to a range between 0 and 1. This property
makes sigmoid useful when modeling binary classification problems.
[ 84 ]
Understanding AI Chapter 3
Now, we know the main components of an artificial neuron: weights and activation
function. To make a neuron work, we need to take its input and combine it with neuron
weights. To do this, we can recall linear regression. Linear regression models combine data
attributes by multiplying each attribute to a weight and then summing them up. Then,
apply an activation function, and you will get an artificial neuron. If our data row had two
columns named a and b, the neuron would have two weights, and . A formula for a
neuron with ReLU activation is shown as follows:
Note that is a special weight called a bias that is not tied to any input.
[ 85 ]
Understanding AI Chapter 3
Or, to give you a more concrete example of an actual calculation, take a look at the
following:
We have decided to use a neural network to solve this churn prediction problem. Our
network will comprise multiple layers of neurons. Neurons in each layer will be connected
to neurons in the next layer:
[ 86 ]
Understanding AI Chapter 3
That is a lot of arrows, isn't it? A connection between two neurons means that a neuron will
pass its output to the next neuron. If a neuron receives multiple inputs, they are all
summed up. This type of network is called a fully connected neural network (FCNN).
We see how we can make predictions using neural networks, but how can we learn what
predictions to make? If you look closer, a neural network is nothing more than a large
function with lots of weights. The model prediction is determined by using weights and
information incoming through the neuron inputs. Thus, to have an accurate neural
network, you must set the right weights. We already know that we can use mathematical
optimization and statistics to minimize prediction error by changing the parameters of a
function. A neural network is nothing more than a large and complex mathematical
function with variable weights. Therefore, we can use MLE and gradient descent to do the
optimization. I will give the formal names of each stage in bold, followed by intuitive
explanations of each stage:
[ 87 ]
Understanding AI Chapter 3
Backpropagation is the most widely used learning algorithm for training neural networks.
It takes a prediction error and calculates how much we should change each weight in the
network to make predictions closer to the ground truth. The name backpropagation comes
from the specific way of how the algorithm updates the weights: it starts from the last layer,
propagating changes to every neuron until it reaches the network input. When inputs go
through the network to calculate an output prediction, we call it a forward pass. When we
change the weights by propagating the error, we call it a backward pass.
Nowadays, there are many different types of building blocks that you can use to compose a
neural network. Some specific neuron types work better with image data, while others can
utilize the sequential nature of text. Many specialized layers were invented to improve the
speed of training and fight overfitting. A specific composition of layers in a neural network
devoted to solving a specific task is called a neural network architecture. All neural
network architectures, no matter how complex and deep, still conform to basic laws of
backpropagation. Next, we will explore the domain-specific applications of deep learning.
[ 88 ]
Understanding AI Chapter 3
At first, we transform an image into three tables of numbers, one each for the red, green
and blue channel of every pixel. If we try to use a plain FCNN as before, we will see
unimpressive results. Deep neural networks shine at computer vision tasks because of a
specific neuron type called a convolutional filter or a convolution. Convolutional neural
networks (CNNs) were invented by a French machine learning researcher Yann LeCun. In
CNNs, a single neuron can look at a small patch of an image, say, 16x16 pixels, instead of
taking the entire set of pixels as an input. This neuron can go through each 16x16 region of
an image, detecting some feature of an image it had learned through backpropagation.
Then, this neuron can pass information to further layers. In the following illustration, you
can see a single convolutional neuron going through small image patches and trying to
detect a fur-like pattern:
[ 89 ]
Understanding AI Chapter 3
The remarkable achievement of CNNs is that a single neuron can reuse a small number of
weights and still cover the entire image. This feature makes CNNs a lot faster and lighter
than regular neural networks. The idea found implementation only in the 2010s, when a
CNN surpassed all other computer vision methods in an ImageNet competition, where an
algorithm had to learn to classify photos between 21,000 possible categories. The
development of CNNs took so long because we lacked the computational capabilities to
train deep neural networks with a large number of parameters on big datasets. To achieve
good accuracy, CNNs require significant amounts of data. For example, the ImageNet
competition includes 1,200,000 training images.
At first layers, CNNs tend to detect simple patterns, such as edges and contours in an
image. As the layer depth progresses, convolutional filters become more complex, detecting
features such as eyes, noses, and so on.
[ 90 ]
Understanding AI Chapter 3
Many neurons learn to recognize simple patterns that are useful for any computer vision
task. This observation leads us to a very important idea: a neural network that was trained
to perform one task well can be retrained to perform another task. Moreover, you will need
much less training data for the second task, as the network has already learned many useful
features from the previous training dataset. In particular, if you want to train a CNN
classifier for two classes from scratch, you will need to label tens of thousands of images to
reach a good performance level. However, if you use a network that was pretrained on
ImageNet, you will probably get good results with only 1,000 to 2,000 images. This
approach is called transfer learning. Transfer learning is not limited to computer vision
tasks. In recent years, researchers made significant progress in using it for other domains:
natural language processing, reinforcement learning, and sound processing.
Now that you have an understanding of how deep CNNs work, we will proceed to the
language domain, where deep learning has changed everything.
As with computer vision, deep learning took the NLP world by storm. Deep-learning-based
NLP algorithms successfully perform near-human-level translation between different
languages, can measure the emotional sentiment of a text, can learn to retrieve information
from a text, and can generate answers on free-form questions. Another great benefit of deep
learning is a unified approach. A single part-of-speech tagging model architecture will
work for French, English, Russian, German, and other languages. You will need training
data for all those languages, but the underlying model will be the same. With deep
learning, we need not try to hardcode the rules of our ambiguous language. While many
tasks, such as long-form writing and human-level dialogue, are yet unconquerable for deep
learning, NLP algorithms are a great help in business and daily life.
For NLP deep learning, everything began with an idea: the meaning of a word is defined by
its neighbors. That is, to learn a language and the meaning of words, all you need is to
understand the context for each word in the text. This idea may seem to be too simple to be
true. To check its validity, we can create a neural network that will predict a word by
receiving surrounding words as an input. To create a training dataset, we may use any text
in any language.
[ 91 ]
Understanding AI Chapter 3
If we take a context window of two words, then we can generate the following training
samples for this sentence:
And so on…
Next, we need to come up with a way to convert all words to numbers, because neural
networks only understand numbers. One approach would be to take all unique words in a
text and assign them to a number:
Following → 0
Training → 1
Samples → 2
For → 3
This → 4
Then, we represent each word by a set of weights inside a neural network. In particular, we
start with two random numbers between 0 and 1 for each word.
[ 92 ]
Understanding AI Chapter 3
Now we have a way to convert every word in the text into a pair of numbers. We will take
all numbers that we have generated as weights for our neural network. It will get four
words as input, convert them into eight numbers, and use them to predict the identifier for
the word in the middle.
For example, for the training sample Following, Training, For, This → Samples:
Input:
Output:
2 → Samples
We call each pair of numbers associated with a word, a word vector. Our neural network
will output a vector of probabilities from zero to one. The length of this vector will match
the total number of unique words in our dataset. Then, the number with the largest
probability will represent the word that is the most likely completion of our input
according to the model.
In this setup, we can apply the backpropagation algorithm to adjust word vectors until the
model matches right words to their contexts. In our example, you can imagine each word
lives on a coordinate grid. The elements of the word vector could represent X and Y
coordinates. If you think about words in this geometric fashion, you may conclude that you
can add or subtract word vectors to get another word vector. In the real world, such vectors
contain not two but 100 to 300 elements, but the intuition remains the same. After many
training iterations, you will see remarkable results.
You will get a vector for the word Queen. By learning to put words into their surrounding
contexts, the model learns how different words relate to each other.
[ 93 ]
Understanding AI Chapter 3
The model we have built is called Word2Vec. We can train Word2Vec models in two ways:
The two approaches do not differ in anything, except model input and output
specifications.
Word vectors are also referred to as word embeddings. Embeddings contain much more
information about words than simple numeric identifiers, and NLP models can use them to
achieve better accuracy. For example, you can train a sentiments classification model by
following these steps:
1. Create a training dataset that contains user reviews and their sentiment, labeled 0
as negative and 1 as positive.
2. Embed the user reviews into sets of word vectors.
3. Train deep learning classifier using this dataset.
To train a classifier that uses word embeddings, you can use a CNN. In a CNN, each
neuron progressively scans the input text in windows of words. Convolutional neurons
learn weights that combine word vectors of nearby words into more compact
representations that are used by the output layer to estimate sentence sentiment.
You can see how a single convolutional neuron works on a single sentence in the following
screenshot:
[ 94 ]
Understanding AI Chapter 3
Self-connection allows a neuron to cycle through its input, pulling its internal state through
each iteration:
[ 95 ]
Understanding AI Chapter 3
The preceding screenshot depicts a single recurrent neuron as it unfolds. With each new
word, a recurrent neuron changes its previous state. When the final word is processed, it
returns its internal state as the output. This is the most basic recurrent architecture. Neural
networks that are used in practice have a more complex internal structure, but the idea of
recurrent connections holds. When speaking about recurrent networks, you will probably
hear about long short-term memory networks (LSTMs). While they differ in the details,
the flow of thought is the same for both RNNs and LSTMs.
Summary
In this chapter, we uncovered the inner workings of machine learning and deep learning.
We learned the main concepts of mathematical optimization and statistics. We connected
them to machine learning and, finally, learned how machines learn and how we can use
optimization algorithms to define learning. Lastly, we covered popular machine learning
and deep learning algorithms, including linear regression, tree ensembles, CNNs, word
embeddings, and recurrent neural networks. This chapter concludes our introduction to
data science.
In the next chapter, we will learn how to build and sustain a data science team capable of
delivering complex cross-functional projects.
[ 96 ]
2
Section 2: Building and
Sustaining a Team
Data science is an innovation for most organizations. However, every innovation requires
deep and careful thinking – not all ideas are equally good, and not all of them have the
necessary resources for implementation. This chapter helps you to identify the best ideas
and strip them down to the minimum valuable product.
Another important consideration is how successfully you can sell your idea to all
stakeholders who might benefit from it.
Merging a knowledge of modern data analysis algorithms and business domain expertise is
a necessary step for every project. This chapter outlines the importance of using a scientific
approach in business. It helps you to answer the following questions: how can you find an
efficient data science application for your business? What are business metrics and
technical metrics and how should we define them? How can we introduce project goals that
align well with our business?
When you work with a team, everyone can specialize in performing several closely-related
types of work. To explore different specialties, let's look at an example of how houses are
built. Building a roof requires one set of skills, while setting up electricity is completely
different. As a manager, you need to have a basic understanding of all specialties, so as to
understand all components necessary for completing a task. In the What is Data Science?
section, you were introduced to the central concepts of data science. Then, we will use this
knowledge to derive roles and specialties in a data science team. In this chapter, we will
define, explore, and understand different team roles as well as each role's key skills and
responsibilities. We will also look at two case studies.
An average data science team will include a business analyst, a system analyst, a data
scientist, a data engineer, and a data science team manager. More complex projects may
also benefit from the participation of a software architect and backend/frontend
development teams.
Project stakeholders: Represent people who are interested in the project; in other
words, your customers. They generate and prioritize high-level requirements and
goals for the project.
Project users: People who will use the solution you are building. They should be
involved in the requirements-specification process to present a practical view on
the system's usability.
Business analysts: The main business expert of the team. They help to shape
business requirements and help data scientists to understand the details about
the problem domain. They define business requirements in the form of
a business requirements document (BRD), or stories, and may act as a product
owner in agile teams.
System analysts: They define, shape, and maintain software and integration
requirements. They create a software requirements document (SRD). In simple
projects or Proof of Concepts (PoC), this role can be handled by other team
members.
Data analysts: Analysis in data science projects often requires building complex
database queries and visualizing data. Data analysts can support other team
members by creating datamarts and interactive dashboards and derive insights
from data.
Data scientists: They create models, perform statistical analysis, and handle
other tasks related to data science. For most projects, it will be sufficient to select
and apply existing algorithms. An expert who specializes in applying existing
algorithms to solve practical problems is called a machine or deep learning
engineer. However, some projects may ask for research and the creation of new
state-of-the-art models. For these tasks, a machine or deep learning researcher
will be a better fit. For those readers with computer science backgrounds, we can
loosely describe the difference between a machine learning engineer and research
scientist as the difference between a software engineer and a computer scientist.
[ 99 ]
An Ideal Data Science Team Chapter 4
Data engineers: They handle all data preparation and data processing. In simple
projects, data scientists with data engineering skills can handle this role.
However, do not underestimate the importance of data engineers in projects with
serious data processing requirements. Big data technology stacks are very
complicated to set up and work with on a large scale, and there is no one better
to handle this task than data engineers.
Data science team manager: They coordinate all tasks for the data team, plan
activities, and control deadlines.
Software teams should handle all additional requirements for building mobile,
web, and desktop applications. Depending on the ramifications, the software
development can be handled by a single developer, a single team, or even several
teams.
In large projects that comprise multiple systems, you may need the help of a
software architect.
Let's now see the general flow of how roles can work together to build the final solution:
[ 100 ]
An Ideal Data Science Team Chapter 4
In the next section, we will look at how to assemble teams and assign project roles to
experts depending on project complexity.
[ 101 ]
An Ideal Data Science Team Chapter 4
Before starting a project, you need an idea that will allow your client to achieve their goals
and simplify their life. In business, you will look to improve key business processes within
a company. Sometimes, the idea is already worked out, and you may start directly from
implementation, but more often, your team will be the driver of the process. So, our ideal
expert must be able to come up with an idea of a data science project that will provide
value for the client.
Next, we will study two project examples to look at how simple projects can be handled
with small teams or even one cross-functional expert, while larger projects need more
diverse teams, where each team member handles one or two specific roles.
One of the main activities of this department is to detect and prevent credit card fraud.
They do this by using a rule-based, fraud detection system. This system looks over all credit
card transactions happening in the bank and checks whether any series of transactions
should be considered fraudulent. Each check is hardcoded and predetermined. They have
heard that ML brings benefits over traditional, rule-based, fraud detection systems. So, they
have asked Mary to implement a fraud detection model as a plugin for their existing
system. Mary has inquired about the datasets and operators of the current fraud detection
system, and the department confirmed that they will provide all necessary data from the
system itself. The only thing they need is a working model, and a simple software
integration. The staff were already familiar with common classification metrics, so they
were advised to use F1-score with k-fold cross-validation.
[ 102 ]
An Ideal Data Science Team Chapter 4
In this project setup, project stakeholders have already finished the business analysis stage.
They have an idea and a success criterion. Mary has clean and easy access to the data
source from a single system, and stakeholders can define a task in terms of a classification
problem. They have also defined a clear way to test the results. The software integration
requirements are also simple. Thus, the role flow of the project is simplified to just a few
steps, which are all performed by a single data scientist role:
As a result, Mary has laid out the following steps to complete the tasks:
This project looks simple and well-defined. Based on this description, we can believe that
Mary can handle it by herself, with little supervision.
[ 103 ]
An Ideal Data Science Team Chapter 4
The retail company has over 10,000 stores across the country, and ensuring the proper level
of service quality for all stores is becoming difficult. Each store has a fixed number of staff
members working full-time. However, each store has a different number of visitors. This
number depends on the geographical location of the store, holidays, and possibly many
more unknown factors. As a result, some stores are overcrowded and some experience a
low number of visits. Jonathan's idea is to change the staff employment strategy so that
customer satisfaction will rise, and the employee burden will even out.
Instead of hiring a fixed store team, he suggests making it elastic. He wants to create a
special mobile application that will allow stores to adjust their staff list with great speed. In
this application, a store manager can create a task for a worker. The task can be as short as
one hour or as long as one year. A pool of workers will see all vacant slots in nearby stores.
If a worker with the necessary skills sees a task that interests them, they can accept it and go
to the store. An algorithm will prepare task suggestions for the store manager, who will
issue work items to the pool. This algorithm will use multiple machine learning models to
recommend tasks. One model would forecast the expected customer demand for each store.
The second model, a computer vision algorithm, will measure the length of the lines in the
store. This way, each store will use the right amount of workforce, changing it based on
demand levels. With the new app, store employees will be able to plan vacations and ask
for a temporary replacement. Calculations show that this model will keep 50,000 workers
with an average load of 40 hours per week for each worker, and a pool of 10,000 part-time
workers with an average load of 15 hours per week. Management considered this model to
be more economically viable and agreed to perform a test of the system in one store. If the
test is successful, they will continue expanding the new policy.
The next thing they ask Jonathan is to come up with an implementation plan. He now
needs to decompose the project into a series of tasks that are necessary to deploy the system
in one store. Jonathan prepared the following decomposition:
1. Collect and document initial requirements for the mobile app, forecasting, and
computer vision models.
2. Collect and document non-functional requirements and Service Level
Agreements for the system.
3. Decide on necessary hardware resources for development, testing, and
production.
4. Find data sources with training data.
5. Create data adapters for exporting data from source systems.
6. Create a software system architecture. Choose the technology stack.
7. Create development, test, and production environments.
8. Develop a forecasting model (full ML project life cycle).
9. Develop a line-length recognition model (full ML project life cycle).
[ 104 ]
An Ideal Data Science Team Chapter 4
Each point in this plan could be broken down further into 10-20 additional tasks. A full task
decomposition for this project could easily include 200 points, but we will stop at this level,
as it is sufficient for discussion purposes.
In reality, plans constantly change, so Jonathan has also decided to use a software
development project management framework, such as SCRUM, to manage deadlines,
requirement changes, and stakeholder expectations.
Let's now see what areas of expertise Jonathan needs to be proficient in order to complete
this project. He needs to be adept in the following areas:
Retail business
Software project management
Requirements-collection
Software and hardware architecture
Data engineering
Data science
Machine learning
Deep learning and computer vision
Mobile application development
Backend software development
Software integration
Software testing
Development operations
Can Jonathan possess all those skills on a level necessary to deliver a working software
system? We can be sure that there are a few people who can do this, but it is far more likely
to meet an expert who knows several closely related domains. In particular, a data engineer
is likely to be good with databases, database administration, and software integration.
Many data scientists are also good with backend software development. Some are
proficient in creating data visualization dashboards. In terms of the role flow of the project,
the full diagram can be used for this project.
[ 105 ]
An Ideal Data Science Team Chapter 4
In an average data science project, expect to fill in the following team roles:
Data scientist
Machine or deep learning engineer
Machine or deep learning researcher
Data engineer
Data analyst
Systems analyst
Business analyst
Backend software developer
Frontend software developer
Technical team leader
Project manager
Depending on the project's complexity, you can merge some of these roles. The bigger your
project scale, the more diverse and large your team will be. Next, we will look at the key
skills and responsibilities of data science team roles.
An ideal data scientist is often described as a mix of the following three things:
[ 106 ]
An Ideal Data Science Team Chapter 4
It is important to understand why each area of expertise is present in this list. We will start
with the domain expertise. You are very unlikely to find a world-class business expert who
can also train state-of-the-art deep learning models. The good thing is that you don't need
to search for unicorns. A data scientist needs domain expertise mainly to understand data.
It is also handy when working with requirements and stakeholder expectations. A basic to
medium understanding of insurance business can help data scientists throughout the
process of building a model.
For example, let's look at how business expertise can help to build models for an insurance
company:
Speaking with insurance experts in the same language can help to discover their
pain and desires, so the data scientist can adjust their goals to achieve better
results.
Insurance expertise will help in exploring and understanding the raw data from
the company's databases.
Insurance expertise can help in data preprocessing. It aids data scientists in
assembling datasets for machine learning models.
Thus, domain expertise is not the prime skill for a data scientist, but a basic to medium
level of understanding can improve results by a large margin.
The next component, software engineering expertise, is often overlooked. Algorithms are
often tied to the most critical processes of an organization, hence, the availability and
stability requirements for these algorithms are high. Data scientists never create an abstract
model. Even in research organizations, data scientists write code. And the closer you are to
real-world applications, the more good code matters. Good programming skills allow data
scientists to write well-architected, well-documented, and reusable code other team
members can understand. Software engineering expertise enables them to build systems
that do not fail under high loads and can scale from simple proof-of-concept tests to
organization-wide deployments.
And last, but not least, are the data science skills. They are key tools without which no one
can build a working model. However, data science is a very wide field. Be sure to
understand what algorithms you will need to use, and understand what kind of expert you
need. A computer vision expert won't be the best fit for a movie recommender project. If
you hire a data scientist, it does not automatically mean that they are proficient in all areas
of machine learning and mathematical statistics.
[ 107 ]
An Ideal Data Science Team Chapter 4
Hence, a separate role of a data engineer emerges. The key areas of knowledge for data
engineers are as follows:
Software engineering
Big data engineering: This includes distributed data processing frameworks,
data streaming technologies, and various orchestration frameworks. Data
engineers also need to be proficient with the main software architecture patterns
related to data processing.
Database management and data warehousing: Relational, NoSQL, and in-
memory databases.
Software engineering skills are very important for data engineers. Data transformation code
frequently suffers from bad design choices. Following the best practices of software design
will ensure that all data processing jobs will be modular, reusable, and easily readable.
Working with big data involves creating highly parallel code, so a knowledge of distributed
processing systems is also essential. Major software platforms, such as the Hadoop
ecosystem, Apache Spark, and Apache Kafka, all require a good understanding of their
internals to write performant and stable code.
[ 108 ]
An Ideal Data Science Team Chapter 4
[ 109 ]
An Ideal Data Science Team Chapter 4
Summary
In this chapter, we first covered what a data scientist is. Then, we looked at two examples
that showed us whether a data scientist can work in isolation or needs a team. Next, we
looked at the various skills and qualities that a data scientist, a data engineer, and a data
science manager need to have. We also briefly explored when we need to ask for help from
the development team.
Finally, we defined the key domains of a data science team, which are analysis, data, and
software. In those domains, we defined project roles that will allow you to create a balanced
and powerful team. We may solve simple projects with small teams where team members
share responsibilities for different roles. But when the project's complexity grows, your
organizational structure will need to scale with it. We also noted the importance of
following the best practices of software engineering and sourcing help from software
development teams.
Knowing that defining responsibilities and expectations is critical for delivering complex
data science projects, we will use those ideas in the next chapter, where we will explore
how to create effective hiring procedures for data science teams.
[ 110 ]
5
Conducting Data Science
Interviews
In this chapter, you will learn how to conduct an efficient data science interview. We will
explore how a hiring process can be tied to real working experience so that the candidates
you seek will clearly understand what kind of job they will need to do. The advice in this
chapter will help you to get rid of unnecessary interviewing rituals and make the hiring
process more efficient.
How long has it been since you went through a technical interview? Was it a pleasant
experience? And if you got through it, can you judge the relevance of your interview
questions to your day-to-day work? I bet that the experience was awful. The common
complaints are that all interviews are stressful, and the questions are not relevant. Technical
interviews are deeply flawed. We imagine that we are using a rational way of thought to
pass candidates through a sieve, selecting only the stellar experts who will do the job. In
fact, interviews are biased. There is almost no correlation between success in an interview
and success at the actual job. I would not be surprised if random candidate selection
showed better results than the hiring processes of many companies.
I ask you to be open-minded, critical, and honest with yourself while reading this chapter.
This text will not give you secret knowledge that will improve your interviews. It will give
you tools. If you invest in them, you'll get back more straightforward, simple, and
enjoyable interviews that will find the right people for the right job.
The primary focus for a candidate will be on applying different techniques (data
mining/statistical analysis/build prediction systems/recommendation systems) using large
company datasets. Apply machine learning models and test the effectiveness of different
actions. The candidate must have strong technical expertise and be able to use a wide set of
tools for data mining/data analysis methods. The candidate must be able to build and
implement mathematical models, algorithms, and simulations. We seek candidates with the
following skillset:
Work with business cases: seek opportunities and leverage company data to
create business solutions.
Mine and analyze data from company databases to drive optimization and
improvement of product development and sales techniques.
[ 112 ]
Conducting Data Science Interviews Chapter 5
Assess the effectiveness and accuracy of new data sources and data gathering.
Extending the company's data with third-party sources of information when
needed.
Use predictive modeling to increase revenue generation, ad targeting, and other
business outcomes.
Analyze business cases and identify data sources (internal/external) and data
mining/analysis methods to use.
Develop a normalization engine to execute cleansing/deduplication for a raw
data through extract transform load (ETL) process for data sources.
Create, train, and test predictive models to solve defined business cases.
Develop algorithms to apply to data sets.
Design data structure models for collected data.
Facilitate the build of a solution from proof of concept (POC) to production.
Work with business owners to gather additional information about business
cases.
Work with data that generated by core business.
Be ready to work in agile style (daily, sprint planning, sprint review,
retrospective).
Work in an environment that adapts quickly to creative change using agile
principles.
Actively works with different development groups inside of organization.
Be ready to adapt a new tool/library/technology/platform.
Excellent understanding of machine learning techniques and algorithms, such
as k-nearest neighbors (kNNs), Naive Bayes, support vector machines (SVM),
decision tree, clustering, artificial neural networks.
Strong understanding of math statistics (such as distributions, statistical testing,
regression, and so on).
Experience creating and using advanced machine learning algorithms and
statistical methods such as regression, simulation, scenario analysis, modeling,
clustering, decision trees, neural networks, and so on.
Proficiency in using query languages such as SQL.
Experience working with and creating data architectures, data models, data
warehouses/data lakes.
Ability to work with minimal supervision.
Strong data analytical skills.
[ 113 ]
Conducting Data Science Interviews Chapter 5
If you think about it, this is an ideal candidate. A one-man army, a unicorn. Do such people
exist? Yes, but they are extremely rare. Does every company need people like this to
complete their projects and achieve their goals? I will answer with a bold no. Instead of
seeking the best fit for their goals, they seek the best fit for all possible goals. Being general
and all-encompassing, this job search will be long, tiresome, and stressful. After reading
this long list of requirements, the candidate won't be able to figure out what their job will
be like. All descriptions are vague and hard to understand. This description does not give a
hint about what this job will be like. The job description should give a clear listing of the
following aspects:
[ 114 ]
Conducting Data Science Interviews Chapter 5
When the job description is vague and convoluted, testing all required skills will surely
require several interview sessions, which will make the process even more tiring. Adequate
assessment of multiple candidates based on this list is close to impossible. After a long
search, the company will just hire someone randomly. After this, the troubles won't end.
The greatest problem of this job description is a lack of direction. The employer is not sure
about the candidate's functions and responsibilities. This problem is deeper than it seems to
be; the job description is actually only a symptom, not the illness itself. For the new
employee, the obscurity and vagueness will not end on the first day on the new job. If the
employer did not fully figure out what the new employee should do, the workflow itself
will become an undetermined mess. Most of the interviews and hiring processes are a
rationalization of a suboptimal, stressful, and biased candidate selection process. The good
thing is that we can fix these problems. The only things you will need are honesty, a clear
definition of your goals, and a bit of preparation.
From the employer's side, the sole purpose of the interview is to fill the position with a
capable candidate. From the side of the employee, the purpose of the interview is to find a
good team, an interesting project, a reliable company to work with, and satisfactory
compensation. We often forget that the interview is a dialogue, not a mere test of skill.
Unfortunately, there is no clear and ubiquitous definition of the main interview goal. That
is because each goal is unique for each position. You should define this goal based on the
detailed understanding of the job you are interviewing for. If you are looking for a data
scientist, define rigorously what you expect them to do. What tasks will the new team
member solve? Think deeply about what the first working day in this position will look
like. What are the core responsibilities? What skills are useful but not mandatory? Based on
that, you can start drafting a set of desired skills for the candidates. Be specific. If you
demand knowledge of SQL, then be sure to have a definite reason for this requirement.
[ 115 ]
Conducting Data Science Interviews Chapter 5
If your interview goal is to find an ideal, top 0.1 percentile, world-class expert, then think
again. The primary goal is to find an able person who can do the job. It might be that this
person must be the best expert in the world. If so, this requirement should be explained and
justified. To create this understanding, you need to describe what the job will be like. After
you decide that the description is finished, rewrite it with the candidate in mind. It should
be short, easy to read, and give a complete idea of that the job will be like and what the
expectations are. The ambiguity of job descriptions often comes from a lack of this
understanding. We just need someone, their job description says. Yours should say We know
exactly what we need, and you can rely on us. If you can't, do you really need to post a new
job?
You will understand what kind of expert you should search for.
Candidates looking at your job description will know exactly what you expect of
them. It will be easier to judge whether their experience is relevant or not.
Clear goal definitions will allow you to design purposeful and insightful
interview questions.
If your requirements list looks too long, try to simplify it. Maybe you were overzealous and
some skills you have listed are not vital. Or perhaps you could consider hiring two people
with different backgrounds? If you are posting a job description with a wide cross-
functional set of requirements, be sure to have a reason behind it.
If your job description is complex, then you will find the following is true:
The candidate search will take longer since you are demanding more skills and
experience.
The candidates will expect a higher salary.
The interview process will be long and could extend to several sessions.
You will have fewer options to choose from.
All those restrictions should have good supporting arguments. If you have none, consider
simplifying.
Having a good goal is not all you need, the execution matters too. We will now look at the
interview process from the candidate's standpoint to discuss the topics of bias and
empathy.
[ 116 ]
Conducting Data Science Interviews Chapter 5
If you have successfully defined honest requirements for the candidate, the next step is to
think about the interview experience as a whole. To see a new teammate in the best light,
the interview should look like a real working process, not like a graduate examination.
You go through a resume filter. The best people have PhDs and at least 5 years of
(questionable) experience in major tech companies.
You finish some kind of online test for your technical skills.
You submit your GitHub repositories and portfolio projects for review.
Then, you go through an online screening interview. You answer a series of
technical questions related to a multitude of technologies and scientific areas that
might be useful in your future work.
Next, you find yourself in a room with a whiteboard, where you answer a set of
complicated and very specific problems on software development, software
architecture, machine learning, combinatorial optimization, category theory, and
other important topics that are used to select the best from the best.
Multiple people, each an expert in their field, will test you for different skills. It is
likely that the interviewers will test you with tricky problems from their
experience, each of which took them days to tackle.
Finally, a company executive will see if you are a good cultural fit and go along
with the company values and goals.
The only things left to add are a handful of DNA tests, IQ score filtering, a test on a lie
detector, along with two checks for artistic and chess-playing abilities.
[ 117 ]
Conducting Data Science Interviews Chapter 5
If you have clearly defined the position requirements, you will already see that most of
those steps are unnecessary. More so, they can bias your interview process toward people
who will be a bad fit for your job.
This process is nothing like your real working experience, where the following is true:
If you could design an interview that is as close to real working experience as possible, all
futile parts of the interview can go away, including prior work experience and education
filters.
The interview process must respect the candidate's emotional condition and treat them as a
human. It should be humanistic. The moment you start doing this, the potential candidate
lists will grow and the interviews will be shorter and more effective.
At last, remember to be human. Be nice to people you are talking with. Create a pleasant
experience and you will get much better results. This can be hard if you are interviewing
people for the whole day, so it is better to limit yourself to one or two interviews per day at
the most.
You may think such interviews will take an unreasonable amount of your time and looking
at every candidate is impossible. In fact, it is the opposite. The key to designing humanistic
interviews is to shift the focus from your organization to the candidate, and we will explore
ways to do this in the next section.
[ 118 ]
Conducting Data Science Interviews Chapter 5
A frequent source of questions is the interviewer's own knowledge resources. You may
have read many intricate and interesting books and blog posts. If the knowledge you
gained from those materials, no matter how intellectually valuable, is unrelated to the
candidate's job, please keep it out of the interview. Imagine a machine learning engineer
interview where the candidate gets a question about mathematical details of the
reparametrization trick in variational autoencoders when his job will be about delivering
pretrained classifiers to production as scalable services.
The previous case might seem unrealistic to you, but it has happened to one of the software
developers that I know. He was interviewing for a distributed system developer position.
After some initial technical questions, the interviewer started asking about the subtleties of
native code integration in Java applications:
[ 119 ]
Conducting Data Science Interviews Chapter 5
Developer: But I know little about native code integration. And, as you said, this is not relevant.
Could we move on?
Interviewer asks the question.
Related to the problems that the candidate will solve in his/her job
As hard or as simple as the candidate's daily working experience
The largest problem of any interview question is its time span. We expect a typical
interview problem to be solved in less than 10 minutes. The real working environment does
not have such time constraints. To narrow the gap, you can create a test assignment.
Ideally, this assignment should be implementable in 2 to 4 hours.
1. Take a core task that is performed by similar roles (or directly by teammates with
the same title) in your company.
2. Then, describe this task in one paragraph. You want your task description to be
SMART (which stands for specific, measurable, achievable, relevant, and time-
bound).
3. Try to estimate how long it will take to implement. Since you are basing it on a
real working experience, it could take days or weeks to finish the task. Only very
dedicated candidates will be ready to take on an assignment with such a time
span. You can shorten its length by decomposing it into several tasks and taking
only the most important ones as a part of the assignment. You can transform all
additional tasks into questions that you will ask while examining the test
assignment results.
4. If you have the necessary skills, complete the assignment yourself, or ask for the
help of your teammates. Note that if you cannot do the assignment yourself, you
will need the help of your teammates during the interview.
[ 120 ]
Conducting Data Science Interviews Chapter 5
Is it relevant?
Can you finish it within the expected time frame?
Does it test all the necessary skills for the position you are interviewing for?
Can you create additional questions about the assignment to structure the
interview around it?
It may take you a few iterations before you create a good test assignment. When you are
done, you will need to decide where the candidates will implement the assignment.
Depending on your requirements, the candidate may implement it remotely or you may
ask them to sit in your office. If the assignment takes a significant amount of time to
implement and you can't strip it down further, consider paying candidates for their time.
Remember: hiring candidates who are not fit for the job will cost you a lot more.
For example, we can create a test assignment for a junior data scientist. Imagine that our
company provides predictive analytics services for remote education platforms. Those
companies need to estimate the propensity of successfully finishing a course for each of
their students. Each client collects different data about their students. Although the model
building process is fairly standardized, custom data preprocessing and feature engineering
are required to create classifiers with satisfiable quality for each client. Our company
promises that each client will get a model with an F1 score of at least 0.85.
Junior data scientists follow a standard procedure to create and deploy a classification
model. The process is well documented and explained. It comprises several stages, such as
the following:
1. Get a training dataset provided by a data engineering team from the company's
data portal web app.
2. Create a web-based data analysis and modeling environment using the
company's self-service portal.
3. Upload data into the environment and create a holdout test set with at least 10%
of the dataset.
4. Perform exploratory data analysis (EDA) and create a baseline classifier using
the gradient boosting algorithm. Use 10-fold cross-validation for measuring the
baseline F1 score.
5. Create new features from customer data and tune model parameters to improve
the F1 score.
6. If you can reach an F1 validation score of no less than 0.85, measure the test
score.
[ 121 ]
Conducting Data Science Interviews Chapter 5
The most essential skills we want to test for this position are the following:
Ability to work with frameworks and tools our company uses for building
classifiers
Ability to build binary classifiers
Ability to perform EDA and feature engineering on datasets similar to those we
face in our daily work
Ability to follow the process and report results
Understanding of the reasoning behind the company's classifier building process
To create a test assignment for a junior data scientist, we may create a dataset similar to
ones from our customers. The candidate's model could achieve the necessary F1 score if the
candidate creates several useful features from the raw data. The ideal environment for the
candidate would be our company's office, but the remote option is also acceptable. We have
offered both options to each candidate.
Our junior data scientists solve similar tasks in 2 to 3 hours. They spend most of the time on
step 5. We have decided to artificially simplify the task so it will be solvable in under 1.5
hours. Learning how to use all the company's internal tooling, including a self-service
portal, generally takes up to 4 hours, so we have also decided to use widely known open
source tools for the test assignment. Most of the candidates will be already familiar with
this option, and all of our internal systems are compatible with open source alternatives, so
this would be a viable simplification.
We give you a CSV file with customers data. We ask you to build a binary classifier for
student's propensity to finish a course. The description of each column in the data file is
below: [… the description is missing for breviry]. You can use the following steps to build
your model:
1. Perform EDA and create a baseline classifier using the gradient boosting
algorithm. Use 10-fold cross-validation for measuring the baseline F1 score.
2. Create new features from customer data and tune model parameters to improve
F1 score.
3. If you can reach F1 validation score no less than 0.85, measure the test score.
4. Report results to us.
[ 122 ]
Conducting Data Science Interviews Chapter 5
The expected time frame for this task is 1.5 hours, but you may ask for extensions. Please
use the following tools to create the solution: […].
We have also created several additional questions to check the candidate's understanding
of our process:
Remember that a technical assessment is not the only thing you want to check in the
interview. Tell the candidate about your workflow and what your team does. Try to figure
out whether the things you do will interest them enough. Highly motivated candidates
who will flawlessly integrate into your workflow are often the best option even if their
technical skills are inferior to those of other candidates.
To conclude, I will leave a simple rule for creating interview questions: ask questions about
activities that are relevant for the daily working experience of the position you are
searching for. A question should be highly abstract, theoretical, or complex only if the
candidate should be able to solve such problems alone, on a daily basis (if this is so, provide
an explanation why). Make interviews realistic and relevant. Ask for what you are looking
for, no more and no less.
[ 123 ]
Conducting Data Science Interviews Chapter 5
General guidance
No matter what position you are interviewing for, you should always check for the
following abilities of a candidate:
Information search or Google skills: This might seem funny, but the ability to
search for information is an extremely valuable skill for any technical expert. You
never know what kind of algorithms, frameworks, or bugs you will encounter
during a project. The ability to quickly find existing solutions and reuse the
experience of the global data science community can save you spectacular
amounts of time. If the candidate struggles to answer a technical question but can
look up the solution and understand it, consider marking this answer as at least
partially correct.
Communication skills: When emulating a real workflow, take notice of the
candidate's communication ability. Take notice of how he understands the task
definition, asks for additional clarifications, and reports status. It is a good idea
to tell the candidate that you will also be testing this set of skills beforehand and
to provide a minor set of instructions. For example, you can allow and encourage
intermediate communication during a take-home assignment. Many take-homes
are finished in an isolated environment that is nowhere near to how we really
collaborate on projects. Encouraging communication will allow you to have a
glance at a real working experience with the candidate, make it an integral part
of the test task.
Software engineering skills: Every data science project role except the project
manager needs coding skills. The better software the candidate can build, the less
work will be required to integrate their piece into the whole system.
If the candidate does not know something, it's OK. People get nervous, forget
details, or lack relevant experience. Look at the interview as a two-sided
process, where you are trying to work through the questions together with the
candidate. Real work does not look like an examination, does it? Plus, in most
cases, the person who is able to quickly work through the problem together with
the team even if they don't know the solution will be valuable.
Do not be a victim of the curse of knowledge. The curse of knowledge is a cognitive
bias that is often seen when a person with certain knowledge assumes that others
have all the necessary background to understand them well, or at least
overestimates this background. Do not expect the candidate to fully understand
and easily solve the problem that your team solved a few days ago. Deep
knowledge of specifics of the datasets you work with and subtleties of the
problems that your team solves on a daily basis are simply inaccessible to the
candidate. Be conscientious about the curse of knowledge and simplify your
questions to match the candidate's background.
[ 124 ]
Conducting Data Science Interviews Chapter 5
What kind of data scientist are you looking for? A research data scientist will be
able to implement state-of-the-art models and push the boundaries of the
discipline, but a machine learning engineer will be more proficient in quick
application of existing best practices while maintaining good software design.
Many companies need the second type of data scientist but are trying to hire the
first one. Do not make this mistake and be honest about your needs as a
company.
Ask about the overall workflow of building a model, and specifically about
delivering the model into production. Pay attention to the offline and online
testing strategies that the candidate chooses to evaluate the model. Many experts
have a good understanding of the inner workings of the algorithms but lack a
coherent end-to-end approach for testing and delivering a model.
Ask about common problems in the datasets you are working with. If your team
constantly faces the problem of imbalanced data, be sure to check whether the
candidate has an instinct for solving problems of this kind.
If you work in a specific business domain, do not forget to ask about domain
expertise.
Start from the specific technologies the data engineer will work with. There is a
multitude of different databases and data processing technologies, so you should
have a good understanding of what technology stack you are looking for because
nobody knows everything. A big data engineer will back you up on taming the
big data zoo of technologies and writing complex Apache Spark jobs. A backend
developer with extensive experience in relational databases and data
integration will help you to process datasets from traditional data warehouses
and will easily integrate your solution with other systems.
[ 125 ]
Conducting Data Science Interviews Chapter 5
Summary
In this chapter, we have looked at the importance of setting goals before starting the
interview process. We have explained why asking complex theoretical questions without a
clear purpose is harmful not only to the candidate but also to the interviewer's company.
We have explored the flaws of modern interviews and gained tools for making them
purposeful and meaningful. We have seen what separates a good job description from a
bad one. Also, we have defined a framework for creating interview questions from real
tasks that you and your teammates solve on a daily basis.
In the next chapter, we will develop guidelines for building data science teams.
[ 126 ]
6
Building Your Data Science
Team
In this chapter, we will learn about the three key aspects of building a successful team and
explain the role of a leader in a data science team. We will learn how to build and sustain a
balanced team that's capable of delivering end-to-end solutions. Then, we will set out a
plan that we will improve and add several useful management skills to our arsenal:
leadership by example, active listening, empathy and emotional quotient, trust, and
delegation. We will also consider how advocating a growth mindset can be a major part of
the team development process.
But what makes a balanced team? Each team member is unique in their soft and hard skill
sets. Some people are a universal jack of all trades, while others are highly specialized in a
single area of expertise. Also, we all have vast emotional differences. In a balanced team,
everyone complements and augments each other. Each team member knows at what point
they can participate to yield the best results. Balanced teams are self-organizing. They can
efficiently plan and act without direct interventions from the team leader. The leader unites
the team and makes it coherent, but the actual work process of a balanced team looks
highly decentralized. The leader frequently takes part in teamwork and helps everyone
reach a common goal. The team members have good relationships and want to advance
themselves to new heights. Another feature of balanced teams is that they do not have a
single point of failure. If someone is ill or has decided to move on, it is not a big deal
regarding how the core team functions.
Unbalanced teams can be hard to distinguish at first glance. An unbalanced team can
function properly and even be a top performer, but unbalanced teams can be very efficient
but fragile. People do not stick around long in an unbalanced team because it is hard to do
so. Unbalanced teams have many forms, so they are hard to define. Here are two examples
of an unbalanced team:
In the first case, the team leader is the best expert in the team. They do most of
the important work, while other team members work on the simplest delegated
tasks. The team leader decomposes the task into many simple subtasks and
assigns them to their team. No one sees the whole picture but them. The
unwillingness to share important parts of the work and to involve the team in the
task's decomposition process creates a strong imbalance. The team is very fragile
and will not function properly without its leader. Employees in this team burn
out quickly because they do not see the result of their work. Most of the tasks are
boring and repetitive, so they seek challenges in another team, project, or
company. This is a case of expertise imbalance in a team.
In the second case, the team is building a very large project. The development
has been going on for 3 years. The team leader and several data scientists who
were working on the project from the start developed strong relationships and a
team culture. However, not all of the team members stayed with them for the
whole 3 years, so the team leader found several new experts eager to take on the
project. However, these new team members found it hard to assimilate into the
team. The core team unwillingly communicates with the new team. They find it
hard to do so because of the cultural gap. It is a lot easier to interact with familiar
people than with the new team. Thus, cultural imbalance has emerged.
Imbalance in teams is very hard to fight once it has sprung its roots. Each specific case
needs careful examination and a plan to bring the team to a more balanced state. The best
way to fight imbalance is to avoid it in the first place.
[ 128 ]
Building Your Data Science Team Chapter 6
Just like when you're hiring staff to build a balanced team, you should look toward your
goal. Without a goal, there is no reason for action, no balance, and no imbalance. For
example, your goal could be delivering solutions for problems using data science with a short
time to market. From this goal, the main characteristics of your team will arise. To
successfully apply data science, you will need experts in the field. To do this with a short
time to market, you will need good internal processes so that everyone works smoothly and
reliably. The team will need to improve over time, and it will need to grow and last; this is
the need for balance.
State your goals clearly and ensure all team members understand them so that everyone is
on the same page about the definition of success. It would be impossible to train machine
learning algorithms with no metrics and loss functions. Without feedback, humans stop
developing too. When there is no notion of good and bad, we stagnate, and goals help us
define these contrasts.
Another important component of each team is its roles. Roles define sets of activities that
team members can perform. You may think you do not need roles because your team is
agile and cross-functional, so everyone can and should be able to do everything. This may
be crucial for small teams to survive while being adroit at what they do. But being cross-
functional does not contradict having clearly defined roles. In fact, cross-functionality and
size are different dimensions of a team, with the latter affecting the former through the
number of roles each team member is performing.
For example, our data science team should handle everything from the definition of
business requirements to the delivery of the final solution. We have loosely defined the
project delivery process and assigned each stage to a role:
[ 129 ]
Building Your Data Science Team Chapter 6
Suppose that this team is limited to only three members, including the team leader. We can
combine several roles into one to fulfill the team size constraint, while simultaneously
creating goals for individual team members. Recall that forming goals for each position is
the foundation of a good hiring process, and creating a team goal, defining roles, and
mapping them to positions is a prerequisite for that.
For our example, there are many options, but the most realistic would be the following:
Creating individual goals that are aligned with team goals can be done through the
following set of transformations:
The more roles your team has, and the smaller its size, the vaguer the boundaries between
the team members will be, and the harder it will be to find teammates to fill in those roles.
Defining roles helps you to be realistic regarding the amount of work and the budget it will
take you to create a good team, so never leave this step out. Mapping roles to positions is
equally helpful if your team is large and the responsibilities of every employee are clearly
isolated or if your team is small and responsibilities are less segregated.
By following this process, you can create a feedback system that will optimize not only
global team goals but also individual goals, providing everyone with an understanding of
what their part in the big picture is. When designing deep neural networks, you need to
create a loss function and make sure that the gradient passes through each layer. When
building teams, you need to make sure that feedback passes through each part of your
team, starting from yourself.
Understanding roles, goals, and size constraints will also help you grow and develop your
team. Over time, your team will grow, but if you maintain the core goal-role-position
framework, adding new responsibilities and team members will be straightforward with
growth, and you will assign positions to fewer roles. Goal-position-role maintenance and
revision is also necessary to keep your team balanced. Changes in organizational structure
or in the main goals of a team always bring a risk of creating an imbalance, so be careful
when introducing changes and plan ahead.
The concepts of team balance and a feedback system are simple but powerful tools that will
help you build a balanced team and scale it in the right way.
[ 130 ]
Building Your Data Science Team Chapter 6
The team leader should make sure that information propagates through the team
effortlessly, and that all communication takes place as needed. If we take the internet as an
analogy, the team leader should provide a fast and reliable communication network for all
the servers.
Another important activity for a team leader is expanding the team constantly and in a
balanced manner. All team members should have sufficient motivation, be challenged, and
have a clear understanding of how they can grow according to their individual goals.
Most importantly, a team leader should set goals and achieve those goals together with
their team. The team leader should unify everyone's efforts into a single focal point and
make sure that focal point is the right one.
But how can we keep everyone focused on their efforts? If you were to ask me about the
one most important aspect of leadership, I would say it is leadership by example. The
modern idea of a successful leader can rarely be allowed in the real world. Many imagine
that all leaders are highly skilled communicators and negotiators who can persuade anyone
to do anything with the power of words. Of course, communication skills are important,
but not those of a manipulative kind. Skilled manipulators could have an edge in some
situations, but each action has consequences—and manipulation is always destructive in
the long run.
Leading by example
The most simple and effective leadership advice was likely given to you in childhood: if
you want to have good relationships with others, take the first step. Want people to trust
you? Build incrementally by trusting other people. Want your team to be motivated? Be
involved in the workflow and give everyone a helping hand; show them progress.
Management literature calls this principle leadership by example. Walk your talk, and
everyone will follow the example.
[ 131 ]
Building Your Data Science Team Chapter 6
It does not require you to learn and apply complex theoretical methodologies
It builds relationships and trust in your team
It ensures that the team leader is involved in the workflow of every role in the
team
It helps everyone grow professionally, including yourself
The only downside of leadership by example is that it can be hard. It requires you to be
deeply involved in your team—emotionally and intellectually—and this can be debilitating.
It is important to avoid exhaustion because your team will feel it instantly. If you burn out,
so will everyone on your team. Thankfully, there is a simple way to avoid leadership
burnout. Make sure that you have time during your workday that allows you to take a rest
from team leadership activities. Maybe it is a small research project that will help your team
members or maybe it is a prototype you can handle yourself.
Another caveat is that leadership by example can be hard to let go of. Many aspiring team
leaders take too much work on themselves. The leader is often the most experienced
professional on the team, and delegating work may be emotionally hard, especially when
you know that you can do it yourself. A team cannot function without task delegation, so
such teams fall apart quickly. Imbalance appears when the team leader cannot delegate
tasks properly, and from thereon it builds upon itself.
[ 132 ]
Building Your Data Science Team Chapter 6
The following diagram represents a simple and useful tool for choosing a leadership style
for task delegation:
[ 133 ]
Building Your Data Science Team Chapter 6
The preceding diagram measures employees' readiness by two variables: commitment and
competence levels. Commitment stands for the employee's level of motivation. People with
high commitment levels find their work interesting and are ready to try hard until the task
is completed. A low commitment level, on the other hand, indicates that the person is not
very interested in completing a task, finds it boring or repetitive, or is close to burning out.
The other side of the coin is competence, which indicates the employee's ability to handle
tasks by themselves. Low competence indicates that the person has no or low prior
experience in the area, regarding the task they need to complete. A high competency level
indicates that the employee has all the necessary knowledge and experience to complete the
task. Situational leadership helps you keep motivation levels stable while ensuring that the
employee gets all the necessary instructions so that their competence can grow without
harming the task's resulting quality.
The following example shows how we can apply the situational leadership technique by
determining an employee's competence and commitment before delegating a task:
1. Imagine that you've got a new team member, Jane. She does not have much
experience and is taking a junior position in your team. Her motivation is high,
and she is eager to learn new things. Here, the commitment level is high and the
competence level is low, so we must take a directive leadership style. Obviously,
she cannot solve complex tasks alone. It is better to take something simple, split
the task into as many subtasks as you can, thoroughly describe everything, and
check the correctness of every step. Employees from this category will take up a
lot of your time, so be aware of your own resources. A team of 10 high-
commitment-low-competency people can easily siphon all of your time.
2. Over some time, Jane has gained some skills, but her motivation has started to
fall recently. This is the perfect time to coach. At this stage, you want to loosen
control a bit to allow your teammate to make commitments.
3. Jane's competency levels have grown further, and you can tell that she has
somewhere between a moderate to a high level of technical skill compared to
others in your team. It is important to catch this transition because Jane's
motivation is at risk again. To keep it up, you will want to change your
leadership style to one that's supportive and focuses less on particular
assignments. You will also want to share decision-making tasks with her. Jane's
performance will probably be inconsistent at this stage, so provide any support
she needs and help her transition into the last stage.
4. At this stage, Jane is already a senior expert. She is highly skilled, can make good
decisions, and can provide a consistent output of good quality. This is the stage
when you can delegate tasks. Your main goals at this stage are to set high-level
goals, monitor progress, and take part in high-level decision-making.
[ 134 ]
Building Your Data Science Team Chapter 6
Try to memorize this. The SMART criteria is a convenient checklist that will help your team
understand you better. Nonetheless, no matter how detailed your task description is, talk it
through with the assignee. Our language can be very ambiguous, and our thoughts are
often unclear, even to ourselves. Just go over to your colleague after composing a task
description and ask: How did you understand this? Can we talk this through for 1-2 minutes? Or
talk through each task at your iteration planning meetings if you are using agile
methodologies. These conversations are an essential part of a task's assignment. Even a
subtle difference in understanding can give you unexpected and completely incorrect
results.
Spending several minutes discussing the task at hand will help you make sure of the
following:
Your understanding of tasks and that their goals are correct and realistic
Your task description is sufficient
You and the assignee are in sync regarding what needs to be done
The assignee understands what the result should look like
The deadlines are realistic
[ 135 ]
Building Your Data Science Team Chapter 6
Developing empathy
Basic concepts such as leadership by example, situational leadership, and the SMART
criteria will help you structure and measure your work as a team leader. However, there is
another crucial component, without which your team can fall apart, even with the perfect
execution of formal leadership functions. This component is empathy. Being empathetic
means to understand your emotions, as well as other people's emotions. Our reactions are
mostly irrational, and we often confuse our own feelings and emotions. For example, it is
easy to confuse anger for fear; we can act angrily when in reality we are just afraid.
Understanding your own emotions and learning to recognize others will help you
understand people and see subtleties and motives behind their actions. Empathy helps us
find the logic behind irrational actions so that we can react properly. It may be easy to
answer with anger in regard to aggressive behavior unless you can understand the motives
behind those actions. If you can see those motives, acting angrily may seem foolish in the
end. Empathy is the ultimate conflict-resolution tool; it helps build trust and coherency in
the team. It also takes an important part in negotiations, which we will talk about later.
Being empathetic is not an innate ability. Of course, some people are naturally more
empathetic than others, but this does not mean that empathy can't be acquired. It is a skill,
and like any other skill, it can be mastered if you work on it. To increase your empathy,
start with yourself. Be aware of your own emotions throughout the day. If you encounter a
complex situation, think through your actions and emotions. Try to understand which of
your responses were adequate and which were not to see what caused them and what you
felt at that moment. Sometimes, you will find mismatches between your feelings and your
actions. Learn from them. Over time, you will start to understand emotions better and
better. First, learn to understand yourself, and then you will be able to read others'
emotions much better than before.
Another important tool that will help you build empathy and understand others is active
listening. We listen to others each day, but do we extract all the available information from
their speech? In fact, we leave most of it behind. When listening, we are often distracted by
our own thoughts. We do not pay attention to the speaker and are eager to talk.
You may have heard that you can achieve a lot more by letting others speak. Imagine
yourself in a conflict situation. Your beloved teammate comes to the office at 3 P.M. without
notice for the third day in a row. When you ask them what's wrong, they harshly ask you to
go away. You may immediately want to react in the same aggressive way. However, in this
situation, it is much better to understand the reason behind such behavior. Unknown
factors can clarify such seemingly illogical behavior and show your teammate's situation in
a different light.
[ 136 ]
Building Your Data Science Team Chapter 6
Simply staring and hoping that the information will flow into you is optimistic. If the
person in front of you is not in the mood to talk, this may make the situation worse. To get
around this, you should listen actively. Active listening involves a set of techniques that
will help others speak more openly.
Pay full attention to the speaker: We often drift into our own thoughts when
others speak for long periods of time. The speaker will immediately notice that
you are in the clouds, and it will negatively affect their urge to share information
with you.
Show that you are paying attention: Give feedback using uh-huhs, saying yes, by
nodding, and by smiling. The other person shouldn't feel like they are talking to
a concrete wall.
Paraphrase and mirror: This is when you repeat the last two or three words of
the last phrase. When you paraphrase, you repeat or sum up the last phrase in
different words; for example, I hear that you want to…. This may seem awkward
at first, but believe me, mirroring works. This is the most important tool for
active listening. It cultivates the flow of information between the speaker and the
listener. After learning about paraphrasing and mirroring, I noticed that good
speakers use them constantly, and they do not do this consciously. There is no
reason for you not to employ this tool.
Do not judge, ever: Openly judging others will trigger their defense mechanism
and break the flow of information.
Verbalize emotions to understand what others feel: Empathy is a hard skill to
master, but often you can take a handy shortcut: verbalize the other person's
feelings. I feel that you are upset, am I right? If you are right, the other person will
confirm that, and if you are wrong, you will get a clue.
Active listening is simple on paper but may be tricky in the real world. As with any skill,
you can practice it to perfection. Follow this simple advice in your daily conversations and
you will notice the effects.
[ 137 ]
Building Your Data Science Team Chapter 6
A great tool for learning is making retrospectives and internal meetups. Retrospectives are
project team meetings where everyone on the team answers three simple questions:
Retrospectives allow you and your team members to get feedback and improve on it. They
are also an effective project management tool that will improve your team's output. On a
larger scale, everyone on the team should have a chance to share an experience with others,
and to get useful feedback not only from their manager but from others in the team too.
[ 138 ]
Building Your Data Science Team Chapter 6
The key to personal or local growth is learning continuously. If you want to motivate your
team members to learn and grow, you must give them the following:
The concept of continuous learning is a tool that will help your teammates make
learning into a habit.
Not everyone has personal time to learn, and even if they have, not everyone is
eager to spend it on learning right from the start. Learning and research activities
at work may be a good way to develop a habit.
A personal learning plan will help everyone structure and track their goals.
The project, learning, and other goals of a performance review process will help
you with the feedback flow in the team.
We will now look at each point separately, starting with continuous learning. The main
idea behind it is that it does not matter whether you read a whole book in one day or a
single informative tweet: all learning counts. You should aim to learn something new each
day—even a small amount. In the end, you will not only move a bit closer to the goal of
learning something but also develop a set of habits that will allow you to gain more and
more knowledge without even noticing it. After a while, continuous learning will be a part
of your life, developing a natural hunger for new, useful information, like a scheduler that
optimizes your free time in the background and squeezes spare minutes you spend
commuting and waiting in long store lines to help you progress toward something that
really matters to you.
The trick of continuous learning is to educate yourself to achieve your learning goals in
small batches. If you are in the mood, you are likely to continue and do more work. If not,
you will still have learned something new and moved on.
At first, this can be a way of achieving a singular goal. Then, after achieving the first
accomplishments, you'll realize that this way of thinking allows you to learn more: apply it
to a new area of expertise, skill, or topic of interest. A few months later, it will become a
part of your life, helping you get better at it every day.
[ 139 ]
Building Your Data Science Team Chapter 6
Make sure that the learning plan template is easily reachable; place it on your company
wiki or any other online source. The learning plan template should contain all the skills,
books, and online courses that you and your team consider useful for gaining new
competencies. This page should be frequently updated with links to new material that your
team finds. Split all the material into categories.
In particular, for a data science learning plan template, you can consider the following
categories:
Mathematics:
Mathematical analysis and optimization
Linear algebra
Mathematical statistics
Data science:
Machine learning
Deep learning
Data engineering:
Relational databases
Data warehousing
Big data
Business intelligence
Software engineering:
Software architecture and patterns
Code quality
[ 140 ]
Building Your Data Science Team Chapter 6
All of those topics should contain links to books, online courses, and blog posts with
commentaries about why this material is good and what the most useful parts of it are. It is
even better to split each topic further into levels; for example, beginner, intermediate, and
advanced. That way, you and anyone on your team can track which level of competency
they are approximately at. This learning plan template can be used in a performance review
to create and adjust personal learning plans. We will discuss this topic in more detail later
in this chapter.
After creating the learning plan template, you can talk with everyone on your team to
discover their professional goals. Ask them which areas they want to grow in and
understand what interests them personally. Ask them to go through your learning plan
template and write their approximate competency level in each topic and help them if they
can't do this alone. With this information, you will understand where each team member is
now and where they want to go. You can then create a personal learning plan. I find simple
task-tracking systems such as Trello to be a good tool for tracking learning goals—create
shared boards for your teammates and create cards in meetings. As with any task, a
learning plan is better if it's defined in accordance with the SMART criteria. In particular,
make sure it is realistic (reading 100 books a year may be a realistic goal for some people,
but not for others) and time-bound. Each learning plan should have a deadline. At this
deadline, you and your teammate should meet again and adjust the learning plan: decide
on the next goal, create new cards, and delete obsolete goals. In your second meeting, you
may discover that many learning plans have been unfinished. This means that you will
need to adjust the amount of learning material for the next period. All that matters is that
progress never stops, not how many learning cards have been completed.
Learning plans are worthless without planned time for research and education. Planned
time can be personal, but you should not be confident that your team will educate itself
only in their free time. Give them opportunities to learn in work hours too. It is common to
have free gaps between projects or when waiting for someone to finish parts of the work
you depend on. On rare occasions, some of your teammates could have several days of
downtime, waiting for new tasks. Use this time wisely. You should communicate to your
team that if they do not have any assigned tasks, they should learn according to their
personal plan after reporting to you. In some cases, company policy can be against learning
in the workplace. Try to communicate that learning is a long-term investment that will
allow your team to perform better.
[ 141 ]
Building Your Data Science Team Chapter 6
In all honesty, I haven't seen a successful implementation of an individual KPI yet. KPIs
work best for business processes, while personal KPIs are very one-dimensional and
unrealistic. If you tie an individual's success to a single number, they will either end up
fighting with the system or exploiting it. A better but more time-consuming way to track
performance and set goals is by using performance reviews. A performance review is
typically carried out once or twice a year.
1. Ask your teammate to write everything that they have done since the last
performance review, such as completed projects, major achievements at work,
open source contributions, books, and so on. Let's assume that the teammate's
name is Matt.
[ 142 ]
Building Your Data Science Team Chapter 6
2. Send a description of this to the colleagues who worked together with Matt. You
should include as many people as you can: project managers, product owners,
and team members. Ask them to answer the following questions:
What have you liked about working with Matt?
What could be improved? What parts did you not like?
3. If you have worked with Matt as a manager on some projects, prepare answers to
those questions yourself.
4. Anonymize all the reviews and prepare a single performance review report.
After the report is done, you should share it with Matt and set up a meeting. It is a good
idea to share the report before the meeting so that Matt can ponder the feedback for a while
and derive conclusions. In the meeting, discuss each point and make sure that you are both
on the same page regarding the feedback. Decide on the most important parts that need to
be improved. If Matt has achieved new heights that you set in the previous performance
review, you should be considering a promotion or a bonus. It is important to give equal
attention to both positive and negative feedback. Try to make performance reviews stand
out. They should be memorable. We should try to create nice-looking review templates,
with good typography and a healthy dose of memes. We should also print all the reviews
on nice paper so that anyone can take them out. A performance review should create
positive memories, especially if the feedback is more on the bad side. You should give your
team the motivation to improve and do not forget to collect reviews for yourself. As a
manager, you can make many mistakes without even knowing about them. Getting
feedback from your team is as important for you as it is for them.
It is important to lead by example if you want to facilitate the growth mindset in your team.
If you want anyone to inherit principles of continuous learning, learning plans, and
educational goals, do this yourself. Demonstrate that this works. That way, you will kill
two birds with one stone: you will vastly improve yourself, and you will show your team
how the ideas you speak about really work.
Now, let's look at how to apply the advice that's been provided in this chapter by looking at
a case study.
[ 143 ]
Building Your Data Science Team Chapter 6
The first thing Robert did was research the scenarios his department should handle. He
discovered that the understanding of data science was still vague in the company, and
while some managers expected the department to build machine learning models and seek
new data science use cases in the company, others wanted him to build data dashboards
and reports. To create a balanced team, Robert first documented two team goals and
confirmed that his views correctly summed up what the company's management wanted
from the new team:
Data stewardship: Creating data marts, reports, and dashboards from the
company's data warehouses based on incoming requests from management
Data science: Searching for advanced analytics use cases, implementing
prototypes, and defending project ideas
Next, Robert thought about team size limits and asked about the approximate volume of
requests they would get every month for each goal. For the data stewardship part,
management already had a big backlog formed, and they needed all the tasks to be
completed as soon as possible. As for the data science part, management did not have any
preexisting requests, but they wanted the team to find and suggest new data science use
cases in the next three months. Management noticed that they were ready to invest in the
data science goal by creating a team of no more than four experts.
Data team leader: To supervise two data teams and coordinate projects at a
higher level. The workload in this role is not expected to be high during the first
year of the team's existence.
Data stewardship team:
Team leader/project manager: The goal definition clearly required
someone to manage incoming requests and plan the work.
Systems analyst: Someone to document visualization and
reporting requirements that come from the stakeholders.
Data analyst: To implement data reports and dashboards based on
the requirements defined by the systems analyst.
Data science team:
Team leader/project manager: Should be able to manage R&D
processes in a business environment.
Business analyst: To interview company management and search
for potential data science use cases.
Systems analyst: To search for data and provide requirements for
the use cases provided by the business analyst.
[ 144 ]
Building Your Data Science Team Chapter 6
Next, Robert thought about team size constraints and created the following positions from
the preceding roles:
[ 145 ]
Building Your Data Science Team Chapter 6
Next, Robert defined a workflow for each position to create accurate job descriptions and
interview scenarios, as described in Chapter 5, Conducting a Data Science Interview. To
facilitate the growth mindset, he defined learning plan templates for each team role based
on his experience. He focused on roles instead of positions because some combined
positions may split when the team grows, and role-based growth tracks will allow those
splits to occur naturally. Robert also created a quick team leader guide where he stated the
standards for task delegation based on situational leadership and the SMART criteria and
set up a performance review schedule for each team.
Summary
In this chapter, we explored the complex topic of building and sustaining a team. First, we
dived into the concept of team balance and looked at why it is crucial for a team's survival
in the long term. Then, we learned about several leadership styles, including the situational
leadership model and leadership by example. We then saw how we can create good task
descriptions using the SMART criteria. We also mentioned empathy and active listening as
the most important soft skills of a leader. We also explored the concepts of continuous
learning and a growth mindset. Finally, we saw how a performance review process can
help learning habits develop in your team.
This chapter concludes the Building and Sustaining a Team section of this book;
congratulations on finishing it! The next section of this book will cover the topic of
managing data science projects.
[ 146 ]
3
Section 3: Managing Various
Data Science Projects
Once you and your team start to build the project, you may realize that it requires effective
management strategies. Can you use traditional software development methodologies for
data science projects? What caveats do you have to look for? What processes should you
use to manage development iterations? How can you sustain a balance between research
and implementation? How should you deal with situations of extreme uncertainty?
For most companies, data science and machine learning belong in the area of innovation.
Unlike software development, these disciplines are terra incognita (unknown territory) for
your customers and business stakeholders. If you approach data science projects like any
other project, you will face many unexpected problems. The domain of data science needs
to be handled differently. When we step into the area of innovations, the definitions
of best and good change.
In this chapter, we will explore innovation management and present answers to the
following questions:
Understanding innovations
Why do big organizations fail so often?
Exploring innovation management
Balancing sales, marketing, team leadership, and technology
Managing innovations in a big company
Managing innovations in a start-up company
Finding project ideas
Managing Innovation Chapter 7
Understanding innovations
Before we discuss failures in innovations, we need to understand what innovation is. The
dictionary definition of innovation states that it is something new or something that's
introduced in a different manner; an introduction of new things or methods. Every
innovation is temporary. Edison's light bulb was once innovatory, as were cars. Successful
innovations invade our daily lives like a tempest, change them forever, and then become
mundane. Failed innovations fade and become forgotten.
If we look at history, the creation of technology does not guarantee success. Automobiles
first appeared in 1886, but cars only really started changing our lives 27 years later, when
the Ford Motor Company built the first moving assembly line in 1913. If you define
innovation in terms of an inventor's success, the invention of the first car may be a huge
breakthrough. But from a business perspective, we may only consider an innovation
successful if it creates a new market with a stable income flow. Technical innovation
advances the field, while business innovation creates some new, previously unavailable
value. This value helps new markets emerge and new businesses rise.
[ 149 ]
Managing Innovation Chapter 7
New technologies are raw and unpolished, so they seem unattractive to new markets.
Professor Clayton Christensen uses small hard disk drives (HDDs) as an example of this.
In the era of large HDDs, everyone used them as an industry standard. Smaller models
were fun in theory but unusable in practice since, at the time, the hardware was only
supported on larger HDD models. Larger HDDs also had more capacity, speed, and
reliability. When the market of personal computers emerged, small HDDs instantly became
more attractive. Power consumption and size were the parameters that PC owners cared
for. New revenue streams helped the development of small HDDs. After some time,
smaller models surpassed their larger counterparts in terms of capacity, speed, and
reliability. It was already too late for the old generation of HDDs. Large companies who
seemed stable 5 years ago lost their market to new, superior technology.
Starting innovative projects in large companies is easy, but finishing them is troublesome.
Innovations appear as nonsense according to a company's short to mid-term goals since
they bring in no revenue and require large investments. However, bringing innovations to
large organizations is not impossible. We will explore how we can grow ideas into the
businesses of small and large organizations later in this chapter.
Game of markets
Data science is still a young field with great potential. Many experts state that even if we
were to suspend new research, if we were to integrate new technologies that have emerged
in the last 10 years, we will still be two to three decades ahead of the game. However, there
is a large gap between research projects and widely applicable technologies. Like HDDs, a
research project needs to grow its features in order to outrun its current competitors.
Data science uses research in machine learning, deep learning, and statistics to deliver
software products that solve real-world problems. To become usable and widely applicable,
software needs large time investments. The time of software engineers, data scientists, and
other technical experts costs money. To earn money from your software, you need to solve
people's problems. Your products need to bring something valuable to people. When this
value is large enough to sustain a constant demand, you create a new market. This market
is ruled by the laws of supply and demand: people and companies will exchange money for
your products and services.
In their infancy, big organizations created their own markets or took part in existing ones.
Markets can age and die like we do, giving new markets somewhere to live. If companies in
a dying market do not transition to an emerging market, they cease to exist.
[ 150 ]
Managing Innovation Chapter 7
Big companies rarely operate quickly. Long operational cycles and heavyweight
organizational processes come up naturally with the company's scale. Without those crucial
instruments, the company may fail to operate. Nonetheless, innovations do not tolerate
long cycles. The cost of failure increases with the time span of an experiment, which makes
innovations extremely costly for large businesses.
For organizations, data science is still an innovation, and most market niches haven't been
developed or discovered yet. While innovations are hard, they are not impossible. In the
upcoming sections of this chapter, we will look at how we can take control of new
technologies by using innovation management in large and small companies.
[ 151 ]
Managing Innovation Chapter 7
To understand the main ideas behind innovation management, we should understand the
different types of innovations, namely, sustaining and disruptive innovations:
Each sustaining innovation increases the gap between competitors slightly, while
disruptive innovations change the way we live and work, thus creating new markets and
making old ones obsolete. From a layman observer's perspective, disruptive innovations
may appear all of a sudden, but in reality, they are not a flash of insight. The Innovator's
Dilemma contains many examples of disruptive innovations that grew in small markets,
where features of innovative products were more appealing for very specific use cases. A
stable source of income allows innovative technology to grow to its potential, while
companies that provide old-generation products do not see the direct competition on their
market. When the new product disrupts the market, it is already too late.
1. Search: In this phase, the company searches for an idea to experiment with.
2. Build: Here, the company builds prototypes, conducts experiments, and
measures results.
3. Scale: At this stage, the build phase gives us a promising direction to work on, so
the company scales out. Prototypes are transformed into market-ready products
and start to generate income.
[ 152 ]
Managing Innovation Chapter 7
4. Expand: Every innovation saturates its market. At this stage, the company
expands its product to new markets to keep the growth rate intact.
5. Sustain: At this stage, the innovation becomes a stable business. Some companies
can live for decades in the sustain stage. However, this comes with risks. Your
current position in the market is not a constant irreducible resource of income. It
can quickly saturate without long-term investments. Each minute of inactivity
opens up new opportunities for your competitors to create disruptive
innovations that can throw you out of the business. The sustain phase is a flag,
signaling that the company needs to resume the Search stage.
Another important aspect of every disruptive innovation is the chasm. In Crossing The
Chasm, Geoffrey Moore presents the famous market adoption curve:
New technology can quickly find a seemingly stable source of income in the early market
that comprises companies who are willing to pay money for risky, untested, experimental,
but promising solutions. Many organizations think that the transition from the early market
to the mainstream market will be smooth; the early market will produce enough income so
that it can create new features that will be valuable to the mainstream market. However,
this is not the case in practice. The gap between early and mainstream markets is much
larger than it seems. When the early market saturates, the company enters the chasm. Its
offering is not mature and developed enough for the mainstream market, where customers
are used to feature-rich and stable products. So, the company is forced to live with a
diminished income stream, while its product requires large investments so that it can create
the features that have been requested by the mainstream market.
[ 153 ]
Managing Innovation Chapter 7
To find a way out of the chasm, the company must focus its efforts on a very specific area of
application where its product's best features are more important than its current flaws. The
goal is to find a tiny but reliable niche and concentrate on perfecting it. This creates a small
market that will allow you to grow momentum through success stories, functionality,
customer references, and new income streams. Consistent success in a small market will
give you the resources and time that's necessary so that you can scale out to more mature
markets.
In the next section, we will look at a case study of a fictional company called MedVision.
The MedVision experts have also done market research and identified four potential ideas
to work on, as follows:
All of these ideas look promising from a commercial perspective and have a clear market
base, that is, doctors and patients who wish to provide/get more accurate and less
expensive medical diagnostics.
[ 154 ]
Managing Innovation Chapter 7
During the build phase, the MedVision team created several prototypes and showed them
to potential buyers. They have also deepened their knowledge and researched the domain
of every idea. The build phase identified unforeseen problems that the team could not
identify during the search phase, as follows:
Cancer detection suffers from bad data quality. Many datasets contain biased
images that need manual cleaning. During the prototyping stage, the MedVision
models got 0.97 F1 classification scores on a holdout test set but failed tests on
newly gained data. The dataset that MedVision had used for model training
contained bias: X-ray images with cancer rates were made using a scanner that
left marks on the scans. The models learned to classify cancer using the simplest
strategy: recognize markings. This led to a large model bias. Another source of
bias was hidden in the timestamp, which was present in all X-rays. It turned out
that patients with cancer are more likely to get scans on Tuesdays and Fridays
because of the scheduling policy in the hospital. The MedVision team is now
prepared to handle those issues when solving cancer classification problems for a
real customer. The team decided that the issue of bias is solvable and shouldn't
be a deal-breaker for this idea.
Data from wearable devices has shown to be a good predictor of heart attacks
and injuries from falling. Regardless, there were big problems with data
availability. Only 1 out of 10 hospitals in MedVision's network reported that they
had wearable device data, and only two planned to offer wearable devices to
their patients in the next 5 years. MedVision decided to take this idea off the list
because of the bad adoption of wearable devices in medicine.
Using a patient's medical history to diagnose illness turned out to be a very
challenging task because of the difficulties with data collection. Researching the
data of three hospitals revealed that 30% of a patients' medical history was in
paper-only format. Forms were standardized, but mostly filled in by hand. To
test their idea, MedVision's team needed to build some reliable handwriting
recognition technology, and then structure and error-correct large amounts of
textual information. This process required large time investments and two
additional experts in handwriting recognition and natural language
processing (NLP). News about the failure of a big company working in this
direction appeared during the data research process. MedVision has decided not
to make further investments in this direction. They agreed that the risks were too
high to continue with the prototype building phase.
[ 155 ]
Managing Innovation Chapter 7
MedVision's team decided to proceed with their cancer detection prototype. They have
identified two hospitals that are interested in their solution: Oakland Hospital and Cancer
Prevention Center. They have also found a medical research organization that may provide
additional funding if their first tests with real patients succeed.
After 1 year, MedVision's system was successfully functioning at Oakland Hospital. The
second project was canceled midway through because the Cancer Prevention Center
wanted exclusive ownership of the developed product, which significantly limited
MedVision's ability to scale and sell their product.
The first working deployment of the cancer detection system allowed MedVision and
Oakland Hospital to publish their success story to major news sources. This allowed
MedVision's team to start its expansion. They have extended their sales department and
started to attract new customers for system trials. A myriad of new feature requests started
to bombard MedVision's team, so they hired a product manager to research the market as
well as catalog, prioritize, and manage the delivery of new features. Several other
competitors entered the market to follow MedVision's success, so the team switched to
delivering sustaining innovations to keep their leadership in the market.
Funds were growing, so after several years of successful operation, MedVision's team
developed a new product for their customer base. They have started making transformative
changes to their company by searching for new ideas, hoping to find some suitable ground
so that they can start the second disruptive innovation cycle.
Next, we will look at the possible ways of integrating innovations into existing businesses.
[ 156 ]
Managing Innovation Chapter 7
Integrating innovations
Innovations require experimentation. New approaches fail often, but failure does not
signify the end of the line. Failure just means that you have to distill the experience, learn
from your mistakes, and try again as fast as you can. If you try to do this in a developed
company, their processes and best practices may make you look like you are insane and
that your actions may appear destructive and contradicting the company's values and
goals. This happens because most businesses are oriented toward sustaining innovations.
Existing values, goals, and KPIs react to disruptive innovations like an immune system
would react to a virus—with defensive actions. This is the reason why most research and
development (R&D) departments live a hard corporate life. They pursue goals that are
incompatible with the values of the company they work for. To ease the integration of
disruptive innovations into a large company, you may try to introduce another isolated
business entity inside the company.
Acquisition: Buy companies that have already managed to cross the chasm. The
main challenge of this approach is to integrate the acquired company. Both
technological and organizational integration needs to be done carefully so that it
does not badly influence newly acquired products and teams.
Separate entity: Create a branch company that will receive financing from the
head company. The main challenges of this approach are large costs, risk
management, and resisting the urge to have complete control. Honestly evaluate
your company's potential in order to create and organize a new business with a
set of goals that are aimed at the long-term development of disruptive
technology.
Separate department: This is the hardest way to create the innovations team. The
main challenge here is creating a new set of goals and values for the existing
company.
Thinking about data science projects from the perspective of innovations is helpful in many
ways. Machine learning and deep learning can create sustaining and disruptive
innovations. Innovation management will give you a broader context and let your main
strategy succeed in the market. Organizations that do not integrate innovations will be
outrun by their competitors. It is easy to blame technology for not being mature enough, or
experts for not being as good as the competitor's rockstars. While doing this, many
companies forget to consider the validity of their strategy for forming markets around new
technologies. Data science solutions need a stable organizational foundation and a plan of
action for income generation. Innovation management helps us take care of both
components.
[ 157 ]
Managing Innovation Chapter 7
As you have seen, managing innovations involves dealing with lots of failures. Failures are
not one-dimensional, and to avoid them, you will need to find a balance between different
areas of expertise. The next section of this chapter will explore three main ways of thinking
that will help you manage risks in data science projects.
The number of areas a data scientist can spend their time on is bewildering:
[ 158 ]
Managing Innovation Chapter 7
We can simplify this list by moving each activity into one of four common areas:
Team building: This topic includes activities associated with your team:
Performance reviews
Coaching
Supporting the growth mindset and working on self-development
plans for your employees
Sales: Every item that helps you sell more of your product or services goes under
this topic:
Working on presentations
Practicing talks
Creating written material
Expanding your network and checking your connections for
potential B2B applications for your technology
Marketing: Every item that is related to markets and pivoting goes here:
Exploring the current offerings of your competitors
Discovering possible pivots for your technology
Seeking new use cases and refining the current ones
[ 159 ]
Managing Innovation Chapter 7
Managing in different contexts during your workday is not only useful for emerging data
science teams. Changing the setting can open up new perspectives and give you ideas
about how you can proceed with the development of your technology, team, and business.
We will briefly cover four key areas of management and relate them to the specifics of data
science.
Technical leadership, along with team building, is a key activity for any data science
manager. You need to make sure that your projects are implemented with the right tools
and that they have been designed to achieve your customer's goals. On the other hand, you
need to care about your team's motivation and long-term improvement. Advocating a
growth mindset and building a balanced team will serve as a strong backbone when it
comes to building better technology.
Your team will need a constant flow of tasks to work on. Even if your company has a
dedicated sales department, you'll need to sell your ideas to them first. If data science is
relatively new in your company, salespeople will be familiar with the existing offerings of
the company, but they won't know how to sell data science solutions to customers. Either
you need to know how to do it yourself, or you need to speak the language of sales to coach
the sales department on how to approach data science solutions. Another case where you
will need these skills is internal sales. For example, this skill set is necessary when the data
science department is optimizing internal business processes. This is a frequent case in
banking and internet companies. In this situation, a data science manager will need to find
ideas, make them attractive and clear so that they can be understood, identify stakeholders,
and close an internal sale to attract budgets for your project. Without sales, even the best
teams and the most disruptive technology will find themselves without any work to do.
Looking at your duty from the salesman perspective will open up a surprising number of
opportunities, so don't miss them.
If the sales part of your job works as intended, you will start to see many opportunities, and
each will need significant time investments from your team. Some opportunities will be
realistic and match technology and expertise of your team, while others will be much less
attractive. It's time to put on your marketing hat so that you can navigate through this sea
of opportunities. Thinking about your work in terms of markets, demands, and products
will allow you to choose the best direction to work toward. With each finished project, your
team's skills and your technology will be drawn closer to one set of opportunities and
further from the others. Marketing is all about searching in the most promising way and
growing in the direction where your technology will meet the market's demands, thus
creating new value.
[ 160 ]
Managing Innovation Chapter 7
Looking at your work from different perspectives is both illuminating and beneficial to
your career and company. We do not live in a one-dimensional world, so it is important to
look further than our main set of responsibilities. From now on, try to see the four hats of
team management, technical leadership, sales, and marketing lying at your desk.
Remember to change them from time to time so that you and your team have all the
necessary ingredients for finding, growing, and delivering innovations.
Next, we will look at a case study regarding the innovation management process in a retail
business.
[ 161 ]
Managing Innovation Chapter 7
In the current process, each store has two to five goods availability managers who
manually monitor each shelf in the store using an app on their smartphones. The app
showed up to date goods locations on each shelf, but the manager needs to manually check
whether the goods are present. After the check, the manager can take note of all the missing
products and go to the store's warehouse to get the goods and place them on the shelves.
First, Carl has asked the company's management to introduce him to the person who's
responsible for the current on-shelf availability across all stores. It turns out that one of the
executive directors, who manages logistics, delivery, and the on-shelf availability of goods,
owns this process. Carl has prepared a detailed presentation that describes the results of the
pilot project. He told the company that he needs the director's help in preparing the
requirements for the system so that it can be integrated into the existing business process
since the director has this organizational capability. The current business process was well-
documented, and each step was assigned to a specific business role. To create a migration
plan, they interviewed experts who executed the target process. Carl showed the current
prototype to the experts and documented their ideas on how to improve the technology so
that could be smoothly integrated into the workflow. Each new feature formed the backlog
for the upcoming implementation project. Along with new features and ideas, changes
regarding the business process came up naturally. Some of them were conflicting and
poorly defined, so Carl assigned a business analyst to the project to formalize the migration
plan into a coherent document.
[ 162 ]
Managing Innovation Chapter 7
After this, Carl presented the results to the director. The integration of the new system
significantly changed the existing process. Previously, the store managers who monitored
the on-shelf availability of goods were manually going through the store to monitor the
availability of the goods, but now, this task can be fully automated using portable cameras
near each shelf. A manager will receive task notifications through the mobile application
and each task will contain precise information about the good's storage location, quantity,
and shelf location. The application will batch tasks together and create optimal routes
through the store's storage. Since all the store's availability managers already have
smartphones, the integration will be simple: the users will need to update the app on their
phone and familiarize themselves with the new features.
The key idea behind the integration strategy that was proposed by Carl was to test the
system in a single store first. The test would be considered successful if the store could
handle its current workload using only half of the availability management personnel. This
way, the project would have measurable success criteria. If the first test succeeds, then the
system will undergo tests on a larger number of stores. This process will be repeated and
migrated to all the stores so that it's the to-be version of the process.
It's clear that each store will need fewer availability managers after the system integration
process. To have open processes and clear explanations, the director has designed a special
training program for personnel that won't be needed for this job after the system is online.
Other departments are working hard to find new people, so redistributing the workforce is
a very appealing option. Carl's team also created an in-depth on-site training course and in-
app tutorials that describe how the users can work with the new app.
The innovation life cycle is much faster than the general project life cycle in large
organizations. It also costs more in the short term because of the fast experimentation rates
and increased risks of failure. This makes innovations tricky to work with in a large
company. Existing short-to-mid term values and goals will reject innovations, just like an
immune system would reject a virus. It is important to provide organizational and financial
support to the R&D and innovation departments in established businesses. Innovation
teams work best when they are deployed as a company inside a company. While the main
business may be working efficiently with the organizational structures with many
hierarchical departments and management chains, innovations work best when the
organizational ladder is very high and there are one or two steps from an employee to a top
manager.
In the next section, we will look at how we can manage innovations in a start-up company.
[ 163 ]
Managing Innovation Chapter 7
The concept of the chasm is important for start-ups. The main goal of a start-up is to find a
new market where it can grow upon itself and provide value that doesn't exist. Matching
your idea and your technology to market demands involves a lot of quantifiable
experimentation. It is important to use a measurable approach that is advocated in The Lean
Startup and Crossing the Chasm methods.
The Lean Startup method recommends that you test every idea and every feature on the
market early. Conducting experiments and getting feedback is crucial for start-ups because
each choice you make limits your future development. The sooner your product receives
market feedback, the sooner you can change its course before it's too late. A company with
good R&D practices and big budgets can test ideas in computer vision, NLP, and
reinforcement learning for long-term payoffs. They can even create entire branch
organizations so that they can test ideas. On the other hand, a computer vision start-up will
have a much harder time pivoting to NLP because of the shorter profit expectations.
Start-ups need to be lean, fast, and agile to survive. Looking at your idea through the
marketing lens is a crucial skill for a small aspiring company. Even the best idea won't
survive long enough without proper economic development. Because of the large gap
between the expectations of early adopters and the early majority, a start-up needs to have
a narrow focus, seeking to satisfy the needs of a niche customer base. In this chapter's
example, MedVision (see the Case study – following the innovation cycle at MedVision section)
has implemented a lung cancer classifier. They have measured the market demand for
several alternatives and identified cancer classification as the most promising and realistic
use case. In the short term, they plan to have several successful deployments to attract
income and new customers. In the long term, they plan to collect market feedback to
expand their system's functionality to more medical equipment types and expand their
disease identification list. The key idea is to direct your product toward market satisfaction
by conducting experiments and collecting measurable feedback.
In the next section, we will explore different approaches to finding ideas that can fuel data
science projects.
[ 164 ]
Managing Innovation Chapter 7
On rare occasions, a problem may be presented to you on a silver platter. If that's not the
case, you will need to dive deeper into the company's business. Start with their key
revenue-generating processes. Educate yourself about how they work. Then, mark all the
steps as manual, partially automated, or fully automated. Two key areas for the
improvement of an existing business process are the automation of manual steps and the
improvement of automated steps.
For manual steps, research why they are still performed that way. Manual labor can
generate lots of digitalized data that you can use to train your models. Form validation,
form filling, and customer support are frequent cases where you can replace repetitive
manual labor with a data science solution.
For automated and partially automated steps, research the IT systems that implement those
steps. Look at the data sources: databases, data warehouses, and file storage. Then, look at
where you can deploy a machine learning model that can make this step better. Scoring,
risk management, and algorithmic marketing algorithms are examples of this.
The best course of action when analyzing existing business processes is to find a step that is
as follows:
Generates data: This or any previous step of the process should constantly
generate digital data that can be used by a model.
Is business-critical: The improvement should provide direct business value to
your customer. Increasing the value is always a strong selling point.
Has potential applications for data science: You can use data to build an
algorithm that provides insights or automates decision-making processes.
[ 165 ]
Managing Innovation Chapter 7
Using existing business processes to find promising ideas for data science projects is a solid
approach when you are working in a B2B environment. It allows you to deepen your
domain expertise and find the most valuable improvements.
1. First, you should create a data source map of the company you are working with.
Sometimes, this information will be readily available for you to use. If it isn't,
then your team should do some research and recover the data architecture of an
organization. Using this data map, you can identify the largest and most
complete data sources.
2. Then, sort all the data sources by the amount of data and data quality and look
over possible data merges between different databases. The goal is to create
several starting points for further research.
3. Finally, look deeper into the data sources you have selected. Seek datasets that
can be used to build supervised or unsupervised models. At this stage, you
should not build any models; just write down all your ideas—the more the
better. After jotting them down, see if any of your ideas can create new business
value.
This way of finding ideas is risky, time-consuming, and very creative. The chances of you
succeeding are lower than in the previous approach, but if you are able to find a working
idea that works bottom-up from the data, you are much more likely to stumble upon a
unique data science use case that gives you a competitive advantage and provides new
business value. The first approach is much more suited toward finding sustainable
innovations, while the second approach can be used to mine for disruptive ideas, that is,
transformative ideas.
Once in a while, some people come up with truly unique ideas. There may not be any data,
nor a business—just a captivating idea. Transformative ideas that disrupt the industry often
fuse several disciplines that have no close relationship at first glance:
[ 166 ]
Managing Innovation Chapter 7
If there was a reliable formula for creating such ideas, they wouldn't be rare and surprising
anymore. The only way to increase your chances of discovering new ideas is to acquire lots
of knowledge and experience in different areas.
If you think that you have come up with a promising concept, do not rush it. Write down
your idea and start researching it:
Look for similar products that are already on the market. Chances are, you will
find several at least vaguely similar products.
Collect and classify a competitor's product features.
Write down the core business model behind each product.
Rank all the collected features into several categories:
Best feature: No one does this better; it's a distinctive feature of the
product. This feature is often the defining purchase factor.
Core feature: The majority of competitors have this feature
implemented fairly well.
Basic feature: This is implemented in a basic way. Lots of other
competitors have a more complete implementation of this feature.
Lacking feature: The product lacks a feature.
Start fleshing out your idea by describing the product's features. Create a profile
for your product and think of the feature that distinguishes it from others.
Think about your product's business model. How would it relate to other
products on the market?
If you can't find any competitors on the market, or you can see that your product could
easily throw competitors out of the market, then you have a chance of creating a disruptive
technology. Remember that pushing disruptive innovations into the market requires careful
planning, focus, and a strategy for crossing the chasm.
Next, we will look at a case study that's all about finding ideas and managing a data science
project in an insurance company.
[ 167 ]
Managing Innovation Chapter 7
Rick already knew that the main income stream for the company was selling insurance
policies, and the main losses were in insurance claims. Insur Inc. provides health, real
estate, and automobile insurance programs. The company's largest product is its
automobile insurance, which contributes to 75% of its overall revenue. Rick took a closer
look at the processes in the automobile insurance department.
First, he looked at the process for new insurance policies, which are shown in the following
diagram:
Front Office System: This is an insurance consultant UI that's used at all Insur
Inc. offices. This system does not store much data and acts as a lightweight UI
with limited functionality compared to the Back Office System.
Back Office System: This manages all the data related to insurance contracts.
Back office workers use this system to validate and process new contracts.
Scoring & Tariff System: This is an auxiliary module for the Back Office System
that takes insurance contract data and applies a set of rules to find the best
matching tariff for this client.
Insurance Policy Registry: This is a central data warehouse that stores all the
data related to insurance contracts.
[ 168 ]
Managing Innovation Chapter 7
Rick wanted to research the profit loss cycle, but this process only included the profit part
of the cycle. The losses are managed by another process, that is, insurance claim
management:
The IT systems that are involved in this process are almost the same, just like the insurance
contracts, except for the Claim Processing system. Dedicated managers use this software to
validate claims and assign on-site investigations regarding damaged vehicles. This system
also identifies fraudulent claims. The fraud detection process is mostly manual; however,
some simple rule-based detectors are also implemented in the system.
Rick decided to improve the Scoring & Tariff System. Predicting potential claims
depending on a customer's data would improve risk identification and contract pricing.
More risky customers would receive more expensive tariffs. Rick has assembled a team and
has started prototyping the new scoring algorithm using data from the Insurance Policy
Registry, Back Office System, and Scoring & Tariff System. The company was planning
to implement a binary classifier that would estimate the probability of a valid claim for each
new client. The business people at the car insurance department were already familiar with
the concept of false-positive and false-negative errors, so they have used confusion tables to
compare the performance of current and new scoring algorithms. The business department
has refused to use more complex metrics such as an F1 score and suggested calculating the
increase of profits directly from the confusion tables.
[ 169 ]
Managing Innovation Chapter 7
After three weeks of prototyping, Rick's team created a classifier that was superior to the
current rule-based algorithm of the Scoring & Tariff System. They sent their predictions to
the business department for evaluation. After 30 minutes, the business department asked
for explanations for each score. It turned out that the current scoring system generated a
detailed quarterly report and sent it to a regulatory organization that checked the validity
of the scoring model that was used at the company. The model should be fully interpretable
and understandable by humans. Rick's team used a black-box model with a large number
of parameters to achieve high accuracy, so their model failed this requirement.
Interpretable models such as logistic regression are more suitable here, but their
performance was subpar for the current system, making them unsuitable for deployment.
Rick realized that he and his team lost three weeks of project time. This was bad news for
his boss, and he was asked to not make the same mistake twice.
This time, Rick researched and documented all the conflicting requirements. He deepened
his domain expertise by asking his colleagues from the car insurance department about
every detail of the business process and documented it for clarity. While researching, he
validated that the fraud detection process was far more suitable for migration to a machine
learning model:
The main problem was that the clients needed more detailed reasoning behind claim
rejection than, Our model thought that your insurance claim is fraudulent, so we rejected it. So,
Rick has suggested that the company only uses an algorithm for fraud detection. The idea
was to create a machine learning-based filter that would assess the probability of a claim
being fraudulent. If this probability is high enough, it triggers the current fraud
investigation process. Only 0.5% of all claims were fraudulent, so if the model is accurate, it
should decrease the number of checks significantly.
[ 170 ]
Managing Innovation Chapter 7
Rick has prepared a solution architecture that shows how the new fraud detection system
will be implemented and how it will be integrated into the existing process. He met with all
the business stakeholders to validate his idea. Everyone agreed with the approach. The
evaluation strategy they agreed on was the same as the scoring model. After that, the next
prototyping stage began. The model was proven to be successful, and Rick was asked to
implement the full fraud detection module that would be integrated into the existing
business process.
Summary
In this chapter, we have explored innovations and learned about how to manage them.
Good innovation management requires us to carefully plan our own activity and use
different viewpoints such as sales, marketing, and technical leadership to match our ideas
with market requirements.
We have also looked at three case studies, each related to different topic related to
innovation management. The first one, following the innovation cycle at MedVision, showed us
how an innovation cycle can be applied in a real-world scenario. The second one was
about bringing data science to a retail business, where we explored innovation management in
large organizations. The last one, finding data science project ideas in an insurance
company, demonstrated the importance of using a structured approach for finding data
science project ideas.
In the next chapter, we will explore the data science project life cycle, which allows you to
structure and plan tasks for your team.
[ 171 ]
8
Managing Data Science
Projects
In the previous chapter, we looked at innovation management. We developed recipes that
can help find ideas for data science projects and matched them with their market demand.
In this chapter, we will cover the non-technical side of data science project management by
looking at how data science projects stand out from general software development projects.
We'll look at common reasons for their failure and develop an approach that will lower the
risks of data science projects. We will conclude this chapter by diving into the art and
science of project estimates.
In this chapter, we will look at how we can manage projects from start to end by covering
the following topics:
To see the difference between a traditional software system and a system with predictive
algorithms, let's look at the common causes of failure for data science projects:
[ 173 ]
Managing Data Science Projects Chapter 8
In data science projects, changes come from every direction. They spread internally from
the project's team and externally from the business' customers. Metrics should always
confirm progress. Getting one step closer to your goal may take tens or even hundreds of
failed experiments, which makes fast iterations a must.
The typical iteration length of an Agile project can stretch from 2 weeks to 1 month. The
project team fixes the iteration scope for this duration and delivers it under a strict timeline.
In a data science project, an experiment's result in the middle of the sprint can affect the
sprint's goals and make working on other planned tasks less important due to the new
discovery.
Management must provide a safety net for common issues and problems. Methodologies
that come from the software engineering domain can give you a solid foundation here, but
they do not provide any tools that we can use to manage research and govern data.
If you develop systems that use machine learning under the hood, it is necessary to take
care of the following:
Requirements for validation and alignment: You need to detect and manage
requirement changes from external (customers) and internal (research team)
sources.
Data governance: Your project will need data governance standards, which
should be rigorously applied to each piece of code that works with data. Ideally,
each row of data going through your pipeline should be tracked back to its data
source. All incoming and outgoing datasets, including intermediate reports,
should be tracked and documented.
[ 174 ]
Managing Data Science Projects Chapter 8
Next, we will look at common stages that arise in data science projects. We will tie those
stages into a process that's comprised of the project life cycle so that we can see the whole
picture behind data science projects.
[ 175 ]
Managing Data Science Projects Chapter 8
Business understanding
During this stage, you apply your domain expertise and research the business side of the
project. You define the business requirements and confirm that their implementation would
make the lives of your customers better. You should also define and document all the
relevant business metrics that will allow you to measure and report on results in a way that
is understandable by the business side. The output of this stage should be a business
requirements specification that has been viewed, redacted, and agreed upon by the project
stakeholders.
[ 176 ]
Managing Data Science Projects Chapter 8
Data understanding
During this stage, you research the data architecture of the organization you are working
with. You document data sources, their owners, and the technologies they use. You should
not document every available data source unless you want to mine project ideas from data
(see Chapter 7, Managing Innovation). Focus on data that's useful for the project.
After finding this data, perform an exploratory data analysis (EDA) and research the data
thoroughly. Look at any anomalies and unusual artifacts in the data. Study the reasons
behind their occurrence and document ways to handle them. For example, if the dataset has
a lot of empty values, you should have a plan for how to deal with them, and your actions
should not distort the data in an undesirable way.
You should also look at ideas regarding feature engineering during the EDA stage. Perform
statistical analysis on the data and try to find causal relationships that will help to solve the
task at hand.
Data source dictionary: This document briefly describes all the data sources that
are relevant to your project.
An EDA report that shows the conclusions of your data analysis: This
document should describe the approach that you will use to solve the task at
hand and the strategies for handling errors that you have found in the data. You
should include facts that may interest your customer.
Data preparation
This stage is where we start working with data. The data preparation stage involves taking
raw data and changing it into a useful format. You read data from its sources and prepare it
so that you can use the data to reach the project's goal. If you are solving a task based on
structured data and plan to use machine learning, you will need to perform feature
engineering. The previous stage should give you insights into the quirks of the data that
you can fix during the data preparation stage. This stage's output is one or more
reproducible data preparation jobs and a dataset that you can use to build and test models.
[ 177 ]
Managing Data Science Projects Chapter 8
To make this process less error-prone and easier to monitor, you should care about data
provenance and versioning. Every dataset should be able to be traced back to its source.
Take care to save all the data files, regardless of whether they're intermediate and raw. Log
the inputs and outputs of every data transformation job in your code. Data processing bugs
are notoriously hard to spot unless you have complete control of your data streams.
Another important point to make is reusability. Code your data processing jobs well. It is
tempting to create a large pile of tangled code lines in a single file and let them do their job.
Doing this will increase your technical debt. The code will work for a while, and then it will
fail without notice. Over time, you may also want to add additional features to the code. If
it is badly written, you will spend an unexpectedly large amount of time making fixes and
debugging.
To ensure that you have robust data processing code, use the following checklist during the
code review:
[ 178 ]
Managing Data Science Projects Chapter 8
Modeling
This topic was covered in the What is Data Science? section of this book. In this stage, we
apply our knowledge of data science, machine learning, and deep learning to solve the task
at hand. This is done in the following stages:
1. First, we determine the task type, that is, supervised (classification and
regression), unsupervised (clustering and document topic modeling), or
reinforcement learning.
2. Then, prepare a list of algorithms that are suitable for solving the task.
3. Next, come up with a model validation and testing approach.
4. Finally, optimize the parameters of the model and select the best model.
Evaluation
While not being separate from the modeling and deployment steps, this stage deserves to
stand on its own. You must test technical and business metrics, as well as checking the
individual predictions of the model at this stage. Look at the biggest errors the model made
on the test set and think about the changes that you can make to your data, features, or
models that can fix those errors. This is also a good way to spot data processing bugs.
Your project should have two evaluation strategies: online and offline. Online evaluation
takes care of tracking all the metrics for the already deployed model, while offline
evaluation is used to decide which model will make it to the deployment stage.
Typical data science projects contain hundreds of experiments with different models, data,
and parameters. Each experiment generates new data in the form of metrics, code
parameters, and notes. Use a specialized experiment tracking tool to decide on the success
or failure of a particular experiment. These tools can automatically collect all the logs,
metrics, and artifacts of the experiment to ensure their reproducibility and to ease searching
the experiment results. If you don't want to or can't use a special tool, a spreadsheet can be
a good substitute, although you will need to spend more time working on it. Having a
complete log of all the experiments and decisions you've made regarding modeling and
data preprocessing will help you compare different experiments and make conclusions
about their results.
If you need to know about the technical details of model testing, please refer to Chapter
2, Testing Your Models.
The modeling and evaluation stages are closely related and are often repeated several times
in successive iterations before the final stage is reached.
[ 179 ]
Managing Data Science Projects Chapter 8
Deployment
At the deployment stage, you publish your best model for your end users and examine the
results. At this stage, complexities are often overlooked. Production code has a separate set
of strict requirements and service level agreements (SLAs) that your model needs to meet.
We can separate those requirements into two categories: functional and nonfunctional.
Functional requirements define your service's features, while nonfunctional requirements
define your SLAs.
Some examples of the functional requirements for your model service are as follows:
Request/response format
Capability for model versioning
A UI for tracking deployments and request statistics
Nonfunctional requirements define the quality of service and availability of your service,
and some of them are as follows:
The requirements for model deployment are similar for different projects, so this part of the
process is subject to reusability. Instead of repeating the same work for each project, you
can develop your own model-serving framework or use an existing one.
Another important point to remember at the deployment stage is evaluation. This does not
end at the previous stage; you should evaluate all of the model's metrics online. Your
system may trigger alerts or compensative actions such as model retraining if the online
metrics drop below a certain threshold. A/B testing and multi-armed bandits are also a part
of the deployment process and can be supported as features of your model server.
Now, you should be familiar with the common stages of each data science project. Let's see
how we can execute each stage with a proper management approach.
[ 180 ]
Managing Data Science Projects Chapter 8
Waterfall
The most intuitive way to manage a project is to approach it like you're building a house.
The steps for this are as follows:
[ 181 ]
Managing Data Science Projects Chapter 8
Software project managers know that a single waterfall won't solve their issues, so they
compose many smaller waterfalls into sequential iterations. This stage of evolution is called
iterative and incremental software development. The iterative project is comprised of
several phases that are managed in a waterfall fashion. The length of a single iteration is
measured in months. At the end of each phase, the development team shows intermediate
results to the end user for the purpose of collecting feedback. This feedback is used to
jumpstart the next iteration. With each cycle, the understanding of the desired result
evolves until it satisfies the customer's needs.
Agile
The iterative approach is still too heavy for most software projects. They suffer from
changes that pile up in a mountain of technical requirements. In 2001, some of the brightest
heads of the software development world created an Agile manifesto (https://
agilemanifesto.org), which described a new management approach in four simple points:
That is, while there is value in the latter of each point, we value the former of each point
more.
Today, we associate agile with Kanban and Scrum. These methodologies take somewhere
between 50 and 500 pages to explain. Nonetheless, at its core, Agile is simple. Any project
can go astray with the agile manifesto, and many did. If you leave out the last sentence in
the manifesto, you may end up creating a project without a plan or specification, which will
inevitably end in an uncontrollable mess. People needed a more direct guide when it comes
to managing software projects. This is why Kanban and Scrum were invented.
[ 182 ]
Managing Data Science Projects Chapter 8
Kanban
First, let's cover Kanban. The best metaphor for explaining Kanban is a conveyor belt.
Imagine that all of your tasks go through a fixed number of stages before they're finished.
The concrete definition of those stages is up to you.
1. Backlog (a buffer where all incoming tasks are collected before they're processed)
2. Requirements specification
3. Development
4. Code review
5. Testing
6. Deployment
[ 183 ]
Managing Data Science Projects Chapter 8
Each stage should have a limit regarding the tasks that can be done in parallel. Kanban
purposefully limits the number of simultaneous tasks to increase throughput. If the team
becomes blocked because there are too many tasks sitting in the deployment stage, then all
the team members who are capable of making shipments of the end product to production
should stop doing their tasks and switch to getting rid of the bottleneck. Once the problem
has been resolved, the team may decide to work on other tasks according to their priority.
Because of this, Kanban favors cross-functional teams where everyone can help push tasks
through each stage. Kanban does not remove the concept of roles, but it states that no
matter the role, each team member should be able to help in dealing with a bottleneck.
Kanban focuses on completing a single task from start to finish as fast as possible. The main
metrics that we can use to measure the effectiveness of Kanban projects are as follows:
Lead time: The time it takes for a task to move from the backlog to being
completed on average.
Cycle tile: The time it takes for a task to move from the starting stage to being
completed on average. In our example, the cycle time would be between the
requirements specification and deployment.
Throughput: The average number of tasks you can get done during a time
interval, that is, a day, a week, or a month.
In general, you don't create a fixed project plan with fixed deadlines when using Kanban.
Also, you don't have to estimate each task individually since the metrics will take care of
that. Measure your team's throughput for several weeks so that you have an idea of how
much you will be able to deliver in the near future.
Kanban's powers are also its limitations, some of which are as follows:
Kanban works best when the amount of work in each of your tasks is the same. If
some tasks are taking significantly longer than the others, your metrics will stop
being useful.
If you don't want to work as a cross-functional team, your throughput will suffer
from bottlenecks, which makes using Kanban worthless.
Kanban does not give you the tools you need to manage deadlines, project scope,
and budgets. The single thing it takes care of is optimizing throughput.
[ 184 ]
Managing Data Science Projects Chapter 8
Kanban is a great software management approach for projects with repetitive tasks. It can
also be partially applied to parts of your project where it makes sense to use it. Here are
some examples of projects where using Kanban is a good idea:
In a software support project, where you should take care of deployment and
fixing frequent issues.
If your data science project has a dedicated team that performs experiments with
machine learning models, using Kanban will help increase their throughput.
Projects where you need to create lots of similar things, such as hundreds of web
forms, data mappings, or identical machine learning models.
Scrum
Another popular management methodology from the agile family is Scrum. The main idea
behind Scrum is the sprint. The sprint is a set of tasks with a fixed deadline and duration.
Typical sprint durations are one week, two weeks, and one month. Explaining the entirety
of Scrum would take another book, so we will only present the basics here.
1. Backlog grooming
2. Sprint planning
3. Sprint execution
4. Retrospective
Akin to other agile methodologies, all the tasks go into the project backlog. The project
backlog needs periodic grooming: all of the obsolete tasks should be deleted; the rest of the
tasks need to be ordered by priority.
[ 185 ]
Managing Data Science Projects Chapter 8
The main component of Scrum is the sprint. A sprint is an iteration with a fixed deadline
and a defined goal. The typical length of a sprint is 2 weeks. It always starts with a sprint
planning meeting, where team members observe the project backlog and take tasks into a
sprint. Each task is estimated in abstract story points. The goal of estimating in story points
rather than hours is to make estimations relative rather than absolute. For example, we can
consider a task with one story point trivial and a task with two story points as being
slightly harder but still easy to complete. Four to six story points would indicate a normal
task. Another system of story point estimates suggests using powers of 2 to 128 as task
estimates. On the first sprint, the estimations are fairly approximate. On the second sprint,
you compare the new tasks with the previous ones and see how many story points the task
is worth. After four sprints, you can see how many story points your team can complete on
average. You can also calculate an approximate hour equivalent of a single story point,
although this should only be used as a reference and not as a substitute for story points
during the sprint planning process.
During planning, each team member estimates the task on their own, and then their
estimations are compared. This helps ensure everyone understands the task definition in
the same way. Differences in estimates signify that the task needs a clearer explanation and
evaluation in terms of the SMART criteria.
The sprint starts after the planning phase. When you start working on the sprint, it will be
locked. You cannot change the scope you defined during the planning phase. The primary
focus of the team is to complete all the tasks in the sprint before it ends. This strategy allows
you to achieve your planned goals while being robust to changes. The main strength of
Scrum is also its main weakness. If your customer comes into work in the middle of the
week with an extremely important task that needs to be done ASAP, you should do your best to
convince them that the team will deliver this during the next sprint. Scope locking is an
essential mechanism that makes sprints work. If you depart from this rule often, Scrum will
become an obstacle rather than a beneficial and effective management approach.
In practice, scope locking can cause problems, especially if you are working in a B2B
environment. For situations where you have no options and are forced to change sprint
scope, you have two options:
Trade tasks: You can remove a task from the sprint and add a new one.
Start a new sprint: You can stop the current sprint and plan a new one.
Using these options frequently makes Scrum ineffective. Try to negotiate a fixed sprint
scope with your customers and show them that it brings benefits such as planned delivery
while leaving space for requirement changes.
[ 186 ]
Managing Data Science Projects Chapter 8
A good strategy you can use to avoid unexpected scope changes is to ask your customers to
take part in backlog grooming and sprint planning. Scrum experts suggest that you should
assign a special product owner role for this task. The product owner should decide on task
priorities, sprint goals, and negotiate all the conflicting requirements with project
stakeholders.
Scrum came directly from the software development world, so it has fewer limitations than
Kanban. The price lies in its complexity: Scrum is not an easy methodology, and it will
create management overhead. Each team member should understand Scrum if you want it
to work. In complex projects, you may need to give someone the dedicated role of Scrum
master. This is someone who will take care of applying the methodology to one or several
of your projects.
In the next section, we will look at choosing a methodology according to the needs of your
project.
[ 187 ]
Managing Data Science Projects Chapter 8
Choosing a project methodology goes hand in hand with estimating data science projects.
In the next section, we will define the goals of the estimation process and learn how to
make estimates.
[ 188 ]
Managing Data Science Projects Chapter 8
The largest problem with estimations is that they fail. Our plans are inaccurate, incomplete,
and often irrelevant to how the real work will be done. Even experienced software
developers struggle to estimate the total amount of hours that it will take to do a single task
unless they have done it multiple times already.
Research shows that humans are bad at absolute estimates. Our brains are simply not
suited to making accurate mental models of complex multilayered projects. For example, if
we were to ask a bunch of strangers about the height of the nearest building, the majority
would fail to give you the correct answer. However, if you told them the height of several
buildings, their estimates would be much more accurate. This is true not only for building
height estimates but for all kinds of estimation.
To use relative estimates in data science projects, you need two things: relevant samples
and a good estimation process. We can think of relevant estimates as simple statistical
estimators that average the length of all previous relevant tasks. To create such an
estimator, we first need to collect a dataset. If you follow a waterfall process, then to get one
new data point in your estimation dataset, you need to complete a whole project from start
to end. You may need to fail estimates in many projects before you get good at estimating
one particular type of project.
The trick is to get down to the individual task levels. Scrum suggests that you use relative
story points instead of absolute hours for this reason. First, your estimates become relative
at the task level, then at the sprint level, and finally at the project level. If you have no prior
experience that will help you to make relative estimates, the only absolute estimate you
should make is one for the first sprint of the project. From here, you should use the
previous tasks as a basis for making new estimations.
You don't have to use Scrum to benefit from relative estimations. Scrum provides one way
to make them work, but it may not be ideal for your circumstances. If that is the case, you
can adapt any management methodology for relative estimation.
[ 189 ]
Managing Data Science Projects Chapter 8
We can look at estimates from two perspectives. The first one will be
familiar to project managers and team leaders who are mostly concerned
with project delivery: the implementation perspective. The main goal of
estimates in this example is to provide correct expectations regarding how
much time and money will be required to build the solution.
[ 190 ]
Managing Data Science Projects Chapter 8
Let's start with the outline. The outline is a hierarchical list of high-level tasks that you will
need to complete. The simplest outline for a waterfall project may look like this:
Using a waterfall project is risky, so we will split the system into several stages, with each
going through several successive steps. Depending on the complexity of the stage, you will
need to make one or several iterations of the same stage to complete it. In theory, you could
try to create an outline for every two-week sprint, but this is unrealistic because of the ever-
changing nature of data science projects.
You should start by defining the rough steps and then detail them. If you have created
similar projects before, you can look at their outlines so that you know where to start.
Collecting and using data from other projects can serve as a primary source of relative
estimates, so do not underestimate your prior experience.
[ 191 ]
Managing Data Science Projects Chapter 8
You will have difficulty decomposing some tasks. Being unsure about an approach to a task
is a red flag signaling that you need to communicate with your customers and figure this
out. If you do not how many systems you will need to integrate with, you will have a hard
time decomposing all the stages and tasks related to integration. In this case, we need to call
the customer and quickly discover the necessary information. However, you may have
hundreds of such questions during the estimation process, and tracking new information
will quickly become ineffective. It is a good idea to prepare a numbered list of questions
first. Answers can change during the estimation process, so each question should be
assigned a date. Ideally, those questions should be shared in a format that makes
collaboration easy.
When your outline is detailed enough, it is time to design a software architecture proposal.
This is a crucial step because matching outlines with customer requirements is not always
economically viable or even possible from a technological standpoint. You should have at
least a rough idea of what technologies you will use, how they will integrate with the rest
of the customer's system, and how your solution should be deployed. If there are any
crucial nonfunctional requirements, such as 24/7 availability, the software architect should
also think about how to implement them in terms of technology and system design.
Drafting a high-level architecture vision will help explain this outline. Do not hesitate to
change the outline if you think that's necessary. Software design is a complex task an
experienced engineer should do, so if you do not have deep expertise in designing software
solutions, ask for help from someone on your team, or even better, make software design a
collaborative effort.
After you've completed the outline and have a software architecture vision, you can start
estimating the project. I recommend using simple statistical estimation procedures such as
the program evaluation and review technique (PERT).
Optimistic estimate: The time you plan to spend on the task if everything goes
well; minor technical problems and requirements issues may arise.
Most likely estimate: The most realistic estimate you can give for the task.
Pessimistic estimate: The time that's required to finish the task if problems arise.
This includes additional risks for dealing with experiments that have gone
wrong, complex debugging sessions, and having long debates with the customer.
Then, you can calculate a simple weighted average to get the final estimate:
[ 192 ]
Managing Data Science Projects Chapter 8
Calculating the standard deviation is also useful when it comes to making confidence
intervals:
Use data from any finished projects as a base for relative estimation. The more external
resources you use for estimation, the more accurate and risk-averse your estimates will
become.
Since we are ineluctably bad at estimation, all of the estimates are only a rough idea of your
current view of the project implementation plan. Project outlines and estimates should
constantly change and be adapted to the current situation. You should periodically check
whether the original plan should be changed and updated. If so, convey this to the
customer and work through the necessity of the changes. It may be that the customer added
several new features to your backlog, thinking that they were present in the original scope.
If this is not the case, negotiate a scope expansion, followed by a budget and deadline
extension. If these features are not critical enough, advise the customer to remove them
from the backlog. With each completed task on a particular type of project, your experience
will grow. As a result, you will be able to anticipate more of the customer's needs and
include them in the base plan, which will make estimates more accurate. Store all of the
versions of the project estimations so that you can track all scope changes effortlessly.
Project architecture vision should also be robust in terms of changes. The more customized
your solution is, the less likely it is that it will create an ideal architecture vision that will
survive all scope changes. Plan ahead and include several variation points in the parts of
your solution that are the most likely to change. A vacation point is a software component
(or a set of software components) that was going to change from the start. A plugin
architecture and microservices with fixed contracts are examples of variation points that
allow for easy extension or substitution.
[ 193 ]
Managing Data Science Projects Chapter 8
Having a fine-grained view of how to achieve the end goal is useful. In contrast, relying on
the exact calculations of how long it will take you, or using extremely detailed outlines, is
not. Use estimates wisely; they will help you align your implementation plans with
customer demands.
Summary
In this chapter, we looked at how to manage data science projects. We explored how
analytical projects differ from software engineering projects and studied the data science
project life cycle. We looked at how we can choose a project management methodology that
suits our needs and uncovered practical guidelines for estimating data science projects, and
also discussed the limitations of long-term plans. No matter how good your plans and
estimates are, data science projects have many inherent risks that can become the failing
points of your projects.
In the next chapter, we will look at common pitfalls of data science projects.
[ 194 ]
9
Common Pitfalls of Data
Science Projects
In this chapter, we will explore the common pitfalls of data science projects, as well as the
mistakes that increase the risks your projects may encounter and that are easy to commit.
It's important that you know how to deal with them for the success of your projects.
Different types of data science solutions have many tempting ways of executing the project
that can lead to undesired difficulties in the later stages of the project. We will pick and
mitigate those issues one by one while following the data science project life cycle.
To make sure that the project goal is defined correctly, you may use the following checklist:
You have a quantifiable business metric that can be calculated from the input
data and the algorithm's output.
The business understands the most important technical metrics that you use.
The task is defined in data science terminology. You know if you are solving
classification, regression, or another kind of task and have an idea of how to
approach it technically.
You understand the details of the (business) process and domain of the problem.
All of the data sources that are required for the start of the project exist and can
be accessed by the implementation team.
Another important issue is the documentation. In data science projects, experiment results
can often change the course of the project. It is important to log all the experiments, along
with the decisions and conclusions that you make. Fixing any incoming changes when it
comes to data and requirements is also necessary. When you have this information, you
will be able to see the whole line of thought that makes your solution work as it does.
Being aware of the common risks of data science projects will help you recover from major
mistakes, but the devil is in the detail. In the next section, we will cover the different
management approaches to research projects, prototypes, and end-to-end production
systems and cover their specifics.
The first pitfall of research projects is the absence of scope. Every research project must
have a clear scope, otherwise it won't be possible to finish it. It is also important to fix any
external constraints for a research project. Your research budgets will grow as the scope's
size increases, so limited budgets may also affect the depth and length of research that you
will be able to do.
[ 196 ]
Common Pitfalls of Data Science Projects Chapter 9
When you have an idea of your research capacity, you can start filling in the experiment
backlog with your team. Each entry in a backlog should contain an idea that may advance
your model's quality or help you reach your desired functionality. Each experiment should
be defined in accordance with the SMART criteria.
For example, an initial experiment backlog for a binary classification problem may look like
this:
1. Perform exploratory data analysis (EDA) and familiarize yourself with the data.
2. Create a baseline model using gradient boosting and basic preprocessing.
3. Test the encoding for categorical variables: Hash encoding.
4. Test the encoding for categorical variables: Target encoding.
5. Feature engineering: Measure the effect of date features.
6. Feature engineering: Aggregate the features day windows.
You should always prioritize experiments by expected returns. Some experiments may take
a lot of time to complete in full, but less time for initial tests. If you and your team are
unsure about priority, it is better to go wide rather than deep. Perform quick initial quality
checks for all of the experiments in your current research iteration to determine the
approximate time investment for each experiment.
Input data
Date and time of the experiment
Exact code version that produced the experiment's results
Output files
Model parameters and model type
Metrics
You can use a simple shared document for this purpose or invest in integrating a
specialized experiment tracking framework with a user interface (UI) into your project.
Experiment tracking frameworks pay for themselves in projects that have lots of
experimentation.
[ 197 ]
Common Pitfalls of Data Science Projects Chapter 9
Finally, care about making each experiment reproducible and document the conclusions
you drew from the results. To check for reproducibility, use the following criteria for each
individual experiment:
Input data is easily accessible and can be discovered by anyone on the team.
The experiment code can be run on input data without errors.
You don't need to enter undocumented configuration parameters to run the
experiment. All of the configuration variables are fixed in an experiment
configuration.
The experiment code contains documentation and is easily readable.
The experiment output is consistent.
The experiment output contains metrics that you can use for comparison with
other experiments.
Conclusions from experiment results are present in the documentation,
comments, or output files.
Successful research projects can help you find an answer to a complex question or push the
boundaries of scientific fields and practical disciplines. Research projects are often confused
with prototypes, though prototypes pursuit different goals in general. In the next section,
we will look at the properties of prototyping projects, along with specific management
strategies for these types of projects.
[ 198 ]
Common Pitfalls of Data Science Projects Chapter 9
Focusing on core features does not mean that your prototype should not have a pretty UI or
stunning data visualizations. If those are the main strengths of your future product, by no
means include them. To identify the core features of your product, you should think in
terms of markets and processes.
Ask yourself the following questions to check whether a particular feature should be
included in the MVP:
Then, look at what features are crucial for reaching the desired goal, what is
complementary, and what is unrelated to the goal. After that, look at your competitors and
think about what will distinguish your MVP from them. If you have no distinguishing
features or you don't solve a different problem, you will have a hard time competing with
an already developed and possibly highly adopted competitive product.
It is beneficial to decompose your competitor's products into feature lists and perform the
same analysis on them to see what supplementary features you should implement to be on
par with the market. Doing a detailed feature analysis of competitor products is more about
product development than creating an MVP, so we won't go any further into this topic.
With the information you will have acquired from feature analysis, you should be able to
rank all the features and select a few core ones for your MVP. Keep in mind that you may
change the priority of some features if you know that they are extremely important for
success.
[ 199 ]
Common Pitfalls of Data Science Projects Chapter 9
Another important aspect of delivering MVPs is to cut corners. If you have a lot of
prototypes, having project templates, internal rapid prototyping tools, and document
templates will save you a great deal of time and money. If you feel that your team spends
more time creating MVPs using particular types of algorithms or technologies, think about
creating tools that will automate most of the work. It is crucial to remember that MVP
should be only good enough for your customers and users to see the benefits. As Voltaire
said, The best is the enemy of the good.
For a more complete review of creating prototypes that can be transformed into real
projects, please go to Chapter 7, Managing Innovation.
In the next section, we will have a brief overview of how we can manage an MVP in a
consulting company environment.
Using these answers, Mark has created the core feature list for the MVP:
[ 200 ]
Common Pitfalls of Data Science Projects Chapter 9
Mark knows that the manufacturing plan director, Anthony, who is the key decision-
maker, appreciates systems with a slick and intuitive UI. Also, Mark is sure that preparing
a model quality report is essential for comparing the efficiency of as-is and to-be quality
control processes.
After the customer approved the scope, Mark decided to use Scrum as a management
methodology and focused on delivering the first working version as fast as possible. To get
up to speed, Mark's team, who were already experienced in applying computer vision
algorithms, used internal software libraries that they can develop for rapid prototyping.
The report template meant that they didn't have to spend a lot of time writing documents,
which meant they could focus on MVP development.
This example concludes our overview of prototypes. Now, we are ready to dive into the
topic of assessing and mitigating risks in data science projects.
The first major risk is the inability to sustain a constant change stream. Data science projects
involve scope changes, and you should be able to work with them without making the
project fall apart. Scrum gives you the basic tools you need for change management by
freezing the development scope over the course of the week. However, for any tool to
work, your customer should understand and follow the required processes, along with
your team.
[ 201 ]
Common Pitfalls of Data Science Projects Chapter 9
Another issue is that the implementation of a given change may cause a lot of unexpected
bugs. Data science projects often lack automated testing procedures. The absence of
constantly testing existing functionality may cause ripple effects when one simple change
causes several bugs. Without tests, a lot more bugs also go unnoticed and get passed into
production. It is also important to implement online testing modules since quality
assurance does not end in the development phase. Models can degrade in performance
over time, and your system should monitor abrupt changes in business and technical
metrics.
If your team has not planned for production in advance, you will face many complex
engineering issues related to nonfunctional requirements such as the availability,
scalability, and reliability of your system. To avoid this, care about system design and
software architecture from the start of the project.
Even if everything has gone well technically, the final result may bewilder your customers.
If key stakeholders do not see the benefit of your system, you must look for any mistakes
that were in your goal definitions. It is often the case that project goals change midterm,
along with the customer's views of what's best for their business. To avoid this major risk,
you should constantly check that the task you are working on is important and that the
solution method is correct.
In the following table, we have enumerated the common risks and their solutions to sum
up this chapter:
[ 202 ]
Common Pitfalls of Data Science Projects Chapter 9
Research The research has no scope and Plan ahead using the research backlog.
plan of action. Prioritize entries in the research backlog
and periodically check whether there are
any obsolete entries that should be
removed.
Assess your expectations of each
experiment by performing quick tests, if
possible.
MVP The prototype does not show Think about every prototype as an MVP
how to solve the user's that solves your customers' problems.
problem. Define your scope by taking the
minimum amount of functionality
required to solve your customers'
problems into account.
MVP The MVP includes too many Use feature analysis to define the MVP
unnecessary features that take scope.
time to develop.
MVP The MVP takes a lot of time to If your team makes a lot of MVPs, think
develop. about creating rapid prototyping
frameworks and project templates to
speed up the development process.
Project The customer is constantly Advocate for the use of Agile
development pushing the team to make development methodology for your
urgent scope changes. project.
Track project scope changes to show how
they affect project deadlines.
Project The customer does not see Constantly review your project goals and
development how the system solves their make sure that your way of solving the
problem. problem has been confirmed by the
customer.
Project New changes introduce a lot Write automated tests.
development of bugs.
Production The model's quality is Develop an online testing module to track
deployment degraded in production and metrics in production.
the system has no tools to Validate incoming data.
solve this problem. Periodically retrain your models on new
data.
[ 203 ]
Common Pitfalls of Data Science Projects Chapter 9
Production The system is not suitable for Fix functional and nonfunctional
deployment production usage. requirements for your system.
Prepare an architecture vision that
provides a production-ready system
design.
Let's move on to the next case study that will demonstrate how you can detect and control
these common risks.
Jane: What about forming a more defined goal? I know it can be hard without
knowing about the technical details of AI and machine learning, so let me help you.
Our company provides sales forecasting solutions for logistics companies like yours.
The typical use case is to integrate our system into your ERP so that your personnel
won't waste time making forecasts by hand. Does this sound interesting to you?
Jane: OK. Our system makes demand forecasts for each item in your warehouse. We
can estimate the expected number of items that you will need to ship by a given date
in the future. To do this, we need access to your ERP database. I bet your employees
already do this, but the process generally takes a lot of time and the results are not as
accurate as they could be, especially if your product catalog contains hundreds or
thousands of items.
Max: Yes, the process takes a lot of time indeed. We even struggle to add new items to
the catalog since this will require us to hire twice as many people in the future.
[ 204 ]
Common Pitfalls of Data Science Projects Chapter 9
Jane: If what I've offered sounds interesting to you, I'd be happy to assist you in
making a small MVP project that will show you how beneficial our system will be to
your business. However, we'll need a bit of assistance from your side. To show you the
positive effect of using the system, we'll need to make sure that the goal is stated in
numbers that can be understood by the company's management. Let's set up a meeting
so I can show you how we can quantify the project results and what support we'll
need from your IT specialists.
After this discussion, Jane had a meeting with Max, where she showed him a common
approach to evaluating the project's performance from the business side. She documented
all the decisions in a small business requirements document and confirmed that the results
of the discussion were correct by showing it to Max. Jane also discussed the success criteria
for the MVP. Max agreed that the production deployment is not necessary at this stage of
the project and that they can make a decision about production integration by looking at
the offline testing results, which Jane's team will present at a meeting. This way, she
covered all three major risks in the common group.
Since this is an MVP project, she also considered risks in the MVP group. Jane asked Max
how they could present the system to end users and company management so that they can
understand the benefits of such a system. They settled on integrating the system directly
into the company's ERP solution as this was the most cost-efficient way for both sides. Max
had lots of ideas regarding what useful features they could include in the MVP, along with
visualization dashboards and system administration tools. Jane noticed that those features
did not add to the core value of the system and would be better for the implementation
stage to implement.
After creating a clear goal definition and shaping the MVP constraints, Jane's team
proceeded to the implementation stage. To close the risks in the project
development group, Jane coordinated with the customer about project management
methodology. Based on her previous experience, Jane decided to use Scrum for this MVP.
She explained the importance of fixed scope iterations to the customer and made sure that
everyone agreed on how changes will be incorporated into sprints during the planning
phase. She also shared the project's backlog with Max using a software project management
system so that he could add new tasks and prioritize them with the team, acting as the
product owner. Jane made sure that Max will have the time to participate in the spring
planning meetings so that the project won't go astray. The general software development
practices in Jane's start-up have already closed risks related to code quality, delivery
pipeline automation, and automated testing, so she doesn't need to care about them.
[ 205 ]
Common Pitfalls of Data Science Projects Chapter 9
Summary
In this chapter, we have explored the common pitfalls of data science projects, as well as
how to manage research projects with tools such as experiment backlog and experiment
tracking. We have also seen how prototypes differ from research projects and looked at
how to manage prototype development from the standpoint of an MVP. Those techniques
were then summarized in a case study that concerned MVP development in a consulting
company. Finally, we enumerated and systematized the major risks and their solutions for
research, prototype, and production systems.
In the next chapter, we will look at how to grow data science products and improve
internal team performance by using reusable technology.
[ 206 ]
10
Creating Products and
Improving Reusability
In this chapter, we will conclude the Managing Data Science Projects section by looking at
how we can help products grow and improve reusability. This chapter focuses on teams
who work to deliver custom solutions for customers. You will find the content in this
chapter helpful if your team helps internal business stakeholders at the company or
external customers who want to buy expertise, services, and software solutions. The
benefits of product thinking and reusability are underappreciated when it comes to the
consultation stage, whereas they become more obvious if your team is developing a
product that is focused on a market niche.
Like software products, you can decompose your activity into service features. Some of the
aspects of your service that your team is good at will separate your department or company
from your competitors, but you will likely find some of these aspects lagging behind other
organizations that focus on this particular type of service. For example, a data science
consulting company may shine in creating custom models. However, their user interfaces
(UIs) will be worse than that of a specialized UI development company. These tradeoffs are
justified by market demands: companies who want to buy services on a custom model
development rarely need the same team to deliver a high-end UI.
Thinking about services as products opens up new possibilities for improvement. You can
analyze your best service feature and think about the improvements that you can introduce
to make your service better and more profitable, thereby opening up new possibilities for
continuous service quality improvement.
In most cases, the best team delivers more functionality in less time and at a lower cost
while being more customizable and better in quality than products that are optimized
toward a broader market. As a service provider, you can do one thing that no product is
capable of: creating solutions that are highly specialized for solving a specific problem for
your customer. The question is, how can we make custom solutions easier to implement
while not losing the key benefit of providing a custom solution to a customer's problem?
The first step of thinking about projects in terms of products is to consider the project's
stage. In the next section, we will consider the concept of reusability in relation to different
project types and stages.
Products
Custom solutions
[ 208 ]
Creating Products and Improving Reusability Chapter 10
While products are reusable by nature, often, custom solutions aren't. However, custom
solutions can be built from reusable components while not losing the qualities of made-to-
order software. To grow these internal components and improve reusability, you should
care about them through each stage of the project:
Minimum viable product (MVP): Think about the results of your previous
projects that you can reuse while having a minimal investment of time. Even if it
looks like building this functionality from scratch will be easier, creating a
reusable component can save you a lot more time over a longer time span.
Development: Think about what reusable components you can include in the
system that you are building. Decide if any new functionality that you are
building can be transformed into a reusable component.
Production: Stabilize and integrate any existing technology.
The next big step in improving reusability is managing research. The question that arises is:
how do we decide when a project needs a research phase? First, let's look at what kinds of
research we can do in data science projects:
Research is hard to plan and estimate, but it is necessary for finishing data science projects
successfully. It opens up new opportunities, provides insights, and changes processes.
Finding the right balance between research and implementation is crucial for the project's
success. To manage research efficiently, you can split your project into two logical
subprojects: research and solution development.
[ 209 ]
Creating Products and Improving Reusability Chapter 10
You can integrate research phases into the solution development project in several ways, as
follows:
To conclude, let's look at a way in which we can integrate research with implementation.
The team has noticed that No category issues are the most difficult and time-consuming for
the support team to fix. This is because the free-form description often lacked the
information that was requested by the templates, and so the support engineer needed to
ask for a lot of additional information to determine the issue category and fill in the
template form by himself. The idea was to use historical data to classify the incoming No
category issues and ask the user to fill in the information that's required by the template
before sending the issue to the support department.
[ 210 ]
Creating Products and Improving Reusability Chapter 10
Building a text classifier that was capable of handling an extensive list of issue
categories (more than 1,000)
Creating an issue classification system that would reliably operate on a 24/7
schedule and integrate this system with the company's support portal
Lucas thought that the project actually consisted of two subprojects: research and system
implementation. The research project had an unbounded time span and required the data
scientists on the team to build various text classifiers and assess their quality. The main goal
of the system implementation project was to build a highly available model server and
change the company's support portal code so that they could integrate the model into the
issue creation workflow.
Lucas has created two project teams for each project and created separate task backlogs for
them:
The research team: The text classification task was not standard and did not have
a direct solution because most pre-built text classification models can only handle
a very limited number of classes and this project required the model to be stable
with 1,000 classes. The research team had a team leader who managed and
prioritized the research backlog so that the team would focus only on the most
cost-effective and promising ideas. Everyone understood that this research could
not have clear deadlines, so the team decided to use the artificial constraint of the
general project deadline. The goal was to train the best possible text classifier,
given the time constraints and the company's data.
The implementation team: They were focused on building a software solution. It
consisted of a machine learning engineer, backend software developer, UI
software developer, a system analyst, and a team leader. Lucas decided that it
would be best to use Scrum for this part of the project. The team discussed the
deadlines, set up a fixed number of sprints, and made sure that they will be able
to complete all the work during the time frame. The task was clear from a
technical standpoint and didn't require extensive research or prototyping. The
team had previously built model servers, so they decided to reuse their existing
technology and add a few missing features.
[ 211 ]
Creating Products and Improving Reusability Chapter 10
The last question that remained was how they could integrate new research results into the
system without breaking the system. This integration could easily become the project's
bottleneck because the research team constantly switched data preprocessing methods and
machine learning frameworks in search of the best result for the project. Lucas decided that
the research team will need to provide their models as a software library with a fixed
interface that's been designed and documented with the implementation team. This
interface will establish the communications contract between the research and
implementation teams so that new model versions can be changed by simply updating the
research team's library version.
After you have decided on the best approach for integrating research, along with the
solution development process, and thought about which reusable components you can
benefit from, you can move on and see how your team can gradually improve the
reusability of the technology that you deliver.
Improving reusability
Improving reusability is a custom project development setting where you develop and
reuse internal components to build better solutions faster. Look at what parts of your work
are repeated in all of your projects. For many companies, it's the model deployment and
serving. For others, it is building the dashboards on top of the model.
Use open source projects as a starting point. In many fields, the best tools are provided by
commercial companies. Thankfully, the data science community is a very open-minded
group, and the best tools you can find are open source. Of course, there are great
commercial products too, but you can build production-grade systems with state-of-the-art
models using open solutions. They will give you a very solid foundation to build upon.
See if you can use those tools to decrease the total amount of time your team spends on
similar repetitive activities over different projects. If you find that no open source solution
solves your problem, then you have two options: build a reusable software component
from scratch or buy a product. But how can you choose between buying a rebuilt solution
versus building one for yourself? Prebuilt products are best suited for solving the common
problems of data science teams. On the contrary, building custom internal reusable
solutions should be left for the parts of the process that are unique to your team and
company. To find a solution to your problem, you should look to open source solutions
first, then to the product market, and lastly consider building a new technology internally.
[ 212 ]
Creating Products and Improving Reusability Chapter 10
Keep in mind that reusable components require special care. As your toolchain grows, your
team will need to spend more time supporting it. Reusable tools necessitate designing APIs
and workflows that will stand the test of time. Maintaining reusable tools that are used in
multiple projects is not an easy task, but it will allow your team to build upon itself, making
each successive project more efficient than its predecessor.
To build or to buy?
Often, your team will have this dilemma. You can build something
yourself, use an open source solution, or buy a commercial product. This
decision is never easy and has lasting consequences. Building something
by yourself from scratch will require significant investment. You can cut
corners by using an open source solution, but you may easily run into
constraints or bugs that will force your team to dive into the development
of this open source component. Trying to fix bugs or add features to large
amounts of existing code that's been written by people from all around the
world will require significant engineering efforts and will make switching
to a different technology harder. Commercial products are the least
customizable of the three, so your team must devote time to deciding
which product matches your needs best.
Reusable components can make your team a lot more efficient, without making your
solutions generic, so that each client will have a customized service. Sometimes, you will
spot an opportunity that will allow your team to make a jump into a totally different
business model by turning the highly demanded custom solution into a product. In the next
section, we will see how we can seek and build products.
[ 213 ]
Creating Products and Improving Reusability Chapter 10
Another sign that you should make products out of your projects is a high demand for
some type of custom solution that you built. If every client asks you to build a customer
support chatbot and you have already built tens of them, why not make a product to make
everyone's lives easier?
If your company has no prior experience of delivering new products to the market, take
this task seriously and be ready to transition into a new world. If you want to avoid
mistakes, take the time to read the literature related to product management and consider
hiring or consulting product managers that have the relevant experience. Product
development differs from making a custom solution. In fact, it is an entirely different
business, with processes and values that may seem bizarre to a consulting project manager.
Finally, remember that building products is risky and expensive. The product team will
need to spend a substantial amount of effort converting any internal code base into a
market-ready product. It the idea has no real market demand, this job will be a wasted
effort.
Good product ideas are very attractive from a business perspective. However, if you
provide services for different customers, you should always think about privacy before
turning your projects into products for wide markets.
Privacy concerns
Privacy is of paramount importance for every data science team, especially for teams that
are trusted by other businesses. When you're considering building a new product based on
your internal code base or data, always check that everything you plan to use can be sold to
other customers. If this is not the case, you may end up building the product from scratch,
which will greatly increase the development costs. Ask your company's legal team if all the
related NDAs and contracts allow for planned use cases.
You also need to consider licensing your internal reusable components so that you can
legally reuse them across different projects. This book is not a legal consultant, so do not
hesitate to consult with one if you plan to reuse software components and build products.
[ 214 ]
Creating Products and Improving Reusability Chapter 10
Summary
In this chapter, we have looked at the benefits of product thinking in a custom project
development environment. We studied why reusability matters and how we can build and
integrate reusable software components at each stage of the data science project. We also
went over the topic of finding the right balance between research and implementation.
Finally, we looked at strategies for improving the reusability of our projects and explored
the conditions that allow us to build standalone products based on our experience.
In the next section of this book, we will look at how we can build a development
infrastructure and choose a technology stack that will ease the development and delivery of
data science projects. We will start by looking at ModelOps, which is a set of practices for
automating model delivery pipelines.
[ 215 ]
4
Section 4: Creating a
Development Infrastructure
In the final section, we aim to create a product that helps real people. Often, this means
creating a stable piece of software that operates in production under a heavy load. Any data
science product is also a software product. Teams can use existing best practices from
software development to circumvent numerous caveats. This part of the book offers an
overview of the must-have practices and tools required to build and deploy data science
solutions.
Understanding ModelOps
Looking into DevOps
Managing code versions and quality
Storing data along with code
Managing environments
Tracking experiments
The importance of automated testing
Continuous model training
A power pack for your projects
Understanding ModelOps
ModelOps is a set of practices for automating a common set of operations that arise in data
science projects, which include the following:
Version control
Experiment tracking
Testing
Deployment
Without ModelOps, teams are forced to waste time on those repetitive tasks. Each task in
itself is fairly easy to handle, but a project can suffer from mistakes in those steps.
ModelOps helps us to create project delivery pipelines that work like a precise conveyor
belt with automated testing procedures that try to catch coding errors.
Let's examine a set of activities that comprise the software development pipeline:
1. Performing checks for errors, typos, bad coding habits, and formatting mistakes.
2. Building the code for one or several target platforms. Many applications should
work on different operating systems.
3. Running a set of tests that check that the code works as intended, according to
the requirements.
4. Packaging the code.
5. Deploying packaged software.
Continuous integration and continuous deployment (CI/CD) states that all of those steps
can and should be automated and run as frequently as possible. Smaller updates that are
thoroughly tested are more reliable. And if everything goes wrong, it is much easier to
revert such an update. Before CI/CD, the throughput of software engineers who manually
executed software delivery pipelines limited the deployment cycle speed.
Now, highly customizable CI/CD servers rid us of manual labor, and completely automate
all necessary activities. They run on top of a source code version control system, and
monitor for new code changes. Once a new code change is present, a CI/CD server can
launch the delivery pipeline. To implement DevOps, you need to spend time writing
automated tests and defining software pipelines, but after that, the pipeline just works,
every time you need it.
[ 218 ]
Implementing ModelOps Chapter 11
DevOps took the software development world by storm, producing many technologies that
make software engineers more productive. Like any technology ecosystem, an expert needs
to devote time to learning and integrating all tools together. Over time, CI/CD servers
became more complicated and feature-rich, and many companies felt the need to have a
full-time expert capable of managing delivery pipelines for their projects. Thus, they came
up with the role of DevOps engineer.
Many tools from the DevOps world are becoming much easier to use, requiring only a
couple of clicks in a user interface. Some CI/CD solutions such as GitLab aid you in creating
simple CI/CD pipelines automatically.
Many benefits of CI/CD infrastructure apply to data science projects; however, many areas
remain uncovered. In the next sections of this chapter, we will look at how data science
projects can use CI/CD infrastructure, and what tools you can use to make the automation
of data science project delivery more complete.
Version control—Git
Code collaboration platform—GitHub, GitLab
Automated testing framework—dependent on the implementation language
CI/CD server—Jenkins, Travis CI, Circle CI, or GitLab CI
All of these technologies miss several core features that are critical for data science projects:
Data management—tools for solving the issue of storing and versioning large
amounts of data files
Experiment tracking—tools for tracking experiment results
Automated testing—tools and methods for testing data-heavy applications
Before covering solutions to the preceding issues, we will familiarize ourselves with the
data science delivery pipeline.
[ 219 ]
Implementing ModelOps Chapter 11
1. Each model pipeline starts with the Raw data, which is stored in some kind of
data source.
2. Then, data scientists perform exploratory data analysis (EDA) and create EDA
Reports to deepen the understanding of the dataset and discover possible issues
with the data.
3. The Data processing pipeline transforms raw data into an intermediate format
that is more suitable for creating datasets for training, validating, and testing
models.
4. The Model dataset pipeline creates ready-to-use datasets for training and testing
models.
5. The Model training pipeline uses prepared datasets to train models, assess their
quality by performing offline testing, and generate Model quality reports that
contain detailed information about model testing results.
6. At the end of the pipeline, you get the final artifact—a Trained model that is
saved on a hard disk or a database.
Now, we are ready to discuss implementation strategies and example tools for ModelOps.
[ 220 ]
Implementing ModelOps Chapter 11
Many teams synonymize Git with one of the popular platform offerings, but it is sometimes
useful to know that it is not the only option you have. Sometimes, you have no internet
access or the ability to install additional software on server machines but still want the
benefits of storing code in a shared repository. You can still use Git in those restricted
environments. Git has a useful feature called file remotes that allow you to push your code
basically everywhere.
For example, you can use a USB stick or a shared folder as a remote repository:
git clone --bare /project/location/my-code /remote-location/my-code #copy
your code history from a local git repo
git remote add usb file:///remote/location/my-code
# add your remote as a file location
git remote add usb file:///remote/location/my-code
# add your remote as a file location
git push usb master
# push the code
# Done! Other developers can set up your remote and pull updates:
git remote add usb file:///remote/location/my-code # add your remote as a
file location
git pull usb mater # pull the code
By changing the file:/// path to the ssh:/// path, you can also push code to the remote
SSH machines on your local network.
Most data science projects are written in Python, where static code analysis and code build
systems are not as widespread as in other programming languages. Those tools allow you
to groom code automatically and check it for critical errors and possible bugs each time you
try to build a project. Python has such tools too—look at pre-commit (https://pre-commit.
com).
[ 221 ]
Implementing ModelOps Chapter 11
Having covered the main recommendations for handling code, let's now see how we can
achieve the same results for data, which is an integral part of any data science project.
This distinguishes data science projects from software projects and creates a need for
managing data versions along with the code: Data Version Control (DVC). In general,
different software versions can be reconstructed by using the source code alone, but for
data science projects this is not sufficient. Let's see what problems arise when you try to
track datasets using Git.
[ 222 ]
Implementing ModelOps Chapter 11
Structuring pipelines and intermediate results is an interesting topic that deserves special
attention. The pipeline structure of your project determines what intermediate results are
available for use. Each intermediate result creates a branching point, from where several
other pipelines can start. This creates the flexibility of reusing intermediate results, but at
the cost of storage and time. Projects with lots of intermediate steps can consume a lot of
disk space and will take more time to calculate, as disk input/output takes a lot of time.
Be aware that model training pipelines and production pipelines should be different. A
model training pipeline might have a lot of intermediate steps for research flexibility, but a
production pipeline should be highly optimized for performance and reliability. Only
intermediate steps that are strictly necessary to execute the finalized production pipeline
need to be executed.
Storing data files is necessary for reproducing results but is not sufficient for understanding
them. You can save yourself a lot of time by documenting data descriptions, along with all
reports that contain summaries and conclusions that your team draws from data. If you
can, store those documents in a simple textual format so that they can be easily tracked in
your version control system along with the corresponding code.
You can use the following folder structure to store the data in your projects:
Project root:
Data:
Raw—raw data from your customer
Interim—intermediate data generated by the
processing pipeline
Preprocessed—model datasets or output files
Reports—project reports for EDA, model quality, and so on
References—data dictionaries and data source documentation
[ 223 ]
Implementing ModelOps Chapter 11
You can see a simple example of a diff file in the following screenshot:
The highlighted lines marked with + represent added lines, while highlighted lines marked
with – stand for deleted lines. Adding large binary or text files in Git is considered bad
practice because it results in massive redundant diff computations, which makes
repositories slow to work with and large in size.
diff files serve a very specific problem: they allow developers to browse, discuss, and
switch between sets of changes. diff is a line-based format that is targeted at text files. On
the contrary, small changes in binary data files will result in a completely different data file.
In such cases, Git will generate a massive diff for each small data modification.
[ 224 ]
Implementing ModelOps Chapter 11
In general, you needn't browse or discuss changes to a data file in a line-based format, so
calculating and storing diff files for each new data version is unnecessary: it is much
simpler to store the entire data file each time it changes.
Now that we are equipped with the tools to handle data storage and versioning, let's look
at how to manage Python environments so that your team won't waste time with package
conflicts on a server.
Managing environments
Data science projects depend on a lot of open source libraries and tools for doing data
analysis. Many of those tools are constantly updated with new features, which sometimes
break APIs. It is important to fix all dependencies in a shareable format that allows every
team member to use the same versions and build libraries.
The Python ecosystem has multiple environment management tools that take care of
different problems. Tools overlap in their use cases and are often confusing to choose from,
so we will cover each briefly:
[ 225 ]
Implementing ModelOps Chapter 11
Consider using conda for managing data science project environments if you are not doing
so already. It will not only solve your environment management problems but also save
time by speeding up the computation. The following plot shows the performance difference
between using the TensorFlow libraries installed by pip and conda (you can find the
original article by following this link: https://www.anaconda.com/tensorflow-in-
anaconda/):
[ 226 ]
Implementing ModelOps Chapter 11
Next, we will cover the topic of experiment tracking. Experiments are a natural part of
every data science project. A single project might contain the results of hundreds or even
thousands of experiments. It is important to keep a record so that you can make correct
conclusions about experiment results.
Tracking experiments
Experimentation lies at the core of data science. Data scientists perform many experiments
to find the best approach to solving the task at hand. In general, experiments exist in sets
that are tied to data processing pipeline steps.
For example, your project may comprise the following experiment sets:
Each experiment can affect the results of other experiments, so it is crucial to be able to
reproduce each experiment in isolation. It is also important to track all results so your team
can compare pipeline variants and choose the best one for your project according to the
metric values.
A simple spreadsheet file with links to data files and code versions can be used to track all
experiments, but reproducing experiments will require lots of manual work and is not
guaranteed to work as expected. Although tracking experiments in a file requires manual
work, the approach has its benefits: it is very easy to start and pleasant to version. For
example, you can store the experiment results in a simple CSV file, which is versioned in
Git along with your code.
Experiment date
Code version (Git commit hash)
Model name
Model parameters
Training dataset size
Training dataset link
Validation dataset size (fold number for cross-validation)
Validation dataset link (none for cross-validation)
Test dataset size
[ 227 ]
Implementing ModelOps Chapter 11
Files are easy to work with if you have a moderate amount of experiments, but if your
project uses multiple models, and each requires a large amount of experimentation, using
files becomes cumbersome. If a team of data scientists performs simultaneous experiments,
tracking files from each team member will require manual merges, and data scientists are
better off spending time on carrying out more experiments rather than merging other
teammates' results. Special frameworks for tracking experiment results exist for more
complex research projects. These tools integrate into the model training pipeline and allow
you to automatically track experiment results in a shared database so that each team
member can focus on experimentation, while all bookkeeping happens automatically.
Those tools present a rich user interface for searching experiment results, browsing metric
plots, and even storing and downloading experiment artifacts. Another benefit of using
experiment tracking tools is that they track a lot of technical information that might become
handy but is too tedious to collect by hand: server resources, server hostnames, script paths,
and even environment variables present on the experiment run.
The data science community uses three major open source solutions that allow the tracking
of experiment results. These tools pack much more functionality than experiment tracking,
and we will briefly cover each:
[ 228 ]
Implementing ModelOps Chapter 11
DVC: This is a toolkit for data versioning and pipeline management. It also
provides basic file-based experiment tracking functionality, but it is subpar in
terms of usability compared to MLflow and Sacred. The power of DVC lies in
experiment management: it allows you to create fully versioned and
reproducible model training pipelines. With DVC, each team member is able to
pull code, data, and pipelines from a server and reproduce results with a single
command. DVC has a rather steep learning curve but is worth learning, as it
solves many technical problems that arise in collaboration on data science
projects. If your metric tracking requirements are simple, you can rely on DVC's
built-in solution, but if you need something more rich and visual, combine DVC
with MLflow or Sacred tracking—those tools are not mutually exclusive.
Now you should have a complete understanding of what tools can be used to track code,
data, and experiments as a single entity in your project. Next, we will cover the topic of
automated testing in data science projects.
[ 229 ]
Implementing ModelOps Chapter 11
2. Integration tests check that every component of the system works together as
intended. In a data science system, an integration test might check that all of the
steps of the model testing pipeline finish successfully and provide the desired
result.
3. Unit tests check individual classes and functions. In a data science project, a unit
test can check the correctness of a single method in a data processing pipeline
step.
If the world of software testing is so technologically developed, can data science projects
benefit from automated tests? The main difference between data science code and software
testing code is the reliability of data. A fixed set of test data that is generated before a test
run is sufficient for most software projects. In data science projects, the situation is
different. For a complete test of the model training pipeline, you may need gigabytes of
data. Some pipelines may run for hours or even days and require distributed computation
clusters, so testing them becomes impractical. For this reason, many data science projects
avoid automated testing. Thus, they suffer from unexpected bugs, ripple effects, and slow
change integration cycles.
Despite the difficulties, the benefits of automated testing are too great to ignore. Ignoring
tests turns out to be much more costly than building them. This is true for data science
projects and software projects. The benefits of automated testing grow with project size,
complexity, and team size. If you lead a complex data science project, consider automating
testing as a mandatory requirement for your project.
Let's look at how we can approach testing data science projects. End-to-end testing for
model training pipelines might be impractical, but what about testing individual pipeline
steps? Apart from the model training code, each data science project will have some
business logic code and data processing code. Most of this code can be abstracted away
from distributed computation frameworks in isolated classes and functions that are easy to
test.
[ 230 ]
Implementing ModelOps Chapter 11
If you architect a project's code base with tests in mind from the start, it will be much easier
to automate testing. Software architects and lead engineers on your team should take the
testability of the code as one of the main acceptance criteria for code reviews. If the code is
properly encapsulated and abstracted, testing becomes easier.
In particular, let's take the model training pipeline. If we separate it into a series of steps
with clearly defined interfaces, we can then test data preprocessing code separately from
model training code. And if data preprocessing takes a lot of time and requires expensive
computation resources, you can at least test individual parts of the pipeline. Even basic
function-level tests (unit tests) can save you a lot of time, and it is much easier to transition
to full end-to-end tests from the basis of unit tests.
To benefit from automated testing in your projects, start from the following guidelines:
We have seen how to manage, test, and maintain code quality in data science projects. Next,
let's look at how we can package code for deployment.
Packaging code
When deploying Python code for data science projects, you have several options:
Regular Python scripts: You just deploy a bunch of Python scripts to the server
and run them. This is the simplest form of deployment, but it requires a lot of
manual preparation: you need to install all required packages, fill in
configuration files, and so on. While those actions can be automated by using
tools such as Ansible (https://www.ansible.com/), it's not recommended to use
this form of deployment for anything but the simplest projects with no long-term
maintainability goals.
[ 231 ]
Implementing ModelOps Chapter 11
Now, we are ready to integrate all tools for dealing with code, data, experiments,
environments, testing, packaging, and deployment into a single coherent process for
delivering machine learning models.
1. CI:
1. Perform static code analysis.
2. Launch automated tests.
2. Continuous model learning:
1. Fetch new data.
2. Generate EDA reports.
3. Launch data quality tests.
4. Perform data processing and create a training dataset.
5. Train a new model.
6. Test the model's quality.
7. Fix experiment results in an experiment log.
[ 232 ]
Implementing ModelOps Chapter 11
3. CD:
1. Package the new model version.
2. Package the source code.
3. Publish the model and code to the target server.
4. Launch a new version of the system.
CI/CD servers can automate all parts of the preceding pipeline. CI/CD steps should be easy
to handle, as they are what CI/CD servers were created for. Continuous model learning
should not be hard either, as long as you structure your pipeline so that it can be launched
automatically from the command line. Tools such as DVC can aid you in creating
reproducible pipelines, which makes it an attractive solution for the continuous model
learning pipeline.
Now, let's look at how we can build a ModelOps pipeline in a data science project.
This example contains some technical details. If you are unfamiliar with
the technologies mentioned in this case study, you may want to follow
links to get a better understanding of the details.
Each piece of this equipment is unique in its own way because it operates under different
conditions on each one of MannCo's plants. This meant that Oliver's team would need to
constantly adapt and retrain separate models for different plants. Let's look at how they
solved this task by building a ModelOps pipeline.
[ 233 ]
Implementing ModelOps Chapter 11
There were several data scientists on the team, so they needed a tool for sharing the code
with each other. The customer requested that, for security purposes, all code should be
stored in local company servers, and not in the cloud. Oliver decided to use GitLab
(https://about.gitlab.com/), as it was a general practice in the company. In terms of the
overall code management process, Oliver suggested using GitFlow (https://
danielkummer.github.io/git-flow-cheatsheet/). It provided a common set of rules for
creating new features, releases, and hotfixes for every team member.
Oliver knew that a reliable project structure would help his team to properly organize code,
notebooks, data, and documentation, so he advised his team to use PyScaffold (https://
pyscaffold.readthedocs.io/) along with the plugin for data science projects (https://
github.com/pyscaffold/pyscaffoldext-dsproject). PyScaffold allowed them to bootstrap
a project template that ensured a uniform way to store and version data science projects.
PyScaffold already provided the environment.yml file, which defined a template
Anaconda (https://www.anaconda.com/distribution/) environment, so the team did not
forget to lock the package dependencies in a versioned file from the start of the project.
Oliver also decided to use DVC (https://dvc.org/) to version datasets using the
company's internal SFTP server. They also used a --gitlab flag for the pyscaffold
command so that they would have a ready-to-use GitLab CI/CD template when they
needed it.
The project structure looked like this (taken from the pyscaffold-dsproject
documentation):
├── AUTHORS.rst <- List of developers and maintainers.
├── CHANGELOG.rst <- Changelog to keep track of new features and fixes.
├── LICENSE.txt <- License as chosen on the command-line.
├── README.md <- The top-level README for developers.
├── configs <- Directory for configurations of model & application.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- Directory for Sphinx documentation in rst or md.
├── environment.yaml <- The conda environment file for reproducibility.
├── models <- Trained and serialized models, model predictions,
│ or model summaries.
├── notebooks <- Jupyter notebooks. Naming convention is a number (for
│ ordering), the creator's initials and a description,
│ e.g. `1.0-fw-initial-data-exploration`.
├── references <- Data dictionaries, manuals, and all other materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated plots and figures for reports.
├── scripts <- Analysis and production scripts which import the
[ 234 ]
Implementing ModelOps Chapter 11
The project team quickly discovered that they would need to perform and compare many
experiments to build models for different manufacturing plants. They evaluated DVC's
metric tracking capabilities. It allowed tracking all metrics using a simple versioned text file
in Git. While the feature was convenient for simple projects, it would be hard to use it in a
project with multiple datasets and models. In the end, they decided to use a more advanced
metric tracker—MLflow (https://mlflow.org). It provided a convenient UI for browsing
experiment results and allowed using a shared database so that every team member would
be able to quickly share their results with the team. MLflow was installed and configured as
a regular Python package, so it easily integrated into the existing technology stack of the
project:
[ 235 ]
Implementing ModelOps Chapter 11
The team also decided to leverage DVC pipelines to make each experiment easily
reproducible. The team liked to prototype models using Jupyter notebooks, so they decided
to use papermill (https://papermill.readthedocs.io/en/latest/) to work with
notebooks as they were a set of parametrized Python scripts. Papermill allows executing
Jupyter notebooks from the command line without starting Jupyter's web interface. The
team found the functionality very convenient to use along with the DVC pipelines, but the
command line for running a single notebook started to be too long:
dvc run -d ../data/interim/ -o ../data/interim/01_generate_dataset -o
../reports/01_generate_dataset.ipynb papermill --progress-bar --log-output
--cwd ../notebooks ../notebooks/01_generate_dataset.ipynb
../reports/01_generate_dataset.ipynb
To solve this problem, they wrote a Bash script to integrate DVC with papermill so that the
team members could create reproducible experiments with less typing in the terminal:
#!/bin/bash
set -eu
if [ $# -eq 0 ]; then
echo "Use:"
echo "./dvc-run-notebook [data subdirectory] [notebook name] -d [your DVC
dependencies]"
echo "This script executes DVC on a notebook using papermill. Before
running create ../data/[data subdirectory] if it does not exist and do not
forget to specify your dependencies as multiple last arguments"
echo "Example:"
echo "./dvc-run-notebook interim ../notebooks/02_generate_dataset.ipynb -d
../data/interim/"
exit 1
fi
NB_NAME=$(basename -- "$2")
CMD="dvc run ${*:3} -o ../data/$1 -o ../reports/$NB_NAME papermill --
progress-bar --log-output --cwd ../notebooks ../notebooks/$NB_NAME
../reports/$NB_NAME"
echo "Executing the following DVC command:"
echo $CMD
$CMD
When using several open source ModelOps tools in a single project, your
team might need to spend some time integrating them together. Be
prepared, and plan accordingly.
[ 236 ]
Implementing ModelOps Chapter 11
Over time, some parts of the code started to duplicate inside the notebooks. The PyScaffold
template provides a way to solve this problem by encapsulating repeated code in the
project's package directory—src. This way, the project team could quickly share code
between notebooks. To install the project's package locally, they simply used the following
command from the project's root directory:
pip install -e .
Closer to the project release date, all stable code bases migrated to the project's src and
scripts directories. The scripts directory contained a single entry point script for
training a new model version that was output into the models directory, which was
tracked by DVC.
To be sure that new changes did not break anything important, the team wrote a set of
automated tests using pytest (https://docs.pytest.org/) for the stable code base. The
tests also checked model quality on a special test dataset created by the team. Oliver
modified a GitLab CI/CD template that was generated by PyScaffold so that tests would be
run with each new commit that was pushed in a Git repository.
The customer requested a simple model API, so the team decided to use an MLflow server
(https://mlflow.org/docs/latest/models.html), as MLflow was already integrated into
the project. To further automate the deployment and packaging process, the team decided
to use Docker along with GitLab CI/CD. To do this, they followed GitLab's guide for
building Docker images (https://docs.gitlab.com/ee/ci/docker/using_docker_build.
html).
The overall ModelOps process for the project contained the following steps:
[ 237 ]
Implementing ModelOps Chapter 11
Please note that, in real projects, you may want to split the deployment
into several stages for at least two different environments: staging and
production.
As a conclusion to this chapter, we'll look at a list of tools that you can use to build
ModelOps pipelines.
[ 238 ]
Implementing ModelOps Chapter 11
[ 239 ]
Implementing ModelOps Chapter 11
Summary
In this chapter, we covered ModelOps – a set of practices for automating a common set of
operations that arise in data science projects. We explored how ModelOps relates to
DevOps and described major steps in the ModelOps pipeline. We looked at strategies for
managing code, versioning data, and sharing project environments between team members.
We also examined the importance of experiment tracking and automated testing for data
science projects. As a conclusion, we outlined the full CI/CD pipeline with continuous
model training and explored a set of tools that can be used to build such pipelines.
In the next chapter, we will look at how to build and manage a data science technology
stack.
[ 240 ]
12
Building Your Technology Stack
Technology choices have lasting consequences. A project's technology stack determines the
functional and nonfunctional capabilities of your system, so it is critical to make thoughtful
choices. The bidirectional link between technologies and requirements opens up an
analytical approach for choosing between different technologies by matching their features
against the project's needs. In this chapter, we will see how we can use software design
practices to form project-specific technology stacks and see what technologies should
constitute the core technology stack that's shared among all of your projects. We will also
explore an approach that compares different technologies so that you can make a rational
choice between apparently similar options.
Shaping a technology stack goes toe-to-toe with designing software architecture, so the
engineers on your team should start by drafting a system design that will meet everyone's
requirements. Software architecture is a wide and deeply technical topic, so we won't
discuss it in depth in this chapter. Instead, we will present an overview of the necessary
steps when it comes to choosing the best technologies for reaching specific goals. Let's get
started:
[ 242 ]
Building Your Technology Stack Chapter 12
As an example, let's build a technology stack for a customer churn prediction system. Our
customer has defined the following list of requirements for the churn prediction
module. We'll omit the technical details so that we can focus on the overall process:
1. Process the data from the marketing database. This contains customer
information. The size of the dataset is under 5 GB of raw data.
2. Train the churn prediction model on a weekly basis, every Monday. We should
consider customers that have made no purchases for a month as churned.
3. Notify the team about the potential customer churn by using remote API calls.
The service should be available on business days.
The team has decomposed customer requirements into the following list:
1. FR:
FR1: The system must be integrated with the customer's marketing
database.
FR2: The system must provide a customer churn prediction model. We
can consider customers that have made no purchases for a month as
churned.
FR3: The system must call a remote API to notify the marketing
department about potential customer churn.
FR4: The model should be executed every Monday.
2. NFR:
NFR1: The system should handle processing 5 GB of data every week.
NFR2: The API should be available on business days.
[ 243 ]
Building Your Technology Stack Chapter 12
Based on this requirements list, the team has come up with the following system design,
which has been drawn in ArchiMate 2.1 notation using the Archi software (https://www.
archimatetool.com/):
The preceding diagram consists of two levels: the infrastructure and software
implementation.
The software level describes the relationships between different components and services:
Raw customer data: Represents the raw data that's used by the churn prediction
system.
Model training pipeline: Represents a set of data processing and model training
steps, grouped as a software component.
Model prediction pipeline: Represents the component that's responsible for
taking trained models to make churn predictions and generate customer churn
alerts.
Scheduler: Orchestrates the work of other components by running pipelines
based on a schedule.
Customer churn alerts: Notification events about potential customer churn.
Marketing CRM system: The customer's CRM system, which has already been
deployed and used in production.
Notification API: A service that allows us to create notifications about customers
inside the CRM system.
[ 244 ]
Building Your Technology Stack Chapter 12
The infrastructure level, which describes the physical allocation of software to specific
hardware resources, has three components:
The preceding software architecture diagram omits many technical details for the sake of
simplicity since our goal is to demonstrate how to make technology choices. The core idea
of the following text is not to deepen your knowledge of the vast area of software
architecture, but to give you a general feel for the process where requirements flow into
sound technology choices. Understanding this process can help you guide your team of
experts toward an efficient technology stack choice. The project did not seem very
complicated to the team, and the task was fairly standard for them. The team has decided
to use a general set of cross-cutting technologies that they use for every project as a
corporate standard:
[ 245 ]
Building Your Technology Stack Chapter 12
NFR1: Perform load tests and determine the minimum necessary server
configuration for model training. From the team's previous experience, a virtual
server with 8 CPU, 1 GB RAM, and 15 GB HDD should be sufficient, so they set
this configuration as a baseline for tests.
NFR2: The team has decided to request more detailed information on API usage
and ask how many requests they can expect per day. It turns out that the API
will be executed up to 10 times a day, so a single API server should suffice for
this availability requirement.
All of the team members will spend time learning a new language, practically
destroying all productivity gains from using the technology.
The team's conclusions may be over-optimistic because of their lack of experience
with the new technology.
Those two risks abate if your team pursues a growth mindset and gains new knowledge
continuously, but they never vanish completely.
The core expertise in your team puts limits on what technologies you can use in projects. If
you want more options, it is important to develop a team technology stack separately.
Continuous internal research processes should keep your core technology step up to date.
Project-specific technology stacks can be formed by adapting the core technology stack for
your project's requirements.
[ 246 ]
Building Your Technology Stack Chapter 12
We can view the internal technology research process as a separate long-running project,
which is best managed by using Kanban:
The core technology stack should not include overly specific technologies that are
applicable to only a small set of projects. The sole purpose of this content is to help your
team form a technological basis that solves the majority of requirements that arise in your
projects. The more widely focused your team is, the more general the core technology stack
should be. If the entire team is building a specific product, then project-specific and core
technology stacks start to merge into a single entity.
The process of adapting the core technology stack into a new project is as follows:
[ 247 ]
Building Your Technology Stack Chapter 12
Using these processes, you can turn often chaotic and desire-driven technology choices into
a consistent and logical series of steps that lead to meaningful decisions. However, even the
most deliberate requirement decompositions may leave your team wondering which
technology to choose because there are many intersections and alternatives between
different frameworks, libraries, and platforms. In the next section, we will explore how we
can choose from a variety of technologies.
X versus Y disputes often happen inside project teams, which is the most unproductive
activity engineers can spend their time on. If you try to transition from X versus Y debates
to searching for technologies that fit your specific needs (which are clearly stated, classified,
and documented), you will get far more useful results in less time. Choosing the most
modern or fashionable technologies is the closest analogy for playing Russian roulette for
data scientists and software engineers. Let's explore how we can make thoughtful decisions
about technology stacks.
To make meaningful technology choices, you need to make the process more systematic.
First, you need to derive a set of comparison criteria that will allow you to benchmark
different technologies and provide a template for research activities. These criteria should
test different dimensions or groups of requirements that make the technology useful in
your specific case.
Let's examine the specifics of comparing different technologies by looking at a case study.
[ 248 ]
Building Your Technology Stack Chapter 12
To choose between these products, the team has created the following comparison criteria:
[ 249 ]
Building Your Technology Stack Chapter 12
ID: This can be used as a short identifier in the technology comparison table
(provided next).
Requirement definition: This should describe the capability of interest.
Substitution: This should provide the motivation behind the requirement.
Score: This shows the relative importance of the requirement and will be used to
summarize each requirement category.
Priority: This indicates the necessity of each requirement and will provide
additional scores for each technology.
Pmdarima (https://www.alkaline-ml.com/pmdarima/)
statsmodels.tsa (https://www.statsmodels.org/stable/tsa.html)
Prophet (https://github.com/facebook/prophet)
LightGBM (https://lightgbm.readthedocs.io/en/latest/) and tsfresh
(https://tsfresh.readthedocs.io/en/latest/)
[ 250 ]
Building Your Technology Stack Chapter 12
The team has also come up with the following comparison table:
Framework D1 D2 D3 D4 D5 F1 F2 F3 F4 P1 P2
(M) (I) (S) (S) (I) (M) (S) (M) (I) (I) (I)
pmdarima 3 2 0 2 3 3 0 3 2 0 0
statsmodels.tsa 3 0 2 0 3 0 1 3 2 2 0
Prophet 3 0 2 2 3 3 0 3 2 0 2
LightGBM and tsfresh 3 2 2 0 0 0 0 3 0 2 0
This table can be further summarized to give you the following final results:
If you wish, you can use weighted averages to summarize each framework's score into a
single number. However, remember that simplifying complex decisions into one number
can lead to errors if this is done incorrectly.
These research results show that Prophet comes up as a better choice according to the initial
requirements. However, the results do not mean that Prophet is the best choice for every
application. Technology choices should be biased and opinionated since no technology can
be the best fit for every project. For example, the ranking could be entirely different if the
team took the desired average metric value into consideration. In this setting, other
frameworks could have won because they provided more accurate models.
Summary
In this chapter, we have seen how we can choose a technology based on its requirements
rather than the hype surrounding it. We also explored how to structure requirements and
derive the main elements of a technology stack from software architecture. Finally, we
discussed the difference between core and project-specific technology stacks and examined
an analytical approach to comparing different technologies.
The next chapter will conclude this book; congratulations on finishing it!
[ 251 ]
13
Conclusion
First of all, thank you for reading this book! I hope that the material was helpful in
presenting the overall management approach to data science projects. Data science
management is a multidimensional topic that requires a manager to show technical,
organizational, and strategic proficiency so that they can execute a data science strategy
inside any organization.
First, the data science manager needs to understand what data science can do, as well as its
technical limitations. Without an understanding of the basic concepts, it is extremely easy to
misunderstand your colleagues or provide over-promising project results to the customer.
The What is Data Science section of this book described the basics of machine learning and
deep learning, including explanations behind the mathematical concepts that comprise
machine learning algorithms. We explored the world of technology and business metrics,
defined mathematical optimization, and looked into statistics and probability. These
concepts helped us define how machines learn using maximum likelihood estimation. In
particular, we can explain almost every machine learning algorithm using this
mathematical framework. We have also seen that there is no silver bullet in the machine
learning world thanks to the no free lunch theorem. This theorem states that no machine
learning algorithm will be better than random choice if you measure its accuracy over all
possible datasets and tasks.
After that, our journey continued toward the Building and sustaining a team section, where
we learned about balancing long-term and short-term team building goals. We talked about
the flaws of over-complicated technical interviews and how to avoid them. The main idea
here was to look at what you need and then search for people with specific skill sets, rather
than a rockstar data scientist who can do anything. We also discussed how to conduct data
science interviews using real-world experience. Then, we explored the concept of team
balance, which forms the long-term stability of teams that allows them to grow and
develop. We also touched on the important topic of the growth mindset. By following this
practice yourself, you become an example to your team.
Conclusion Chapter 13
In the Managing data science projects section, we focused on project management, starting
from a bird's-eye strategic view on innovations and marketing. We drew connections
between innovations and data science and provided high-level strategies for managing
innovations in different kinds of organizations. We have looked at data science project
management on a lower scale and examined processes that we can use to manage
individual projects. Then, we defined the data science project life cycle based on CRISP-DM
and looked over common mistakes that customers, managers, and implementation teams
make when delivering data science solutions. We also touched on the topic of software
component reusability and stated that creating reusable components will ease the
development of your projects and increase your team's business potential.
Machine learning:
An Introduction to Statistical Learning, by Rob Tibshirani and Robert
Hastie (http://faculty.marshall.usc.edu/gareth-james/),
along with a MOOC (https://lagunita.stanford.edu/courses/
HumanitiesSciences/StatLearning/Winter2016/about). The code
in this book is in R, but it's not a bad idea to read it, even if you are
only interested in coding in Python. The main benefit of this book
is its presentation of theory, which is independent of your favorite
programming language.
Introduction to Machine Learning with Python, by Andreas C. Müller
(https://www.amazon.com/Introduction-Machine-Learning-
Python-Scientists/dp/1449369413).
Deep learning:
FastAI course, by Jeremy Howard (http://fast.ai)
Deep Learning, by Ian Goodfellow (https://www.amazon.com/Deep-
Learning-Adaptive-Computation-Machine/dp/0262035618/)
[ 253 ]
Conclusion Chapter 13
Software architecture:
Designing Data-Intensive Applications, by Martin Kleppmann
(https://www.amazon.com/Designing-Data-Intensive-
Applications-Reliable-Maintainable/dp/1449373321)
Documenting Software Architectures: Views and Beyond, by Paul
Clements, Felix Bachmann, Len Bass, David Garlan, James Ivers,
Reed Little, Paulo Merson, Robert Nord, Judith Stafford (https://
www.amazon.com/Documenting-Software-Architectures-Views-
Beyond/dp/0321552687/)
Software engineering:
Fluent Python, by Luciano Ramalho (https://www.amazon.com/
Fluent-Python-Concise-Effective-Programming/dp/1491946008)
The Pragmatic Programmer, by Andrew Hunt and David Thomas
(https://www.amazon.com/Pragmatic-Programmer-Journeyman-
Master/dp/020161622X)
Innovation management:
The Innovator's Dilemma, by Clayton M. Christensen (https://www.
amazon.com/Innovators-Dilemma-Revolutionary-Change-
Business/dp/0062060244)
Crossing the Chasm, by Geoffrey Moore (https://www.amazon.com/
Crossing-Chasm-3rd-Disruptive-Mainstream/dp/0062292986/)
The Lean Startup, by Eric Ries (https://www.amazon.com/Lean-
Startup-Entrepreneurs-Continuous-Innovation/dp/B005MM7HY8)
Project management:
The Deadline, by Tom DeMarco (https://www.amazon.com/
Deadline-Novel-About-Project-Management-ebook/dp/
B006MN4RAS/)
Peopleware: Productive Projects and Teams, by Tom DeMarco and
Tim Lister (https://www.amazon.com/gp/product/B00DY5A8X2/)
Emotional intelligence:
Emotional Intelligence, by Daniel Goleman (https://www.amazon.
com/Emotional-Intelligence-Matter-More-Than/dp/055338371X)
Never Split the Difference, by Chris Voss (https://www.amazon.com/
Never-Split-Difference-Negotiating-Depended/dp/0062407805)
[ 254 ]
Conclusion Chapter 13
Summary
Congratulations on finishing this book! I hope that it helped you deepen your knowledge of
data science and systemize management techniques for data science projects. Please
don't leave what you have read about the theory of data science behind. To master the
concepts we have covered in this book, you should apply them to your daily work. I also
strongly advise you to acquire a growth mindset, if you haven't already. You will be
amazed by how far you can go by learning something new each day, even if it's the smallest
bit of knowledge. No book can fully cover every aspect of data science management, but
you can continue this journey by referring to one of the books mentioned in this chapter or
discovering your own path while mastering data science.
[ 255 ]
Other Books You May Enjoy
If you enjoyed this book, you may be interested in these other books by Packt:
[ 257 ]
Other Books You May Enjoy
[ 258 ]
Index
business processes
A ideas, finding 165, 166
activation function 84 business requirements document (BRD) 99
Agile 182
Agile manifesto C
URL 182
causal inference 28, 29, 30
alternative hypothesis 50
Circle CI
Ansible
URL 239
URL 231, 240
Classification and regression tree (CART) 77
ArchiMate
clustering 81
URL 244
code quality
architecture views
managing 221
component view 242
code versions
deployment view 242
managing 221
infrastructure view 242
code
other views 242
packaging 231, 232
area under precision-recall curve (PR AUC) 47
computer vision 24, 25, 26, 27, 88
artificial intelligence (AI)
Conda
defining 8
URL 226
Atlassian Bitbucket
conditional probabilities 62, 63
URL 240
confusion tables 44
automated testing
constrained optimization 56
significance 229, 230, 231
continuous bag of words (CBOW) 94
Continuous Integration/Continuous Deployment
B (CI/CD) 218, 242
balanced teams 127, 128 continuous model training 232, 233
Bayesian view convolutional filter 89
on probability 64 Convolutional neural networks (CNNs) 89, 90
bell curve 65 core technologies
bias 34 versus project-specific technologies 246, 247
bias-variance tradeoff 35 cross-validation 39
big data engineering 108 CUDA 21
big organizations customer relationship management (CRM) 173
failures, reasons 149, 150
business analyst 99, 101 D
business metrics
data analyst 99, 101
applying 48
data engineer
about 100, 101 data science team manager 100
interviewing 125 data scientists 99
key skills 108 defining 98
data hunger 15 exploring 101
data samples project stakeholders 99
statistics, calculating from 66, 67 project users 99
data science delivery pipeline 220 responsibilities 101
data science department software teams 100
creating 143, 144, 145 system analysts 99
data science management data science
approaches 174, 175 about 106
specialties, balancing 158, 159, 160 bringing, to retail business 162, 163
data science manager defining 9, 10
key skills 109 influence 10, 11
data science project ideas limitations 11
finding, in insurance company 167, 168, 170, data scientists
171 about 99, 101
data science project infrastructure, needs interviewing 125
exploring 219 key skills 106, 107
data science project, life cycle Data Version Control (DVC)
about 175 about 222, 229, 245
business understanding 176 URL 239
data preparation 177 data warehousing 108
data preparation, optimizing 178 data
data understanding 177 storing 224, 225
deployment stage 180 storing, with code 222
evaluation stage 179 tracking 223
modeling stage 179 versioning 223
data science projects database management 108
common risks, avoiding 195, 196 deep learning engineer 99
cost estimates, making 190, 191, 192, 193 deep learning
estimating 188, 189 about 20, 21
failure, reasons 173 computer vision, exploring 26, 27
research 209 exploring 82, 83
time estimates, making 190, 191, 192, 193 natural language processing (NLP) 22, 23
tools 238, 239, 240 references 253
data science roles, interviewing use case 27
about 123 dependent events 63
general guidance 124 development team
data science team manager 100 help, obtaining from 109
data science team roles DevOps 218
about 101 disruptive innovations 152
business analyst 99 distributions 65
data analysts 99 Docker
data engineers 100 URL 232, 240
[ 260 ]
domain expertise 106 URL 239
dsproject extension GitLab
reference link 238 URL 240
GitLFS
E URL 239
emotional intelligence GPT-2 model
references 254 reference link 23
empathy gradient boosting algorithm 121
developing 136, 137 gradient boosting machine (GBM) 123
empirical probability distribution 65 gradient descent 59, 60
ensemble models growth mindset, facilitating
about 78 about 137
tree-based ensembles 79, 80 continuous learning, applying for personal growth
environments 138, 139
managing 225, 226 employees, helping to grow with performance
epsilon 52 reviews 142, 143
estimation process, goals opportunities, giving for learning 140, 141, 142
discovering 194 team expertise, growing as whole 138
experiments
tracking 227, 229 H
exploratory data analysis (EDA) 121, 177, 197 hard disk drives (HDDs) 150
extract transform load (ETL) 113 Hypothesis
URL 239
F
F1-score I
using 45 image classification 25
False Negatives (FN) 42 ImageNet 21
False Positive (FP) 43 imbalanced classes 45, 46, 47, 48
file remotes 221 implementation team 211
Flake8 independent events 63
URL 238 innovation management
frequentist probabilities 62 exploring 151, 153
fully connected neural network (FCNN) 87 references 254
functional requirements (FR) innovations
about 243 about 149
implementing 245 integrating 157
managing, in big company 161
G managing, in startup company 164
garbage in, garbage out principle 17 instance segmentation
Gaussian distribution 65 about 25
Generative Adversarial Networks (GANs) 26 reference link 25
Git 221 interview process
GitHub purpose, discovering 115, 116
URL 240 interviews
GitLab CI/CD ethics 117, 118
[ 261 ]
good interviews, designing 119 exploring 75
values 117, 118 goals 76
origins 15
J references 253
Jenkins tasks 16, 17, 18, 19, 20
URL 239 markets
creating 151
K mathematical optimization 55, 57, 58, 59, 60
maximum likelihood estimation (MLE) 61
K-means 81
mean absolute error (MAE) 41
k-nearest neighbors (kNNs) 113
MedVision
Kanban 183, 184, 185
innovation cycle 154, 155, 156
key performance indicator (KPI) 48, 142
MLflow Projects
Kubernetes
URL 239
URL 240
MLflow
L about 228
URL 239
leadership by example 132 model errors
leadership about 32, 33
about 131 bias 34
downside 132 decomposing 33, 35, 36
leave-one-out cross-validation 40 variance 35
LightGBM ModelOps 217
reference link 250 ModelOps, building for predictive maintenance
linear models 76 system
Linux containers 232 case study 233, 235, 236, 237
long short term memory networks (LSTMs) 96 Multi-Armed Bandits (MABs) 52
MVP projects
M dealing with 199
machine learning engineer 99 MVP
machine learning model creating, in consulting company 200, 201
about 74 MyPy
building 73 URL 238
data 13, 14, 15
decisions 12 N
insights 12, 13 natural language processing (NLP) 22, 23, 91, 92,
stages 76 155
machine learning project cost estimator neural networks
building 70 building 86, 87, 88
machine learning, in retail company no free lunch theorem 16, 75
case study 103, 104, 105 non-sensible absurd correlations, examples
machine learning reference link 28
about 11, 12, 69, 102 nonfunctional requirements (NFR)
anatomy 16 about 243
applying, to prevent fraud in banks 102, 103 implementing 246
[ 262 ]
null hypothesis 50 waterfall 181, 182
project management
O references 254
object detection 25 project stakeholders 99
offline model testing 32 project users 99
online data testing 53, 54 project-specific technologies
online model testing versus core technologies 246, 247
about 49, 50, 51, 52, 53 project
alternative hypothesis 50 as products 207, 208
null hypothesis 50 stage, determining 208, 209
risks 49 Proof of Concepts (PoC) 99
optimization 55 Prophet
overfitting 36, 37, 38, 39, 40 reference link 250
prototypes
P dealing with 199
pyenv
people management 131
URL 225
pipenv
PyScaffold
URL 226
about 232
pmdarima
URL 238
reference link 250
pytest
pre-commit
URL 239
URL 238
precision 45
probability distribution 65
R
production-oriented data science systems random variable 66
risks, mitigating 201, 202 recall 45
products receiver operating characteristic curve (ROC AUC)
building 213 about 47
privacy 214 reference link 48
seeking 213 rectified linear unit (ReLU) 84
versus tools 248 recurrent neural networks (RNNs) 95
program evaluation and review technique (PERT) Relational Database Management System
192 (RDBMS) 108
project ideas research and development (R&D) 157
finding 165 research phases
finding, in business processes 165 integrating, into solution development project
finding, in data 166, 167 210
project management methodology research projects
Agile 182 approaching 196, 197
custom project, developing for customer 188 pitfalls, avoiding 198
disruptive innovation, creating 187 research team 211
Kanban 183, 184, 185 reusability
Scrum 185, 186 improving 212, 213
selecting 181, 187 risks
tested solution, providing 188 mitigating, in production-oriented data science
[ 263 ]
systems 201, 202 technical interviews
roles 129 flaws 112, 113, 114, 115
root mean square error (RMSE) 41, 72 technical metrics
using 41, 42, 43, 44, 45
S technology stack
Sacred about 241
about 228 elements, defining 243, 244, 245
URL 239 test assignments
sales forecasting system designing 119, 120, 121, 123
bringing, into production 204, 205 text summarization model
sample variance 67 reference link 22
Scrum 185, 186, 187 time series forecasting framework
service desk routing system for logistics company 248, 249, 250, 251
building 210, 211, 212 tools
service level agreements (SLAs) 180 versus products 248
situational leadership Travis CI
using 132, 134 URL 239
social credit system 10 tree-based ensembles 79, 80
software architect 100 True Negatives (TN) 42
software architecture True Positive (TP) 43
references 254 tsfresh
software engineering 106, 108 reference link 250
software requirements document (SRD) 99
software team 100, 101 U
source version control (SVC) 221 unbalanced teams
specific measurable achievable relevant time- about 128
bound (SMART) 120, 135, 247 examples 128
sprint 186 user interface (UI) 197
statistical modeling 68, 69
statistics V
about 61 variance 35
calculating, from data samples 66, 67 virtualenv
statsmodels.tsa URL 225
reference link 250
strong AI 8 W
supervised learning 17
waterfall 181, 182
support vector machines (SVM) 113
weak AI 8
system analyst 99, 101
weight 85
wemake
T URL 238
team Zen word optimization 56
achieving 127, 129, 130