Data Preparation For Machine Learning Sample
Data Preparation For Machine Learning Sample
Machine Learning
Data Cleaning, Feature Selection,
and Data Transforms in Python
Jason Brownlee
i
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to apply
ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written permission
from the author.
Acknowledgements
Special thanks to my copy editor Sarah Martin and my technical editors Michael Sanderson and Arun
Koshy, Andrei Cheremskoy, and John Halfyard.
Copyright
Edition: v1.2
Contents
Copyright i
Contents ii
Preface iii
I Introduction iv
II Data Transforms 2
1 How to Scale Numerical Data 3
1.1 Tutorial Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Scale of Your Data Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Numerical Data Scaling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Diabetes Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 MinMaxScaler Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 StandardScaler Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Common Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
ii
Preface
Data preparation may be the most important part of a machine learning project. It is the
most time consuming part, although it seems to be the least discussed topic. Data preparation,
sometimes referred to as data preprocessing, is the act of transforming raw data into a form
that is appropriate for modeling. Machine learning algorithms require input data to be numbers,
and most algorithm implementations maintain this expectation. As such, if your data contains
data types and values that are not numbers, such as labels, you will need to change the data
into numbers. Further, specific machine learning algorithms have expectations regarding the
data types, scale, probability distribution, and relationships between input variables, and you
may need to change the data to meet these expectations.
The philosophy of data preparation is to discover how to best expose the unknown underlying
structure of the problem to the learning algorithms. This often requires an iterative path of
experimentation through a suite of different data preparation techniques in order to discover
what works well or best. The vast majority of the machine learning algorithms you may use on
a project are years to decades old. The implementation and application of the algorithms are
well understood. So much so that they are routine, with amazing fully featured open-source
machine learning libraries like scikit-learn in Python. The thing that is different from project to
project is the data. You may be the first person (ever!) to use a specific dataset as the basis for
a predictive modeling project. As such, the preparation of the data in order to best present it
to the problem of the learning algorithms is the primary task of any modern machine learning
project.
The challenge of data preparation is that each dataset is unique and different. Datasets
differ in the number of variables (tens, hundreds, thousands, or more), the types of the variables
(numeric, nominal, ordinal, boolean), the scale of the variables, the drift in the values over
time, and more. As such, this makes discussing data preparation a challenge. Either specific
case studies are used, or focus is put on the general methods that can be used across projects.
The result is that neither approach is explored. I wrote this book to address the lack of solid
advice on data preparation for predictive modeling machine learning projects. I structured the
book around the main data preparation activities and designed the tutorials around the most
important and widely used data preparation techniques, with a focus on how to use them in the
general case so that you can directly copy and paste the code examples into your own projects
and get started.
Data preparation is important to machine learning, and I believe that if it is taught at the
right level for practitioners, it can be a fascinating, fun, directly applicable, and immeasurably
useful toolbox of techniques. I hope that you agree.
Jason Brownlee
2021
iii
Part I
Introduction
iv
Welcome
Welcome to Data Preparation for Machine Learning. Data preparation is the process of
transforming raw data into a form that is more appropriate for modeling. It may be the most
important, most time consuming, and yet least discussed area of a predictive modeling machine
learning project. Data preparation is relatively straightforward in principle, although there
is a suite of high-level classes of techniques, each with a range of different algorithms, and
each appropriate for a specific situation with their own hyperparameters, tips, and tricks. I
designed this book to teach you the techniques for data preparation step-by-step with concrete
and executable examples in Python.
This guide was written in the top-down and results-first machine learning style that you’re
used to from Machine Learning Mastery.
The importance of data preparation for predictive modeling machine learning projects.
How to prepare data in a way that avoids data leakage, and in turn, incorrect model
evaluation.
How to identify and handle problems with messy data, such as outliers and missing values.
How to identify and remove irrelevant and redundant input variables with feature selection
methods.
v
vi
How to know which feature selection method to choose based on the data types of the
variables.
How to scale the range of input variables using normalization and standardization tech-
niques.
How to transform a dataset with different variable types and how to transform target
variables.
How to project variables into a lower-dimensional space that captures the salient data
relationships.
This book is not a substitute for an undergraduate course in data preparation (if such courses
exist) or a textbook for such a course, although it could complement such materials. For a good
list of top papers, textbooks, and other resources on data preparation, see the Further Reading
section at the end of each tutorial.
this book is intended to be read and used, not to sit idle. I recommend picking a schedule and
sticking to it. The tutorials are divided into six parts; they are:
Part 1: Foundation. Discover the importance of data preparation, tour data preparation
techniques, and discover the best practices to use in order to avoid data leakage.
Part 2: Data Cleaning. Discover how to transform messy data into clean data by
identifying outliers and identifying and handling missing values with statistical and
modeling techniques.
Part 3: Feature Selection. Discover statistical and modeling techniques for feature
selection and feature importance and how to choose the technique to use for different
variable types.
Part 4: Data Transforms. Discover how to transform variable types and variable
probability distributions with a suite of standard data transform algorithms.
Part 5: Advanced Transforms. Discover how to handle some of the trickier aspects of
data transforms, such as handling multiple variable types at once, transforming targets,
and saving transforms after you choose a final model.
Each part targets a specific learning outcome, and so does each tutorial within each part.
This acts as a filter to ensure you are only focused on the things you need to know to get to a
specific result and do not get bogged down in the math or near-infinite number of digressions.
The tutorials were not designed to teach you everything there is to know about each of the
methods. They were designed to give you an understanding of how they work, how to use them,
and how to interpret the results the fastest way I know how: to learn by doing.
Algorithms were demonstrated on synthetic and small standard datasets to give you the
context and confidence to bring the techniques to your own projects.
Model configurations used were discovered through trial and error and are skillful, but
not optimized. This leaves the door open for you to explore new and possibly better
configurations.
Code examples are complete and standalone. The code for each lesson will run as-is with
no code from prior lessons or third parties needed beyond the installation of the required
packages.
viii
A complete working example is presented with each tutorial for you to inspect and copy-paste.
All source code is also provided with the book and I would recommend running the provided
files whenever possible to avoid any copy-paste issues. The provided code was developed in a
text editor and is intended to be run on the command line. No special IDE or notebooks are
required. If you are using a more advanced development environment and are having trouble,
try running the example from the command line instead.
Machine learning algorithms are stochastic. This means that they will make different
predictions when the same model configuration is trained on the same training data. On top of
that, each experimental problem in this book is based on generating stochastic predictions. As a
result, this means you will not get exactly the same sample output presented in this book. This
is by design. I want you to get used to the stochastic nature of the machine learning algorithms.
If this bothers you, please note:
You can re-run a given example a few times and your results should be close to the values
reported.
You can make the output consistent by fixing the random number seed.
You can develop a robust estimate of the skill of a model by fitting and evaluating it
multiple times and taking the average of the final skill score (highly recommended).
All code examples were tested on a POSIX-compatible machine with Python 3. All code
examples will run on modest and modern computer hardware. I am only human, and there
may be a bug in the sample code. If you discover a bug, please let me know so I can fix it and
correct the book (and you can request a free update at any time).
Help with a technique? If you need help with the technical aspects of a specific
operation or technique, see the Further Reading section at the end of each tutorial.
Help with APIs? If you need help with using a Python library, see the list of resources
in the Further Reading section at the end of each lesson, and also see Appendix A.
Help with your workstation? If you need help setting up your environment, I would
recommend using Anaconda and following my tutorial in Appendix B.
Next
Are you ready? Let’s dive in!
This is Just a Sample
Jason Brownlee
1
Part II
Data Transforms
2
Chapter 1
Many machine learning algorithms perform better when numerical input variables are scaled
to a standard range. This includes algorithms that use a weighted sum of the input, like
linear regression, and algorithms that use distance measures, like k-nearest neighbors. The two
most popular techniques for scaling numerical data prior to modeling are normalization and
standardization. Normalization scales each input variable separately to the range 0-1, which is
the range for floating-point values where we have the most precision. Standardization scales
each input variable separately by subtracting the mean (called centering) and dividing by the
standard deviation to shift the distribution to have a mean of zero and a standard deviation of
one. In this tutorial, you will discover how to use scaler transforms to standardize and normalize
numerical input variables for classification and regression. After completing this tutorial, you
will know:
Data scaling is a recommended pre-processing step when working with many machine
learning algorithms.
Data scaling can be achieved by normalizing or standardizing real-valued input and output
variables.
How to apply standardization and normalization to improve the performance of predictive
modeling algorithms.
Let’s get started.
3
1.2. The Scale of Your Data Matters 4
Attributes are often normalized to lie in a fixed range - usually from zero to one-
by dividing all values by the maximum value encountered or by subtracting the
minimum value and dividing by the range between the maximum and minimum
values.
— Page 61, Data Mining: Practical Machine Learning Tools and Techniques, 2016.
Fit the scaler using available training data. For normalization, this means the
training data will be used to estimate the minimum and maximum observable values. This
is done by calling the fit() function.
Apply the scale to training data. This means you can use the normalized data to
train your model. This is done by calling the transform() function.
1.3. Numerical Data Scaling Methods 6
Apply the scale to data going forward. This means you can prepare new data in the
future on which you want to make predictions.
The default scale for the MinMaxScaler is to rescale variables into the range [0,1], although
a preferred scale can be specified via the feature range argument as a tuple containing the
min and the max for all variables.
...
# create scaler
scaler = MinMaxScaler(feature_range=(0,1))
Now that we are familiar with normalization, let’s take a closer look at standardization.
Another [...] technique is to calculate the statistical mean and standard deviation of
the attribute values, subtract the mean from each value, and divide the result by
the standard deviation. This process is called standardizing a statistical variable
and results in a set of values whose mean is zero and standard deviation is one.
— Page 61, Data Mining: Practical Machine Learning Tools and Techniques, 2016.
Standardization requires that you know or are able to accurately estimate the mean and
standard deviation of observable values. You may be able to estimate these values from your
training data, not the entire dataset.
... it is emphasized that the statistics required for the transformation (e.g., the
mean) are estimated from the training set and are applied to all data sets (e.g., the
test set or new samples).
Subtracting the mean from the data is called centering, whereas dividing by the standard
deviation is called scaling. As such, the method is sometimes called center scaling.
The most straightforward and common data transformation is to center scale the
predictor variables. To center a predictor variable, the average predictor value is
subtracted from all the values. As a result of centering, the predictor has a zero
mean. Similarly, to scale the data, each value of the predictor variable is divided
by its standard deviation. Scaling the data coerce the values to have a common
standard deviation of one.
[ 0.96062565 0.65304778]
[-1.16286263 1.44302493]]
Listing 1.7: Example output from summarizing the variables from the diabetes dataset.
Finally, a histogram is created for each input variable. The plots confirm the differing scale
for each input variable and show that the variables have differing scales.
1.4. Diabetes Dataset 10
Figure 1.1: Histogram Plots of Input Variables for the Diabetes Binary Classification Dataset.
Next, let’s fit and evaluate a machine learning model on the raw dataset. We will use
a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated
stratified k-fold cross-validation. The complete example is listed below.
# evaluate knn on the raw diabetes dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define and configure the model
model = KNeighborsClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
1.5. MinMaxScaler Transform 11
Note: Your specific results may vary given the stochastic nature of the learning algorithm, the
evaluation procedure, or differences in numerical precision. Consider running the example a few
times and compare the average performance.
In this case we can see that the model achieved a mean classification accuracy of about
71.7 percent, showing that it has skill (better than 65 percent) and is in the ball-park of good
performance (77 percent).
Accuracy: 0.717 (0.040)
Listing 1.9: Example output from evaluating model performance on the diabetes dataset.
Next, let’s explore a scaling transform of the dataset.
Listing 1.12: Example output from summarizing the variables from the diabetes dataset after a
MinMaxScaler transform.
Histogram plots of the variables are created, although the distributions don’t look much
different from their original distributions seen in the previous section. We can confirm that the
minimum and maximum values are not zero and one respectively, as we expected.
1.5. MinMaxScaler Transform 13
Figure 1.2: Histogram Plots of MinMaxScaler Transformed Input Variables for the Diabetes
Dataset.
Next, let’s evaluate the same KNN model as the previous section, but in this case, on a
MinMaxScaler transform of the dataset. The complete example is listed below.
# evaluate knn on the diabetes dataset with minmax scaler transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = MinMaxScaler()
1.6. StandardScaler Transform 14
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Note: Your specific results may vary given the stochastic nature of the learning algorithm, the
evaluation procedure, or differences in numerical precision. Consider running the example a few
times and compare the average performance.
Listing 1.14: Example output from evaluating model performance after a MinMaxScaler
transform.
Next, let’s explore the effect of standardizing the input variables.
Listing 1.17: Example output from summarizing the variables from the diabetes dataset after a
StandardScaler transform.
Histogram plots of the variables are created, although the distributions don’t look much
different from their original distributions seen in the previous section other than their scale on
the x-axis. We can see that the center of mass for each distribution is centered on zero, which is
more obvious for some variables than others.
1.6. StandardScaler Transform 16
Figure 1.3: Histogram Plots of StandardScaler Transformed Input Variables for the Diabetes
Dataset.
Next, let’s evaluate the same KNN model as the previous section, but in this case, on a
StandardScaler transform of the dataset. The complete example is listed below.
# evaluate knn on the diabetes dataset with standard scaler transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = StandardScaler()
1.7. Common Questions 17
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Listing 1.19: Example output from evaluating model performance after a StandardScaler
transform.
1.8.1 Books
Neural Networks for Pattern Recognition, 1995.
https://amzn.to/2S8qdwt
1.8.2 APIs
sklearn.preprocessing.MinMaxScaler API.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
sklearn.preprocessing.StandardScaler API.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
1.9. Summary 19
1.8.3 Articles
Should I normalize/standardize/rescale the data? Neural Nets FAQ.
ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std
1.9 Summary
In this tutorial, you discovered how to use scaler transforms to standardize and normalize
numerical input variables for classification and regression. Specifically, you learned:
Data scaling is a recommended pre-processing step when working with many machine
learning algorithms.
Data scaling can be achieved by normalizing or standardizing real-valued input and output
variables.
1.9.1 Next
In the next section, we will explore how we can use a robust form of data scaling that is less
sensitive to outliers.
This is Just a Sample
Jason Brownlee
20