Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

MACHINE LEARNING ACCELERATOR

Tabular Data
Learning Outcomes
• Fundamental understanding of data preprocessing, commonly used
machine learning (ML) algorithms, and model evaluation
• Gain familiarity with standard ML tools and data preprocessing libraries,
including methods for detecting and remedying overfitting
• Be comfortable talking with scientist partners
• Learn about valuable and useful ML resources
Course Overview
Lecture 1 Lecture 2 Lecture 3

• Introduction to ML • Feature Engineering • Optimization

• Model Evaluation • Tree-based Models • Regression Models

 Train-Validation-Test  Decision Tree • Regularization

 Overfitting  Random Forest • Boosting

• Exploratory Data Analysis • Hyperparameter Tuning • Neural Networks

• K Nearest Neighbors (KNN) • AWS AI/ML Services • AutoML


Machine Learning Resources

Notebooks to build, train, and deploy


machine learning models quickly

D2L: An interactive deep learning book with


code, math, and discussions.

Your Collaborate Share ideas Connect


classmates Network Learn
Introduction to
Machine Learning
Machine Learning Applications
Machine Learning Applications
Business/ML Problem Description Example
Ranking algorithm within Amazon Search
Ranking Helping users find the most relevant thing

Giving users the thing they may be most


Recommendation interested in

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing

Clustering Putting similar things together

Anomaly Detection Finding uncommon things


Machine Learning Applications
Business/ML Problem Description Example
Recommendations across the website
Ranking Helping users find the most relevant thing

Giving users the thing they may be most


Recommendation interested in

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing Amazon’s Choice

Clustering Putting similar things together

Anomaly Detection Finding uncommon things


Machine Learning Applications
Business/ML Problem Description Example
Product classification for our catalog
Ranking Helping users find the most relevant thing

Giving users the thing they may be most


Recommendation interested in

Classification Figuring out what kind of thing something is


High-Low Dress Straight Dress

Regression Predicting a numerical value of a thing

Clustering Putting similar things together

Anomaly Detection Finding uncommon things Striped Skirt Graphic Shirt


Machine Learning Applications
Business/ML Problem Description Example
Predicting sales for specific ASINs
Ranking Helping users find the most relevant thing

Giving users the thing they may be most


Recommendation interested in

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing

Clustering Putting similar things together

Anomaly Detection Finding uncommon things Seasonality | Out of stock | Promotions


Machine Learning Applications
Business/ML Problem Description Example
Close-matching for near-duplicates
Ranking Helping users find the most relevant thing

Giving users the thing they may be most


Recommendation interested in

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing

Clustering Putting similar things together

Anomaly Detection Finding uncommon things


Machine Learning Applications
Business/ML Problem Description Example
Fruit freshness
Ranking Helping users find the most relevant thing

Giving users the thing they may be most Before After


Recommendation interested in

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing

Clustering Putting similar things together


Good
Damage
Serious Damage
Anomaly Detection Finding uncommon things Decay
What is Machine Learning?
What is Data Science?
Machine Learning
Wikipedia describes Data Science as:

MACHINE

“a multi-disciplinary field that uses LEARNING


COMPUTER

scientific methods, processes, algorithms


MATHEMATICS SCIENCE

and systems to extract knowledge and DATA


SCIENCE
insights from structured and unstructured
data.” 
DATA
STATISTICAL
PROCESSING
RESEARCH

DOMAIN
EXPERTISE

https://en.wikipedia.org/wiki/Data_science
What is Machine Learning?
“Programming computers to learn from experience should
eventually eliminate the need for much of this detailed
programming effort”
Arthur Samuel (1959) – Computer Scientist
Data
Classical Programming (Rules, if/else, etc.) Answers
Rules

Data
ML Algorithms Trained ML Models (Rules) Answers
Answers

New Similar Data


Why ML? Why now?
Data
• larger amounts of data, easy to produce, collect and store
Compute
• powerful processing units, hardware acceleration, parallelization
Algorithms
• ML frameworks, libraries, improved and more efficient techniques

Data
ML Algorithms Trained ML Models (Rules) Answers
Answers

New Similar Data


Machine Learning Lifecycle
Answer

New Data/Re-training Deployment


Business
Problem Data Processing Yes
• Data Collection
• Data Preprocessing Model
ML Problem • Data Visualization Meets
• Training
Formulation • Data Augmentation Algorithm • Tuning
Business
• Feature Engineering Goal?
• Evaluation
• Etc.

No
Some Important ML Terms

ML Statistics/Math/other Simply Put

Label/Target/y Dependent/Response/Output Variable The thing you’re trying to predict

Feature/x Independent/Explanatory/Input Variable Data that help you make predictions

Feature Engineering Transformation Reshaping data to get more value


1d, 2d,… nd Dimensionality Number of features
A set of numbers embedded in a
Model Parameters Weights
model that can predict the labels
Finding the ‘best’ set of model
Model Training Optimization
parameters
Supervised & Unsupervised
Learning
Supervised vs. Unsupervised Learning
Business/ML Problem Description

Ranking Helping users find the most relevant thing Data is


Supervised provided
Giving users the thing they may be most Learning with the
Recommendation interested in correct
labels

Classification Figuring out what kind of thing something is

Regression Predicting a numerical value of a thing

Unsupervised Data is
Clustering Putting similar things together Learning provided
without
labels
Anomaly Detection Finding uncommon things
Supervised vs. Unsupervised Learning

Data is
provided Supervised Learning Unsupervised Learning
with the
correct Data is
labels provided
without
Model labels

Collaborative
learns by Regression Classification

K-Means

Filtering
looking at (Quantity) (Category) Model

PCA
these finds
examples patterns in
data
Neural Net

Logistic
Linear

Trees
KNN

SVM
Supervised Learning: Regression
Data is
provided Supervised Learning
with the

Price
correct
labels

Model
learns by Regression Classification
looking at (Quantity) (Category) SqFootage
these
examples Label Features

Price Bedrooms SqFootage Age


Neural Net

Logistic
Linear

Trees
KNN

SVM
280.000 3 3292 14
210.030 2 2465 6
… … … …
Supervised Learning: Classification
Data is
provided Supervised Learning Class 1 = star
with the Class 0 = not star
correct

Feature 2
labels

Model
learns by Regression Classification Feature 1
looking at (Quantity) (Category)
these
examples Label Features

Star Points Edges Size


Neural Net

Logistic
Linear

Trees
KNN

SVM
1 5 10< 750
0 0 >9 150
… … … …
Unsupervised Learning: Clustering
Feature 2

Unsupervised Learning

Data is
provided
Feature 1 without
labels

ve Filtering
Collaborati
K-Means
Features Model

PCA
finds
patterns in
Age Music Books data
21 Classical Practical Magic
47 Jazz The Great Gatsby
… … …
Unsupervised Learning: Clustering
Feature 2

Unsupervised Learning

Data is
provided
Feature 1 without
labels

ve Filtering
Collaborati
K-Means
Features Model

PCA
finds
patterns in
Age Music Books data
21 Classical Practical Magic
47 Jazz The Great Gatsby
… … …
Sample ML Problem
Food Delivery Problem
• John loves to order his BadWeather RushHour
MilesFromR
UrbanAddress Late
food online for home and estaurant

work. 10 1 5 1 0

• He wants to predict 78 0 7 0 1
whether his order will be 14 1 2 1 0
on time or late 58 1 4.2 1 1
beforehand. 82 0 7.8 0 0
• He logged his previous 45 … … … … …
orders.

Two classes: 1/late and 0/on time


Food Delivery Problem
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Late
On time
What class does ?belong to?

Miles from Restaurant


?

Bad Weather
Food Delivery Problem
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Late
On time
What class does ?belong to?
Look at the K closest data points:

Miles from Restaurant


• Choose K = 3
• Calculate the distances from to
? all ?
data points
• Find the K nearest neighbors
• Pick the majority class:

Bad Weather
Food Delivery Problem Hands-on
• Let’s use John’s food delivery Late
On time
example, and train a K Nearest
Neighbors model to predict the

Miles from Restaurant


new data point.
?

Bad Weather

MLA-TAB-Lecture1-Sample-Model.ipynb
Model Evaluation
Regression Metrics
Metrics Equations
: Data values
: Predicted values
Mean Squared Error
: Mean value of data values,
(MSE)
: Number of data records
Root Mean Squared
Error (RMSE)

Mean Absolute Error


(MAE)

R Squared (R2)
Classification Metrics
Prediction True Positive: Predicted ‘Positive’
when the actual is ‘Positive’
Positive Negative False Positive: Predicted ‘Positive’
when the actual is ‘Negative’
False Negative: Predicted ‘Negative’
Positive

True Positive when the actual is ‘Positive’


False Negative
True Negative: Predicted ‘Negative’
True State

18 3 when the actual is ‘Negative’


‘Positive’ = star
‘Negative’ = not star
Negative

False Positive True Negative


1 15
Classification Metrics: Accuracy
Prediction
Positive Negative Accuracy*: The percent (ratio) of
cases classified correctly
Positive

True Positive False Negative


True State

18 3
Negative

False Positive True Negative


1 15
Classification Metrics: Accuracy
Prediction
Positive Negative High Accuracy Paradox: Accuracy
is misleading when dealing with
imbalanced datasets - few True
Positive

True Positive False Negative Positives, the ‘rare’ class, and many
True State

2 8 True Negatives, the ‘dominant’


class. High Accuracy even when
few True Positives.
Negative

False Positive True Negative


2 88
Classification Metrics: Precision
Prediction
Positive Negative Precision*: Accuracy of a predicted
positive outcome
Positive

True Positive False Negative


True State

2 8
Negative

False Positive True Negative


2 88
Classification Metrics: Recall
Prediction
Positive Negative Recall*: Measures model’s ability to
predict a positive outcome
Positive

True Positive False Negative


True State

2 8
Negative

False Positive True Negative


2 88
Classification Metrics: F1 Score

F1 Score*: A combined metric, the


harmonic mean of Precision and
Recall.
Recall

F1 Score

Low when one or both of the High when both Precision


Precision Precision and Recall are low and Recall are high
Train – Validation – Test
Datasets
Training – Validation – Test Sets
Learning
Training Set
(used to train the model) Train, tune and validate the
Training
Set model (multiple times!)
Validation Set
Original
Dataset
(used for unbiased evaluation of the model) Testing
Test Set
Test Set (Final) Test the model
(used for final evaluation of the model)

The test set is not available to the model for learning, it is only used to
ensure that the model generalizes well on new “unseen” data.
Training – Validation – Test Sets
Learning
Training Set
(used to train the model) Train, tune and validate the
Training
Set model (multiple times!)
Validation Set
Original
Dataset
(used for unbiased evaluation of the model) Testing
Test Set
Test Set (Final) Test the model
(used for final evaluation of the model)

Training Set It is good practice to shuffle


the dataset before the split to
Validation Set avoid bias in the resulting
sets.
Test Set
K-fold Cross Validation ( K = 5)
K-fold Cross-Validation (CV) is a validation
1 st
2 nd
3 rd
4 th
5 th

technique to see how well a trained model


Training Training Training Training Validation
generalizes to an independent validation set.
Training Set

Training Training Training Validation Training

Training Training Validation Training Training Use K different holdout samples to validate the model,
Training Validation Training Training Training
each time training with the remainder samples:
• Split the training dataset into K independent folds.
Validation Training Training Training Training • Repeat the following K times:
 Set aside Kth fold of the data for validation.
 Train the model on the other folds, the training
set.
 Test the model on the validation set.
Average or combine validation performance metrics
• Average or combine the model performance metrics.
Underfitting & Overfitting
Model Evaluation: Underfitting
Underfitting: Model is not good enough to describe the relationship
between the input data (x1, x2) and output y: {Class 1, Class 2}.

• Model is too simple to capture important


Class 1 Class 2
patterns of training data.
x2 • Model will perform poorly on training
and validation (and/or test).

x1
Model Evaluation: Overfitting
Overfitting: Model memorizes or imitates training data, and fails to
generalize well on new “unseen” data (test data).

• Model is too complex.


Class 1 Class 2
• Model picks up the noise instead of the
x2 underlying relationship.
• Model will perform well on training, but
poorly on validation (and/or test).

x1
Model Evaluation: Good Fit
Appropriate fitting: Model captures the general relationship between the
input data (x1, x2) and output y: {Class 1, Class 2}.

• Model not too simple, not too complex.


Class 1 Class 2
• Model picks up the underlying relationship
x2 rather than the noise in the training.
• Model will perform good enough on
training and validation (and/or test).

x1
Overfitting Hands-on
• Let’s use John’s food delivery Late
On time
example one more time.
• We train a K Nearest Neighbors

Miles from Restaurant


model and analyze overfitting.
?

Bad Weather

MLA-TAB-Lecture1-Sample-Model.ipynb
Exploratory Data Analysis (EDA)
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach to analyze a dataset
and capture main characteristics of it.

• Collect or aggregate data


• Perform initial investigations to discover patterns, spot anomalies,
test hypothesis and check assumptions
 Summary statistics
 Graphical/visual representations (histograms, plots)
• Process data to produce meaningful information
Descriptive Statistics
Overall statistics - df.head(), df.shape, df.info()
• Number of instances (i.e. number of rows)
• Number of features (i.e. number of columns)
Univariate statistics (single feature)
• Statistics for numerical features (mean, variance, histogram) - df.describe(), hist(df[feature])
• Statistics for categorical features (histograms, mode, most/least frequent values, percentage,
number of unique values)
 Histogram of values - df[feature].value_counts() or seaborn’s distplot()
• Target statistics
 Class distribution - df[target].value_counts() or np.bincount(y)
Multivariate statistics (more than one feature)
• Correlations - df.plot.scatter(feature1, feature2), df[[feature1, feature2]].corr()
Univariate Statistics: Histograms
Numerical features: Categorical features:
import matplotlib.pyplot as plt import matplotlib.pyplot as plt

df[num_feature].plot.hist(bins = 7) df[cat_feature].value_counts().plot.bar()
plt.show() plt.show()
Correlations: Scatterplot
Correlations: How strongly pairs of features are related.
df.plot.scatter(feature1,feature2) Scatterplot matrices visualize feature-target and
plt.show() feature-feature pairwise relationships.
Correlations: Correlation Matrix
Correlations: How strongly pairs of features are related.
cols = [feature1, feature2] Correlation matrices measure the linear dependence between
df[cols].corr() features; easier to read; can use heat-maps.

Correlation values are between -1 and 1: -1 means perfect negative correlation, 1


means perfect positive correlation, and 0 means there is no relationship between the
two variables.
Correlations
Highly correlated (positive or negative) features might degrade
performance of some ML models, such as linear and logistic regression
models.
• Select one of the correlated features and discard the other(s).

• Other ML models, like decision trees, are mostly immune to this


problem.

While, highly target-correlated (positive or negative) features might


improve the performance of linear and logistic regression models.
Imbalanced Datasets
Class Imbalance
Star rating count in millions • Number of samples per class is
2.5
not equally distributed.
2
Star rating count

• The ML model may not work


1.5
well for the infrequent classes.
1
Examples:
0.5
• Fraud Detection
1 2 3 4 5
• Anomaly Detection
Star rating • Medical Diagnosis
Amazon review dataset: The number of 5 star reviews almost equals the total of the
other 4 types of star reviews combined.
Class Imbalance
How to address class imbalance problems?

Down-sampling Up-sampling Data generation Sample weights

Reduce the size of Increase the size of Create new records, For a model that uses
the dominant or the rare or small similar but not a cost function,
frequent class(es). class(es). identical. assign higher weights
to rare class(es) and
lower weights to
dominant class(es).
Missing Data
Handling Missing Data
Drop rows and/or columns with missing values: Remove those rows
and/or columns from the dataset.
Less training data samples and/or less features can lead to overfitting/underfitting
Impute (fill-in) the missing values:
• Average imputation on missing numerical values: Replace with the average
value in the column - df[‘col'].fillna((df['col'].mean()))
• Common point imputation on missing categorical values: Replace with the
most common value for that column - df['col'].fillna((df['col'].mode()))
• Placeholder: Assign a common value for missing data location
• Advanced imputation: Predict missing values from complete samples using
ML techniques. For example, AWS Datawig uses neural networks to predict
tabular data missing values https://github.com/awslabs/datawig
SimpleImputer in sklearn
SimpleImputer: sklearn imputation transformer for completing missing
values - .fit(), .transform()
SimpleImputer(missing_values=nan, strategy='mean', fill_value=None)
• numerical data:
 Strategy = “mean”, replace missing values using the mean along each column
 Strategy = “median”, replace missing values using the median along each column
• numerical or categorical data:
 Strategy = “most_frequent”, replace missing using the most frequent value along
each column
 Strategy = “constant”, replace missing values with fill_value
EDA Hands-on
• Let’s read our review dataset and analyze it.

MLA-TAB-Lecture1-EDA.ipynb
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
• Calculate the distances from to
?

all data points


K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
• Calculate the distances from to
?

all data points


• Find the K nearest neighbors
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
• Calculate the distances from to
?

all data points


• Find the K nearest neighbors
• Pick the majority class:
KNN in sklearn
KNeighborsClassifier: sklearn classifier implementing the K Nearest
Neighbors vote - .fit(), .predict()

KNeighborsClassifier(n_neighbors=5, metric='minkowski’, p=2)

• n_neighbors: How to choose K?


• metric: How to calculate distances?

The full interface is larger.


KNN: Choosing K
How to choose K?
• Use a validation set to select an appropriate value for K.
K = 1, ? = K = 3, ? = K = 5, ? = K = 7, ? =

? ? ? ?
KNN: Choosing K
What is the effect of K on the model?
• Low K (like K=1): predictions based on only one data sample could
be greatly impacted by noise in the data (outliers, mislabeling)

• Large K: more robust to noise, but the nearest “neighborhood” can


get too inclusive, breaking the “locality”, and a class with only a few
samples in it will always be outvoted by the other classes

• Rule of thumb in selecting K:


K= , where is the number of data samples
KNN: Choosing the Distance Metric
Data samples are considered similar to each other if there are close to
each other, as determined by a specific distance metric.
How to choose the distance metric?
• Real-valued features:
• Similar types: p = 2, Euclidean
• Mixed types (lengths, ages, salaries): p = 1, Manhattan (taxi-cab)
• Binary-valued features: Hamming
• number of positions where the values of two vectors are different
• Boolean-valued features: Jaccard
• ratio of shared values to the total number of values of two vectors
• High dimensional feature space: cosine similarity
• the angle between two vectors (similarity irrespective of their sizes)
KNN: Curse of Dimensionality
With too many features, KNN becomes computationally expensive and
difficult to solve.

Data closeness
Data sparsity
? ?

?
KNN: Feature Scaling
Features should be on the same scale when using KNN.
Unscaled features: ?closer to Scaled features: ?closer to
10 1
(K = 1)
( almost
halved)
5 ? 0.5 ?
d2 = 2 d2 = 0.2
3 0.3

0 80 85 100 0 0.80 0.85 1

d1 = 5 ( changed d1 = 0.05
less)
Feature Scaling
Feature Scaling
Motivation: Many algorithms are sensitive to features being on different
scales, like metric-based algorithms (KNN, K Means) and gradient
descent-based algorithms (regression, neural networks)
Note: tree-based algorithms (decision trees, random forests) do not have this
issue

Solution: Bring features to the same scale


Common choices (both linear):
• Mean/variance standardization
• MinMax scaling
Standardization in sklearn
StandardScaler: sklearn scaler, scaling values to be centered around
mean 0 with standard deviation 1 - .fit(), .transform()

Transform:

from sklearn.preprocessing import StandardScaler


stdsc = StandardScaler()

raw_data = np.array([[-3.4], [4.5], [50], [24], [3.4], [1.6]])


scaled_data = stdsc.fit_transform(raw_data)
print(scaled_data.reshape(1,-1))
MinMax Scaling in sklearn
MinMaxScaler: sklearn scaler, scaling values so that minimum value is 0
and maximum value is 1 - .fit(), .transform()

Transform:

from sklearn.preprocessing import MinMaxScaler


minmaxsc = MinMaxScaler()

raw_data = np.array([[-3.4], [4.5], [50], [24], [3.4], [1.6]])


scaled_data = minmaxsc.fit_transform(raw_data)
print(scaled_data.reshape(1,-1))
Pipeline (sklearn)
Pipeline in sklearn
Pipeline: sklearn sequential data transforms with a final estimator
(prevents data leakage) - .fit(), .predict()

Pipeline(steps, verbose=False)
y_train
X_train X_test
pipeline = Pipeline([ pipeline.predict
pipeline.fit
(‘imputer’, SimpleImputer(strategy='mean’)),
(‘scaler’, MinMaxScaler()), .fit_transform .transform

(‘clf’, KNeighborsClassifier(n_neighbors = 3))


]) .fit_transform .transform

pipeline.fit(X_train, y_train) .fit .predict


predictions = pipeline.predict(X_test)
Putting it all together
• In this notebook, we work with our review dataset to predict the
target field
• The notebook covers the following tasks:
 Exploratory Data Analysis
 Splitting dataset into training and test sets
 Numerical features processing
 Train a K Nearest Neighbors Classifier
 Check the performance metrics on test set

MLA-TAB-Lecture1-KNN.ipynb
Looking Ahead: Lecture 2
Looking Ahead: Lecture 2
Feature Engineering
Encoding: Handling categorical data
Text Vectorization: Handling text data
Tree Models and Ensemble Learning: More advanced ML models
Hyperparameter Tuning: Tune ML models for best performance
AWS SageMaker: Amazon ML tool that helps you build, train and deploy
ML models on AWS

You might also like