MLA TAB Lecture1
MLA TAB Lecture1
Tabular Data
Learning Outcomes
• Fundamental understanding of data preprocessing, commonly used
machine learning (ML) algorithms, and model evaluation
• Gain familiarity with standard ML tools and data preprocessing libraries,
including methods for detecting and remedying overfitting
• Be comfortable talking with scientist partners
• Learn about valuable and useful ML resources
Course Overview
Lecture 1 Lecture 2 Lecture 3
MACHINE
DOMAIN
EXPERTISE
https://en.wikipedia.org/wiki/Data_science
What is Machine Learning?
“Programming computers to learn from experience should
eventually eliminate the need for much of this detailed
programming effort”
Arthur Samuel (1959) – Computer Scientist
Data
Classical Programming (Rules, if/else, etc.) Answers
Rules
Data
ML Algorithms Trained ML Models (Rules) Answers
Answers
Data
ML Algorithms Trained ML Models (Rules) Answers
Answers
No
Some Important ML Terms
Unsupervised Data is
Clustering Putting similar things together Learning provided
without
labels
Anomaly Detection Finding uncommon things
Supervised vs. Unsupervised Learning
Data is
provided Supervised Learning Unsupervised Learning
with the
correct Data is
labels provided
without
Model labels
Collaborative
learns by Regression Classification
K-Means
Filtering
looking at (Quantity) (Category) Model
PCA
these finds
examples patterns in
data
Neural Net
Logistic
Linear
Trees
KNN
SVM
Supervised Learning: Regression
Data is
provided Supervised Learning
with the
Price
correct
labels
Model
learns by Regression Classification
looking at (Quantity) (Category) SqFootage
these
examples Label Features
Logistic
Linear
Trees
KNN
SVM
280.000 3 3292 14
210.030 2 2465 6
… … … …
Supervised Learning: Classification
Data is
provided Supervised Learning Class 1 = star
with the Class 0 = not star
correct
Feature 2
labels
Model
learns by Regression Classification Feature 1
looking at (Quantity) (Category)
these
examples Label Features
Logistic
Linear
Trees
KNN
SVM
1 5 10< 750
0 0 >9 150
… … … …
Unsupervised Learning: Clustering
Feature 2
Unsupervised Learning
Data is
provided
Feature 1 without
labels
ve Filtering
Collaborati
K-Means
Features Model
PCA
finds
patterns in
Age Music Books data
21 Classical Practical Magic
47 Jazz The Great Gatsby
… … …
Unsupervised Learning: Clustering
Feature 2
Unsupervised Learning
Data is
provided
Feature 1 without
labels
ve Filtering
Collaborati
K-Means
Features Model
PCA
finds
patterns in
Age Music Books data
21 Classical Practical Magic
47 Jazz The Great Gatsby
… … …
Sample ML Problem
Food Delivery Problem
• John loves to order his BadWeather RushHour
MilesFromR
UrbanAddress Late
food online for home and estaurant
work. 10 1 5 1 0
• He wants to predict 78 0 7 0 1
whether his order will be 14 1 2 1 0
on time or late 58 1 4.2 1 1
beforehand. 82 0 7.8 0 0
• He logged his previous 45 … … … … …
orders.
Bad Weather
Food Delivery Problem
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Late
On time
What class does ?belong to?
Look at the K closest data points:
Bad Weather
Food Delivery Problem Hands-on
• Let’s use John’s food delivery Late
On time
example, and train a K Nearest
Neighbors model to predict the
Bad Weather
MLA-TAB-Lecture1-Sample-Model.ipynb
Model Evaluation
Regression Metrics
Metrics Equations
: Data values
: Predicted values
Mean Squared Error
: Mean value of data values,
(MSE)
: Number of data records
Root Mean Squared
Error (RMSE)
R Squared (R2)
Classification Metrics
Prediction True Positive: Predicted ‘Positive’
when the actual is ‘Positive’
Positive Negative False Positive: Predicted ‘Positive’
when the actual is ‘Negative’
False Negative: Predicted ‘Negative’
Positive
18 3
Negative
True Positive False Negative Positives, the ‘rare’ class, and many
True State
2 8
Negative
2 8
Negative
F1 Score
The test set is not available to the model for learning, it is only used to
ensure that the model generalizes well on new “unseen” data.
Training – Validation – Test Sets
Learning
Training Set
(used to train the model) Train, tune and validate the
Training
Set model (multiple times!)
Validation Set
Original
Dataset
(used for unbiased evaluation of the model) Testing
Test Set
Test Set (Final) Test the model
(used for final evaluation of the model)
Training Training Validation Training Training Use K different holdout samples to validate the model,
Training Validation Training Training Training
each time training with the remainder samples:
• Split the training dataset into K independent folds.
Validation Training Training Training Training • Repeat the following K times:
Set aside Kth fold of the data for validation.
Train the model on the other folds, the training
set.
Test the model on the validation set.
Average or combine validation performance metrics
• Average or combine the model performance metrics.
Underfitting & Overfitting
Model Evaluation: Underfitting
Underfitting: Model is not good enough to describe the relationship
between the input data (x1, x2) and output y: {Class 1, Class 2}.
x1
Model Evaluation: Overfitting
Overfitting: Model memorizes or imitates training data, and fails to
generalize well on new “unseen” data (test data).
x1
Model Evaluation: Good Fit
Appropriate fitting: Model captures the general relationship between the
input data (x1, x2) and output y: {Class 1, Class 2}.
x1
Overfitting Hands-on
• Let’s use John’s food delivery Late
On time
example one more time.
• We train a K Nearest Neighbors
Bad Weather
MLA-TAB-Lecture1-Sample-Model.ipynb
Exploratory Data Analysis (EDA)
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach to analyze a dataset
and capture main characteristics of it.
df[num_feature].plot.hist(bins = 7) df[cat_feature].value_counts().plot.bar()
plt.show() plt.show()
Correlations: Scatterplot
Correlations: How strongly pairs of features are related.
df.plot.scatter(feature1,feature2) Scatterplot matrices visualize feature-target and
plt.show() feature-feature pairwise relationships.
Correlations: Correlation Matrix
Correlations: How strongly pairs of features are related.
cols = [feature1, feature2] Correlation matrices measure the linear dependence between
df[cols].corr() features; easier to read; can use heat-maps.
Reduce the size of Increase the size of Create new records, For a model that uses
the dominant or the rare or small similar but not a cost function,
frequent class(es). class(es). identical. assign higher weights
to rare class(es) and
lower weights to
dominant class(es).
Missing Data
Handling Missing Data
Drop rows and/or columns with missing values: Remove those rows
and/or columns from the dataset.
Less training data samples and/or less features can lead to overfitting/underfitting
Impute (fill-in) the missing values:
• Average imputation on missing numerical values: Replace with the average
value in the column - df[‘col'].fillna((df['col'].mean()))
• Common point imputation on missing categorical values: Replace with the
most common value for that column - df['col'].fillna((df['col'].mode()))
• Placeholder: Assign a common value for missing data location
• Advanced imputation: Predict missing values from complete samples using
ML techniques. For example, AWS Datawig uses neural networks to predict
tabular data missing values https://github.com/awslabs/datawig
SimpleImputer in sklearn
SimpleImputer: sklearn imputation transformer for completing missing
values - .fit(), .transform()
SimpleImputer(missing_values=nan, strategy='mean', fill_value=None)
• numerical data:
Strategy = “mean”, replace missing values using the mean along each column
Strategy = “median”, replace missing values using the median along each column
• numerical or categorical data:
Strategy = “most_frequent”, replace missing using the most frequent value along
each column
Strategy = “constant”, replace missing values with fill_value
EDA Hands-on
• Let’s read our review dataset and analyze it.
MLA-TAB-Lecture1-EDA.ipynb
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) predicts new data points based on K
similar records from a dataset.
Class
Class
What class does ? belong to?
Look at the K closest data points:
• Choose K = 3
?
• Calculate the distances from to
?
KNeighborsClassifier(n_neighbors=5, metric='minkowski’, p=2)
? ? ? ?
KNN: Choosing K
What is the effect of K on the model?
• Low K (like K=1): predictions based on only one data sample could
be greatly impacted by noise in the data (outliers, mislabeling)
Data closeness
Data sparsity
? ?
?
KNN: Feature Scaling
Features should be on the same scale when using KNN.
Unscaled features: ?closer to Scaled features: ?closer to
10 1
(K = 1)
( almost
halved)
5 ? 0.5 ?
d2 = 2 d2 = 0.2
3 0.3
d1 = 5 ( changed d1 = 0.05
less)
Feature Scaling
Feature Scaling
Motivation: Many algorithms are sensitive to features being on different
scales, like metric-based algorithms (KNN, K Means) and gradient
descent-based algorithms (regression, neural networks)
Note: tree-based algorithms (decision trees, random forests) do not have this
issue
Transform:
Transform:
Pipeline(steps, verbose=False)
y_train
X_train X_test
pipeline = Pipeline([ pipeline.predict
pipeline.fit
(‘imputer’, SimpleImputer(strategy='mean’)),
(‘scaler’, MinMaxScaler()), .fit_transform .transform
MLA-TAB-Lecture1-KNN.ipynb
Looking Ahead: Lecture 2
Looking Ahead: Lecture 2
Feature Engineering
Encoding: Handling categorical data
Text Vectorization: Handling text data
Tree Models and Ensemble Learning: More advanced ML models
Hyperparameter Tuning: Tune ML models for best performance
AWS SageMaker: Amazon ML tool that helps you build, train and deploy
ML models on AWS