Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 125

Introduction to k-Nearest Neighbors:

Simplified (with implementation in


Python)
Introduction
In four years of my career into analytics I have built more than 80% of classification
models and just 15-20% regression models. These ratios can be more or less
generalized throughout the industry. The reason of a bias towards classification models
is that most analytical problem involves making a decision. For instance will a customer
attrite or not, should we target customer X for digital campaigns, whether customer has a
high potential or not etc. These analysis are more insightful and directly links to an
implementation roadmap. In this article, we will talk about another widely used
classification technique called K-nearest neighbors (KNN) . Our focus will be primarily on
how does the algorithm work and how does the input parameter effect the
output/prediction.

Table of Contents
 When do we use KNN algorithm?
 How does the KNN algorithm work?
 How do we choose the factor K?
 Breaking it Down – Pseudo Code of KNN
 Implementation in Python from scratch
 Comparing our model with scikit-learn

When do we use KNN algorithm?


KNN can be used for both classification and regression predictive problems. However, it
is more widely used in classification problems in the industry. To evaluate any technique
we generally look at 3 important aspects:

1. Ease to interpret output

2. Calculation time

3. Predictive Power

Let us take a few examples to place KNN in the scale :


KNN
algorithm fairs across all parameters of considerations. It is commonly used for its easy
of interpretation and low calculation time.

How does the KNN algorithm work?


Let’s take a simple case to understand this algorithm. Following is a spread of red circles
(RC) and green squares (GS) :

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and
nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote
from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as
to enclose only three datapoints on the plane. Refer to following diagram for more
details:
The three closest points to BS is all RC. Hence, with good confidence level we can say
that the BS should belong to the class RC. Here, the choice became very obvious as all
three votes from the closest neighbor went to RC. The choice of the parameter K is very
crucial in this algorithm. Next we will understand what are the factors to be considered to
conclude the best K.

How do we choose the factor K?


First let us try to understand what exactly does K influence in the algorithm. If we see the
last example, given that all the 6 training observation remain constant, with a given K
value we can make boundaries of each class. These boundaries will segregate RC from
GS. The same way, let’s try to see the effect of value “K” on the class boundaries.
Following are the different boundaries separating the two classes with different values of
K.
If you watch carefully, you can see that the boundary becomes smoother with increasing
value of K. With K increasing to infinity it finally becomes all blue or all red depending on
the total majority. The training error rate and the validation error rate are two parameters
we need to access on different K-value. Following is the curve for the training error rate
with varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is
because the closest point to any training data point is itself.Hence the prediction is
always accurate with K=1. If validation error curve would have been similar, our choice
of K would have been 1. Following is the validation error curve with varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence,
error rate initially decreases and reaches a minima. After the minima point, it then
increase with increasing K. To get the optimal value of K, you can segregate the training
and validation from the initial dataset. Now plot the validation error curve to get the
optimal value of K. This value of K should be used for all predictions.

Breaking it Down – Pseudo Code of KNN


We can implement a KNN model by following the below steps:

1. Load the data


2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of training data
points
1. Calculate the distance between test data and each row of training
data. Here we will use Euclidean distance as our distance metric since it’s
the most popular method. The other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class

Implementation in Python from scratch


We will be using the popular Iris dataset for building our KNN model. You can download
it from here.

# Importing libraries
import pandas as pd

import numpy as np

import math

import operator

#### Start of STEP 1

# Importing data

data = pd.read_csv("iris.csv")

#### End of STEP 1

data.head()

# Defining a function which calculates euclidean distance between two data points

def euclideanDistance(data1, data2, length):


distance = 0

for x in range(length):

distance += np.square(data1[x] - data2[x])

return np.sqrt(distance)

# Defining our KNN model

def knn(trainingSet, testInstance, k):

distances = {}

sort = {}

length = testInstance.shape[1]

#### Start of STEP 3

# Calculating euclidean distance between each row of training data and test da

ta

for x in range(len(trainingSet)):
#### Start of STEP 3.1

dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)

distances[x] = dist[0]

#### End of STEP 3.1

#### Start of STEP 3.2

# Sorting them on the basis of distance

sorted_d = sorted(distances.items(), key=operator.itemgetter(1))

#### End of STEP 3.2

neighbors = []

#### Start of STEP 3.3

# Extracting top k neighbors


for x in range(k):

neighbors.append(sorted_d[x][0])

#### End of STEP 3.3

classVotes = {}

#### Start of STEP 3.4

# Calculating the most freq class in the neighbors

for x in range(len(neighbors)):

response = trainingSet.iloc[neighbors[x]][-1]

if response in classVotes:

classVotes[response] += 1

else:

classVotes[response] = 1

#### End of STEP 3.4


#### Start of STEP 3.5

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=T

rue)

return(sortedVotes[0][0], neighbors)

#### End of STEP 3.5

# Creating a dummy testset

testSet = [[7.2, 3.6, 5.1, 2.5]]

test = pd.DataFrame(testSet)

#### Start of STEP 2

# Setting number of neighbors = 1

k = 1

#### End of STEP 2

# Running KNN model

result,neigh = knn(data, test, k)


# Predicted class

print(result)

-> Iris-virginica

# Nearest neighbor

print(neigh)

-> [141]

Now we will try to alter the k values, and see how the prediction changes.

# Setting number of neighbors = 3

k = 3

# Running KNN model

result,neigh = knn(data, test, k)

# Predicted class

print(result) -> Iris-virginica

# 3 nearest neighbors

print(neigh)

-> [141, 139, 120]


# Setting number of neighbors = 5

k = 5

# Running KNN model

result,neigh = knn(data, test, k)

# Predicted class

print(result) -> Iris-virginica

# 5 nearest neighbors

print(neigh)

-> [141, 139, 120, 145, 144]

Comparing our model with scikit-learn

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)

neigh.fit(data.iloc[:,0:4], data['Name'])
# Predicted class

print(neigh.predict(test))

-> ['Iris-virginica']

# 3 nearest neighbors

print(neigh.kneighbors(test)[1])

-> [[141 139 120]]

We can see that both the models predicted the same class (‘Iris-virginica’) and the same
nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as
expected.

End Notes
KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it
can give highly competitive results. KNN algorithm can also be used for regression
problems. The only difference from the discussed methodology will be using averages of
nearest neighbors rather than voting from nearest neighbors. KNN can be coded in a
single line on R. I am yet to explore how can we use KNN algorithm on SAS.

Did you find the article useful? Have you used any other machine learning tool recently?
Do you plan to use KNN in any of your business problems? If yes, share with us how
you plan to go about it.
What is Principal Component Analysis ?
In simple words, principal component analysis is a method of extracting important
variables (in form of components) from a large set of variables available in a data set. It
extracts low dimensional set of features from a high dimensional data set with a motive
to capture as much information as possible. With fewer variables, visualization also
becomes much more meaningful. PCA is more useful when dealing with 3 or higher
dimensional data.

It is always performed on a symmetric correlation or covariance matrix. This means the


matrix should be numeric and have standardized data.

Let’s understand it using an example:

Let’s say we have a data set of dimension 300 (n) × 50 (p). n represents the number of
observations and prepresents number of predictors. Since we have a large p = 50,
there can be p(p-1)/2 scatter plots i.e more than 1000 plots possible to analyze the
variable relationship. Wouldn’t is be a tedious job to perform exploratory analysis on this
data ?

In this case, it would be a lucid approach to select a subset of p (p << 50) predictor
which captures as much information. Followed by plotting the observation in the
resultant low dimensional space.

The image below shows the transformation of a high dimensional data (3 dimension) to
low dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension
is a linear combination of p features

Source: nlpca
What are principal components ?
A principal component is a normalized linear combination of the original predictors in a
data set. In image above, PC1 and PC2 are the principal components. Let’s say we
have a set of predictors as X¹, X²...,Xp

The principal component can be written as:

Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + .... +Φp¹Xp

where,

 Z¹ is first principal component


 Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of first principal
component. The loadings are constrained to a sum of square equals to 1. This is
because large magnitude of loadings may lead to large variance. It also
defines the direction of the principal component (Z¹) along which data varies the
most. It results in a line in p dimensional space which is closest to
the n observations. Closeness is measured using average squared euclidean
distance.
 X¹..Xp are normalized predictors. Normalized predictors have mean equals to
zero and standard deviation equals to one.

Therefore,

First principal component is a linear combination of original predictor variables which


captures the maximum variance in the data set. It determines the direction of highest
variability in the data. Larger the variability captured in first component, larger the
information captured by component. No other component can have variability higher
than first principal component.

The first principal component results in a line which is closest to the data i.e. it minimizes
the sum of squared distance between a data point and the line.

Similarly, we can compute the second principal component also.

Second principal component (Z²) is also a linear combination of original predictors


which captures the remaining variance in the data set and is uncorrelated with Z¹. In
other words, the correlation between first and second component should is zero. It can
be represented as:

Z² = Φ¹²X¹ + Φ²²X² + Φ³²X³ + .... + Φp2Xp

If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of
the components, as expected they are orthogonal. This suggests the correlation b/w
these components in zero.
All
succeeding principal component follows a similar concept i.e. they capture the remaining
variation without being correlated with the previous component. In general, for n ×
p dimensional data, min(n-1, p) principal component can be constructed.

The directions of these components are identified in an unsupervised way i.e. the
response variable(Y) is not used to determine the component direction. Therefore, it
is an unsupervised approach.

Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher
weight to variables which are strongly related to response variable to determine principal
components.

Why is normalization of variables necessary ?


The principal components are supplied with normalized version of original predictors.
This is because, the original predictors may have different scales. For example: Imagine
a data set with variables’ measuring units as gallons, kilometers, light years etc. It is
definite that the scale of variances in these variables will be large.

Performing PCA on un-normalized variables will lead to insanely large loadings for
variables with high variance. In turn, this will lead to dependence of a principal
component on the variable with high variance. This is undesirable.

As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.
Implement PCA in R & Python (with interpretation)
How many principal components to choose ? I could dive deep in theory, but it would be
better to answer these question practically.

For this demonstration, I’ll be using the data set from Big Mart Prediction Challenge III.

Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have
done the basic data cleaning prior to implementing this technique. Let’s quickly finish
with initial data loading and cleaning steps:

#directory path
> path <- ".../Data/Big_Mart_Sales"

#set working directory


> setwd(path)

#load train and test file


> train <- read.csv("train_Big.csv")
> test <- read.csv("test_Big.csv")

#add a column
> test$Item_Outlet_Sales <- 1

#combine the data set


> combi <- rbind(train, test)

#impute missing values with median


> combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight, na.rm =
TRUE)
#impute 0 with median
> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0,
median(combi$Item_Visibility), combi$Item_Visibi
lity)

#find mode and impute


> table(combi$Outlet_Size, combi$Outlet_Type)
> levels(combi$Outlet_Size)[1] <- "Other"

Till here, we’ve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are
practicing an unsupervised learning technique, hence response variable must be
removed.

#remove the dependent and identifier variables


> my_data <- subset(combi, select = -c(Item_Outlet_Sales,
Item_Identifier, Outlet_Identifier))

Let’s check the available variables ( a.k.a predictors) in the data set.

#check available variables


> colnames(my_data)

Since PCA works on numeric variables, let’s see if we have any variable other than
numeric.

#check variable class


> str(my_data)

'data.frame': 14204 obs. of 9 variables:


$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0.054 0.054 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002
2007 ...
$ Outlet_Size : Factor w/ 4 levels "Other","High",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3
2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...

Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. We’ll convert these categorical variables into numeric using one hot encoding.

#load library
> library(dummies)

#create a dummy data frame


> new_my_data <- dummy.data.frame(my_data, names =
c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))
To check, if we now have a data set of integer values, simple write:

#check the data set


> str(new_my_data)

And, we now have all the numerical values. Let’s divide the data into test and train.

#divide the new data


> pca.train <- new_my_data[1:nrow(train),]
> pca.test <- new_my_data[-(1:nrow(train)),]

We can now go ahead with PCA.

The base R function prcomp() is used to perform PCA. By default, it centers the variable
to have mean equals to zero. With parameter scale. = T, we normalize the variables to
have standard deviation equals to 1.

#principal component analysis


> prin_comp <- prcomp(pca.train, scale. = T)
> names(prin_comp)
[1] "sdev" "rotation" "center" "scale" "x"

The prcomp() function results in 5 useful measures:

1. center and scale refers to respective mean and standard deviation of the variables
that are used for normalization prior to implementing PCA

#outputs the mean of variables


prin_comp$center

#outputs the standard deviation of variables


prin_comp$scale

2. The rotation measure provides the principal component loading. Each column of
rotation matrix contains the principal component loading vector. This is the most
important measure we should be interested in.

> prin_comp$rotation

This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set,
the maximum number of principal component loadings is a minimum of (n-1, p). Let’s
look at first 4 principal components and first 5 rows.

> prin_comp$rotation[1:5,1:4]
PC1 PC2 PC3 PC4
Item_Weight 0.0054429225 -0.001285666 0.011246194 0.011887106
Item_Fat_ContentLF -0.0021983314 0.003768557 -0.009790094 -0.016789483
Item_Fat_Contentlow fat -0.0019042710 0.001866905 -0.003066415 -0.018396143
Item_Fat_ContentLow Fat 0.0027936467 -0.002234328 0.028309811 0.056822747
Item_Fat_Contentreg 0.0002936319 0.001120931 0.009033254 -0.001026615
3. In order to compute the principal component score vector, we don’t need to multiply
the loading with data. Rather, the matrix x has the principal component score vectors in
a 8523 × 44 dimension.

> dim(prin_comp$x)
[1] 8523 44

Let’s plot the resultant principal components.

> biplot(prin_comp, scale = 0)

The parameter scale = 0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.

We infer than first principal component corresponds to a measure of


Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that
the second component corresponds to a measure of Outlet_Location_TypeTier1,
Outlet_Sizeother. For exact measure of a variable in a component, you should look at
rotation matrix(above) again.
4. The prcomp() function also provides the facility to compute standard deviation of each
principal component. sdev refers to the standard deviation of principal components.

#compute standard deviation of each principal component


> std_dev <- prin_comp$sdev

#compute variance
> pr_var <- std_dev^2

#check variance of first 10 components


> pr_var[1:10]
[1] 4.563615 3.217702 2.744726 2.541091 2.198152 2.015320 1.932076 1.256831
[9] 1.203791 1.168101

We aim to find the components which explain the maximum variance. This is because,
we want to retain as much information as possible using these components. So, higher
is the explained variance, higher will be the information contained in those components.

To compute the proportion of variance explained by each component, we simply divide


the variance by sum of total variance. This results in:

#proportion of variance explained


> prop_varex <- pr_var/sum(pr_var)
> prop_varex[1:20]
[1] 0.10371853 0.07312958 0.06238014 0.05775207 0.04995800 0.04580274
[7] 0.04391081 0.02856433 0.02735888 0.02654774 0.02559876 0.02556797
[13] 0.02549516 0.02508831 0.02493932 0.02490938 0.02468313 0.02446016
[19] 0.02390367 0.02371118

This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do
we decide how many components should we select for modeling stage ?

The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents
values in descending order.

#scree plot
> plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
The plot above shows that ~ 30 components explains around 98.4% variance in the data
set. In order words, using PCA we have reduced 44 predictors to 30 without
compromising on explained variance. This is the power of PCA> Let’s do a confirmation
check, by plotting a cumulative variance plot. This will give us a clear picture of number
of components.

#cumulative scree plot


> plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
This plot shows that 30 components results in variance close to ~ 98%. Therefore, in this
case, we’ll select number of components as 30 [PC1 to PC30] and proceed to the
modeling stage. This completes the steps to implement PCA on train data. For modeling,
we’ll use these 30 components as predictor variables and follow the normal procedures.

Predictive Modeling with PCA Components


After we’ve calculated the principal components on training set, let’s now understand the
process of predicting on test data using these components. The process is simple. Just
like we’ve obtained PCA components on training set, we’ll get another bunch of
components on testing set. Finally, we train the model.

But, few important points to understand:

1. We should not combine the train and test set to obtain PCA components of whole
data at once. Because, this would violate the entire assumption of
generalization since test data would get ‘leaked’ into the training set. In other
words, the test data set would no longer remain ‘unseen’. Eventually, this will
hammer down the generalization capability of the model.
2. We should not perform PCA on test and train data sets separately. Because, the
resultant vectors from train and test PCAs will have different directions ( due to
unequal variance). Due to this, we’ll end up comparing data registered on
different axes. Therefore, the resulting vectors from train and test data should
have same axes.

So, what should we do?

We should do exactly the same transformation to the test set as we did to training set,
including the center and scaling feature. Let’s do it in R:

#add a training set with principal components


> train.data <- data.frame(Item_Outlet_Sales = train$Item_Outlet_Sales,
prin_comp$x)

#we are interested in first 30 PCAs


> train.data <- train.data[,1:31]

#run a decision tree


> install.packages("rpart")
> library(rpart)
> rpart.model <- rpart(Item_Outlet_Sales ~ .,data = train.data, method = "anova")
> rpart.model

#transform test into PCA


> test.data <- predict(prin_comp, newdata = pca.test)
> test.data <- as.data.frame(test.data)

#select the first 30 components


> test.data <- test.data[,1:30]
#make prediction on test data
> rpart.prediction <- predict(rpart.model, test.data)

#For fun, finally check your score of leaderboard


> sample <- read.csv("SampleSubmission_TmnO39y.csv")
> final.sub <- data.frame(Item_Identifier = sample$Item_Identifier,
Outlet_Identifier = sample$Outlet_Identifier, Item_Outlet_Sales = rpart.prediction)
> write.csv(final.sub, "pca.csv",row.names = F)

That’s the complete modeling process after PCA extraction. I’m sure you wouldn’t be
happy with your leaderboard rank after you upload the solution. Try using random forest!

For Python Users: To implement PCA in python, simply import PCA from sklearn library.
The interpretation remains same as explained for R users above. Ofcourse, the result is
some as derived after using R. The data set used for Python is a cleaned version where
missing values have been imputed, and categorical variables are converted into numeric.
The modeling process remains same, as explained for R users above.

import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
%matplotlib inline

#Load data set


data = pd.read_csv('Big_Mart_PCA.csv')

#convert it to numpy arrays


X=data.values

#Scaling the values


X = scale(X)

pca = PCA(n_components=44)

pca.fit(X)

#The amount of variance that each PC explains


var= pca.explained_variance_ratio_

#Cumulative Variance explains


var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

print var1
[ 10.37 17.68 23.92 29.7 34.7 39.28 43.67 46.53 49.27
51.92 54.48 57.04 59.59 62.1 64.59 67.08 69.55 72.
74.39 76.76 79.1 81.44 83.77 86.06 88.33 90.59 92.7
94.76 96.78 98.44 100.01 100.01 100.01 100.01 100.01 100.01
100.01 100.01 100.01 100.01 100.01 100.01 100.01 100.01]
plt.plot(var1)

#Looking at above plot I'm taking 30 variables


pca = PCA(n_components=30)
pca.fit(X)
X1=pca.fit_transform(X)

print X1

For more information on PCA in python, visit scikit learn documentation.

Points to Remember
1. PCA is used to overcome features redundancy in a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear
combination of original predictor variables.
4. These components aim to capture as much information as possible with high
explained variance.
5. The first component has the highest variance followed by second, third and so on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See
above.
7. Normalizing data becomes extremely important when the predictors are
measured in different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher
dimensions, it becomes increasingly difficult to make interpretations from the
resultant cloud of data.
9. PCA is applied on a data set with numeric variables.
10. PCA is a tool which helps to produce better visualizations of high dimensional
data.
End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, I’ve
tried to make you familiar with most important concepts required to use this technique.
It’s simple but needs special attention while deciding the number of
components. Practically, we should strive to retain only first few k components

The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.

Understanding Support Vector


Machine algorithm from examples
(along with code)
Introduction
Mastering machine learning algorithms isn’t a myth at all. Most of the beginners start by
learning regression. It is simple to learn and use, but does that solve our purpose? Of
course not! Because, you can do so much more than just Regression!

Think of machine learning algorithms as an armory packed with axes, sword, blades,
bow, dagger etc. You have various tools, but you ought to learn to use them at the right
time. As an analogy, think of ‘Regression’ as a sword capable of slicing and dicing
data efficiently, but incapable of dealing with highly complex data. On the
contrary, ‘Support Vector Machines’ is like a sharp knife – it works on smaller datasets,
but on them, it can be much more stronger and powerful in building models.

By now, I hope you’ve now mastered Random Forest, Naive Bayes


Algorithm and Ensemble Modeling. If not, I’d suggest you to take out few minutes and
read about them as well. In this article, I shall guide you through the basics to advanced
knowledge of a crucial machine learning algorithm, support vector machines.

Table of Contents
1. What is Support Vector Machine?
2. How does it work?
3. How to implement SVM in Python and R?
4. How to tune Parameters of SVM?
5. Pros and Cons associated with SVM

What is Support Vector Machine?


“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can
be used for both classification or regression challenges. However, it is mostly
used in classification problems. In this algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate. Then, we perform classification by
finding the hyper-plane that differentiate the two classes very well (look at the below
snapshot).

Support Vectors are simply the co-ordinates of individual observation. Support Vector
Machine is a frontier which best segregates the two classes (hyper-plane/ line).

You can look at definition of support vectors and a few examples of its working here.
How does it work?
Above, we got accustomed to the process of segregating the two classes with a hyper-
plane. Now the burning question is “How can we identify the right hyper-plane?”. Don’t
worry, it’s not as hard as you think!

Let’s understand:

 Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes


(A, B and C). Now, identify the right hyper-plane to classify star and circle.

You need to remember


a thumb rule to identify the right hyper-plane: “Select the hyper-plane which
segregates the two classes better”. In this scenario, hyper-plane “B”
has excellently performed this job.
 Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes
(A, B and C) and all are segregating the classes well. Now, How can we identify
the right hyper-plane?

Here, maximizing the


distances between nearest data point (either class) and hyper-plane will help us
to decide the right hyper-plane. This distance is called as Margin. Let’s look at
the below snapshot:
Above, you can see that the margin for hyper-plane C is high as compared to
both A and B. Hence, we name the right hyper-plane as C. Another lightning
reason for selecting the hyper-plane with higher margin is robustness. If we
select a hyper-plane having low margin then there is high chance of miss-
classification.

 Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in


previous section to identify the right hyper-plane

Some of you may have selected


the hyper-plane B as it has higher margin compared to A. But, here is the catch, SVM
selects the hyper-plane which classifies the classes accurately prior to maximizing
margin. Here, hyper-plane B has a classification error and A has classified all correctly.
Therefore, the right hyper-plane is A.

 Can we classify two classes (Scenario-4)?: Below, I am unable to segregate


the two classes using a straight line, as one of star lies in the territory of
other(circle) class as an outlier.
As I have already
mentioned, one star at other end is like an outlier for star class. SVM has a
feature to ignore outliers and find the hyper-plane that has maximum margin.
Hence, we can say, SVM is robust to outliers.

 Find the hyper-plane to segregate to classes (Scenario-5): In the scenario


below, we can’t have linear hyper-plane between the two classes, so how does
SVM classify these two classes? Till now, we have only looked at the linear
hyper-plane.
 SVM can solve this problem. Easily! It solves this problem by
introducing additional feature. Here, we will add a new feature z=x^2+y^2. Now,
let’s plot the data points on axis x and z:

In above plot, points to consider are:


o All values for z would be positive always because z is the squared sum of
both x and y
o In the original plot, red circles appear close to the origin of x and y axes,
leading to lower value of z and star relatively away from the origin result
to higher value of z.

In SVM, it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this feature
manually to have a hyper-plane. No, SVM has a technique called the kernel trick.
These are functions which takes low dimensional input space and transform it to
a higher dimensional space i.e. it converts not separable problem to separable
problem, these functions are called kernels. It is mostly useful in non-linear
separation problem. Simply put, it does some extremely complex data
transformations, then find out the process to separate the data based on the
labels or outputs you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:

Now, let’s look at the methods to apply SVM algorithm in a data science challenge.

How to implement SVM in Python and R?


In Python, scikit-learn is a widely used library for implementing machine learning
algorithms, SVM is also available in scikit-learn library and follow the same structure
(Import library, object creation, fitting model and prediction). Let’s look at the below code:

#Import Library

from sklearn import svm

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p

redictor) of test_dataset

# Create SVM classification object

model = svm.svc(kernel='linear', c=1, gamma=1)


# there is various option associated with it, like changing kernel, gamma and C va

lue. Will discuss more # about it in next section.Train the model using the traini

ng sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted= model.predict(x_test)

The e1071 package in R is used to create Support Vector Machines with ease. It has
helper functions as well as code for the Naive Bayes Classifier. The creation of a support
vector machine in R and Python follow similar approaches, let’s take a look now at the
following code:

#Import Library

require(e1071) #Contains the SVM

Train <- read.csv(file.choose())

Test <- read.csv(file.choose())

# there are various options associated with SVM training; like changing kernel, ga

mma and C value.

# create model
model <- svm(Target~Predictor1+Predictor2+Predictor3,data=Train,kernel='linear',ga

mma=0.2,cost=100)

#Predict Output

preds <- predict(model,Test)

table(preds)

How to tune Parameters of SVM?


Tuning parameters value for machine learning algorithms effectively improves the model
performance. Let’s look at the list of parameters available with SVM.

sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=T

rue, probability=False,tol=0.001, cache_size=200, class_weight=None, verbose=Fals

e, max_iter=-1, random_state=None)

I am going to discuss about some important parameters having higher impact on model
performance, “kernel”, “gamma” and “C”.

kernel: We have already discussed about it. Here, we have various options available
with kernel like, “linear”, “rbf”,”poly” and others (default value is “rbf”). Here “rbf” and
“poly” are useful for non-linear hyper-plane. Let’s look at the example, where we’ve used
linear kernel on two feature of iris data set to classify their class.

Example: Have linear kernel

import numpy as np

import matplotlib.pyplot as plt


from sklearn import svm, datasets

# import some data to play with

iris = datasets.load_iris()

X = iris.data[:, :2] # we only take the first two features. We could

# avoid this ugly slicing by using a two-dim dataset

y = iris.target

# we create an instance of SVM and fit out data. We do not scale our

# data since we want to plot the support vectors

C = 1.0 # SVM regularization parameter

svc = svm.SVC(kernel='linear', C=1,gamma=0).fit(X, y)

# create a mesh to plot in

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

h = (x_max / x_min)/100

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))


plt.subplot(1, 1, 1)

Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.title('SVC with linear kernel')

plt.show()

Example: Have rbf kernel

Change the kernel type to rbf in below line and look at the impact.
svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)

I would suggest you to go for linear kernel if you have large number of features (>1000)
because it is more likely that the data is linearly separable in high dimensional space.
Also, you can RBF but do not forget to cross validate for its parameters as to avoid over-
fitting.

gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will
try to exact fit the as per training data set i.e. generalization error and cause over-fitting
problem.

Example: Let’s difference if we have gamma different gamma values like 0, 10 or 100.

svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)


C: Penalty parameter C of the error term. It also controls the trade off between smooth
decision boundary and classifying the training points correctly.

We should always look at the cross validation score to have effective combination of
these parameters and avoid over-fitting.

In R, SVMs can be tuned in a similar fashion as they are in Python. Mentioned below are
the respective parameters for e1071 package:

 The kernel parameter can be tuned to take “Linear”,”Poly”,”rbf” etc.


 The gamma value can be tuned by setting the “Gamma” parameter.
 The C value in Python is tuned by the “Cost” parameter in R.

Pros and Cons associated with SVM


 Pros:
o It works really well with clear margin of separation
o It is effective in high dimensional spaces.
o It is effective in cases where number of dimensions is greater than the
number of samples.
o It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
 Cons:
o It doesn’t perform well, when we have large data set because the required
training time is higher
o It also doesn’t perform very well, when the data set has more noise i.e.
target classes are overlapping
o SVM doesn’t directly provide probability estimates, these are calculated
using an expensive five-fold cross-validation. It is related SVC method of
Python scikit-learn library.

Practice Problem
Find right additional feature to have a hyper-plane for segregating the classes in below
snapshot:
Answer the variable name in the comments section below. I’ll shall then reveal the
answer.

End Notes

A Complete Tutorial on Tree Based


Modeling from Scratch (in R &
Python)

Introduction
Tree based learning algorithms are considered to be one of the best and mostly used
supervised learning methods. Tree based methods empower predictive models with high
accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear
relationships quite well. They are adaptable at solving any kind of problem at hand
(classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used
in all kinds of data science problems. Hence, for every analyst (fresher also), it’s
important to learn these algorithms and use them for modeling.
This tutorial is meant to help beginners learn tree based modeling from scratch. After the
successful completion of this tutorial, one is expected to become proficient at using tree
based algorithms and build predictive models.

Note: This tutorial requires no prior knowledge of machine learning. However,


elementary knowledge of R or Python will be helpful. To get started you can follow full
tutorial in R and full tutorial in Python.

Table of Contents
1.
1. What is a Decision Tree? How does it work?
2. Regression Trees vs Classification Trees
3. How does a tree decide where to split?
4. What are the key parameters of model building and how can we avoid
over-fitting in decision trees?
5. Are tree based models better than linear models?
6. Working with Decision Trees in R and Python
7. What are the ensemble methods of trees based model?
8. What is Bagging? How does it work?
9. What is Random Forest ? How does it work?
10. What is Boosting ? How does it work?
11. Which is more powerful: GBM or Xgboost?
12. Working with GBM in R and Python
13. Working with Xgboost in R and Python
14. Where to Practice ?

1. What is a Decision Tree ? How does it work ?


Decision tree is a type of supervised learning algorithm (having a pre-defined target
variable) that is mostly used in classification problems. It works for both categorical and
continuous input and output variables. In this technique, we split the population or
sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables.
Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl),
Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I
want to create a model to predict who will play cricket during leisure period? In this
problem, we need to segregate students who play cricket in their leisure time based on
highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of
three variable and identify the variable, which creates the best homogeneous sets of
students (which are heterogeneous to each other). In the snapshot below, you can see
that variable Gender is able to identify best homogeneous sets compared to the other
two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value
that gives best homogeneous sets of population. Now the question which arises is, how
does it identify the variable and the split? To do this, decision tree uses various
algorithms, which we will shall discuss in the following section.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two
types:

1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. Example:- In above
scenario of student problem, where the target variable was “Student will play
cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target
variable then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that income of
customer is a significant variable but insurance company does not have income details
for all customers. Now, as we know this is an important variable, then we can build a
decision tree to predict customer income based on occupation, product and various
other variables. In this case, we are predicting values for continuous variable.

Important Terminology related to Decision Trees


Let’s look at the basic terminology used with Decision trees:

1. Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.

4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm
has advantages and disadvantages, below are the important factors which one should
know.

Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better
power to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration
stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most
significant variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modeling techniques. It is not influenced by outliers and missing
values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.

Disadvantages
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters
and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
2. Regression Trees vs Classification Trees
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree.
This means that decision trees are typically drawn upside down such that leaves are the
the bottom & roots are the tops (shown below).

Both the trees work almost similar to each other, let’s look at the primary differences &
similarity between classification and regression trees:

1. Regression trees are used when dependent variable is continuous. Classification


trees are used when dependent variable is categorical.
2. In case of regression tree, the value obtained by terminal nodes in the training
data is the mean response of observation falling in that region. Thus, if an unseen
data observation falls in that region, we’ll make its prediction with mean value.
3. In case of classification tree, the value (class) obtained by terminal node in the
training data is the mode of observations falling in that region. Thus, if an unseen
data observation falls in that region, we’ll make its prediction with mode value.
4. Both the trees divide the predictor space (independent variables) into distinct and
non-overlapping regions. For the sake of simplicity, you can think of these
regions as high dimensional boxes or boxes.
5. Both the trees follow a top-down greedy approach known as recursive binary
splitting. We call it as ‘top-down’ because it begins from the top of tree when all
the observations are available in a single region and successively splits the
predictor space into two new branches down the tree. It is known as ‘greedy’
because, the algorithm cares (looks for best variable available) about only the
current split, and not about future splits which will lead to a better tree.
6. This splitting process is continued until a user defined stopping criteria is reached.
For example: we can tell the the algorithm to stop once the number of
observations per node becomes less than 50.
7. In both the cases, the splitting process results in fully grown trees until the
stopping criteria is reached. But, the fully grown tree is likely to overfit data,
leading to poor accuracy on unseen data. This bring ‘pruning’. Pruning is one of
the technique used tackle overfitting. We’ll learn more about it in following section.

3. How does a tree decide where to split?


The decision of making strategic splits heavily affects a tree’s accuracy. The decision
criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other
words, we can say that purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then selects the split which
results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four
most commonly used algorithms in decision tree:

Gini Index
Gini index says, if we select two items from a population at random then they must be of
same class and probability for this is 1 if population is pure.

1. It works with categorical target variable “Success” or “Failure”.


2. It performs only Binary splits
3. Higher the value of Gini higher the homogeneity.
4. CART (Classification and Regression Tree) uses Gini method to create binary
splits.

Steps to Calculate Gini for a split

1. Calculate Gini for sub-nodes, using formula sum of square of probability for
success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split

Example: – Referring to example used above, where we want to segregate the students
based on target variable ( playing cricket or not ). In the snapshot below, we split the
population using two input variables Gender and Class. Now, I want to identify which
split is producing more homogeneous sub-nodes using Gini index.
Split on Gender:

1. Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68


2. Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
3. Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

Similar for Split on Class:

1. Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51


2. Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
3. Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on
Class, hence, the node split will take place on Gender.

Chi-Square
It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of
standardized differences between observed and expected frequencies of target variable.

1. It works with categorical target variable “Success” or “Failure”.


2. It can perform two or more splits.
3. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
4. Chi-Square of each node is calculated using formula,
5. Chi-square = ((Actual – Expected)^2 / Expected)^1/2
6. It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

1. Calculate Chi-square for individual node by calculating the deviation for Success
and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure
of each node of the split

Example: Let’s work with above example that we have used to calculate Gini.
Split on Gender:

1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it
would be 5 for both because parent node has probability of 50% and we have
applied same probability on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play
Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using
formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer
below table for calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.

Split on Class:

Perform similar steps of calculation for split on Class and you will come up with below
table.

Above, you can see that Chi-square also identify the Gender split is more significant
compare to Class.

Information Gain:
Look at the image below and think which node can be described easily. I am sure, your
answer is C because it requires less information as all values are similar. On the other
hand, B requires more information to describe it and A requires the maximum
information. In other words, we can say that C is a Pure node, B is less Impure and A is
more impure.
Now, we can build a conclusion that less impure node requires less information to
describe it. And, more impure node requires more information. Information theory is a
measure to define this degree of disorganization in a system known as Entropy. If the
sample is completely homogeneous, then the entropy is zero and if the sample is an
equally divided (50% – 50%), it has entropy of one.

Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node. Entropy is
also used with categorical target variable. It chooses the split which has lowest entropy
compared to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

1. Calculate entropy of parent node


2. Calculate entropy of each individual node of split and calculate weighted average
of all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example.

1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1
shows that it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for
male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 +
(20/30)*0.93 = 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for
Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree
will split on Gender. We can derive information gain from entropy as 1- Entropy.
Reduction in Variance
Till now, we have discussed the algorithms for categorical target variable. Reduction in
variance is an algorithm used for continuous target variables (regression problems). This
algorithm uses the standard formula of variance to choose the best split. The split with
lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

1. Calculate variance for each node.


2. Calculate variance for each split as weighted average of each node variance.

Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket.
Now follow the steps to identify the right split:

1. Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have
15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-
0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-
0.5)^2) / 30 = 0.25
2. Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-
0.2)^2) / 10 = 0.16
3. Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-
0.65)^2) / 20 = 0.23
4. Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 +
(20/30) *0.23 = 0.21
5. Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-
0.43)^2) / 14= 0.24
6. Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-
0.56)^2) / 16 = 0.25
7. Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25

Above, you can see that Gender split has lower variance compare to parent node, so the
split would take place on Gender variable.

Until here, we learnt about the basics of decision trees and the decision making process
involved to choose the best splits in building a tree model. As I said, decision tree can be
applied both on regression and classification problems. Let’s understand these aspects
in detail.
4. What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision trees. If there is
no limit set of a decision tree, it will give you 100% accuracy on training set because in
the worse case it will end up making 1 leaf for each observation. Thus, preventing
overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

1. Setting constraints on tree size


2. Tree pruning

Lets discuss both of these briefly.

Setting Constraints on Tree Size


This can be done by using various parameters which are used to define a tree. First, lets
look at the general structure of a decision tree:

The parameters used for defining a tree are further explained below. The parameters
described below are irrespective of tool. It is important to understand the role of
parameters used in tree modeling. These parameters are available in R & Python.

1. Minimum samples for a node split


o Defines the minimum number of samples (or observations) which are
required in a node to be considered for splitting.
o Used to control over-fitting. Higher values prevent a model from learning
relations which might be highly specific to the particular sample selected
for a tree.
o Too high values can lead to under-fitting hence, it should be tuned using
CV.
2. Minimum samples for a terminal node (leaf)
o Defines the minimum samples (or observations) required in a terminal
node or leaf.
o Used to control over-fitting similar to min_samples_split.
oGenerally lower values should be chosen for imbalanced class problems
because the regions in which the minority class will be in majority will be
very small.
3. Maximum depth of tree (vertical depth)
o The maximum depth of a tree.
o Used to control over-fitting as higher depth will allow model to learn
relations very specific to a particular sample.
o Should be tuned using CV.
4. Maximum number of terminal nodes
o The maximum number of terminal nodes or leaves in a tree.
o Can be defined in place of max_depth. Since binary trees are created, a
depth of ‘n’ would produce a maximum of 2^n leaves.
5. Maximum features to consider for split
o The number of features to consider while searching for a best split. These
will be randomly selected.
o As a thumb-rule, square root of the total number of features works great
but we should check upto 30-40% of the total number of features.
o Higher values can lead to over-fitting but depends on case to case.

Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-approach. In other
words, it will check for the best split instantaneously and move forward until one of the
specified stopping condition is reached. Let’s consider the following case when you’re
driving:

There are 2 lanes:

1. A lane with cars moving at 80km/h


2. A lane with trucks moving at 30km/h

At this instant, you are the yellow car and you have 2 choices:

1. Take a left and overtake the other 2 cars quickly


2. Keep moving in the present lane
Lets analyze these choice. In the former choice, you’ll immediately overtake the car
ahead and reach behind the truck and start moving at 30 km/h, looking for an
opportunity to move back right. All cars originally behind you move ahead in the
meanwhile. This would be the optimum choice if your objective is to maximize the
distance covered in next say 10 seconds. In the later choice, you sale through at same
speed, cross trucks and then overtake maybe depending on situation ahead. Greedy
you!

This is exactly the difference between normal decision tree & pruning. A decision tree
with constraints won’t see the truck ahead and adopt a greedy approach by taking a left.
On the other hand if we use pruning, we in effect look at a few steps ahead and make a
choice.

So we know pruning is better. But how to implement it in decision tree? The idea is
simple.

1. We first make the decision tree to a large depth.


2. Then we start at the bottom and start removing leaves which are giving us
negative returns when compared from the top.
3. Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split
on that gives us a gain of 20. A simple decision tree will stop at step 1 but in
pruning, we will see that the overall gain is +10 and keep both leaves.

Note that sklearn’s decision tree classifier does not currently support pruning. Advanced
packages like xgboost have adopted tree pruning in their implementation. But the
library rpart in R, provides a function to prune. Good for R users!

5. Are tree based models better than linear models?


“If I can use logistic regression for classification problems and linear regression for
regression problems, why is there a need to use trees”? Many of us have this question.
And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are
solving. Let’s look at some key factors which will help you to decide which algorithm to
use:

1. If the relationship between dependent & independent variable is well


approximated by a linear model, linear regression will outperform tree based
model.
2. If there is a high non-linearity & complex relationship between dependent &
independent variables, a tree model will outperform a classical regression
method.
3. If you need to build a model which is easy to explain to people, a decision tree
model will always do better than a linear model. Decision tree models are even
simpler to interpret than linear regression!
6. Working with Decision Trees in R and Python
For R users and Python users, decision tree is quite easy to implement. Let’s quickly
look at the set of codes which can get you started with this algorithm. For ease of use,
I’ve shared standard codes where you’ll need to replace your data set name and
variables to get started.

For R users, there are multiple packages available to implement decision tree such as
ctree, rpart, tree etc.

> library(rpart)

> x <- cbind(x_train,y_train)

# grow tree

> fit <- rpart(y_train ~ ., data = x,method="class")

> summary(fit)

#Predict Output

> predicted= predict(fit,x_test)

In the code above:

 y_train – represents dependent variable.


 x_train – represents independent variable
 x – represents training data.

For Python users, below is the code:

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p

redictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here y

ou can change the algorithm as gini or entropy (information gain) by default it is

gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted= model.predict(x_test)
7. What are ensemble methods in tree based
modeling ?
The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of
predictive models to achieve a better accuracy and model stability. Ensemble methods
are known to impart supreme boost to tree based models.

Like every other model, a tree based model also suffers from the plague of bias and
variance. Bias means, ‘how much on an average are the predicted values different from
the actual value.’ Variance means, ‘how different will the predictions of the model be at
the same point if different samples are taken from the same population’.

You build a small tree and you will get a model with low variance and high bias. How do
you manage to balance the trade off between bias and variance ?

Normally, as you increase the complexity of your model, you will see a reduction in
prediction error due to lower bias in the model. As you continue to make your model
more complex, you end up over-fitting your model and your model will start suffering
from high variance.

A champion model should maintain a balance between these two types of errors. This is
known as the trade-off management of bias-variance errors. Ensemble learning is one
way to execute this trade off analysis.

Some
of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In
this tutorial, we’ll focus on Bagging and Boosting in detail.
8. What is Bagging? How does it work?
Bagging is a technique used to reduce the variance of our predictions by combining the
result of multiple classifiers modeled on different sub-samples of the same data set. The
following figure will make it clearer:

The steps followed in bagging are:

1. Create Multiple DataSets:


o Sampling is done with replacement on the original data and new datasets
are formed.
o The new data sets can have a fraction of the columns as well as rows,
which are generally hyper-parameters in a bagging model
o Taking row and column fractions less than 1 helps in making robust
models, less prone to overfitting
2. Build Multiple Classifiers:
o Classifiers are built on each data set.
o Generally the same classifier is modeled on each data set and predictions
are made.
3. Combine Classifiers:
o The predictions of all the classifiers are combined using a mean, median
or mode value depending on the problem at hand.
o The combined values are generally more robust than a single model.

Note that, here the number of models built is not a hyper-parameters. Higher number of
models are always better or may give similar performance than lower numbers. It can be
theoretically shown that the variance of the combined predictions are reduced to 1/n (n:
number of classifiers) of the original variance, under some assumptions.

There are various implementations of bagging models. Random forest is one of them
and we’ll discuss it next.
9. What is Random Forest ? How does it work?
Random Forest is considered to be a panacea of all data science problems. On a funny
note, when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both


regression and classification tasks. It also undertakes dimensional reduction methods,
treats missing values, outlier values and other essential steps of data exploration, and
does a fairly good job. It is a type of ensemble learning method, where a group of weak
models combine to form a powerful model.

How does it work?


In Random Forest, we grow multiple trees as opposed to a single tree in CART model
(see comparison between CART and Random Forest here, part1 and part2). To classify
a new object based on attributes, each tree gives a classification and we say the tree
“votes” for that class. The forest chooses the classification having the most votes (over
all the trees in the forest) and in case of regression, it takes the average of outputs by
different trees.

It works in the following manner. Each tree is planted & grown as follows:
1. Assume number of cases in the training set is N. Then, sample of these N cases
is taken at random but with replacement. This sample will be the training set for
growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node,
m variables are selected at random out of the M. The best split on these m is
used to split the node. The value of m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority
votes for classification, average for regression).

To understand more in detail about this algorithm using a case study, please read
this article “Introduction to Random forest – Simplified“.

Advantages of Random Forest


 This algorithm can solve both type of problems i.e. classification and regression
and does a decent estimation at both fronts.
 One of benefits of Random forest which excites me most is, the power of handle
large data set with higher dimensionality. It can handle thousands of input
variables and identify most significant variables so it is considered as one of the
dimensionality reduction methods. Further, the model outputs Importance of
variable, which can be a very handy feature (on some random data set).

 It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing.
 It has methods for balancing errors in data sets where classes are imbalanced.
 The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
 Random Forest involves sampling of the input data with replacement called as
bootstrap sampling. Here one third of the data is not used for training and can be
used to testing. These are called the out of bag samples. Error estimated on
these out of bag samples is known as out of bag error. Study of error estimates
by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate
as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.

Disadvantages of Random Forest


 It surely does a good job at classification but not as good as for regression
problem as it does not give precise continuous nature predictions. In case of
regression, it doesn’t predict beyond the range in the training data, and that they
may over-fit data sets that are particularly noisy.
 Random Forest can feel like a black box approach for statistical modelers – you
have very little control on what the model does. You can at best – try different
parameters and random seeds!

Python & R implementation


Random forests have commonly known implementations in R packages and Python
scikit-learn. Let’s look at the code of loading random forest model in R and Python below:
Python

#Import Library

from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for

regression problem

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p

redictor) of test_dataset

# Create Random Forest object

model= RandomForestClassifier(n_estimators=1000)

# Train the model using the training sets and check score

model.fit(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code

> library(randomForest)

> x <- cbind(x_train,y_train)

# Fitting model

> fit <- randomForest(Species ~ ., x,ntree=500)


> summary(fit)

#Predict Output

> predicted= predict(fit,x_test)

10. What is Boosting ? How does it work?


Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak
learner to strong learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial
approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a
SPAM
4. Email from our official domain “Analyticsvidhya.com” , Not a SPAM
5. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do
you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak
learner using methods like:

 Using average/ weighted average


 Considering prediction has higher vote

For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted
as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an
email as SPAM because we have higher(3) vote for ‘SPAM’.
How does it work?
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong
rule. An immediate question which should pop in your mind is, ‘How boosting identify
weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak prediction rule.
This is an iterative process. After many iterations, the boosting algorithm combines these
weak rules into a single strong prediction rule.

Here’s another question which might haunt you, ‘How do we choose different distribution
for each round?’

For choosing the right distribution, here are the following steps:

Step 1: The base learner takes all the distributions and assign equal weight or attention
to each observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we
pay higher attention to observations having prediction error. Then, we apply the next
base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher
accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which
eventually improves the prediction power of the model. Boosting pays higher focus on
examples which are mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy.
In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient
Boosting (GBM) and XGboost.

11. Which is more powerful: GBM or Xgboost?


I’ve always admired the boosting capabilities that xgboost algorithm. At times, I’ve found
that it provides better result compared to GBM implementation, but at times you might
find that the gains are just marginal. When I explored more about its performance and
science behind its high accuracy, I discovered many advantages of Xgboost over GBM:

1. Regularization:
o Standard GBM implementation has
no regularization like XGBoost, therefore it also helps to reduce overfitting.
o In fact, XGBoost is also known as ‘regularized boosting‘ technique.
2. Parallel Processing:
o XGBoost implements parallel processing and is blazingly faster as
compared to GBM.
o But hang on, we know that boosting is sequential process so how can it be
parallelized? We know that each tree can be built only after the previous
one, so what stops us from making a tree using all cores? I hope
you get where I’m coming from. Check this link out to explore further.
o XGBoost also supports implementation on Hadoop.
3. High Flexibility
o XGBoost allow users to define custom optimization objectives and
evaluation criteria.
o This adds a whole new dimension to the model and there is no limit to
what we can do.
4. Handling Missing Values
o XGBoost has an in-built routine to handle missing values.
o User is required to supply a different value than other observations and
pass that as a parameter. XGBoost tries different things as it encounters a
missing value on each node and learns which path to take for missing
values in future.
5. Tree Pruning:
o A GBM would stop splitting a node when it encounters a negative loss in
the split. Thus it is more of a greedy algorithm.
o XGBoost on the other hand make splits upto the max_depth specified
and then start pruning the tree backwards and remove splits beyond
which there is no positive gain.
o Another advantage is that sometimes a split of negative loss say -2 may
be followed by a split of positive loss +10. GBM would stop as it
encounters -2. But XGBoost will go deeper and it will see a combined
effect of +8 of the split and keep both.
6. Built-in Cross-Validation
o XGBoost allows user to run a cross-validation at each iteration of the
boosting process and thus it is easy to get the exact optimum number of
boosting iterations in a single run.
o This is unlike GBM where we have to run a grid-search and only a limited
values can be tested.
7. Continue on Existing Model
o User can start training an XGBoost model from its last iteration of previous
run. This can be of significant advantage in certain specific applications.
o GBM implementation of sklearn also has this feature so they are even on
this point.

12. Working with GBM in R and Python


Before we start working, let’s quickly understand the important parameters and the
working of this algorithm. This will be helpful for both R and Python users. Below is the
overall pseudo-code of GBM algorithm for 2 classes:
1. Initialize the outcome

2. Iterate from 1 to total number of trees

2.1 Update the weights for targets based on previous run (higher for the ones mi

s-classified)

2.2 Fit the model on selected subsample of data

2.3 Make predictions on the full set of observations

2.4 Update the output with current results taking into account the learning rate

3. Return the final output.

This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will
help every beginners to understand this algorithm.

Lets consider the important GBM parameters used to improve model performance in
Python:

1. learning_rate
o This determines the impact of each tree on the final outcome (step 2.4).
GBM works by starting with an initial estimate which is updated using the
output of each tree. The learning parameter controls the magnitude of this
change in the estimates.
o Lower values are generally preferred as they make the model robust to
the specific characteristics of tree and thus allowing it to generalize well.
o Lower values would require higher number of trees to model all the
relations and will be computationally expensive.
2. n_estimators
o The number of sequential trees to be modeled (step 2)
o Though GBM is fairly robust at higher number of trees but it can still overfit
at a point. Hence, this should be tuned using CV for a particular learning
rate.
3. subsample
o The fraction of observations to be selected for each tree. Selection is done
by random sampling.
o Values slightly less than 1 make the model robust by reducing the
variance.
o Typical values ~0.8 generally work fine but can be fine-tuned further.
Apart from these, there are certain miscellaneous parameters which affect overall
functionality:

1. loss
o It refers to the loss function to be minimized in each split.
o It can have various values for classification and regression case.
Generally the default values work fine. Other values should be chosen
only if you understand their impact on the model.
2. init
oThis affects initialization of the output.
oThis can be used if we have made another model whose outcome is to be
used as the initial estimates for GBM.
3. random_state
o The random number seed so that same random numbers are generated
every time.
o This is important for parameter tuning. If we don’t fix the random number,
then we’ll have different outcomes for subsequent runs on the same
parameters and it becomes difficult to compare models.
o It can potentially result in overfitting to a particular random sample
selected. We can try running models for different random samples, which
is computationally expensive and generally not used.
4. verbose
o The type of output to be printed when the model fits. The different values
can be:
 0: no output generated (default)
 1: output generated for trees in certain intervals
 >1: output generated for all trees
5. warm_start
o This parameter has an interesting application and can help a lot if used
judicially.
o Using this, we can fit additional trees on previous fits of a model. It can
save a lot of time and you should explore this option for advanced
applications
6. presort
o Select whether to presort data for faster splits.
o It makes the selection automatically by default but it can be changed if
needed.

I know its a long list of parameters but I have simplified it for you in an excel file which
you can download from this GitHub repository.

For R users, using caret package, there are 3 main tuning parameters:

1. n.trees – It refers to number of iterations i.e. tree which will be taken to grow the
trees
2. interaction.depth – It determines the complexity of the tree i.e. total number of
splits it has to perform on a tree (starting from a single node)
3. shrinkage – It refers to the learning rate. This is similar to learning_rate in python
(shown above).
4. n.minobsinnode – It refers to minimum number of training samples required in a
node to perform splitting
GBM in R (with cross validation)
I’ve shared the standard codes in R and Python. At your end, you’ll be required to
change the value of dependent variable and data set name used in the codes below.
Considering the ease of implementing GBM in R, one can easily perform tasks like cross
validation and grid search with this package.

> library(caret)

> fitControl <- trainControl(method = "cv",

number = 10, #5folds)

> tune_Grid <- expand.grid(interaction.depth = 2,

n.trees = 500,

shrinkage = 0.1,

n.minobsinnode = 10)

> set.seed(825)

> fit <- train(y_train ~ ., data = train,

method = "gbm",

trControl = fitControl,

verbose = FALSE,

tuneGrid = gbmGrid)
> predicted= predict(fit,test,type= "prob")[,2]

GBM in Python

#import libraries

from sklearn.ensemble import GradientBoostingClassifier #For Classification

from sklearn.ensemble import GradientBoostingRegressor #For Regression

#use GBM function

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)

clf.fit(X_train, y_train)

13. Working with XGBoost in R and Python


XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient
boosting algorithm. It’s feature to implement parallel computing makes it at least 10
times faster than existing gradient boosting implementations. It supports various
objective functions, including regression, classification and ranking.

R Tutorial: For R users, this is a complete tutorial on XGboost which explains the
parameters along with codes in R. Check Tutorial.

Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to
get you started. Check Tutorial.
14. Where to practice ?
Practice is the one and true method of mastering any concept. Hence, you need to start
practicing if you wish to master these algorithms.

Till here, you’ve got gained significant knowledge on tree based models along with these
practical implementation. It’s time that you start working on them. Here are open practice
problems where you can participate and check your live rankings on leaderboard:

For Regression: Big Mart Sales Prediction

For Classification: Loan Prediction

End Notes
Tree based algorithm are important for every data scientist to learn. In fact, tree models
are known to provide the best model performance in the family of whole machine
learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we
come to the end of this tutorial.

We discussed about tree based modeling from scratch. We learnt the important of
decision tree and how that simplistic concept is being used in boosting algorithms. For
better understanding, I would suggest you to continue practicing these algorithms
practically. Also, do keep note of the parameters associated with boosting algorithms.
I’m hoping that this tutorial would enrich you with complete knowledge on tree based
modeling.
Ultimate guide to deal with Text Data
(using Python) – for Data Scientists &
Engineers
Introduction
One of the biggest breakthroughs required for achieving any level of artificial intelligence
is to have machines which can process text data. Thankfully, the amount of text data
being generated in this universe has exploded exponentially in the last few years.

It has become imperative for an organization to have a structure in place to mine


actionable insights from the text being generated. From social media analytics to risk
management and cybercrime protection, dealing with text data has never been more
important.

In this article we will discuss different feature extraction methods, starting with some
basic techniques which will lead into advanced Natural Language Processing techniques.
We will also learn about pre-processing of the text data in order to extract better features
from clean data.

By the end of this article, you will be able to perform text operations by yourself. Let’s get
started!

Table of Contents:
1. Basic feature extraction using text data
o Number of words
o Number of characters
o Average word length
o Number of stopwords
o Number of special characters
o Number of numerics
o Number of uppercase words
2. Basic Text Pre-processing of text data
o Lower casing
o Punctuation removal
o Stopwords removal
o Frequent words removal
o Rare words removal
o Spelling correction
o Tokenization
o Stemming
o Lemmatization
3. Advance Text Processing
o N-grams
o Term Frequency
o Inverse Document Frequency
o Term Frequency-Inverse Document Frequency (TF-IDF)
o Bag of Words
o Sentiment Analysis
o Word Embedding

1. Basic Feature Extraction


We can use text data to extract a number of features even if we don’t have sufficient
knowledge of Natural Language Processing. So let’s discuss some of them in this
section.

Before starting, let’s quickly read the training file from the dataset in order to perform
different tasks on it. In the entire article, we will use the twitter sentiment dataset from
the datahack platform.

train = pd.read_csv('train_E6oV3lV.csv')

Note that here we are only working with textual data, but we can also use the below
methods when numerical features are also present along with the text.

1.1 Number of Words


One of the most basic features we can extract is the number of words in each tweet. The
basic intuition behind this is that generally, the negative sentiments contain a lesser
amount of words than the positive ones.

To do this, we simply use the split function in python:

train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))

train[['tweet','word_count']].head()
1.2 Number of characters
This feature is also based on the previous feature intuition. Here, we calculate the
number of characters in each tweet. This is done by calculating the length of the tweet.

train['char_count'] = train['tweet'].str.len() ## this also includes spaces

train[['tweet','char_count']].head()

Note that the calculation will also include the number of spaces, which you can remove,
if required.

1.3 Average Word Length


We will also extract another feature which will calculate the average word length of each
tweet. This can also potentially help us in improving our model.
Here, we simply take the sum of the length of all the words and divide it by the total
length of the tweet:

def avg_word(sentence):

words = sentence.split()

return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))

train[['tweet','avg_word']].head()

1.4 Number of stopwords


Generally, while solving an NLP problem, the first thing we do is to remove the
stopwords. But sometimes calculating the number of stopwords can also give us some
extra information which we might have been losing before.

Here, we have imported stopwords from NLTK, which is a basic NLP library in python.

from nltk.corpus import stopwords

stop = stopwords.words('english')
train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x

in stop]))

train[['tweet','stopwords']].head()

1.5 Number of special characters


One more interesting feature which we can extract from a tweet is calculating the
number of hashtags or mentions present in it. This also helps in extracting extra
information from our text data.

Here, we make use of the ‘starts with’ function because hashtags (or mentions) always
appear at the beginning of a word.

train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.s

tartswith('#')]))

train[['tweet','hastags']].head()
1.6 Number of numerics
Just like we calculated the number of words, we can also calculate the number of
numerics which are present in the tweets. It does not have a lot of use in our example,
but this is still a useful feature that should be run while doing similar exercises. For
example,

train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.

isdigit()]))

train[['tweet','numerics']].head()

1.7 Number of Uppercase words


Anger or rage is quite often expressed by writing in UPPERCASE words which makes
this a necessary operation to identify those words.
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isu

pper()]))

train[['tweet','upper']].head()

2. Basic Pre-processing
So far, we have learned how to extract basic features from text data. Before diving into
text and feature extraction, our first step should be cleaning the data in order to obtain
better features. We will achieve this by doing some of the basic pre-processing steps on
our training data.

So, let’s get into it.

2.1 Lower case


The first pre-processing step which we will do is transform our tweets into lower case.
This avoids having multiple copies of the same words. For example, while calculating the
word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.spli

t()))

train['tweet'].head()
2.2 Removing Punctuation
The next step is to remove punctuation, as it doesn’t add any extra information while
treating text data. Therefore removing all instances of it will help us reduce the size of
the training data.

train['tweet'] = train['tweet'].str.replace('[^\w\s]','')

train['tweet'].head()

As you can see in the above output, all the punctuation, including ‘#’ and ‘@’, has been
removed from the training data.

2.3 Removal of Stop Words


As we discussed earlier, stop words (or commonly occurring words) should be removed
from the text data. For this purpose, we can either create a list of stopwords ourselves or
we can use predefined libraries.

from nltk.corpus import stopwords


stop = stopwords.words('english')

train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x

not in stop))

train['tweet'].head()

2.4 Common word removal


Previously, we just removed commonly occurring words in a general sense. We can also
remove commonly occurring words from our text data First, let’s check the 10 most
frequently occurring words in our text data then take call to remove or retain.

freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]

freq

> user 17473

love 2647

ð 2511

day 2199

â 1797
happy 1663

amp 1582

im 1139

u 1136

time 1110

dtype: int64

Now, let’s remove these words as their presence will not of any use in classification of
our text data.

freq = list(freq.index)

train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x

not in freq))

train['tweet'].head()

2.5 Rare words removal


Similarly, just as we removed the most common words, this time let’s remove rarely
occurring words from the text. Because they’re so rare, the association between them
and other words is dominated by noise. You can replace rare words with a more general
form and then this will have higher counts

freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]

freq

> tvperfect 1

oau 1

850am 1

semangatpagi 1

kindestbravest 1

moodyah 1

downhill 1

loreal 1

ohwhatcoulditbe 1

maannnn 1

dtype: int64

freq = list(freq.index)

train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x

not in freq))
train['tweet'].head()

All these pre-processing steps are essential and help us in reducing our vocabulary
clutter so that the features produced in the end are more effective.

2.6 Spelling correction


We’ve all seen tweets with a plethora of spelling mistakes. Our timelines are often filled
with hastly sent tweets that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will
help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will
be treated as different words even if they are used in the same sense.

To achieve this we will use the textblob library. If you are not familiar with it, you can
check my previous article on ‘NLP for beginners using textblob’.

from textblob import TextBlob

train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

Note that it will actually take a lot of time to make these corrections. Therefore, just for
the purposes of learning, I have shown this technique by applying it on only the first 5
rows. Moreover, we cannot always expect it to be accurate so some care should be
taken before applying it.
We should also keep in mind that words are often used in their abbreviated form. For
instance, ‘your’ is used as ‘ur’. We should treat this before the spelling correction step,
otherwise these words might be transformed into any other word like the one shown
below:

2.7 Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences. In our
example, we have used the textblob library to first transform our tweets into a blob and
then converted them into a series of words.

TextBlob(train['tweet'][1]).words

> WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', '

wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

2.8 Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based
approach. For this purpose, we will use PorterStemmer from the NLTK library.

from nltk.stem import PorterStemmer

st = PorterStemmer()

train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split

()]))
0 father dysfunct selfish drag kid dysfunct run

1 thank lyft credit cant use caus dont offer whe...

2 bihday majesti

3 model take urð ðððð ððð

4 factsguid societi motiv

Name: tweet, dtype: object

In the above output, dysfunctional has been transformed into dysfunct, among other
changes.

2.9 Lemmatization
Lemmatization is a more effective option than stemming because it converts the word
into its root word, rather than just stripping the suffices. It makes use of the vocabulary
and does a morphological analysis to obtain the root word. Therefore, we usually prefer
using lemmatization over stemming.

from textblob import Word

train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() f

or word in x.split()]))

train['tweet'].head()

0 father dysfunctional selfish drag kid dysfunct...

1 thanks lyft credit cant use cause dont offer w...


2 bihday majesty

3 model take urð ðððð ððð

4 factsguide society motivation

Name: tweet, dtype: object

3. Advance Text Processing


Up to this point, we have done all the basic pre-processing steps in order to clean our
data. Now, we can finally move on to extracting features using NLP techniques.

3.1 N-grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are
called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

Unigrams do not usually contain as much information as compared to bigrams and


trigrams. The basic principle behind n-grams is that they capture the language structure,
like what letter or word is likely to follow the given one. The longer the n-gram (the higher
the n), the more context you have to work with. Optimum length really depends on the
application – if your n-grams are too short, you may fail to capture important differences.
On the other hand, if they are too long, you may fail to capture the “general knowledge”
and only stick to particular cases.

So, let’s quickly extract bigrams from our tweets using the ngrams function of the
textblob library.

TextBlob(train['tweet'][0]).ngrams(2)

> [WordList(['user', 'when']),

WordList(['when', 'a']),
WordList(['a', 'father']),

WordList(['father', 'is']),

WordList(['is', 'dysfunctional']),

WordList(['dysfunctional', 'and']),

WordList(['and', 'is']),

WordList(['is', 'so']),

WordList(['so', 'selfish']),

WordList(['selfish', 'he']),

WordList(['he', 'drags']),

WordList(['drags', 'his']),

WordList(['his', 'kids']),

WordList(['kids', 'into']),

WordList(['into', 'his']),

WordList(['his', 'dysfunction']),

WordList(['dysfunction', 'run'])]
3.2 Term frequency
Term frequency is simply the ratio of the count of a word present in a sentence, to the
length of the sentence.

Therefore, we can generalize term frequency as:

TF = (Number of times term T appears in the particular row) / (number of terms in


that row)

To understand more about Term Frequency, have a look at this article.

Below, I have tried to show you the term frequency table of a tweet.

tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axi

s = 0).reset_index()

tf1.columns = ['words','tf']

tf1

You can read more about term frequency in this article.


3.3 Inverse Document Frequency
The intuition behind inverse document frequency (IDF) is that a word is not of much use
to us if it’s appearing in all the documents.

Therefore, the IDF of each word is the log of the ratio of the total number of rows to the
number of rows in which that word is present.

IDF = log(N/n), where, N is the total number of rows and n is the number of rows in
which the word was present.

So, let’s calculate IDF for the same tweets for which we calculated the term frequency.

for i,word in enumerate(tf1['words']):

tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains

(word)])))

tf1
The more the value of IDF, the more unique is the word.

3.4 Term Frequency – Inverse Document Frequency (TF-IDF)


TF-IDF is the multiplication of the TF and IDF which we calculated above.

tf1['tfidf'] = tf1['tf'] * tf1['idf']

tf1
We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because
they are commonly occurring words. However, it has given a high weight to
“disappointed” since that will be very useful in determining the sentiment of the tweet.

We don’t have to calculate TF and IDF every time beforehand and then multiply it to
obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',

stop_words= 'english',ngram_range=(1,1))

train_vect = tfidf.fit_transform(train['tweet'])
train_vect

<31962x1000 sparse matrix of type '<class 'numpy.float64'>'

with 114033 stored elements in Compressed Sparse Row format>

We can also perform basic pre-processing steps like lower-casing and removal of
stopwords, if we haven’t done them earlier.

3.5 Bag of Words


Bag of Words (BoW) refers to the representation of text which describes the presence of
words within the text data. The intuition behind this is that two similar text fields will
contain similar kind of words, and will therefore have a similar bag of words. Further, that
from the text alone we can learn something about the meaning of the document.

For implementation, sklearn provides a separate function for it as shown below:

from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyze

r = "word")

train_bow = bow.fit_transform(train['tweet'])

train_bow

> <31962x1000 sparse matrix of type '<class 'numpy.int64'>'

with 128380 stored elements in Compressed Sparse Row format>

To gain a better understanding of this, you can refer to this article.


3.6 Sentiment Analysis
If you recall, our problem was to detect the sentiment of the tweet. So, before applying
any ML/DL models (which can have a separate feature detecting the sentiment using the
textblob library), let’s check the sentiment of the first few tweets.

train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

0 (-0.3, 0.5354166666666667)

1 (0.2, 0.2)

2 (0.0, 0.0)

3 (0.0, 0.0)

4 (0.0, 0.0)

Name: tweet, dtype: object

Above, you can see that it returns a tuple representing polarity and subjectivity of each
tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1
means a positive sentiment and values nearer to -1 means a negative sentiment. This
can also work as a feature for building a machine learning model.

train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )

train[['tweet','sentiment']].head()
3.7 Word Embeddings
Word Embedding is the representation of text in the form of vectors. The underlying idea
here is that similar words will have a minimum distance between their vectors.

Word2Vec models require a lot of text, so either we can train it on our training data or we
can use the pre-trained word vectors developed by Google, Wiki, etc.

Here, we will use pre-trained word vectors which can be downloaded from
the glove website. There are different dimensions (50,100, 200, 300) vectors trained on
wiki data. For this example, I have downloaded the 100-dimensional version of the
model.

You can refer an article here to understand different form of word embeddings.

The first step here is to convert it into the word2vec format.

from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'glove.6B.100d.txt'

word2vec_output_file = 'glove.6B.100d.txt.word2vec'

glove2word2vec(glove_input_file, word2vec_output_file)

>(400000, 100)
Now, we can load the above word2vec file as a model.

from gensim.models import KeyedVectors # load the Stanford GloVe model

filename = 'glove.6B.100d.txt.word2vec'

model = KeyedVectors.load_word2vec_format(filename, binary=False)

Let’s say our tweet contains a text saying ‘go away’. We can easily obtain it’s word
vector using the above model:

model['go']

model['away']
We then take the average to represent the string ‘go away’ in the form of vectors having
100 dimensions.

(model['go'] + model['away'])/2

> array([-0.091342 , 0.22340401, 0.58855999, -0.61476499, -0.0838365 ,

0.53869998, -0.43531001, 0.349125 , 0.16330799, -0.28222999,

0.53547001, 0.52797496, 0.096812 , 0.2879 , -0.0533385 ,

-0.37232 , 0.022637 , 0.574705 , -0.55327499, 0.385575 ,

0.56533498, 0.80540502, 0.2579965 , 0.0088565 , 0.1674905 ,

0.25543001, -0.57103503, -0.59925997, 0.42258501, -0.42896 ,

-0.389065 , 0.19631 , -0.00933 , 0.127285 , -0.0487465 ,

0.38143501, -0.22540998, 0.021299 , -0.1827915 , -0.16490501,

-0.47944498, -0.431528 , -0.20091 , -0.55664998, -0.32982001,


-0.088548 , -0.28038502, 0.219725 , 0.090537 , -0.67012 ,

0.0883085 , -0.19332001, 0.0465725 , 1.160815 , 0.0691255 ,

-2.47895002, -0.33706999, 0.083195 , 1.86185002, 0.283465 ,

-0.13080999, 0.92779499, -0.37028 , 0.18854649, 0.66197997,

0.50517499, 0.37748498, 0.1322995 , -0.380375 , -0.025135 ,

-0.1636765 , -0.45638999, -0.047815 , -0.87393999, 0.0264145 ,

0.0117645 , -0.42741501, -0.31048 , -0.317725 , -0.02326 ,

0.525635 , 0.05760051, -0.69786 , -0.1213325 , -1.27069998,

-0.225355 , -0.1018815 , 0.18575001, -0.30943 , -0.211059 ,

-0.27954501, -0.16002001, 0.100371 , -0.05461 , -0.71834505,

-0.39292499, 0.12075999, 0.61991 , 0.58214498, 0.20161 ], dtype=flo

at32)

We have converted the entire string into a vector which can now be used as a feature in
any modelling technique.

End Notes
I hope that now you have a basic understanding of how to deal with text data in
predictive modeling. These methods will help in extracting more information which in
return will help you in building better models.
I would recommend practising these methods by applying them in machine
learning/deep learning competitions. You can also start with the Twitter sentiment
problem we covered in this article (the dataset is available on the datahack platform of
AV).

Quick Guide to Build a


Recommendation Engine in Python
Introduction
This could help you in building your first project!
Be it a fresher or an experienced professional in data science, doing voluntary projects
always adds to one’s candidature. My sole reason behind writing this article is to get
your started with recommendation systems so that you can build one. If you struggle to
get open data, write to me in comments.

Recommendation engines are nothing but an automated form of a “shop counter guy”.
You ask him for a product. Not only he shows that product, but also the related ones
which you could buy. They are well trained in cross selling and up selling. So, does our
recommendation engines.

The ability of these engines to recommend personalized content, based on past


behavior is incredible. It brings customer delight and gives them a reason to keep
returning to the website.

In this post, I will cover the fundamentals of creating a recommendation system


using GraphLab in Python. We will get some intuition into how recommendation work
and create basic popularity model and a collaborative filtering model.

Topics Covered
1. Type of Recommendation Engines
2. The MovieLens DataSet
3. A simple popularity model
4. A Collaborative Filtering Model
5. Evaluating Recommendation Engines

Before moving forward, I would like to extend my sincere gratitude to


the Coursera’s Machine Learning Specialization by University of Washington. This
course has been instrumental in my understanding of the concepts and this post is an
illustration of my learnings from the same.
1. Type of Recommendation Engines
Before taking a look at the different types of recommendation engines, lets take a step
back and see if we can make some intuitive recommendations. Consider the following
cases:

Case 1: Recommend the most popular items


A simple approach could be to recommend the items which are liked by most number of
users. This is a blazing fast and dirty approach and thus has a major drawback. The
things is, there is no personalization involved with this approach.

Basically the most popular items would be same for each user since popularity is defined
on the entire user pool. So everybody will see the same results. It sounds like, ‘a website
recommends you to buy microwave just because it’s been liked by other users and
doesn’t care if you are even interested in buying or not’.

Surprisingly, such approach still works in places like news portals. Whenever you login
to say bbcnews, you’ll see a column of “Popular News” which is subdivided into sections
and the most read articles of each sections are displayed. This approach can work in
this case because:

 There is division by section so user can look at the section of his interest.
 At a time there are only a few hot topics and there is a high chance that a user
wants to read the news which is being read by most others

Case 2: Using a classifier to make recommendation


We already know lots of classification algorithms. Let’s see how we can use the same
technique to make recommendations. Classifiers are parametric solutions so we just
need to define some parameters (features) of the user and the item. The outcome can
be 1 if the user likes it or 0 otherwise. This might work out in some cases because of
following advantages:

 Incorporates personalization
 It can work even if the user’s past history is short or not available

But has some major drawbacks as well because of which it is not used much in practice:

 The features might actually not be available or even if they are, they may not be
sufficient to make a good classifier
 As the number of users and items grow, making a good classifier will become
exponentially difficult
Case 3: Recommendation Algorithms
Now lets come to the special class of algorithms which are tailor-made for solving the
recommendation problem. There are typically two types of algorithms – Content Based
and Collaborative Filtering. You should refer to our previous article to get a complete
sense of how they work. I’ll give a short recap here.

1. Content based algorithms:


o Idea: If you like an item then you will also like a “similar” item
o Based on similarity of the items being recommended
o It generally works well when its easy to determine the context/properties of
each item. For instance when we are recommending the same kind of
item like a movie recommendation or song recommendation.
2. Collaborative filtering algorithms:
o Idea: If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar
interests and A should like item 4 and B should like item 1.
o This algorithm is entirely based on the past behavior and not on the
context. This makes it one of the most commonly used algorithm as it is
not dependent on any additional information.
o For instance: product recommendations by e-commerce player like
Amazon and merchant recommendations by banks like American Express.
o Further, there are several types of collaborative filtering algorithms :
1. User-User Collaborative filtering: Here we find look alike
customers (based on similarity) and offer products which first
customer’s look alike has chosen in past. This algorithm is very
effective but takes a lot of time and resources. It requires to
compute every customer pair information which takes time.
Therefore, for big base platforms, this algorithm is hard to
implement without a very strong parallelizable system.
2. Item-Item Collaborative filtering: It is quite similar to
previous algorithm, but instead of finding customer look alike, we
try finding item look alike. Once we have item look alike matrix, we
can easily recommend alike items to customer who have
purchased any item from the store. This algorithm is far less
resource consuming than user-user collaborative filtering. Hence,
for a new customer the algorithm takes far lesser time than user-
user collaborate as we don’t need all similarity scores between
customers. And with fixed number of products, product-product
look alike matrix is fixed over time.
3. Other simpler algorithms: There are other approaches
like market basket analysis, which generally do not have high
predictive power than the algorithms described above.

2. The MovieLens DataSet


We will be using the MovieLens dataset for this purpose. It has been collected by the
GroupLens Research Project at the University of Minnesota. MovieLens 100K dataset
can be downloaded from here. It consists of:
 100,000 ratings (1-5) from 943 users on 1682 movies.
 Each user has rated at least 20 movies.
 Simple demographic info for the users (age, gender, occupation, zip)
 Genre information of movies

Lets load this data into Python. There are many files in the ml-100k.zip file which we
can use. Lets load the three most importance files to get a sense of the data. I also
recommend you to read the readme document which gives a lot of information about the
difference files.

import pandas as pd

# pass in column names for each CSV and read them using pandas.

# Column names available in the readme file

#Reading users file:

u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,

encoding='latin-1')

#Reading ratings file:

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,


encoding='latin-1')

#Reading items file:

i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb UR

L', 'unknown', 'Action', 'Adventure',

'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',

'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'Wa

r', 'Western']

items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,

encoding='latin-1')

Now lets take a peak into the content of each file to understand them better.

 Users

print users.shape

users.head()
This reconfirms that there are 943 users and we have 5 features for each namely their
unique ID, age, gender, occupation and the zip code they are living in.

 Ratings

print ratings.shape

ratings.head()
This confirms that there are 100K ratings for different user and movie combinations. Also
notice that each rating has a timestamp associated with it.

 Items

print items.shape

items.head()

This dataset contains attributes of the 1682 movies. There are 24 columns out of which
19 specify the genre of a particular movie. The last 19 columns are for each genre and a
value of 1 denotes movie belongs to that genre and 0 otherwise.

Now we have to divide the ratings data set into test and train data for making models.
Luckily GroupLens provides pre-divided data wherein the test data has 10 ratings for
each user, i.e. 9430 rows in total. Lets load that:

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='la

tin-1')

ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='la

tin-1')
ratings_base.shape, ratings_test.shape

Output: ((90570, 4), (9430, 4))

Since we’ll be using GraphLab, lets convert these in SFrames.

import graphlab

train_data = graphlab.SFrame(ratings_base)

test_data = graphlab.SFrame(ratings_test)

We can use this data for training and testing. Now that we have gathered all the data
available. Note that here we have user behaviour as well as attributes of the users and
movies. So we can make content based as well as collaborative filtering algorithms.

3. A Simple Popularity Model


Lets start with making a popularity based model, i.e. the one where all the users have
same recommendation based on the most popular choices. We’ll use the graphlab
recommender functions popularity_recommender for this.

We can train a recommendation as:

popularity_model = graphlab.popularity_recommender.create(train_data, user_id='use

r_id', item_id='movie_id', target='rating')

Arguments:

 train_data: the SFrame which contains the required data


 user_id: the column name which represents each user ID
 item_id: the column name which represents each item to be recommended
 target: the column name representing scores/ratings given by the user

Lets use this model to make top 5 recommendations for first 5 users and see what
comes out:
#Get recommendations for first 5 users and print them

#users = range(1,6) specifies user ID of first 5 users

#k=5 specifies top 5 recommendations to be given

popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)

popularity_recomm.print_rows(num_rows=25)
Did you notice something? The recommendations for all users are same –
1500,1201,1189,1122,814 in the same order. This can be verified by checking the
movies with highest mean recommendations in our ratings_base data set:
ratings_base.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).

head(20)

This confirms that all the recommended movies have an average rating of 5, i.e. all the
users who watched the movie gave a top rating. Thus we can see that our popularity
system works as expected. But it is good enough? We’ll analyze it in detail later.

4. A Collaborative Filtering Model


Lets start by understanding the basics of a collaborative filtering algorithm. The core
idea works in 2 steps:

1. Find similar items by using a similarity metric


2. For a user, recommend the items most similar to the items (s)he already likes

To give you a high level overview, this is done by making an item-item matrix in which
we keep a record of the pair of items which were rated together.
In this case, an item is a movie. Once we have the matrix, we use it to determine the
best recommendations for a user based on the movies he has already rated. Note that
there a few more things to take care in actual implementation which would require
deeper mathematical introspection, which I’ll skip for now.

I would just like to mention that there are 3 types of item similarity metrics supported by
graphlab. These are:

1. Jaccard Similarity:
o Similarity is based on the number of users which have rated item A and B
divided by the number of users who have rated either A or B
o It is typically used where we don’t have a numeric rating but just a boolean
value like a product being bought or an add being clicked
2. Cosine Similarity:
o Similarity is the cosine of the angle between the 2 vectors of the item
vectors of A and B
o Closer the vectors, smaller will be the angle and larger the cosine
3. Pearson Similarity
o Similarity is the pearson coefficient between the two vectors.

Lets create a model based on item similarity as follow:

#Train Model

item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='

user_id', item_id='movie_id', target='rating', similarity_type='pearson')

#Make Recommendations:

item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)

item_sim_recomm.print_rows(num_rows=25)
Here we can see that the recommendations are different for each user. So,
personalization exists. But how good is this model? We need some means of evaluating
a recommendation engine. Lets focus on that in the next section.
5. Evaluating Recommendation Engines
For evaluating recommendation engines, we can use the concept of precision-recall.
You must be familiar with this in terms of classification and the idea is very similar. Let
me define them in terms of recommendations.

 Recall:
o What ratio of items that a user likes were actually recommended.
o If a user likes say 5 items and the recommendation decided to show 3 of
them, then the recall is 0.6
 Precision
o Out of all the recommended items, how many the user actually liked?
o If 5 items were recommended to the user out of which he liked say 4 of
them, then precision is 0.8

Now if we think about recall, how can we maximize it? If we simply recommend all the
items, they will definitely cover the items which the user likes. So we have 100% recall!
But think about precision for a second. If we recommend say 1000 items and user like
only say 10 of them then precision is 0.1%. This is really low. Our aim is to maximize
both precision and recall.

An idea recommender system is the one which only recommends the items which user
likes. So in this case precision=recall=1. This is an optimal recommender and we should
try and get as close as possible.

Lets compare both the models we have built till now based on precision-recall
characteristics:

model_performance = graphlab.compare(test_data, [popularity_model, item_sim_mode

l])

graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])
Here we can make 2 very quick observations:

1. The item similarity model is definitely better than the popularity model (by atleast
10x)
2. On an absolute level, even the item similarity model appears to have a poor
performance. It is far from being a useful recommendation system.

There is a big scope of improvement here. But I leave it up to you to figure out how to
improve this further. I would like to give a couple of tips:

1. Try leveraging the additional context information which we have


2. Explore more sophisticated algorithms like matrix factorization

In the end, I would like to mention that along with GraphLab, you can also use some
other open source python packages like the following:

 Crab.
 Surprise
 Python Recsys
 MRec

End Notes
In this article, we traversed through the process of making a basic recommendation
engine in Python using GrpahLab. We started by understanding the fundamentals of
recommendations. Then we went on to load the MovieLens 100K data set for the
purpose of experimentation.

Subsequently we made a first model as a simple popularity model in which the most
popular movies were recommended for each user. Since this lacked personalization, we
made another model based on collaborative filtering and observed the impact of
personalization.

Finally, we discussed precision-recall as evaluation metrics for recommendation systems


and on comparison found the collaborative filtering model to be more than 10x better
than the popularity model.
Getting Started with Audio Data
Analysis using Deep Learning (with
case study)
Introduction
When you get started with data science, you start simple. You go through simple
projects like Loan Prediction problem or Big Mart Sales Prediction. These problems have
structured data arranged neatly in a tabular format. In other words, you are spoon-fed
the hardest part in data science pipeline.

The datasets in real life are much more complex.

You first have to understand it, collect it from various sources and arrange it in a format
which is ready for processing. This is even more difficult when the data is in an
unstructured format such as image or audio. This is so because you would have to
represent image/audio data in a standard way for it to be useful for analysis.

The abundance on unstructured data


Interestingly, unstructured data represents huge under-exploited opportunity. It is closer
to how we communicate and interact as humans. It also contains a lot of useful &
powerful information. For example, if a person speaks; you not only get what he / she
says but also what were the emotions of the person from the voice.

Also the body language of the person can show you many more features about a person,
because actions speak louder than words! So in short, unstructured data is complex but
processing it can reap easy rewards.

In this article, I intend to cover an overview of audio / voice processing with a case study
so that you would get a hands-on introduction to solving audio processing problems.

Let’s get on with it!

Table of Contents
 What do you mean by Audio data?
o Applications of Audio Processing
 Data Handling in Audio domain
 Let’s solve the UrbanSound challenge!
 Intermission: Our first submission
 Let’s solve the challenge! Part 2: Building better models
 Future Steps to explore

What do you mean by Audio data?


Directly or indirectly, you are always in contact with audio. Your brain is continuously
processing and understanding audio data and giving you information about the
environment. A simple example can be your conversations with people which you do
daily. This speech is discerned by the other person to carry on the discussions. Even
when you think you are in a quiet environment, you tend to catch much more subtle
sounds, like the rustling of leaves or the splatter of rain. This is the extent of your
connection with audio.

So can you somehow catch this audio floating all around you to do something
constructive? Yes, of course! There are devices built which help you catch these sounds
and represent it in computer readable format. Examples of these formats are

 wav (Waveform Audio File) format


 mp3 (MPEG-1 Audio Layer 3) format
 WMA (Windows Media Audio) format

If you give a thought on what an audio looks like, it is nothing but a wave like format of
data, where the amplitude of audio change with respect to time. This can be pictorial
represented as follows.

Applications of Audio Processing


Although we discussed that audio data can be useful for analysis. But what are the
potential applications of audio processing? Here I would list a few of them

 Indexing music collections according to their audio features.


 Recommending music for radio channels
 Similarity search for audio files (aka Shazam)
 Speech processing and synthesis – generating artificial voice for conversational
agents

Here’s an exercise for you; can you think of an application of audio processing that can
potentially help thousands of lives?

Data Handling in Audio domain


As with all unstructured data formats, audio data has a couple of preprocessing steps
which have to be followed before it is presented for analysis.. We will cover this in detail
in later article, here we will get an intuition on why this is done.

The first step is to actually load the data into a machine understandable format. For this,
we simply take values after every specific time steps. For example; in a 2 second audio
file, we extract values at half a second. This is called sampling of audio data, and the
rate at which it is sampled is called the sampling rate.

Another way of representing audio data is by converting it into a different domain of data
representation, namely the frequency domain. When we sample an audio data, we
require much more data points to represent the whole data and also, the sampling rate
should be as high as possible.

On the other hand, if we represent audio data in frequency domain, much less
computational space is required. To get an intuition, take a look at the image below
Here, we separate one audio signal into 3 different pure signals, which can now be
represented as three unique values in frequency domain.

There are a few more ways in which audio data can be represented, for example. using
MFCs (Mel-Frequency cepstrums. PS: We will cover this in the later article). These are
nothing but different ways to represent the data.

Now the next step is to extract features from this audio representations, so that our
algorithm can work on these features and perform the task it is designed for. Here’s a
visual representation of the categories of audio features that can be extracted.
After extracting these features, it is then sent to the machine learning model for further
analysis.

Let’s solve the UrbanSound challenge!


Let us have a better practical overview in a real life project, the Urban Sound challenge.
This practice problem is meant to introduce you to audio processing in the usual
classification scenario.

The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes,
namely:

 air conditioner,
 car horn,
 children playing,
 dog bark,
 drilling,
 engine idling,
 gun shot,
 jackhammer,
 siren, and
 street music

Here’s a sound excerpt from the dataset. Can you guess which class does it belong to?

Audio Player
00:00
00:04

Use Up/Down Arrow keys to increase or decrease volume.

To play this in the jupyter notebook, you can simply follow along with the code.

import IPython.display as ipd

ipd.Audio('../data/Train/2022.wav')

Now let us load this audio in our notebook as a numpy array. For this, we will use librosa
library in python. To install librosa, just type this in command line

pip install librosa

Now we can run the following code to load the data

data, sampling_rate = librosa.load('../data/Train/2022.wav')

When you load the data, it gives you two objects; a numpy array of an audio file and the
corresponding sampling rate by which it was extracted. Now to represent this as a
waveform (which it originally is), use the following code

% pylab inline

import os

import pandas as pd

import librosa

import glob

plt.figure(figsize=(12, 4))

librosa.display.waveplot(data, sr=sampling_rate)
The output comes out as follows

Let us now visually inspect our data and see if we can find patterns in the data

Class: jackhammer

Class: drilling
Class: dog_barking

We can see that it may be difficult to differentiate between jackhammer and drilling, but it
is still easy to discern between dog_barking and drilling. To see more such examples,
you can use this code

i = random.choice(train.index)

audio_name = train.ID[i]

path = os.path.join(data_dir, 'Train', str(audio_name) + '.wav')

print('Class: ', train.Class[i])

x, sr = librosa.load('../data/Train/' + str(train.ID[i]) + '.wav')

plt.figure(figsize=(12, 4))

librosa.display.waveplot(x, sr=sr)
Intermission: Our first submission
We will do a similar approach as we did for Age detection problem, to see the class
distributions and just predict the max occurrence of all test cases as that class.

Let us see the distributions for this problem.

train.Class.value_counts()

Out[10]:

jackhammer 0.122907

engine_idling 0.114811

siren 0.111684

dog_bark 0.110396

air_conditioner 0.110396

children_playing 0.110396

street_music 0.110396

drilling 0.110396

car_horn 0.056302

gun_shot 0.042318

We see that jackhammer class has more values than any other class. So let us create
our first submission with this idea.

test = pd.read_csv('../data/test.csv')

test['Class'] = 'jackhammer'
test.to_csv(‘sub01.csv’, index=False)

This seems like a good idea as a benchmark for any challenge, but for this problem, it
seems a bit unfair. This is so because the dataset is not much imbalanced.

Let’s solve the challenge! Part 2: Building better


models
Now let us see how we can leverage the concepts we learned above to solve the
problem. We will follow these steps to solve the problem.

Step 1: Load audio files


Step 2: Extract features from audio
Step 3: Convert the data to pass it in our deep learning model
Step 4: Run a deep learning model and get results

Below is a code of how I implemented these steps

Step 1 and 2 combined: Load audio files and extract features

def parser(row):

# function to load files and extract features

file_name = os.path.join(os.path.abspath(data_dir), 'Train', str(row.ID) + '.wa


v')

# handle exception to check if there isn't a file which is corrupted

try:

# here kaiser_fast is a technique used for faster extraction

X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')

# we extract mfcc feature from data


mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=
0)

except Exception as e:

print("Error encountered while parsing file: ", file)

return None, None

feature = mfccs

label = row.Class

return [feature, label]

temp = train.apply(parser, axis=1)

temp.columns = ['feature', 'label']

Step 3: Convert the data to pass it in our deep learning model

from sklearn.preprocessing import LabelEncoder

X = np.array(temp.feature.tolist())

y = np.array(temp.label.tolist())
lb = LabelEncoder()

y = np_utils.to_categorical(lb.fit_transform(y))

Step 4: Run a deep learning model and get results

import numpy as np

from keras.models import Sequential

from keras.layers import Dense, Dropout, Activation, Flatten

from keras.layers import Convolution2D, MaxPooling2D

from keras.optimizers import Adam

from keras.utils import np_utils

from sklearn import metrics

num_labels = y.shape[1]

filter_size = 2

# build model

model = Sequential()

model.add(Dense(256, input_shape=(40,)))

model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))

model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(num_labels))

model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='ad


am')

Now let us train our model

model.fit(X, y, batch_size=32, epochs=5, validation_data=(val_x, val_y))

This is the result I got on training for 5 epochs

Train on 5435 samples, validate on 1359 samples

Epoch 1/10

5435/5435 [==============================] - 2s - loss: 12.0145 - acc: 0.1799 - va


l_loss: 8.3553 - val_acc: 0.2958

Epoch 2/10

5435/5435 [==============================] - 0s - loss: 7.6847 - acc: 0.2925 - val


_loss: 2.1265 - val_acc: 0.5026
Epoch 3/10

5435/5435 [==============================] - 0s - loss: 2.5338 - acc: 0.3553 - val


_loss: 1.7296 - val_acc: 0.5033

Epoch 4/10

5435/5435 [==============================] - 0s - loss: 1.8101 - acc: 0.4039 - val


_loss: 1.4127 - val_acc: 0.6144

Epoch 5/10

5435/5435 [==============================] - 0s - loss: 1.5522 - acc: 0.4822 - val


_loss: 1.2489 - val_acc: 0.6637

Seems ok, but the score can be increased obviously. (PS: I could get an accuracy of 80%
on my validation dataset). Now its your turn, can you increase on this score? If you do,
let me know in the comments below!

Future steps to explore


Now that we saw a simple applications, we can ideate a few more methods which can
help us improve our score

1. We applied a simple neural network model to the problem. Our immediate next
step should be to understand where does the model fail and why. By this, we
want to conceptualize our understanding of the failures of algorithm so that the
next time we build a model, it does not do the same mistakes
2. We can build more efficient models that our “better models”, such as
convolutional neural networks or recurrent neural networks. These models have
be proven to solve such problems with greater ease.
3. We touched the concept of data augmentation, but we did not apply them here.
You could try it to see if it works for the problem.

End Notes
In this article, I have given a brief overview of audio processing with an case study on
UrbanSound challenge. I have also shown the steps you perform when dealing with
audio data in python with librosa package. Giving this “shastra” in your hand, I hope you
could try your own algorithms in Urban Sound challenge, or try solving your own audio
problems in daily life.

You might also like