ML Notes
ML Notes
Table of Contents
When do we use KNN algorithm?
How does the KNN algorithm work?
How do we choose the factor K?
Breaking it Down – Pseudo Code of KNN
Implementation in Python from scratch
Comparing our model with scikit-learn
2. Calculation time
3. Predictive Power
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and
nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote
from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as
to enclose only three datapoints on the plane. Refer to following diagram for more
details:
The three closest points to BS is all RC. Hence, with good confidence level we can say
that the BS should belong to the class RC. Here, the choice became very obvious as all
three votes from the closest neighbor went to RC. The choice of the parameter K is very
crucial in this algorithm. Next we will understand what are the factors to be considered to
conclude the best K.
As you can see, the error rate at K=1 is always zero for the training sample. This is
because the closest point to any training data point is itself.Hence the prediction is
always accurate with K=1. If validation error curve would have been similar, our choice
of K would have been 1. Following is the validation error curve with varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence,
error rate initially decreases and reaches a minima. After the minima point, it then
increase with increasing K. To get the optimal value of K, you can segregate the training
and validation from the initial dataset. Now plot the validation error curve to get the
optimal value of K. This value of K should be used for all predictions.
# Importing libraries
import pandas as pd
import numpy as np
import math
import operator
# Importing data
data = pd.read_csv("iris.csv")
data.head()
# Defining a function which calculates euclidean distance between two data points
for x in range(length):
return np.sqrt(distance)
distances = {}
sort = {}
length = testInstance.shape[1]
# Calculating euclidean distance between each row of training data and test da
ta
for x in range(len(trainingSet)):
#### Start of STEP 3.1
distances[x] = dist[0]
neighbors = []
neighbors.append(sorted_d[x][0])
classVotes = {}
for x in range(len(neighbors)):
response = trainingSet.iloc[neighbors[x]][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
rue)
return(sortedVotes[0][0], neighbors)
test = pd.DataFrame(testSet)
k = 1
print(result)
-> Iris-virginica
# Nearest neighbor
print(neigh)
-> [141]
Now we will try to alter the k values, and see how the prediction changes.
k = 3
# Predicted class
# 3 nearest neighbors
print(neigh)
k = 5
# Predicted class
# 5 nearest neighbors
print(neigh)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])
# Predicted class
print(neigh.predict(test))
-> ['Iris-virginica']
# 3 nearest neighbors
print(neigh.kneighbors(test)[1])
We can see that both the models predicted the same class (‘Iris-virginica’) and the same
nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as
expected.
End Notes
KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it
can give highly competitive results. KNN algorithm can also be used for regression
problems. The only difference from the discussed methodology will be using averages of
nearest neighbors rather than voting from nearest neighbors. KNN can be coded in a
single line on R. I am yet to explore how can we use KNN algorithm on SAS.
Did you find the article useful? Have you used any other machine learning tool recently?
Do you plan to use KNN in any of your business problems? If yes, share with us how
you plan to go about it.
What is Principal Component Analysis ?
In simple words, principal component analysis is a method of extracting important
variables (in form of components) from a large set of variables available in a data set. It
extracts low dimensional set of features from a high dimensional data set with a motive
to capture as much information as possible. With fewer variables, visualization also
becomes much more meaningful. PCA is more useful when dealing with 3 or higher
dimensional data.
Let’s say we have a data set of dimension 300 (n) × 50 (p). n represents the number of
observations and prepresents number of predictors. Since we have a large p = 50,
there can be p(p-1)/2 scatter plots i.e more than 1000 plots possible to analyze the
variable relationship. Wouldn’t is be a tedious job to perform exploratory analysis on this
data ?
In this case, it would be a lucid approach to select a subset of p (p << 50) predictor
which captures as much information. Followed by plotting the observation in the
resultant low dimensional space.
The image below shows the transformation of a high dimensional data (3 dimension) to
low dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension
is a linear combination of p features
Source: nlpca
What are principal components ?
A principal component is a normalized linear combination of the original predictors in a
data set. In image above, PC1 and PC2 are the principal components. Let’s say we
have a set of predictors as X¹, X²...,Xp
where,
Therefore,
The first principal component results in a line which is closest to the data i.e. it minimizes
the sum of squared distance between a data point and the line.
If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of
the components, as expected they are orthogonal. This suggests the correlation b/w
these components in zero.
All
succeeding principal component follows a similar concept i.e. they capture the remaining
variation without being correlated with the previous component. In general, for n ×
p dimensional data, min(n-1, p) principal component can be constructed.
The directions of these components are identified in an unsupervised way i.e. the
response variable(Y) is not used to determine the component direction. Therefore, it
is an unsupervised approach.
Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher
weight to variables which are strongly related to response variable to determine principal
components.
Performing PCA on un-normalized variables will lead to insanely large loadings for
variables with high variance. In turn, this will lead to dependence of a principal
component on the variable with high variance. This is undesirable.
As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.
Implement PCA in R & Python (with interpretation)
How many principal components to choose ? I could dive deep in theory, but it would be
better to answer these question practically.
For this demonstration, I’ll be using the data set from Big Mart Prediction Challenge III.
Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have
done the basic data cleaning prior to implementing this technique. Let’s quickly finish
with initial data loading and cleaning steps:
#directory path
> path <- ".../Data/Big_Mart_Sales"
#add a column
> test$Item_Outlet_Sales <- 1
Till here, we’ve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are
practicing an unsupervised learning technique, hence response variable must be
removed.
Let’s check the available variables ( a.k.a predictors) in the data set.
Since PCA works on numeric variables, let’s see if we have any variable other than
numeric.
Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. We’ll convert these categorical variables into numeric using one hot encoding.
#load library
> library(dummies)
And, we now have all the numerical values. Let’s divide the data into test and train.
The base R function prcomp() is used to perform PCA. By default, it centers the variable
to have mean equals to zero. With parameter scale. = T, we normalize the variables to
have standard deviation equals to 1.
1. center and scale refers to respective mean and standard deviation of the variables
that are used for normalization prior to implementing PCA
2. The rotation measure provides the principal component loading. Each column of
rotation matrix contains the principal component loading vector. This is the most
important measure we should be interested in.
> prin_comp$rotation
This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set,
the maximum number of principal component loadings is a minimum of (n-1, p). Let’s
look at first 4 principal components and first 5 rows.
> prin_comp$rotation[1:5,1:4]
PC1 PC2 PC3 PC4
Item_Weight 0.0054429225 -0.001285666 0.011246194 0.011887106
Item_Fat_ContentLF -0.0021983314 0.003768557 -0.009790094 -0.016789483
Item_Fat_Contentlow fat -0.0019042710 0.001866905 -0.003066415 -0.018396143
Item_Fat_ContentLow Fat 0.0027936467 -0.002234328 0.028309811 0.056822747
Item_Fat_Contentreg 0.0002936319 0.001120931 0.009033254 -0.001026615
3. In order to compute the principal component score vector, we don’t need to multiply
the loading with data. Rather, the matrix x has the principal component score vectors in
a 8523 × 44 dimension.
> dim(prin_comp$x)
[1] 8523 44
The parameter scale = 0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.
#compute variance
> pr_var <- std_dev^2
We aim to find the components which explain the maximum variance. This is because,
we want to retain as much information as possible using these components. So, higher
is the explained variance, higher will be the information contained in those components.
This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do
we decide how many components should we select for modeling stage ?
The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents
values in descending order.
#scree plot
> plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
The plot above shows that ~ 30 components explains around 98.4% variance in the data
set. In order words, using PCA we have reduced 44 predictors to 30 without
compromising on explained variance. This is the power of PCA> Let’s do a confirmation
check, by plotting a cumulative variance plot. This will give us a clear picture of number
of components.
1. We should not combine the train and test set to obtain PCA components of whole
data at once. Because, this would violate the entire assumption of
generalization since test data would get ‘leaked’ into the training set. In other
words, the test data set would no longer remain ‘unseen’. Eventually, this will
hammer down the generalization capability of the model.
2. We should not perform PCA on test and train data sets separately. Because, the
resultant vectors from train and test PCAs will have different directions ( due to
unequal variance). Due to this, we’ll end up comparing data registered on
different axes. Therefore, the resulting vectors from train and test data should
have same axes.
We should do exactly the same transformation to the test set as we did to training set,
including the center and scaling feature. Let’s do it in R:
That’s the complete modeling process after PCA extraction. I’m sure you wouldn’t be
happy with your leaderboard rank after you upload the solution. Try using random forest!
For Python Users: To implement PCA in python, simply import PCA from sklearn library.
The interpretation remains same as explained for R users above. Ofcourse, the result is
some as derived after using R. The data set used for Python is a cleaned version where
missing values have been imputed, and categorical variables are converted into numeric.
The modeling process remains same, as explained for R users above.
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
%matplotlib inline
pca = PCA(n_components=44)
pca.fit(X)
print var1
[ 10.37 17.68 23.92 29.7 34.7 39.28 43.67 46.53 49.27
51.92 54.48 57.04 59.59 62.1 64.59 67.08 69.55 72.
74.39 76.76 79.1 81.44 83.77 86.06 88.33 90.59 92.7
94.76 96.78 98.44 100.01 100.01 100.01 100.01 100.01 100.01
100.01 100.01 100.01 100.01 100.01 100.01 100.01 100.01]
plt.plot(var1)
print X1
Points to Remember
1. PCA is used to overcome features redundancy in a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear
combination of original predictor variables.
4. These components aim to capture as much information as possible with high
explained variance.
5. The first component has the highest variance followed by second, third and so on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See
above.
7. Normalizing data becomes extremely important when the predictors are
measured in different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher
dimensions, it becomes increasingly difficult to make interpretations from the
resultant cloud of data.
9. PCA is applied on a data set with numeric variables.
10. PCA is a tool which helps to produce better visualizations of high dimensional
data.
End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, I’ve
tried to make you familiar with most important concepts required to use this technique.
It’s simple but needs special attention while deciding the number of
components. Practically, we should strive to retain only first few k components
The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.
Think of machine learning algorithms as an armory packed with axes, sword, blades,
bow, dagger etc. You have various tools, but you ought to learn to use them at the right
time. As an analogy, think of ‘Regression’ as a sword capable of slicing and dicing
data efficiently, but incapable of dealing with highly complex data. On the
contrary, ‘Support Vector Machines’ is like a sharp knife – it works on smaller datasets,
but on them, it can be much more stronger and powerful in building models.
Table of Contents
1. What is Support Vector Machine?
2. How does it work?
3. How to implement SVM in Python and R?
4. How to tune Parameters of SVM?
5. Pros and Cons associated with SVM
Support Vectors are simply the co-ordinates of individual observation. Support Vector
Machine is a frontier which best segregates the two classes (hyper-plane/ line).
You can look at definition of support vectors and a few examples of its working here.
How does it work?
Above, we got accustomed to the process of segregating the two classes with a hyper-
plane. Now the burning question is “How can we identify the right hyper-plane?”. Don’t
worry, it’s not as hard as you think!
Let’s understand:
In SVM, it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this feature
manually to have a hyper-plane. No, SVM has a technique called the kernel trick.
These are functions which takes low dimensional input space and transform it to
a higher dimensional space i.e. it converts not separable problem to separable
problem, these functions are called kernels. It is mostly useful in non-linear
separation problem. Simply put, it does some extremely complex data
transformations, then find out the process to separate the data based on the
labels or outputs you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
Now, let’s look at the methods to apply SVM algorithm in a data science challenge.
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p
redictor) of test_dataset
lue. Will discuss more # about it in next section.Train the model using the traini
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
The e1071 package in R is used to create Support Vector Machines with ease. It has
helper functions as well as code for the Naive Bayes Classifier. The creation of a support
vector machine in R and Python follow similar approaches, let’s take a look now at the
following code:
#Import Library
# there are various options associated with SVM training; like changing kernel, ga
# create model
model <- svm(Target~Predictor1+Predictor2+Predictor3,data=Train,kernel='linear',ga
mma=0.2,cost=100)
#Predict Output
table(preds)
e, max_iter=-1, random_state=None)
I am going to discuss about some important parameters having higher impact on model
performance, “kernel”, “gamma” and “C”.
kernel: We have already discussed about it. Here, we have various options available
with kernel like, “linear”, “rbf”,”poly” and others (default value is “rbf”). Here “rbf” and
“poly” are useful for non-linear hyper-plane. Let’s look at the example, where we’ve used
linear kernel on two feature of iris data set to classify their class.
import numpy as np
iris = datasets.load_iris()
y = iris.target
# we create an instance of SVM and fit out data. We do not scale our
h = (x_max / x_min)/100
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.show()
Change the kernel type to rbf in below line and look at the impact.
svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)
I would suggest you to go for linear kernel if you have large number of features (>1000)
because it is more likely that the data is linearly separable in high dimensional space.
Also, you can RBF but do not forget to cross validate for its parameters as to avoid over-
fitting.
gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will
try to exact fit the as per training data set i.e. generalization error and cause over-fitting
problem.
Example: Let’s difference if we have gamma different gamma values like 0, 10 or 100.
We should always look at the cross validation score to have effective combination of
these parameters and avoid over-fitting.
In R, SVMs can be tuned in a similar fashion as they are in Python. Mentioned below are
the respective parameters for e1071 package:
Practice Problem
Find right additional feature to have a hyper-plane for segregating the classes in below
snapshot:
Answer the variable name in the comments section below. I’ll shall then reveal the
answer.
End Notes
Introduction
Tree based learning algorithms are considered to be one of the best and mostly used
supervised learning methods. Tree based methods empower predictive models with high
accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear
relationships quite well. They are adaptable at solving any kind of problem at hand
(classification or regression).
Methods like decision trees, random forest, gradient boosting are being popularly used
in all kinds of data science problems. Hence, for every analyst (fresher also), it’s
important to learn these algorithms and use them for modeling.
This tutorial is meant to help beginners learn tree based modeling from scratch. After the
successful completion of this tutorial, one is expected to become proficient at using tree
based algorithms and build predictive models.
Table of Contents
1.
1. What is a Decision Tree? How does it work?
2. Regression Trees vs Classification Trees
3. How does a tree decide where to split?
4. What are the key parameters of model building and how can we avoid
over-fitting in decision trees?
5. Are tree based models better than linear models?
6. Working with Decision Trees in R and Python
7. What are the ensemble methods of trees based model?
8. What is Bagging? How does it work?
9. What is Random Forest ? How does it work?
10. What is Boosting ? How does it work?
11. Which is more powerful: GBM or Xgboost?
12. Working with GBM in R and Python
13. Working with Xgboost in R and Python
14. Where to Practice ?
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl),
Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I
want to create a model to predict who will play cricket during leisure period? In this
problem, we need to segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of
three variable and identify the variable, which creates the best homogeneous sets of
students (which are heterogeneous to each other). In the snapshot below, you can see
that variable Gender is able to identify best homogeneous sets compared to the other
two variables.
As mentioned above, decision tree identifies the most significant variable and it’s value
that gives best homogeneous sets of population. Now the question which arises is, how
does it identify the variable and the split? To do this, decision tree uses various
algorithms, which we will shall discuss in the following section.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two
types:
1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. Example:- In above
scenario of student problem, where the target variable was “Student will play
cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target
variable then it is called as Continuous Variable Decision Tree.
Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that income of
customer is a significant variable but insurance company does not have income details
for all customers. Now, as we know this is an important variable, then we can build a
decision tree to predict customer income based on occupation, product and various
other variables. In this case, we are predicting values for continuous variable.
1. Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm
has advantages and disadvantages, below are the important factors which one should
know.
Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better
power to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration
stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most
significant variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modeling techniques. It is not influenced by outliers and missing
values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters
and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
2. Regression Trees vs Classification Trees
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree.
This means that decision trees are typically drawn upside down such that leaves are the
the bottom & roots are the tops (shown below).
Both the trees work almost similar to each other, let’s look at the primary differences &
similarity between classification and regression trees:
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other
words, we can say that purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then selects the split which
results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the four
most commonly used algorithms in decision tree:
Gini Index
Gini index says, if we select two items from a population at random then they must be of
same class and probability for this is 1 if population is pure.
1. Calculate Gini for sub-nodes, using formula sum of square of probability for
success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to example used above, where we want to segregate the students
based on target variable ( playing cricket or not ). In the snapshot below, we split the
population using two input variables Gender and Class. Now, I want to identify which
split is producing more homogeneous sub-nodes using Gini index.
Split on Gender:
Above, you can see that Gini score for Split on Gender is higher than Split on
Class, hence, the node split will take place on Gender.
Chi-Square
It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of
standardized differences between observed and expected frequencies of target variable.
1. Calculate Chi-square for individual node by calculating the deviation for Success
and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure
of each node of the split
Example: Let’s work with above example that we have used to calculate Gini.
Split on Gender:
1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it
would be 5 for both because parent node has probability of 50% and we have
applied same probability on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play
Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using
formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer
below table for calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
Perform similar steps of calculation for split on Class and you will come up with below
table.
Above, you can see that Chi-square also identify the Gender split is more significant
compare to Class.
Information Gain:
Look at the image below and think which node can be described easily. I am sure, your
answer is C because it requires less information as all values are similar. On the other
hand, B requires more information to describe it and A requires the maximum
information. In other words, we can say that C is a Pure node, B is less Impure and A is
more impure.
Now, we can build a conclusion that less impure node requires less information to
describe it. And, more impure node requires more information. Information theory is a
measure to define this degree of disorganization in a system known as Entropy. If the
sample is completely homogeneous, then the entropy is zero and if the sample is an
equally divided (50% – 50%), it has entropy of one.
Here p and q is probability of success and failure respectively in that node. Entropy is
also used with categorical target variable. It chooses the split which has lowest entropy
compared to parent node and other splits. The lesser the entropy, the better it is.
Example: Let’s use this method to identify best split for student example.
1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1
shows that it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for
male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 +
(20/30)*0.93 = 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for
Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree
will split on Gender. We can derive information gain from entropy as 1- Entropy.
Reduction in Variance
Till now, we have discussed the algorithms for categorical target variable. Reduction in
variance is an algorithm used for continuous target variables (regression problems). This
algorithm uses the standard formula of variance to choose the best split. The split with
lower variance is selected as the criteria to split the population:
Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket.
Now follow the steps to identify the right split:
1. Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have
15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-
0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-
0.5)^2) / 30 = 0.25
2. Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-
0.2)^2) / 10 = 0.16
3. Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-
0.65)^2) / 20 = 0.23
4. Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 +
(20/30) *0.23 = 0.21
5. Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-
0.43)^2) / 14= 0.24
6. Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-
0.56)^2) / 16 = 0.25
7. Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25
Above, you can see that Gender split has lower variance compare to parent node, so the
split would take place on Gender variable.
Until here, we learnt about the basics of decision trees and the decision making process
involved to choose the best splits in building a tree model. As I said, decision tree can be
applied both on regression and classification problems. Let’s understand these aspects
in detail.
4. What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision trees. If there is
no limit set of a decision tree, it will give you 100% accuracy on training set because in
the worse case it will end up making 1 leaf for each observation. Thus, preventing
overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:
The parameters used for defining a tree are further explained below. The parameters
described below are irrespective of tool. It is important to understand the role of
parameters used in tree modeling. These parameters are available in R & Python.
Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-approach. In other
words, it will check for the best split instantaneously and move forward until one of the
specified stopping condition is reached. Let’s consider the following case when you’re
driving:
At this instant, you are the yellow car and you have 2 choices:
This is exactly the difference between normal decision tree & pruning. A decision tree
with constraints won’t see the truck ahead and adopt a greedy approach by taking a left.
On the other hand if we use pruning, we in effect look at a few steps ahead and make a
choice.
So we know pruning is better. But how to implement it in decision tree? The idea is
simple.
Note that sklearn’s decision tree classifier does not currently support pruning. Advanced
packages like xgboost have adopted tree pruning in their implementation. But the
library rpart in R, provides a function to prune. Good for R users!
Actually, you can use any algorithm. It is dependent on the type of problem you are
solving. Let’s look at some key factors which will help you to decide which algorithm to
use:
For R users, there are multiple packages available to implement decision tree such as
ctree, rpart, tree etc.
> library(rpart)
# grow tree
> summary(fit)
#Predict Output
#Import Library
#Import other necessary libraries like pandas, numpy...
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p
redictor) of test_dataset
gini
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
7. What are ensemble methods in tree based
modeling ?
The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of
predictive models to achieve a better accuracy and model stability. Ensemble methods
are known to impart supreme boost to tree based models.
Like every other model, a tree based model also suffers from the plague of bias and
variance. Bias means, ‘how much on an average are the predicted values different from
the actual value.’ Variance means, ‘how different will the predictions of the model be at
the same point if different samples are taken from the same population’.
You build a small tree and you will get a model with low variance and high bias. How do
you manage to balance the trade off between bias and variance ?
Normally, as you increase the complexity of your model, you will see a reduction in
prediction error due to lower bias in the model. As you continue to make your model
more complex, you end up over-fitting your model and your model will start suffering
from high variance.
A champion model should maintain a balance between these two types of errors. This is
known as the trade-off management of bias-variance errors. Ensemble learning is one
way to execute this trade off analysis.
Some
of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In
this tutorial, we’ll focus on Bagging and Boosting in detail.
8. What is Bagging? How does it work?
Bagging is a technique used to reduce the variance of our predictions by combining the
result of multiple classifiers modeled on different sub-samples of the same data set. The
following figure will make it clearer:
Note that, here the number of models built is not a hyper-parameters. Higher number of
models are always better or may give similar performance than lower numbers. It can be
theoretically shown that the variance of the combined predictions are reduced to 1/n (n:
number of classifiers) of the original variance, under some assumptions.
There are various implementations of bagging models. Random forest is one of them
and we’ll discuss it next.
9. What is Random Forest ? How does it work?
Random Forest is considered to be a panacea of all data science problems. On a funny
note, when you can’t think of any algorithm (irrespective of situation), use random forest!
It works in the following manner. Each tree is planted & grown as follows:
1. Assume number of cases in the training set is N. Then, sample of these N cases
is taken at random but with replacement. This sample will be the training set for
growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node,
m variables are selected at random out of the M. The best split on these m is
used to split the node. The value of m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority
votes for classification, average for regression).
To understand more in detail about this algorithm using a case study, please read
this article “Introduction to Random forest – Simplified“.
It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing.
It has methods for balancing errors in data sets where classes are imbalanced.
The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
Random Forest involves sampling of the input data with replacement called as
bootstrap sampling. Here one third of the data is not used for training and can be
used to testing. These are called the out of bag samples. Error estimated on
these out of bag samples is known as out of bag error. Study of error estimates
by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate
as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.
#Import Library
regression problem
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(p
redictor) of test_dataset
model= RandomForestClassifier(n_estimators=1000)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
> library(randomForest)
# Fitting model
#Predict Output
Let’s understand this definition in detail by solving a problem of spam email identification:
How would you classify an email as SPAM or not? Like everyone else, our initial
approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a
SPAM
4. Email from our official domain “Analyticsvidhya.com” , Not a SPAM
5. Email from known source, Not a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do
you think these rules individually are strong enough to successfully classify an email? No.
Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak
learner using methods like:
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted
as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an
email as SPAM because we have higher(3) vote for ‘SPAM’.
How does it work?
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong
rule. An immediate question which should pop in your mind is, ‘How boosting identify
weak rules?‘
To find weak rule, we apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak prediction rule.
This is an iterative process. After many iterations, the boosting algorithm combines these
weak rules into a single strong prediction rule.
Here’s another question which might haunt you, ‘How do we choose different distribution
for each round?’
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention
to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we
pay higher attention to observations having prediction error. Then, we apply the next
base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher
accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which
eventually improves the prediction power of the model. Boosting pays higher focus on
examples which are mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy.
In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient
Boosting (GBM) and XGboost.
1. Regularization:
o Standard GBM implementation has
no regularization like XGBoost, therefore it also helps to reduce overfitting.
o In fact, XGBoost is also known as ‘regularized boosting‘ technique.
2. Parallel Processing:
o XGBoost implements parallel processing and is blazingly faster as
compared to GBM.
o But hang on, we know that boosting is sequential process so how can it be
parallelized? We know that each tree can be built only after the previous
one, so what stops us from making a tree using all cores? I hope
you get where I’m coming from. Check this link out to explore further.
o XGBoost also supports implementation on Hadoop.
3. High Flexibility
o XGBoost allow users to define custom optimization objectives and
evaluation criteria.
o This adds a whole new dimension to the model and there is no limit to
what we can do.
4. Handling Missing Values
o XGBoost has an in-built routine to handle missing values.
o User is required to supply a different value than other observations and
pass that as a parameter. XGBoost tries different things as it encounters a
missing value on each node and learns which path to take for missing
values in future.
5. Tree Pruning:
o A GBM would stop splitting a node when it encounters a negative loss in
the split. Thus it is more of a greedy algorithm.
o XGBoost on the other hand make splits upto the max_depth specified
and then start pruning the tree backwards and remove splits beyond
which there is no positive gain.
o Another advantage is that sometimes a split of negative loss say -2 may
be followed by a split of positive loss +10. GBM would stop as it
encounters -2. But XGBoost will go deeper and it will see a combined
effect of +8 of the split and keep both.
6. Built-in Cross-Validation
o XGBoost allows user to run a cross-validation at each iteration of the
boosting process and thus it is easy to get the exact optimum number of
boosting iterations in a single run.
o This is unlike GBM where we have to run a grid-search and only a limited
values can be tested.
7. Continue on Existing Model
o User can start training an XGBoost model from its last iteration of previous
run. This can be of significant advantage in certain specific applications.
o GBM implementation of sklearn also has this feature so they are even on
this point.
2.1 Update the weights for targets based on previous run (higher for the ones mi
s-classified)
2.4 Update the output with current results taking into account the learning rate
This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will
help every beginners to understand this algorithm.
Lets consider the important GBM parameters used to improve model performance in
Python:
1. learning_rate
o This determines the impact of each tree on the final outcome (step 2.4).
GBM works by starting with an initial estimate which is updated using the
output of each tree. The learning parameter controls the magnitude of this
change in the estimates.
o Lower values are generally preferred as they make the model robust to
the specific characteristics of tree and thus allowing it to generalize well.
o Lower values would require higher number of trees to model all the
relations and will be computationally expensive.
2. n_estimators
o The number of sequential trees to be modeled (step 2)
o Though GBM is fairly robust at higher number of trees but it can still overfit
at a point. Hence, this should be tuned using CV for a particular learning
rate.
3. subsample
o The fraction of observations to be selected for each tree. Selection is done
by random sampling.
o Values slightly less than 1 make the model robust by reducing the
variance.
o Typical values ~0.8 generally work fine but can be fine-tuned further.
Apart from these, there are certain miscellaneous parameters which affect overall
functionality:
1. loss
o It refers to the loss function to be minimized in each split.
o It can have various values for classification and regression case.
Generally the default values work fine. Other values should be chosen
only if you understand their impact on the model.
2. init
oThis affects initialization of the output.
oThis can be used if we have made another model whose outcome is to be
used as the initial estimates for GBM.
3. random_state
o The random number seed so that same random numbers are generated
every time.
o This is important for parameter tuning. If we don’t fix the random number,
then we’ll have different outcomes for subsequent runs on the same
parameters and it becomes difficult to compare models.
o It can potentially result in overfitting to a particular random sample
selected. We can try running models for different random samples, which
is computationally expensive and generally not used.
4. verbose
o The type of output to be printed when the model fits. The different values
can be:
0: no output generated (default)
1: output generated for trees in certain intervals
>1: output generated for all trees
5. warm_start
o This parameter has an interesting application and can help a lot if used
judicially.
o Using this, we can fit additional trees on previous fits of a model. It can
save a lot of time and you should explore this option for advanced
applications
6. presort
o Select whether to presort data for faster splits.
o It makes the selection automatically by default but it can be changed if
needed.
I know its a long list of parameters but I have simplified it for you in an excel file which
you can download from this GitHub repository.
For R users, using caret package, there are 3 main tuning parameters:
1. n.trees – It refers to number of iterations i.e. tree which will be taken to grow the
trees
2. interaction.depth – It determines the complexity of the tree i.e. total number of
splits it has to perform on a tree (starting from a single node)
3. shrinkage – It refers to the learning rate. This is similar to learning_rate in python
(shown above).
4. n.minobsinnode – It refers to minimum number of training samples required in a
node to perform splitting
GBM in R (with cross validation)
I’ve shared the standard codes in R and Python. At your end, you’ll be required to
change the value of dependent variable and data set name used in the codes below.
Considering the ease of implementing GBM in R, one can easily perform tasks like cross
validation and grid search with this package.
> library(caret)
n.trees = 500,
shrinkage = 0.1,
n.minobsinnode = 10)
> set.seed(825)
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
> predicted= predict(fit,test,type= "prob")[,2]
GBM in Python
#import libraries
clf.fit(X_train, y_train)
R Tutorial: For R users, this is a complete tutorial on XGboost which explains the
parameters along with codes in R. Check Tutorial.
Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to
get you started. Check Tutorial.
14. Where to practice ?
Practice is the one and true method of mastering any concept. Hence, you need to start
practicing if you wish to master these algorithms.
Till here, you’ve got gained significant knowledge on tree based models along with these
practical implementation. It’s time that you start working on them. Here are open practice
problems where you can participate and check your live rankings on leaderboard:
End Notes
Tree based algorithm are important for every data scientist to learn. In fact, tree models
are known to provide the best model performance in the family of whole machine
learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we
come to the end of this tutorial.
We discussed about tree based modeling from scratch. We learnt the important of
decision tree and how that simplistic concept is being used in boosting algorithms. For
better understanding, I would suggest you to continue practicing these algorithms
practically. Also, do keep note of the parameters associated with boosting algorithms.
I’m hoping that this tutorial would enrich you with complete knowledge on tree based
modeling.
Ultimate guide to deal with Text Data
(using Python) – for Data Scientists &
Engineers
Introduction
One of the biggest breakthroughs required for achieving any level of artificial intelligence
is to have machines which can process text data. Thankfully, the amount of text data
being generated in this universe has exploded exponentially in the last few years.
In this article we will discuss different feature extraction methods, starting with some
basic techniques which will lead into advanced Natural Language Processing techniques.
We will also learn about pre-processing of the text data in order to extract better features
from clean data.
By the end of this article, you will be able to perform text operations by yourself. Let’s get
started!
Table of Contents:
1. Basic feature extraction using text data
o Number of words
o Number of characters
o Average word length
o Number of stopwords
o Number of special characters
o Number of numerics
o Number of uppercase words
2. Basic Text Pre-processing of text data
o Lower casing
o Punctuation removal
o Stopwords removal
o Frequent words removal
o Rare words removal
o Spelling correction
o Tokenization
o Stemming
o Lemmatization
3. Advance Text Processing
o N-grams
o Term Frequency
o Inverse Document Frequency
o Term Frequency-Inverse Document Frequency (TF-IDF)
o Bag of Words
o Sentiment Analysis
o Word Embedding
Before starting, let’s quickly read the training file from the dataset in order to perform
different tasks on it. In the entire article, we will use the twitter sentiment dataset from
the datahack platform.
train = pd.read_csv('train_E6oV3lV.csv')
Note that here we are only working with textual data, but we can also use the below
methods when numerical features are also present along with the text.
train[['tweet','word_count']].head()
1.2 Number of characters
This feature is also based on the previous feature intuition. Here, we calculate the
number of characters in each tweet. This is done by calculating the length of the tweet.
train[['tweet','char_count']].head()
Note that the calculation will also include the number of spaces, which you can remove,
if required.
def avg_word(sentence):
words = sentence.split()
train[['tweet','avg_word']].head()
Here, we have imported stopwords from NLTK, which is a basic NLP library in python.
stop = stopwords.words('english')
train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x
in stop]))
train[['tweet','stopwords']].head()
Here, we make use of the ‘starts with’ function because hashtags (or mentions) always
appear at the beginning of a word.
tartswith('#')]))
train[['tweet','hastags']].head()
1.6 Number of numerics
Just like we calculated the number of words, we can also calculate the number of
numerics which are present in the tweets. It does not have a lot of use in our example,
but this is still a useful feature that should be run while doing similar exercises. For
example,
isdigit()]))
train[['tweet','numerics']].head()
pper()]))
train[['tweet','upper']].head()
2. Basic Pre-processing
So far, we have learned how to extract basic features from text data. Before diving into
text and feature extraction, our first step should be cleaning the data in order to obtain
better features. We will achieve this by doing some of the basic pre-processing steps on
our training data.
t()))
train['tweet'].head()
2.2 Removing Punctuation
The next step is to remove punctuation, as it doesn’t add any extra information while
treating text data. Therefore removing all instances of it will help us reduce the size of
the training data.
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()
As you can see in the above output, all the punctuation, including ‘#’ and ‘@’, has been
removed from the training data.
not in stop))
train['tweet'].head()
freq
love 2647
ð 2511
day 2199
â 1797
happy 1663
amp 1582
im 1139
u 1136
time 1110
dtype: int64
Now, let’s remove these words as their presence will not of any use in classification of
our text data.
freq = list(freq.index)
not in freq))
train['tweet'].head()
freq
> tvperfect 1
oau 1
850am 1
semangatpagi 1
kindestbravest 1
moodyah 1
downhill 1
loreal 1
ohwhatcoulditbe 1
maannnn 1
dtype: int64
freq = list(freq.index)
not in freq))
train['tweet'].head()
All these pre-processing steps are essential and help us in reducing our vocabulary
clutter so that the features produced in the end are more effective.
In that regard, spelling correction is a useful pre-processing step because this also will
help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will
be treated as different words even if they are used in the same sense.
To achieve this we will use the textblob library. If you are not familiar with it, you can
check my previous article on ‘NLP for beginners using textblob’.
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))
Note that it will actually take a lot of time to make these corrections. Therefore, just for
the purposes of learning, I have shown this technique by applying it on only the first 5
rows. Moreover, we cannot always expect it to be accurate so some care should be
taken before applying it.
We should also keep in mind that words are often used in their abbreviated form. For
instance, ‘your’ is used as ‘ur’. We should treat this before the spelling correction step,
otherwise these words might be transformed into any other word like the one shown
below:
2.7 Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences. In our
example, we have used the textblob library to first transform our tweets into a blob and
then converted them into a series of words.
TextBlob(train['tweet'][1]).words
> WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', '
2.8 Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based
approach. For this purpose, we will use PorterStemmer from the NLTK library.
st = PorterStemmer()
()]))
0 father dysfunct selfish drag kid dysfunct run
2 bihday majesti
In the above output, dysfunctional has been transformed into dysfunct, among other
changes.
2.9 Lemmatization
Lemmatization is a more effective option than stemming because it converts the word
into its root word, rather than just stripping the suffices. It makes use of the vocabulary
and does a morphological analysis to obtain the root word. Therefore, we usually prefer
using lemmatization over stemming.
or word in x.split()]))
train['tweet'].head()
3.1 N-grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are
called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.
So, let’s quickly extract bigrams from our tweets using the ngrams function of the
textblob library.
TextBlob(train['tweet'][0]).ngrams(2)
WordList(['when', 'a']),
WordList(['a', 'father']),
WordList(['father', 'is']),
WordList(['is', 'dysfunctional']),
WordList(['dysfunctional', 'and']),
WordList(['and', 'is']),
WordList(['is', 'so']),
WordList(['so', 'selfish']),
WordList(['selfish', 'he']),
WordList(['he', 'drags']),
WordList(['drags', 'his']),
WordList(['his', 'kids']),
WordList(['kids', 'into']),
WordList(['into', 'his']),
WordList(['his', 'dysfunction']),
WordList(['dysfunction', 'run'])]
3.2 Term frequency
Term frequency is simply the ratio of the count of a word present in a sentence, to the
length of the sentence.
Below, I have tried to show you the term frequency table of a tweet.
s = 0).reset_index()
tf1.columns = ['words','tf']
tf1
Therefore, the IDF of each word is the log of the ratio of the total number of rows to the
number of rows in which that word is present.
IDF = log(N/n), where, N is the total number of rows and n is the number of rows in
which the word was present.
So, let’s calculate IDF for the same tweets for which we calculated the term frequency.
(word)])))
tf1
The more the value of IDF, the more unique is the word.
tf1
We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because
they are commonly occurring words. However, it has given a high weight to
“disappointed” since that will be very useful in determining the sentiment of the tweet.
We don’t have to calculate TF and IDF every time beforehand and then multiply it to
obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:
stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])
train_vect
We can also perform basic pre-processing steps like lower-casing and removal of
stopwords, if we haven’t done them earlier.
r = "word")
train_bow = bow.fit_transform(train['tweet'])
train_bow
train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)
0 (-0.3, 0.5354166666666667)
1 (0.2, 0.2)
2 (0.0, 0.0)
3 (0.0, 0.0)
4 (0.0, 0.0)
Above, you can see that it returns a tuple representing polarity and subjectivity of each
tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1
means a positive sentiment and values nearer to -1 means a negative sentiment. This
can also work as a feature for building a machine learning model.
train[['tweet','sentiment']].head()
3.7 Word Embeddings
Word Embedding is the representation of text in the form of vectors. The underlying idea
here is that similar words will have a minimum distance between their vectors.
Word2Vec models require a lot of text, so either we can train it on our training data or we
can use the pre-trained word vectors developed by Google, Wiki, etc.
Here, we will use pre-trained word vectors which can be downloaded from
the glove website. There are different dimensions (50,100, 200, 300) vectors trained on
wiki data. For this example, I have downloaded the 100-dimensional version of the
model.
You can refer an article here to understand different form of word embeddings.
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)
>(400000, 100)
Now, we can load the above word2vec file as a model.
filename = 'glove.6B.100d.txt.word2vec'
Let’s say our tweet contains a text saying ‘go away’. We can easily obtain it’s word
vector using the above model:
model['go']
model['away']
We then take the average to represent the string ‘go away’ in the form of vectors having
100 dimensions.
(model['go'] + model['away'])/2
at32)
We have converted the entire string into a vector which can now be used as a feature in
any modelling technique.
End Notes
I hope that now you have a basic understanding of how to deal with text data in
predictive modeling. These methods will help in extracting more information which in
return will help you in building better models.
I would recommend practising these methods by applying them in machine
learning/deep learning competitions. You can also start with the Twitter sentiment
problem we covered in this article (the dataset is available on the datahack platform of
AV).
Recommendation engines are nothing but an automated form of a “shop counter guy”.
You ask him for a product. Not only he shows that product, but also the related ones
which you could buy. They are well trained in cross selling and up selling. So, does our
recommendation engines.
Topics Covered
1. Type of Recommendation Engines
2. The MovieLens DataSet
3. A simple popularity model
4. A Collaborative Filtering Model
5. Evaluating Recommendation Engines
Basically the most popular items would be same for each user since popularity is defined
on the entire user pool. So everybody will see the same results. It sounds like, ‘a website
recommends you to buy microwave just because it’s been liked by other users and
doesn’t care if you are even interested in buying or not’.
Surprisingly, such approach still works in places like news portals. Whenever you login
to say bbcnews, you’ll see a column of “Popular News” which is subdivided into sections
and the most read articles of each sections are displayed. This approach can work in
this case because:
There is division by section so user can look at the section of his interest.
At a time there are only a few hot topics and there is a high chance that a user
wants to read the news which is being read by most others
Incorporates personalization
It can work even if the user’s past history is short or not available
But has some major drawbacks as well because of which it is not used much in practice:
The features might actually not be available or even if they are, they may not be
sufficient to make a good classifier
As the number of users and items grow, making a good classifier will become
exponentially difficult
Case 3: Recommendation Algorithms
Now lets come to the special class of algorithms which are tailor-made for solving the
recommendation problem. There are typically two types of algorithms – Content Based
and Collaborative Filtering. You should refer to our previous article to get a complete
sense of how they work. I’ll give a short recap here.
Lets load this data into Python. There are many files in the ml-100k.zip file which we
can use. Lets load the three most importance files to get a sense of the data. I also
recommend you to read the readme document which gives a lot of information about the
difference files.
import pandas as pd
# pass in column names for each CSV and read them using pandas.
encoding='latin-1')
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb UR
r', 'Western']
encoding='latin-1')
Now lets take a peak into the content of each file to understand them better.
Users
print users.shape
users.head()
This reconfirms that there are 943 users and we have 5 features for each namely their
unique ID, age, gender, occupation and the zip code they are living in.
Ratings
print ratings.shape
ratings.head()
This confirms that there are 100K ratings for different user and movie combinations. Also
notice that each rating has a timestamp associated with it.
Items
print items.shape
items.head()
This dataset contains attributes of the 1682 movies. There are 24 columns out of which
19 specify the genre of a particular movie. The last 19 columns are for each genre and a
value of 1 denotes movie belongs to that genre and 0 otherwise.
Now we have to divide the ratings data set into test and train data for making models.
Luckily GroupLens provides pre-divided data wherein the test data has 10 ratings for
each user, i.e. 9430 rows in total. Lets load that:
tin-1')
tin-1')
ratings_base.shape, ratings_test.shape
import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)
We can use this data for training and testing. Now that we have gathered all the data
available. Note that here we have user behaviour as well as attributes of the users and
movies. So we can make content based as well as collaborative filtering algorithms.
Arguments:
Lets use this model to make top 5 recommendations for first 5 users and see what
comes out:
#Get recommendations for first 5 users and print them
popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)
popularity_recomm.print_rows(num_rows=25)
Did you notice something? The recommendations for all users are same –
1500,1201,1189,1122,814 in the same order. This can be verified by checking the
movies with highest mean recommendations in our ratings_base data set:
ratings_base.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).
head(20)
This confirms that all the recommended movies have an average rating of 5, i.e. all the
users who watched the movie gave a top rating. Thus we can see that our popularity
system works as expected. But it is good enough? We’ll analyze it in detail later.
To give you a high level overview, this is done by making an item-item matrix in which
we keep a record of the pair of items which were rated together.
In this case, an item is a movie. Once we have the matrix, we use it to determine the
best recommendations for a user based on the movies he has already rated. Note that
there a few more things to take care in actual implementation which would require
deeper mathematical introspection, which I’ll skip for now.
I would just like to mention that there are 3 types of item similarity metrics supported by
graphlab. These are:
1. Jaccard Similarity:
o Similarity is based on the number of users which have rated item A and B
divided by the number of users who have rated either A or B
o It is typically used where we don’t have a numeric rating but just a boolean
value like a product being bought or an add being clicked
2. Cosine Similarity:
o Similarity is the cosine of the angle between the 2 vectors of the item
vectors of A and B
o Closer the vectors, smaller will be the angle and larger the cosine
3. Pearson Similarity
o Similarity is the pearson coefficient between the two vectors.
#Train Model
#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)
item_sim_recomm.print_rows(num_rows=25)
Here we can see that the recommendations are different for each user. So,
personalization exists. But how good is this model? We need some means of evaluating
a recommendation engine. Lets focus on that in the next section.
5. Evaluating Recommendation Engines
For evaluating recommendation engines, we can use the concept of precision-recall.
You must be familiar with this in terms of classification and the idea is very similar. Let
me define them in terms of recommendations.
Recall:
o What ratio of items that a user likes were actually recommended.
o If a user likes say 5 items and the recommendation decided to show 3 of
them, then the recall is 0.6
Precision
o Out of all the recommended items, how many the user actually liked?
o If 5 items were recommended to the user out of which he liked say 4 of
them, then precision is 0.8
Now if we think about recall, how can we maximize it? If we simply recommend all the
items, they will definitely cover the items which the user likes. So we have 100% recall!
But think about precision for a second. If we recommend say 1000 items and user like
only say 10 of them then precision is 0.1%. This is really low. Our aim is to maximize
both precision and recall.
An idea recommender system is the one which only recommends the items which user
likes. So in this case precision=recall=1. This is an optimal recommender and we should
try and get as close as possible.
Lets compare both the models we have built till now based on precision-recall
characteristics:
l])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])
Here we can make 2 very quick observations:
1. The item similarity model is definitely better than the popularity model (by atleast
10x)
2. On an absolute level, even the item similarity model appears to have a poor
performance. It is far from being a useful recommendation system.
There is a big scope of improvement here. But I leave it up to you to figure out how to
improve this further. I would like to give a couple of tips:
In the end, I would like to mention that along with GraphLab, you can also use some
other open source python packages like the following:
Crab.
Surprise
Python Recsys
MRec
End Notes
In this article, we traversed through the process of making a basic recommendation
engine in Python using GrpahLab. We started by understanding the fundamentals of
recommendations. Then we went on to load the MovieLens 100K data set for the
purpose of experimentation.
Subsequently we made a first model as a simple popularity model in which the most
popular movies were recommended for each user. Since this lacked personalization, we
made another model based on collaborative filtering and observed the impact of
personalization.
You first have to understand it, collect it from various sources and arrange it in a format
which is ready for processing. This is even more difficult when the data is in an
unstructured format such as image or audio. This is so because you would have to
represent image/audio data in a standard way for it to be useful for analysis.
Also the body language of the person can show you many more features about a person,
because actions speak louder than words! So in short, unstructured data is complex but
processing it can reap easy rewards.
In this article, I intend to cover an overview of audio / voice processing with a case study
so that you would get a hands-on introduction to solving audio processing problems.
Table of Contents
What do you mean by Audio data?
o Applications of Audio Processing
Data Handling in Audio domain
Let’s solve the UrbanSound challenge!
Intermission: Our first submission
Let’s solve the challenge! Part 2: Building better models
Future Steps to explore
So can you somehow catch this audio floating all around you to do something
constructive? Yes, of course! There are devices built which help you catch these sounds
and represent it in computer readable format. Examples of these formats are
If you give a thought on what an audio looks like, it is nothing but a wave like format of
data, where the amplitude of audio change with respect to time. This can be pictorial
represented as follows.
Here’s an exercise for you; can you think of an application of audio processing that can
potentially help thousands of lives?
The first step is to actually load the data into a machine understandable format. For this,
we simply take values after every specific time steps. For example; in a 2 second audio
file, we extract values at half a second. This is called sampling of audio data, and the
rate at which it is sampled is called the sampling rate.
Another way of representing audio data is by converting it into a different domain of data
representation, namely the frequency domain. When we sample an audio data, we
require much more data points to represent the whole data and also, the sampling rate
should be as high as possible.
On the other hand, if we represent audio data in frequency domain, much less
computational space is required. To get an intuition, take a look at the image below
Here, we separate one audio signal into 3 different pure signals, which can now be
represented as three unique values in frequency domain.
There are a few more ways in which audio data can be represented, for example. using
MFCs (Mel-Frequency cepstrums. PS: We will cover this in the later article). These are
nothing but different ways to represent the data.
Now the next step is to extract features from this audio representations, so that our
algorithm can work on these features and perform the task it is designed for. Here’s a
visual representation of the categories of audio features that can be extracted.
After extracting these features, it is then sent to the machine learning model for further
analysis.
The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes,
namely:
air conditioner,
car horn,
children playing,
dog bark,
drilling,
engine idling,
gun shot,
jackhammer,
siren, and
street music
Here’s a sound excerpt from the dataset. Can you guess which class does it belong to?
Audio Player
00:00
00:04
To play this in the jupyter notebook, you can simply follow along with the code.
ipd.Audio('../data/Train/2022.wav')
Now let us load this audio in our notebook as a numpy array. For this, we will use librosa
library in python. To install librosa, just type this in command line
When you load the data, it gives you two objects; a numpy array of an audio file and the
corresponding sampling rate by which it was extracted. Now to represent this as a
waveform (which it originally is), use the following code
% pylab inline
import os
import pandas as pd
import librosa
import glob
plt.figure(figsize=(12, 4))
librosa.display.waveplot(data, sr=sampling_rate)
The output comes out as follows
Let us now visually inspect our data and see if we can find patterns in the data
Class: jackhammer
Class: drilling
Class: dog_barking
We can see that it may be difficult to differentiate between jackhammer and drilling, but it
is still easy to discern between dog_barking and drilling. To see more such examples,
you can use this code
i = random.choice(train.index)
audio_name = train.ID[i]
plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr)
Intermission: Our first submission
We will do a similar approach as we did for Age detection problem, to see the class
distributions and just predict the max occurrence of all test cases as that class.
train.Class.value_counts()
Out[10]:
jackhammer 0.122907
engine_idling 0.114811
siren 0.111684
dog_bark 0.110396
air_conditioner 0.110396
children_playing 0.110396
street_music 0.110396
drilling 0.110396
car_horn 0.056302
gun_shot 0.042318
We see that jackhammer class has more values than any other class. So let us create
our first submission with this idea.
test = pd.read_csv('../data/test.csv')
test['Class'] = 'jackhammer'
test.to_csv(‘sub01.csv’, index=False)
This seems like a good idea as a benchmark for any challenge, but for this problem, it
seems a bit unfair. This is so because the dataset is not much imbalanced.
def parser(row):
try:
except Exception as e:
feature = mfccs
label = row.Class
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
import numpy as np
num_labels = y.shape[1]
filter_size = 2
# build model
model = Sequential()
model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
Epoch 1/10
Epoch 2/10
Epoch 4/10
Epoch 5/10
Seems ok, but the score can be increased obviously. (PS: I could get an accuracy of 80%
on my validation dataset). Now its your turn, can you increase on this score? If you do,
let me know in the comments below!
1. We applied a simple neural network model to the problem. Our immediate next
step should be to understand where does the model fail and why. By this, we
want to conceptualize our understanding of the failures of algorithm so that the
next time we build a model, it does not do the same mistakes
2. We can build more efficient models that our “better models”, such as
convolutional neural networks or recurrent neural networks. These models have
be proven to solve such problems with greater ease.
3. We touched the concept of data augmentation, but we did not apply them here.
You could try it to see if it works for the problem.
End Notes
In this article, I have given a brief overview of audio processing with an case study on
UrbanSound challenge. I have also shown the steps you perform when dealing with
audio data in python with librosa package. Giving this “shastra” in your hand, I hope you
could try your own algorithms in Urban Sound challenge, or try solving your own audio
problems in daily life.