Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Problem 2: Text Mining

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch
to the VC sharks.

You will ONLY use “Description” column for the initial text mining exercise.

1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.

First, we load the dataset into python and get the data frame and see how the data actually looks.
We need to concentrate on what basis the deals are succeeded by the entrepreneurs based on the
description column present in the data frame.

We could see that the deal column being the dependent variable which says whether the deal is
succeeded based on their pitch. So we try to separate the rest of the columns from the dependent
variable and variable that needs to text mined (description).

Before that, we check if there are more null values (more than 90%) in any column to remove them
from the data frame. But there are none to be found
We have created a new data frame (data) with only deal and description column from the original
data frame.

2. Create two corpora, one with those who secured a Deal, the other with those who did not
secure a deal.

First, we convert the deal column into string data type and then create two data frame, one for those
who secured a deal and those who did not secure a deal.

We then group the data and append the grouped data into its data frame (when deal = True, rows
will grouped and append into the secured data frame and vice versa for the other data frame)

Secured Data Frame:

Not Secured Data Frame:


We also remove the deal column from both the corpuses, as we need only the description column to
perform the text analysis.

Deals which are secured:

Deals which are not secured:


3. The following exercise is to be done for both the corpora:

a) Find the number of characters for both the corpuses.

Number of characters for both corpuses are given.

The number of characters in which is deal is secured = 50302

The number of characters in which is deal is not secured = 34899

b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)

Words mentioned in the question are also added to the stop words list.

Cleaned words after removing the stop words (secured deal):


Cleaned words after removing the stop words (non-secured deals):

c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)? d) Plot the Word Cloud for both the corpora.

Before cleaning the stop words:

The top 3 most frequently occurring words for positive deal corpora are ‘and’ (352), ’the’ (249), ’to’
(212)

The top 3 most frequently occurring words for negative deal corpora are ‘the’ (228), ’and’ (206), ’a’
(155)

After cleaning the stop words:

The top 3 most frequently occurring words for positive deal corpora are ‘products’ (18), ’easy’ (16),
’children’ (16), ‘make’ (16)

The top 3 most frequently occurring words for negative deal corpora are ’make’ (16), ’product’ (12),
‘system’ (12), ‘online’ (12)

Before cleaning:

After cleaning:
d) Plot the Word Cloud for both the corpora.

Positive Corpora:

Negative Corpora:
4. Refer to both the word clouds. What do you infer?

The 'Secured a deal' word cloud contains words such as 'one', 'design' , 'free' ,'children' ,'offer',
'easy' ,'online', 'use' .These indicate that Deals aimed towards catering to the children, which
provided offers or a free sample/product, was easy to use, had a good design and was unique in its
creativity are more likely to secure a deal. We could see that deals which are focused on children’s
use seem to secure a deal in most of the cases.

The 'Did not secure a deal' word cloud contains words such as 'one', 'designed' ,
'help' ,'device' ,'bottle', 'premium' ,'use', ’service’ .These indicate that Deals with a mediocre design,
less suited to solve/help a problem, products involving water bottles, having a higher and premium
price tag and less usability are less likely to secure a deal. We could see that the service needs to be
better in of the products which made the users not to secure a deal.

It is also observed that words such as 'one', 'designed', 'system' and 'use' have a higher weight in
both these word clouds. This indicates that either these were not the defining factors to whether a
deal is made or not or might have been used in a different context in the description in each
scenario.
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?

The word 'device' is not easily found in the 'secured a deal' word cloud while it is easily spotted in the
'not secured a deal' word cloud.

This indicates that the word 'device' occurred frequently when a deal was rejected hence implying
the statement given in the question is true , which in turn means that the entrepreneurs who
introduced devices are less likely to secure a deal due to various factors.

Problem 1: Machine Learning Models

You work for an office transport company. You are in discussions with ABC Consulting Company for
providing transport for their employees. For this purpose, you are tasked with understanding how do
the employees of ABC Consulting prefer to commute presently (between home and office). Based on
the parameters like age, salary, work experience etc. given in the data set ‘Transport.csv’, you are
required to predict the preferred mode of transport. The project requires you to build several
Machine Learning models and compare them so that the model can be finalised.

Data Dictionary

Age : Age of the Employee in Years

Gender: Gender of the Employee

Engineer : For Engineer =1 , Non Engineer =0

MBA : For MBA =1 , Non MBA =0

Work Exp : Experience in years

Salary : Salary in Lakhs per Annum

Distance : Distance in Kms from Home to Office

license : If Employee has Driving Licence -1, If not, then 0

Transport : Mode of Transport

The objective is to build various Machine Learning models on this data set and based on the
accuracy metrics decide which model is to be finalised for finally predicting the mode of transport
chosen by the employee.

1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.

Load the required packages, set the working directory and load the file.
Dataset has 444 rows and 9 features. Let’s go ahead with the data exploratory analysis with the look
at the first 5 rows.

We could take a look at the data dictionary and we could some of the features (Gender, Transport)
are in object data type and the other features are in numerical data type.

We convert the two categorical variables into numerical variable by encoding.


Descriptive Analysis:

We could see that dataset has more male employees and also most of the people are travelling by
Public Transport as per the data. We also see that the mean age of employees at the company is 27
and mean distance travelled by the employees is 11 km from Home to Office.

Missing Value and Outlier Treatment:

We could see that there are no missing values in the dataset. Even though, we could see some
outliers in the salary, distance, work experience variables, but we are not treating the outliers in this
exercise, because these values are not extreme values which seem to affect data quality , but these
outliers values are something that little higher than upper limit of values, which is acceptable.

Univariate Analysis:

We can see the analysis of each feature in which Age and Distance seem to follow normal
distribution as per the histogram. Licence variable is proportional to Transport, which means that the
people who travel by Private transport are the people who do not own license.
Bivariate Analysis:

We could see that the people who use Public transport are travelling lesser distance than the people
who travel in Private transport. Employees prefer Private transport as they have to travel longer than
other employees who travel lesser distance.
Employees who are younger or having less age prefer to travel in Public Transport and older
employees travel by Private transport as per data.

Employees who earn less travel by Public transport and employees who earn more travel by Private
Transport as per data.

Checking correlations:

We could see high correlation between variables – Age, Salary and Work Exp, as we know as Age
increases, work experience and in turn salary increases. We also see that there is some correlation
between salary and licence variable.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?

Scaling is generally performed when one variable in the dataset has higher magnitude than other
variables, thus impacting the model. But in this data set, we do not have such variable which has
higher magnitude than other variables, most values are in the range of 0 to 100. Thus, we are not
performing the scaling of data in this exercise.

After performing the exploratory data analysis, we perform train test split in the ratio of 70:30 after
arranging the data into independent and dependent variable.

3. Build the following models on the 70% training data and check the performance of these models
on the Training as well as the 30% Test data using the various inferences from the Confusion Matrix
and plotting a AUC-ROC curve along with the AUC values. Tune the models wherever required for
optimum performance.: a. Logistic Regression Model b. Linear Discriminant Analysis c. Decision
Tree Classifier – CART model d. Naïve Bayes Model e. KNN Model f. Random Forest Model g.
Boosting Classifier Model using Gradient boost.

a. Logistic Regression Model:

Model Accuracy on Train Data: 80%

Model Accuracy on Test Data: 81%

Confusion Matrix and Classification Report:

AUC-ROC Curve:
b. Linear Discriminant Analysis:

Model Accuracy on Train Data: 80%

Model Accuracy on Test Data: 82%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

c. Decision Tree Classifier – CART Model

Model Accuracy on Train Data: 100%

Model Accuracy on Test Data: 82%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

d. Naïve Bayes Model

Model Accuracy on Train Data: 79%

Model Accuracy on Test Data: 79%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

e. KNN Model

Model Accuracy on Train Data: 80%

Model Accuracy on Test Data: 77%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

f. Random Forest Model

Model Accuracy on Train Data: 85%

Model Accuracy on Test Data: 80%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

g. Boosting Classifier Model using Gradient boost.

Model Accuracy on Train Data: 96%

Model Accuracy on Test Data: 76%

Confusion Matrix and Classification Report:


AUC-ROC Curve:

4. Which model performs the best?

Below are the models that performed best on different metrics

Accuracy Scores - Linear Discriminant Analysis

ROC AUC Scores – Logistic Regression

Recall values of Private Transport (0’s) – Logistic Regression and Linear Discriminant Analysis

Recall value of Public Transport (1’s) – Tuned Random Forest Classifier


F1 score of Private Transport (0’s) – Linear Discriminant Analysis

F1 score of Public Transport (1’s) - Linear Discriminant Analysis

We could see Linear Discriminant Analysis model is performing well in all aspects of performance
metrics, followed by Logistic Regression model.

As per the above results, we would like to go Linear Discriminant Analysis (LDA) model as the best
model.

5. What are your business insights?

Some of the business insights from EDA as well as model building are:

 We could see that employees who are at lesser distance from Home to Office prefer Public
Transport than employees who are farther away, who prefer Private Transport
 Employees who have higher experience , in turn has more salary prefer to travel more
through their own/private transport rather than taking public transport
 We have high positive correlation between Age, Work Experience and Salary variables, which
in turn is in positive relationship with predictor variable.
 We also see that males employees have license and female employees seem to use Public
Transport , as they don’t have license
 LDA model has provided best results in predicting the output variable – Transport variable,
which gives accuracy of 83% , which is followed by Logistic Regression model.

You might also like