NIrupam Agarwal Business Report-ML

BUSINESS REPORT ON
MACHINE LEARNING
NAME: NIRUPAM AGARWAL
BATCH- MAY, 2021
1
TABLE OF CONTENT
Contents
List of Figures ............................................................................................................................................. 3
Problem 1 ....................................................................................................................................................... 4
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis. ........................... 4
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ........... 6
BIVARIANT ANALYSIS .......................................................................................................................... 9
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ....................................................................................................11
1.4 Apply Logistic Regression and LDA (linear discriminate analysis). ...................................................13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ......................................................14
Problem 2: .....................................................................................................................................................19
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...........................19
2.2 Remove all the stopwords from all three speeches. ..............................................................................19
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords)........................................................................................19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) ..............20
2
List of Figures
Figure 1 Head and tail of the dataset ............................................................................................................... 4

Figure 2 Info of the dataset ............................................................................................................................. 5
Figure 3 Summary statistics of the dataset ...................................................................................................... 5
Figure 4 Summary of categorical Variables .................................................................................................... 5
Figure 5 Missing values in the dataset ............................................................................................................ 6
Figure 6 Dropped column ............................................................................................................................... 6
Figure 7 Object count ..................................................................................................................................... 6
Figure 8 Univariate analysis of age, economic.cond.national, economic.cond.household, Blair. ................... 7
Figure 9 Univariate analysis of Hague, Europe, and Political knowledge. ...................................................... 8
Figure 10 Skewness of the above attributes .................................................................................................... 8
Figure 11 Details of categorical attribute ‘vote’.............................................................................................. 8
Figure 12 Details of categorical attribute ‘Gender’ ......................................................................................... 9
Figure 13 Pair plot of the dataset ...................................................................................................................10
Figure 14 Correlation of the dataset ...............................................................................................................10
Figure 15 Heat map .......................................................................................................................................11
Figure 16 Encoded dataset .............................................................................................................................11
Figure 17 Encoding of categorical attributes. ................................................................................................12
Figure 18 LDA Train dataset performance ....................................................................................................13
Figure 19 LDA Test dataset performance ......................................................................................................13
Figure 20 LR Train dataset performance .......................................................................................................13
Figure 21 LR Test dataset performance .........................................................................................................14
Figure 22 Naïve’s Train dataset performance ................................................................................................14
Figure 23 Naïve’s Test dataset performance ..................................................................................................14
Figure 24 KNN Train dataset performance ....................................................................................................15
Figure 25 KNN Test dataset performance......................................................................................................15
Figure 26 Bagging Train dataset performance ...............................................................................................15
Figure 27 Bagging Test dataset performance .................................................................................................16
Figure 28 Ada boosting Train dataset performance .......................................................................................16
Figure 29 Ada boosting Test dataset performance .........................................................................................16
Figure 30 Gradient Boosting Train dataset performance ...............................................................................17
Figure 31 Gradient Boosting Test dataset performance .................................................................................17
Figure 32 Stopwords ......................................................................................................................................19
Figure 33 Words ............................................................................................................................................20
Figure 34 Word Cloud ...................................................................................................................................20
Figure 35 Words ............................................................................................................................................21
Figure 37 Words ............................................................................................................................................22
3
Problem 1
Problem statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
Firstly imported all the necessary library packages and the dataset to pandas dataframe.
EDA approach:
1) Descriptive Analytics
2) Data pre-processing
3) Data visualization
4) Data preparation
1. Import the required libraries. Check the head and tail of the dataset. Find the shape and info
of the dataset.
Figure 1 Head and tail of the dataset
2. Shape of the dataset: The number of rows and columns are: (1525, 10)
4
3. Info and Summary of the dataset:
Figure 2 Info of the dataset
The info contains 9 attributes with non-null values. The attributes unnamed, age, economic
cond.national, economic cond.household, Blair, Hague, Europe, political. Knowledge is integer
data types.
The attribute vote and gender are object data types. There are no missing values. The indexing is
from 0 to 1524. The total entries are 1525.
Figure 3 Summary statistics of the dataset
The count of 1527 for all the attributes.

The mean and median are not exactly normal.
The range between 75% and max value is okay, the range is not too much.
Summary statistics of categorical variables:
Figure 4 Summary of categorical Variables

For the attribute vote: The top is Labour with non null values 1525 and unique values 2 occurring
1063 times.
For the attribute gender: The top is female with non null values 1525 and unique value 2
occurring 812 times.
5
4. Duplicate rows: There are 8 duplicate rows in the dataset.
5. Missing values in the dataset:
Figure 5 Missing values in the dataset
The dataset consists of no missing values.
Dropped the column Unnamed: 0 as it holds no significance.
Figure 6 Dropped column
6. Unique counts of all objects:
Figure 7 Object count
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.
UNIVARIANT ANALYSIS
For the above we plot the histogram and check for outliers. We do so by using the
histplot and boxplot for each of them respectively (for all attributes).
1) Attributes
The age attribute is having skewness of 0.144 i.e. it is more or less normal.
There are no outliers in age attribute’s boxplot.
6
The economic.cond.national skewness is -0.24 i.e. it is more or less normal.
There is a presence of outliers.
The economic.cond.household attribute is having skewness of -0.14 i.e. it is more or less

normal.
There are outliers in this attribute’s boxplot.
The Blaire attribute is having skewness of -0.535 i.e. it is more or less normal.
There are outliers in this attribute’s boxplot.
The Hague attribute is having skewness of 0.15 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.
The Europe attribute is having skewness of -0.135 i.e. it is more or less normal.
The Political knowledge attribute is having skewness of -0.42 i.e. it is more or less normal.
Figure 8 Univariate analysis of age, economic.cond.national, economic.cond.household, Blair.
7
Figure 9 Univariate analysis of Hague, Europe, and Political knowledge.
Figure 10 Skewness of the above attributes
The details of categorical variables
1) Vote
Figure 11 Details of categorical attribute ‘vote’
The vote attribute varies from Labour to Conservative as shown. Labour is occurring the most with
1063 followed by Conservative occurring with 462.
8
We can use one hot encoding to ensure better readability.
2) Gender
Figure 12 Details of categorical attribute ‘Gender’
The gender attribute varies from female to male as shown. Female is occurring the most with 812
followed by Male occurring with 713.
We can use one hot encoding to ensure better readability.
BIVARIANT ANALYSIS
Plot the pair plot of the dataset
1. The main diagonal plot is histogram for each of the attributes (Univariant analysis)
We can see the various positive correlations
9
Figure 13 Pair plot of the dataset
The main diagonal is the Histogram for each attributes. There are multiple peaks hence there are
clusters in dataset.
Correlation:
Figure 14 Correlation of the dataset
10
Using a subplot to see the correlation:
Figure 15 Heat map
1. Indicates that all attributes are highly correlated. Economic.cond.household and

economic.cond.national are slightly correlated.
2. Blair and economic.cond.national are slightly correlated. Europe and Hague show slight
correlation too.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).
Encoding the dataset and updating the columns with their normalized values(scaling) for numeric
data.
Figure 16 Encoded dataset
The variables ‘vote’ and ‘gender’ have string values. Converting them into numeric values
(encoding).
11
Figure 17 Encoding of categorical attributes.
12
1.4 Apply Logistic Regression and LDA (linear discriminate analysis).
1. LDA
Performance Matrix on train data set
Figure 18 LDA Train dataset performance
Performance Matrix on test data set
Figure 19 LDA Test dataset performance
The accuracy of train dataset is 83%

The accuracy of test dataset is 81%
The model has performed well for both train and test data.
2. Logistic Regression
Figure 20 LR Train dataset performance
13
Figure 21 LR Test dataset performance

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Naïve’s
Figure 22 Naïve’s Train dataset performance
Figure 23 Naïve’s Test dataset performance

14
KNN
Figure 24 KNN Train dataset performance
Figure 25 KNN Test dataset performance

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Bagging
Figure 26 Bagging Train dataset performance
15
Figure 27 Bagging Test dataset performance

Ada Boosting
Figure 28 Ada boosting Train dataset performance
Figure 29 Ada boosting Test dataset performance

16
Gradient Boosting
Figure 30 Gradient Boosting Train dataset performance
Figure 31 Gradient Boosting Test dataset performance

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
Ordinal encoding has improved the model result but Linear Regression result hasn’t increased
much. In this kernel, the data is evaluated by means of their features in order to predict the diamond
price. Before predicting the price, exploratory data analysis has been made, categorical features
encoded.To predict the price; Linear Regression Model, Decision Tree Regressor, RandomForrest
Regressor and KNN are compared. Amongst them Decision Tree has been the most successful one
in order to predict diamond price.
17
18
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
2.1 Find the number of characters, words, and sentences for the mentioned documents.
1. No. of words in Rossevelt file is : 1360

2. No. of words in Nixon file is : 1819
3. No. of words in Kennedy file is : 1390
 President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
 President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390
words.
 President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words
2.2 Remove all the stopwords from all three speeches.
Figure 32 Stopwords
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Word plot for Rossevelt
Figure 33 Words
Figure 34 Word Cloud
20
Word plot for Kennedy
Figure 35 Words
21
Word plot for Nixon
Figure 37 Words
22
23

NIrupam Agarwal Business Report-ML

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NIrupam Agarwal Business Report-ML

Uploaded by

Copyright:

Available Formats

BUSINESS REPORT ON

NAME: NIRUPAM AGARWAL

BATCH- MAY, 2021

Figure 1 Head and tail of the dataset ............................................................................................................... 4

Figure 1 Head and tail of the dataset

Figure 2 Info of the dataset

Figure 3 Summary statistics of the dataset

The count of 1527 for all the attributes.

Summary statistics of categorical variables:

Figure 4 Summary of categorical Variables

5. Missing values in the dataset:

Figure 5 Missing values in the dataset

The dataset consists of no missing values.

Dropped the column Unnamed: 0 as it holds no significance.

Figure 6 Dropped column

6. Unique counts of all objects:

Figure 7 Object count

The economic.cond.household attribute is having skewness of -0.14 i.e. it is more or less

Figure 8 Univariate analysis of age, economic.cond.national, economic.cond.household, Blair.

Figure 10 Skewness of the above attributes

The details of categorical variables

Figure 11 Details of categorical attribute ‘vote’

Figure 12 Details of categorical attribute ‘Gender’

Figure 14 Correlation of the dataset

Figure 15 Heat map

1. Indicates that all attributes are highly correlated. Economic.cond.household and

Figure 16 Encoded dataset

Figure 18 LDA Train dataset performance

Performance Matrix on test data set

Figure 19 LDA Test dataset performance

The accuracy of train dataset is 83%

Performance Matrix on train data set

Figure 20 LR Train dataset performance

Figure 21 LR Test dataset performance

The accuracy of train dataset is 83%

Figure 22 Naïve’s Train dataset performance

Performance Matrix on test data set

Figure 23 Naïve’s Test dataset performance

The accuracy of train dataset is 83%

Performance Matrix on train data set

Figure 24 KNN Train dataset performance

Performance Matrix on test data set

Figure 25 KNN Test dataset performance

The accuracy of train dataset is 86%

Performance Matrix on train data set

Figure 26 Bagging Train dataset performance

Figure 27 Bagging Test dataset performance

The accuracy of train dataset is 99%

Performance Matrix on train data set

Figure 28 Ada boosting Train dataset performance

Performance Matrix on test data set

Figure 29 Ada boosting Test dataset performance

The accuracy of train dataset is 84%

Performance Matrix on train data set

Figure 30 Gradient Boosting Train dataset performance

Performance Matrix on test data set

Figure 31 Gradient Boosting Test dataset performance

The accuracy of train dataset is 88%

1. No. of words in Rossevelt file is : 1360

2.2 Remove all the stopwords from all three speeches.

Word plot for Rossevelt

Figure 34 Word Cloud

Word plot for Nixon