Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

BUSINESS REPORT ON

MACHINE LEARNING

NAME: NIRUPAM AGARWAL

BATCH- MAY, 2021

1
TABLE OF CONTENT

Contents
List of Figures ............................................................................................................................................. 3
Problem 1 ....................................................................................................................................................... 4
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis. ........................... 4
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ........... 6
BIVARIANT ANALYSIS .......................................................................................................................... 9
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ....................................................................................................11
1.4 Apply Logistic Regression and LDA (linear discriminate analysis). ...................................................13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ......................................................14
Problem 2: .....................................................................................................................................................19
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...........................19
2.2 Remove all the stopwords from all three speeches. ..............................................................................19
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords)........................................................................................19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) ..............20

2
List of Figures

Figure 1 Head and tail of the dataset ............................................................................................................... 4


Figure 2 Info of the dataset ............................................................................................................................. 5
Figure 3 Summary statistics of the dataset ...................................................................................................... 5
Figure 4 Summary of categorical Variables .................................................................................................... 5
Figure 5 Missing values in the dataset ............................................................................................................ 6
Figure 6 Dropped column ............................................................................................................................... 6
Figure 7 Object count ..................................................................................................................................... 6
Figure 8 Univariate analysis of age, economic.cond.national, economic.cond.household, Blair. ................... 7
Figure 9 Univariate analysis of Hague, Europe, and Political knowledge. ...................................................... 8
Figure 10 Skewness of the above attributes .................................................................................................... 8
Figure 11 Details of categorical attribute ‘vote’.............................................................................................. 8
Figure 12 Details of categorical attribute ‘Gender’ ......................................................................................... 9
Figure 13 Pair plot of the dataset ...................................................................................................................10
Figure 14 Correlation of the dataset ...............................................................................................................10
Figure 15 Heat map .......................................................................................................................................11
Figure 16 Encoded dataset .............................................................................................................................11
Figure 17 Encoding of categorical attributes. ................................................................................................12
Figure 18 LDA Train dataset performance ....................................................................................................13
Figure 19 LDA Test dataset performance ......................................................................................................13
Figure 20 LR Train dataset performance .......................................................................................................13
Figure 21 LR Test dataset performance .........................................................................................................14
Figure 22 Naïve’s Train dataset performance ................................................................................................14
Figure 23 Naïve’s Test dataset performance ..................................................................................................14
Figure 24 KNN Train dataset performance ....................................................................................................15
Figure 25 KNN Test dataset performance......................................................................................................15
Figure 26 Bagging Train dataset performance ...............................................................................................15
Figure 27 Bagging Test dataset performance .................................................................................................16
Figure 28 Ada boosting Train dataset performance .......................................................................................16
Figure 29 Ada boosting Test dataset performance .........................................................................................16
Figure 30 Gradient Boosting Train dataset performance ...............................................................................17
Figure 31 Gradient Boosting Test dataset performance .................................................................................17
Figure 32 Stopwords ......................................................................................................................................19
Figure 33 Words ............................................................................................................................................20
Figure 34 Word Cloud ...................................................................................................................................20
Figure 35 Words ............................................................................................................................................21
Figure 36 Word Cloud ...................................................................................................................................22
Figure 37 Words ............................................................................................................................................22
Figure 38 Word Cloud ...................................................................................................................................23

3
Problem 1
Problem statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.

Firstly imported all the necessary library packages and the dataset to pandas dataframe.

EDA approach:
1) Descriptive Analytics
2) Data pre-processing
3) Data visualization
4) Data preparation

1. Import the required libraries. Check the head and tail of the dataset. Find the shape and info
of the dataset.

Figure 1 Head and tail of the dataset

2. Shape of the dataset: The number of rows and columns are: (1525, 10)

4
3. Info and Summary of the dataset:

Figure 2 Info of the dataset

The info contains 9 attributes with non-null values. The attributes unnamed, age, economic
cond.national, economic cond.household, Blair, Hague, Europe, political. Knowledge is integer
data types.
The attribute vote and gender are object data types. There are no missing values. The indexing is
from 0 to 1524. The total entries are 1525.

Figure 3 Summary statistics of the dataset

The count of 1527 for all the attributes.


The mean and median are not exactly normal.
The range between 75% and max value is okay, the range is not too much.

Summary statistics of categorical variables:

Figure 4 Summary of categorical Variables


For the attribute vote: The top is Labour with non null values 1525 and unique values 2 occurring
1063 times.
For the attribute gender: The top is female with non null values 1525 and unique value 2
occurring 812 times.

5
4. Duplicate rows: There are 8 duplicate rows in the dataset.

5. Missing values in the dataset:

Figure 5 Missing values in the dataset

The dataset consists of no missing values.

Dropped the column Unnamed: 0 as it holds no significance.

Figure 6 Dropped column

6. Unique counts of all objects:

Figure 7 Object count

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.

UNIVARIANT ANALYSIS

For the above we plot the histogram and check for outliers. We do so by using the
histplot and boxplot for each of them respectively (for all attributes).

1) Attributes

The age attribute is having skewness of 0.144 i.e. it is more or less normal.
There are no outliers in age attribute’s boxplot.
6
The economic.cond.national skewness is -0.24 i.e. it is more or less normal.
There is a presence of outliers.

The economic.cond.household attribute is having skewness of -0.14 i.e. it is more or less


normal.
There are outliers in this attribute’s boxplot.

The Blaire attribute is having skewness of -0.535 i.e. it is more or less normal.
There are outliers in this attribute’s boxplot.

The Hague attribute is having skewness of 0.15 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.

The Europe attribute is having skewness of -0.135 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.

The Political knowledge attribute is having skewness of -0.42 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.

Figure 8 Univariate analysis of age, economic.cond.national, economic.cond.household, Blair.

7
Figure 9 Univariate analysis of Hague, Europe, and Political knowledge.

Figure 10 Skewness of the above attributes

The details of categorical variables

1) Vote

Figure 11 Details of categorical attribute ‘vote’

The vote attribute varies from Labour to Conservative as shown. Labour is occurring the most with
1063 followed by Conservative occurring with 462.
8
We can use one hot encoding to ensure better readability.

2) Gender

Figure 12 Details of categorical attribute ‘Gender’

The gender attribute varies from female to male as shown. Female is occurring the most with 812
followed by Male occurring with 713.
We can use one hot encoding to ensure better readability.

BIVARIANT ANALYSIS
Plot the pair plot of the dataset
1. The main diagonal plot is histogram for each of the attributes (Univariant analysis)
We can see the various positive correlations

9
Figure 13 Pair plot of the dataset

The main diagonal is the Histogram for each attributes. There are multiple peaks hence there are
clusters in dataset.

Correlation:

Figure 14 Correlation of the dataset

10
Using a subplot to see the correlation:

Figure 15 Heat map

1. Indicates that all attributes are highly correlated. Economic.cond.household and


economic.cond.national are slightly correlated.
2. Blair and economic.cond.national are slightly correlated. Europe and Hague show slight
correlation too.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).

Encoding the dataset and updating the columns with their normalized values(scaling) for numeric
data.

Figure 16 Encoded dataset

The variables ‘vote’ and ‘gender’ have string values. Converting them into numeric values
(encoding).

11
Figure 17 Encoding of categorical attributes.

12
1.4 Apply Logistic Regression and LDA (linear discriminate analysis).

1. LDA
Performance Matrix on train data set

Figure 18 LDA Train dataset performance

Performance Matrix on test data set

Figure 19 LDA Test dataset performance

The accuracy of train dataset is 83%


The accuracy of test dataset is 81%
The model has performed well for both train and test data.

2. Logistic Regression

Performance Matrix on train data set

Figure 20 LR Train dataset performance

13
Performance Matrix on test data set

Figure 21 LR Test dataset performance

The accuracy of train dataset is 83%


The accuracy of test dataset is 82%
The model has performed well for both train and test data.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

Naïve’s
Performance Matrix on train data set

Figure 22 Naïve’s Train dataset performance

Performance Matrix on test data set

Figure 23 Naïve’s Test dataset performance

The accuracy of train dataset is 83%


The accuracy of test dataset is 82%
The model has performed well for both train and test data.

14
KNN

Performance Matrix on train data set

Figure 24 KNN Train dataset performance

Performance Matrix on test data set

Figure 25 KNN Test dataset performance

The accuracy of train dataset is 86%


The accuracy of test dataset is 82%
The model has performed well for both train and test data.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

Bagging

Performance Matrix on train data set

Figure 26 Bagging Train dataset performance

15
Performance Matrix on test data set

Figure 27 Bagging Test dataset performance

The accuracy of train dataset is 99%


The accuracy of test dataset is 79%
The model has performed well for both train and test data.

Ada Boosting

Performance Matrix on train data set

Figure 28 Ada boosting Train dataset performance

Performance Matrix on test data set

Figure 29 Ada boosting Test dataset performance

The accuracy of train dataset is 84%


The accuracy of test dataset is 81%
The model has performed well for both train and test data.
16
Gradient Boosting

Performance Matrix on train data set

Figure 30 Gradient Boosting Train dataset performance

Performance Matrix on test data set

Figure 31 Gradient Boosting Test dataset performance

The accuracy of train dataset is 88%


The accuracy of test dataset is 88%
The model has performed well for both train and test data.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.

Ordinal encoding has improved the model result but Linear Regression result hasn’t increased
much. In this kernel, the data is evaluated by means of their features in order to predict the diamond
price. Before predicting the price, exploratory data analysis has been made, categorical features
encoded.To predict the price; Linear Regression Model, Decision Tree Regressor, RandomForrest
Regressor and KNN are compared. Amongst them Decision Tree has been the most successful one
in order to predict diamond price.

17
18
Problem 2:

In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)

2.1 Find the number of characters, words, and sentences for the mentioned documents.

1. No. of words in Rossevelt file is : 1360


2. No. of words in Nixon file is : 1819
3. No. of words in Kennedy file is : 1390

 President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
 President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390
words.
 President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words

2.2 Remove all the stopwords from all three speeches.

Figure 32 Stopwords

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)

19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)

Word plot for Rossevelt

Figure 33 Words

Figure 34 Word Cloud

20
Word plot for Kennedy

Figure 35 Words

21
Figure 36 Word Cloud

Word plot for Nixon

Figure 37 Words

22
Figure 38 Word Cloud

23

You might also like