Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

Machine Learning

2/20/2022
Projects

Mayank Gupta
PGP-DSBA Online
Table of Content
S.No Topic Page No.
.
01. Problem 1: - Election Data 04-30
1.1 Read the dataset. Do the descriptive statistics and do the null value 05-07
condition check. Write an inference on it.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data 07-14


analysis. Check for Outliers.

1.3 Encode the data (having string values) for Modelling. Is Scaling 14-15
necessary here or not? Data Split: Split the data into train and test
(70:30).

1.4 Apply Logistic Regression and LDA (linear discriminant analysis). 15-18
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 18-24
1.6 Model Tuning, Bagging (Random Forest should be applied for 24-29
Bagging), and Boosting.

1.7 Performance Metrics: Check the performance of Predictions on Train 24-29


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.

1.8 Based on these predictions, what are the insights? 29-30


02. Problem 2: - Inaugural Corpora 31-36
2.1 Find the number of characters, words, and sentences for the mentioned 32-33
documents.

2.2 Remove all the stopwords from all three speeches. 33-35
2.3 Which word occurs the most number of times in his inaugural address 35-35
for each president? Mention the top three words (after removing the
stopwords).

2.4 Plot the word cloud of each of the speeches of the variable. (after 36-36
removing the stopwords)

1
List of Figures

S.No. Topic Page No.

1.2 Multivariate Analysis 12-13

1.2 Heatmap 13

2.4 Word cloud for 1941-Roosevelt.txt 36

2.4 Word cloud for 1961-Kennedy.txt 36

2.4 Word cloud for 1973-Nixon.txt 36

List of Tables

S. No. Topic Page No.

1.1 Statistical Description of Dataset 05

1.2 Model Summary Table 29-30

2
Executive Summary
This is an accumulation of two projects which are based on different concepts of Machine
Learning. One of them are based on the numerical form of data analysis whereas the other
project is totally based on the text analytics. The main aim of this project is to get better
understanding and implementation of machine learning concepts. The two projects in this are
mutually exclusive and have their own dataset with the separate methods of analysis. It is also
having the detailed inferences and insights obtained after data analysis modelling the data for
analysis based on some factors of machine learning concepts.
Project 1: - This is based on the analysis of election data. I have assumed that I have been
hired by a media channel who wants me to make data analysis on the data of a survey which
has been answered by 1529 people and recorded their answers in 9 variables. I have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Problem 2: - In this particular project, I am going to work on the inaugural corpora which will
be extracted from the nltk in Python. For this project, I will have to look at the following
mentioned speeches of the different Presidents of the United States of America:
 President Franklin D. Roosevelt in 1941
 President John F. Kennedy in 1961
 President Richard Nixon in 1973

3
Problem 1: - Election Data
Problem Statement
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Dataset for Problem: Election_Data.xlsx
Summary of Dataset
There is total 9 variables for which data has been collected from 1525 people. Out of some of
them are males and some of them are females. Also, they have voted for Labour Party or
Conservative Party.
 Party: - Two partied contesting in the election namely Labour Party and Conservative
Party.
 Age: - Age of the voter who have taken the survey conducted by CNBE news
channel.
 Gender: - Gender of the voter
 economic.cond.national
 economic.cond.household
 Blair
 Hague
 Europe
 political.knowledge

4
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.

Importing all of the relevant libraries

Checking for the data in the Dataset

Statistical Description of the Dataset

Looking for the null values in the Dataset

5
Finding datatypes of different variables

Finding total number of duplicate entries in the dataset

Finding Shape of the Dataset

Finding vote counts of each party

Observed Inferences

6
 The overall mean of the age of the voters is 54.18 years with the standard
deviation of about 15.71 years.
 There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
 There is no entry in the dataset with null values.
 Total number of duplicate entries in this dataset is 8.
 As per the vote count of the survey data, Labour Party has achieved 1063 votes
and Conservatives Party has achieved 462 votes which is even less than half of
the votes achieved by Labour Party.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis.


Check for Outliers.

Univariate Analysis

Blair Boxplot

Political Knowledge Boxplot

7
Bivariate Analysis

Vote vs. economic_cond_national

8
Vote vs. economic_cond_household

Vote vs. Blair

9
Vote vs. Hague

Vote vs. Europe

10
Vote vs. political knowledge

11
Multivariate Analysis

12
Heatmap

13
Outlier Analysis

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).

Describing the dataset

Creating Dummy data by eliminating by converting Party and Gender into integer values
and assigning values 0 ad 1.

Changing columns names: vote_Labour to IsLabour_or_not' and gender_male to


IsMale_or_not

14
As per the data in the dataset, it is clear that there is a need for scaling of data for the further
data analysis, otherwise, there will be discrepancy in the analysis.

Data Split in 70:30

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression

Train Data

y_train_prob

15
Logistic Model Score of Train Data

AUC ROC curve for Logistic Regression Train

Test Data

y_test_prob

AUC ROC curve for Logistic Regression Test

16
Linear Discriminant Analysis

y_train_predict

AUC ROC curve for LDA Train

17
y_test_predict

AUC ROC curve for LDA Test

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN Model

Transforming dataset by applying zscore

18
KNN Model Score of train data

Confusion Matrix and Classification Report of train data

AUC ROC Curve KNN Train

KNN model score of Test Data

19
Confusion Matrix and Classification Report of test data

AUC ROC Curve KNN Test

KNN Model with n=7

KNN Model Score, Confusion Matrix and Classification Report of train data

20
KNN Model Score, Confusion Matrix and Classification Report of test data

KNeighborsClassifier(n_neighbors=5)

KNN Model Score, Confusion Matrix and Classification Report of train data

KNN Model Score, Confusion Matrix and Classification Report of test data

ac_score

21
AUC ROC curve after n classifier for train data set

AUC ROC curve after n classifier for test data set

Number of Neighbours K vs. Misclassification Error

22
Naive Bayes

Model Score, Confusion Matrix and Classification Report of train data

AUC ROC Curve for Train Data

23
Model Score, Confusion Matrix and Classification Report of test data

AUC ROC Curve for Test Data

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging),
and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model.

Bagging

Bagging Train

Model Score, Confusion Matrix and Classification Report of train data

24
AUC _ROC Curve Bagging Train

Bagging Test

Model Score, Confusion Matrix and Classification Report of test data

AUC _ROC Curve Bagging Test

25
Boosting Method

Ada Boost

Model Score, Confusion Matrix and Classification Report of train data

AUC _ROC Curve Boosting Train

26
Gradient Boosting

Model Score, Confusion Matrix and Classification Report of train data

AUC _ROC Curve Boosting Train

27
ADA Boosting Test

Model Score, Confusion Matrix and Classification Report of test data

AUC _ROC Curve Boosting Test

28
Gradient Boosting Test

Gradient Boosting AUC_ROC Curve Test

Final Model: Compare the models and write inference which model is best/optimized.

On the in-depth observation from the different models used in this case, the data has been
inferred that in this case KNN model with n = 7 is highly optimised as compared to the other
models, after making the in-depth comparison of accuracy, recall, model score, and AUC
score of training and testing data of different models.

1.8 Based on these predictions, what are the insights?

Based on these predictions the following end result has been concluded.

Method Train Data AUC Score Test Data AUC Score

29
Logistic Regression 0.840 0.889 0.823 0.882

Linear Discriminant
0.837 0.889 0.819 0.884
Analysis

KNN 0.867 0.93 0.824 0.870

KNN (n=7) 0.853 0.904 0.835 0.900

KNN (n=5) 0.867 0.824

Naïve Bayes 0.833 0.886 0.825 0.885

Bagging 0.999 1.000 0.797 0.877

Boosting (ADA Boost) 0.847 0.913 0.819 0.877

Boosting (Gradient Boost) 0.886 0.950 0.831 0.904

Model Summary Table

The following set of inferences has been concluded from the above data analysis.

 The overall data has needed scaling in order to make it more uniform for the data
analysis.
 There are outliers being present in some variable.
 The overall training and testing of this dataset using different methods has given
similar results which is clearly showing that the overall data modelling, model tuning
and scaling has been done properly.
 Bagging has exhibited big differences in the training and testing data, rest others have
exhibited almost similar or very small gap between testing and training results.
 The overall mean of the age of the voters is 54.18 years with the standard deviation of
about 15.71 years.
 There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
 There is no entry in the dataset with null values.
 Total number of duplicate entries in this dataset is 8.
 As per the vote count of the survey data, Labour Party has achieved 1063 votes and
Conservatives Party has achieved 462 votes which is even less than half of the votes
achieved by Labour Party.

30
Problem 2: - Inaugural Corpora
Problem Statement

In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:

 President Franklin D. Roosevelt in 1941


 President John F. Kennedy in 1961
 President Richard Nixon in 1973

Code Snippet to extract the three speeches:

"

import nltk

nltk.download('inaugural')

from nltk.corpus import inaugural

inaugural.fileids()

inaugural.raw('1941-Roosevelt.txt')

inaugural.raw('1961-Kennedy.txt')

inaugural.raw('1973-Nixon.txt')

"

31
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Importing all of the relevant txt files

1941-Roosevelt.txt

Total number of words

Total number of sentences

Total number characters

1961-Kennedy.txt

Total number of words

Total number of sentences

Total number characters

1973-Nixon.txt

Total number of words

Total number of sentences

Total number characters

32
2.2 Remove all the stopwords from all three speeches.
Importing libraries for removing stopwords

1941-Roosevelt.txt

1961-Kennedy.txt

33
1973-Nixon.txt

34
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)
1941-Roosevelt.txt

Most occurred word

Top 3 Words

1961-Kennedy.txt

Most occurred word

Top 3 Words

1973-Nixon.txt

Most occurred word

Top 3 Words

35
2.4 Plot the word cloud of each of the speeches of the variable. (after removing
the stopwords)
1941-Roosevelt.txt

1961-Kennedy.txt

1973-Nixon.txt

36

You might also like