Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

PROJECT REPORT - Machine Learning

By Akshaya
Table of Content

Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to
predict which party a voter will vote for on the basis of the given information, to create an
exit poll that will help in predicting overall win and seats covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
1.8 Based on these predictions, what are the insights?
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
2.2 Remove all the stopwords from all three speeches.
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) –
3 Marks [ refer to the End-to-End Case Study done in the Mentored Learning Session ]
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')

1
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"

Problem No.1
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.

1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.

Let us import the dataset using head function and understand the problem.

We can see that Unnamed variable is of no use at this moment, let us drop and read the dataset

Let us understand the number of rows and columns in the dataset. We can see that there are rows
and columns.

Let us check for any null values in the dataset

2
There are no null values present and also check for any duplicates. Let us also drop the duplicate
values in the dataset. There are 8 duplicate values present in the dataset which are of no use . Let
us check the basic info about the dataset and also the statistical summary. Only two variables are
there with string values ,rest every variables are in numerical .

Every variables has no missing or non numerical values present. Almost every variables mean
and median are equal. Let us check the normal distribution of observation using a histogram.

Let us check for the skewness in the dataset to understand the distribution.

3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.

Almost every variable are in the zero skewness side. Let us check for the outliers using boxplot.

4
Only two variables has outlier. Rest of the variables doesn’t have any outliers. Let us remove the
outlier to treat the model. Also let us check for the distribution of the variables.

Let us look onto the distribution of variables with outliers .Two variables are almost having the
same distribution curve.

5
Let us clearly look onto the distribution of variables with outliers in detail by plotting histogram
and boxplot simultaneously.

We will remove this outlier on later part. Now let us segregate the all the variables using their
gender distribution.

6
Orange represent the male distribution and blue represent the female distribution. We can see
that most of the observations are from male gender. Let us check for any imbalance problem in
the dataset. Let us use the vote variable to find out the number of Labour and conservative counts
in the dataset.

Now lets remove the outlier from the dataset by performing IQR method. Below figure shows
the boxplot of variable after outlier treatment.

7
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).

For this problem statement, vote variable is our target variable and we have to find out the
performance metrics for both labour and conservative parties. We can see that most of the people
have voted for labour parties and 30% only have voted for conservative. Let us predict the voters
using this target variables. For predicting the model we have to split the dataset into training and
testing dataset. Mostly importantly , we have to convert the categorical variable into one hot
encoding .Let us also convert the object datatypes into numerical values for modelling.

8
We can see that vote has two codes 0 represent conservative and 1 represent labour votes.
GENDER variable is also assigned with respective codes 0s and 1s.Let us check once again
about the basic info of dataset.We can see that all variables are converted to numerical values .
Now our dataset is good to split into training and testing.

Scaling is necessary only for model which is based on distance rule. In this problem , let us
perform scaling only for KNN modelling. LDA and logistics and naives bayes doesn’t have any
effects by scaling.

Let us split the dataset into 70% training and 30% test size. Here vote is the target variables and
rest are the independent variables.

9
Above represent the amount of training dataset observation. Below represent for testing with
same.

The size of training and testing datasets are:

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

model=LogisticRegression(max_iter=1000)

model1=LinearDiscriminantAnalysis()

model2=GaussianNB()

Let us call the KNN model later part after scaling. After calling the model function from the
library , let us fit the training dataset into the respective model variables and check for the model
accuracy for both training and testing dataset.

10
From the above model accuracies , we can see that logistics regression looks better compares to
linear discriminant analysis model with a small difference in the training accuracies. Both the
model hold same testing accuracy.Both Logistic regression and LDA doesn’t requires scaling
because they are not affected by scaled values.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

Let us import naives bayes and KNN model from sklearn libaray. Specially for knn model, let us
scale the dataset and split the dataset into training and testing .While for naïve bayes ,we wont
scale the data. Then lets us fit the training dataset into the model and check for the prediction.

From the above model accuracies , KNN model have more accuracy on training dataset
compares to all other model. All model have overfit problems. Let us consider the modes with
which less than 10% difference in training and testing as a standard model. So, Let us perform
the prediction with all models and check their precision, recall, f1-score and model accuracy. We
will be considering the F1-score to compare the model efficiency.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

Let us use GRIDSEARCHCV for hypertuning the models. Before applying the bagging classifier
, let us use random forest and apply into bagging base estimator.Let us apply gradiest boosting
and adaboost classifier as boosting alogirthm . Naives baves doesn’t hold with more number of
parameters.So it is difficult to tune the naives baves alogorithm.We will be tuning the all the base
models with different combination of parameters.

Logistic regression TUNED

After applying the gridsearchCv function, we got the best estimator from the model with all
permutation and combination of parameters within models.

LDA TUNED

After applying the different combination of parameters into the gridsearchcv function, we got as
below the best parameter for LDA

11
KNN TUNED

With 9 combination of cross validation , we got below parameters as the best parameter for knn model.

We will go with the base model parameter for naives bayes. Since naives bayes doesn’t have more
parameters to tune .

BAGGING

For BAGGING , lets us import randomforestclassifier from sklearnmodel library. Apply all the
parameters and fit the model.Before that let us find out out of bag score error to identify the
amount of error in the prediction.

Above are the output of OOB score for respective random state, lets us choose random state as 5
to have a better prediction from the above OOB score.After applying the parameter into
randomforeset model.Let us fit the training variables into the model and do the bagging
operation. RFC used as the base estimator in the bagging classifer model.

BOOSTING

We are using two boosting techniques to find out the best model .,Gradient boosting and
adaboost classifier.We are calling the gradientboostingclassifier into a variable called

gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)

12
and calling adaboostingclassfier into a variable with parameters :-

abcl = AdaBoostClassifier(n_estimators=10, random_state=1)

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized.

Let us evaluate the training and testing accuracy , precision, recall, f1-score respectively.

Logistic regression – Training

logistic regression (Tuned )- training

Logistic regression – testing

13
Logistic regression (Tuned)- testing

There is no much difference in the tuned and base model for logistic regression. We have
received 87% of f1-score on labour votes and 69% of better prediction for conservative voters.

LINEAR DISCRIMINANT ANALYSIS

LDA TRAINING

LDA TRANING TUNED

14
LDA TESTING

LDA TESTING TUNED

The base model and tuned model are having no differences. We can see that 87% better
prediction is for labour voters and 70% f1-score is conservative voters are obtained from LDA
Models. Conservative voters are correctly and more accurately predicted in lda model comparing
to logistics regression model.

15
NAÏVE BAYES MODEL

Naïve bayes model –training

Naïve bayes model –testing

By observing the classification report of naïve bayes testing classification report , we can see that
we have received a better f1-score on labour voters. There is a 1% increase in the prediction for
labour voters with naïve bayes model while 70% of f1-score for conservative voters. As of now,
this model looks good since there is only 2% difference with the testing and training model
accuracies. We can see that naïve bayes is having less overfitting issue. To an extend , we have
received a quiet good precision and recall rates. Let us check for knn model performance using
the metrics analysis

16
KNN MODEL

KNN MODEL –TRAINING

KNN MODEL TUNED-TRAINING

Once tuned, we can see that training model accuracy has been increased

KNN MODEL TESTING

17
KNN MODEL TUNED TESTING

Once tuned, we can see that after tuning the f1-score for labour voters have increased by 1% on
testing report. This model looks perfect since it has better f1-score ,precision ,recall and model
accuracies comparing to rest of the models. Let us compare in detail using a tabular format.
Before that , let us also check for bagging and boosting classfier models reports.

BAGGING

Before applying the bagging classifier, we have imported randomforest classifier and fitted the
training dataset into it. We have to use the randomforestclassifier as the base estimator for the
bagging classifer.After inputing the dataset into the training and evaluating with the testing
dataset . We have received a confusion matrix like this :-

BAGGING TRAINING

18
MODEL ACCURACY = 0.9736098020735156

BAGGING TESTING

MODEL ACCURACY = 0.8157894736842105

After applying the bagging classifier, we have received 97% of model accuracy with better f1-
score . But once we check for the performance of testing model , we have only received 82%

19
model accuracy. And also we can see that our f1-score for conservative and labour voters
predicted has drop down drastically by comparing with other models.We can see there is a 15%
difference in the training and testing model accuracies of bagging classifier.

BOOSTING

20
Training Model accuracy =0.8397737983034873

Testing model accuracy= 0.8179824561403509

For boosting classifier , the training accuracy is more compares to testing accuracy. Here we can
see both bagging and boosting models have overfitting problems. Both the model have same
testing accuracy. Since both models have same performance to an extend. But we can notice that
, boosting f1score for conservative voters have 2% high compares to bagging. So that we can
conclude, boosting is better in this case compares to bagging.

ROC AND AUC CURVE

1.)LOGISTIC REGRESSION

Training

21
Training tuned

Testing

22
Testing tuned

2.) LDA

LDA TRAINING

23
LDA TUNED TRANING

LDA TESTING

LDA TUNED TESTING

24
3.) NAÏVE BAYES

TRAINING

TESTING

25
4.) KNN

KNN TRAINING

KNN TUNED TRAINING

KNN TESTING

26
KNN TUNED TESTING

5.) Bagging

Training

27
Testing

6.) BOOSTING(Gradientboosting)

Training

Testing

28
MODEL ACCURACY COMPARISON

All the model are having overfitting problem and similar accuracies too.From the above metrics
we can clearly see that knn model with k=5 has better accuracy compares to all other model.
Overfitting to an extended we have reduced once we are tuned . Whereas the bagging model
increased the training accuracy , but performs poor on testing side. For knn model once we tuned
with different parameters, we have increased 1%of training accuracy ,but testing accuracy still
remains same. A model with less than 10% difference in training and testing model accuracies
can be considered as a good model . With that condition , we can conclude, knn will be right
model to predict the voters from this given dataset.

MODEL F1-SCORE COMPARISON

if we closely observe the f1-score for both conservative and labour parties, KNN model with k=5
perform the best comparing to rest of the model. Since there is an imbalance problem in the
target variable. We are able to predict 70% of conservative voters prediction and 88% of labour
voters from the given dataset. After tuning we have increased the f1score for labour voters by
1% using knn model.

After calculating the best number of k nearest neighbour , we got to see k=5 have better
performance compares to k=6

Above figure represent the misclassification error with respect to number of k values. We can see
that between 5 and 6 there is a stable value of error. When k=5, the error is also less(0.18). This
why , our model is best compares to all other k values. Rest of the models are also good , but
training accuracies are less compares to knn model. If we look onto the bagging and boosting

29
classifier, they are performing very good in terms of training model , but the performance is less
compares to testing side.Using knn model with k value =5, we are able to predict the labour and
conservative voters voting to their respective parties.

1.8 Based on these predictions, what are the insights?

Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative
values can be extracted which will aid in the calculation of the accuracy score, precision score,
recall score, and f1 score.Listing below model performance metrics before fine tuning the model:

Train Data:

True Positive:239 ,False Positive:46, False Negative:68, True Negative:708 ,AUC: 95.1%,
Accuracy: 89% ,Precision: 91%, f1-Score: 93%,Recall:94%.

Test Data:

True Positive:103 ,False Positive:35 ,False Negative:50, True Negative:268 ,AUC: 89.9%
,Accuracy: 81% Precision: 84%, f1-Score: 86%, Recall:88%

Clearly,our model has better performance on the training set than on the test set. We know that,
FPR tells us what proportion of the negative class got incorrectly classified by the
classifier.Here, we have higher TNR and a lower FPR which is desirable to classify the negative
class. Here, both Type I Error (False Positives) and Type II Error ( False Negatives) are low for
indicating high Sensitivity/Recall, Precision,Specificity and F1 Score. F1-score, Recall,Precision
and AUC are better for train data.

Model can be considered a good model. The best technique to use between bagging and boosting
depends on the data available, simulation, and any existing circumstances at the time.In this case,
we might consider Boosting as a better technique since the model is overfitting for Train data
with Boosting algorithm. An estimate’s variance is significantly reduced by boosting techniques
during the combination procedure, thereby increasing the accuracy. Therefore, the results
obtained demonstrate higher stability than the individual results. Boosting technique has
generated a unified model with lower errors since it concentrates on optimizing the advantages
and reducing shortcomings in a single model.

Problem No.2
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

30
2.1 Find the number of characters, words, and sentences for the mentioned documents.

we are assigned these text files into variable called ‘x’ converting them to list files.After that
we have to call these text files into a new dataframe. So that we can easily perform the rest of
the task.Let us call the new dataframe as ‘y’. The speech inside each files are converted into
the list format and applied to the text column of the ‘y’ dataFrame.

Roosevelt speech

Number of words :-

Character count :-

Average words :-

Nixon Speech

Number of words:-

Character counts:-

Avg words:-

kennedy Speech

Let us call the text file into dataframe y.Insert the list of speech values into the text column.

31
Number of words:-

Character counts:-

Avg words:-

All representation:-

2.2 Remove all the stopwords from all three speeches.

For Roosevelt speech we have around 632 stopwords . We have to remove all the stopwords
and punctuation from the speech to find out the frequent words in the speech. From nixon
speech we got around 899 stopwords. And for kennedy speech there are around 618
stopwords. Before removing the stopwords, we are first converting the speech files into
lower cases and splitting them into words. After that special characters, numerical values and
punctuations are removed from the text files. Using nltk corpus library we are importing
stopword function and calling them to remove stopwords.

Roosevelt speech

Nixon speech

Kennedy speech

32
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)

Roosevelt speech

roosevelt top three words used are nation , know and democarcy

Nixon speech

nixon has used three words frequently in his speech like ,.us, let and peace

Kennedy speech

33
kennedy has used the words let , us ,world as the top three words during his speech From the
above three speeches ‘US’ is the word which is used by three presidents in their speeches.

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
We have used wordcloud library in python to apply the repetitive words in the speeches of
respectives presidents during the election. Each president has their own perspective during their
speeches. But mostly everyone has tried to highlight the unity. ‘US’ , ‘world’ these are words
which they used frequently to highlight the strength of unity.
Roosevelt

34
Nixon

Kennedy

35
36

You might also like