ML Project Report

Problem 1:
Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was
conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will vote for on
the basis of the given information, to create an exit poll that will help in predicting overall win and seats covered by
a particular party.
1.1 Read the dataset. Describe the data briefly. Interpret the inferences for each. Initial steps
like head() .info(), Data Types, etc . Null value check, Summary stats, Skewness must be
discussed
1.2 Perform EDA. Perform Uni-variate and Bivariate Analysis. Do exploratory data analysis.
Check for Outliers.
- Checking for Outliers:
- UniVariate Analysis:
- Bi-Variate and Multi-Variate Analysis:
- Checking for Correlations:
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).
1. We will not be scaling the variables.
- Encoding the Target Variable to get the values in 0’s and 1’s:
- Train – Test Split:

- Splitting the variables into train set and test set:
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)
- Logistic Regression Model:

- Linear Discriminant Analysis:
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
- KNN Model:
- Gaussian Naïve Bayes Model:
1.6 Apply Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting
- Linear Logistic Regression with Grid Search:

- Linear Discriminant Analysis with Grid Search:
- KNN Model with Grid Search:
- Gaussian Naïve Bayes Model with Grid Search:
- Bagging:
- Bagging with Grid Search:
- Boosting:
- Boosting with Grid Search:
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized
- Model Comparison without Grid Search:
1. The Worst Performing model are KNN and Bagging as there recall values for 1’s is very less in Test set.
2. The best model is Naive Bayes model as the recall values and the accuracy is stable for both train and test
set.
- Model Comparison after Grid Search:
1. The Worst Performing model is LDA and LLR.

2. The best model is Naive Bayes model.
1.8 Based on these predictions, what are the insights?

 The Worst Performing model is LDA and LLR.
 Bagging and Gradient Boosting which performed well in Train set have performed worst in Test set.
 Naive Bayes and KNN has shown some stability but the difference between train and set scores lead to
perform SMOTE analysis, which increases the recall scores of both.
 The over all best model for this prediction is Naive Bayes Model with and without Grid search.
 Naive Bayes with Smote and KNN with SMOTE gives better accuracy, recall and f1 scores for both training
and Testing set.
Problem 2: TEXT MINING

Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
1.1 Find the number of characters, words, and sentences for the mentioned documents
 The Inaugural corpora contains speeches by three former presidents of the United States, Pres. Roosevelt,
Pres. Kennedy and Pres. Nixon.
- Word Count:
- Characters Count:
- Sentence Count:
1.2 Remove all the stop-words from all three speeches.
- Lowercase conversion:
- Punctuation and Special Characters:
- Stemming:
- Stop words:
1.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stop-words)
- All three speeches combined word frequency:
- President Roosevelt:
- President Kennedy:
- President Nixon:
1.4 Plot the word cloud of each of the speeches of the variable. (after removing the stop -
words)
- President Roosevelt:
- President Kennedy:
- President Nixon:
1.5 Inference:
 Pres. Nixon's Speech is the longest with respect to the other two president's speeches with a total word count
of 1769 with 51 sentences.
 In all the three speeches, us has been used the highest number of times followed by nation and america.

ML Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Project Report

Uploaded by

Copyright:

Available Formats

Problem 1:

- Checking for Outliers:

1. We will not be scaling the variables.

- Train – Test Split:

1.4 Apply Logistic Regression and LDA (linear discriminant analysis)

- Logistic Regression Model:

- Linear Logistic Regression with Grid Search:

- Model Comparison without Grid Search:

- Model Comparison after Grid Search:

1. The Worst Performing model is LDA and LLR.

1.8 Based on these predictions, what are the insights?

Problem 2: TEXT MINING

1. President Franklin D. Roosevelt in 1941

1.2 Remove all the stop-words from all three speeches.

- All three speeches combined word frequency:

You might also like