Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Problem 1

1. Read the dataset. Do the descriptive statistics and do null value condition check. Write an inference
on it.

Head of the data is as follows

Information about the data type of various columns is given below

Description of the data set is given below. Few interpretations are as follows:

1. Mean and 50% data representation for Blair, Hague, Europe and Political knowledge has
significant difference
2. Count for all columns is same, therefore, there are no missing values
Checking for null values to confirm our interpretation

There are 8 duplicate rows in the dataset, however, they cannot be inferred as the age for various
duplicate items is different so we will not eliminate these rows

2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Checking for outlier
Data Preparation

1. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split
the data into train and test (70:30).

Modelling

1. Apply Logistic Regression and LDA (linear discriminant analysis).


Logistic Regression

Model accuracy score for training data– 83.97%

AUC for training data – 88.9%

Model score for training data – 82.31%

AUC for training data – 88.9%

Confusion Matrix for training data


Confusion matrix for test data
Linear Discriminant Analysis
AUC for the Training Data: 88.9%

AUC for the Test Data: 88.4%


2. Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN Model

AUC for the Training Data: 91.1%

AUC for the Test Data: 86.1%


KNN_train_precision: 0.87

KNN_train_recall: 0.88

KNN_train_f1: 0.88

KNN_test_precision: 0.86

KNN_test_recall: 0.87

KNN_test_f1: 0.86

Naïve Bayes Model

AUC for the Training Data: 88.6%

AUC for the Test Data: 88.5%


3. Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting.
AUC for the Training Data: 99.7%

AUC for the Test Data: 89.6%


4. Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.

Based on the various results of the models we can infer that Random Forest has the most optimized
model as it has the highest data as per Area under Curve.

Particulars Training Data Test Data


Logistical Regression 83.9% 82.37%
LDA 88.9% 88.4%
KNN 91.1% 86.1%
Naives Bayes 88.6% 88.5%
Random Forest 99.7% 89.6%

Inference

1. Based on these predictions, what are the insights?

Accuracy on Test data is 99% and on Train data is 80%.

AUC is greater than 89% for both.

Recall and Precision is low and same on both data.

While the model results between training and test sets are similar, indicating no under or overfitting
issues.

Problem 2
1. Find the number of characters, words and sentences for the mentioned documents.

For 1941 Roosevelt speech

The number of sentences in the text are: 68


The number of words in the text are: 1,360

The number of characters in the text are: 7,571

For 1961 Kennedy speech

The number of sentences in the text are: 52

The number of words in the text are: 1,390

The number of characters in the text are: 7,618

For 1973 Nixon speech

The number of sentences in the text are: 68

The number of words in the text are: 1,819

The number of characters in the text are: 9,991

2. Remove all the stopwords from all the three speeches.

3. Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)
Top three words for Roosevelt are: nation, know, spirit

Top three words for Kennedy are: let, us, world

Top three words for Nixon are: us, let, America

4. Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)

Word cloud for Roosevelt

Word cloud for Kennedy after cleaning


Word cloud for Nixon after cleaning

You might also like