Palash Bhai - Machine Learning Assignment
Palash Bhai - Machine Learning Assignment
1. Read the dataset. Do the descriptive statistics and do null value condition check. Write an inference
on it.
Description of the data set is given below. Few interpretations are as follows:
1. Mean and 50% data representation for Blair, Hague, Europe and Political knowledge has
significant difference
2. Count for all columns is same, therefore, there are no missing values
Checking for null values to confirm our interpretation
There are 8 duplicate rows in the dataset, however, they cannot be inferred as the age for various
duplicate items is different so we will not eliminate these rows
2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Checking for outlier
Data Preparation
1. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split
the data into train and test (70:30).
Modelling
KNN Model
KNN_train_recall: 0.88
KNN_train_f1: 0.88
KNN_test_precision: 0.86
KNN_test_recall: 0.87
KNN_test_f1: 0.86
Based on the various results of the models we can infer that Random Forest has the most optimized
model as it has the highest data as per Area under Curve.
Inference
While the model results between training and test sets are similar, indicating no under or overfitting
issues.
Problem 2
1. Find the number of characters, words and sentences for the mentioned documents.
3. Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)
Top three words for Roosevelt are: nation, know, spirit
4. Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)