Professional Documents
Culture Documents
NIrupam Agarwal Business Report-ML
NIrupam Agarwal Business Report-ML
MACHINE LEARNING
1
TABLE OF CONTENT
Contents
List of Figures ............................................................................................................................................. 3
Problem 1 ....................................................................................................................................................... 4
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis. ........................... 4
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ........... 6
BIVARIANT ANALYSIS .......................................................................................................................... 9
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ....................................................................................................11
1.4 Apply Logistic Regression and LDA (linear discriminate analysis). ...................................................13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ......................................................14
Problem 2: .....................................................................................................................................................19
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...........................19
2.2 Remove all the stopwords from all three speeches. ..............................................................................19
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords)........................................................................................19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) ..............20
2
List of Figures
3
Problem 1
Problem statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
Firstly imported all the necessary library packages and the dataset to pandas dataframe.
EDA approach:
1) Descriptive Analytics
2) Data pre-processing
3) Data visualization
4) Data preparation
1. Import the required libraries. Check the head and tail of the dataset. Find the shape and info
of the dataset.
2. Shape of the dataset: The number of rows and columns are: (1525, 10)
4
3. Info and Summary of the dataset:
The info contains 9 attributes with non-null values. The attributes unnamed, age, economic
cond.national, economic cond.household, Blair, Hague, Europe, political. Knowledge is integer
data types.
The attribute vote and gender are object data types. There are no missing values. The indexing is
from 0 to 1524. The total entries are 1525.
5
4. Duplicate rows: There are 8 duplicate rows in the dataset.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.
UNIVARIANT ANALYSIS
For the above we plot the histogram and check for outliers. We do so by using the
histplot and boxplot for each of them respectively (for all attributes).
1) Attributes
The age attribute is having skewness of 0.144 i.e. it is more or less normal.
There are no outliers in age attribute’s boxplot.
6
The economic.cond.national skewness is -0.24 i.e. it is more or less normal.
There is a presence of outliers.
The Blaire attribute is having skewness of -0.535 i.e. it is more or less normal.
There are outliers in this attribute’s boxplot.
The Hague attribute is having skewness of 0.15 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.
The Europe attribute is having skewness of -0.135 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.
The Political knowledge attribute is having skewness of -0.42 i.e. it is more or less normal.
There are no outliers in this attribute’s boxplot.
7
Figure 9 Univariate analysis of Hague, Europe, and Political knowledge.
1) Vote
The vote attribute varies from Labour to Conservative as shown. Labour is occurring the most with
1063 followed by Conservative occurring with 462.
8
We can use one hot encoding to ensure better readability.
2) Gender
The gender attribute varies from female to male as shown. Female is occurring the most with 812
followed by Male occurring with 713.
We can use one hot encoding to ensure better readability.
BIVARIANT ANALYSIS
Plot the pair plot of the dataset
1. The main diagonal plot is histogram for each of the attributes (Univariant analysis)
We can see the various positive correlations
9
Figure 13 Pair plot of the dataset
The main diagonal is the Histogram for each attributes. There are multiple peaks hence there are
clusters in dataset.
Correlation:
10
Using a subplot to see the correlation:
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).
Encoding the dataset and updating the columns with their normalized values(scaling) for numeric
data.
The variables ‘vote’ and ‘gender’ have string values. Converting them into numeric values
(encoding).
11
Figure 17 Encoding of categorical attributes.
12
1.4 Apply Logistic Regression and LDA (linear discriminate analysis).
1. LDA
Performance Matrix on train data set
2. Logistic Regression
13
Performance Matrix on test data set
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Naïve’s
Performance Matrix on train data set
14
KNN
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Bagging
15
Performance Matrix on test data set
Ada Boosting
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
Ordinal encoding has improved the model result but Linear Regression result hasn’t increased
much. In this kernel, the data is evaluated by means of their features in order to predict the diamond
price. Before predicting the price, exploratory data analysis has been made, categorical features
encoded.To predict the price; Linear Regression Model, Decision Tree Regressor, RandomForrest
Regressor and KNN are compared. Amongst them Decision Tree has been the most successful one
in order to predict diamond price.
17
18
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
2.1 Find the number of characters, words, and sentences for the mentioned documents.
President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390
words.
President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words
Figure 32 Stopwords
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
19
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Figure 33 Words
20
Word plot for Kennedy
Figure 35 Words
21
Figure 36 Word Cloud
Figure 37 Words
22
Figure 38 Word Cloud
23