Predictive Modelling Sweta Kumari

Predictive
Modelling
Name: Sweta Kumari

PGP-DSBA Online
July’ 21
Date: 06/03/2022
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Table Content 1
List of Figure
Data Dictionary 5
Problem 1: Linear Regression 6
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis. 3
Checking the top 5 records:.............................................................................................................6
Checking the Bottom 5 records:.......................................................................................................6
Checking the shape of the data frame:............................................................................................6
Checking the information of the data frame:..................................................................................7
Checking for Missing Values:...........................................................................................................7
Checking the summary of the data frame:......................................................................................7
Check for Duplicates:.......................................................................................................................7
Unique values for categorical variables...........................................................................................7
UNIVARIENT ANALYSIS:....................................................................................................................8
Skewness:........................................................................................................................................8
Bivariate analysis:..........................................................................................................................10
correlation matrix:.........................................................................................................................11
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Do you think scaling is necessary in this
case? 15
Impute Null Values.........................................................................................................................11
StandardScaler Scaling...................................................................................................................12
BOX Plot Before Scaling.................................................................................................................14
Box Plot on Scaled Data.................................................................................................................14
Correlation Matrix on Scaled Data.................................................................................................15
1.3. Encode the data (having string values) for Modelling. Data Split: Split the data into train
and test (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE. 18
Split X and y into training and test set in 70:30 ratio.....................................................................15
Variation influence.........................................................................................................................17
STATSMODEL.................................................................................................................................18
ITERATION 2...................................................................................................................................18
1.4 Inference: Basis on these predictions, what are the business insights and recommendations 19
Problem 2: Logistic Regression and LDA 19
1
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis..........................................................................................................................................19
Checking the top 5 records:...........................................................................................................19
Checking the Bottom 5 records:.....................................................................................................20
Checking the shape of the data frame:..........................................................................................20
Checking the information of the data frame:................................................................................20
Checking for Missing Values:.........................................................................................................21
Check for Duplicates:.....................................................................................................................21
unique values for categorical variables..........................................................................................23
UNIVARIENT ANALYSIS:..................................................................................................................23
Categorical Variables.....................................................................................................................23
Skewness:......................................................................................................................................24
Multivariate analysis:.....................................................................................................................25
correlation matrix:.........................................................................................................................26
Outliers..........................................................................................................................................26
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).40
Splitting data into training and test set..........................................................................................27
Model 1 - Applying GridSearchCV for Logistic Regression.............................................................27
Prediction on the training set........................................................................................................27
Getting the probabilities on the test set........................................................................................27
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized. 28
Classification Report /Confusion matrix / Accuracy on the training data......................................28
AUC and ROC for the training data................................................................................................28
Classification Report /Confusion matrix / Accuracy on the Test data............................................29
AUC and ROC for the Test data......................................................................................................29
LDA MODEL....................................................................................................................................30
Training Data and Test Data Confusion Matrix Comparison..........................................................31
Training Data and Test Data Classification Report Comparison.....................................................31
Training Data Probability Predication & Score for Train and Test model.......................................31
AUC and ROC for the training & test data......................................................................................32
How to change the cut-off values for maximum accuracy?...........................................................32
Predicting the classes on the test data..........................................................................................33
2
Classification Report of the default cut-off test data:..........................................................................33
Classification Report of the custom cut-off test data:.........................................................................34
2.4 Inference: Basis on these predictions, what are the insights and recommendations.Please
explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present. 34
3
List of Figure:
Name of the Figure Figure No Page No

Univariant Analysis Fig 1 9
Pair plot Fig 2 10
Correlation Heat Map Fig 3 11
Invest data with outliers before scaling Fig 4 13
Invest data without outliers after scaling Fig 5 13
Correlation Heat Map Fig 6 14
Sales graph before scaling Fig 7 16
Sales Graph after scaling Fig 8 17
Univariant Analysis Problem 2 Fig 9 23
Pair plot Problem 2 Fig 10 24
Car Crash with outliers before scaling Fig 11 24
Car Crash without outliers after scaling Fig 12 25
Confusion Matrix for Train Data for LR Fig 13 26
AUC and ROC for Train Data for LR Fig 14 27
Confusion Matrix for Test Data for LR Fig 15 28
AUC and ROC for Test Data for LR Fig 16 28
LDA test and train confusion matrix Fig 17 29
AUC and ROC for Test and Train data of LDA Fig 18 30
Cut-off maximum accuracy confusion matrix Fig 19 32
Predicting the classes on the test data Fig 20 33
4
Data Dictionary :
Data Dictionary for Firm_level_data

1. Sales: Sales (in millions of dollars).
2. Capital: Net stock of property, plant, and equipment.
3. Patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. Employment: Employment (in 1000s).
6. Sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the
stock performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
8. Value: Stock market value.
9. Institutions: Proportion of stock owned by institutions.
Data Dictionary for Car Crash

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed;
5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.
5
Problem 1: Problem Statement: Linear Regression
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the
sales of these firms on the bases of the details given in the dataset so as to help your company in
investing consciously. Also, provide them with 5 attributes that are most important.
Ques 1: Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis
Ans: - We have successfully imported all necessary libraries and also checked latest versions of it.
We will read the file and make analysis out of it.
I have created data frame df for file Firm_level_data.csv
Now let’s read the file.

Checking top 5 records:
We find one Columns as Unnamed: 0 which is not adding any value to analysis so we drop this
records and again check top 5 records.
Checking bottom 5 records:
Information about the data:
Shape of the dataset:
6
Data type of dataset:
Exploratory Data Analysis: EDA is an approach to analyze data using both non visual and visual
techniques. It has a structured approach which involves thorough analysis of data to understand the
current business situation
Describing the dataset:
Checking Null Value:
Checking duplicate row:

Number of duplicate rows = 0
Checking unique values in dataset:
7
Univariate and Bivariate Analysis
1) Dist. plot and box plot of sales
2) Dist. plot and box plot of Capital
3) Dist. plot and Box plot of patent:
4) Dist. plot and box plot of randd
8
5) Dist. Plot and box plot of employment
6) Dist. Plot and box plot of tobniq
7) Dist. plot and box plot of Value
8) Dist and box plot of institution
Fig-1
9
Checking skewness of the dataset:
Pair plot of the dataset
Fig -2
10
Correlation Heat map of the data
Fig-3
Question.1.2) Impute null values if present? Do you think scaling is necessary in this case?
Ans: We will check the null value in dataset in order to check if there is any value missing in the
dataset .
From above table we found , total 21 values are missing in tobniq attribute.
We will impute missing values by using median so that no missing value is left is dataset for analysis.
11
After imputing missing value ,we found no values are missing in the dataset .
Scaling : When scaling of the data is done, all variables come at comparable level, scaling is
important when data has different in volume and magnitude. Data normalization is the method used
to standardize the range of features of data. Since the range of values of data may vary widely, it
becomes a necessary step in data preprocessing while using machine learning algorithms. Here in
current case where although data is in better state, however , Standard Scaler as well as Zscore used
to scale the data.
I have used StandardScaler to scale the data .
Result after scaling the dataset:
Let us plot the graph of the dataset with outliers before scaling the data.
12
Fig-4
Now from above graph we found lots of outliers are present in the dataset .
We need to treat the outlier and plot its graph.
Below is graph of investment dataset without outliers after scaling it .
Fig5
13
Now let’s plot the graph of Correlation Heat map of the data after treating the outliers.
Fig-6
Observations:
1) We found we need to impute the dataset as missing values are present in tobniq dataset .
2) We have imputed missing values by using median of tobniq .
3) Scaling of dataset is necessary in order to bring all the dataset at one level.
4) We have scaled the dataset using StandardScaler
5) We also found lots of outliers are present in the dataset and it got treated .
Question 1.3) 1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using R-square, RMSE.
Ans: We have split the given data set into train and test data for model building we have followed
the steps as given below:
Step -1 : Separate X & Y
Step -2 : Train test split – X_train , X_test, Y_train, Y_test.
Step -3 : Model is introduced.
Step -4: model.fit(X_train, Y_train)
Step -5: model.predict(X_train) & (X_test)
Step -6: Validation
14
We have applied Linear Regression on the given dataset.
Coefficients of different attributes of dataset is given below :

The coefficient for capital is 0.3016193095094924
The coefficient for patents is -0.04631497668451081
The coefficient for randd is 0.15074938156370654
The coefficient for employment is 0.4039466739213144
The coefficient for tobinq is -0.015840756793696984
The coefficient for value is 0.21343210254731274
The coefficient for institutions is 0.0015491273809130615
Intercept of our model is :

The intercept for our model is 0.01533730879391812
R square on training data: 0.9351872713101734

R square on testing data : 0.9273849351952674
RMSE on Training data: 0.04551705796613301

RMSE on Testing data: 0.04486679281231977
15
Variation Influence before scaling the dataset:
capital ---> 13.210120175427297
patents ---> 10.923471116505148
randd ---> 13.859560752076456
employment ---> 8.639388123955953
tobinq ---> 1.5045082851876506
value ---> 8.057752748316684
institutions ---> 1.2012687061404628
Stats Model - Apply Linear Regression
Graph is plotted before scaling the dataset:
Notes:[1] Standard Errors assume that the covariance matrix of the

errors is correctly specified.
Graph of sales before scaling the dataset.
16
Fig7
It is always a good practice to scale all the dimensions using z scores or some other method to
address the problem of different scales
Coefficients of different attributes of dataset after scaling the dataset using Zscore is given below:
The coefficient for capital is 0.25546427187105647
The coefficient for patents is -0.02692349037369152
The coefficient for randd is 0.053592321134503934
The coefficient for employment is 0.4344708559702377
The coefficient for tobinq is -0.04559176069854309
The coefficient for value is 0.29673076835504136
The coefficient for institutions is 0.008683246589433539
Intercept of the model after scaling it :

The intercept for our model is 1.4953578678144901e-16
Model Score is : 0.9284628141931344
MSE : 0.26746436362040027
Variation Influence after scaling the dataset:

capital ---> 13.210120175427297
patents ---> 10.923471116505148
randd ---> 13.859560752076456
employment ---> 8.639388123955953
tobinq ---> 1.5045082851876506
value ---> 8.057752748316684
institutions ---> 1.2012687061404628
Graph of sales after scaling the dataset.
17
Fig 8
Stats Model - Apply Linear Regression
Graph is plotted after scaling the dataset :
18
Notes:[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified
Ques 1.4) Inference: Based on these predictions, what are the business insights and
recommendations.
Ans:
Business Insight:
1) Final Linear Equation is:

Sales= (0.02) * Intercept + (-0.05) * patents + (0.4) * employment + (0.15) * randd + (-0.02) *
tobinq + (0.21) * value + (0.0) * institutions + (0.3) * capital
2)With one unit increase in employment ,sales increases by 0.4 units, keeping all the other
predictors constant.
3)When capital increases by 1 unit ,sales increases by 0.3 units, keeping all other predictors
constant.
4)There are also some negative co-efficient values for instance. This implies ,when negative co-
efficient attributes are reduced the sales decreases ,keeping all the predictor constant.
Recommendations:
1)The investment criteria for any new investor are mainly based on the capital invested in the
company by the promoters and investors are vying on the firms where the capital investment is
good .
19
2)Investment criteria also depend on factor that company has good sales which is shown by
scatter plot.
3)To generate capital the company should have the combination of the following attributes such
as value, employment, sales, and patents.
4)The hig hest contributing attribute is employment followed by patents
6) Company should provide more employment as increase in sales can only be done by hiring
more no of people .
7) As sales increase , definitely any new firm will be highly interested to invest
Problem 2: Logistic Regression and Linear Discriminant Analysis

You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived, and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set
so as to provide insights that will help the government to make stronger laws for car manufacturers
to ensure safety measures. Also, find out the important factors on the basis of which you made your
predictions
Ques 2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it? Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Ans: We have successfully imported all necessary libraries and also checked latest versions of it.
We will read the file and make analysis out of it.
I have created data frame df for file Car_Crash.csv.
Now let’s check top 5 records:
Bottom 5 records:
We will drop the unwanted attribute Unnamed :0 as its not adding any value to the dataset .
Let’s check the top 5 records after dropping unnamed attribute from the dataset .
20
Bottom 5 records after dropping unnamed attribute from the dataset .
Information about the dataset:
Shape of the dataset:

(11217, 15)
Checking the null values in the dataset:
Checking for duplicate records:

Number of duplicate rows = 0
21
Checking unique value in the dataset:
74:58:1 6
73:110:1 6
22
78:2:1 6
74:74:2 6
73:100:2 7
Name: caseid, Length: 6488, dtype: int64
Describe the data :
We will impute the missing data and check if no values are left in the dataset by imputing it with
median value .
We will replace some values for further evolution of the data.
23
Univariant Analysis and Bivariant Analysis :
Dist. and Box plot of different attributes of dataset:
Fig-9 -Problem 2
Pair plot of the dataset:
24
Fig 10-Problem 2 -Pair plot
Let’s plot graph for the car crash dataset with outliers before scaling it .
Fig 11
Let’s plot the graph for Car Crash dataset without outliers after scaling it .
25
Fig 12
Inferences :
1) Shape of the dataset is 11217, 15
2) Unnamed :0 attribute of the dataset needs to be dropped as its not adding value to the
dataset.
3) There are some missing values in the dataset which is imputed using median of it
4) There are 3 attributes of data type float64, 4 attribute of data type int64 and 8 attributes of
data type object
5) There is no duplicate record in the data set .
6) Outliers are present in the data set which need to be treated .
7) There are some unique values in the dataset which needs to be treated for further analysis .
Ques: 2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train
and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
Ques:2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both
the models and write inferences, which model is best/optimized.
Ans: We have s We have split ted the data into train and test (70:30). the data into train and test
(70:30). We have applied Logistic Regression and LDA (linear discriminant analysis) on the given
dataset
taking survived as target variable.
TRAIN AND TEST SPLIT-

 The Train and test method is used to create a model and measure the accuracy of our model.
Here the dataset has been split into the ratio 70:30.
 70% for training and 30% for testing.
26
Logistic Regression Model:
Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binary dependent variable, although many more complex extensions exist. In regression analysis,
logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of
binary regression). It is used in statistical software to understand the relationship between the
dependent variable and one or more independent variables by estimating probabilities using a
logistic regression equation. This type of analysis can help you predict the likelihood of an event
happening or a choice being made.
Getting the probabilities on the test set:
Confusion Matrix on training data set:
27
Fig 13
Accuracy - Training Data 0.9698127627053879
AUC and ROC for the training data:

AUC: 0.987
Fig 14
Confusion matrix on the test data
28
Fig 15
Accuracy - Test Data 0.9705882352941176
AUC and ROC for the testing data:

AUC: 0.985
Fig 16
Output for precision, recall and f1 score for training dataset:

lr_train_precision 0.97
lr_train_recall 0.99
lr_train_f1 0.98
Output for precision, recall and f1 score for test dataset:
lr_test_precision 0.98
lr_test_recall 0.99
lr_test_f1 0.98
29
LDA Model:
Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-
processing step in Machine Learning and applications of pattern classification
Confusion Matrix for LDA:

array([[1180, 0],
[ 0, 9960]], dtype=int64)
Plotting confusion matrix for test and train dataset:
Fig 17
Classification report for LDA model is :
30
AUC and ROC graph for test and train dataset:
AUC for the Training Data: 1.000
AUC for the Test Data: 0.904
Fig 18
Precision , recall and F1 Score for training dataset:

lda_train_precision 1.0
lda_train_recall 1.0
lda_train_f1 1.0
Precision , recall and F1 Score for testing dataset:

lda_test_precision 0.93
lda_test_recall 0.96
lda_test_f1 0.94
Comparison between LDA and Logistic Regression Training and testing dataset:
Change the cut-off values for maximum accuracy

Checking for both test and training dataset of both model LDA and logistic Regression Model .
31
1)Accuracy Score 0.9954
F1 Score 0.9974
2) Accuracy Score 0.9958

F1 Score 0.9976
32
F1 Score 0.9975

F1 Score 0.9973
Fig 19
33
Predicting the classes on the test data
Fig 20
Ques 2.4) Inference: Based on these predictions, what are the insights and recommendations.
Ans:
Insights:
 The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e., 97%.
 Similarly, AUC in logistic regression for training data and testing data is also similar.
 The other parameters of confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is overfitted.
 We have therefore applied Grid Search CV to hyper tune our model and as per which F1
score in both training and test data was 98%.
 In case of LDA, the AUC for testing and training data is also same and it was 97%, besides this
the other parameters of confusion matrix of LDA model was also similar and it clearly shows
that model is overfitted here too.
 Score of Both Train and Test Data are coming near-by.
 Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison to
Logistic Regression.
 Hence, LDA Model cab be considered further upgrading the model.
Recommendations: Overall we can conclude that logistic regression model is best suited for this
data set given the level of accuracy in spite of the Linear Discriminant Analysis that the model is
overfitted.
34

Predictive Modelling Sweta Kumari

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modelling Sweta Kumari

Uploaded by

Copyright:

Available Formats

Predictive

Name: Sweta Kumari

Name of the Figure Figure No Page No

Data Dictionary for Firm_level_data

Data Dictionary for Car Crash

I have created data frame df for file Firm_level_data.csv

Now let’s read the file.

Checking bottom 5 records:

Information about the data:

Shape of the dataset:

Describing the dataset:

Checking Null Value:

Checking duplicate row:

Checking unique values in dataset:

2) Dist. plot and box plot of Capital

3) Dist. plot and Box plot of patent:

4) Dist. plot and box plot of randd

6) Dist. Plot and box plot of tobniq

7) Dist. plot and box plot of Value

8) Dist and box plot of institution

Pair plot of the dataset

I have used StandardScaler to scale the data .

Result after scaling the dataset:

We need to treat the outlier and plot its graph.

Below is graph of investment dataset without outliers after scaling it .

Coefficients of different attributes of dataset is given below :

Intercept of our model is :

R square on training data: 0.9351872713101734

RMSE on Training data: 0.04551705796613301

Stats Model - Apply Linear Regression

Graph is plotted before scaling the dataset:

Notes:[1] Standard Errors assume that the covariance matrix of the

Graph of sales before scaling the dataset.

Intercept of the model after scaling it :

Model Score is : 0.9284628141931344

Variation Influence after scaling the dataset:

Graph of sales after scaling the dataset.

Stats Model - Apply Linear Regression

Graph is plotted after scaling the dataset :

1) Final Linear Equation is:

4)The hig hest contributing attribute is employment followed by patents

Problem 2: Logistic Regression and Linear Discriminant Analysis

I have created data frame df for file Car_Crash.csv.

Now let’s check top 5 records:

Information about the dataset:

Shape of the dataset:

Checking the null values in the dataset:

Checking for duplicate records:

Describe the data :

We will replace some values for further evolution of the data.

Pair plot of the dataset:

TRAIN AND TEST SPLIT-

Getting the probabilities on the test set:

Confusion Matrix on training data set:

AUC and ROC for the training data:

AUC and ROC for the testing data:

Output for precision, recall and f1 score for training dataset:

Confusion Matrix for LDA:

Plotting confusion matrix for test and train dataset:

Classification report for LDA model is :

Precision , recall and F1 Score for training dataset:

Precision , recall and F1 Score for testing dataset:

Change the cut-off values for maximum accuracy