PM - ExtendedProject - Business Report

_______________________
PREDICTIVE MODELLING
PROJECT BUSINESS REPORT
_______________________
DSBA
PREDICTIVE MODELLING BUSINESS REPORT

Table of Contents:
List Of Figures: ...............................................................................................................................................1
List Of Tables: ................................................................................................................................................2
Data Description: ............................................................................................................................................3

1.1.EDA...........................................................................................................................................................4
Univariate Analysis....................................................................................................................................4
Bivariate Analysis......................................................................................................................................5
Outlier Detection & Treatment...................................................................................................................6
Null Value Treatment.................................................................................................................................7
1.2. Encoding...................................................................................................................................................8
Train -Test Split...............................................................................................................................................8
Model Building................................................................................................................................................9
Logistic Regression Model..............................................................................................................................9
1.3. Model Evaluation & Performance...........................................................................................................

1.4. Business Insights & Recommendations.................................................................................................
List of Tables
Table 1: Data Description Dataset1...........................................................................................................4
Table 2: Data Summary.............................................................................................................................5
Table 3: Encoded data...............................................................................................................................5
Table 4: Data description.........................................................................................................................18
Table 5: Encoded data.............................................................................................................................25
Table 6: Classification report -Linear regression model 1 -Train..............................................................26
Table 7: Classification report -Linear Regression model 1 -Test..............................................................26
Table 6: Classification report -Logistic regression model 1 -Train...........................................................26
Table 7: Classification report -Logistic Regression model 1 -Test...........................................................26
Table 8: Classification report -Optimized Logistic Regression model –Train..........................................28
Table 9: Classification report -Optimized Logistic Regression model –Test...........................................28
Table 10: Classification report -LDA model -Train..................................................................................29
Table 11: Classification report -LDA model –Test...................................................................................29
Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44
Table 26: Classification report – Performance Metrics of Prediction Test –……………………………….45
Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44

List of Figures:
Figure 1: Univariate Analysis...........................................................................................................................6
Figure 2: Univariate Analysis...........................................................................................................................6
Figure 3: Bivariate analysis .............................................................................................................................7
Figure 4: Multivariate analysis ........................................................................................................................7
Figure7: Pairplot..............................................................................................................................................8
Figure 8: Correlation Heatmap........................................................................................................................9
Figure 9: Boxplot for outlier detection............................................................................................................10
Figure 35: ROC Curve -Optimized Logistic Regression model –Train..........................................................28
Figure 36: ROC Curve -Optimized Logistic Regression model –Test...........................................................28
Figure 37: ROC Curve -LDA model –Train....................................................................................................30
Figure 38: ROC Curve -LDA model -Test......................................................................................................30
Figure 41: Confusion matrices of all models (Train data) .............................................................................33
Figure 42: Confusion matrices of all models (Test data) ..............................................................................34

Problem 1
Problem 1:
Problem 1: Linear Regression
You are a part of an investment firm and your work is to do research about these 759 firms. You are provided
with the dataset containing the sales and other attributes of these 759 firms. Predict the sales of these firms on
the bases of the details given in the dataset so as to help your company in investing consciously. Also, provide
them with 5 attributes that are most important.
Questions for Problem 1:
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data
types, shape, and EDA). Perform Univariate and Bivariate Analysis. (8 marks)
1.2) Impute null values if present? Do you think scaling is necessary in this case? (8 marks)
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (30:70).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using R-square, RMSE. (8 marks)
1.4) Inference: Based on these predictions, what are the business insights and recommendations? (6 marks)
Data Dictionary for Firm_level_data:
1. sales: Sales (in millions of dollars).

2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the stock
performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's market value
and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, and EDA). Perform Univariate and Bivariate Analysis. (8 marks)
EDA
The data is imported, and the following are the observations:
 The data has 759 rows and 10 columns. There is 1 object type data types 7 float and Integer 2 int data types.
Data type of data features:

 The next step is we need to know the details of the column along with how many entries
and the data type of all the variables.
From the above, we can infer that there are 10 columns with 759 entries, except sp500, all the variables are int
and float, where sp500is an object.
 Now we need to know whether all the variables have any null values in the given dataset.
Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
From the above output, except “tobinq” all the variables doesn’t have null values .As, the number of
null values of ‘tobinq’ is less we can modify those with the mean value .After this process, we noticed
that all the null values are modified.
 Number of duplicate values =0

(759,19)
There is no duplicate value present in the dataset provided.
 Now, we need to describe the dataset:

 Univariate Analysis:
 Sales:
 Capital

 Patents:
 Randd:
 Employment:

 Tobinq:
 There are many outliers present in the dataset ,which needs to be taken care of the
values ranges 1 and 3.
 Values:

 Institutions:

 There are no outliers present in the dataset. The value ranges from 20 to 60.

 Checking for corelation between the variables:
The sales and the capital is having more commonly related .so, in order to predict
the sales, we can take ‘capital’ for splitting the data.

 Multivariate Analysis:

1.2) Impute null values if present? Do you think scaling is necessary in this case? (8 marks)
All the null values present in the data base has been imputed .Scaling is necessary to convert the
variables with different measurement into the same measurement.
Scaling is required in our data set also. We have treated the outliers present in the dataset and then we
did the StandardScaler normalizes.
Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
Null values treated.

Outliers Treatment:
Before :

After Treatment:

1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (30:70).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using R-square, RMSE. (8 marks)
<class 'pandas. core. frame. Data Frame'>

Range Index: 759 entries, 0 to 758
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null float64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 tobinq 759 non-null float64
7 value 759 non-null float64
8 institutions 759 non-null float64
dtypes: float64(8), int64(1)
Encoded the data
We have encoded the data (having strings values )for modelling and also done Data split: Split the data into test
and train (70:30)
We have to split the given dataset into training and testing by separating X and Y, Xtrain, Xtest.Ytrain, Ytest.
LinearRegression()

The coefficient for Unnamed: 0 is -0.024779222165983567
The coefficient for capital is 0.40647679657880864
The coefficient for patents is -4.6311880265917305
The coefficient for randd is 0.6363608311856803
The coefficient for employment is 78.63925895707519
The coefficient for tobinq is -39.926809277913165
The coefficient for value is 0.2445585106292266
The coefficient for institutions is 0.2248864748953271
The coefficient for sp500_yes is 165.33489561068234

VIF values:
Unnamed: 0 2.981358
capital 8.520474
patents 4.290258
randd 4.699553
employment 8.954333
tobinq 3.067115
value 10.422137
institutions 4.699461
sp500_yes 4.256051
dtype: float64

Actual Values Fitted Values Residuals
0 1947.224100 658.298481 1288.925619
1 60.327997 325.336096 -265.008099
2 1065.748032 746.579787 319.168245
3 1193.647768 2522.941248 -1329.293480
4 164.135025 516.629610 -352.494585
In [70]:

The performance metrics are as follows:
Rsquare on training data is 83.15%

RMSE on training data is 6 %
RMSE on testing data is 5.19%
1.4) Inference: Based on these predictions, what are the business insights and recommendations? (6 marks)
Before going for the new, we need to check on the capital invested is good which is reflecting in the scatterplot.
the important variable is value, employment. Sales and patents. the very important attributes are employment
and patents.

Problem 2: Logistic Regression and Linear Discriminant Analysis
You are hired by the Government to do an analysis of car crashes. You are provided details of car crashes,
among which some people survived and some didn't. You have to help the government in predicting whether a
person will survive or not on the basis of the information given in the data set so as to provide insights that will
help the government to make stronger laws for car manufacturers to ensure safety measures. Also, find out the
important factors on the basis of which you made your predictions.
Data Dictionary for Car_Crash
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, are designed to account for varying sampling
probabilities. (The inverse probability weighting estimator can be used to demonstrate causality when the
researcher cannot conduct a controlled experiment but has observed data to model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels of none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy, nodeploy and
unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5: unknown,
6: prior death
15. caseid: character, created by pasting together the population sampling unit, the case number, and the
vehicle number. Within each year, use this to uniquely identify the vehicle.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do a null value condition check, and write
an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. (8 marks)
We have to import all the necessary library files to process the data analysis. Need to check the head entries.
Description:

From the above ,we can infer that, there are totally 15 columns with 11217 entries .The First column is unnamed
.The datatypes are integer ,float, object.
To check the null values in the dataset.
Except ‘’injseverity’ ,all the variables are not having any null values.

Multivariate Analysis:

The above shows the collinearity between the variables.
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30).
Apply Logistic Regression and LDA (linear discriminant analysis). (8 marks)
We have encoded the data (having strings values) for modelling.
Data split: we have splitted the data into train and test (70:30)
By taking ‘’Survived’ as the target variable we have splitted the data into train and test.

LDA:

2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve, and get the ROC_AUC score for each model. Compare both the models and
write inferences, about which model is best/optimized. (8 marks)
0.1
Accuracy Score 0.9496

F1 Score 0.9726
Confusion Matrix
0.2

F1 Score 0.9735
Confusion Matrix
0.3

F1 Score 0.9748
Confusion Matrix
0.4

F1 Score 0.9757
Confusion Matrix

0.5

F1 Score 0.9766
Confusion Matrix
0.6

F1 Score 0.9771
Confusion Matrix

0.7

F1 Score 0.9777
Confusion Matrix
0.8

F1 Score 0.9777
Confusion Matrix

0.9

F1 Score 0.9728
Confusion Matrix

2.4) Inference: Based on these predictions, what are the insights and recommendations? (6 marks)
Different Model Parameters

Model Name Accuracy Recall Precision F1 score AUC
Train Test Train Test Train Test Train Test Train Test
Cut off 96% 95% 1.00 1.00 96% 95% 98% 97%
Logistic Regression 96% 98% 97% 98%
LDA Model 96% 96% 1.00 1.00 96% 96% 98% 98% 97 97.2
 From all the inferences above, we see that mostly all the models have similar performance.
 The accuracy of both training and testing is same as 96%. The confusion matrix also shows the similarity
We can conclude that logistic method is better to predict the analysis.
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.

PM - ExtendedProject - Business Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PM - ExtendedProject - Business Report

Uploaded by

Copyright:

Available Formats

_______________________

PREDICTIVE MODELLING BUSINESS REPORT

Data Description: ............................................................................................................................................3

1.3. Model Evaluation & Performance...........................................................................................................

Table 2: Data Summary.............................................................................................................................5

Table 3: Encoded data...............................................................................................................................5

Table 4: Data description.........................................................................................................................18

Table 5: Encoded data.............................................................................................................................25

Table 6: Classification report -Linear regression model 1 -Train..............................................................26

Table 7: Classification report -Linear Regression model 1 -Test..............................................................26

Table 6: Classification report -Logistic regression model 1 -Train...........................................................26

Table 7: Classification report -Logistic Regression model 1 -Test...........................................................26

Table 8: Classification report -Optimized Logistic Regression model –Train..........................................28

Table 9: Classification report -Optimized Logistic Regression model –Test...........................................28

Table 10: Classification report -LDA model -Train..................................................................................29

Table 11: Classification report -LDA model –Test...................................................................................29

Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44

Table 26: Classification report – Performance Metrics of Prediction Test –……………………………….45

Table 26: Classification report – Performance Metrics of Prediction Train –………………………………44

PREDICTIVE MODELLING BUSINESS REPORT

Figure 2: Univariate Analysis...........................................................................................................................6

Figure 3: Bivariate analysis .............................................................................................................................7

Figure 4: Multivariate analysis ........................................................................................................................7

Figure 8: Correlation Heatmap........................................................................................................................9

Figure 9: Boxplot for outlier detection............................................................................................................10

Figure 35: ROC Curve -Optimized Logistic Regression model –Train..........................................................28

Figure 36: ROC Curve -Optimized Logistic Regression model –Test...........................................................28

Figure 37: ROC Curve -LDA model –Train....................................................................................................30

Figure 38: ROC Curve -LDA model -Test......................................................................................................30

Figure 41: Confusion matrices of all models (Train data) .............................................................................33

Figure 42: Confusion matrices of all models (Test data) ..............................................................................34

PREDICTIVE MODELLING BUSINESS REPORT

Questions for Problem 1:

Data Dictionary for Firm_level_data:

1. sales: Sales (in millions of dollars).

Data type of data features:

PREDICTIVE MODELLING BUSINESS REPORT

 Number of duplicate values =0

 Now, we need to describe the dataset:

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

Null values treated.

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

<class 'pandas. core. frame. Data Frame'>

Encoded the data

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

PREDICTIVE MODELLING BUSINESS REPORT

0 1947.224100 658.298481 1288.925619

1 60.327997 325.336096 -265.008099

2 1065.748032 746.579787 319.168245

3 1193.647768 2522.941248 -1329.293480

4 164.135025 516.629610 -352.494585

PREDICTIVE MODELLING BUSINESS REPORT

Rsquare on training data is 83.15%