PM - ExtendedProject - Business Report
PM - ExtendedProject - Business Report
PREDICTIVE MODELLING
PROJECT BUSINESS REPORT
_______________________
DSBA
1.2. Encoding...................................................................................................................................................8
Train -Test Split...............................................................................................................................................8
Model Building................................................................................................................................................9
Logistic Regression Model..............................................................................................................................9
List of Tables
Table 1: Data Description Dataset1...........................................................................................................4
Figure7: Pairplot..............................................................................................................................................8
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data
types, shape, and EDA). Perform Univariate and Bivariate Analysis. (8 marks)
1.2) Impute null values if present? Do you think scaling is necessary in this case? (8 marks)
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (30:70).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using R-square, RMSE. (8 marks)
1.4) Inference: Based on these predictions, what are the business insights and recommendations? (6 marks)
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, and EDA). Perform Univariate and Bivariate Analysis. (8 marks)
EDA
The data is imported, and the following are the observations:
The data has 759 rows and 10 columns. There is 1 object type data types 7 float and Integer 2 int data types.
From the above, we can infer that there are 10 columns with 759 entries, except sp500, all the variables are int
and float, where sp500is an object.
Now we need to know whether all the variables have any null values in the given dataset.
Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
From the above output, except “tobinq” all the variables doesn’t have null values .As, the number of
null values of ‘tobinq’ is less we can modify those with the mean value .After this process, we noticed
that all the null values are modified.
Capital
Randd:
Employment:
There are many outliers present in the dataset ,which needs to be taken care of the
values ranges 1 and 3.
Values:
There are no outliers present in the dataset. The value ranges from 20 to 60.
The sales and the capital is having more commonly related .so, in order to predict
the sales, we can take ‘capital’ for splitting the data.
All the null values present in the data base has been imputed .Scaling is necessary to convert the
variables with different measurement into the same measurement.
Scaling is required in our data set also. We have treated the outliers present in the dataset and then we
did the StandardScaler normalizes.
Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
We have encoded the data (having strings values )for modelling and also done Data split: Split the data into test
and train (70:30)
We have to split the given dataset into training and testing by separating X and Y, Xtrain, Xtest.Ytrain, Ytest.
LinearRegression()
Unnamed: 0 2.981358
capital 8.520474
patents 4.290258
randd 4.699553
employment 8.954333
tobinq 3.067115
value 10.422137
institutions 4.699461
sp500_yes 4.256051
dtype: float64
In [70]:
1.4) Inference: Based on these predictions, what are the business insights and recommendations? (6 marks)
Before going for the new, we need to check on the capital invested is good which is reflecting in the scatterplot.
the important variable is value, employment. Sales and patents. the very important attributes are employment
and patents.
You are hired by the Government to do an analysis of car crashes. You are provided details of car crashes,
among which some people survived and some didn't. You have to help the government in predicting whether a
person will survive or not on the basis of the information given in the data set so as to provide insights that will
help the government to make stronger laws for car manufacturers to ensure safety measures. Also, find out the
important factors on the basis of which you made your predictions.
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, are designed to account for varying sampling
probabilities. (The inverse probability weighting estimator can be used to demonstrate causality when the
researcher cannot conduct a controlled experiment but has observed data to model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels of none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy, nodeploy and
unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5: unknown,
6: prior death
15. caseid: character, created by pasting together the population sampling unit, the case number, and the
vehicle number. Within each year, use this to uniquely identify the vehicle.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do a null value condition check, and write
an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. (8 marks)
We have to import all the necessary library files to process the data analysis. Need to check the head entries.
Description:
Except ‘’injseverity’ ,all the variables are not having any null values.
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30).
Apply Logistic Regression and LDA (linear discriminant analysis). (8 marks)
Data split: we have splitted the data into train and test (70:30)
By taking ‘’Survived’ as the target variable we have splitted the data into train and test.
0.1
Confusion Matrix
0.2
Confusion Matrix
PREDICTIVE MODELLING BUSINESS REPORT
0.3
Confusion Matrix
0.4
Confusion Matrix
Confusion Matrix
0.6
Confusion Matrix
Confusion Matrix
0.8
Confusion Matrix
Confusion Matrix
From all the inferences above, we see that mostly all the models have similar performance.
The accuracy of both training and testing is same as 96%. The confusion matrix also shows the similarity
We can conclude that logistic method is better to predict the analysis.
Conclusion:
• There is no under-fitting or over-fitting in any of the tuned models.