Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Predictive Modelling

Project Report
DSBA

Alok Kumar
CONTENTS
PROBLEM 1.......................................................................................................................................3
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis..............................................................................................................................4
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
creating new features if required. Also check for outliers and duplicates if there.............9
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate methods from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning..................11
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.Please explain and summarise the various steps performed in this project.
There should be proper business interpretation and actionable insights present...........12
PROBLEM 2.....................................................................................................................................14
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis....................................................................14
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split
the data into train and test (70:30). Apply Logistic Regression andLDA (linear discriminant
analysis) and CART...........................................................................................................17
Logistic Regression...........................................................................................................17
LDA...................................................................................................................................19
CART.................................................................................................................................22
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.24
2.4 Inference: Basis on these predictions, what are the insights and recommendations.Please
explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.................................................25
As you are a budding data scientist you thought to find out a linear equation to build a
model to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how
each attribute affects the system to be in 'usr' mode using a list of system attributes.
Dataset for Problem 1: compactiv.xlsx
DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per
second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting
for a CPU to run.
Typically, this value should be less than 2. Consistently higher values mean that the system
might be CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode
Problem 1.1
Read the data and do exploratory data analysis. Describe the data briefly.
(Check the Data types, shape, EDA, 5 point summary). Perform Univariate,
Bivariate Analysis, Multivariate Analysis.
Univariate analysis using box plot
Insights :

1. Data consists of both categorical and numerical values.

2. There are total of 8192 rows and 22 columns in the dataset. Out of 22, only 1 columns is of
object type,8 columns are of integer type and remaining 13 are of float type data.

3.'usr’ is the target variable and all other are predector variables.

4.Looking into the fields in the univariate analsysis, we see there are outliers that needs to be treated

5.Bivariate and multivariate analysis indicates that there is strong positive correlation between the
target variable usr and the predictor variables freemem and freeswap.

6.We also notice that there are no duplicates records in the given data

1.2 Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outliers and duplicates if there.
Before null value treatment
After null value treatment
Before outlier treatment

After outlier treatment

Insights :

1. Observed null values in 2 fields rchar and wchar.

2. We imputed the null values with median value of the data set

3. Most of the of the continuous fields had outliers and we have treated them using the IQR approach

4. In this case, it is not necessary to scale the data as, we'll get an equivalent solution whether we
apply some kind of linear scaling or not. For example, to find the best parameter values of a linear
regression model, there is a closed-form solution, called the Normal Equation. If our implementation
makes use of that equation, there is no stepwise optimization process, so feature scaling is not
necessary.

5. Removing the records with 0 values is not a necessity, as it might not have an impact on the
model building.
Problem 1.3
Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks
for significant variables using appropriate method from statsmodel. Create
multiple models and check the performance of Predictions on Train and Test
sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select
the best one with appropriate reasoning.
Variance Inflation factor

Insights:

R-squared is always between 0 and 100%:

• 0% indicates that the model explains none of the variability of the response data around
its mean.
• 100% indicates that the model explains all the variability of the response data around its
mean. In general, the higher the R-squared, the better the model fits data.

In this case, R-squared value for both test and train is 0.76 and 0.79 respectively , which indicates that
more than 75% of observed variance can be explained by model’s inputs.
1.4 Inference: Basis on these predictions, what are the business insights
and recommendations.

• When lwrite increases by 1 unit , usr increases by 0.05 units keeping all other predictors
constant.
• When atch increases by 1 unit , usr increases by 0.63 units keeping all other predictors
constant.
• When pgout decreases by 1unit , usr decreases by 0.37 units keeping all other
predictors constant.

Problem 2

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, check for duplicates and outliers and write an
inference on it. Perform Univariate and Bivariate Analysis and
Multivariate Analysis.
Insights :

1. Data consists of both categorical and numerical values.

2. There are total of 1473 rows and 10columns in the dataset. Out of 22, 7 columns are of object
type,1 columns of integer type and remaining 2 are of float type data.

3. 'contraceptive used’ is the target variable and all other are predector variables.

4. Looking into the fields in the univariate analysis, we see outliers is present only in the field number
of children

5. Looking in to the boxplot between target variable contraceptive method used and the
no_of_children_born, we see that , No_of_children_born is high in the case of use of
contraception used.
6.Bivariate and multivariate analysis indicates that there is strong positive correlation between the
fields wife_age and no_of_children_born

7. We also notice that there are 80 duplicates records in the given data set and has been removed.

8.Null values identified has been imputed with mean.

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis) and CART.
Data set after label encoding

Data set after dummy encoding

Results of grid search CV for logistic regression


Coefficiants for linear discriminant analysis

2.3 Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model Final Model: Compare Both the models
and write inference which model is best/optimized.
AUC and ROC for test data
AUC and ROC curve for training and test data

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

Insights from Logistic regression:

For predicting ‘Contraceptive_method_used’ is “No” (Label 0 ):

Precision (65%) – 65% of the people predicted are actually not using contraceptions out of all families
predicted to have been not using contraceptions

Recall (50%) – Out of all the people not using contraceptions ,50% of families have been predicted
correctly .

For predicting ‘Contraceptive_method_used’ is “Yes” (Label 1 ):

Precision (70%) – 70% of people predicted are actually using contraceptions out of all people
predicted to have been not using contraceptions

Recall (79%) – Out of all the people using contraceptions ,79% of families have been predicted
correctly

Overall accuracy of the model – 63 % of total predictions are correct


Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves no
overfitting or underfitting has happened, and overall the model is a good model for classification

Insights from LDA:

For predicting ‘Contraceptive_method_used’ is “No” (Label 0 ):

Precision (65%) – 67% of the people predicted are actually not using contraceptions out of all families
predicted to have been not using contraceptions

Recall (68%) – Out of all the people not using contraceptions ,68% of families have been predicted
correctly .

For predicting ‘Contraceptive_method_used’ is “Yes” (Label 1 ):

Precision (65%) – 65% of people predicted are actually using contraceptions out of all people
predicted to have been not using contraceptions

Recall (79%) – Out of all the people using contraceptions ,79% of families have been predicted
correctly

Overall accuracy of the model – 63 % of total predictions are correct

Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves no
overfitting or underfitting has happened, and overall the model is a good model for classification

CONCLUSION :

Both Logistic regression and LDA similar results and hence for the given data set both models are ideal
1.4 Inference: Basis on these predictions, what are the business insights
and recommendations.
Please explain and summarise the various steps performed in this
project. There should be proper business interpretation and actionable
insights present.

You might also like