Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

PREDICTIVE

MODELLING
PROBLEM 1: LINEAR REGRESSION

The comp-activ databases is a collection of a computer systems activity measures .


The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running
in a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs. 

As you are a budding data scientist you thought to find out a linear equation to build a
model to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how
each attribute affects the system to be in 'usr' mode using a list of system attributes.

SUMMARY

 The Given dataset contains data collected from sun sparctation 20/712 with 128
Mbytes of memory running in a multi-user university department.
 A Model need to find out for predicting ‘usr’ and check how each attribute affects the
system to be in 'usr' mode using a list of system attributes.

The following process carried out for building a model as follows,

 Importing the required libraries regarding linear regression.


 Reading the excel file.
 As per the summary the excel file contains data of sun sparctation 20/712 with 128
Mbytes of memory running in a multi-user university department.

Dataset sample

lwrite scall sread swrite fork exec rchar wchar pgout ... pgscan atch pgin ppgin pflt vflt runqsz freemem freeswap usr
lread

0 1 0 2147 79 68 0.2 0.2 40671.0 53995.0 0.0 ... 0.0 0.0 1.6 2.6 16.00 26.40 CPU_Bound 4670 1730946 95

1 0 0 170 18 21 0.2 0.2 448.0 8385.0 0.0 ... 0.0 0.0 0.0 0.0 15.63 16.83 Not_CPU_Bound 7278 1869002 97

2 15 3 2162 159 119 2.0 2.4 NaN 31950.0 0.0 ... 0.0 1.2 6.0 9.4 150.20 220.20 Not_CPU_Bound 702 1021237 87

3 0 0 160 12 16 0.2 0.2 NaN 8670.0 0.0 ... 0.0 0.0 0.2 0.2 15.60 16.80 Not_CPU_Bound 7248 1863704 98

4 5 1 330 39 38 0.4 0.4 NaN 12185.0 0.0 ... 0.0 0.0 1.0 1.2 37.80 47.60 Not_CPU_Bound 633 1760253 90

5 rows × 22 columns

 Data set contains 8192 rows with 22 columns (Shape).


EDA
0 lread 8192 non-null int64
1 lwrite 8192 non-null int64
2 scall 8192 non-null int64
3 sread 8192 non-null int64
4 swrite 8192 non-null int64
5 fork 8192 non-null float64
6 exec 8192 non-null float64
7 rchar 8088 non-null float64
8 wchar 8177 non-null float64
9 pgout 8192 non-null float64
10 ppgout 8192 non-null float64
11 pgfree 8192 non-null float64
12 pgscan 8192 non-null float64
13 atch 8192 non-null float64
14 pgin 8192 non-null float64
15 ppgin 8192 non-null float64
16 pflt 8192 non-null float64
17 vflt 8192 non-null float64
18 runqsz 8192 non-null object
19 freemem 8192 non-null int64
20 freeswap 8192 non-null int64
21 usr 8192 non-null int64

 Data set contains object named ‘runqz’.


 Data set contains 13 float values, 8 integer values and 1 object.
dtypes: float64(13), int64(8), object(1)
CHECKING DUPLICATE AND NULL VALUES:

lread 0
lwrite 0
scall 0
sread 0
swrite 0
fork 0
exec 0
rchar 104
wchar 15
pgout 0
ppgout 0
pgfree 0
pgscan 0
atch 0
pgin 0
ppgin 0
pflt 0
vflt 0
runqsz 0
freemem 0
freeswap 0
usr 0
dtype: int64
 There is no duplicated column, data set doesn’t have duplicate rows as well
 Data set doesn’t have null values as well except rchar and wchar columns.
 Let us use the ‘For loop ’ to treat these null values by replace with median values.
lread 0
lwrite 0
scall 0
sread 0
swrite 0
fork 0
exec 0
rchar 0
wchar 0
pgout 0
ppgout 0
pgfree 0
pgscan 0
atch 0
pgin 0
ppgin 0
pflt 0
vflt 0
runqsz 0
freemem 0
freeswap 0
usr 0

 After the treatment null values in the data set was clear, no disturbance in data set.
 Linear regression sensitive to the null values.

ENCODING:

 Linear regression model requires only numerical values, but the data set have one
object variable ,we can encode the object as numerical variable
 In data set there is a column ‘runqsz’ as object data type.

Now Converting the columns as numerical by using the Label encoding method and
replacing the ‘Cpu_bound’ as 1 and ‘Notcpu_bound’ as 2.

OUTLIERS:
Every column having the outliers. As the Linear regression is sensitive for outliers,
but in my opinion is outliers treatment is not quite good because each and every data is
unique with his own entry.
And Treating the outliers will affect the original value of the data and it may leads
to wrong prediction also. So, we will proceed the data with the outliers.

Here in every column ‘0’ place an important role as its showing huge difference in
the range of the data.
If we treat the 0 ,there will be change in data also (like null values) as the real data
may have 0, so we will proceed with these.
PAIRPLOT

Pairplot shows the relationship between the variables in the form of scatterplot and
the distribution of the variable in the form of histogram
As the given data set contains huge numbers of columns the pair plot is looking
little messy.
And as the plot we can see some columns having the positive correlation b/w them.
Some having no correlation and some columns have negative correlation as well.
Now Let us split the data and build a model to proceed.

TEST AND TRAIN SPLIT:


Let us create the x and y variable data with respect to ‘usr’ column as the target
variable. Now x having every data except the target variable and y having only the target
variable .
 Using stats model api as SM to intercept the X variable.
 Using sklearn to split the data into x_tain and y_train.
Now x_train data having the follows,

const lread lwrite scall sread swrite fork exec rchar \


694 1.0 1 1 1345 223 192 0.6 0.6 198703.0
5535 1.0 1 1 1429 87 67 0.2 0.2 7163.0
4244 1.0 49 71 3273 225 180 0.6 0.4 83246.0
2472 1.0 13 8 4349 300 191 2.8 3.0 96009.0
7052 1.0 17 23 225 13 13 0.4 1.6 17132.0

wchar ... pgfree pgscan atch pgin ppgin pflt vflt \


694 293578.0 ... 23.40 56.4 2.60 3.80 7.40 28.20 56.60
5535 24842.0 ... 0.00 0.0 0.00 1.60 1.60 15.77 30.74
4244 53705.0 ... 7.19 0.0 2.79 3.99 4.59 59.88 74.05
2472 70467.0 ... 0.00 0.0 0.00 2.80 3.20 129.00 236.80
7052 12514.0 ... 0.00 0.0 0.00 0.00 0.00 19.80 23.80

freemem freeswap runqsz_Not_CPU_Bound


694 121 1375446 0
5535 1476 1021541 1
4244 82 18 0
2472 772 993909 0
7052 4179 1821682 1

[5 rows x 22 columns]
X_test data having the follows,
const lread lwrite scall sread swrite fork exec rchar \
3894 1.0 27 39 1252 53 118 0.2 0.2 26592.0
4276 1.0 1 0 996 85 55 0.4 0.4 16667.0
3414 1.0 9 7 1530 247 135 0.4 0.4 14513.0
4165 1.0 32 4 3243 182 140 5.2 5.6 337517.0
7385 1.0 16 3 5017 259 249 2.8 1.4 73537.0

wchar ... pgfree pgscan atch pgin ppgin pflt vflt \


3894 54394.0 ... 0.0 0.0 0.0 0.4 0.6 19.44 20.04
4276 36431.0 ... 0.0 0.0 0.0 1.0 1.4 35.53 52.10
3414 61905.0 ... 30.4 24.2 10.4 14.8 18.4 26.80 186.20
4165 94832.0 ... 1.0 0.0 1.4 4.6 7.0 250.60 420.20
7385 237547.0 ... 0.0 0.0 0.0 5.6 5.8 142.80 276.20

freemem freeswap runqsz_Not_CPU_Bound


3894 7762 1875466 1
4276 2979 1010114 1
3414 89 11 0
4165 1300 1535309 0
7385 2114 988600 0

[5 rows x 22 columns]

As the Train and the test data split up we can process with creating the linear
model. Now for creating the OLS model, we can use the .ols from stats model api package.
And Fit the data with x_train and y_train.
Now the summary of the linear regression as follows,
OLS Regression Results
==============================================================================
Dep. Variable: usr R-squared: 0.643
Model: OLS Adj. R-squared: 0.642
Method: Least Squares F-statistic: 489.6
Date: Mon, 05 Dec 2022 Prob (F-statistic): 0.00
Time: 16:53:24 Log-Likelihood: -21788.
No. Observations: 5734 AIC: 4.362e+04
Df Residuals: 5712 BIC: 4.377e+04
Df Model: 21
Covariance Type: nonrobust
===================================================================================
=====
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------------------
-----
const 44.6380 0.746 59.831 0.000 43.175
46.101
lread -0.0199 0.003 -6.214 0.000 -0.026 -
0.014
lwrite 0.0048 0.006 0.795 0.427 -0.007
0.017
scall 0.0010 0.000 7.451 0.000 0.001
0.001
sread -0.0005 0.002 -0.257 0.797 -0.004
0.003
swrite -0.0020 0.002 -1.018 0.309 -0.006
0.002
fork -1.7222 0.244 -7.052 0.000 -2.201 -
1.244
exec -0.0896 0.048 -1.879 0.060 -0.183
0.004
rchar -4.062e-06 8.29e-07 -4.898 0.000 -5.69e-06 -
2.44e-06
wchar -1.164e-05 1.28e-06 -9.118 0.000 -1.41e-05 -
9.14e-06
pgout -0.1739 0.064 -2.717 0.007 -0.299 -
0.048
ppgout 0.0989 0.037 2.701 0.007 0.027
0.171
pgfree -0.0703 0.020 -3.508 0.000 -0.110 -
0.031
pgscan 0.0086 0.006 1.362 0.173 -0.004
0.021
atch -0.0786 0.027 -2.949 0.003 -0.131 -
0.026
pgin 0.0913 0.029 3.103 0.002 0.034
0.149
ppgin -0.0594 0.019 -3.128 0.002 -0.097 -
0.022
pflt -0.0415 0.004 -9.697 0.000 -0.050 -
0.033
vflt 0.0223 0.003 6.665 0.000 0.016
0.029
freemem -0.0016 7.53e-05 -21.489 0.000 -0.002 -
0.001
freeswap 3.219e-05 4.54e-07 70.985 0.000 3.13e-05
3.31e-05
runqsz_Not_CPU_Bound 7.7908 0.303 25.693 0.000 7.196
8.385
==============================================================================
Omnibus: 1507.319 Durbin-Watson: 2.057
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4768.238
Skew: -1.333 Prob(JB): 0.00
Kurtosis: 6.585 Cond. No. 7.48e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.48e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

 The R-square value tells that the model can explain 64.3% of the variance in the
training set.
 Adjusted R-square also nearly to the R-square,64.2%.
Let’s build another model.

CREATING NEW MODEL PROCEDURE


As the same procedure follows up for creating different variables with same data
set. Now splitup the data into x_train and y_tain. Y will be our target variable.
Creating a model with Linear regression and fitting a model with x_train and y_train
data. And Intercepting the data like done before with intercept method,the intercept of the
created the model is 44.63.
R-square of the training data is 64% same as previous one. 64% of variance in ‘usr’
explained in the model for train set.
Regression model test and train data scores(Accuracy) are 63.12 and 64.24. As well
as the RMSE of train and test data is 51.59 and 51.97. Now Building a regression model with
target variable.
OLS Regression Results

Dep. Variable: usr R-squared: 0.643

Model: OLS Adj. R-squared: 0.642

Method: Least Squares F-statistic: 489.6

Date: Mon, 05 Dec 2022 Prob (F-statistic): 0.00

Time: 16:53:24 Log-Likelihood: -21788.

No. Observations: 5734 AIC: 4.362e+04

Df Residuals: 5712 BIC: 4.377e+04

Df Model: 21

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const 44.6380 0.746 59.831 0.000 43.175 46.101

lread -0.0199 0.003 -6.214 0.000 -0.026 -0.014

lwrite 0.0048 0.006 0.795 0.427 -0.007 0.017

scall 0.0010 0.000 7.451 0.000 0.001 0.001

sread -0.0005 0.002 -0.257 0.797 -0.004 0.003

swrite -0.0020 0.002 -1.018 0.309 -0.006 0.002

fork -1.7222 0.244 -7.052 0.000 -2.201 -1.244

exec -0.0896 0.048 -1.879 0.060 -0.183 0.004

rchar -4.062e-06 8.29e-07 -4.898 0.000 -5.69e-06 -2.44e-06

wchar -1.164e-05 1.28e-06 -9.118 0.000 -1.41e-05 -9.14e-06

pgout -0.1739 0.064 -2.717 0.007 -0.299 -0.048


ppgout 0.0989 0.037 2.701 0.007 0.027 0.171

pgfree -0.0703 0.020 -3.508 0.000 -0.110 -0.031

pgscan 0.0086 0.006 1.362 0.173 -0.004 0.021

atch -0.0786 0.027 -2.949 0.003 -0.131 -0.026

pgin 0.0913 0.029 3.103 0.002 0.034 0.149

ppgin -0.0594 0.019 -3.128 0.002 -0.097 -0.022

pflt -0.0415 0.004 -9.697 0.000 -0.050 -0.033

vflt 0.0223 0.003 6.665 0.000 0.016 0.029

freemem -0.0016 7.53e-05 -21.489 0.000 -0.002 -0.001

freeswap 3.219e-05 4.54e-07 70.985 0.000 3.13e-05 3.31e-05

runqsz_Not_CPU_Bound 7.7908 0.303 25.693 0.000 7.196 8.385

Omnibus: 1507.319 Durbin-Watson: 2.057

Prob(Omnibus): 0.000 Jarque-Bera (JB): 4768.238

Skew: -1.333 Prob(JB): 0.00

Kurtosis: 6.585 Cond. No. 7.48e+06

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.48e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

As we see the new generated model is same as previous model with also R square
and Adjusted R square are same.
With these data we can see there is strong correlation between the data of Y_test
and the y_prediction

CONCLUSION:

The Final linear equation of the given data,


(44.64) * const + (-0.02) * lread + (0.0) * lwrite + (0.0) * scall + (-0.0) * sread + (-0.0)
* swrite + (-1.72) * fork + (-0.09) * exec + (-0.0) * rchar + (-0.0) * wchar + (-0.17) * pgout +
(0.1) * ppgout + (-0.07) * pgfree + (0.01) * pgscan + (-0.08) * atch + (0.09) * pgin + (-0.06) *
ppgin + (-0.04) * pflt + (0.02) * vflt + (-0.0) * freemem + (0.0) * freeswap + (7.79) *
runqsz_Not_CPU_Bound

 When number of faults increases, ‘usr’ also getting increase by 0.02% and rest of all
are in negative value.
 There are so many negative co-efficient are present in linear equation.
 Except ‘vflt’ and ‘rungsz’ all co-efficient are decrease when implies.
 Totally model was not good enough to predict the future data set as the Outliers
dependent is more.
 Even including the ‘0’ as the data the linear regressin model sensitive for the
outliers,if we try to remove these 0, then the information from the data will change.
PROBLEM 2: LOGISTIC REGRESSION, LDA AND CART

You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence Survey. The
samples are married women who were either not pregnant or do not know if they were at the
time of the survey.

The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.

SUMMARY

 The Given dataset contains data of 1473 females collected from a Contraceptive
Prevalence Survey. And the not sure if they were pregnant or not at the time of the
survey.
 The Model need predict do/don't they use a contraceptive method of choice based
on their demographic and socio-economic characteristics.

The following process carried out for building a model as follows,

 Importing the required libraries regarding logistic regression, LDA and CART .
 Reading the excel file.
 As per the summary the excel file contains data of data of 1473 females collected
from a Contraceptive Prevalence Survey.

Dataset sample

types: float64(2), int64(1), object(7)


The given data set contains 1473 of Rows and 10 of columns. By checking with the
info ,we get 7 object variable, 2 object variables and the 2 float variables.

By using describe (include=all) we can find NaN values mostly due only 3 variables are
in numeric .

EDA

CHECKING DUPLICATE AND NULL VALUES:

In the given dataset , there are 80 rows of data showing as Duplicates . But it may
be certain valued data like different persons with having same qualifications or talent, i.e ,
similar data found in the given dataset. And the Displayed Duplicated data represents the
information from different person only as atleast one variable showed different data.
In the Python the entries with similar data will be considered as Duplicates even its
valued information. And we don’t need to treat or drop those data as it may contain the
information. And they have similarity only in husband education, wife region, standard of
living and media exposure.
By checking the null values, we got the variable ‘Wife_age’ and the ‘No of children
born’ having null values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Wife_age 1402 non-null float64
1 Wife_ education 1473 non-null object
2 Husband_education 1473 non-null object
3 No_of_children_born 1452 non-null float64
4 Wife_religion 1473 non-null object
5 Wife_Working 1473 non-null object
6 Husband_Occupation 1473 non-null int64
7 Standard_of_living_index 1473 non-null object
8 Media_exposure 1473 non-null object
9 Contraceptive_method_used 1473 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 115.2+ KB

Now let us treat the null values only for the wife age, because as the time of survey
the given dataset not sure about the married women are Pregnant or the children born or
not at that time. So, we can’t predict those data. But we can predict the wife age but using
the mean and median values.
By changing the nullvalues of Wife age as the mean by using the ‘for loop’ option
And we can drop the rows those ‘No of children born’ not known as the least no of data.
Now we can get the 0 Null value dataset, as some data gets drops we get 1452 data as
rows.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1452 entries, 0 to 1472
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Wife_age 1452 non-null float64
1 Wife_ education 1452 non-null object
2 Husband_education 1452 non-null object
3 No_of_children_born 1452 non-null float64
4 Wife_religion 1452 non-null object
5 Wife_Working 1452 non-null object
6 Husband_Occupation 1452 non-null int64
7 Standard_of_living_index 1452 non-null object
8 Media_exposure 1452 non-null object
9 Contraceptive_method_used 1452 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 124.8+ KB
Before we move onto the plots with the object variables, we can change these
object as numeric labelling by using the unique codes . Because our model created by
Logistics Regression , LDA and the CART will not work on object variables.

As the plot informed, some variables having outliers .But we not going to treat those
outliers as it may help us to predict the model by this in formation. For example,

Wife_ education
Tertiary 570
Secondary 405
Primary 327
Uneducated 150
Name: Wife_ education, dtype: int64

Husband_education
Tertiary 889
Secondary 346
Primary 173
Uneducated 44
Name: Husband_education, dtype: int64

Wife_religion
Scientology 1235
Non-Scientology 217
Name: Wife_religion, dtype: int64

Wife_Working
No 1089
Yes 363
Name: Wife_Working, dtype: int64
The Pair plot shows how the relationship between the every variables. Among these
variables we need to predict those models that they use the contraceptive method or not.
Logistic Regression
Train and Test Split:
Let us create the x and y variable data with respect to ‘'Contraceptive_method_used'’
column as the target variable. Now x having every data except the target variable and y
having only the target variable .
 Before we proceed the process , we need to import the required libraries or checking
it. In this encoding for ‘'Contraceptive_method_used'’ 1 as yes and 0 as No.
 As we already label encoding the object variables , there is no necessary to use Label
Encoder form sklearn library.
 The encoding is for creating the dummy variables .
 Now the Train set and the test set has been spittedup by using the sklearn
model.Using logistic regression model method to fit the data and creating a logistic
model.
 The propotion of 1s and 0s i.e., (Customers using Contraceptive_method_used Yes /
No) as follows,
1 0.566804
0 0.433196

Now we need to fit the Logistic regression model by using newton cg as solver, 1000
as max itre(iteration), then we get the predicted dataframe of the model,

1
0

0 0.363066 0.636934

1 0.342266 0.657734

2 0.471333 0.528667

3 0.338485 0.661515

4 0.309265 0.690735

 In the above dataframe we can see ,’1’ gets the higher probability of 69% and the
Model Accuracy is 66.73%(0.6673)

AUC AND ROC CURVE


Now we can plot plot the AUC and ROC curve of the model and get the separate
curve and auc score of Train dataset and test dataset.

AUC curve for train data:

 In this curve, if the plot occurs below the dotted lines, then it accept as worst model
ever, Eventhough the curve is not perfect but the curve is OK ,the AUC(Area under
the curve) of the train data model is 67.10%.

AUC curve for test data:


 This curve is similar to the train data curve AUC but slightly vary in the initial
locations.
 And the curve is ok as plotted above the dotted lines.
 The Area under curve is same as the train data as 67.10%.

For the comparision the train data AUC with the test data AUC, mostly both curve is
similar with some variation only as the AUC of both is same as 67.10%. Lets move to the
confusion matrix,

Confusion matrix for train data:

By checking up the confusion matrix of the train data, we can get the value of True
Positive as 182 and the True Negative as 489.

array([[182, 247],
[ 91, 496]], dtype=int64)

This Plot shows the relationship between the True label and the predicted label as
0s and 1s.
And the classification of the report as follows,
precision recall f1-score support

0 0.67 0.42 0.52 429


1 0.67 0.84 0.75 587

accuracy 0.67 1016


macro avg 0.67 0.63 0.63 1016
weighted avg 0.67 0.67 0.65 1016

For Contraceptive_method_used (Label 0 ):

 Precision (67%) – 67% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
 Recall (42%) – Out of all the married women not using Contraceptive method , 46%
of married women have been predicted correctly .

For Contraceptive_method_used (Label 1 ):

 Precision (67%) – 67% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
 Recall (84%) – Out of all the married women actually using contraceptive method ,
84% of married women have been predicted correctly .
And the Accuracy is 67% which is more than 50%, so the model is Good.
Confusion matrix for train data:

By checking up the confusion matrix of the train data, we can get the value of True
Positive as 75 and the True Negative as 208.

array([[ 75, 125],


[ 28, 208]], dtype=int64)
This Plot shows the relationship between the True label and the predicted label as
0s and 1s.
And the classification of the report as follows,
precision recall f1-score support

0 0.73 0.38 0.50 200


1 0.62 0.88 0.73 236

accuracy 0.65 436


macro avg 0.68 0.63 0.61 436
weighted avg 0.67 0.65 0.62 436
For Contraceptive_method_used (Label 0 ):

 Precision (73%) – 73% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
 Recall (38%) – Out of all the married women not using Contraceptive method , 38%
of married women have been predicted correctly .

For Contraceptive_method_used (Label 1 ):

 Precision (62%) – 62% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
 Recall (88%) – Out of all the married women actually using contraceptive method ,
88% of married women have been predicted correctly .
And the Accuracy is 65% which is more than 50%, so the model is also Good as
Training Data.

Grid search:
By using the grid search CV from sklearn model to get predict the best model. The
process is same as the above and we get,
For Training data

precision recall f1-score support

0 0.67 0.42 0.52 429


1 0.67 0.85 0.75 587

accuracy 0.67 1016


macro avg 0.67 0.64 0.63 1016
weighted avg 0.67 0.67 0.65 1016
As the above method , here also we get similar values as true positive as 182 and
true negative as 499. And the Accuracy is as 67%, not much differ from previous method.

For test data:


precision recall f1-score support

0 0.74 0.38 0.50 200


1 0.63 0.89 0.73 236

accuracy 0.65 436


macro avg 0.68 0.63 0.62 436
weighted avg 0.68 0.65 0.63 436

As the above method , here also we get similar values as true positive as 76 and true
negative as 209. And the Accuracy is as 65%, not much differ from previous method.
Conclusion:
 Overall accuracy of the model – 67 % of total predictions are correct
 Accuracy, AUC, Precision and Recall for test data is almost inline with training data.
This proves no overfitting or underfitting has happened, and overall the model is a
good model for classification

LDA
Train and Test Split:
The procedure is same as the above Logistics regression for splitting the Train and
test data.
Need to import the LDA(Linear Discriminant analysis) from the sklearn library and the
results is as follows,
Classification Report of the training data:

precision recall f1-score support

0 0.67 0.42 0.52 429


1 0.67 0.85 0.75 587

accuracy 0.67 1016


macro avg 0.67 0.63 0.63 1016
weighted avg 0.67 0.67 0.65 1016

Classification Report of the test data:

precision recall f1-score support

0 0.73 0.37 0.49 200


1 0.62 0.88 0.73 236

accuracy 0.65 436


macro avg 0.67 0.63 0.61 436
weighted avg 0.67 0.65 0.62 436

There is some slight difference with the Training and the test data reports, but its ok
as the Accuracy of train data is as 67% and the accuracy for the test data is as 65%.

AUC curve for test and train data:


This plot shows the AUC curve of both the training data and Test data. And in the
curve we can say the train data forming smooth curve ,where test data auc curve forming
slightly different . The AUC curve for train data is 67.0%, and AUC for test data is 67.4%

 The model accuracy on the training as well as the test set is about 67%, which is
roughly the same proportion as the class 0 observations in the dataset. This model is
affected by a class imbalance problem. Since we only have 1473 observations, if re-
build the same LDA model with more number of data points, an even better model
could be built.
 By choosing cut-off, We see that 0.4 and 0.5 gives better accuracy than the rest of
the custom cut-off values. But 0.4 cut-off gives us the best 'f1-score'. Here, we will
take the cut-off as 0.2 to get the optimum 'f1' score.

CART
In CART we can use the dataset with outliers as its not sensitive with outliers.
Train and Test Split:
The Same procedure as the above Logistic regression and the LDA, Train and test
data need to be splitted, and before that the necessary libraries need to be imported.
In cart , the decision tree is the most important,
Decision tree:
 Fit the train and test data into decision tree. We need to create in new word
document and saved in Project folder.
 Now we can copy and paste the code in http://webgraphviz.com/. For checking the
decision tree we can delete the existing codes and paste it there.
 The tree will be little messy as the data contains vast information or classifications,so
we will reduce the max.leaf , max.depth of the tree and the min. sample size.
 Here “GINI” ,a decision tree classifier plays the important role. And creating a new
word document with reduced branches as 30, leaf is 10 and depth is 7 and saved the
document in project folder.
 Now decision tree is looking better than before.

Now Let us check the feature Importance, where Feature importance refers to
techniques that assign a score to input features based on how useful they are at predicting
a target variable.
Imp
Wife_age 0.408296
No_of_children_born 0.366101
Media_exposure 0.075275
Wife_ education 0.073313
Husband_education 0.053250
Husband_Occupation 0.010100
Standard_of_living_index 0.008617
Wife_Working 0.005049
Wife_religion 0.000000

As we see ,depend upon the ‘wife_age’ having more importance, we can slightly
predict that the contraceptive method can be used depend upon the age factors of women.
AUC PLOT
As we see the AUC curve bending high , the model will be good and its AUC
value for train data is 83.9%

Here the plot is not quite smooth , but over the area its keeping up the bend
formation and its AUC value for test data is 72.9%.
Let us move to the Confusion matrix,
FOR TRAIN DATA,
array([[282, 159],
[ 71, 504]], dtype=int64)

precision recall f1-score support


0 0.80 0.64 0.71 441
1 0.76 0.88 0.81 575

accuracy 0.77 1016


macro avg 0.78 0.76 0.76 1016
weighted avg 0.78 0.77 0.77 1016

By checking up the confusion matrix of the train data, we can get the value of True
Positive as 282 and the True Negative as 504.

For Contraceptive_method_used (Label 0 ):

 Precision (80%) – 80% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
 Recall (64%) – Out of all the married women not using Contraceptive method , 64%
of married women have been predicted correctly .

For Contraceptive_method_used (Label 1 ):

 Precision (76%) – 76% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
 Recall (88%) – Out of all the married women actually using contraceptive method ,
88% of married women have been predicted correctly .

And the Accuracy is 77.3% which is more than 50%, so the model is also Good( Better
than the Logistic Regression and the LDA).

FOR TEST DATA


array([[106, 82],
[ 50, 198]], dtype=int64)

precision recall f1-score support

0 0.68 0.56 0.62 188


1 0.71 0.80 0.75 248

accuracy 0.70 436


macro avg 0.69 0.68 0.68 436
weighted avg 0.70 0.70 0.69 436

By checking up the confusion matrix of the train data, we can get the value of True
Positive as 106 and the True Negative as 198.

For Contraceptive_method_used (Label 0 ):


 Precision (68%) – 68% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
 Recall (56%) – Out of all the married women not using Contraceptive method , 56%
of married women have been predicted correctly .

For Contraceptive_method_used (Label 1 ):

 Precision (71%) – 71% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
 Recall (80%) – Out of all the married women actually using contraceptive method ,
80% of married women have been predicted correctly .

And the Accuracy is 69.7% which is more than 50%, so the model is also Good( Better
than the Logistic Regression and the LDA).

CONCLUSION

 From these above models , in Every models the Encoded label ‘1’(conceptive method
used) predicted as high and the Accuracy and the F1 score of the models also favour
for the label ’1’.
 But we can’t conclude that the contraceptive method used or not , but we can
predict that the married women used the Contraceptive method as prediction and
the final prediction also showing the same things only.

You might also like