Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

LOGISTIC REGRESSION AND

LDA

PREPARED BY
MURALIDHARAN N
1

LOGISTIC REGRESSION AND LDA


You are hired by a tour and travel agency which deals in selling holiday
packages. You are provided details of 872 employees of a company. Among
these employees, some opted for the package and some didn't. You have to help
the company in predicting whether an employee will opt for the package or not
on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to
sell their packages.

Data Dictionary:

Variable Name Description


Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children
no_young_children
(younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate
and Bivariate Analysis. Do exploratory data analysis.

Loading all the necessary library for the model building.


Now, reading the head and tail of the dataset to check whether data has been
properly fed
2

HEAD OF THE DATA

TAIL OF THE DATA

SHAPE OF THE DATA (872, 8)


INFO

 No null values in the dataset,


 We have integer and object data
3

DATA DESCRIBE

We have integer and continuous data,


Holiday package is our target variable
Salary, age, educ and number young children, number older children of
employee have the went to foreign, these are the attributes we have to cross
examine and help the company predict weather the person will opt for holiday
package or not.

Unique values in the categorical data


HOLLIDAY_PACKAGE: 2
Yes 401
4

No 471
Name: Holliday Package, dtype: int64

FOREIGN : 2
Yes 216
No 656
Name: foreign, dtype: int64

Percentage of target :

This split indicates that 45% of employees are interested in the holiday package.

CATEGORICAL UNIVARIATE ANALYSIS

FOREIGN
5

HOLIDAY PACKAGE

HOLIDAY PACKAGE VS SALARY

We can see employee below salary 150000 have always opted for holiday
package
6

HOLIDAY PACKAGE VS AGE

HOLIDAY PACKAGE VS EDUC


7

HOLIDAY PACKAGE VS YOUNG CHILDREN

HOLIDAY PACKAGE VS OLDER CHILDREN


8

AGE VS SALARY VS HOLIDAY PACKAGE

Employee age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted more
for holiday package.
9

EDUC VS SALARY VS HOLIDAY PACKAGE


10

YOUNG CHILDREN VS AGE VS HOLIDAY PACKAGE


11

OLDER CHILDREN VS AGE VS HOLIDAY_PACKAGE


12

BIVARITE ANALYIS
DATA DISTRIBUTION

There is no correlation between the data, the data seems to be normal. There is
no huge difference in the data distribution among the holiday package, I don’t
see any clear two different distribution in the data.
13

No multi collinearity in the data


TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical computation
treating outliers will help perform the model better.
14

AFTER OUTLIER TREATMENT

No outliers in the data, all outliers have been treated.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
ENCODING CATEGORICAL VARIABLE

The encoding helps the logistic regression model predict better results
15

GRID SEARCH METHOD:


The grid search method is used for logistic regression to find the optimal
solving and the parameters for solving

The grid search method gives, liblinear solver which is suitable for small
datasets.
Tolerance and penalty has been found using grid search method
Predicting the training data,
16

CONFUSION MATRIX TRAIN DATA

CONFUSION MATRIX FOR TEST DATA


17

ACCURACY

AUC, ROC CURVE FOR TRAIN DATA

AUC, ROC CURVE FOR TEST DATA


18
19

LDA

PREDICTING THE VARIBALE

MODEL SCORE

CLASSFICATION REPORT TRAIN DATA


20

MODEL SCORE

CLASSIFICATION REPORT TEST DATA

CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE


THAT GIVES BETTER ACCURACY AND F1 SCORE
21
22
23
24
25

AUC AND ROC CURVE


26

Comparing both these models, we find both results are same, but LDA
works better when there is category target variable.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
Please explain and summarise the various steps performed in this project.
There should be proper business interpretation and actionable insights
present.

We had a business problem where we need predict whether an employee would


opt for a holiday package or not, for this problem we had done predictions both
logistic regression and linear discriminant analysis. Since both are results are
same.
The EDA analysis clearly indicates certain criteria where we could find people
aged above 50 are not interested much in holiday packages.
So this is one of the we find aged people not opting for holiday packages.
People ranging from the age 30 to 50 generally opt for holiday packages.
Employee age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted more
for holiday package.

The important factors deciding the predictions are salary, age and educ.

Recommendations
1. To improve holiday packages over the age above 50 we can provide
religious destination places.
2. For people earning more than 150000 we can provide vacation holiday
packages.
3. For employee having more than number of older children we can provide
packages in holiday vacation places.

THE END

You might also like