Cart-Rf-ANN: Prepared by Muralidharan N

CART-RF-
ANN
PREPARED BY
MURALIDHARAN N
1
CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
Data Dictionary
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it.
Reading the dataset,
The data has read successfully,

The shape of the dataset is (3000, 10)
Info function clearly indicates the dataset has object, integer and float so we have to change the
object data type to numeric value.
2
No missing values in the dataset,
Summary of the dataset,

3
We have 4 numeric values and 6 categorical values,
Agency code EPX has a frequency of 1365,
The most preferred type seems to be travel agency
Channel is online
Customized plan is the most sought plan by customers
Destination ASIA seems to be most sought destination place by customers.
We will further look at the distribution of dataset in univarite and bivariate analysis
Checking for duplicates in the dataset,
As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
4
Outliers exist in almost all the numeric values.
We can treat outliers in random forest classification.
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
5
Yes924
No2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
Univariate / Bivariate analysis
The box plot of the age variable shows outliers.

Spending is positively skewed - 1.149713
The dist plot shows the distribution of data from 20 to 80
In the range of 30 to 40 is where the majority of the distribution lies.
6
The box plot of the commission variable shows outliers.

The box plot of the duration variable shows outliers.

The box plot of the sales variable shows outliers.

7
Categorical Variables
Agency Code
The distribution of the agency code, shows us EPX with maximum frequency
8
The box plot shows the split of sales with different agency code and also hue having
claimed column.
It seems that C2B have claimed more claims than other agency.
9
The box plot shows the split of sales with different type and also hue having claimed
column. We could understand airlines type has more claims.
The majority of customers have used online medium, very less with offline medium
10
The box plot shows the split of sales with different channel and also hue having claimed
column.
Customized plan seems to be most liked plan by customers when compared to all other plans.
Jhkytkkukjm\
1.2 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
ENCODING THE STRING VALUES

GET DUMMIES
27
Dummies have been encoded.

Linear regression model does not take categorical values so that we have encoded categorical
values to integer for better results.
Train/Test split and Linear Regression model:

•

Cart-Rf-ANN: Prepared by Muralidharan N

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cart-Rf-ANN: Prepared by Muralidharan N

Uploaded by

Copyright:

Available Formats

CART-RF-

Reading the dataset,

The data has read successfully,

No missing values in the dataset,

Summary of the dataset,

We have 4 numeric values and 6 categorical values,

Agency code EPX has a frequency of 1365,

The most preferred type seems to be travel agency

Customized plan is the most sought plan by customers

Destination ASIA seems to be most sought destination place by customers.

Checking for duplicates in the dataset,

Outliers exist in almost all the numeric values.

We can treat outliers in random forest classification.

Univariate / Bivariate analysis

The box plot of the age variable shows outliers.

The box plot of the commission variable shows outliers.

The box plot of the duration variable shows outliers.

The box plot of the sales variable shows outliers.

Spending is positively skewed - 2.381148

The dist plot shows the distribution of data from 0 to 300

ENCODING THE STRING VALUES

Dummies have been encoded.

Train/Test split and Linear Regression model:

You might also like