Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

CART-RF-

ANN

PREPARED BY
MURALIDHARAN N
1

CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
Data Dictionary
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it.

Reading the dataset,

The data has read successfully,


The shape of the dataset is (3000, 10)
Info function clearly indicates the dataset has object, integer and float so we have to change the
object data type to numeric value.
2

No missing values in the dataset,

Summary of the dataset,


3

We have 4 numeric values and 6 categorical values,

Agency code EPX has a frequency of 1365,

The most preferred type seems to be travel agency

Channel is online

Customized plan is the most sought plan by customers

Destination ASIA seems to be most sought destination place by customers.

We will further look at the distribution of dataset in univarite and bivariate analysis

Checking for duplicates in the dataset,

As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
4

Outliers exist in almost all the numeric values.

We can treat outliers in random forest classification.

AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365

TYPE: 2
Airlines 1163
Travel Agency 1837

CLAIMED: 2
5

Yes924
No2076

CHANNEL: 2
Offline 46
Online 2954

PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136

DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465

Univariate / Bivariate analysis

The box plot of the age variable shows outliers.


Spending is positively skewed - 1.149713
The dist plot shows the distribution of data from 20 to 80
In the range of 30 to 40 is where the majority of the distribution lies.
6

The box plot of the commission variable shows outliers.


Spending is positively skewed - 3.148858
The dist plot shows the distribution of data from 0 to 30

The box plot of the duration variable shows outliers.


Spending is positively skewed - 13.784681
The dist plot shows the distribution of data from 0 to 100

The box plot of the sales variable shows outliers.


7

Spending is positively skewed - 2.381148

The dist plot shows the distribution of data from 0 to 300

Categorical Variables
Agency Code

The distribution of the agency code, shows us EPX with maximum frequency
8

The box plot shows the split of sales with different agency code and also hue having
claimed column.
It seems that C2B have claimed more claims than other agency.
9

The box plot shows the split of sales with different type and also hue having claimed
column. We could understand airlines type has more claims.

The majority of customers have used online medium, very less with offline medium
10

The box plot shows the split of sales with different channel and also hue having claimed
column.

Customized plan seems to be most liked plan by customers when compared to all other plans.
Jhkytkkukjm\

1.2 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

ENCODING THE STRING VALUES


GET DUMMIES

27

Dummies have been encoded.


Linear regression model does not take categorical values so that we have encoded categorical
values to integer for better results.

Train/Test split and Linear Regression model:


You might also like