Cart-Rf-ANN: Prepared by Muralidharan N
Cart-Rf-ANN: Prepared by Muralidharan N
ANN
PREPARED BY
MURALIDHARAN N
1
CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
Data Dictionary
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it.
Channel is online
We will further look at the distribution of dataset in univarite and bivariate analysis
As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
4
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
5
Yes924
No2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
Categorical Variables
Agency Code
The distribution of the agency code, shows us EPX with maximum frequency
8
The box plot shows the split of sales with different agency code and also hue having
claimed column.
It seems that C2B have claimed more claims than other agency.
9
The box plot shows the split of sales with different type and also hue having claimed
column. We could understand airlines type has more claims.
The majority of customers have used online medium, very less with offline medium
10
The box plot shows the split of sales with different channel and also hue having claimed
column.
Customized plan seems to be most liked plan by customers when compared to all other plans.
Jhkytkkukjm\
1.2 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
27