Download as pdf or txt
Download as pdf or txt
You are on page 1of 235

Michel 

Denuit
Donatien Hainaut
Julien Trufin

Effective
Statistical
Learning
Methods
for Actuaries II
Tree-Based Methods and Extensions
Springer Actuarial

Springer Actuarial Lecture Notes

Editors-in-Chief
Hansjoerg Albrecher, University of Lausanne, Lausanne, Switzerland
Michael Sherris, UNSW, Sydney, NSW, Australia

Series Editors
Daniel Bauer, University of Wisconsin-Madison, Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Ermanno Pitacco, Università di Trieste, Trieste, Italy
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, The University of Hong Kong, Hong Kong, Hong Kong
This subseries of Springer Actuarial includes books with the character of lecture
notes. Typically these are research monographs on new, cutting-edge developments
in actuarial science; sometimes they may be a glimpse of a new field of research
activity, or presentations of a new angle in a more classical field.
In the established tradition of Lecture Notes, the timeliness of a manuscript can
be more important than its form, which may be informal, preliminary or tentative.

More information about this subseries at http://www.springer.com/series/15682


Michel Denuit Donatien Hainaut
• •

Julien Trufin

Effective Statistical Learning


Methods for Actuaries II
Tree-Based Methods and Extensions

123
Michel Denuit Donatien Hainaut
Institut de Statistique, Biostatistique et Institut de Statistique, Biostatistique et
Sciences Actuarielles (ISBA) Sciences Actuarielles (ISBA)
Université Catholique Louvain Université Catholique Louvain
Louvain-la-Neuve, Belgium Louvain-la-Neuve, Belgium

Julien Trufin
Département de Mathématiques
Université Libre de Bruxelles
Brussels, Belgium

ISSN 2523-3262 ISSN 2523-3270 (electronic)


Springer Actuarial
ISSN 2523-3289 ISSN 2523-3297 (electronic)
Springer Actuarial Lecture Notes
ISBN 978-3-030-57555-7 ISBN 978-3-030-57556-4 (eBook)
https://doi.org/10.1007/978-3-030-57556-4

Mathematics Subject Classification: 62P05, 62-XX, 68-XX, 62M45

© Springer Nature Switzerland AG 2020


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The present material is written for students enrolled in actuarial master programs
and practicing actuaries, who would like to gain a better understanding of insurance
data analytics. It is built in three volumes, starting from the celebrated Generalized
Linear Models, or GLMs and continuing with tree-based methods and neural
networks.
This second volume summarizes the state of the art using regression trees and
their various combinations such as random forests and boosting trees. This second
volume also goes through tools enabling to assess the predictive accuracy of
regression models. Throughout this book, we alternate between methodological
aspects and numerical illustrations or case studies to demonstrate practical appli-
cations of the proposed techniques. The R statistical software has been found
convenient to perform the analyses throughout this book. It is a free language and
environment for statistical computing and graphics. In addition to our own R code,
we have benefited from many R packages contributed by the members of the very
active community of R-users. The open-source statistical software R is freely
available from https://www.r-project.org/.
The technical requirements to understand the material are kept at a reasonable
level so that this text is meant for a broad readership. We refrain from proving all
results but rather favor an intuitive approach with supportive numerical illustrations,
providing the reader with relevant references where all justifications can be found,
as well as more advanced material. These references are gathered in a dedicated
section at the end of each chapter.
The three authors are professors of actuarial mathematics at the universities of
Brussels and Louvain-la-Neuve, Belgium. Together, they accumulate decades of
teaching experience related to the topics treated in the three books, in Belgium and
throughout Europe and Canada. They are also scientific directors at Detralytics, a
consulting office based in Brussels.
Within Detralytics as well as on behalf of actuarial associations, the authors have
had the opportunity to teach the material contained in the three volumes of
“Effective Statistical Learning Methods for Actuaries” to various audiences of
practitioners. The feedback received from the participants to these short courses

v
vi Preface

greatly helped to improve the exposition of the topic. Throughout their contacts
with the industry, the authors also implemented these techniques in a variety of
consulting and R&D projects. This makes the three volumes of “Effective Statistical
Learning Methods for Actuaries” the ideal support for teaching students and CPD
events for professionals.

Louvain-la-Neuve, Belgium Michel Denuit


Louvain-la-Neuve, Belgium Donatien Hainaut
Brussels, Belgium Julien Trufin
September 2020
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Risk Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Insurance Risk Diversification . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Why Classifying Risks? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 The Need for Regression Models . . . . . . . . . . . . . . . . . . . 2
1.1.4 Observable Versus Hidden Risk Factors . . . . . . . . . . . . . . 2
1.1.5 Insurance Ratemaking Versus Loss Prediction . . . . . . . . . . 3
1.2 Insurance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Claim Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Frequency-Severity Decomposition . . . . . . . . . . . . . . . . . . 4
1.2.3 Observational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Format of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Data Quality Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Exponential Dispersion (ED) Distributions . . . . . . . . . . . . . . . . . . 8
1.3.1 Frequency and Severity Distributions . . . . . . . . . . . . . . . . 8
1.3.2 From Normal to ED Distributions . . . . . . . . . . . . . . . . . . . 9
1.3.3 Some ED Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.5 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.6 Exposure-to-Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Likelihood-Based Statistical Inference . . . . . . . . . . . . . . . . 22
1.4.2 Maximum-Likelihood Estimator . . . . . . . . . . . . . . . . . . . . 22
1.4.3 Derivation of the Maximum-Likelihood Estimate . . . . . . . . 23
1.4.4 Properties of the Maximum-Likelihood Estimators . . . . . . . 24
1.4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii
viii Contents

1.6 Actuarial Pricing and Tree-Based Methods . . . . . . . . . . . . . . . . . . 29


1.7 Bibliographic Notes and Further Reading . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Generalization Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.4 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Expected Generalization Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Squared Error Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.2 Poisson Deviance Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.3 Gamma Deviance Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.4 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 (Expected) Generalization Error for Randomized Training
Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... 47
2.5 Bibliographic Notes and Further Reading . . . . . . . . . . . . . ...... 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... 49
3 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 51
3.2 Binary Regression Trees . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 51
3.2.1 Selection of the Splits . . . . . . . . . . . . . . . .. . . . . . . . . . . 53
3.2.2 The Prediction in Each Terminal Node . . . .. . . . . . . . . . . 55
3.2.3 The Rule to Determine When a Node Is Terminal . . . . . . . 57
3.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 59
3.3 Right Sized Trees . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 70
3.3.1 Minimal Cost-Complexity Pruning . . . . ........ . . . . . . 72
3.3.2 Choice of the Best Pruned Tree . . . . . . ........ . . . . . . 80
3.4 Measure of Performance . . . . . . . . . . . . . . . . . ........ . . . . . . 89
3.5 Relative Importance of Features . . . . . . . . . . . ........ . . . . . . 90
3.5.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 91
3.5.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 92
3.5.3 Effect of Correlated Features . . . . . . . . ........ . . . . . . 93
3.6 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 95
3.7 Limitations of Trees . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 96
3.7.1 Model Instability . . . . . . . . . . . . . . . . . ........ . . . . . . 96
3.7.2 Lack of Smoothness . . . . . . . . . . . . . . ........ . . . . . . 101
3.8 Bibliographic Notes and Further Reading . . . . . ........ . . . . . . 103
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 105
Contents ix

4 Bagging Trees and Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . 107


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 Bagging Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.3 Expected Generalization Error . . . . . . . . . . . . . . . . . . . . . 114
4.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5 Out-of-Bag Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.6 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.6.1 Relative Importances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.6.2 Partial Dependence Plots . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.8 Bibliographic Notes and Further Reading . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Forward Stagewise Additive Modeling . . . . . . . . . . . . . . . . . . . . . 131
5.3 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3.2 Particular Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.3 Size of the Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.4 Gradient Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4.1 Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4.2 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.4.4 Particular Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5 Boosting Versus Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . 157
5.6 Regularization and Randomness . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.6.1 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.6.2 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.7 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.7.1 Relative Importances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.7.2 Partial Dependence Plots . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.7.3 Friedman’s H-Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.9 Bibliographic Notes and Further Reading . . . . . . . . . . . . . . . . . . . 171
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
x Contents

6 Other Measures for Model Comparison . . . . . . . . . . . . . . . . . . . . . . 175


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.2 Measures of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.2 Probability of Concordance . . . . . . . . . . . . . . . . . . . . . . . 177
6.2.3 Kendall’s Tau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.2.4 Spearman’s Rho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.2.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.3 Measuring Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3.2 Predictors Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.3.3 Convex Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3.4 Concentration Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.3.5 Assessing the Performances of a Given Predictor . . . . . . . 203
6.3.6 Comparison of the Performances of Two Predictors . . . . . . 209
6.3.7 Ordered Lorenz Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3.8 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.3.9 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.4 Bibliographic Notes and Further Reading . . . . . . . . . . . . . . . . . . . 227
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Chapter 1
Introduction

1.1 The Risk Classification Problem

1.1.1 Insurance Risk Diversification

Insurance companies cover risks (that is, random financial losses) by collecting pre-
miums. Premiums are generally paid in advance (hence their name). The pure pre-
mium is the amount collected by the insurance company, to be re-distributed as
benefits among policyholders and third parties in execution of the contract, without
loss nor profit. Under the conditions of validity of the law of large numbers, the
pure premium is the expected amount of compensation to be paid by the insurer
(sometimes discounted to policy issue in case of long-term liabilities).
The pure premiums are just re-distributed among policyholders to pay for their
respective claims, without loss nor profit on average. Hence, they cannot be consid-
ered as insurance prices because loadings must be added to face operating costs, in
order to ensure solvency, to cover general expenses, to pay commissions to interme-
diaries, to generate profit for stockholders, not to mention the taxes imposed by the
local authorities.

1.1.2 Why Classifying Risks?

In practice, most of portfolios are heterogeneous: they mix individuals with different
risk levels. Some policyholders tend to report claims more often or to report more
expensive claims, on average. In an heterogeneous portfolio with a uniform price
list, the financial result of the insurance company depends on the composition of the
portfolio.
The modification in the composition of the portfolio may generate losses for the
insurer charging a uniform premium to different risk profiles, when competitors

© Springer Nature Switzerland AG 2020 1


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_1
2 1 Introduction

distinguish premiums according to these profiles. Policyholders who are over-priced


by the insurance tariff leave the insurer to enjoy premium discounts offered by the
competitors whereas those who appear to have been under-priced remain with the
insurer. This change in the portfolio composition generates systematic losses for the
insurer applying the uniform tariff. This phenomenon is known as adverse selection,
as policyholders are supposed to select the insurance provider offering them the best
premium.
This (partly) explains why so many factors are used by insurance companies:
insurance companies have to use a rating structure matching the premiums for the
risks as closely as the rating structures of competitors. If they do not, they become
exposed to the risk of loosing the policyholders who are currently over-priced accord-
ing to their tariff, breaking the equilibrium between expected losses and collected
premiums. This is one of the reasons why the technical pricelist must be as accurate
as possible: it is only in this way that the insurer is able to manage its portfolio
effectively, by knowing which profiles are over-priced and which ones subsidize the
others. In other words, the insurer knows the value of each policy in the portfolio.

1.1.3 The Need for Regression Models

Considering that risk profiles differ inside insurance portfolios, it theoretically suf-
fices to subdivide the entire portfolio into homogeneous risk classes, i.e. groups of
policyholders sharing the same risk factors, and to determine an amount of pure pre-
mium specific to each risk class. However, if the data are subdivided into risk classes
determined by many factors, actuaries often deal with sparsely populated groups
of contracts. Therefore, simple averages become useless and regression models are
needed.
Regression models predict a response variable from a function of risk factors and
parameters. This approach is also referred to as supervised learning. By connecting
the different risk profiles, a regression analysis can deal with highly segmented
problems resulting from the massive amount of information about the policyholders
that has now become available to the insurers.

1.1.4 Observable Versus Hidden Risk Factors

Some risk factors can easily be observed, such as the policyholder’s age, gender,
marital status or occupation, the type and use of the car, or the place of residence
for instance. Other ones can be observed but subject to some effort or cost. This
is typically the case with behavioral characteristics reflected in telematics data or
information gathered in external databases that can be accessed by the insurer for a
fee paid to the provider. But besides these observable factors, there always remain
risk factors unknown to the insurer. In motor insurance for instance, these hidden
1.1 The Risk Classification Problem 3

risk factors typically include temper and skills, aggressiveness behind the wheel,
respect of the highway code or swiftness of reflexes (even if telematics data now
help insurers to figure out these behavioral traits, but only after contract inception).
Henceforth, we denote as X the random vector gathering the observable risk
factors used by the insurer. Notice that those risk factors are not necessarily in causal
relationship with the response Y . As a consequence, some components of X could
become irrelevant if the hidden risk factors influencing the risk (in addition to X) in
causal relationship with the response Y , denoted X + , would be available.

1.1.5 Insurance Ratemaking Versus Loss Prediction

Consider a response Y and a set of features X 1 , . . . , X p gathered in the vector X.


Features are considered here as random variables so that they are denoted by capital
letters. This means that we are working with a generic policyholder, taken at random
from (and thus representative of) the portfolio under consideration. When it comes
to pricing a specific contract, we work conditionally to the realized value of X, that
is, given X = x. The set of all possible features is called the feature space and is
denoted .
The dependence structure inside the random vector (Y, X 1 , . . . , X p ) is exploited
to extract the information contained in X about Y . In actuarial pricing, the aim is to
evaluate the pure premium as accurately as possible. This means that the target is the
conditional expectation μ(X) = E[Y |X] of the response Y (claim number or claim
amount) given the available information X. Henceforth, μ(X) is referred to as the
true (pure) premium.
Notice that the function x → μ(x) = E[Y |X = x] is generally unknown to the
actuary, and may exhibit a complex behavior in x. This is why this function is
approximated by a (working, or actual) premium x →  μ(x) with a relatively simple
structure compared to the unknown regression function x → μ(x).

1.2 Insurance Data

1.2.1 Claim Data

Because of adverse selection, most of actuarial studies are based on insurance-specific


data, generally consisting in claims data. Dealing with claim data means that only lim-
ited information is available about events that actually occurred. Analyzing insurance
data, the actuary draws conclusions about the number of claims filed by policyholders
subject to a specific ratemaking mechanism (bonus-malus rules or deductibles, for
instance), not about the actual number of accidents. The conclusions of the actuarial
analysis are valid only if the existing rules are kept unchanged. The effect of an
4 1 Introduction

extension of coverage (decreasing the amount of deductibles, for instance) is


extremely difficult to assess.
Also, some policyholders may not report their claims immediately, for various
reasons (for instance because they were not aware of the occurrence of the insured
event), impacting on the available data. Because of late reporting, the observed num-
ber of claims may be smaller than the actual number of claims for recent observation
periods. Once reported, claims require some time to be settled. This is especially the
case in tort systems, for liability claims. This means that it may take years before the
final claim cost is known to the insurer.
The information recorded in the database generally gathers one or several calendar
years. The data are as seen from the date of extraction (6 months after the end of
the observation period, say). Hence, most of the “small” claims are settled and their
final cost is known. However, for the large claims, actuaries can only work with
incurred losses (payments made plus reserve, the latter representing a forecast of the
final cost still to be paid according to the evaluation made by the claim manager).
Incurred losses are routinely used in practice but a better approach would consist in
recognizing that actuaries only have partial knowledge about the claim amount.

1.2.2 Frequency-Severity Decomposition

1.2.2.1 Claim Numbers

Even if the actuary wants to model the total claim amount Y generated by a policy of
the portfolio over one period (typically, one year), this random variable is generally
not the modeling target. Indeed, modeling Y does not allow to study the effect of
per-claim deductibles nor bonus-malus rules, for instance. Rather, the total claim
amount Y is decomposed into
N
Y = Ck
k=1

where

N = number of claims
Ck = cost (or severity) of the kth claim
C1 , C2 , . . . identically distributed

all these random variables being independent. By convention, the empty sum is zero,
that is,
N = 0 ⇒ Y = 0.
1.2 Insurance Data 5

The frequency component of Y refers to the number N of claims filed by each


policyholder during one period. Considering the number of claims reported by a
policyholder in Property and Casualty insurance, the Poisson model is often used as
a starting point.
Generally, the different components of the yearly insurance losses Y are modeled
separately. Costs may be of different magnitudes, depending on the type of the claim:
standard, or attritional claims, with moderate costs versus large claims with much
higher costs. If large claims may occur then the mix of these two types of claims is
explicitly recognized by

large claim cost, with probability p,
Ck =
attritional claim cost, with probability 1 − p.

1.2.2.2 Claim Amounts

Having individual costs for each claim, the actuary often wishes to model their respec-
tive amounts (also called claim sizes or claim severities in the actuarial literature).
Prior to the analysis, the actuary first needs to exclude possible large claims, keeping
only the standard, or attritional ones.
Overall, the modeling of claim amounts is more difficult than claim frequencies.
There are several reasons for that. First and foremost, claims sometimes need several
years to be settled as explained before. Only estimates of the final cost appear in
the insurer’s records until the claim is closed. Moreover, the statistics available to fit
a model for claim severities are much more scarce, since generally only 5–10% of
the policies in the portfolio produced claims. Finally, the unexplained heterogeneity
is sometimes more pronounced for costs than for frequencies. The cost of a traffic
accident for instance is indeed for the most part beyond the control of a policyholder
since the payments of the insurance company are determined by third-party charac-
teristics. The degree of care exercised by a driver mostly influences the number of
accidents, but in a much lesser way the cost of these accidents.

1.2.3 Observational Data

Statistical analyzes are conducted with data either from experimental or from obser-
vational studies. In the former case, random assignment of individual units (humans
or animals, for instance) to the experimental treatments plays a fundamental role to
draw conclusions about causal relationships (to demonstrate the usefulness of a new
drug, for instance). This is however not the case with insurance data, which consist
of observations recorded on past contracts issued by the insurer.
As an example, let us consider motor insurance. The policyholders covered by a
given insurance company are generally not a random sample from the entire pop-
ulation of drivers in the country. Each company targets a specific segment of this
6 1 Introduction

population (with advertisement campaigns or specific product design, for instance)


and attracts particular profiles. This may be due to consumers’ perception of insurer’s
products, sales channels (brokers, agents or direct), not to mention the selection oper-
ated by the insurer, screening the applicants before accepting to cover their risks.
In insurance studies, we consider that the portfolio is representative of future
policyholders, those who will stay insured by the company or later join the portfolio.
The assumption that new policyholders conform with the profiles already in the
portfolio needs to be carefully assessed as any change in coverage conditions or in
competitors’ price lists may attract new profiles with different risk levels (despite
they are identical with respect to X, they may differ in X + , due to adverse selection
against the insurer).
The actuary has always to keep in mind the important difference existing between
causal relationships and mere correlations existing among the risk factors and the
number of claims or their severity. Such correlations may have been produced by a
causal relationship, but could also result from confounding effects. Therefore, the
actuary has always to keep in mind that it is generally not possible to disentangle
• a true effect of a risk factor
• from an apparent effect resulting from correlation with hidden characteristics
on the basis of observational data. Also, the effect estimated from portfolio statistics
is the dominant one: different stories may apply to different policyholders whereas
they are all averaged in the estimates obtained by the actuary.
Notice that correlation with hidden risk factors may even reverse the influence of
an available risk factor on the response. This is the case for instance when the feature
is negatively correlated with the response but positively correlated with a hidden
characteristic, the latter being positively related to the response. The actuary may
then observe a positive relationship between this feature and the response, despite
the true correlation is negative.

1.2.4 Format of the Data

The data required to perform analyses carried out in this book generally consist of
linked policy and claims information at the individual risk level. The appropriate
definition of individual risk level varies according to the line of business and the type
of study. For instance, an individual risk generally corresponds to a vehicle in motor
insurance or to a building in fire insurance.
The database must contain one record for each period of time during which a
policy was exposed to the risk of filling out a claim, and during which all risk factors
remained unchanged. A new record must be created each time risk factors change,
with the previous exposure curtailed at the point of amendment. The policy number
then allows the actuary to track the experience of the individual risks over time. Policy
cancellations and new business also result in the exposure period to be curtailed. For
each record, the database registers policy characteristics together with the number of
1.2 Insurance Data 7

claims and the total incurred losses. In addition to this policy file, there is a claim file
recording all the information about each claim, separately (the link between the two
files being made using the policy number). This second file also contains specific
features about each claim, such as the presence of bodily injuries, the number of
victims, and so on. This second file is interesting to build predictive models for the
cost of claims based on the information about the circumstances of each insured
event. This allows the insurer to better assess incurred losses.
The information available to perform risk classification is summarized into a set
of features xi j , j = 1, . . . , p, available for each policy i. These features may have
different formats:
• categorical (such as gender, with two levels, male and female);
• integer-valued, or discrete (such as the number of vehicles in the household);
• continuous (such as policyholder’s age).
Categorical covariates may be ordered (when the levels can be ordered in a mean-
ingful way, such as education level) or not (when the levels cannot be ranked, think
for instance to marital status, with levels single, married, cohabiting, divorced, or
widow, say).
Notice that continuous features are generally available to a finite precision so that
they are actually discrete variables with a large number of numerical values.

1.2.5 Data Quality Issues

As in most actuarial textbooks, we assume here that the available data are reliable
and accurate. This assumption hides a time-consuming step in every actuarial study,
during which data are gathered, checked for consistency, cleaned if needed and some-
times connected to external data bases to increase the volume of information. Setting
up the database often takes the most time and does not look very rewarding. Data
preparation is however of crucial importance because, as the saying goes, “garbage
in, garbage out”: there is no hope to get a reliable technical price list from a database
suffering many limitations.
Once data have been gathered, it is important to spend enough time on exploratory
data analysis. This part of the analysis aims at discovering which features seem
to influence the response, as well as subsets of strongly correlated features. This
traditional, seemingly old-fashioned view may well conflict with the modern data
science approach, where practitioners are sometimes tempted to put all the features
in a black-box model without taking the time to even know what they mean. But we
firmly believe that such a blind strategy can sometimes lead to disastrous conclusions
in insurance pricing so that we strongly advise to dedicate enough time to discover
the kind of information recorded in the database under study.
8 1 Introduction

1.3 Exponential Dispersion (ED) Distributions

1.3.1 Frequency and Severity Distributions

Regression models aim to analyze the relationship between a variable whose outcome
needs to be predicted and one or more potential explanatory variables. The variable
of interest is called the response and is denoted as Y . Insurance analysts typically
encounter non-Normal responses such as the number of claims or the claim severities.
Actuaries then often select the distribution of the response from the exponential
dispersion (or ED) family.
Claim numbers are modeled by means of non-negative integer-valued random
variables (often called counting random variables). Such random variables are
described by their probability mass function: given a counting random variable Y
valued in the set {0, 1, 2, . . .} of non-negative integers, its probability mass function
pY is defined as
y → pY (y) = P[Y = y], y = 0, 1, 2, . . .

and we set pY to zero otherwise. The support S of Y is defined as the set of all values
y such that pY (y) > 0. Expectation and variance are then respectively given by

 ∞
  2
E[Y ] = ypY (y) and Var[Y ] = y − E[Y ] pY (y).
y=0 y=0

Claim amounts are modeled by non-negative continuous random variables pos-


sessing a probability density function. Precisely, the probability density function f Y
of such a random variable Y is defined as
d
y → f Y (y) = P[Y ≤ y], y ∈ (−∞, ∞).
dy

In this case,  
 
P[Y ≈ y] = P y − ≤Y ≤y+ ≈ f Y (y)
2 2

for sufficiently small  > 0, so that f Y also indicates the region where Y is most
likely to fall. In particular, f Y = 0 where Y cannot assume its values. The support S
of Y is then defined as the set of all values y such that f Y (y) > 0. Expectation and
variance are then respectively given by
∞ ∞  2
E[Y ] = y f Y (y)dy and Var[Y ] = y − E[Y ] f Y (y)dy.
−∞ −∞
1.3 Exponential Dispersion (ED) Distributions 9

1.3.2 From Normal to ED Distributions

The oldest distribution for errors in a regression setting is certainly the Normal
distribution, also called Gaussian, or Gauss–Laplace distribution after its inventors.
The family of ED distributions in fact extends the nice structure of this probability
law to more general errors.

1.3.2.1 Normal Distribution

Recall that a response Y valued in S = (−∞, ∞) is Normally distributed with param-


eters μ ∈ (−∞, ∞) and σ 2 > 0, denoted as Y ∼ Nor (μ, σ 2 ), if its probability den-
sity function f Y is


1 1
f Y (y) = √ exp − 2 (y − μ)2 , y ∈ (−∞, ∞). (1.3.1)
σ 2π 2σ

Considering (1.3.1), we see that Normally distributed responses can take any real
value, positive or negative as f Y > 0 over the whole real line (−∞, ∞).
Figure 1.1 displays the probability density function (1.3.1) for different parameter
values. The Nor (μ, σ 2 ) probability density function appears to be a symmetric bell-
shaped curve centered at μ, with σ 2 controlling the spread of the distribution. The
probability density function f Y being symmetric with respect to μ, positive or nega-
tive deviations from the mean μ have the same probability to occur. To be effective,
any analysis based on the Normal distribution requires that the probability density
function of the data has a shape similar to one of those visible in Fig. 1.1, which is
rarely the case in insurance applications.
Notice that the Normal distribution enjoys the convenient convolution stability
property, meaning that the sum of independent, Normally distributed random vari-
ables remain Normally distributed.

1.3.2.2 ED Distributions

The Nor (μ, σ 2 ) probability density function can be rewritten in order to be extended
to a larger class of probability distributions sharing some convenient properties: the
ED family. The idea is as follows. The parameter of interest in insurance pricing is the
mean μ involved in pure premium calculations. This is why we isolate components of
the Normal probability density function where μ appears. This is done by expanding
the square appearing inside the exponential function in (1.3.1), which gives
10 1 Introduction

0.12
0.10
0.08
fY(y)

0.06
0.04
0.02
0.00

−20 0 20 40

Fig. 1.1 Probability density functions of Nor (10, 32 ) in continuous line, Nor (10, 52 ) in broken
line, and Nor (10, 102 ) in dotted line



1 1  
f Y (y) = √ exp − 2 y 2 − 2yμ + μ2
σ 2π 2σ
 2
yμ − μ2 exp − 2σ2
2 y

= exp √ . (1.3.2)
σ2 σ 2π

The second factor appearing in (1.3.2) does not involve μ so that the important
component is the first one. We see that it has a very simple form, being the exponential
(hence the vocable “exponential” in ED) of a ratio with the variance σ 2 , i.e. the
dispersion parameter, appearing in the denominator. The numerator appears to be
the difference between the product of the response y and the canonical Normal mean
parameter μ with a function of μ, only. Notice that the derivative of this second term
μ2
2
is just the mean μ. Such a decomposition allows us to define the whole ED class
of distributions as follows.
Definition 1.3.1 Consider a response Y valued in a subset S of the real line
(−∞, ∞). Its distribution is said to belong to the ED family if Y obeys a proba-
bility mass function pY or a probability density function f Y of the form


pY (y) yθ − a(θ)
= exp c(y, φ/ν), y ∈ S, (1.3.3)
f Y (y) φ/ν

where
1.3 Exponential Dispersion (ED) Distributions 11

θ = real-valued location parameter, called the canonical parameter


φ = positive scale parameter, called the dispersion parameter
ν = known positive constant, called the weight
a(·) = monotonic convex function of θ, called the cumulant function
c(·) = positive normalizing function.

In the majority of actuarial applications, the weight corresponds to some volume


measure, hence the notation ν.
The parameters θ and φ are essentially location and scale indicators, extending the
mean value μ and variance σ 2 to the whole family of ED distributions. Considering
(1.3.2), we see that it is indeed of the form (1.3.3) with

θ=μ
μ2 θ2
a(θ) = =
2 2
φ = σ2
ν=1  2
y
exp − 2σ 2
c(y, φ) = √ .
σ 2π

Remark 1.3.2 Sometimes, (1.3.3) is replaced with the more general form


yθ − a(θ)
exp c(y, φ, ν).
b(φ, ν)

However, the particular case where φ and ν are combined into φ/ν, i.e.


φ φ
b(φ, ν) = and c(y, φ, ν) = c y,
ν ν

appears to be enough for actuarial applications.

1.3.3 Some ED Distributions

The ED family is convenient for non-life insurance ratemaking. In particular, we


show in the following that the Poisson and Gamma distributions, often used by
actuaries for modeling claim counts and claim severities, belong to this family. A
detailed review of the ED family can be found in Denuit et al. (2019). Thereafter, we
only describe in details the Poisson and Gamma distributions.
12 1 Introduction

1.3.3.1 Poisson Distribution

A Poisson-distributed response Y takes its values in S = {0, 1, 2, . . .} and has prob-


ability mass function

λy
pY (y) = exp(−λ) , y = 0, 1, 2, . . . . (1.3.4)
y!

Having a counting random variable Y , we denote as Y ∼ Poi(λ) the fact that Y is


Poisson distributed with parameter λ. The parameter λ is often called the rate, in
relation to the Poisson process (see below).
The mean and variance of Y ∼ Poi(λ) are given by

E[Y ] = λ and Var[Y ] = λ. (1.3.5)

Considering (1.3.5), we see that both the mean and variance of the Poisson distribu-
tion are equal to λ, a phenomenon termed as equidispersion. The skewness coefficient
of the Poisson distribution is
1
γ[Y ] = √ .
λ

As λ increases, the Poisson distribution thus becomes more symmetric and is even-
tually well approximated by a Normal distribution, the √ approximation turning out
to be quite good for λ > 20. But if Y ∼ Poi(λ) then Y converges much faster to
the Nor (λ, 41 ) distribution. Hence, the square root transformation was often recom-
mended as a variance stabilizing transformation for count data at a time classical
methods assuming Normality (and constant variance) were employed.
The shape of the Poisson probability mass function is displayed in the graphs of
Fig. 1.2. For small values of λ, we see that the Poi(λ) probability mass function is
highly asymmetric. When λ increases, it becomes more symmetric and ultimately
looks like the Normal bell curve.
The Poisson distribution enjoys the convenient convolution stability property, i.e.

Y1 ∼ Poi(λ1 ) ⎬
Y2 ∼ Poi(λ2 ) ⇒ Y1 + Y2 ∼ Poi(λ1 + λ2 ). (1.3.6)

Y1 and Y2 independent

This property is useful because sometimes the actuary has only access to aggregated
data. Assuming that individual data is Poisson distributed, then so is the summed
count and Poisson modeling still applies.
In order to establish that the Poisson distribution belongs to the ED family, let us
write the Poi(λ) probability mass function (1.3.4) as follows:
1.3 Exponential Dispersion (ED) Distributions 13

0.6
0.8

0.5
0.4
0.6
pY(y)

pY(y)

0.3
0.4

0.2
0.2

0.1
0.0

0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y

0.12
0.15

0.10
0.08
0.10
pY(y)

pY(y)

0.06
0.04
0.05

0.02
0.00

0.00

0 5 10 15 20 0 5 10 15 20
y y

Fig. 1.2 Probability mass functions of Poi(λ) with λ = 0.05, 0.5, 5, 10 (from upper left to lower
right)

λy
pY (y) = exp(−λ)
y!
 1
= exp y ln λ − λ
y!

where we recognize the probability mass function (1.3.3) with


14 1 Introduction

θ = ln λ
a(θ) = λ = exp(θ)
φ=1
ν=1
1
c(y, φ) = .
y!

Thus, the Poisson distribution indeed belongs to the ED family.

1.3.3.2 Gamma Distribution

The Gamma distribution is right-skewed, with a sharp peak and a long tail to the right.
These characteristics are often visible on empirical distributions of claim amounts.
This makes the Gamma distribution a natural candidate for modeling accident benefits
paid by the insurer.
Precisely, a random variable Y valued in S = (0, ∞) is distributed according to
the Gamma distribution with parameters α > 0 and τ > 0, which will henceforth be
denoted as Y ∼ Gam(α, τ ), if its probability density function is given by

y α−1 τ α exp(−τ y)
f Y (y) = , y > 0, (1.3.7)
(α)

where ∞
(α) = x α−1 e−x d x.
0

The parameter α is often called the shape of the Gamma distribution whereas τ is
referred to as the scale parameter.
The mean and the variance of Y ∼ Gam(α, τ ) are respectively given by

α α 1 2
E[Y ] = and Var[Y ] = 2 = E[Y ] . (1.3.8)
τ τ α
We thus see that the variance is a quadratic function of the mean. The Gamma
distribution is useful for modeling a positive, continuous response when the variance
grows with the mean but where the coefficient of variation

Var[Y ] 1
CV[Y ] = =√
E[Y ] α

stays constant. As their names suggest, the scale parameter in the Gamma family
influences the spread (and incidentally, the location) but not the shape of the dis-
tribution, while the shape parameter controls the skewness of the distribution. For
Y ∼ Gam(α, τ ), we have
1.3 Exponential Dispersion (ED) Distributions 15

1.0
0.8

μ = 1, τ = 1
μ = 1, τ = 2
μ = 1, τ = 4
0.6
fY(y)

0.4
0.2
0.0

0 1 2 3 4 5

Fig. 1.3 Probability density functions of Gam(α, τ ) with a = τ ∈ {1, 2, 4}

2
γ[Y ] = √
α

so that the Gamma distribution is positively skewed. As the shape parameter gets
larger, the distribution grows more symmetric.
Figure 1.3 displays Gamma probability density functions for different parameter
values. Here, we fix the mean μ = ατ to 1 and we take τ ∈ {1, 2, 4} so that the variance
is equal to 1, 0.5, and 0.25. Unlike the Normal distribution (whose probability density
function resembles a bell shape centered at μ whatever the variance σ 2 ), the shape of
the Gamma probability density function changes with the parameter α. For α ≤ 1,
the probability density function has a maximum at the origin whereas for α > 1 it is
unimodal but skewed. The skewness decreases as α increases.
Gamma distributions enjoy the convenient convolution stability property for fixed
scale parameter τ . Specifically,

Y1 ∼ Gam(α1 , τ ) ⎬
Y2 ∼ Gam(α2 , τ ) ⇒ Y1 + Y2 ∼ Gam(α1 + α2 , τ ). (1.3.9)

Y1 and Y2 independent

When α = 1, the Gamma distribution reduces to the Negative Exponential one


with probability density function

f Y (y) = τ exp(−τ y), y > 0.


16 1 Introduction

In this case, we write Y ∼ Ex p(τ ). This distribution enjoys the remarkable memo-
ryless property:

exp(−τ (s + t))
P[Y > s + t|Y > s] =
exp(−τ s)
= exp(−τ t) = P[Y > t].

When α is a positive integer, the corresponding Gamma distribution function is


then given by
α−1
(yτ ) j
F(y) = 1 − exp(−yτ ) , y ≥ 0.
j=0
j!

This particular case is referred to as the Erlang distribution. The Erlang distribution
corresponds to the distribution of a sum

Y = Z1 + Z2 + . . . + Zα

of independent random variables Z 1 , Z 2 , . . . , Z α with common Ex p(τ ) distribution,


by virtue of (1.3.9). Hence, when α = 1 the Erlang distribution reduces to the Neg-
ative Exponential one.
Let us now establish that the Gamma distribution belongs to the ED family. To
this end, we let the mean parameter μ = α/τ enter the expression of the probability
density function and rewrite the Gam(α, τ ) probability density function (1.3.7) as
follows:
τ α α−1
f Y (y) = y exp(−τ y)
(α)


αα −α α−1 α α α
= μ y exp −y with μ = ⇔ τ =
(α) μ τ μ


α
y α
= exp α − − ln μ y α−1
μ (α)

which is well of the form (1.3.3) with

1
θ=−
μ
a(θ) = ln μ = − ln(−θ)
1
φ=
α
αα α−1
c(y, φ) = y .
(α)

The Gamma distribution thus belongs to the ED family.


1.3 Exponential Dispersion (ED) Distributions 17

Table 1.1 Examples of ED distributions (with ν = 1)


Distribution θ a(θ) φ μ = E[Y ] Var[Y ]
q
Ber (q) ln ln(1 + exp(θ)) 1 q μ(1 − μ)
1−q
 
Bin(m, q) ln q
1−q m ln(1 + 1 mq μ 1 − mμ
exp(θ))
1−q
Geo(q) ln(1 − q) − ln(1 − 1 q μ(1 + μ)
exp(θ))
 μ
Pas(m, q) ln(1 − q) −m ln(1 − 1 m 1−q
q μ 1+ m
exp(θ))
Poi(μ) ln μ exp(θ) 1 μ μ
θ2
Nor (μ, σ 2 ) μ 2 σ2 μ φ
Ex p(μ) − μ1 − ln(−θ) 1 μ μ2
Gam(μ, α) − μ1 − ln(−θ) 1
α μ φμ2

IGau(μ, α) − 2μ1 2 − −2θ 1
α μ φμ3

1.3.3.3 Some Other ED Distributions

In addition to the Normal, Poisson and Gamma distributions, Table 1.1 gives an
overview of some other ED distributions that appear to be useful in the analysis of
insurance data, namely the Bernoulli distribution Ber (q) with parameter q ∈ (0, 1);
the Binomial distribution Bin(m, q) with parameters m ∈ N+ = {1, 2, . . .} and
q ∈ (0, 1); the Geometric distribution Geo(q) with parameter q ∈ (0, 1); the Pas-
cal distribution Pas(m, q) with parameters m ∈ N+ and q ∈ (0, 1); the Negative
Exponential distribution Ex p(μ) with parameter μ > 0 and the Inverse Gaussian dis-
tribution IGau(μ, α) with parameters μ > 0 and α > 0. For each distribution, we list
the canonical parameter θ, the cumulant function a(·), and the dispersion parameter
φ entering the general definition (1.3.3) of ED probability mass or probability density
function. We also give the two first moments.

1.3.4 Mean and Variance

In the Normal case, the derivative of a(θ) = θ2 is θ = μ so that we recover the


2

mean response from a . This turns out to be a property generally valid for all ED
distributions.
Property 1.3.3 If the response Y has probability density/mass function of the form
(1.3.3) then
E[Y ] = a (θ).
18 1 Introduction

The mean response Y then corresponds to the first derivative of the function a(·)
involved in (1.3.3). The next result shows that the variance is proportional to the
second derivative of a(·)
Property 1.3.4 If the response Y has probability density/mass function of the form
(1.3.3) then
φ
Var[Y ] = a (θ).
ν
Notice that increasing the weight thus decreases the variance whereas the variance
increases linearly in the dispersion parameter φ. The impact of θ on the variance is
given by the factor
d
a (θ) = μ(θ)

expressing how a change in the canonical parameter θ modifies the expected response.
In the Normal case, a (θ) = 1 and the variance is just constantly equal to φ/ν =
σ 2 /ν, not depending on θ. In this case, the mean response does not influence its
variance. For the other members of the ED family, a is not constant and a change
in θ modifies the variance.
An absence of relation between the mean and the variance is only possible for
real-valued responses (such as Normally distributed ones, where the variance σ 2 does
not depend on the mean μ). Indeed, if Y is non-negative (i.e. Y ≥ 0) then intuitively
the variance of Y tends to zero as the mean of Y tends to zero. That is, the variance
is a function of the mean for non-negative responses. The relationship between the
mean and variance of an ED distribution is indicated by the variance function V (·).
The variance function V (·) is formally defined as

d2 d
V (μ) = 2
a(θ) = μ(θ).
dθ dθ
The variance function thus corresponds to the variation in the mean response μ(θ)
viewed as a function of the canonical parameter θ. In the Normal case, μ(θ) = θ and
V (μ) = 1. The other ED distributions have non-constant variance functions. Again,
we see that the cumulant function a(·) determines the distributional properties in the
ED family.
The variance of the response can thus be written as

φ
Var[Y ] = V (μ).
ν
It is important to keep in mind that the variance function is not the variance of the
response, but the function of the mean entering this variance (to be multiplied by
φ/ν). The variance function is regarded as a function of the mean μ, even if it appears
as a function of θ; this is possible by inverting the relationship between θ and μ as
we known from Property 1.3.3 that μ = E[Y ] = a (θ). The convexity of a(·) ensures
1.3 Exponential Dispersion (ED) Distributions 19

Table 1.2 Variance functions for some selected ED distributions


Distribution Variance function V (μ)
Ber (q) μ(1 − μ)
Geo(q) μ(1 + μ)
Poi(μ) μ
Nor (μ, σ 2 ) 1
Gam(μ, α) μ2
IGau(μ, α) μ3

that the mean function a is increasing so that its inverse is well defined. Hence, we
can express the canonical parameter in terms of the mean response μ by the relation

θ = (a )−1 (μ).

The variance functions corresponding to the usual ED distributions are listed in


Table 1.2. In the Poisson case for instance,

d2
V (μ) = exp(θ)
dθ2
= exp(θ)

while in the Gamma case,

d2
V (μ) = − ln(−θ)
dθ2
1
=− 2
θ
= μ2 .

Notice that


⎪ 0 for the Normal distribution

ξ 1 for the Poisson distribution
V (μ) = μ with ξ =
⎪ 2 for the Gamma distribution


3 for the Inverse Gaussian distribution.

These members of the ED family thus have power variance functions. The whole
family of ED distributions with power variance functions is referred to as the Tweedie
class.
20 1 Introduction

1.3.5 Weights

Averaging independent and identically distributed ED responses does not modify


their distribution, just the value of the weight.
Property 1.3.5 If Y1 , Y2 , . . . , Yn are independent with common probability mass/
density function of the form (1.3.3), then their average

1
n
Y = Yi
n i=1

has probability mass/density function




yθ − a(θ)
exp c(y, φ/(nν)).
φ/(nν)

The distribution for Y is thus the same as for each Yi except that the weight ν is
replaced with nν.
Averaging observations is thus accounted for by modifying the weights in the ED
family.
In actuarial studies, responses can be ratios with the aggregate exposure or even
premiums in the denominator (in case loss ratios are analyzed). The numerator may
correspond to individual data, or to grouped data aggregated over a set of homo-
geneous policies. This means that the size of the group has to be accounted for as
the response ratios will tend to be far more volatile in low-volume cells than in
high-volume ones. Actuaries generally consider that a large-volume cell is the result
of summing smaller independent cells, leading to response variance proportional to
the inverse of the volume measure. This implies that weights vary according to the
business volume measure.
The next result extends Property 1.3.5 to weighted averages of ED responses.
Property 1.3.6 Consider independent responses Y1 , . . . , Yn obeying ED distribu-
tions (1.3.3) with common mean μ, dispersion parameter φ and specific weights νi .
Define the total weight
 n
ν• = νi .
i=1

Then, the weighted average


1 
n
νi Yi
ν• i=1

still follows an ED distribution (1.3.3) with mean μ, dispersion parameter φ and


weight ν• .
1.3 Exponential Dispersion (ED) Distributions 21

1.3.6 Exposure-to-Risk

The number of observed events generally depends on a size variable that determines
the number of opportunities for the event to occur. This size variable is often the time
as the number of claims obviously depends on the length of the coverage period.
However, some other choices are possible, such as distance traveled in motor insur-
ance, for instance.
The Poisson process setting is useful when the actuary wants to analyze claims
experience from policyholders who have been observed during periods of unequal
lengths. Assume that the claims occur according to a Poisson process with rate λ. In
this setting, claims occur randomly and independently in time. Denoting as T1 , T2 , . . .
the times between two consecutive events, this means that these random variables are
independent and obey the Ex p(λ) distribution, the only one enjoying the memoryless
property. Hence, the kth claim occurs at time


k
T j ∼ Gam(k, λ)
j=1

where we recognize the Erlang distribution.


Consider a policyholder covered by the company for a period of length e, that is,
the policyholder exposes the insurer to the risk of recording a claim during e time
units. Then, the number Y of reported claims is such that
⎡ ⎤

k 
k−1
(λe) j
P[Y ≥ k] = P ⎣ T j ≤ e⎦ = 1 − exp(−λe)
j=1 j=0
j!

so that it has probability mass function

(λe)k
P[Y = k] = P[Y ≥ k] − P[Y ≥ k + 1] = exp(−λe) , k = 0, 1, . . . ,
k!

that is, Y ∼ Poi(λe).


In actuarial studies, the length e of the observation period is generally referred to
as the exposure-to-risk (hence the letter e). It allows the analyst to account for the
fact that some policies are observed for the whole period whereas others just entered
the portfolio or left it soon after the beginning of the observation period. We see that
the exposure-to-risk e simply multiplies the annual expected claim frequency λ in
the Poisson process model.
22 1 Introduction

1.4 Maximum Likelihood Estimation

1.4.1 Likelihood-Based Statistical Inference

Assuming that the responses are random variables with unknown distribution depend-
ing on one (or several) parameter(s) θ, the actuary must draw conclusions about the
unknown parameter θ based on available data. Such conclusions are thus subject to
sampling errors: another dataset from the same population would have inevitably
produced different results. To perform premium calculations, the actuary needs to
select a value of θ hopefully close to the true parameter value. Such a value is called
an estimate (or pointwise estimation) of the unknown parameter θ. The estimate is
distinguished from the model parameters by a hat, which means that an estimate of
θ is denoted by  θ. This distinction is necessary since it is generally impossible to
estimate the true parameter θ without error. Thus, θ = θ in general. The estimator is
itself a random variable as it varies from sample to sample (in a repeated sampling
setting, that is, drawing random samples from a given population and computing
the estimated value, again and again). Formally, an estimator  θ is a function of the
observations Y1 , Y2 , . . . , Yn , that is,


θ=
θ(Y1 , Y2 , . . . , Yn ).

The corresponding estimate is  θ(y1 , y2 , . . . , yn ), computed from the realizations of


the responses for a particular sample of observations y1 , y2 , . . . , yn .
In this Sect. 1.4, we do not deal with more complex data structures such as values of
responses accompanied by possible explanatory variables, which amounts to consider
the case where the features do not bring any information about the response. The
fundamental assumption here is that the value of the response is random and thus
considered as the realization of a random variable with unknown distribution.
It seems reasonable to require that a good estimate of the unknown parameter θ
would be the value of the parameter that maximizes the chance of getting the data that
have been recorded in the database (in which the actuary trusts). Maximum likeli-
hood estimation is a method to determine the likely values of the parameters having
produced the available responses in a given probability model. Broadly speaking,
the parameter values are found such that they maximize the chances that the model
produced the data that were actually observed.

1.4.2 Maximum-Likelihood Estimator

The parameter value that makes the observed y1 , . . . , yn the most probable is
called the maximum-likelihood estimate. Formally, the likelihood function LF(θ) is
defined as the joint probability mass/density function of the observations. In our case,
for independent observations Y1 , . . . , Yn obeying the same ED distribution (1.3.3)
1.4 Maximum Likelihood Estimation 23

with unit weights, the likelihood function is given by


n

yi θ − a(θ)
LF(θ) = exp c(yi , φ).
i=1
φ

The likelihood function LF(θ) can be interpreted as the probability or chance of


obtaining the actual observations y1 , . . . , yn under the parameter θ. It is important
to remember that the likelihood is always defined for a given set of observed values
y1 , . . . , yn . In case repeated sampling properties are discussed, the numerical values
yi are replaced with unobserved random variables Yi .
The maximum-likelihood estimator is the value  θ which maximizes the likelihood
function. Equivalently,  θ maximizes the log-likelihood function

L(θ) = ln LF(θ)

which is given by
n


yi θ − a(θ)
L(θ) = + ln c(yi , φ)
i=1
φ
 
n yθ − a(θ) n
= + ln c(yi , φ)
φ i=1

where y is the sample mean, i.e. the arithmetic average

1
n
y= yi
n i=1

of the available observations y1 , y2 , . . . , yn .


In risk classification, the parameter of interest is the mean μ of the response
entering the pure premium calculation. Working with ED distributions, this means
that the parameter of interest is θ (that is, the analyst in primarily interested in the
mean value of the response), so that φ is called a nuisance parameter.

1.4.3 Derivation of the Maximum-Likelihood Estimate

The desired 
θ can easily be obtained by solving the likelihood equation

d
L(θ) = 0.

This gives
24 1 Introduction

d
0= L(θ)

 
n
d n yθ − a(θ)
=
i=1
dθ φ
 
n y − a (θ)
=
φ

so that the maximum-likelihood estimate of θ is the unique root of the equation

y = a (θ) ⇔ 
θ = (a )−1 (y).

The solution indeed corresponds to a maximum as the second derivative

d2 na (θ) n
L(θ) = − = − 2 Var[Y1 ]
dθ 2 φ φ

is always negative. We see that the individual observations y1 , . . . , yn are not needed
to compute θ as long as the analyst knows y (which thus summarizes all the informa-
tion contained in the observations about the canonical parameter). Also, we notice
that the nuisance parameter φ does not show up in the estimation of θ.

1.4.4 Properties of the Maximum-Likelihood Estimators

Properties of estimators are generally inherited from an hypothetical repeated sam-


pling procedure. Assume that we are allowed to draw repeatedly samples from a
given population. Averaging the estimates  θ1 , 
θ2 , . . . obtained for each of the sam-
ples corresponds to the mathematical expectation as

1 
k
E[
θ] = lim θj
k∞ k
j=1

by the law of large number. In this setting, Var[θ] measures the stability of the
estimates across these samples.
We now briefly discuss some relevant properties of the maximum-likelihood esti-
mators.

1.4.4.1 Consistency

Let us denote as  θn the maximum-likelihood estimator obtained with a sample of


size n, that is, with n observations Y1 , Y2 , . . . , Yn . The maximum-likelihood estimator
1.4 Maximum Likelihood Estimation 25

θn = 
 θ(Y1 , Y2 , . . . , Yn ) is consistent for the parameter θ, namely that

lim P[|
θn − θ| < ] = 1 for all > 0.
n∞

In our case, in the ED family setting, consistency of the maximum-likelihood esti-


mator is a direct consequence of the law of large numbers. Indeed, as Y is known
to converge to μ with probability 1, it turns out that θ = (a )−1 (Y ) converges to
−1
(a ) (μ) = θ.
Consistency is thus an asymptotic property which ensures that, as the sample size
gets large, the maximum-likelihood estimator is increasingly likely to fall within a
small region around the true value of the parameter. This expresses the idea that  θn
converges to the true value θ of the parameter as the sample size becomes infinitely
large.

1.4.4.2 Invariance

If h is a one-to-one function then h(


θ) is the maximum-likelihood estimate for h(θ),
that is,
h(θ) = h(
 θ)

when maximum-likelihood is used for estimation. This ensures that for every distri-
bution in the ED family, the maximum-likelihood estimate fulfills


μ = y. (1.4.1)

The maximum likelihood estimator 


μ is thus unbiased for μ since

1
n
μ] = E[Y ] =
E[ E [Yi ] = μ.
n i=1

1.4.4.3 Asymptotic Unbiasedness

The maximum-likelihood estimator is asymptotically unbiased, that is,

lim E[
θn ] = θ.
n∞

The expectation E[


θn ] approaches θ as the sample size increases.
26 1 Introduction

1.4.4.4 Minimum Variance

In the class of all estimators, for large samples, the maximum-likelihood estimator

θ has the minimum variance and is therefore the most accurate estimator possible.
We see that many attractive properties of the maximum-likelihood estimation
principle hold in large samples. As actuaries generally deal with massive amounts of
data, this makes this estimation procedure particularly attractive to conduct insurance
studies.

1.4.5 Examples

1.4.5.1 Poisson Distribution

Assume that the responses Y1 , . . . , Yn are independent and Poisson distributed with
Yi ∼ Poi(λei ) for all i = 1, . . . , n, where ei is the exposure-to-risk for observation
i and λ is the annual expected claim frequency that is common to all observations.
We thus have μi = λei for all i = 1, . . . , n so that θi = ln μi = ln ei + ln λ and
a(θi ) = exp(θi ) = λei . In this setting, the log-likelihood function writes

n

Yi θi − a(θi )
L(λ) = + ln c(Yi , φ)
i=1
φ

n
= (Yi θi − exp(θi ) + ln c(Yi , φ))
i=1
n
= (Yi ln ei + Yi ln λ − λei + ln c(Yi , φ)) .
i=1

Thus, the maximum likelihood estimator 


λ for λ, solution of

d
L(λ) = 0

is given by n
 Yi
λ = i=1
n .
i=1 ei

We have  n  n
 i=1 Yi 1
E[λ] = E n = n E [Yi ] = λ,
i=1 ei i=1 ei i=1
1.4 Maximum Likelihood Estimation 27

so that this estimator is unbiased for λ. Furthermore, its variance is given by


 n  n
Yi 1 λ
Var[
λ] = Var i=1
n = n 2 Var [Yi ] = n ,
i=1 ei i=1 ei i=1 i=1 ei

which converges to 0 as the denominator goes to infinity. In particular, since 


λ is
unbiased and
lim Var[
λ] = 0,
n∞

the maximum likelihood estimator 


λ is consistent for λ.

1.4.5.2 Gamma Distribution

Assume that the responses Y1 , . . . , Yn are independent and Gamma distributed with
Yi ∼ Gam(μ, ανi ) for all i = 1, . . . , n, where νi is the weight for observation i.
The log-likelihood function writes
n


Yi θ − a(θ)
L(μ) = + ln c(Yi , φ/νi )
i=1
φ/νi


n
− Yμi − ln μ
= + ln c(Yi , φ/νi ) .
i=1
φ/νi

Taking the derivative of L(μ) with respect to μ gives



d 1 n
νi Yi 
n
L(μ) = i=1
− νi .
dμ φμ μ i=1

Hence, the maximum likelihood estimator 


μ for μ, solution of

d
L(μ) = 0,

is given by n
νi Yi
μ = i=1
 n .
i=1 νi

This estimator is obviously unbiased for μ and

1 n
μ2 /α
μ] = n
Var[ ν 2
Var [Yi ] =  n ,
( i=1 νi )2 i=1 i i=1 νi
28 1 Introduction

n
μ] converges to 0 as i=1
so that Var[ νi becomes infinitely large. The maximum
likelihood estimator 
μ is then consistent for μ.

1.5 Deviance

In the absence of relation between features and the response, we fit a common mean
μ to all observations, such that

μi = y for i = 1, 2, . . . , n,


as we have seen in Sect. 1.4. This model is called the null model and corresponds
to the case where the features do not bring any information about the response. In
the null model, the data are represented entirely as random variations around the
common mean μ. If the null model applies then the data are homogeneous and there
is no reason to charge different premium amounts to subgroups of policyholders.
The null model represents one extreme where the data are purely random. Another
extreme is the full model which represents the data as being entirely systematic. The
model estimate  μi is just the corresponding observation yi , that is,

μi = yi for i = 1, 2, . . . , n.


Thus, each fitted value is equal to the observation and the full model fits perfectly.
However, this model does not extract any structure from the data, but merely repeats
the available observations without condensing them.
The deviance, or residual deviance of a regression model  μ is defined as
 
μ) = 2φ L full − L(
D( μ)

n       
=2 νi yi  θi − a 
θi −  θi + a 
θi , (1.5.1)
i=1

where L full is the log-likelihood of the full model based on the considered ED dis-
tribution and with  μi ) and 
θi = (a )−1 ( θi = (a )−1 (
μi ), where 
μi = 
μ(x i ). As the
full model gives the highest attainable log-likelihood with the ED distribution under
consideration, the difference between the log-likelihood L full of the full model and
the log-likelihood L( μ) of the regression model under interest is always positive. The
deviance is a measure of distance between a model and the observed data defined
by means of the saturated model. It quantifies the variations in the data that are not
explained by the model under consideration. A too large value of D( μ) indicates that
the model μ under consideration does not satisfactorily fit the actual data. The larger
the deviance, the larger the differences between the actual data and the fitted values.
The deviance of the null model is called the null deviance.
1.5 Deviance 29

Table 1.3 Deviance associated to regression models based on some members of the ED family of
distributions
Distribution Deviance
n  yi n i −yi

Binomial 2 i=1 yi ln  μi + (n i − yi ) ln n i −
μi where

μi = n i 
qi
n  yi

Poisson 2 i=1 yi ln  μi − (yi −  μi ) where
y ln y = 0 if y = 0
n  2
Normal i=1 yi −  μi

n yi yi − μi
Gamma 2 i=1 − ln  μi +  μi
 2
n yi −μi
Inverse Gaussian i=1 
μ y
2
i i

Table 1.3 displays the deviance associated to regression models based on some
members of the ED family of distributions. Notice that in that table, y ln y is taken
to be 0 when y = 0 (its limit as y → 0).

1.6 Actuarial Pricing and Tree-Based Methods

In actuarial pricing, the aim is to evaluate the pure premium as accurately as possible.
The target is the conditional expectation μ(X) = E[Y |X] of the response Y (claim
number or claim amount for instance) given the available information X. The function
x → μ(x) = E[Y |X = x] is generally unknown to the actuary and is approximated
by a working premium x →  μ(x). The goal is to produce the most accurate function

μ(x).
Lack of accuracy for  μ(x) is defined by the generalization error

Err (
μ) = E [L(Y, 
μ(X))] ,

where L(., .) is a function measuring the discrepancy between its two arguments,
called loss function, and the expected value is over the joint distribution of (Y, X).
We aim to find a function μ(x) of the features minimizing the generalization error.
Notice that the loss function L(., .) should not be confused with the log-likelihood
L(.).
Let
L = {(y1 , x 1 ), (y2 , x 2 ), . . . , (yn , x n )} (1.6.1)

be the set of observations available to the insurer, called learning set. The learning
set is often partitioned into a training set

D = {(yi , x i ); i ∈ I}
30 1 Introduction

and a validation set


D = {(yi , x i ); i ∈ I},

with I ⊂ {1, . . . , n} labelling the observations in D and I = {1, . . . , n}\I labelling


the remaining observations of L. The training set is then used to fit the model and
the validation set to assess the predictive accuracy of the model.
An approximation  μ(x) to μ(x) is obtained by applying a training procedure to
the training set. Common training procedures are to restrict μ(x) to be a member of
a parametrized class of functions. More specifically, a supervised training procedure
assumes that μ belongs to a family of candidate models of restricted structure. Model
selection is then defined as finding the best model within this family on the basis of
the training set. Of course, the true model is usually not a member of the family under
consideration, depending on the restricted structure imposed to the candidate models.
However, when these restrictions are flexible enough, some candidate models could
be sufficiently close to the true model.
In the GLM setting discussed in Denuit et al. (2019), μ(x) is assumed to be of
the form
μ(x) = g −1 (score(x)) , (1.6.2)

where g is the link function and


p
score(x) = β0 + βjxj. (1.6.3)
j=1

Estimates β0 , β
1 , . . . , β
p for parameters β0 , β1 , . . . , β p are then obtained from the
training set by maximum likelihood, so that we get
 
μ(x) = g −1 score(x)
 
⎛ ⎞

p
= g −1 ⎝β0 + βj x j ⎠ . (1.6.4)
j=1

It is interesting to notice that the maximum likelihood estimates β 1 , . . . , β


0 , β p also
minimize the deviance
 #      $
D train (
μ) = 2 νi yi (a )−1 (yi ) − (a )−1 (
μi ) − a (a )−1 (yi ) + a (a )−1 ( μi )
i∈I
(1.6.5)
computed on the training set, also called in-sample deviance. Note that in this book,
μ) and D train (
both notations D( μ) are used to designate the in-sample deviance. The
appropriate choice for the loss function in the GLM framework is thus given by
1.6 Actuarial Pricing and Tree-Based Methods 31
#      $
μi ) = 2νi yi (a )−1 (yi ) − (a )−1 (
L(yi ,  μi ) − a (a )−1 (yi ) + a (a )−1 (
μi ) ,
(1.6.6)
which makes the training sample estimate of the generalization error


train 1 
Err (
μ) = L(yi , 
μi )
|I| i∈I

1
correspond to the in-sample deviance (up to the factor |I|
), that is,


train D train (
μ)
Err (
μ) = , (1.6.7)
|I|

where |I| denotes the number of elements of I. In the following, we extend the
choice (1.6.6) for the loss function to the tree-based methods studied in this book.
The GLM training procedure thus amounts to estimate scores structured like in (1.6.4)
by minimizing the corresponding in-sample deviances (1.6.7) and to select the best
model among the GLMs under consideration. Note that selecting the best model
among different GLMs on the basis of the deviance computed on the training set will
favor the most complex models. In that goal, the Akaike Information Criteria (AIC)
is preferred over the in-sample deviance, because it accounts for a measure of model
complexity in the penalty. As mentioned in Denuit et al. (2019), comparing different
GLMs on the basis of AIC amounts to account for the optimism in the deviance
computed on the training set.
In this second volume, we work with models whose scores are linear combinations
of regression trees, that is,


M
g(μ(x)) = score(x) = βm Tm (x),
m=1

where M is the number of trees producing the score and Tm (x) is the prediction
obtained from regression tree Tm , m = 1, . . . , M. The parameters β1 , . . . , β M specify
the linear combination used to produce the score. The training procedures studied in
this book differ in the way the regression trees are fitted from the training set and in
the linear combination used to produce the ensemble.
In Chap. 3, we consider single regression trees for the score. Specifically, we work
with M = 1 and β1 = 1. In this setting, the identity link function is appropriate and
implicitly chosen, so that we assume

μ(x) = score(x) = T1 (x).

The estimated score will be in the range of the response, the prediction in a terminal
node of the tree will be computed as the (weighted) average of the responses in that
node. Note that, contrarily to GLMs for instance for which the form of the score is
32 1 Introduction

strongly constrained, the score is here left unspecified and estimated from the training
set. Theoretically, any function (here the true model) can be approximated as close as
possible by a piecewise constant function (here a regression tree). However, in prac-
tice, the training procedure limits the level of accuracy of a regression tree. A large
tree might overfit the training set, while a small tree might not capture the important
structure of the true model. The size of the tree is thus an important parameter in the
training procedure because it controls the score’s complexity. Selecting the optimal
size of the tree will also be part of the training procedure.
Remark 1.6.1 Because of the high flexibility of the score, a large regression tree
is prone to overfit the training set. However, it turns out that the training sample
estimate of the generalization error will favor larger trees. To combat this issue, a
part of the observations of the training set can only be used to fit trees of different
size and the remaining part of the observations of the training set to estimate the
generalization error of these trees in order to select the best one. Using the K-fold
cross validation estimate of the generalization error, as discussed in the next chapter,
is also an alternative to avoid this issue.
In Chap. 4, we work with regression models assumed to be the average of M
regression trees, each built on a bootstrap sample of the training set. The goal is to
reduce the variance of the predictions obtained from a single tree (slight changes in
the training set can drastically change the structure of the tree fitted on it). That is,
we suppose β1 = β2 = . . . = β M = M1 , so that we consider models of the form

1 
M
μ(x) = score(x) = Tm (x),
M m=1

where T1 , . . . , TM are regression trees that will be estimated on different bootstrap


samples of the training set. This training procedure is called bagging trees. The
number of trees M and the size of the trees will also be determined during the
training procedure. A modification of bagging trees, called random forests, is also
studied in Chap. 4. The latter training procedure differs from bagging trees in the
way the trees are produced from the bootstrap samples of the training set. Note that
in Chap. 4, we use the notation B for the number of trees instead of M, B referring
to the number of bootstrap samples produced from the training set.
M
In Chap. 5, we focus on scores structured as score(x) = m=1 Tm (x), where
T1 , . . . , TM are relatively small regression trees, called weak learners. Specifically,
we suppose β1 = . . . = β M = 1, and the regression trees in the expansion will be
sequentially fitted on random subsamples of the training set. This training procedure
is called boosting trees. Contrarily to bagging trees and random forests, boosting trees
is an iterative training procedure, meaning that the production of the mth regression
tree in the ensemble will depend on the previous m − 1 trees already fitted. Note
that, because of the iterative way the constituent trees are produced, the identity link
function may this time lead to predictions that are not in the range of the expected
response. Hence, in Chap. 5, we suppose that
1.6 Actuarial Pricing and Tree-Based Methods 33


M
g(μ(x)) = score(x) = Tm (x),
m=1

where g is an appropriate link function mapping the score to the range of the response
and T1 , . . . , TM are relatively small regression trees that will be sequentially fitted
on random subsamples of the training set. Note that the number of trees M and the
size of the trees will also be selected during the training procedure.

1.7 Bibliographic Notes and Further Reading

This chapter closely follows the book of Denuit et al. (2019). Precisely, we summarize
the first three chapters of Denuit et al. (2019) with the notions useful for this second
volume. We refer the reader to the first three chapters by Denuit et al. (2019) for
more details, as well as for an extensive overview of the literature. Section 1.6 gives
an overview of the methods used throughout this book. We refer the reader to the
bibliographic notes of the next chapters for more details about the corresponding
literature.

References

Denuit M, Hainaut D, Trufin J (2019) Effective statistical learning methods for actuaries I: GLMs
and extensions. Springer actuarial lecture notes
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat
29(5):1189–1232
Chapter 2
Performance Evaluation

2.1 Introduction

In actuarial pricing, the objective is to evaluate the pure premium as accurately as pos-
sible. The target is thus the conditional expectation μ(X) = E[Y |X] of the response
Y (claim number or claim amount for instance) given the available information X.
The function x → μ(x) = E[Y |X = x] is generally unknown to the actuary and
is approximated by a working premium x →  μ(x). The goal is to produce the most
accurate function  μ(x). Lack of accuracy for  μ(x) is defined by the generalization
error. Producing a model  μ whose predictions are as good as possible can be stated as
finding a model which minimizes its generalization error. In this chapter, we describe
the generalization error used throughout this book for model selection and model
assessment.

2.2 Generalization Error

2.2.1 Definition

We denote by
L = {(y1 , x 1 ), (y2 , x 2 ), . . . , (yn , x n )} (2.2.1)

the set of observations available to the insurer. This dataset is called the learning set.
We aim to find a model  μ built on the learning set L (or only on a part of L, called
training set, as discussed thereafter) which approximates the best the true model μ.
Lack of accuracy for μ is defined by the generalization error. The generalization
error, also known as expected prediction error, of  μ is defined as follows:

© Springer Nature Switzerland AG 2020 35


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_2
36 2 Performance Evaluation

Definition 2.2.1 The generalization error of the model 


μ is

Err (
μ) = E [L(Y, 
μ(X))] , (2.2.2)

where L(., .) is a function measuring the discrepancy between its two arguments,
called loss function.

The goal is thus to find a function of the covariates which predicts at best the
response, that is, which minimizes the generalization error. The model performance
is evaluated according to the generalization error which depends on a predefined
loss function. The choice of an appropriate loss function in our ED family setting is
discussed thereafter.
Notice that the expectation in (2.2.2) is taken over all possible data, that is, with
respect to the probability distribution of the random vector (Y, X) assumed to be
independent of the data used to fit the model.

2.2.2 Loss Function

A simple estimate of the generalization error is given by

1
n
 μ) =
Err( L(yi , 
μi ) (2.2.3)
n i=1

with μi = μ(x i ). In the ED family setting, the appropriate choice for the loss function
is related to the deviance. It suffices to observe that the regression model  μ maxi-
mizing the log-likelihood function L( μ) also minimizes the corresponding deviance
D( μ). Specifically, since the deviance D( μ) can be expressed as


n       
μ) = 2
D( νi yi (a  )−1 (yi ) − (a  )−1 (
μi ) − a (a  )−1 (yi ) + a (a  )−1 (
μi ) ,
i=1
(2.2.4)
we see from (2.2.3) that the appropriate loss function in our ED family setting is
given by
      
μi ) = 2νi yi (a  )−1 (yi ) − (a  )−1 (
L(yi ,  μi ) − a (a  )−1 (yi ) + a (a  )−1 (
μi ) .
(2.2.5)
The constant 2 is not necessary and is there to make the loss function match the
deviance. Throughout this book, we use the choice (2.2.5) for the loss function.
2.2 Generalization Error 37

2.2.3 Estimates

The performance of a model is evaluated throughout the generalization error Err ( μ).
In practice, we usually do not know the probability distribution from which the
observations are drawn, making the direct evaluation of the generalization error
Err (μ) not feasible. Hence, the set of observations available to the insurer often
constitutes the only data on which the model needs to be fitted and its generalization
error estimated.

2.2.3.1 Training Sample Estimate

The learning set


L = {(y1 , x 1 ), (y2 , x 2 ), . . . , (yn , x n )} (2.2.6)

constitutes the only data available to the insurer. When the whole learning set is used
to fit the model 
μ, the generalization error Err (
μ) can only be estimated on the same
data as the ones used to build the model, that is,

1
n
train

Err (
μ) = L(yi , 
μ(x i )). (2.2.7)
n i=1

This estimate is called the training sample estimate and has been introduced in (2.2.3).
In our setting, we thus have


train μ)
D(
Err (
μ) = . (2.2.8)
n
Typically, the training sample estimate (2.2.7) will be less that the true generalization
error, because the same data is being used to fit the model and assess its error. A
model typically adapts to the data used to train it, and hence the training sample
estimate will be an overly optimistic estimate of the generalization error. This is
particularly true for tree-based models because of their high flexibility to adapt to
the training set: the resulting models are too closely fitted to the training set, which
is called overfitting.

2.2.3.2 Validation Sample Estimate

The training sample estimate (2.2.7) directly evaluates the accuracy of the model on
the dataset used to build the model. While the training sample estimate is useful to
fit the model, the resulting estimate for the generalization error is likely to be very
optimistic since the model is precisely built to reduce it. This is of course an issue
38 2 Performance Evaluation

when we aim to assess the predictive performance of the model, namely its accuracy
on new data.
As actuaries generally deal with massive amounts of data, a better approach is
to divide the learning set L into two disjoint sets D and D, called training set and
validation set, and to use the training set for fitting the model and the validation set for
estimating the generalization error of the model. The learning set is thus partitioned
into a training set
D = {(yi , x i ); i ∈ I}

and a validation set


D = {(yi , x i ); i ∈ I},

with I ⊂ {1, . . . , n} labelling the observations in D considered for fitting the model
and I = {1, . . . , n}\I labelling the remaining observations of L used to assess the
predictive accuracy of the model. The validation sample estimate of the generalization
error of the model  μ that has been built on the training set D is then given by


val 1 
Err ( μ) = L(yi , 
μ(x i ))
|I|
i∈I
D val (
μ)
= (2.2.9)
|I|

while the training sample estimate (2.2.7) now writes


train 1 
Err (
μ) = L(yi , 
μ(x i ))
|I| i∈I
D train (
μ)
= , (2.2.10)
|I|

where we denote by D train (μ) the deviance computed from the observations of the
training set (also called in-sample deviance) and D val (μ) the deviance computed
from the observations of the validation set (also called out-of-sample deviance). As
a rule-of-thumb, the training set usually represents 80% of the learning set and the
validation set the remaining 20%. Of course, this allocation depends on the problem
under consideration. In any case, the splitting of the learning set must be done in a
way that observations in the training set can be considered independent from those
in the validation set and drawn from the same population. Usually, this is guaranteed
by drawing both sets at random from the learning set.
Training and validation sets should be as homogeneous as possible. Creating
those two sets by taking simple random samples, as mentioned above, is usually
sufficient to guarantee similar data sets. However, in some cases, the distribution
of the response can be quite different between the training and validation sets. For
instance, consider the annual number of claims in MTPL insurance. Typically, the
2.2 Generalization Error 39

vast majority of the policyholders makes no claim over the year (say 95%). Some
policyholders experience one claim (say 4%) while only a few of them have more
than one claim (say 1% with two claims). In such a situation, because the proportions
of policyholders with one or two claims are small compared to the proportion of
policyholders with no claim, the distribution of the response can be very different
between the training and validation sets.
To address this potential issue, random sampling can be applied within subgroups,
a subgroup being a set of observations with the same response. In our example, we
would thus have three subgroups: a first one made of the observations with no claim
(95% of the observations), a second one composed of observations with one claim
(4% of the observations) and a third one with remaining observations (1% of the
observations). Applying the randomization within these subgroups is called stratified
random sampling.

2.2.3.3 K-Fold Cross Validation Estimate


val
The validation sample estimate  Err ( μ) is an unbiased estimate of Err ( μ). It is
obtained from a dataset independent of the training set used to fit the model. As
such, it constitutes a more reliable estimate of the generalization error to assess the
predictive performance of the model.
However, using a validation set reduces the size of the set used to build the model.
In actuarial science, it is generally not an issue as we are often in data-rich situations.
But sometimes, when the size of the learning set is too small, the K-fold cross
validation estimate is preferred over the validation sample estimate.
The K-fold cross validation works as follows. We randomly partition the learn-
ing set L into K roughly equal-size disjoint subsets L1 , . . . , L K and we label by
Ik ⊂ {1, . . . , n} the observations in Lk for all k = 1, . . . , K . For each subset Lk ,
k = 1, . . . , K , we fit the model on the set of observations L\Lk , that we denote
μL\Lk , and we estimate its generalization error on the remaining data Lk as



val 1 
Err ( μL\Lk ) = L(yi , 
μL\Lk (x i )). (2.2.11)
|Ik | i∈I
k

For the model μL\Lk , the set of observations L\Lk plays the role of training set while
Lk of validation set. The generalization error can then be estimated as the weighted
val
average of the estimates  Err ( μL\Lk ) given in (2.2.11), that is,

CV 
K
|Ik | val

Err (
μ) = 
Err (
μL\Lk )
k=1
n

1 
K
= L(yi , 
μL\Lk (x i )). (2.2.12)
n k=1 i∈I
k
40 2 Performance Evaluation

CV
The idea behind the K-fold cross validation estimate  Err ( μ) is that each model
μL\Lk should be close enough to the model 
 μ fitted on the whole learning set L.
val
Therefore, the estimates  Err ( μL\Lk ) given in (2.2.11), that are unbiased estimates
of Err ( μ), should also be close enough to Err ( μ).
Contrary to the validation sample estimate, the K-fold cross validation estimate
uses every observation (yi , x i ) in the learning set L for estimating the generalization
error. Typically, K is fixed to 10, which appears to be a value that produces stable
and reliable estimates. However, the K-fold cross validation estimate is more com-
putationally intensive since it requires to fit K models, while only one model needs
to be fitted for computing its validation sample counterpart.
Note that the K partitions can be chosen in a way that makes the subsets
L1 , . . . , L K balanced with respect to the response. Applying stratified random sam-
pling as discussed in Sect. 2.2.3.2 produces folds that have similar distributions for
the response.

2.2.3.4 Model Selection and Model Assessment

Model selection consists in choosing the best model among different models pro-
duced by a training procedure, say the final model, while model assessment consists
in assessing the generalization error of this final model. In practice, we do both model
selection and model assessment.
Model assessment should be performed on data that are kept out of the entire
training procedure (which includes the fit of the different models together with model
selection). Ideally, the training set D should be entirely dedicated to the training
procedure and the validation set D to model assessment.
As part of the training procedure, model selection is thus based on observations
from the training set. To guarantee an unbiased estimate of the generalization error
for each model under consideration during model selection, a possibility is to divide
in its turn the training set into two parts: a part used to fit the models and another
(sometimes called test set) to estimate the corresponding generalization errors. Of
course, this approach supposes to be in a data-rich situation. If we are in situations
where there is insufficient observations in the training set to make this split, another
possibility for model selection consists in relying on K-fold cross validation estimates
as described above, using the training set D instead of the entire learning set L.

2.2.4 Decomposition

According to Definition 2.2.1, the generalization error Err (


μ) of a model 
μ is given
by
Err (μ) = E [L(Y, μ(X))] . (2.2.13)
2.2 Generalization Error 41

In the same way, the generalization error of 


μ can be defined for a fixed value X = x
as
Err (
μ(x)) = E [L(Y,  μ(X))|X = x] . (2.2.14)

Notice that averaging the local errors Err (


μ(x)) enables to recover the generalization
error Err (
μ), that is,
Err (μ) = E [Err (μ(X))] . (2.2.15)

2.2.4.1 Squared Error Loss

Consider that the loss function is the squared error loss. In our ED family setting, it
amounts to assume that the responses are normally distributed. The generalization
error of model 
μ at X = x becomes
  

Err (
μ(x)) = E (Y −  μ(x))2 X = x
  

= E (Y − μ(x) + μ(x) −  μ(x))2  X = x
     
 
= E (Y − μ(x))2 X = x + E (μ(x) −  μ(x))2 X = x
  

+2 E (Y − μ(x)) (μ(x) −  μ(x)) X = x
     
 
= E (Y − μ(x))2  X = x + E (μ(x) −  μ(x))2  X = x

since
  

μ(x))  X = x
E (Y − μ(x)) (μ(x) − 
  

μ(x)) E (Y − μ(x)) X = x
= (μ(x) − 
= (μ(x) − 
μ(x)) (E [Y |X = x] − μ(x))
=0

by definition of μ(x) = E [Y |X = x]. So, it comes

μ(x)) = Err (μ(x)) + (μ(x) − 


Err ( μ(x))2 . (2.2.16)

By (2.2.15), the generalization error Err (


μ) thus writes

Err ( μ(X))2 .
μ) = Err (μ) + E (μ(X) −  (2.2.17)

The generalization error of 


μ can be expressed as the sum of two terms, the first
one corresponding to the generalization error of the true model μ and the second one
representing the estimation error, that is, the discrepancy of 
μ from the true model
42 2 Performance Evaluation

μ. The further our model from the true one, the larger the generalization error. The
generalization error of the true model is called the residual error and is irreducible.
Indeed, we have
μ) ≥ Err (μ) ,
Err (

which means that the smallest generalization error coincides with the one associated
to the true model.

2.2.4.2 Poisson Deviance Loss

Consider that the loss function is the Poisson deviance. This choice is appropriate
when the responses are assumed to be Poisson distributed, as when examining the
number of claims for instance. The generalization error of model 
μ at X = x is then
given by
 
Y 
μ(x)) = 2E Y ln
Err ( − (Y − μ(x)) X = x

μ(x)
 
Y 
= 2E Y ln − (Y − μ(x))X = x
μ(x)

μ(x) 

+2E  μ(x) − μ(x) − Y ln X = x
μ(x)
 
Y 
= 2E Y ln − (Y − μ(x))X = x
μ(x)


μ(x)
+2 (
μ(x) − μ(x)) − 2E [Y |X = x] ln .
μ(x)

Replacing E [Y |X = x] by μ(x), we get




μ(x)
μ(x)) = Err (μ(x)) + 2 (
Err ( μ(x) − μ(x)) − 2μ(x) ln
μ(x)


μ(x) 
μ(x)
= Err (μ(x)) + 2μ(x) − 1 − ln . (2.2.18)
μ(x) μ(x)

The generalization error Err (


μ) thus writes


μ(X) 
μ(X)
μ) = Err (μ) + 2E μ(X)
Err ( − 1 − ln . (2.2.19)
μ(X) μ(X)

As for the squared error loss, the generalization error of 


μ can be decomposed
as the sum of the generalization error of the true model and an estimation error
E[E P (
μ(X))], where
2.2 Generalization Error 43


μ(x) 
μ(x)
E P (
μ(x)) = 2μ(x) − 1 − ln .
μ(x) μ(x)

Notice that E P (
μ(x)) is always positive because y → y − 1 − ln y is positive on
R+ , so that we have
μ) ≥ Err (μ) .
Err (

2.2.4.3 Gamma Deviance Loss

Consider the Gamma deviance loss. This choice is often made when we study claim
severities for instance. The generalization error of model 
μ at X = x is then given
by
 
Y Y 
μ(x)) = 2E − ln
Err ( + − 1X = x

μ(x) 
μ(x)
 
Y Y 
= 2E − ln + − 1 X = x
μ(x) μ(x)

Y Y Y Y 
+2E − ln + ln + − X = x

μ(x) μ(x) 
μ(x) μ(x)


μ(x) μ(x) 
μ(x) − 
= Err (μ(x)) + 2E ln +Y X = x
μ(x) 
μ(x)μ(x)


μ(x) μ(x) − μ(x)
= Err (μ(x)) + 2 ln + E [Y |X = x]
μ(x) 
μ(x)μ(x)


μ(x) μ(x) − μ(x)
= Err (μ(x)) + 2 ln +
μ(x) 
μ(x)

μ(x) μ(x)
= Err (μ(x)) + 2 − 1 − ln (2.2.20)

μ(x) 
μ(x)

since E [Y |X = x] = μ(x). The generalization error Err ( μ) thus writes as the sum
of the generalization error of the true model and an estimation error E[E G (
μ(X))],
where
μ(x) μ(x)
E G (
μ(x)) = 2 − 1 − ln ,

μ(x) 
μ(x)

that is,
μ) = Err (μ) + E[E G (
Err ( μ(X))]. (2.2.21)

Note that E G (
μ(x)) is always positive since we have already noticed in the Poisson
case that y → y − 1 − ln y is positive on R+ , so that we have

μ) ≥ Err (μ) .
Err (
44 2 Performance Evaluation

2.3 Expected Generalization Error

The model  μ under consideration is estimated on the training set D so that it depends
on D. To make explicit the dependence on the training set, we use from now on both
notations  μ and  μD for the model under interest. We assume in a first time there
is only one model which corresponds to a given training set, that is, we consider
training procedures that are said to be deterministic. Training procedures that can
produce different models for a fixed training set are discussed in Sect. 2.4.
The generalization error Err ( μD ) is evaluated conditional on the training set. That
is, the model  μD under study is first fitted on the training set D before computing the
expectation over all possible observations independently from the training set D. In
that sense, the generalization error Err ( μD ) gives an idea of the general accuracy of
the training procedure for the particular training set D. In order to study the general
behavior of our training procedure, and not only its behavior for a specific training
set, it is interesting to evaluate the training procedure on different training sets of the
same size.
The training set D is itself a random variable sampled from a distribution usually
unknown in practice, so that the generalization error Err ( μD ) is in its turn a random
variable. In order to study the general performance of the training procedure, it is
then of interest to take the average of the generalization error Err ( μD ) over D, that
is, to work with the expected generalization error ED [Err ( μD )] over the models
learned from all possible training sets and produced with the training procedure under
investigation.
The expected generalization error is thus given by

ED [Err ( μD (X))]] ,
μD )] = ED [E X [Err ( (2.3.1)

which can also be expressed as

ED [Err ( μD (X))]] .
μD )] = E X [ED [Err ( (2.3.2)

μD (X))] in order to get


We can first determine the expected local error ED [Err (
the expected generalization error.

2.3.1 Squared Error Loss

When the loss function is the squared error loss, we know from Eq. (2.2.16) that the
generalization error at X = x writes

μD (x)) = Err (μ(x)) + (μ(x) − 


Err ( μD (x))2 . (2.3.3)
2.3 Expected Generalization Error 45

The true model μ is independent of the training set, so is the generalization error
Err (μ(x)). The expected generalization error of 
μ at X = x is then given by

μD (x))] = Err (μ(x)) + ED (μ(x) − 


ED [Err ( μD (x))2 .

The first term is the local generalization error of the true model while the second
term is the expected estimation error at X = x, which can be re-expressed as

ED (μ(x) − 
μD (x))2

= ED (μ(x) − ED [
μD (x)] + ED [
μD (x)] − 
μD (x))2

μD (x)])2 + ED (ED [
= ED (μ(x) − ED [ μD (x)] − 
μD (x))2
+2ED [(μ(x) − ED [μD (x)]) (ED [
μD (x)] − 
μD (x))]

= ED (μ(x) − ED [μD (x)]) + ED (ED [


2
μD (x)] − 
μD (x))2

= (μ(x) − ED [
μD (x)])2 + ED (ED [μD (x)] − 
μD (x))2

since

μD (x)]) (ED [
ED [(μ(x) − ED [ μD (x)] − 
μD (x))]
= (μ(x) − ED [
μD (x)]) ED [(ED [
μD (x)] − 
μD (x))]
= (μ(x) − ED [
μD (x)]) (ED [
μD (x)] − ED [
μD (x)])
= 0. (2.3.4)

Therefore, the expected generalization error at X = x is given by

μD (x))] = Err (μ(x)) + (μ(x) − ED [


ED [Err ( μD (x)])2

+ED (ED [ μD (x))2 .


μD (x)] −  (2.3.5)

This is the bias-variance decomposition of the expected generalization error.


The first term in (2.3.5) is the local generalization error of the true model, that is,
the residual error. The residual error is independent of the training procedure and the
training set, which provides in any case a lower bound for the expected generalization
error. Notice that in practice, the computation of this lower bound is often unfeasible
since the true model is usually unknown. The second term measures the discrepancy
between the average estimate ED [ μD (x)] and the value of the true model μ(x), and
corresponds to the bias term. The third term measures the variability of the estimate
μD (x) over the models trained from all possible training sets, and corresponds to

the variance term.
From (2.3.5), the expected generalization error writes
 
μD )] = Err (μ) + E X (μ(X) − ED [
ED [Err ( μD (X)])2


+E X ED (ED [ μD (X))2 .
μD (X)] −  (2.3.6)
46 2 Performance Evaluation

2.3.2 Poisson Deviance Loss

In the case of the Poisson deviance loss, we know from Eq. (2.2.18) that the local
generalization error writes

μD (x)) = Err (μ(x)) + E P (


Err ( μD (x)) (2.3.7)

where
P μD (x)
 
μD (x)
μD (x)) = 2μ(x)
E ( − 1 − ln . (2.3.8)
μ(x) μ(x)

Because the true model μ is independent of the training set, the expected general-
μD (x))] can be expressed as
ization error ED [Err (

μD (x))] = Err (μ(x)) + ED E P (


ED [Err ( μD (x)) (2.3.9)

with
 


μD (x) 
μD (x)
ED E P (μD (x)) = 2μ(x) ED − 1 − ED ln .
μ(x) μ(x)
(2.3.10)
Locally, the expected generalization error is equal to the generalization error of
the true model plus the expected estimation error which can be attributed to
the bias and the
estimation fluctuation. Notice that the expected estimation error
ED E P ( μD (x)) is positive since we have seen that the estimation error E P (
μD (x))
is always positive. The generalization error of the true model is again a theoretical
lower bound for the expected generalization error.
From (2.3.9) and (2.3.10), the expected generalization error writes
   

μD (X) 
μD (X)
μD )] = Err (μ) + 2E X μ(X) ED
ED [Err ( − 1 − ED ln .
μ(X) μ(X)
(2.3.11)

2.3.3 Gamma Deviance Loss

In the Gamma case, Eq. (2.2.20) tells us that the local generalization error is given
by
Err (μD (x)) = Err (μ(x)) + E G ( μD (x)) , (2.3.12)

where
μ(x) μ(x)
E G (
μD (x)) = 2 − 1 − ln . (2.3.13)

μD (x) 
μD (x)
2.3 Expected Generalization Error 47

μD (x))] can be expressed as


Hence, the expected generalization error ED [Err (

μD (x))] = Err (μ(x)) + ED E G (


ED [Err ( μD (x)) (2.3.14)

with
 
G
μ(x) μ(x)
ED E (
μD (x)) = 2 ED − 1 − ED ln . (2.3.15)

μD (x) 
μD (x)

Locally, the expected generalization error is equal to the generalization error of


the true model plus the expected estimation error which can be attributed to
the bias and the
estimation fluctuation. Notice that the expected estimation error
ED E G ( μD (x)) is positive since we have seen that the estimation error E G (
μD (x))
is always positive. The generalization error of the true model is a theoretical lower
bound for the expected generalization error.
From (2.3.14), the expected generalization error writes
  
μ(X) μ(X)
μD )] = Err (μ) + 2E X ED
ED [Err ( − 1 − ED ln .

μD (X) 
μD (X)
(2.3.16)

2.3.4 Bias and Variance

In order to minimise the expected generalization error, it might appear desirable to


sacrifice a bit on the bias provided we can reduce to a large extend the variability
of the prediction over the models trained from all possible training sets. The bias-
variance decomposition of the expected generalization error is used for justifying the
performances of ensemble learning techniques studied in Chap. 4.

2.4 (Expected) Generalization Error for Randomized


Training Procedures

A training procedure which always produces the same model  μD for a given training
set D (and fixed values for the tuning parameters) is said to be deterministic. This is
the case for instance of regression trees studied in Chap. 3.
There also exist randomized training procedures that can produce different models
for a fixed training set (and fixed values for the tuning parameters), such as random
forests and boosting trees discussed in Chaps. 4 and 5. In order to account for the
randomness of the training procedure, we introduce a random vector  which is
assumed to fully capture the randomness of the training procedure. The model  μ
resulting from the randomized training procedure depends on the training set D and
48 2 Performance Evaluation

also on the random vector , so that we use both notations 


μ and 
μD, for the model
under consideration.
μD, ) is thus evaluated conditional on the training
The generalization error Err (
set D and the random vector . The expected generalization error, which aims to
assess the general accuracy of the training procedure, is now obtained by taking the
average of the generalization error Err (μD, ) over D and . Expression (2.3.1)
becomes  
 

ED, Err  μD, = ED, E X Err  μD, (X) , (2.4.1)

which can also be expressed as


 
 

ED, Err 
μD, = E X ED, Err 
μD, (X) . (2.4.2)
 

Again, we can first determine the expected local error ED, Err  μD, (X) in
order to get the expected generalization error.
Taking into account the additional source of randomness in the training procedure,
expressions (2.3.5), (2.3.9) and (2.3.14) become respectively
 

2
ED, Err 
μD, (x) = Err (μ(x)) + μ(x) − ED, μD, (x)

2 
+ED, ED,  μD, (x) , (2.4.3)
μD, (x) − 

 
 

μD, (x) = Err (μ(x)) + ED, E P 


ED, Err  μD, (x) (2.4.4)

with
  
  
μD, (x) 
μD, (x)
ED, E P 
μD, (x) = 2μ(x) ED, − 1 − ED, ln ,
μ(x) μ(x)
(2.4.5)
and
 
 

μD, (x) = Err (μ(x)) + ED, E G 


ED, Err  μD, (x) (2.4.6)

with
 
 
μ(x) μ(x)
ED, E G 
μD, (x) = 2 ED, − 1 − ED, ln .

μD, (x) 
μD, (x)
(2.4.7)
2.5 Bibliographic Notes and Further Reading 49

2.5 Bibliographic Notes and Further Reading

This chapter is mainly inspired from Louppe (2014) and the book of Hastie et al.
(2009). We also find inspiration from Wüthrich and Buser (2019) for the choice of
the loss function in our ED family setting as well as for the decomposition of the
generalization error in the Poisson case.

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning. Data Mining,
Inference, and Prediction. Second Edition. Springer Series in Statistics
Louppe, G. (2014). Understanding random forests: from theory to practice. arXiv:14077502
Wüthrich, M. V., Buser, C. (2019). Data analytics for non-life insurance pricing. Lecture notes
Chapter 3
Regression Trees

3.1 Introduction

In this chapter, we present the regression trees introduced by Breiman et al. (1984).
Regression trees are at the core of this second volume. They are the building blocks of
the ensemble techniques described in Chaps. 4 and 5. We closely follow the seminal
book of Breiman et al. (1984). The presentation is also mainly inspired from Hastie
et al. (2009) and Wüthrich and Buser (2019).

3.2 Binary Regression Trees

A regression tree partitions the feature space χ into disjoint subsets {χt }t∈T , where
T is a set of indexes. On each subset χt , the prediction  ct of the response on that part
of the feature space is assumed to be constant. The resulting predictions  μ(x) can
be written as 

μ(x) = 
ct I [x ∈ χt ] . (3.2.1)
t∈T

Binary regression trees recursively partition the feature space by a sequence of


binary splits until a stopping rule is applied, as illustrated in Fig. 3.1.
The node t0 is called the root or root node of the tree and represents the feature
space χ itself, also denoted χt0 to make explicit the link with the root node t0 .
The feature space χt0 is first split into two disjoint subspaces χt1 and χt2 such that
χt1 ∪ χt2 = χt0 . The subspaces χt1 and χt2 correspond to nodes t1 and t2 , respectively.
The node t0 is said to be the parent of nodes t1 and t2 or equivalently nodes t1 and t2
are said to be the children of t0 . More precisely, t1 is called the left child of t0 while
t2 is called the right child.
Then, subspaces χt1 and χt2 are in turn split into two disjoint subspaces. For
instance, subspace χt1 is partitioned into disjoint subsets χt3 and χt4 such that

© Springer Nature Switzerland AG 2020 51


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_3
52 3 Regression Trees

Fig. 3.1 Example of a t0


binary regression tree.
Circles indicate non-terminal
nodes and rectangle boxes
represent terminal nodes
t1 t2

t3 t4 t5 t6

t7 t8 t9 t10 t11 t12

t13 t14 t15 t16

χt3 ∪ χt4 = χt1 . The node t1 is the parent of nodes t3 and t4 that correspond to sub-
spaces χt3 and χt4 , while t3 and t4 are the left and right children of t1 , respectively.
The node t4 does not have children. Such a node is called terminal node or leaf of
the tree, and is represented by a rectangle box. Non-terminal nodes are indicated by
circles.
This process is continued until all nodes are designated terminals. One says that the
tree stops growing. In Fig. 3.1, the terminal nodes are t4 , t7 , t9 , t10 , t12 , t13 , t14 , t15 and
t16 . The corresponding subspaces are disjoint and form together a partition of the
feature space χ , namely χt4 ∪ χt7 ∪ χt9 ∪ χt10 ∪ χt12 ∪ χt13 ∪ χt14 ∪ χt15 ∪ χt16 = χ .
The set of indexes T is then given by {t4 , t7 , t9 , t10 , t12 , t13 , t14 , t15 , t16 }.
In each terminal node t ∈ T , the prediction of the response is denoted  ct , that is,

μ(x) =  ct for x ∈ χt .
Notice that each node t is split into a left child node t L and a right child node t R .
In a more general case, we could consider multiway splits, resulting in more than
two children nodes for t. Figure 3.2 represents a binary split with a multiway split
resulting in four children nodes. However, the problem with multiway splits is that
they partition the data too quickly, leaving insufficient observations at the next level
down. Also, since multiway splits can be achieved by a series of binary splits, the
latter are often preferred. Henceforth, we work with binary regression trees.

Fig. 3.2 Binary split versus t t


multiway split with four
children nodes

tL tR ti1 ti2 ti3 ti4


3.2 Binary Regression Trees 53

Several elements need to be discussed in order to determine 


μ(x):
1. The selection of the splits;
2. The decision to declare a node terminal;
3. The predictions in the terminal nodes.
Remark 3.2.1 Regression functions of the form (3.2.1) are piecewise constant func-
tions. It is in general impossible to find the best piecewise constant regression function
minimizing the generalization error. Regression trees are produced by following a
greedy strategy.

3.2.1 Selection of the Splits

At each node t, the selection of the optimal split st requires to define a candidate set
St of possible binary splits and a goodness of split criterion to pick the best one.

3.2.1.1 Candidate Set of Splits

Let us assume that node t is composed of observations with k distinct values for x.
The number of partitions of χt into two non-empty disjoint subsets χtL and χt R is
then given by 2k−1 − 1. Because of the exponential growth of the number of binary
partitions with k, the strategy which consists in trying all partitions and taking the
best one turns out to be unrealistic since often computationally intractable.
For this reason, the number of possible splits is restricted. Specifically, only stan-
dardized binary splits are usually considered. A standardized binary split is charac-
terized as follows:
1. Depends on the value of only one single feature;
2. For an ordered feature x j , only allows questions of the form x j ≤ c, where c is
a constant;
3. For a categorical variable x j , only allows questions of the form x j ∈ C, where
C is a subset of possible categories of the variable x j .
An ordered feature x j taking q different values x j1 , . . . , x jq at node t generates
q − 1 standardized binary splits at that node. The split questions are x j ≤ ci , i =
1, . . . , q − 1, where the constants ci are taken halfway between consecutive values
x +x
x ji and x j,i+1 , that is ci = j,i+12 ji .
For a categorical feature x j with q different values at node t, the number of
standardized binary splits generated at that node is 2q−1 − 1. Let us notice that
trivial standardized binary splits are excluded, meaning splits of χt that generate
a subset (either χtL or χt R ) with no observation. Therefore, at each node t, every
possible standardized binary split is tested ant the best split is selected by means of
a goodness of split criterion. In the feature space, working with standardized binary
splits amounts to only consider splits that are perpendicular to the coordinate axes.
54 3 Regression Trees

Fig. 3.3 Example of a tree t0


with two features
18 ≤ x1 ≤ 100 and x1 ≤ 35 x1 > 35
0 ≤ x2 ≤ 20
t1 t2
x2 ≤ 8 x2 > 8 x1 ≤ 50 x1 > 50

t3 t4 t5 t6

Fig. 3.4 Partition into x2


rectangles of the feature
space corresponding to the
tree depicted in Fig. 3.3
χt4

χt5 χt6
8

χt3

35 50 x1

Hence, the resulting regression trees recursively partition the feature space into hyper
rectangles.
For example, let p = 2 with 18 ≤ x1 ≤ 100 and 0 ≤ x2 ≤ 20 and suppose that
the resulting tree is the one described in Fig. 3.3. The feature space is then partitioned
into the rectangles depicted in Fig. 3.4.

3.2.1.2 Goodness of Split

At a node t, the candidate set St is henceforth restricted to non-trivial standardized


binary splits. In order to determine the best possible split st ∈ St for node t, a goodness
of split criterion is needed.
In the ED family setting, the natural candidate is the deviance. That is, denoting
by Dχt ( ct ) the deviance on χt , the optimal split st solves
    
min Dχ (s) 
ct (s) + Dχ (s) 
ct (s) , (3.2.2)
s∈St tL L tR R

where t L(s) and t R(s) are the left and right children nodes of t resulting from split s
ct (s) and 
and  ct (s) are the corresponding predictions. The optimal split st then leads to
L R
3.2 Binary Regression Trees 55

children nodes t L(st ) and t R(st ) that we also denote t L and t R . Notice that solving (3.2.2)
amounts to find s ∈ St that maximizes the decrease of the deviance at node t, namely
      
ct ) − Dχ (s)
max Dχt ( 
ct (s) + Dχ (s) 
ct (s) . (3.2.3)
s∈St tL L tR R

A categorical feature x j with q categories x j1 , . . . , x jq at node t yields 2q−1 − 1


possible splits. Hence, when q is large, computing (3.2.2) for every possible splits
can be very time consuming. In order to speed up the procedure, each category x jl can
be replaced by the prediction  ct (x jl ) computed on χt ∩ {x j = x jl }. This procedure
results in an ordered version of the categorical feature x j , reducing the number of
possible splits to q − 1. Notice that in the case the response records a number of
claims, the number of observations in a given node can be too small to observe
enough claims for some categories of the feature, especially in low frequency cases,
so that the resulting predictions may not provide a relevant ordering of the categories.
This potential issue is addressed in Sect. 3.2.2.2.

Remark 3.2.2 A tree starts from the whole feature space χ and is grown by iter-
atively dividing the subsets of χ into smaller subsets. This procedure consists in
dividing each node t using the optimal split st that locally maximizes the decrease of
the deviance. This greedy strategy could be improved by assessing the goodness of a
split by also looking at those deeper in the tree. However, such an approach is more
time consuming and does not seem to significantly improve the model performance.

3.2.2 The Prediction in Each Terminal Node

3.2.2.1 Maximum-Likelihood Estimator

For every terminal node t ∈ T , we need to compute the prediction  ct . The deviance,
equivalently denoted D (μ) and D ((ct )t∈T ) in the following, is given by

μ) =
D ( ct ) .
Dχt (
t∈T

Minimizing the deviance D ( μ) with respect to ( ct )t∈T is thus equivalent to mini-


mizing each deviance Dχt ( ct . Therefore, the best estimate 
ct ) with respect to  ct for
the prediction is the maximum-likelihood estimate on χt .

3.2.2.2 Bayesian Estimator

Let the response Y be the number of claims. Given X = x, Y is assumed to be Poisson


distributed with expected claim frequency eμ(x) > 0, where e is the exposure-to-
56 3 Regression Trees

risk. Let D = {(yi , x i ); i ∈ I} be the observations available to train our model. For
a node t, we denote by wt the corresponding volume defined by

wt = ei ,
i:x i ∈χt

and by ct the corresponding expected claim frequency defined by

1 
ct = ei μ(x i ) > 0.
wt i:x ∈χ
i t

If the maximum-likelihood estimate  ct = 0, then we get a solution which makes no


sense in practice. That is why we replace the maximum-likelihood estimate  ct on χt
Bayes
by a Bayesian estimator  ct .
In that goal, we assume that conditionally to  ∼ Gam(γ , γ ), the responses Yi
(i ∈ I) are independent and distributed as

Yi | ∼ Poi(ei μ(x i )).

The random variable 

introduces uncertainty in the expected claim frequency. Given


, and denoting Z t = i:x i ∈χt Yi , we then have

Z t | ∼ Poi(ct wt ),

so that it comes

E [ct |Z t ] = ct E [|Z t ]


= αt
ct + (1 − αt )ct

with maximum-likelihood estimate 


ct and credibility weight
wt ct
αt = .
γ + wt ct

For more details about credibility theory and the latter calculations, we refer the
interested reader to Bühlmann and Gisler (2005).
Therefore, given values for ct and γ , we can compute E [ct |Z t ], which is always
positive, contrarily to ct that can be zero. However, ct is obviously not known in
practice. It could be replaced byct or E [ct |Z t ]. The maximum-likelihood estimator
ct does not solve our original problem. Turning to E [ct |Z t ], we can compute it

recursively as

=
Bayes,k Bayes,k−1

ct E [ct |Z t ] = 
αt
ct + (1 − 
αt )
ct ,
3.2 Binary Regression Trees 57

with estimated credibility weights


Bayes,k−1
wt 
ct

αt = Bayes,k−1
.
γ + wt 
ct

This recursive procedure can be initialized by


Bayes,0 Yi

ct c0 =
i∈I ,
=
i∈I ei

which is the maximum-likelihood estimator for the expected claim frequency without
making distinction between individuals. The remaining parameter γ , which enters
the computation of the estimated credibility weights, still needs to be selected and is
chosen externally.
Note that the R command rpart used in this chapter for the examples simplifies
the recursive approach described above. It rather considers the estimator

ctrpart = 
 αtrpart
ct + (1 − 
αtrpart )
c0

for E [ct |Z t ], with estimated credibility weights

wt 
c0
αtrpart =
 ,
γ + wt c0

which is a special case of the Bühlmann–Straub model.

Remark 3.2.3 Considering the same  in each terminal node introduces depen-
dence between the leaves. One way to remedy to this undesirable effect is to consider
as many (independent) random effects as there are leaves in the tree. Notice that this
latter approach requires the knowledge of the tree structure.

3.2.3 The Rule to Determine When a Node Is Terminal

A node t is inevitably terminal when χt can no longer be split. This is the case
when the observations in node t share the same values for all the features, i.e. when
x i = x j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt . In such a case, splitting
node t would generate a subset with no observation. A node t is also necessarily
terminal when it contains observations with the same value for the response, i.e.
when yi = y j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt . In particular, this is
the case when node t contains only one observation.
Those stopping criteria are inherent to the recursive partitioning procedure. Such
inevitable terminal nodes lead to the biggest possible regression tree that we can
58 3 Regression Trees

Fig. 3.5 Example of a tree t0


with a maximal depth of two

t1 t2

t3 t4 t5 t6

grow on the training set. However such a tree is likely to capture noise in the training
set and to cause overfitting.
In order to reduce the size of the tree and hence to prevent overfitting, these
stopping criteria that are inherent to the recursive partitioning procedure are com-
plemented with several rules. Three stopping rules that are commonly used can be
formulated as follows:
– A node t is declared terminal when it contains less than a fixed number of obser-
vations.
– A node t is declared terminal if at least one of its children nodes t L and t R that
results from the optimal split st contains less than a fixed number of observations.
– A node t is declared terminal when its depth is equal to a fixed maximal depth.
Notice that the depth of a node t is equal to d if it belongs to generation d + 1. For
instance, in Fig. 3.5, t0 has a depth of zero, t1 and t2 have a depth of one and terminal
nodes t3 , t4 , t5 and t6 have a depth of two.
Another common stopping rule consists of setting a threshold and deciding that
a node t is terminal if the decrease in deviance that we have by splitting node t with
the optimal split st is less than this fixed threshold. Recall that splitting node t into
children nodes t L and t R results in a decrease of the deviance given by

Dχt = Dχt (
ct ) − DχtL ctL + Dχt R ct R . (3.2.4)

It is interesting to notice that this deviance reduction Dχt is always positive. Indeed,
we have

ct ) = DχtL (
Dχt ( ct ) + Dχt R (
ct )

≥ D χt L ctL + Dχt R  ct R

since the maximum-likelihood estimates ctL and ct R minimize the deviances DχtL (
ct )
and DχtL (ct ), respectively. As a consequence, if for a partition {χt }t∈T of χ , we
consider an additional standardized binary split st of χt for a given t ∈ T , yielding
the new set of indexes T  = T \{t} ∪ {t L , t R } for the terminal nodes, we necessarily
have
ct )t∈T ) ≥ D ((
D (( ct )t∈T  ) .
3.2 Binary Regression Trees 59

So, an additional standardized binary split always decreases the deviance.


While the stopping rules presented above may give good results in practice, the
strategy of stopping early the growing of the tree is in general unsatisfactory. For
instance, the last rule provides a too large tree if the threshold is set too low. Increasing
the threshold may lead to a too small tree, and a node t with a small deviance reduction
Dχt may have children nodes t L and t R with larger decreases in the deviance. Hence,
by declaring t terminal, the good splits at nodes t L and t R would be never used.
That is why it is preferable to prune the tree instead of stopping the growing of
the tree. Pruning a tree consists in fully developing the tree and then prune it upward
until the optimal tree is found. This is discussed in Sect. 3.3.

3.2.4 Examples

The examples in this chapter are done with the R package rpart, which stands for
recursive partitioning.

3.2.4.1 Simulated Dataset

We consider an example in car insurance. Four features X = (X 1 , X 2 , X 3 , X 4 ) are


supposed to be available, that are
– X1 = Gender: policyholder’s gender (female or male);
– X2 = Age: policyholder’s age (integer values from 18 to 65);
– X3 = Split: whether the policyholder splits its annual premium or not (yes or no);
– X4 = Sport: whether the policyholder’s car is a sports car or not (yes or no).
The variables X 1 , X 2 , X 3 and X 4 are assumed to be independent and distributed as
follows:

P [X 1 = f emale] = P [X 1 = male] = 0.5;


P [X 2 = 18] = P [X 2 = 19] = . . . = P [X 2 = 65] = 1/48;
P [X 3 = yes] = P [X 3 = no] = 0.5;
P [X 4 = yes] = P [X 4 = no] = 0.5.

The values taken by a feature are thus equiprobable.


The response Y is supposed to be the number of claims. Given X = x, Y is
assumed to be Poisson distributed with expected claim frequency given by

μ(x) = 0.1 × (1 + 0.1I [x1 = male])


× (1 + 0.4I [18 ≤ x2 < 30] + 0.2I [30 ≤ x2 < 45])
× (1 + 0.15I [x4 = yes]) .
60 3 Regression Trees

Table 3.1 Ten first observations of the simulated dataset


Y X 1 (Gender) X 2 (Age) X 3 (Split) X 4 (Sport)
1 0 Male 46 Yes Yes
2 0 Male 57 No No
3 0 Female 34 No Yes
4 0 Female 27 Yes No
5 0 Male 42 Yes Yes
6 0 Female 27 No Yes
7 0 Female 55 Yes Yes
8 0 Female 23 No Yes
9 0 Female 33 No No
10 2 Male 36 No Yes

Being a male increases the expected claim frequency by 10%, drivers between 18
and 29 (resp. 30 and 44) years old have expected claim frequencies 40% (resp.
20%) larger than policyholders older than 45 years old, splitting its premium does
not influence the expected claim frequency while driving a sports car increases the
expected claim frequency by 15%.
In this example, the true model μ(x) is known and we can simulate realizations
of the random vector (Y, X). Specifically, we generate n = 500 000 independent
realizations of (Y, X), that is, we consider a learning set made of 500 000 observations
(y1 , x 1 ), (y2 , x 2 ), . . . , (y500 000 , x 500 000 ). An observation represents a policy that has
been observed during a whole year. In Table 3.1, we provide the ten first observations
of the learning set. While the nine first policies made no claim over the past year,
the tenth policyholder, who is a 36 years old man with a sports car and paying his
premium annually, experienced two claims.
In this simulated dataset, the proportion of males is approximately 50%, so are
the proportions of sports cars and policyholders splitting their premiums. For each
age 18,19,...,65, there are between 10 188 and 10 739 policyholders.
We now aim to estimate the expected claim frequency μ(x). In that goal, we fit
several trees on our simulated dataset with Poisson deviance as loss function. Here,
we do not divide the learning set into a training set and a validation set, so that the
whole learning set is used to train the models. The R command used is
> tree <− rpart (Y ~ Gender+Age+Split+Sport ,
data = dataset ,
method="poisson" ,
control = rpart . control ( ) )

where data specifies the training set used to build the tree, method refers to the
optimisation criterion applied at each split, here the Poisson deviance, and control
enables to control the size of the tree.
3.2 Binary Regression Trees 61

0.13
66e+3 / 500e+3
100%

yes Age >= 44 no

0.15
41e+3 / 281e+3
56%

2 Age >= 30
0.11
25e+3 / 219e+3
44%

Sport = no
5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%

4 Gender = female 6 Sport = no


0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no

9 11 13 15

0.11 0.13 0.14 0.17


5943 / 55e+3 7023 / 55e+3 11e+3 / 78e+3 11e+3 / 62e+3
11% 11% 16% 12%

8 10 12 14

0.1 0.12 0.13 0.15


5512 / 55e+3 6351 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 11% 16% 12%

Fig. 3.6 Tree with a maximum depth equal to three as stopping rule

A first tree with a maximum depth of three as stopping rule is built. A node t is
then terminal when its depth is equal to three, meaning that it belongs to the fourth
generation of nodes. The resulting tree is depicted in Fig. 3.6 and presented in more
details below:
> n= 500000

node) , split , n, deviance , yval


∗ denotes terminal node

1) root 500000 279\,043.30 0.1317560


2) Age>=44.5 218846 111779.70 0.1134548
4) Sport=no 109659 53386.74 0.1044621
8) Gender=female 54866 26084.89 0.1004673 ∗
9) Gender=male 54793 27285.05 0.1084660 ∗
5) Sport=yes 109187 58236.07 0.1224878
10) Gender=female 54545 28257.29 0.1164381 ∗
11) Gender=male 54642 29946.18 0.1285280 ∗
3) Age< 44.5 281154 166261.40 0.1460015
6) Age>=29.5 156219 88703.96 0.1355531
12) Sport=no 77851 42515.21 0.1267940 ∗
13) Sport=yes 78368 46100.83 0.1442541 ∗
7) Age< 29.5 124935 77295.79 0.1590651
14) Sport=no 62475 37323.97 0.1493856 ∗
15) Sport=yes 62460 39898.18 0.1687435 ∗
62 3 Regression Trees

We start with n=500 000 observations in the training set. The nodes are num-
bered with the variable node). The node 1) is the root node, node 2) corresponds
to its left-child and node 3) to its right-child, and so on, such that node k) corre-
sponds to node tk−1 in our notations. For each node, the variable split specifies
the split criterion applied, n the number of observations in the node, deviance
the deviance at that node and yval the prediction (i.e. the estimate of the expected
claim frequency) in that node. Also, * denotes terminal nodes, that are in this case
nodes 8) to 15).
In particular, terminal node 14), which corresponds to node t13 , is obtained by
with answer no. It contains n t13 = 62 475 observations, the
using feature x4 (Sport)
deviance is Dχt13  ct13 = 37 323.97 and the corresponding estimated expected claim
frequency is given by  ct13 = 0.1493856. Its parent node is t6 and the decrease of the
deviance resulting from the split of t6 into nodes t13 and t14 is

Dχt6 = Dχt6 ct6 − Dχt13 ct13 + Dχt14 ct14
= 77295.79 − (37323.97 + 39898.18)
= 73.64.

In Fig. 3.6, each rectangle represents a node of the tree. In each node, one can
see the estimated expected claim frequency, the number of claims, the number of
observations as well as the proportion of the training set in the node. As an example,
in node 3) (or t2 ), we
observe the estimated expected claim frequency  ct2 ≈ 0.15,
the number of claims i:x i ∈χt yi ≈ 41 000 and the number of observations n t2 ≈
2
281 000 which corresponds to approximately 56% of the training set. Below each
non-terminal node t, we find the question that characterizes the split st . In case the
answer is yes, one moves to the left-child node t L while when the answer is no,
one moves to the right-child node t R . Each node is topped by a small rectangle
box containing the number of the node which corresponds to the variable node).
Terminal nodes are nodes 8) to 15) and belong to the fourth generation of nodes
as requested by the stopping rule. Finally, the darker the gray of a node, the higher
the estimated expected claim frequency in that node.
The first split of the tree is defined by the question Age ≥ 44.5 (and not by
Age ≥ 44 as it is suggested in Fig. 3.6). For feature Age, the set of possible questions
is x2 ≤ ck with constants ck = 18+19+2k−2
2
, k = 1, . . . , 47. The best split for the root
node is thus x2 ≤ 44.5. The feature Age is indeed the one that influences the most the
expected claim frequency, up to a difference of 40% (resp. 20%) for policyholders
with 18 ≤ Age < 30 (resp. 30 ≤ Age < 45) compared to those older than 45.
The left child node is node 2) and comprised policyholders with Age ≥ 45. In
that node, the feature Age does not influence the expected claim frequency, while
features Sport and Gender can lead to expected claim frequencies that differ from
15% and 10%, respectively. The feature Sport is then naturally selected to define the
best split at that node. The two resulting nodes are 4) and 5), in which only the
feature Gender still influences the expected claim frequency, hence the choice of the
Gender to perform both splits leading to terminal nodes.
3.2 Binary Regression Trees 63

0.13
66e+3 / 500e+3
100%

yes Age >= 44 no

0.15
41e+3 / 281e+3
56%

Age >= 30
2

0.11
25e+3 / 219e+3
44%

Sport = no

5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%

Gender = female Sport = no


4 6

0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no

9 11 13 15

0.11 0.13 0.14 0.17


5943 / 55e+3 7023 / 55e+3 11e+3 / 78e+3 11e+3 / 62e+3
11% 11% 16% 12%

Age < 46 Age < 60 Gender = female Gender = female


8 10 12 14

0.1 0.12 0.13 0.15


5512 / 55e+3 6351 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 11% 16% 12%

Split = yes Age < 50 Gender = female Gender = female

17 19 21 23 25 27 29 31

0.1 0.11 0.12 0.13 0.13 0.15 0.16 0.18


2804 / 27e+3 5674 / 52e+3 4573 / 39e+3 2026 / 16e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
5% 10% 8% 3% 8% 8% 6% 6%

16 18 20 22 24 26 28 30

0.098 0.1 0.11 0.13 0.12 0.14 0.14 0.16


2708 / 28e+3 269 / 2665 1778 / 16e+3 4997 / 39e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
6% 1% 3% 8% 8% 8% 6% 6%

Fig. 3.7 Tree with a maximum depth equal to four as stopping rule

The right child node of the root is node 3). In that node, the feature Age is
again the preferred feature as it can still produce a difference of 16.67% = 1.2 1.4
−1
in expected claim frequencies. The children nodes 6) and 7) are then in turn split
with the feature Sport since in both nodes, it yields to differences in expected claim
frequencies of 15% while the feature Gender yields to differences of 10%. Notice
that the feature Age is no longer relevant in these two nodes. The resulting nodes,
namely 12), 13), 14) and 15), are terminals since they all belongs to the fourth
generation of nodes.
The tree that has been built with a maximum depth of three as stopping rule is
not enough deep. Indeed, we notice that nodes 12), 13), 14) and 15) should still
be split with the feature Gender since males have expected claim frequencies 10%
higher than females. So, in this example, a maximum depth of three is too restrictive.
A tree with a maximum depth of four as stopping rule is then fitted. The resulting
tree is showed in Fig. 3.7. The first four generations of nodes are obviously the same
than in our previous tree depicted in Fig. 3.6. The only difference lies in the addition
of a fifth generation of nodes, which enables to split nodes 12), 13), 14) and 15)
with the feature Gender, as desired. However, while nodes 8), 9), 10) and 11)
were terminal nodes in our previous tree, they now become non-terminal nodes as
they do not belong to the fifth generation. Therefore, they are all split in order to
meet our stopping rule. For instance, node 8) is split with the feature Split, while we
64 3 Regression Trees

Table 3.2 Decrease of the deviance Dχt for each node t


Node k) Dχtk−1
1) 279 043.30−(111779.70+166261.40) = 1002.20
2) 111779.70−(53386.74+58236.07) = 156.89
3) 166261.40−(88703.96+77295.79) = 261.65
4) 53386.74−(26084.89+27285.05) = 16.80
5) 58236.07−(28257.29+29946.18) = 32.60
6) 88703.96−(42515.21+46100.83) = 87.92
7) 77295.79−(37323.97+39898.18) = 73.64
8) 26084.89−(12962.76+13118.61) = 3.52
9) 27285.05−(1255.95+26027.61) = 1.49
10) 28257.29−(7964.89+20291.27) = 1.13
11) 29946.18−(21333.37+8612.48) = 0.33
12) 42515.21−(20701.11+21790.59) = 23.51
13) 46100.83−(22350.27+23718.03) = 32.53
14) 37323.97−(18211.52+19090.83) = 21.62
15) 39898.18−(19474.59+20396.94) = 26.65

know that this feature does not influence the expected claim frequency. Therefore,
while we improve our model on the right hand side of the tree, i.e. node 3) and its
children, we start to overfit our dataset on the left hand side of the tree, i.e. node 2)
and subsequent.
In this example, specifying the stopping rule only with respect to the maximum
depth does not lead to satisfying results. Instead, we could combine the rule of a
maximum depth equal to four with a requirement of a minimum decrease of the
deviance. In Table 3.2, we show the decrease of the deviance Dχt observed at each
node t. As expected, we notice that nodes 8), 9), 10) and 11) have the four
smallest values for Dχt .
If we select a threshold somewhere between 3.52 and 16.80 for the minimum
decrease of the deviance allowed to split a node, which are the values of Dχt at
nodes 4) and 8), respectively, we get the optimal tree presented in Fig. 3.8, in which
nodes 16) to 23) now disappeared compared to our previous tree showed in Fig. 3.7.
Hence, nodes 8) to 11) are now terminal nodes, as desired. Finally, we then get
twelve terminal nodes.
In Table 3.3, we show the terminal nodes with their corresponding expected claim
frequencies μ(x) as well as with their estimates  μ(x).
3.2 Binary Regression Trees 65

0.13
66e+3 / 500e+3
100%

yes Age >= 44 no

0.15
41e+3 / 281e+3
56%

2 Age >= 30
0.11
25e+3 / 219e+3
44%

Sport = no
5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%

4 Gender = female 6 Sport = no


0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no


13 15

0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%

12 Gender = female 14 Gender = female


0.13 0.15
9871 / 78e+3 9333 / 62e+3
16% 12%

Gender = female Gender = female

9 11 25 27 29 31

0.11 0.13 0.13 0.15 0.16 0.18


5943 / 55e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
11% 11% 8% 8% 6% 6%

8 10 24 26 28 30

0.1 0.12 0.12 0.14 0.14 0.16


5512 / 55e+3 6351 / 55e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
11% 11% 8% 8% 6% 6%

Fig. 3.8 Optimal tree

Table 3.3 Terminal nodes with their corresponding expected claim frequencies μ(x) and estimated
expected claim frequencies 
μ(x)
Node k) x μ(x) 
μ(x)
x1 (Gender) x2 (Age) x4 (Sport)
Node 8) Female x2 ≥ 44.5 No 0.1000 0.1005
Node 9) Male x2 ≥ 44.5 No 0.1100 0.1085
Node 10) Female x2 ≥ 44.5 Yes 0.1150 0.1164
Node 11) Male x2 ≥ 44.5 Yes 0.1265 0.1285
Node 24) Female 29.5 ≤ x2 < 44.5 No 0.1200 0.1206
Node 25) Male 29.5 ≤ x2 < 44.5 No 0.1320 0.1330
Node 26) Female 29.5 ≤ x2 < 44.5 Yes 0.1380 0.1365
Node 27) Male 29.5 ≤ x2 < 44.5 Yes 0.1518 0.1520
Node 28) Female x2 < 29.5 No 0.1400 0.1422
Node 29) Male x2 < 29.5 No 0.1540 0.1566
Node 30) Female x2 < 29.5 Yes 0.1610 0.1603
Node 31) Male x2 < 29.5 Yes 0.1771 0.1772
66 3 Regression Trees

3.2.4.2 Real Dataset

In this second example, we consider a motor third-party liability insurance portfolio


of an insurance company operating in the EU that has been observed during one
year. The portfolio comprises n = 160 944 insurance policies. For each policy i,
i = 1, . . . , n, we have the numbers of claims yi filed by the policyholder, the
exposure-to-risk ei ≤ 1 expressed in year, which is the duration of observation for
policy i, and eight features x i = (xi1 , . . . , xi8 ), namely
– xi1 = AgePh: policyholder’s age;
– xi2 = AgeCar: age of the car;
– xi3 = Fuel: fuel of the car, with two categories (gas or diesel);
– xi4 = Split: splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly);
– xi5 = Cover: extent of the coverage, with three categories (from compulsory third-
party liability cover to comprehensive);
– xi6 = Gender: policyholder’s gender, with two categories (female or male);
– xi7 = Use: use of the car, with two categories (private or professional);
– xi8 = PowerCat: the engine’s power, with five categories.
Table 3.4 provides the then first observations of the dataset.
In Fig. 3.9, we give the number of policies with respect to the exposure-to-risk
expressed in months. As we can see, the majority of the policies has been observed
during the whole year of observation. Also, Fig. 3.10 displays the exposure-to-risk by
category/value for each of the eight features. Finally, Table 3.5 shows the observed
numbers of claims with their corresponding exposures-to-risk.
The observations are assumed to be independent and given X = x and the
exposure-to-risk e, the response Y is assumed to be Poisson distributed with expected
claim frequency eμ(x). So, μ(x i ) represents the expected annual claim frequency
for policyholder i.
We fit a tree with Poisson deviance as loss function on the whole dataset, meaning
that we do not isolate some observations to form a validation set. As stopping rule,
we require a minimum number of observations in terminal nodes equal to 5000.
The resulting tree is represented in Fig. 3.11. With such a stopping rule, we notice
that terminal nodes do not necessarily have the same depth. For instance, terminal
node 6) is part of the third generation of nodes while terminal node 39) is part of
the sixth one.
Requiring at least 5000 observations within each terminal node is particularly
relevant in this example as it guarantees a reasonable accuracy on the predictions ct ,
t ∈ T . Indeed, at portfolio level, the estimated expected annual claim frequency is

μ = 13.9% and the average exposure-to-risk is e = 0.89. One can then estimate a
confidence interval of two standard deviations for  μ given by
Table 3.4 Ten first observations of the simulated dataset, where ExpoR designates the exposure-to-risk
3.2 Binary Regression Trees

Y AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR


1 1 50 12 Gasoline Monthly TPL.Only Male Private C2 1.000
2 0 64 3 Gasoline Yearly Limited.MD Female Private C2 1.000
3 0 60 10 Diesel Yearly TPL.Only Male Private C2 1.000
4 0 77 15 Gasoline Yearly TPL.Only Male Private C2 1.000
5 1 28 7 Gasoline Half-yearly TPL.Only Female Private C2 0.047
6 0 26 12 Gasoline Quarterly TPL.Only Male Private C2 1.000
7 1 26 8 Gasoline Half-yearly Comprehensive Male Private C2 1.000
8 0 58 14 Gasoline Quarterly TPL.Only Female Private C2 0.403
9 0 59 6 Gasoline Half-yearly Limited.MD Female Private C1 1.000
10 0 57 10 Gasoline Half-yearly Limited.MD Female Private C1 1.000
67
68 3 Regression Trees

120000
100000
80000
Number of policies
60000
40000
20000
0

1 2 3 4 5 6 7 8 9 10 11 12
Exposure (in months)

Fig. 3.9 Number of policies with respect to the exposure-to-risk expressed in months

100000
100000

75000
Exposure−to−risk

Exposure−to−risk

Exposure−to−risk

75000 1e+05

50000
50000
5e+04

25000 25000

0 0 0e+00
Male Female Diesel Gasoline Private Professional
Gender Fuel Use
80000
80000

60000
60000
Exposure−to−risk
Exposure−to−risk

Exposure−to−risk

60000

40000 40000
40000

20000 20000
20000

0 0 0
Comprehensive Limited.MD TPL.Only Half−Yearly Monthly Quarterly Yearly C1 C2 C3 C4 C5
Cover Split PowerCat

3000
Exposure−to−risk

Exposure−to−risk

10000

2000

5000
1000

0 0
0 5 10 15 20 20 40 60 80
AgeCar AgePh

Fig. 3.10 Categories/values of the predictors and their corresponding exposures-to-risk


3.2 Binary Regression Trees 69

Table 3.5 Descriptive statistics for the number of claims


Number of claims Exposure-to-risk
0 126 499.7
1 15 160.4
2 1424.9
3 145.4
4 14.3
5 1.4
≥6 0

0.14
20e+3 / 161e+3
100%
yes AgePh >= 30 no

0.21
4358 / 24e+3
15%
Split = Yearly
2

0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
5

0.17
3931 / 29e+3
18%
AgePh >= 58
4

0.12
12e+3 / 108e+3
67%
AgePh >= 58
9 11

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD


8

0.097
3305 / 36e+3
23%

Fuel = Gasoline 19

0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD
16 18

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Male Split = Yearly


37

0.13
2200 / 18e+3
11%

Cover = Limited.MD
32 36

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
AgePh < 74 AgePh >= 48
73

0.12
1866 / 17e+3
10%
Cover = Comprehensive,Limited.MD
64

0.086
1482 / 18e+3
11%
AgeCar >= 5.5

129 33 72 147 75 39 22 6

0.092 0.1 0.11 0.13 0.14 0.15 0.16 0.17


552 / 6360 526 / 5474 1079 / 11e+3 1239 / 11e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
4% 3% 7% 7% 8% 9% 8% 6%

128 65 17 146 74 38 10 23 7

0.082 0.099 0.12 0.11 0.12 0.13 0.13 0.19 0.24


930 / 12e+3 507 / 5455 790 / 7015 627 / 6093 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
8% 3% 4% 4% 4% 7% 3% 8% 9%

Fig. 3.11 Tree with at least 5000 observations in terminal nodes

  

μ 
μ

μ−2 ,
μ+2
e × 5000 e × 5000
  
13.9% 13.9%
= 13.9% − 2 , 13.9% + 2
0.89 × 5000 0.89 × 5000
= [12.8%, 15.0%] .
70 3 Regression Trees

Roughly speaking, one sees that we obtain a precision of 1% in the average annual
claim frequency 
μ, which can be considered as satisfactory here. Notice that if
we would have selected 1000 for the minimum number of observations in the ter-
minal nodes, we would have get a precision of 2.5% in the average annual claim
frequency 
μ.

3.3 Right Sized Trees

We have seen several rules to declare a node t terminal. These rules have in common
that they early stop the growth of the tree. Another way to find the right sized tree
consists in fully developing the tree and then pruning it.
Henceforth, to ease the presentation, we denote a regression tree by T . A tree T
is defined by a set of splits together with the order in which they are used and the
predictions in the terminal nodes. When needed to make explicit the link with tree
T , we use the notation T(T ) for the corresponding set of indexes T of the terminal
nodes. In addition, we mean by |T | the number of terminal nodes of tree T , that is
|T(T ) |.
In order to define the pruning process, one needs to specify the notion of a tree
branch. A branch T (t) is the part of T that is composed of node t and all its descendants
nodes. Pruning a branch T (t) of a tree T means deleting from T all descendants nodes
of t. The resulting tree is denoted T − T (t) . One says that T − T (t) is a pruned subtree
of T . For instance, in Fig. 3.12, we represent a tree T , its branch T (t2 ) and the subtree
T − T (t2 ) obtained from T by pruning the branch T (t2 ) . More generally, a tree T  that
is obtained from T by successively pruning branches is called a pruned subtree of
T , or simply a subtree of T , and is denoted T  T .
The first step of the pruning process is to grow the largest possible tree Tmax
by letting the splitting procedure continue until all terminal nodes contain either
observations with identical values for the features (i.e. terminal nodes t ∈ T where
x i = x j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt ) or either observations
with the same value for the response (i.e. terminal nodes t ∈ T where yi = y j for all
(yi , x i ) and (y j , x j ) such that x i , x j ∈ χt ).
Notice that the initial tree Tinit can be smaller than Tmax . Indeed, let us assume that
the pruning process starting with the largest tree Tmax produces the subtree Tprune .
Then, the pruning process will always lead to the same subtree Tprune if we start with
any subtree Tinit of Tmax such that Tprune is a subtree of Tinit .
We thus start the pruning process with a sufficiently large tree Tinit . Then, the idea
of pruning Tinit consists in constructing a sequence of smaller and smaller trees

Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1 ,

where Tk is a subtree of Tinit with k terminal nodes, k = 1, . . . , |Tinit | − 1. In partic-


ular, T1 is only composed of the root node t0 of Tinit .
3.3 Right Sized Trees 71

t0

t1 t2

t3 t4 t5 t6

t7 t8 t9 t10
(a) Tree T .
t2 t0

t5 t6 t1 t2

t7 t8 t9 t10 t3 t4
(b) Branch T (t2 ) . (c) Subtree T − T (t2 ) .

Fig. 3.12 Example of a branch T (t2 ) for a tree T as well as the resulting subtree T − T (t2 ) when
pruning T (t2 )

Let us denote by C(Tinit , k) the class of all subtrees of Tinit having k terminal
nodes. An intuitive procedure to produce the sequence of trees Tinit , T|Tinit |−1 , . . . , T1
is to
select, for every k = 1, . . . , |Tinit | − 1, the tree Tk which minimizes the deviance
D ( ct )t∈T(T ) among all subtrees T of Tinit with k terminal nodes, that is

ct )t∈T(Tk ) =
D ( min ct )t∈T(T ) .
D (
T ∈C(Tinit ,k)

Thus, for every k = 1, . . . , |Tinit | − 1, Tk is the best subtree of Tinit with k terminal
nodes according to the deviance loss function.
It is natural to use the deviance in comparing subtrees with the same number
of terminal nodes. However, the deviance is not helpful for comparing subtrees
Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1 . Indeed, as noticed in Sect. 3.2.3, if Tk is a subtree of

Tk+1 ∈ C(Tinit , k + 1), we necessarily have
72 3 Regression Trees

 
ct )t∈T(Tk ) ≥ D (
D ( ct )t∈T(T  )
 
k+1

≥ D (
ct )t∈T(Tk+1 ) .

Therefore, the selection of the best subtree Tprune among the sequence

Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1

that is based on the deviance, or equivalently on the training sample estimate of the
generalization error, will always lead to the largest tree Tinit .
That is why the generalization errors of the pruned subtrees

Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1

should be estimated on a validation set in order to determine Tprune . The choice of


Tprune is also often done by cross-validation, as we will see in the following.
Such an intuitive procedure for constructing the sequence of trees

Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1

has some drawbacks. One of them is to produce subtrees of Tinit that are not nested,
meaning that subtree Tk is not necessarily a subtree of Tk+1 . Hence, a node t of Tinit
can reappear in tree Tk while it was cut off in tree Tk+1 .
That is why the minimal cost-complexity pruning presented hereafter is usually
preferred.

3.3.1 Minimal Cost-Complexity Pruning

We define the cost-complexity measure of a tree T as



ct )t∈T(T ) + α|T |,
Rα (T ) = D (

where the parameter α is a positive real number. The number of terminal nodes |T |
is called the complexity of the tree T . Thus, the cost-complexity measure Rα (T )
is a combination of the deviance D ( ct )t∈T(T ) and a penalty for the complexity of
the tree α|T |. The parameter α can be interpreted as the increase in the penalty for
having one more terminal node.
When we increase by one the number of terminal nodes of a tree T by splitting
one of its terminal node t into two children nodes t L and t R , then we know that the
deviance of the resulting tree T  is smaller than the deviance of the original tree T ,
that is
D (ct )t∈T(T  ) ≤ D ( ct )t∈T(T ) .
3.3 Right Sized Trees 73

The deviance will always favor the more complex tree T  over T .
By introducing a penalty for the complexity of the tree, the cost-complexity mea-
sure may now prefer the original tree T over the most complex one T  . Indeed, the
cost-complexity measure of T  can be written as

Rα (T  ) = D (ct )t∈T(T  ) + α|T  |

= D (ct )t∈T(T  ) + α(|T | + 1)

= Rα (T ) + D ( ct )t∈T(T  ) − D (
ct )t∈T(T ) + α.

Hence, depending on the value of α, the more complex tree T  may have a higher
cost-complexity measure than T . We have Rα (T  ) ≥ Rα (T ) if and only if

ct )t∈T(T ) − D (
α ≥ D ( ct )t∈T(T  ) . (3.3.1)

In words, Rα (T  ) ≥ Rα (T ) if and only if the deviance reduction that we get by


producing tree T  is smaller than the increase in the penalty for having one more
terminal node. In such cases, the deviance reduction is not sufficient to compensate
the resulting increase in the penalty.
Therefore, using the cost-complexity measure for model selection may now lead
to chose the simplest tree T over T  . If the value of α is such that condition (3.3.1)
is fulfilled, then tree T will be preferred over T  . Otherwise, in case the value of
α does not satisfy condition (3.3.1), the more complex tree T  will be preferred,
meaning that the deviance reduction is higher than the corresponding increase in the
penalty. In particular, this is the case when α = 0, the cost-complexity measure then
coinciding with the deviance.
Let Tinit be the large tree that is to be pruned to the right sized tree Tprune . Of
course, Tinit may correspond to the largest possible tree Tmax . For a fixed value of α,
we can now define T (α) as the subtree of Tinit that minimizes the cost-complexity
measure Rα (T ), namely
Rα (T (α)) = min Rα (T ). (3.3.2)
T Tinit

Hence, at this value of α, there is no subtree of Tinit with lower cost-complexity


measure than T (α).
When the penalty α per terminal node is small, the penalty α|T | for having a
large tree is small as well so that T (α) will be large. For instance, if α = 0, Rα (T )
coincides with the deviance such that the largest tree Tinit minimizes Rα (T ). As the
parameter α increases, the penalty for having a large tree increases and the subtrees
T (α) will have less and less terminal nodes. Finally, for sufficiently large values of
α, the subtree T (α) will consist of the root node only.
Because there can be more than one subtree of Tinit minimizing Rα (T ), we com-
plement condition (3.3.2) with the following one:

If Rα (T ) = Rα (T (α)), then T (α) T. (3.3.3)


74 3 Regression Trees

This additional condition says that if there is a tie, namely more than one subtree of
Tinit minimizing Rα (T ), then we select for T (α) the smallest tree, that is the one that
is a subtree of all others satisfying (3.3.2). The resulting subtrees T (α) are called the
smallest minimizing subtrees.
It is obvious that for every value of α there is at least one subtree of Tinit that
minimizes Rα (T ) since there are only finitely many pruned subtrees of Tinit . However,
it is not clear whether the additional condition (3.3.3) can be met for every value of
α. Indeed, this says that we cannot have two subtrees that minimize Rα (T ) such that
neither is a subtree of the other. This is guaranteed by the next proposition.

Proposition 3.3.1 For every value of α, there exists a smallest minimizing subtree
T (α).

Proof We refer the interested reader to Sect. 10.2 (Theorem 10.7) in Breiman et al.
(1984). 

Thus, for every value of α ≥ 0, there exists a unique subtree T (α) of Tinit that
minimizes Rα (T ) and which satisfies T (α) T for all subtrees T minimizing Rα (T ).
The large tree Tinit has only a finite number of subtrees. Hence, even if α goes
from zero to infinity in a continuous way, the set of the smallest minimizing subtrees
{T (α)}α≥0 only contains a finite number of subtrees of Tinit .
Let α = 0. We start from Tinit and we find any pair of terminal nodes with a
(t)
common parent node t such that the branch Tinit can be pruned without increasing
the cost-complexity measure. We continue until we cannot longer find such pair in
order to obtain a subtree of Tinit with the same cost-complexity measure as Tinit for
α = 0. We define α0 = 0 and we denote Tα0 the resulting subtree of Tinit .
When α increases, it may become optimal to prune the branch Tα(t) 0
for a certain
node t of Tα0 , meaning that the smaller tree Tα0 − Tα(t)
0
becomes better than Tα0 . This
will be the case once α is high enough to have

Rα (Tα0 ) ≥ Rα Tα0 − Tα(t)
0
.

The deviance of Tα0 can be written as


  
cs )s∈T(Tα0 ) =
D ( Dχs (
cs )
s∈T(Tα0 )
 
= cs ) +
Dχs ( cs ) − D (
Dχs ( cs )s∈{t}
s∈T (t) s∈T (t)
(Tα0 −Tα ) (Tα )
 
0 0


= D (
cs )s∈T (t)
 + D (
cs )s∈T (t)
 − D (
cs )s∈{t} .
Tα0 −Tα Tα
0 0

Furthermore, we have

|Tα0 | = |Tα0 − Tα(t)


0
| + |Tα(t)
0
| − 1.
3.3 Right Sized Trees 75

The cost-complexity measure Rα (Tα0 ) can then be rewritten as



Rα (Tα0 ) = Rα Tα0 − Tα(t)
 0

+D ( cs )s∈T (t)  + α|Tα(t)
0
|

0

−D ( cs )s∈{t} − α. (3.3.4)

Thus, we have Rα (Tα0 ) ≥ Rα Tα0 − Tα(t)
0
if and only if


D (
cs )s∈T (t)
 + α|Tα(t)
0
cs )s∈{t} + α.
| ≥ D ( (3.3.5)

0

The left-hand side of (3.3.5) is the cost-complexity measure of the branch Tα(t) 0
while the right-hand side is the cost-complexity measure of the node t. Therefore,
it
becomes optimal to cut the branch Tα(t)0
once its cost-complexity measure R α Tα
(t)
0
becomes higher than the cost complexity measure Rα (t) of its root node t. This
happens for values of α satisfying


cs )s∈{t} − D (
D ( cs )s∈T (t)


α≥ .
0
(3.3.6)
|Tα(t)
0 | − 1

We denote by α1(t) the right-hand side of (3.3.6). For each non-terminal node t of Tα0
we can compute α1(t) . For a tree T , let us denote by T(T ) the set of its non-terminal
nodes. Then, we define the weakest links of Tα0 as the non-terminal nodes t for which
α1(t) is the smallest and we denote α1 this minimum value, i.e.

α1 = min α1(t) .

t∈T
(Tα0 )

Cutting any branch of Tα0 is not optimal as long as α < α1 . However, once the
parameter α reaches the value α1 , it becomes preferable to prune Tα0 at its weakest
links. The resulting tree is then denoted Tα1 . Notice that it is appropriate to prune the
tree Tα0 by exploring the nodes in top-down order. In such a way, we avoid cutting
in a node t that will disappear later on when pruning Tα0 .
Now, we repeat the same process for Tα1 . Namely, for a non-terminal node t of
Tα1 , it will be preferable to cut the branch Tα(t)
1
when


cs )s∈{t} − D (
D ( cs )s∈T (t)


α≥ .
1
(3.3.7)
|Tα(t)
1 | − 1
76 3 Regression Trees

We denote by α2(t) the right-hand side of (3.3.6) and we define

α2 = min α2(t) .

t∈T
(Tα1 )

The non-terminal nodes t of Tα1 for which α2(t) = α2 are called the weakest links of
Tα1 and it becomes better to cut in these nodes once α reaches the value α2 in order
to produce Tα2 .
Then we continue the same process for Tα2 , and so on until we reach the root node
{t0 }. Finally, we come up with the sequence of trees

Tα0 , Tα1 , Tα2 , . . . , Tακ = {t0 }.

In the sequence, we can obtain the next tree by pruning the current one.
The next proposition makes the link between this sequence of trees and the smallest
minimizing subtrees T (α).

Proposition 3.3.2 We have 0 = α0 < α1 < . . . < ακ < ακ+1 = ∞. Furthermore,


for all k = 0, 1, . . . , κ, we have

T (α) = Tαk for all α ∈ [αk , αk+1 ).

Proof We refer the interested reader to Sect. 10.2 in Breiman et al. (1984). 

This result is important since it gives the instructions to find the smallest min-
imizing subtrees T (α). It suffices to apply the recursive pruning steps described
above. This recursive procedure is an efficient algorithm, any tree in the sequence
being obtained by pruning the previous one. Hence, this algorithm only requires to
consider a small fraction of the total possible subtrees of Tinit .
The cost-complexity measure Rα (T ) is given by

ct )t∈T(T ) + α|T |,
Rα (T ) = D (

where the parameter α is called the regularization parameter. The parameter α has
the same unit as the deviance. It is often convenient to normalize the regularization
parameter by the deviance of the root tree. We define the cost-complexity parameter
as α
cp = ,
D (ct )t∈{t0 }

where D ( ct )t∈{t0 } is the deviance of the root tree. The cost-complexity measure
Rα (T ) can then be rewritten as

ct )t∈T(T ) + α|T |
Rα (T ) = D (

= D (
ct )t∈T(T ) + cp D ( ct )t∈{t0 } |T |. (3.3.8)
3.3 Right Sized Trees 77

The sequence αk , k = 0, . . . , κ, defines the sequence cpk , k = 0, . . . , κ, for the


cost-complexity parameters, with
αk
cpk = .
D (
ct )t∈{t0 }

3.3.1.1 Example

Consider the simulated dataset presented in Sect. 3.2.4.1. We use as Tinit the tree
depicted in Fig. 3.7, where a maximum depth of four has been used as stopping rule.
The right sized tree Tprune is shown in Fig. 3.8 and is a subtree of Tinit . The initial tree
Tinit is then large enough to start the pruning process.
Table 3.2 presents the decrease of the deviance Dχt at each non-terminal node
t of the initial tree Tinit . The smallest decrease of the deviance is observed for node
11), also denoted t10 . Here, node t10 is the weakest node of Tinit . It becomes optimal
(t10 )
to cut the branch Tinit for values of α satisfying
 

D (
cs )s∈{t10 } − D (
cs )s∈T (t )

Tinit10

α≥ (t10 )
|Tinit |−1
= Dχt10
= 0.3254482.

ct )t∈{t0 } = 279 043.30,
Therefore, we get α1 = 0.3254482 and hence, since D (

α1 0.3254482
cp1 = = = 1.1663 10−6 .
D (
ct )t∈{t0 } 279 043.30

Tree Tα1 is depicted in Fig. 3.13. Terminal nodes 22) and 23) of Tinit disappeared
in Tα1 and node 11) becomes a terminal node.
The weakest node of Tα1 is node 10) or t9 with a decrease of the deviance
Dχt9 = 1.13. We cut the branch Tα(t19 ) once α satisfies
 

D (
cs )s∈{t9 } − D (
cs )s∈T (t )

Tα 9
1
α≥
|Tα(t19 ) | − 1
= Dχt9
= 1.135009.

We then have α2 = 1.135009 and


78 3 Regression Trees

0.13
66e+3 / 500e+3
100%

yes Age >= 44 no

0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%

Sport = no

5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no

9 13 15

0.11 0.14 0.17


5943 / 55e+3 11e+3 / 78e+3 11e+3 / 62e+3
11% 16% 12%
8 10 12 14
Age < 46 Gender = female Gender = female
0.1 0.12 0.13 0.15
5512 / 55e+3 6351 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 11% 16% 12%

Split = yes Age < 50 Gender = female Gender = female

17 19 21 24 26 28 30

0.1 0.11 0.12 0.12 0.14 0.14 0.16


2804 / 27e+3 5674 / 52e+3 4573 / 39e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
5% 10% 8% 8% 8% 6% 6%
16 18 20 11 25 27 29 31

0.098 0.1 0.11 0.13 0.13 0.15 0.16 0.18


2708 / 28e+3 269 / 2665 1778 / 16e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
6% 1% 3% 11% 8% 8% 6% 6%

Fig. 3.13 Tree Tα1

α2 1.135009
cp2 = = = 4.0676 10−6 .
D (
ct )t∈{t0 } 279 043.30

Tree Tα2 is shown in Fig. 3.14 where node 10) is now terminal.
In tree Tα2 , the smallest decrease of the deviance is at node 9). We have
 

cs )s∈{t8 } − D (
D ( cs )s∈T (t )

Tα 8
2
α3 =
|Tα(t28 ) | − 1
= Dχt8
= 1.495393

and
α3 1.495393
cp3 = = = 5.3590 10−6 .
D (
ct )t∈{t0 } 279 043.30

Tree Tα3 is shown in Fig. 3.15, node 9) becomes a terminal node.


The smallest decrease of the deviance in Tα3 is at node 8). We have
3.3 Right Sized Trees 79

0.13
66e+3 / 500e+3
100%
yes Age >= 44 no

0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%

Sport = no
5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no


9 13 15

0.11 0.14 0.17


5943 / 55e+3 11e+3 / 78e+3 11e+3 / 62e+3
11% 16% 12%
8 12 14
Age < 46 Gender = female Gender = female
0.1 0.13 0.15
5512 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 16% 12%

Split = yes Gender = female Gender = female

17 19 11 25 27 29 31

0.1 0.11 0.13 0.13 0.15 0.16 0.18


2804 / 27e+3 5674 / 52e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
5% 10% 11% 8% 8% 6% 6%
16 18 10 24 26 28 30

0.098 0.1 0.12 0.12 0.14 0.14 0.16


2708 / 28e+3 269 / 2665 6351 / 55e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
6% 1% 11% 8% 8% 6% 6%

Fig. 3.14 Tree Tα2

 

cs )s∈{t7 } − D (
D ( cs )s∈T (t )

Tα 7
3
α4 =
|Tα(t37 ) | − 1
= Dχt7
= 3.515387

and
α4 3.515387
cp4 = = = 1.2598 10−5 .
D (
ct )t∈{t0 } 279 043.30

Tree Tα4 is presented in Fig. 3.16, node 8) being now terminal.


Tree Tα4 corresponds to the right sized tree Tprune shown in Fig. 3.8. We then stop
here the pruning process.
Of course, in situations of practical relevance, the right sized tree is usually not
known. The pruning process is then stopped once we reach the root node t0 .
80 3 Regression Trees

0.13
66e+3 / 500e+3
100%

yes Age >= 44 no

0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%

Sport = no
5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no


13 15

0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%
8 12 14
Gender = female Gender = female
0.1 0.13 0.15
5512 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 16% 12%

Split = yes Gender = female Gender = female

17 10 24 26 28 30

0.1 0.12 0.12 0.14 0.14 0.16


2804 / 27e+3 6351 / 55e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
5% 11% 8% 8% 6% 6%
16 9 11 25 27 29 31

0.098 0.11 0.13 0.13 0.15 0.16 0.18


2708 / 28e+3 5943 / 55e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
6% 11% 11% 8% 8% 6% 6%

Fig. 3.15 Tree Tα3

3.3.2 Choice of the Best Pruned Tree

Once the sequence of trees Tα0 , Tα1 , Tα2 , . . . , Tακ = {t0 } has been built, we need to
select the best pruned tree. One way to proceed is to rely on cross-validation. We set

α̃k = αk αk+1 , , k = 0, 1, . . . , κ,

where α̃k is considered as a typical value for [αk , αk+1 ) and hence as the value
corresponding to Tαk . The parameter α̃k corresponds to the geometric mean of αk
and αk+1 . Notice that α̃0 = 0 and α̃κ = ∞ since α0 = 0 and ακ+1 = ∞.
In K −fold cross-validation, the training set D is partitioned into K subsets
D1 , D2 , . . . , D K of roughly equal size and we label by I j ⊂ I the observations in D j
for all j = 1, . . . , K . The jth training set is defined as D\D j , j = 1, . . . , K , so that
it contains the observations of the training set D that are not in D j . For each training
set D\D j , we build the corresponding sequence of smallest minimizing subtrees
( j)
T ( j) (α̃0 ), T ( j) (α̃1 ), . . . , T ( j) (α̃κ ), starting from a sufficiently large tree Tinit . Each
( j)
tree T (α̃k ) of the sequence provides a partition {χt }t∈T T ( j) (α̃ ) of the feature space
( k )
χ and predictions ( ct )t∈T ( j) on that partition computed on the training set D\D j .
(T (α̃k ))
Since the observations in D j have not been used to build the trees
3.3 Right Sized Trees 81

0.13
66e+3 / 500e+3
100%
yes Age >= 44 no

0.15
41e+3 / 281e+3
56%

2
Age >= 30

0.11
25e+3 / 219e+3
44%

Sport = no
5 7

0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%

4
Gender = female 6
Sport = no

0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%

Gender = female Sport = no


13 15

0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%

12
Gender = female 14
Gender = female
0.13 0.15
9871 / 78e+3 9333 / 62e+3
16% 12%

Gender = female Gender = female

9 11 25 27 29 31

0.11 0.13 0.13 0.15 0.16 0.18


5943 / 55e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 4882 / 31e+3 5527 / 31e+3
11% 11% 8% 8% 6% 6%

8 10 24 26 28 30

0.1 0.12 0.12 0.14 0.14 0.16


5512 / 55e+3 6351 / 55e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5013 / 31e+3
11% 11% 8% 8% 6% 6%

Fig. 3.16 Tree Tα4

T ( j) (α̃0 ), T ( j) (α̃1 ), . . . , T ( j) (α̃κ ),

D j can play the role of validation set for those trees. If we denote by 
μT ( j) (α̃k ) the
model produced by tree T ( j) (α̃k ), that is

μT ( j) (α̃k ) (x) =
 ct I [x ∈ χt ] ,

t∈T
( T ( j) (α̃k ) )

μT ( j) (α̃k ) on D j is given by
then an estimate of the generalization error of 


val 1 
Err μT ( j) (α̃k ) = L yi , 
μT ( j) (α̃k ) (x i ) .
|I j | i∈I
j

So, the K −fold cross-validation estimate of the generalization error for the regular-
ization parameter α̃k is given by
82 3 Regression Trees

CV 
K
|I j | val

Err (α̃k ) = 
Err 
μT ( j) (α̃k )
j=1
|I|

1 
K
= L yi , 
μT ( j) (α̃k ) (x i ) .
|I| j=1 i∈I
j

According to the minimum cross-validation error principle, the right sized tree
Tprune is then selected as the tree Tαk ∗ of the sequence Tα0 , Tα1 , Tα2 , . . . , Tακ such that

CV CV

Err (α̃k ∗ ) = min 
Err (α̃k ).
k∈{0,1,...,κ}

3.3.2.1 Example 1

Consider the simulated dataset of Sect. 3.2.4.1. Starting with the same initial tree Tinit
than in example of Sect. 3.3.1.1, we have

α1 = 0.3254482, α2 = 1.135009, α3 = 1.495393, α4 = 3.515387, . . .

so that
α̃1 = 0.6077719, α̃2 = 1.302799, α̃3 = 2.29279, . . .

The values of cpk and hence of αk , k = 1, . . . , κ, can be obtained from the R


command printcp() so that we get
CP nsplit rel error xerror xstd
1 3.5920e−03 0 1.00000 1.00000 0.0023034
2 9.3746e−04 1 0.99641 0.99642 0.0022958
3 5.6213e−04 2 0.99547 0.99549 0.0022950
4 3.1509e−04 3 0.99491 0.99493 0.0022945
5 2.6391e−04 4 0.99459 0.99470 0.0022944
6 1.1682e−04 5 0.99433 0.99437 0.0022934
7 1.1658e−04 6 0.99421 0.99436 0.0022936
8 9.5494e−05 7 0.99410 0.99424 0.0022936
9 8.4238e−05 8 0.99400 0.99413 0.0022938
10 7.7462e−05 9 0.99392 0.99406 0.0022936
11 6.0203e−05 10 0.99384 0.99396 0.0022936
12 1.2598e−05 11 0.99378 0.99386 0.0022932
13 5.3590e−06 12 0.99377 0.99392 0.0022936
14 4.0676e−06 13 0.99376 0.99393 0.0022936
15 1.1663e−06 14 0.99376 0.99395 0.0022937
16 0.0000e+00 15 0.99376 0.99395 0.0022937

The column CP contains the value of the complexity parameter cpk , k = 0, 1, . . . , κ.


From bottom to top, we have cp0 = 0, cp1 = 1.1663 10−6 , cp2 = 4.0676 10−6 ,
cp3 = 5.3590 10−6 , cp4 = 1.2598 10−5 and so on up to cp15 = 3.5920 10−3 so that
κ = 15. Of course, the values for cp1 up to cp4 coincide with those computed in exam-
3.3 Right Sized Trees 83

ple of Sect. 3.3.1.1.


The value of αk is obtained by multiplying cpk by the deviance
of the root tree D ( ct )t∈{t0 } = 279 043.30.
The second column nsplit gives the number of splits of the corresponding tree
Tαk . One sees that Tα0 = Tinit has 15 splits, Tα1 presents 14 splits and so on up to the
root tree Tα15 = {t0 } which has 0 split.
The third column rel error provides the ratio between the deviance of Tαk
and the deviance of the root tree {t0 }, namely

D (ct )t∈T T
( αk )
rel errork = .
D ( ct )t∈{t0 }

This relative error is a training sample estimate. It starts from the smallest value
rel error0 = 0.99376 to increase up to rel error15 = 1 since the trees become
smaller and smaller to reach the root tree. Obviously, choosing the optimal tree based
on the relative error would always favor the largest tree.
Turning to the fourth column xerror, it provides the relative 10-fold cross-
validation error estimate for parameter α̃k , namely
CV
|I| 
Err (α̃k )
xerrork = ,
D ( ct )t∈{t0 }

where, in this example, |I| = n = 500 000. For instance, we have xerror0 =
0.99395, xerror1 = 0.99395 and so on up to xerror15 = 1. Hence, using the
minimum cross-validation error principle, the right-sized tree turns out to be Tα4
with a relative 10-fold cross-validation error equal to 0.99386. The dataset being
simulated, we know that Tα4 shown in Fig. 3.16 indeed coincides with the best pos-
sible tree.
The last column xstd will be commented in the next example.

3.3.2.2 Example 2

Consider the real dataset described in Sect. 3.2.4.2. The tree depicted in Fig. 3.11
with 17 terminal nodes, that has been obtained by requiring a minimum number of
observations in terminal nodes equal to 5000, is used as Tinit . We get the following
results:
CP nsplit rel error xerror xstd
1 9.3945e−03 0 1.00000 1.00003 0.0048524
2 4.0574e−03 1 0.99061 0.99115 0.0047885
3 2.1243e−03 2 0.98655 0.98719 0.0047662
4 1.4023e−03 3 0.98442 0.98524 0.0047565
5 5.1564e−04 4 0.98302 0.98376 0.0047441
6 4.5896e−04 5 0.98251 0.98344 0.0047427
7 4.1895e−04 6 0.98205 0.98312 0.0047437
84 3 Regression Trees

8 2.3144e−04 7 0.98163 0.98254 0.0047440


9 1.9344e−04 8 0.98140 0.98244 0.0047428
10 1.6076e−04 9 0.98120 0.98233 0.0047405
11 1.5651e−04 10 0.98104 0.98225 0.0047400
12 1.4674e−04 11 0.98089 0.98225 0.0047400
13 1.0746e−04 12 0.98074 0.98213 0.0047390
14 8.6425e−05 13 0.98063 0.98200 0.0047385
15 8.6149e−05 14 0.98055 0.98189 0.0047374
16 5.3416e−05 15 0.98046 0.98197 0.0047391
17 0.0000e+00 16 0.98041 0.98204 0.0047396

The sequence of trees Tα0 , Tα1 , . . . , Tακ is composed of 17 trees, namely κ = 16.
The minimum 10-fold cross-validation error is equal to 0.98189 and corresponds to
tree Tα2 with a complexity parameter cp2 = 8.6149 10−5 and 14 splits. Tα2 is shown
in Fig. 3.17. So, it is the tree with the minimum 10-fold cross-validation error when
requiring at least 5000 observations in the terminal nodes. Notice that Tα2 with 15
terminal nodes is relatively big compared to Tinit with 17 terminal nodes. The reason
is that any terminal node of the initial tree must contain at least 5000 observations
so that the size of Tinit is limited.
For any α̃k , k = 0, 1, . . . , κ, the standard deviation estimate

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.21
4358 / 24e+3
15%

Split = Yearly

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.17
3931 / 29e+3
18%

AgePh >= 58

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.097
3305 / 36e+3
23%

Fuel = Gasoline

0.14
3213 / 25e+3
16%

Cover = Comprehensive,Limited.MD

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Male Split = Yearly

0.13
2200 / 18e+3
11%

Cover = Limited.MD

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%

AgePh < 74 AgePh >= 48

0.099 0.12 0.12 0.14 0.15 0.16 0.17


507 / 5455 790 / 7015 1866 / 17e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
3% 4% 10% 8% 9% 8% 6%

0.086 0.1 0.11 0.12 0.13 0.13 0.19 0.24


1482 / 18e+3 526 / 5474 1079 / 11e+3 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
11% 3% 7% 4% 7% 3% 8% 9%

Fig. 3.17 Tree with minimum cross-validation error when requiring at least 5000 observations in
the terminal nodes
3.3 Right Sized Trees 85

number of splits

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1.010
1.005
1.000
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0062 0.0017 0.00049 0.00031 0.00018 0.00015 9.6e−05 6.8e−05 0

cp

Fig. 3.18 Relative cross-validation error xerrork together with the relative standard error xstdk

 CV 1/2

V ar Err (α̃k )

can be estimated empirically over the K estimates of the generalization error. The
last column xstd provides an estimate of the relative standard error of the cross-
validation error, namely
 CV 1/2
|I| 
V ar  Err (α̃k )
xstdk = ,
D (ct )t∈{t0 }

where, in this example, |I| = n = 160 944. Figure 3.18 shows the relative cross-
validation error xerrork together with the relative standard error xstdk for each
tree Tαk . From right to left, we start with Tα0 which corresponds to a complexity
parameter equal to 0 and 16 splits to end with the root node tree Tα16 with 0 split.
As we can see, the tree of the sequence with only 3 splits, namely Tα13 , is within one
standard deviation (SD) of the tree Tα2 with the minimum cross-validation error. The
1-SD rule consists in selecting the smallest tree that is within one standard deviation
of the tree that minimizes the cross-validation error. This rule recognizes that there
is some uncertainty in the estimate of the cross-validation error and chooses the
simplest tree whose accuracy is still judged acceptable. Hence, according to the 1-
SD rule, the tree selected is Tαk ∗∗ where k ∗∗ is the maximum k ∈ {k ∗ , k ∗ + 1, . . . , κ}
satisfying
86 3 Regression Trees

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.097 0.13 0.17 0.21


3305 / 36e+3 8358 / 71e+3 3931 / 29e+3 4358 / 24e+3
23% 44% 18% 15%

Fig. 3.19 Tree Tαk ∗∗ = Tα13

CV CV  CV 1/2
Err (α̃k ) ≤ 
 Err (α̃k ∗ ) + 
V ar Err (α̃k ∗ ) .

In our case, Tαk ∗∗ = Tα13 , which is depicted in Fig. 3.19. Compared to tree Tαk ∗ =
Tα2 which minimizes the cross-validation error, Tα13 is much more simpler. The
number of terminal nodes decreases by 11, going from 15 in Tα2 to 4 in Tα13 .
As a result, the minimum cross-validation principle selects Tα2 while the 1-SD
rule chooses Tα13 . Both trees can be compared on a validation set by computing their
respective generalization errors.

3.3.2.3 Example 3

Consider once again the real dataset described in Sect. 3.2.4.2. This time, a stratified
random split of the available dataset L is done to get a training set D and a validation
set D. In this example, the training set D is assumed to be composed of 80% of the
observations of the learning set L and the validation set D of the 20% remaining
observations.
The initial tree Tinit is grown on the training set by requiring a minimum number
of observations in terminal nodes equal to 4000 = 80% × 5000. The resulting tree
Tinit is shown in Fig. 3.20 and only differs from the previous initial tree obtained on
the whole dataset L at only one split in the bottom. Conducting the pruning process,
we get the following results that are also illustrated in Fig. 3.21:
3.3 Right Sized Trees 87

CP nsplit rel error xerror xstd


1 9.1497e−03 0 1.00000 1.00001 0.0054547
2 3.7595e−03 1 0.99085 0.99160 0.0053964
3 2.2473e−03 2 0.98709 0.98853 0.0053797
4 1.5067e−03 3 0.98484 0.98607 0.0053688
5 6.0567e−04 4 0.98334 0.98417 0.0053489
6 4.6174e−04 5 0.98273 0.98362 0.0053447
7 4.5173e−04 6 0.98227 0.98337 0.0053502
8 2.1482e−04 7 0.98182 0.98301 0.0053520
9 2.0725e−04 9 0.98139 0.98304 0.0053529
10 1.6066e−04 10 0.98118 0.98297 0.0053534
11 1.5325e−04 11 0.98102 0.98293 0.0053548
12 1.1692e−04 12 0.98087 0.98294 0.0053558
13 9.0587e−05 13 0.98075 0.98302 0.0053580
14 8.4270e−05 15 0.98057 0.98292 0.0053562
15 0.0000e+00 16 0.98048 0.98285 0.0053555

The tree that minimizes the cross-validation error is the initial tree Tα0 , which means
that requiring at least 4000 observations in the terminal nodes already prevents for
overfitting. The 1-SD rule selects the tree Tα11 with only 3 splits, depicted in Fig. 3.22.
In order to compare both trees Tαk ∗ = Tα0 and Tαk ∗∗ = Tα11 , we estimate their
respective generalization errors on the validation set D. Remember that the validation

0.14
16e+3 / 129e+3
100%
yes AgePh >= 32 no

0.21
3844 / 22e+3
17%
Split = Yearly
0.13
12e+3 / 107e+3
83%
Split = Half−Yearly,Yearly
0.17 0.24
3018 / 23e+3 2528 / 13e+3
18% 10%
AgePh >= 58 AgePh >= 26
0.12
9060 / 84e+3
65%
AgePh >= 50
0.13 0.18
4809 / 40e+3 2596 / 19e+3
31% 15%
Split = Yearly AgeCar < 6.5
0.1
4251 / 45e+3
35%

Fuel = Gasoline
0.15
2183 / 16e+3
13%

Fuel = Gasoline
0.095 0.12
3002 / 34e+3 2626 / 23e+3
26% 18%
AgePh >= 58 Cover = Comprehensive,Limited.MD
0.13
1708 / 14e+3
11%

Gender = Male
0.09
1982 / 24e+3
18%

Gender = Male

0.087
1558 / 19e+3
15%
AgePh < 74

0.084
1167 / 15e+3
11%
AgeCar >= 5.5

0.091 0.1 0.12 0.13 0.13 0.13 0.19 0.21


437 / 5105 424 / 4391 1249 / 11e+3 1141 / 9982 1301 / 11e+3 422 / 4023 1368 / 9316 1391 / 7909
4% 3% 8% 8% 8% 3% 7% 6%

0.081 0.095 0.11 0.11 0.15 0.16 0.17 0.17 0.27


730 / 9650 391 / 4363 1020 / 10e+3 918 / 8865 567 / 4282 882 / 5858 1228 / 9452 1316 / 8797 1137 / 4989
7% 3% 8% 7% 3% 5% 7% 7% 4%

Fig. 3.20 Initial tree when requiring at least 4000 observations in the terminal nodes
88 3 Regression Trees

number of splits

1.010
1.005
1.000 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0059 0.0018 0.00053 0.00031 0.00018 0.00013 8.7e−05 0

cp

Fig. 3.21 Relative cross-validation error xerrork together with the relative standard error xstdk

0.14
16e+3 / 129e+3
100%

yes AgePh >= 30 no

0.13
12e+3 / 110e+3
85%

Split = Half−Yearly,Yearly

0.12
9325 / 86e+3
67%

AgePh >= 58

0.096 0.13 0.17 0.21


2621 / 29e+3 6704 / 57e+3 3097 / 24e+3 3459 / 19e+3
23% 44% 18% 15%

Fig. 3.22 Tree Tαk ∗∗ = Tα11


3.3 Right Sized Trees 89

sample estimate of the generalization error for a tree T fitted on the training set is
given by

val 1 
Err ( μT ) = L (yi , 
μT (x i )ei ) ,
|I|
i∈I

where 
μT corresponds to the model induced by tree T . We get
val  

Err 
μTαk ∗ = 0.5452772

and
val  

Err μTαk ∗∗ = 0.5464333.

The tree Tαk ∗ that minimizes the cross-validation error is also the one that minimizes
the generalization error estimated on the validation set. Hence, Tαk ∗ has the best
predictive accuracy compared to Tαk ∗∗ and is thus judged as the best tree.
val  
The difference between both generalization error estimates  Err 
μTαk ∗ and
val  

Err μTαk ∗∗ is around 10−3 . Such a difference appears to be significant in this

context. The validation sample estimate of the generalization error for the root node
tree {t0 } is given by
val

Err 
μ{t0 } = 0.54963.

The generalization error estimate only decreases by 0.0043528 from the root node
tree {t0 }, that is the null model, to the optimal one Tαk ∗ .
The generalization error of a model  μ can be decomposed as the sum of the gen-
eralization error of the true model μ and an estimation error that is positive. The
generalization error of the true model μ is irreducible. Provided that the generaliza-
tion error of the true model μ is large compared to the estimation error of the null
model, a small decrease of the generalization error can actually mean a significant
improvement.

3.4 Measure of Performance

The generalization error Err ( μ) measures the performance of the model  μ. Specifi-
val
cally, its validation sample estimate 
Err ( μ) enables to assess the predictive accu-
racy of  μ. However, as noticed in the previous section, there are some situations
where the validation sample estimate only slightly reacts to a model change while
such a small variation could actually reveal a significant improvement in terms of
model accuracy.
90 3 Regression Trees

Consider the simulated dataset of Sect. 3.2.4.1. Because the true model μ is known
in this example, we can estimate the generalization error of the true model on the
whole dataset, that is

Err (μ) = 0.5546299. (3.4.1)

Let 
μnull be the model obtained by averaging the true expected claim frequencies
μnull = 0.1312089, such that its generalization error
over the observations. We get 
estimated on the whole dataset is

 μnull ) = 0.5580889.
Err( (3.4.2)

The difference between both error estimates  Err (μ) and  Err(μnull ) is 0.003459.
We observe that the improvement we get in terms of generalization error by using
μnull is only of the order of 10−3 . A slight
the true model μ instead of the null model 
decrease of the generalization error can actually mean a real improvement in terms
of model accuracy.

3.5 Relative Importance of Features

In insurance, the features are not all of equally importance for the response. Often,
only a few of them have substantial influence on the response. Assessing the relative
importances of the features to the response can thus be useful for the analyst.
For a tree T , the relative importance of feature x j is the total reduction of deviance
obtained by using this feature throughout the tree. The overall objective is to minimize
the deviance. A feature that contributes a lot to this reduction will be more important
than another one with a small or no deviance reduction. Specifically, denoting by
T(T ) (x j ) the set of all non-terminal nodes of tree T for which x j was selected as
the splitting feature, the relative importance of feature x j is the sum of the deviance
reductions Dχt over the non-terminal nodes t ∈ T(T ) (x j ), that is,

I(x j ) = Dχt . (3.5.1)
(T ) (x j )
t∈T

The features can be ordered with respect to their relative importances. The feature
with the largest relative importance is the most important one, the feature with the
second largest relative importance is the second most important one, and so on up to
the least important feature. The most important features are those that appear higher
in the tree or several times in the tree.
Note that the relative importances are relative measures, so that they can be nor-
malized to improve their readability. It is customary to assign the largest a value
of 100 and then scale the others accordingly. Another way is to scale the relative
3.5 Relative Importance of Features 91

importances such that their sum equals 100, so that any relative importance can be
interpreted as the percentage contribution to the overall model.

3.5.1 Example 1

As an illustration, let us compute the relative importances in the example of


Sect. 3.2.4.1. Table 3.6 shows the deviance reduction Dχt for any non-terminal
node t of the optimal tree depicted in Fig. 3.8. We then get

I(Age) = 1002.20 + 261.65 = 1263.85


I(Spor t) = 156.89 + 87.92 + 73.64 = 318.45
I(Gender ) = 16.80 + 32.60 + 23.51 + 32.53 + 21.62 + 26.65 = 153.71
I(Split) = 0.

The sum of the relative importances is given by

I(Age) + I(Spor t) + I(Gender ) + I(Split) = 1736.01.

Normalizing the relative importances such that their sum equals 100, we get

Table 3.6 Decrease of the deviance Dχt for any non-terminal node t of the optimal tree
Node k) Splitting Dχtk−1
feature
1) Age 279 043.30−(111779.70+166261.40) = 1002.20
2) Sport 111779.70−(53386.74+58236.07) = 156.89
3) Age 166261.40−(88703.96+77295.79) = 261.65
4) Gender 53386.74−(26084.89+27285.05) = 16.80
5) Gender 58236.07−(28257.29+29946.18) = 32.60
6) Sport 88703.96−(42515.21+46100.83) = 87.92
7) Sport 77295.79−(37323.97+39898.18) = 73.64
12) Gender 42515.21−(20701.11+21790.59) = 23.51
13) Gender 46100.83−(22350.27+23718.03) = 32.53
14) Gender 37323.97−(18211.52+19090.83) = 21.62
15) Gender 39898.18−(19474.59+20396.94) = 26.65
92 3 Regression Trees

I(Age) = 1263.85/17.3601 = 72.80


I(Spor t) = 318.45/17.3601 = 18.34
I(Gender ) = 153.71/17.3601 = 8.85
I(Split) = 0.

The feature Age is the most important one, followed by Sport and Gender, which
is in line with our expectation. Notice that the variable Split has an importance equals
to 0 as it is not used in the tree.

3.5.2 Example 2

Consider the example of Sect. 3.2.4.2. The relative importances of features related
to the tree depicted in Fig. 3.17 are shown in Fig. 3.23. One sees that the relative
importance of features AgePh and Split represents approximately 90% of the total
importance while the relative importance of the last five features (that are Cover,
AgeCar, Gender, PowerCat and Use) is less than 5%.

AgePh

Split

Fuel

Cover
Feature

AgeCar

Gender

PowerCat

Use

0 20 40 60

Relative importance

Fig. 3.23 Relative importances of the features related to the tree depicted in Fig. 3.17
3.5 Relative Importance of Features 93

3.5.3 Effect of Correlated Features

Most of the time, the features X j are correlated in observational studies. When the
correlation between some features becomes large, trees may run into trouble, as
illustrated next. Such high correlations mean that the same information is encoded
in several features.
Consider the example of Sect. 3.2.4.1 and assume that variables X 1 (Gender) and
X 3 (Split) are correlated. In order to generate the announced correlation, we suppose
that the distribution of females inside the portfolio differs according to the split of
the premium. Specifically,

P [X 1 = f emale|X 3 = yes] = 1 − P [X 1 = male|X 3 = yes]


with ρ ∈ [0.5, 1]. Since P [X 1 = f emale] = 0.5 and P [X 3 = yes] = 0.5, we nec-
essarily have

P [X 1 = f emale|X 3 = no] = 1 − P [X 1 = male|X 3 = no]


= 1 − ρ.

The correlation between X 1 and X 3 increases with ρ. In particular, the case ρ = 0.5
corresponds to the independent case while both features X 1 and X 3 are perfectly
correlated when ρ = 1.
We consider different values for ρ, from 0.5 to 1 by 0.1 steps. For any value of ρ,
we generate a training set with 500 000 observations on which we build a tree min-
imizing the 10-fold cross validation error. The corresponding relative importances
are depicted in Fig. 3.24. One sees that the importance of the variable Split increases
with ρ, starting from 0 when ρ = 0.5 to completely replace the variable Gender
when ρ = 1. Also, one observes that the more important the variable Split, the less
the variable Gender.
The variable Split should not be used to explain the expected claim frequency,
which is the case when Split and Gender are independent. However, introducing a
correlation between Split and Gender leads to a transfer of importance from Gender
to Split. As a result, the variable Split seems to be useful to explain the expected claim
frequency while the variable Gender appears to be less important than it should be.
This effect becomes even more pronounced as the correlation increases, the variable
Split bringing more and more information about Gender. In the extreme case where
Split and Gender are perfectly dependent, using Split instead of Gender always yields
the same deviance reduction, so that both variables becomes fully equivalent. Note
that in such a situation, the variable Split was automatically selected over Gender so
that Gender has no importance in Fig. 3.24 for ρ = 1.
94 3 Regression Trees

Age Age

Sport Sport
Feature

Feature
Gender Gender

Split Split

0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance

Age Age

Sport Sport
Feature

Feature

Gender Gender

Split Split

0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance

Age Age

Sport Sport
Feature

Feature

Gender Split

Split Gender

0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance

Fig. 3.24 Relative importances of the features for different values of ρ: ρ = 0.5 (top left), ρ = 0.6
(top right), ρ = 0.7 (middle left), ρ = 0.8 (middle right), ρ = 0.9 (bottom left) and ρ = 1 (bottom
right)

For a set of features, here Split and Gender, we might have all the importances
be small, yet we cannot delete them all. This little example shows that the relative
importances could easily lead to wrong conclusions if they are not used carefully by
the analyst.
3.6 Interactions 95

3.6 Interactions

Interaction arises when the effect of a particular feature is reliant on the value of
another. An example in motor insurance is given by driver’s age and gender: often,
young female drivers cause on average less claims compared to young male ones
whereas this gender difference disappears (and sometimes even reverses) at older
ages. Hence, the effect of age depends on gender. Regression trees automatically
accounts for interactions.
Let us revisit the example of Sect. 3.2.4.1 for which we now assume the following
expected annual claim frequencies:

μ(x) = 0.1 × (1 + 0.1I [x1 = male])


× (1 + (0.4 + 0.2I [x1 = male]) I [18 ≤ x2 < 30] + 0.2I [30 ≤ x2 < 45])
× (1 + 0.15I [x4 = yes]) . (3.6.1)

This time, the effect of the age on the expected claim frequency depends on the
policyholder’s gender. Young male drivers are indeed more risky than young female
drivers in our example, young meaning here younger than 30 years old. Equivalently,
the effect of the gender on the expected claim frequency depends on the policyholder’s
age. Indeed, a man and a woman both with 18 ≤ Age < 30 and with the same value
for the feature Sport have expected claim frequencies that differ by 1.1×1.6 1.4
−1=
25.71% while a man and a woman both with Age ≥ 30 and with the same value for
the feature Sport have expected claim frequencies that only differ by 10%. One says
that features Gender and Age interact.
The optimal regression tree is shown in Fig. 3.25. By nature, the structure of
a regression tree enables to account for potential interactions between features. In
Fig. 3.25, the root node is split with the rule Age ≥ 30. Hence, once the feature Gender
is used for a split on the left hand side of the tree (node 2) and children), it only applies
for policyholders with Age ≥ 30, while when it appears on the right hand side of the
tree (node 3) and children), it only applies for policyholders with 18 ≤ Age < 30.
By construction, the impact of the feature Gender can then be different for categories
18 ≤ Age < 30 and Age ≥ 30. The structure of a tree, which is a succession of
binary splits, can thus easily reveal existing interactions between features.

Remark 3.6.1 Interaction has nothing to do with correlation. For instance, consider
a motor insurance portfolio with the same age structure for males and females (so that
the risk factors age and gender are mutually independent). If young male drivers are
more dangerous compared to young female drivers whereas this ranking disappears
or reverses at older ages, then age and gender interact despite being independent.
96 3 Regression Trees

0.13
67e+3 / 500e+3
100%

yes Age >= 30 no

0.17
21e+3 / 125e+3
25%
2
Gender = female
0.12
46e+3 / 375e+3
75%

Age >= 44
5 7

0.14 0.19
21e+3 / 156e+3 12e+3 / 62e+3
31% 12%
4 6
Sport = no Sport = no
0.11 0.15
25e+3 / 219e+3 9464 / 63e+3
44% 13%

Sport = no Sport = no
9 11

0.12 0.14
13e+3 / 109e+3 11e+3 / 78e+3
22% 16%
8 10
Gender = female Gender = female
0.1 0.13
11e+3 / 110e+3 9871 / 78e+3
22% 16%

Gender = female Gender = female

17 19 21 23 13 15

0.11 0.13 0.13 0.15 0.16 0.2


5943 / 55e+3 7023 / 55e+3 5168 / 39e+3 5963 / 39e+3 5013 / 31e+3 6292 / 31e+3
11% 11% 8% 8% 6% 6%
16 18 20 22 12 14

0.1 0.12 0.12 0.14 0.14 0.18


5512 / 55e+3 6351 / 55e+3 4703 / 39e+3 5342 / 39e+3 4451 / 31e+3 5585 / 31e+3
11% 11% 8% 8% 6% 6%

Fig. 3.25 Optimal tree

3.7 Limitations of Trees

3.7.1 Model Instability

One issue with trees is their high variance. There is a high variability of the prediction
μD (x) over the models trained from all possible training sets. The main reason is

due to the hierarchical nature of the procedure. A change in one of the top splits
is propagated to all the subsequent splits. A way to remedy that problem is to rely
on bagging that averages many trees. This has the effect to reduce the variance. But
the price to be paid for stabilizing the prediction is to deal with less comprehensive
models so that we lose in terms of model interpretability. Instead of relying on one
tree, Bagging works with many trees. Ensemble learning techniques such as Bagging
will be studied in Chap. 4.
Consider the example of Sect. 3.2.4.1. The true expected claim frequency only
takes twelve values, so that the complexity of the true model is small. We first
generate 10 training sets with 5000 observations. On each training set, we fit a tree
with only one split. Figure 3.26 shows the resulting trees. As we can see, even in this
simple example, the feature used to make the split of the root node is not always the
3.7 Limitations of Trees 97

1 1

0.13 0.13
633 / 5000 642 / 5000
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.1 0.15 0.11 0.15


235 / 2272 398 / 2728 235 / 2218 407 / 2782
45% 55% 44% 56%

1 1

0.13 0.13
649 / 5000 663 / 5000
100% 100%

yes Age >= 48 no yes Sport = no no

2 3 2 3

0.11 0.14 0.12 0.15


203 / 1922 446 / 3078 295 / 2484 368 / 2516
38% 62% 50% 50%

1 1

0.14 0.13
685 / 5000 667 / 5000
100% 100%

yes Age >= 38 no yes Age >= 62 no

2 3 2 3

0.12 0.16 0.081 0.14


366 / 2948 319 / 2052 23 / 289 644 / 4711
59% 41% 6% 94%

1 1

0.13 0.13
644 / 5000 636 / 5000
100% 100%

yes Sport = no no yes Age >= 54 no

2 3 2 3

0.11 0.15 0.098 0.14


270 / 2495 374 / 2505 108 / 1104 528 / 3896
50% 50% 22% 78%

1 1

0.13 0.14
664 / 5000 713 / 5000
100% 100%

yes Age >= 46 no yes Age >= 40 no

54
2 3 2 3

0.11 0.15 0.12 0.17


227 / 2090 437 / 2910 328 / 2696 385 / 2304
42% 58% 54% 46%

Fig. 3.26 Trees with only one split built on different training sets of size 5000
98 3 Regression Trees

same. Furthermore, the value of the split when using the feature Age differs from
one tree to another.
Increasing the size of the training set should decrease the variability of the trees
with respect to the training set. In Figs. 3.27 and 3.28, we depict trees built on training
sets with 50 000 and 500 000 observations, respectively. We notice that trees fitted
on training sets with 50 000 observations always use the feature Age for the first
split. Only the value of the split varies, but not too much. Finally, with training sets
made of 500 000 observations, we observe that all the trees have the same structure,
with Age ≥ 44 as the split of the root node. In this case, the trees only differ on the
corresponding predictions for Age < 44 and Age ≥ 44.
The variability of the models with respect to the training set can be measured by
the variance   
E X ED (ED [ μD (X)] −  μD (X))2 , (3.7.1)

which has been introduced in Chap. 2. In Table 3.7, we calculate the variance (3.7.1)
by Monte-Carlo simulation for training sets of sizes 5000, 50 000 and 500 000. As
expected, the variance decreases as the number of observations in the training set
increases.
Even in this simple example, where the true model only takes twelve values
and where we only have three important features (one of which being continuous),
training sets with 50 000 observations can still lead to trees with different splits for
the root node. Larger trees can even be more impacted due to the hierarchical nature
of the procedure. A change in the split for the root node can be propagated to the
subsequent nodes, which may lead to trees with very different structures.
In practice, the complexity of the true model is usually much larger than in this
example. Typically, more features influence the true expected claim frequency and
the impact of a continuous feature such as the age is often more complicated than
being summarized with three categories. In this setting, let us assume that the true
expected claim frequency is actually

μ(x) = 0.1 × (1 + 0.1I [x1 = male])



1
× 1+ √
x2 − 17
× (1 + 0.15I [x4 = yes]) . (3.7.2)

This time, the impact of the age on the expected claim frequency is more complex than
in the previous example, and is depicted in Fig. 3.29. The expected claim frequency
smoothly decreases with the age, young drivers being more risky. Even with training
sets made of 500 000 observations, the resulting trees can still have different splits
for the root node, as shown in Fig. 3.30. We get the following variance
  
μD (X))2 = 0.03374507 10−3
μD (X)] − 
E X ED (ED [
3.7 Limitations of Trees 99

1 1

0.13 0.14
6597 / 50e+3 6776 / 50e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.12 0.14 0.12 0.15


2507 / 22e+3 4090 / 28e+3 2539 / 22e+3 4237 / 28e+3
43% 57% 44% 56%

1 1

0.13 0.13
6428 / 50e+3 6566 / 50e+3
100% 100%

yes Age >= 44 no yes Age >= 42 no

2 3 2 3

0.11 0.14 0.11 0.15


2451 / 22e+3 3977 / 28e+3 2725 / 24e+3 3841 / 26e+3
44% 56% 48% 52%

1 1

0.13 0.13
6606 / 50e+3 6584 / 50e+3
100% 100%

yes Age >= 44 no yes Age >= 42 no

2 3 2 3

0.11 0.15 0.11 0.15


2605 / 23e+3 4001 / 27e+3 2729 / 24e+3 3855 / 26e+3
46% 54% 48% 52%

1 1

0.13 0.13
6598 / 50e+3 6636 / 50e+3
100% 100%

yes Age >= 42 no yes Age >= 44 no

2 3 2 3

0.11 0.15 0.11 0.15


2737 / 24e+3 3861 / 26e+3 2472 / 22e+3 4164 / 28e+3
48% 52% 44% 56%

1 1

0.13 0.13
6677 / 50e+3 6567 / 50e+3
100% 100%

yes Age >= 42 no yes Age >= 46 no

55
2 3 2 3

0.12 0.15 0.11 0.15


2840 / 24e+3 3837 / 26e+3 2323 / 21e+3 4244 / 29e+3
48% 52% 42% 58%

Fig. 3.27 Trees with only one split built on different training sets of size 50 000
100 3 Regression Trees
1 1

0.13 0.13
66e+3 / 500e+3 66e+3 / 500e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.11 0.15 0.11 0.15


25e+3 / 219e+3 41e+3 / 281e+3 25e+3 / 219e+3 41e+3 / 281e+3
44% 56% 44% 56%

1 1

0.13 0.13
65e+3 / 500e+3 66e+3 / 500e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.11 0.14 0.11 0.15


25e+3 / 219e+3 41e+3 / 281e+3 25e+3 / 219e+3 41e+3 / 281e+3
44% 56% 44% 56%

1 1

0.13 0.13
66e+3 / 500e+3 66e+3 / 500e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.11 0.15 0.11 0.15


25e+3 / 219e+3 41e+3 / 281e+3 25e+3 / 219e+3 41e+3 / 281e+3
44% 56% 44% 56%

1 1

0.13 0.13
66e+3 / 500e+3 65e+3 / 500e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

2 3 2 3

0.11 0.15 0.11 0.14


25e+3 / 218e+3 41e+3 / 282e+3 25e+3 / 218e+3 41e+3 / 282e+3
44% 56% 44% 56%

1 1

0.13 0.13
65e+3 / 500e+3 65e+3 / 500e+3
100% 100%

yes Age >= 44 no yes Age >= 44 no

56
2 3 2 3

0.11 0.15 0.11 0.15


24e+3 / 219e+3 41e+3 / 281e+3 24e+3 / 219e+3 41e+3 / 281e+3
44% 56% 44% 56%

Fig. 3.28 Trees with only one split built on different training sets of size 500 000
3.7 Limitations of Trees 101

Table 3.7 Estimation of (3.7.1) by Monte-Carlo simulation for training sets of sizes 5000, 50 000
and 500 000
  
Size of the training sets E X ED (ED [ μD (X)] −  μD (X))2
5000 0.206928850 10−3
50 000 0.061018808 10−3
500 000 0.002361467 10−3
2.0
1.8
1+1/(Age−17)^0.5

1.6
1.4
1.2

20 30 40 50 60
Age

Fig. 3.29 Impact of the age on the expected claim frequency

by Monte-Carlo simulation, which is, as expected, larger than the variance in Table 3.7
for training sets with 500 000 observations.
In insurance, the complexity of the true model can be large so that training sets of
reasonable size for an insurance portfolio often lead to unstable estimators μD with
respect to D. In addition, correlated features can still increase model instability.

3.7.2 Lack of Smoothness

When the true model μ(x) is smooth, trees are unlikely to capture all the nuances
of μ(x). One says that regression trees suffer from a lack of smoothness. Ensemble
techniques described in Chaps. 4 and 5 will enable to address this issue.
As an illustration, we consider the example of Sect. 3.2.4.1 with expected claim
frequencies given by (3.7.2). The expected claim frequency μ(x) smoothly decreases
with the feature Age. We simulate a training set with 500 000 observations and we
102 3 Regression Trees
1 1

0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%

yes Age >= 26 no yes Age >= 24 no

2 3 2 3

0.14 0.18 0.14 0.18


57e+3 / 416e+3 15e+3 / 84e+3 60e+3 / 437e+3 11e+3 / 63e+3
83% 17% 87% 13%

1 1

0.14 0.14
71e+3 / 500e+3 72e+3 / 500e+3
100% 100%

yes Age >= 22 no yes Age >= 20 no

2 3 2 3

0.14 0.19 0.14 0.19


61e+3 / 448e+3 9776 / 52e+3 66e+3 / 469e+3 5999 / 31e+3
90% 10% 94% 6%

1 1

0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%

yes Age >= 22 no yes Age >= 24 no

2 3 2 3

0.14 0.18 0.14 0.18


61e+3 / 448e+3 9654 / 52e+3 58e+3 / 428e+3 13e+3 / 72e+3
90% 10% 86% 14%

1 1

0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%

yes Age >= 26 no yes Age >= 22 no

2 3 2 3

0.14 0.17 0.14 0.19


55e+3 / 407e+3 16e+3 / 93e+3 62e+3 / 448e+3 9751 / 52e+3
81% 19% 90% 10%

1 1

0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%

yes Age >= 28 no yes Age >= 24 no

2 3 2 3

0.14
53e+3 / 396e+3
79%
0.17
18e+3 / 104e+3
21% 59 0.14
60e+3 / 438e+3
88%
0.18
11e+3 / 62e+3
12%

Fig. 3.30 Trees with only one split built on different training sets of size 500 000 for the true
expected claim frequency given by (3.7.2)
3.7 Limitations of Trees 103

0.14
71e+3 / 500e+3
100%

yes Age >= 24 no

0.18
11e+3 / 62e+3
12%
2
Age >= 18
0.14
60e+3 / 438e+3
88%

Sport = no

5 7

0.15 0.23
32e+3 / 219e+3 2414 / 11e+3
44% 2%
4 6
Gender = female Sport = no
0.13 0.17
28e+3 / 219e+3 8998 / 52e+3
44% 10%

Gender = female Sport = no

9 11 13

0.13 0.15 0.19


15e+3 / 109e+3 17e+3 / 109e+3 4809 / 26e+3
22% 22% 5%
8 10 12
Age >= 34 Age >= 36 Age >= 20
0.12 0.14 0.16
13e+3 / 110e+3 15e+3 / 109e+3 4189 / 26e+3
22% 22% 5%

Age >= 34 Age >= 36 Age >= 20

25

0.18
1781 / 10e+3
2%
18 26
Gender = female
0.13 0.18
10e+3 / 81e+3 3744 / 21e+3
16% 4%

Age < 36 Gender = female


37 53

0.13 0.19
9845 / 76e+3 1928 / 10e+3
15% 2%

Age >= 42 Age >= 22

17 74 19 21 23 50 52 107 14

0.13 0.13 0.14 0.15 0.16 0.16 0.17 0.19 0.22


3418 / 26e+3 8028 / 62e+3 4097 / 29e+3 4993 / 34e+3 5536 / 34e+3 816 / 5024 1816 / 10e+3 1511 / 7780 1138 / 5290
5% 12% 6% 7% 7% 1% 2% 2% 1%
16 36 75 20 22 24 51 106 27 15

0.12 0.12 0.14 0.13 0.15 0.15 0.19 0.16 0.2 0.24
9855 / 84e+3 613 / 5086 1817 / 13e+3 10e+3 / 76e+3 11e+3 / 76e+3 2408 / 16e+3 965 / 5115 417 / 2534 1065 / 5224 1276 / 5241
17% 1% 3% 15% 15% 3% 1% 1% 1% 1%

Fig. 3.31 Tree minimizing the 10-fold cross validation error

fit a tree which minimizes the 10-fold cross validation error, depicted in Fig. 3.31.
Notice that the feature Split is not used by this tree, as desired.
The corresponding model  μ is not satisfactory. Figure 3.32 illustrates both 
μ and
μ as functions of policyholder’s age for fixed values of variables Gender and Sport.
One sees that μ does not reproduce the smooth decreasing behavior of the true model
μ with respect to the age.
Typically, small trees suffer from a lack of smoothness because of their limited
number of terminal nodes whereas large trees, that could overcome this limitation
because of their larger number of leaves, tend to overfit the data. Trees with the
highest predictive accuracy cannot be too large so that their limited possible predicted
outcomes prevent obtaining models with smooth behaviors with respect to some
features.

3.8 Bibliographic Notes and Further Reading

The decision trees are first due to Morgan and Sonquist (1963), who suggested a
method called automatic interaction detector (AID) in the context of survey data.
Several improvements were then proposed by Sonquist (1970), Messenger and
Expected claim frequency (our model) Expected claim frequency (our model) Expected claim frequency (our model) Expected claim frequency (our model)
0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30
104

20
20
20
20

30
30
30
30

40
40
40
40

Fig. 3.32 Fitted model 


Age
Age
Age
Age

50
50
50
50

60
60
60
60

Expected claim frequency (true model) Expected claim frequency (true model) Expected claim frequency (true model) Expected claim frequency (true model)
0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30

female, Sport = yes) and (Gender = female, Sport = no)


20
20
20
20

30
30
30
30

40
40
40
40

Age
Age
Age
Age

50
50
50
50

60
60
60
60

μ (on the left) and true model μ (on the right) as functions of policyholder’s
age. From top to bottom: (Gender = male, Sport = yes), (Gender = male, Sport = no), (Gender =
3 Regression Trees
3.8 Bibliographic Notes and Further Reading 105

Mandell (1972), Gillo (1972) and Sonquist et al. (1974). The most important con-
tributors to modern methodological principles of decision trees are however Breiman
(1978a, b), Friedman (1977, 1979) and Quinlan (1979, 1986) who proposed very sim-
ilar algorithms for the construction of decision trees. The seminal work of Breiman
et al. (1984), complemented by the work of Quinlan (1993), enabled to create a
simple and consistent methodological framework for decision trees, the classification
and regression tree (CART) techniques, which facilitated the diffusion of tree-based
models towards a large audience. Our presentation is mainly inspired from Breiman
et al. (1984), Hastie et al. (2009) and Wüthrich and Buser (2019), the latter reference
adapting tree based methods to model claim frequencies. Also, Louppe (2014) made
a good overview of the literature from which this section is greatly inspired.

References

Breiman L (1978a) Parsimonious binary classification trees. Preliminary report. Technology Service
Corporation, Santa Monica, Calif
Breiman L (1978b) Description of chlorine tree development and use. Technical report, Technology
Service Corporation, Santa Monica, CA
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees.
Wadsworth statistics/probability series
Bühlmann H, Gisler A (2005) A course in credibility theory and its applications. Springer, Berlin
Denuit M, Hainaut D, Trufin J (2019) Effective statistical learning methods for actuaries I: GLMs
and extensions. Springer actuarial lecture notes
Feelders A (2019) Classification trees. Lecture notes
Friedman JH (1977) A recursive partitioning decision rule for nonparametric classification. IEEE
Trans Comput 100(4):404–408
Friedman JH (1979) A tree-structured approach to nonparametric multiple regression. Smoothing
techniques for curve estimation. Springer, Berlin, pp 5–22
Geurts P (2002) Contributions to decision tree induction: bias/variance tradeoff and time series
classification. PhD thesis
Gillo M (1972) Maid: a honeywell 600 program for an automatised survey analysis. Behav Sci
17:251–252
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, infer-
ence, and prediction, 2nd edn. Springer series in statistics
Louppe G (2014) Understanding random forests: from theory to practice. arXiv:14077502
Messenger R, Mandell L (1972) A modal search technique for predictive nominal scale multivariate
analysis. J Am Stat Assoc 67(340):768–772
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat
Assoc 58(302):415–434
Quinlan JR (1979) Discovering rules by induction from large collections of examples. Expert
systems in the micro electronic age. Edinburgh University Press, Edinburgh
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, Burlington
Sonquist JA (1970) Multivariate model building: the validation of a search strategy. Survey Research
Center, University of Michigan
Sonquist JA, Baker EL, Morgan JN (1974) Searching for structure: an approach to analysis of
substantial bodies of micro-data and documentation for a computer program. Survey Research
Center, University of Michigan Ann Arbor, MI
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Chapter 4
Bagging Trees and Random Forests

4.1 Introduction

Two ensemble methods are considered in this chapter, namely bagging trees and
random forests. One issue with regression trees is their high variance. There is a high
μD (x) over the trees trained from all possible training
variability of the prediction 
sets D. Bagging trees and random forests aim to reduce the variance without too
much altering bias.
Ensemble methods are relevant tools to reduce the expected generalization error
of a model by driving down the variance of the model without increasing too much
the bias. The principle of ensemble methods (based on randomization) consists in
introducing random perturbations into the training procedure in order to get different
models from a single training set D and combining them to obtain the estimate of
the ensemble.
μD (x)]. It has the same bias as 
Let us start with the average prediction ED [ μD (x)
since
μD (x)] = ED [ED [
ED [ μD (x)]] , (4.1.1)

and zero variance, that is,

μD (x)]] = 0.
VarD [ED [ (4.1.2)

Hence, finding a training procedure that produces a good approximation of the aver-
age model in order to stabilize model predictions seems to be a good strategy.
If we assume that we can draw as many training sets as we want, so that we have B
training sets D1 , D2 , . . . , D B available, then an approximation of the average model
can be obtained by averaging the regression trees built on these training sets, that is,

1 
B
 μD (x)] =
ED [ 
μDb (x). (4.1.3)
B b=1

© Springer Nature Switzerland AG 2020 107


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_4
108 4 Bagging Trees and Random Forests

In such a case, the average of the estimate (4.1.3) with respect to the training sets
D1 , . . . , D B is the average prediction ED [
μD (x)], that is,
 
1  1 
B B
ED1 ,...,D B 
μD (x) =
b μDb (x)]
EDb [
B b=1 B b=1
μD (x)] ,
= ED [ (4.1.4)

while the variance of (4.1.3) with respect to D1 , . . . , D B is given by


   B 
1  
B
1
VarD1 ,...,D B 
μDb (x) = 2 VarD1 ,...,D B 
μDb (x)
B b=1 B b=1

1 
B
= μDb (x)]
VarDb [
B 2 b=1
μD (x)]
VarD [
= (4.1.5)
B
since predictions μD1 (x), . . . , 
μD B (x) are independent and identically distributed.
So, averaging over B estimates fitted on different training sets leaves the bias
unchanged compared to each individual estimate while it divides the variance by
B. The estimate (4.1.3) is then less variable than each individual one.
In practice, the probability distribution from which the observations of the training
set are drawn is usually not known so that there is only one training set available. In
this context, the bootstrap approach, used both in bagging trees and random forests,
appears to be particularly useful.

4.2 Bootstrap

Suppose we have independent random variables Y1 , Y2 , . . . , Yn with common dis-


tribution function F that is unknown and that we are interested in using them to
estimate some quantity θ (F) associated with F. An estimator


θ = g(Y1 , Y2 , . . . , Yn )

is available for θ (F). The distributional properties of  θ in terms of the variables


Y1 , Y2 , . . . , Yn cannot be determined since the distribution function F is not known.
The idea of bootstrap is to estimate F.
The empirical counterpart to F is defined as

n
n (x) = #{Yi such that Yi ≤ x} = 1
F I [Yi ≤ x] .
n n i=1
4.2 Bootstrap 109

Thus, the empirical distribution function F n puts an equal probability 1 on each of


n
the observed data points Y1 , . . . , Yn . The idea behind the non-parametric bootstrap
is to simulate sets of independent random variables

Y1(∗b) , Y2(∗b) , . . . , Yn(∗b)

n , b = 1, 2, . . . , B. This can be done by simulating


obeying the distribution function F
Ui ∼ Uni(0, 1) and setting

Yi(∗b) = y I with I = [nUi ] + 1.

Then, for each b = 1, . . . , B, we calculate

θ (∗b) = g(Y1(∗b) , Y2(∗b) , . . . , Yn(∗b) ),




so that the corresponding bootstrap distribution of 


θ is given by

1   (∗b) 
B
Fθ∗ (x) = I 
θ ≤x .
B b=1

4.3 Bagging Trees

Bagging is one of the first ensemble methods proposed in the literature. Consider
a model fitted to our training set D, obtaining the prediction  μD (x) at point x.
Bootstrap aggregation or bagging averages this prediction over a set of bootstrap
samples in order to reduce its variance.
The probability distribution of the random vector (Y, X) is usually not known.
This latter distribution is then approximated by its empirical version which puts an
1
equal probability |I| on each of the observations {(yi , x i ); i ∈ I} of the training set
D. Hence, instead of simulating B training sets D1 , D2 , . . . , D B from the probability
distribution of (Y, X), which is not possible in practice, the idea of bagging is rather
to simulate B bootstrap samples D∗1 , D∗2 , . . . , D∗B of the training set D from its
empirical counterpart. Specifically, a bootstrap sample of D is obtained by simulating
independently |I| observations from the empirical distribution of (Y, X) defined
above. A bootstrap sample is thus a random sample of D taken with replacement
which has the same size as D. Notice that, on average, 63.2% of the observations of
the training set are represented at least once in a bootstrap sample. Indeed,
 |I|
|I| − 1
1− , (4.3.1)
|I|
110 4 Bagging Trees and Random Forests

Table 4.1 Probability in (4.3.1) with respect to |I |



| I |
|I |−1
|I | 1− |I |
10 0.651322
100 0.633968
1000 0.632305
10 000 0.632139
100 000 0.632122

which is computed in Table 4.1 for different values of |I|, is the probability that a
given observation of the training set is represented at least once. One can see that
(4.3.1) quickly reaches the value of 63.2%.
Let D∗1 , D∗2 , . . . , D∗B be B bootstrap samples of the training set D. For each
∗b
D , b = 1, . . . , B, we fit our model, giving prediction μD,b (x) = μD∗b (x). The
bagging prediction is then defined by

1 
B
bag

μD, (x) = 
μD,b (x), (4.3.2)
B b=1

where  = (1 , . . . , B ). Random vectors 1 , . . . , B fully capture the random-


ness of the training procedure. For bagging, 1 , . . . , B are independent and identi-
cally distributed so that b is a vector of |I| integers randomly and uniformly drawn
in I. Each component of b indexes one observation of the training set selected in
D∗b . In this book, bagging is applied to unpruned regression trees. This provides the
following algorithm:

Algorithm 4.1: Bagging trees

For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit an unpruned tree on D∗b , which gives prediction 
μD,b (x).
End for B
bag
Output: 
μD, (x) = 1
B b=1 
μD,b (x).

As mentioned previously, two main drawbacks of regression trees are that they
produce piece-wise constant estimates and that they are rather unstable under a small
change in the observations of the training set. The construction of an ensemble of
trees produces more stable and smoothed estimates under averaging.
4.3 Bagging Trees 111

4.3.1 Bias

For bagging, the bias is the same as the one of the individual sampled models. Indeed,

bag
Bias(x) = μ(x) − ED,  μD, (x)
 
1 
B
= μ(x) − ED,1 ,...,B 
μD,b (x)
B b=1

1   
B
= μ(x) − ED,b  μD,b (x)
B b=1
 
= μ(x) − ED,b μD,b (x) (4.3.3)

since predictions μD,1 (x), . . . , 


μD,B (x) are identically distributed.
However, the bias of  μD,b (x) is typically greater in absolute terms than the
μD (x) fitted on D since the reduced sample D∗b imposes restrictions. The
bias of 
improvements in the estimation obtained by bagging will be a consequence of vari-
ance reduction.
Notice that trees are ideal candidates for bagging. They can handle complex
interaction structures in the data and they have relatively low bias if grown sufficiently
deep. Because they are noisy, they will greatly benefit from the averaging.

4.3.2 Variance
bag
The variance of 
μD, (x) can be written as
 
 1 
B
bag
VarD, 
μD, (x) = VarD,1 ,...,B 
μD,b (x)
B b=1
 B 
1 
= 2 VarD,1 ,...,B 
μD,b (x)
B b=1
   B 
 
1 
= 2 VarD E1 ,...,B 
μD,b (x)D
B b=1
  B  
 

+ED Var1 ,...,B μD,b (x)D

b=1
  1  
 
= VarD Eb 
μD,b (x)D + ED Varb 
μD,b (x)D
B
(4.3.4)
112 4 Bagging Trees and Random Forests

since conditionally to D, predictions μD,1 (x), . . . , 


μD,B (x) are independent and
identically distributed. The second term is the within-D variance, a result of the
randomization due to the bootstrap sampling. The first term is the sampling variance
of the bagging ensemble, a result of the sampling variability of D itself. As the
number of aggregated estimates
gets arbitrarily
 large, i.e. as B → ∞, the variance
bag 
μD, (x) reduces to VarD Eb 
of  μD,b (x)D .
From (4.3.4) and
     
 
VarD,b  μD,b (x) = VarD Eb  μD,b (x)D + ED Varb  μD,b (x)D ,
(4.3.5)
we see that   
bag
VarD, μD, (x) ≤ VarD,b  μD,b (x) . (4.3.6)

bag
The variance of the bagging prediction μD, (x) is smaller than the variance of an
μD,b (x). Actually, we learn from (4.3.4) and (4.3.5) that the
individual prediction 
variance reduction is given by

   B−1  
bag 
VarD,b  μD,b (x) − VarD,  μD, (x) = ED Varb  μD,b (x)D ,
B
 (4.3.7)

which increases as B increases and tends to ED Varb  μD,b (x)D when
B → ∞.
Let us introduce the correlation coefficient ρ(x) between any pair of predictions
used in the averaging which are built on the same training set but fitted on two different
bootstrap samples. Using the definition of the Pearson’s correlation coefficient, we
get
 
CovD,b ,b  μD,b (x), 
μD,b (x)
ρ(x) =     
VarD,b  μD,b (x) VarD,b  μD,b (x)
 
CovD,b ,b 
μD,b (x), 
μD,b (x)
=   (4.3.8)
VarD,b  μD,b (x)

μD,b (x) and 


as  μD,b (x) are identically distributed. By the law of total covariance,
the numerator in (4.3.8) can be rewritten as
    
μD,b (x), 
CovD,b ,b  μD,b (x) = ED Covb ,b  μD,b (x), 
μD,b (x)|D
    
+CovD Eb  μD,b (x)|D , Eb  μD,b (x)|D
  
= VarD Eb  μD,b (x)|D (4.3.9)

since conditionally to D, estimates μD,b (x) and μD,b (x) are independent and
identically distributed. Hence, combining (4.3.5) and (4.3.9), the correlation coeffi-
cient in (4.3.8) becomes
4.3 Bagging Trees 113
  
VarD Eb  μD,b (x)|D
ρ(x) =   (4.3.10)
VarD,b μD,b (x)
  
VarD Eb  μD,b (x)|D
=     . (4.3.11)
 
VarD Eb  μD,b (x)D + ED Varb  μD,b (x)D

The correlation coefficient ρ(x) measures the correlation between a pair of pre-
dictions in the ensemble induced by repeatedly making training sample draws D
from the population and then drawing a pair of bootstrap samples from D.
When ρ(x) is close to 1, the predictions are highly correlated, suggesting that
the randomization due to the bootstrap sampling has no significant effect on the
predictions. On the contrary, when ρ(x) is close to 0, the predictions are uncorrelated,
suggesting that the randomization due to the bootstrap sampling has a strong impact
on the predictions.
One sees that ρ(x) is the ratio between the variance due to the training set and the
total variance. The total variance is the sum of the variance due to the training set and
the variance due to randomization induced by the bootstrap samples. A correlation
coefficient close to 1 and hence correlated predictions means that the total variance
is mostly driven by the training set. On the contrary, a correlation coefficient close
to 0 and hence de-correlated predictions means that the total variance is mostly due
to the randomization induced by the bootstrap samples.
bag
Alternatively, the variance of  μD, (x) given in (4.3.4) can be re-expressed in
terms of the correlation coefficient. Indeed, from (4.3.10) and (4.3.11), we have
    
VarD Eb 
μD,b (x)|D = ρ(x)VarD,b 
μD,b (x) (4.3.12)

and
    
ED Varb 
μD,b (x)|D = (1 − ρ(x)) VarD,b 
μD,b (x) , (4.3.13)

such that (4.3.4) can be rewritten as


    1   
bag
VarD, 
μD, (x) = VarD Eb  μD,b (x)|D + ED Varb  μD,b (x)|D
B
  (1 − ρ(x))  
= ρ(x)VarD,b 
μD,b (x) + VarD,b μD,b (x) .
B
(4.3.14)

As B increases, the second term disappears, but the first term remains. Hence, when
ρ(x) < 1, one sees that the variance of the ensemble is strictly smaller than the
variance of an individual model. Let us mention that assuming ρ(x) < 1 amounts
to suppose that the randomization due to the bootstrap sampling influences the indi-
vidual predictions.
114 4 Bagging Trees and Random Forests

Notice that the random perturbation introduced by the bootstrap sampling induces
μD,b (x) than for 
a higher variance for an individual prediction  μD (x), so that
 
VarD,b 
μD,b (x) ≥ VarD [
μD (x)] . (4.3.15)

Therefore, bagging averages models with higher variances. Nevertheless, the bagging
bag
prediction μD, (x) has generally a smaller variance than  μD (x). This comes from
the fact that, typically, the correlation
 coefficient
 ρ(x) in (4.3.14) compensates for
the variance increase VarD,b   μD,b (x) − VarD [ μD (x)], so that the combined
effect of ρ(x) < 1 and VarD,b  μD,b (x) ≥ VarD [ μD (x)] often leads to a variance
reduction  
VarD [μD (x)] − ρ(x)VarD,b  μD,b (x) (4.3.16)

that is positive. Because of their high variance, regression trees very likely benefit
from the averaging procedure.

4.3.3 Expected Generalization Error

For some loss functions, such as the squared error and Poisson deviance losses, we
bag
μD, (x) is
can show that the expected generalization error for the bagging prediction 
μD,b (x),
smaller than the expected generalization error for an individual prediction 
that is,

bag   
ED, Err  μD, (x) ≤ ED,b Err  μD,b (x) . (4.3.17)

However, while it is typically the case with bagging trees, we cannot highlight some
bag
situations where the estimate  μD, (x) performs always better than  μD (x) in the
sense of the expected generalization error, even for the squared error and Poisson
deviance losses.

4.3.3.1 Squared Error Loss


bag
μD, (x)
For the squared error loss, from (2.4.3), the expected generalization error for 
is given by


 2
bag bag
ED, Err 
μD, (x) = Err (μ(x)) + μ(x) − ED, 
μD, (x)

bag
+VarD,  μD, (x) . (4.3.18)

From (4.3.3) and (4.3.6), one observes that the bias remains unchanged while the
μD,b (x), so that we get
variance decreases compared to the individual prediction 
4.3 Bagging Trees 115


bag
ED, Err μD, (x)
  2 
bag
= Err (μ(x)) + μ(x) − ED,b μD,b (x) + VarD,  μD, (x)
  2  
≤ Err (μ(x)) + μ(x) − ED,b μD,b (x) + VarD,b  μD,b (x)
  
= ED,b Err  μD,b (x) . (4.3.19)

For every value of X, the expected generalization error of the ensemble is smaller
than the expected generalization error of an individual model.
Taking the average of (4.3.19) over X leads to

   
bag
ED, Err 
μD, ≤ ED,b Err 
μD,b . (4.3.20)

4.3.3.2 Poisson Deviance Loss

For the Poisson deviance loss, from (2.4.4) and (2.4.5), the expected generalization
bag
μD, (x) is given by
error for 



μD, (x) = Err (μ(x)) + ED, E P 
bag bag
ED, Err  μD, (x) (4.3.21)

with
⎛ ⎡ ⎤ ⎡ ⎛ ⎞⎤⎞

 
bag
μD, (x) 
bag
μD, (x)
P
μD, (x) = 2μ(x) ⎝ED, ⎣ ⎦ − 1 − ED, ⎣ln ⎝ ⎠⎦⎠ .
bag
ED, E 
μ(x) μ(x)
(4.3.22)
We have
 bag
  
μD, (x)
 
μD,b (x)
ED, = ED,b , (4.3.23)
μ(x) μ(x)

so that (4.3.22) can be expressed as




ED, E P 
bag
μD, (x)
     

μD,b (x) 
μD,b (x)
= 2μ(x) ED,b − 1 − ED,b ln
μ(x) μ(x)
   bag    
μD, (x)
 
μD,b (x)
−2μ(x) ED, ln − ED,b ln
μ(x) μ(x)
 P 
= ED,b E  μD,b (x)


   
bag
−2μ(x) ED, ln  μD, (x) − ED,b ln  μD,b (x) . (4.3.24)
116 4 Bagging Trees and Random Forests

Jensen’s inequality implies


  
bag
ED, ln μD, (x) − ED,b ln  μD,b (x)
  
1   
B
= ED,1 ,...,B ln 
μD,b (x) − ED,b ln 
μD,b (x)
B b=1
 
1   
B
≥ ED,1 ,...,B μD,b (x) − ED,b ln 
ln  μD,b (x)
B b=1
= 0, (4.3.25)

so that combining (4.3.24) and (4.3.25) leads to



   
ED, E P 
μD, (x) ≤ ED,b E P 
bag
μD,b (x) (4.3.26)

and hence

   
bag
ED, Err μD, (x) ≤ ED,b Err 
μD,b (x) . (4.3.27)

For every value of X, the expected generalization error of the ensemble is smaller
than the expected generalization error of an individual model.
Taking the average of (4.3.27) over X leads to

   
bag
ED, Err 
μD, ≤ ED,b Err 
μD,b . (4.3.28)

Example Consider the example of Sect. 3.7.2. We simulate training sets D made
of 100 000 observations and validation sets D of the same size. For each simulated
training set D, we build the corresponding tree μD with maxdepth = D = 5, which
corresponds to a reasonable size in this context (see Fig. 3.31), and we estimate its
generalization error on a validation set D. Also, we generate bootstrap samples
D∗1 , D∗2 , . . . of D and we produce the corresponding trees  μD∗1 , 
μD∗2 , . . . with
D = 5. We estimate their generalization errors on a validation set D, together with
the generalization errors of the corresponding bagging models. Note that in this
example, we use the R package rpart to build the different trees described above.
μD , 
Figure 4.1 displays estimates of the expected generalization errors for  μD∗b =
bag
μD,b and 
 μD, for B = 1, 2, . . . , 10 obtained by Monte-Carlo simulations. As
expected, we notice that

   
 μD, ≤ 
ED, Err 
bag
ED,b Err 
μD,b .

For B ≥ 2, bagging trees outperforms individual sample trees. Also, we note that
4.3 Bagging Trees 117

0.57925
Expected generalization error

0.57900

0.57875

0.57850

1 2 3 4 5 6 7 8 9 10
Number of trees


Fig. 4.1 
bag
ED, Err  μD, with respect to the number of trees B, together with
  

ED,b Err  μD,b (dotted line) and  μD )] (solid line)
ED [Err (

  
 μD )] ≤ 
ED [Err ( ED,b Err 
μD,b ,

showing that the restriction imposed by the reduced sample D∗b does not allow to
build trees as predictive as trees built on the entire training set D. Finally, from B = 4,
we note that


ED, Err  μD, ≤ 
bag
ED [Err ( μD )] ,

meaning that for B ≥ 4, bagging trees also outperforms single trees built on the
entire training set.

4.3.3.3 Gamma Deviance Loss

Consider the Gamma deviance loss. From (2.4.6) and (2.4.7), the expected general-
bag
μD, (x) is given by
ization error for 



μD, (x) = Err (μ(x)) + ED, E G 
bag bag
ED, Err  μD, (x) (4.3.29)
118 4 Bagging Trees and Random Forests

with
     

 μ(x) μ(x)
G bag
ED, E  μD, (x) = 2 ED, bag
− 1 − ED, ln bag
.

μD, (x) 
μD, (x)
(4.3.30)
Since we have

   
ED, E G 
μD, (x) = ED,b E G 
bag
μD,b (x)
    
μ(x) μ(x)
+2 ED, − E D ,b

bag
μD, (x) 
μD,b (x)
      
μ(x) μ(x)
+2 ED,b ln − ED, ln

μD,b (x) 
bag
μD, (x)
 G 
= ED,b E  μD,b (x)
  
μ(x) μ(x)
+2ED, bag
− ln bag

μD, (x) 
μD, (x)
  
μ(x) μ(x)
−2ED,b − ln , (4.3.31)

μD,b (x) 
μD,b (x)

we see that
   
ED, E G 
μD, (x) ≤ ED,b E G 
bag
μD,b (x)

if and only if
  bag
   
μ(x) μD, (x)
 μ(x) 
μD,b (x)
ED, + ln ≤ ED,b + ln .

bag
μD, (x) μ(x) 
μD,b (x) μ(x)
(4.3.32)
The latter inequality is fulfilled when the individual sample trees satisfy


μD,b (x)
≤ 2, (4.3.33)
μ(x)
bag

μ (x)
which, in turn, guarantees that D,μ(x)
≤ 2. Indeed, the function φ : x > 0 → 1
x
+
ln x is convex for x ≤ 2, so that Jensen’s inequality implies
4.3 Bagging Trees 119
  bag
   bag

μ(x) μD, (x)
 μD, (x)

ED, + ln = ED, φ

bag
μD, (x) μ(x) μ(x)
  
1 
B
μD,b (x)
= ED,1 ,...,B φ
B b=1 μ(x)
  
1 
B
μD,b (x)

≤ ED,1 ,...,B φ
B b=1 μ(x)
  

μD,b (x)
= ED,b φ
μ(x)
  
μ(x) 
μD,b (x)
= ED,b + ln

μD,b (x) μ(x)

provided that condition (4.3.33) holds.


If inequality (4.3.33) is satisfied, we thus have

   
bag
ED, Err 
μD, (x) ≤ ED,b Err 
μD,b (x) ,

and so
   
bag
ED, Err 
μD, ≤ ED,b Err 
μD,b . (4.3.34)

μD,b (x)
Note that condition (4.3.33) means that the individual sample prediction 
should be not too far from the true prediction μ(x).

4.4 Random Forests

Bagging is a technique used for reducing the variance of a prediction. Typically, it


works well for high variance and low-bias procedures, such as regression trees. The
procedure called random forests is a modification of bagging trees. It produces a
collection of trees that are more de-correlated than in the bagging procedure, and
averages them.
Recall from (4.3.14) that the variance of the bagging prediction can be expressed
as
   (1 − ρ(x))  
bag
VarD,  μD, (x) = ρ(x)VarD,b  μD,b (x) + VarD,b  μD,b (x) .
B
(4.4.1)
As B increases, the second term disappears while the first one remains. The cor-
relation coefficient ρ(x) in the first term limits the effect of averaging. The idea
of random forests is to improve the bagging procedure by  reducing the correlation
coefficient ρ(x) without increasing the variance VarD,b  μD,b (x) too much.
120 4 Bagging Trees and Random Forests

Reducing correlation among trees can be achieved by adding randomness to the


training procedure. The difficulty relies in modifying the training procedure without
affecting too much both the bias and the variance of the individual trees.
Random forests is a combination of bagging with random feature selection at each
node. Specifically, when growing a tree on a bootstrap sample, m(≤ p) features are
selected at random before each split and used as candidates for splitting. The random
forest prediction writes

1 
B
μrfD, (x) =
 
μD,b (x),
B b=1

where μD,b (x) denotes the prediction at point x for the bth random forest tree. Ran-
dom vectors 1 , . . . , B capture not only the randomness of the bootstrap sampling,
as for bagging, but also the additional randomness of the training procedure due to
the random selection of m features before each split. This provides the following
algorithm:

Algorithm 4.2: Random forests

For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit a tree on D∗b .
For each node t do
(2.1) Select m (≤ p) features at random from the p original features.
(2.2) Pick the best feature among the m.
(2.3) Split the node into two daughter nodes.
End for
This gives prediction 
μD,b (x) (use typical tree stopping criteria (but do not
prune)).
End for B
Output: 
μrfD, (x) = 1
B b=1 
μD,b (x).

As soon as m < p, random forests differs from bagging trees since the optimal
split can be missed if it is not among the m features selected. Typical value of m is

p/3 . However, the best value for m depends on the problem under consideration
and is treated as a tuning parameter. Decreasing m reduces the correlation between
any pair of trees while it increases the variance of the individual trees.
Notice that random forests is more computationally efficient on a tree-by-tree basis
than bagging since the training procedure only needs to assess a part of the original
features at each split. However, compared to bagging, random forests usually require
more trees.
4.4 Random Forests 121

Remark 4.4.1 Obviously, results obtained in Sects. 4.3.1, 4.3.2 and 4.3.3 also hold
for random forests. The only difference relies in the meaning of random vectors
1 , . . . , B . For bagging, those vectors express the randomization due to the boot-
strap sampling, while for random forests, they also account for the randomness due
to the feature selection at each node.

4.5 Out-of-Bag Estimate

Bagging trees and random forests aggregate trees built on bootstrap samples D∗1 , . . . ,
D∗B . For each observation (yi , x i ) of the training set D, an out-of-bag prediction
can be constructed by averaging only trees corresponding to bootstrap samples D∗b
in which (yi , x i ) does not appear. The out-of-bag prediction for observation (yi , x i )
is thus given by

1 
B
 

μoob
D, (x i ) = B    / D∗b .
μD,b (x i )I (yi , x i ) ∈ (4.5.1)
/ D∗b
b=1 I (yi , x i ) ∈ b=1

The generalization error of 


μD, can be estimated by


oob 1 
Err (
μD, ) = L(yi , 
μoob
D, (x i )), (4.5.2)
|I| i∈I

which is called the out-of-bag estimate of the generalization error.


The out-of bag estimate of the generalization error is almost identical to the |I|-
fold cross-validation estimate. However, the out-of-bag estimate does not require to
fit new trees, so that bagging trees and random forests can be fit in one sequence.
We stop adding new trees when the out-of-bag estimate of the generalization error
stabilizes.

4.6 Interpretability

A bagged model is less interpretable than a model that is not bagged. Bagging trees
and random forests are no longer a tree. However, there exist tools that enable to
better understand model outcomes.
122 4 Bagging Trees and Random Forests

4.6.1 Relative Importances

As for a single regression tree, the relative importance of a feature can be computed
for an ensemble by combining relative importances from the bootstrap trees. For the
bth tree in the ensemble, denoted Tb , the relative importance of feature x j is the sum
of the deviance reductions Dχt over the non-terminal nodes t ∈ T!(Tb ) (x j ) (i.e. the
non-terminal nodes t of Tb for which x j was selected as the splitting feature), that is,

Ib (x j ) = Dχt . (4.6.1)
! T (x j )
t∈T( b)

For the ensemble, the relative importance of feature x j is obtained by averaging the
relative importances of x j over the collection of trees, namely

1 
B
I(x j ) = Ib (x j ). (4.6.2)
B b=1

For convenience, the relative importances are often normalized so that their sum
equals to 100. Any individual number can then be interpreted as the percentage
contribution to the overall model. Sometimes, the relative importances are expressed
as a percent of the maximum relative importance.
An alternative to compute variable importances for bagging trees and random
forests is based on out-of-bag observations. Some observations (yi , x i ) of the train-
ing set D do not appear in bootstrap sample D∗b . They are called the out-of-bag
observations for the bth tree. Because they were not used to fit that specific tree,
these observations enable to assess the predictive accuracy of 
μD,b , that is,

1 

Err (
μD,b ) = L(yi , 
μD,b (x i )), (4.6.3)
∗b
|I\I | ∗b
i∈I\I

where I ∗b labels the observations in D∗b . The categories of feature x j are then ran-
domly permuted in the out-of-bag observations, so that we get perturbed observations
), i ∈ I\I ∗b , and the predictive accuracy of 
perm(j)
(yi , x i μD,b is again computed as

perm(j) 1 

Err (
μD,b ) = L(yi , 
perm(j)
μD,b (x i )). (4.6.4)
|I\I ∗b | ∗b
i∈I\I

The decrease in predictive accuracy due to this permuting is averaged over all trees
and is used as a measure of importance for feature x j in the ensemble, that is

1 
perm(j)
B
I(x j ) = Err μD,b ) − 
( Err (
μD,b ) . (4.6.5)
B b=1
4.6 Interpretability 123

These importances can be normalized to improve their readability. A feature will be


important if randomly permuting its values decreases the model accuracy. In such a
case, it means that the model relies on the feature for the prediction.
Notice that the relative importance I(x j ) obtained by permutation in (4.6.5) does
not measure the effect on estimate of the absence of x j . Rather, it measures the effect
of neutralizing x j , much like setting a coefficient to zero in GLMs for instance.

4.6.2 Partial Dependence Plots

Visualizing the value of  μ(x) as a function of x enables to understand its dependence


on the joint values of the features. Such visualization is however limited to small
values of p. For p = 1,  μ(x) can be easily plotted, as a graph of the values of  μ(x)
against each possible value of x for single real-valued variable x, and as a bar-plot
for categorical variable x, each bar corresponding to one value of the categorical
variable. For p = 2, graphical renderings of  μ(x) is still possible. Functions of two
real-valued variables can be represented by means of contour or perspective mesh
plots. Functions of a categorical variable and another variable, categorical or real,
can be pictured by a sequence of plots, each one depicting the dependence of  μ(x)
on the second variable, conditioned on the values of the first variable.
For p > 2, representing  μ(x) as a function of its arguments is more difficult. An
alternative consists in visualizing a collection of plots, each one showing the partial
dependence of  μ(x) on selected small subsets of the features.
Consider the subvector x S of l < p of the features x = (x1 , x2 , . . . , x p ), indexed
by S ⊂ {1, 2, . . . , p}. Let x S̄ be the complement subvector such that

x S ∪ x S̄ = x.

In principle, 
μ(x) depends on features in x S and x S̄ , so that (by rearranging the
order of the features if needed) we can write

 μ(x S , x S̄ ).
μ(x) = 

Hence, if one conditions on specific values for the features in x S̄ , then 


μ(x) can be
seen as a function of the features in x S . One way to define the partial dependence of

μ(x) on x S is given by
 
μS (x S ) = E X S̄ 
 μ(x S , X S̄ ) . (4.6.6)

This average function can be used as a description of the effect of the selected subset
x S on  μ(x) when the features in x S do not have strong interactions with those in
x S̄ .
For instance, in the particular case where the dependence of 
μ(x) on x S is additive
124 4 Bagging Trees and Random Forests


μ(x) = f S (x S ) + f S̄ (x S̄ ), (4.6.7)

so that there is no interactions between features in x S and x S̄ , the partial dependence


function (4.6.6) becomes
 
μS (x S ) = f S (x S ) + E X S̄ f S̄ (X S̄ ) .
 (4.6.8)

Hence, (4.6.6) produces f S (x S ) up to an additive constant. In this case, one sees that
(4.6.6) provides a complete description of the way  μ(x) varies on the subset x S .
The partial dependence function  μS (x S ) can be estimated from the training set
by
1 

μ(x S , x i S̄ ), (4.6.9)
|I| i∈I

where {x i S̄ , i ∈ I} are the values of X S̄ in the training set.


Notice that  μS (x S ) represents the effect of x S on 
μ(x) after accounting for the
average effects of the other features x S̄ on μ(x). Hence, it is different than computing
the conditional expectation

μS (x S ) = E X [
! μ(X)|X S = x S ] . (4.6.10)

Indeed, this latter expression captures the effect of x S on  μ(x) ignoring the effects
of x S̄ . Both expressions (4.6.6) and (4.6.10) are actually equivalent only when X S
and X S̄ are independent.
For instance, in the specific case (4.6.7) where  μ(x) is additive, !μS (x S ) can be
written as  
μS (x S ) = f S (x S ) + E X S̄ f S̄ (X S̄ )|X S = x S .
! (4.6.11)

One sees that ! μS (x S ) does not produce f S (x S ) up to a constant, as it was the


μS (x S ). This time, the behavior of !
case for  μS (x S ) depends on the dependence
structure between X S and X S̄ . In the case where the dependence of  μ(x) on x S is
multiplicative, that is
μ(x) = f S (x S ) · f S̄ (x S̄ ),
 (4.6.12)

so that the features in x S interact with those in x S̄ , (4.6.6) becomes


 
μS (x S ) = f S (x S ) · E X S̄ f S̄ (X S̄ )
 (4.6.13)

while (4.6.10) is given by


 
μS (x S ) = f S (x S ) · E X S̄ f S̄ (X S̄ )|X S = x S .
! (4.6.14)
4.6 Interpretability 125

In this other example, one sees that  μS (x S ) produces f S (x S ) up to a multiplicative


constant factor, so that its form does not depend neither on the dependence structure
between X S and X S̄ , while ! μS (x S ) does well.

4.7 Example

Consider the real dataset described in Sect. 3.2.4.2. We use the same training set D
and validation set D than in the example of Sect. 3.3.2.3, so that the estimates for the
generalization error will be comparable.
We fit random forests with B = 2000 trees on D by means of the R pack-
age rfCountData. More precisely, we use the R command rfPoisson(),
which stands for random forest Poisson, producing random forests with Poisson
deviance as loss function. The number of trees B = 2000 is set arbitrarily. However,
we will see that B = 2000 is large enough (i.e. adding trees will not improve the
predictive accuracy of the random forests under investigation).
The other parameters we need to fine-tune are
– the number m of features tried at each split (mtry);
– the size of the trees, controlled here by the minimum number of observations
(nodesize) required in terminal nodes.
To this end, we try different values for mtry and nodesize and we split the
training set D into five disjoint and stratified subsets D1 , D2 , . . . , D5 of equal size.
Specifically, we consider all possible values for mtry (from 1 to 8) together with
four values for nodesize: 500, 1000, 5000 and 10 000. Note that the training
set D contains 128 755 observations, so that values of 500 or 1000 for nodesize
amounts to require at least 0.39% and 0.78% of the observations in the final nodes
of the individual trees, respectively, which allows for rather large trees. Then, for
each value of (mtry, nodesize), we compute the 5-fold cross-validation estimate
of the generalization error from subsets D1 , D2 , . . . , D5 . The results are depicted in
Fig. 4.2. We can see that the minimum 5-fold cross-validation estimate corresponds
to mtry = 3 and nodesize = 1000. Notice that for any value of nodesize, it is
never optimal to use all the features at each split (i.e. mtry = 8). Introducing a ran-
dom feature selection at each node therefore improves the predictive accuracy of the
ensemble. Moreover, as expected, limiting too much the size of the trees (here with
nodesize = 5000, 10 000) turns out to be counterproductive. The predictive per-
formances for trees with nodesize = 1000 are already satisfying and comparable
to the ones obtained with even smaller trees.
In Fig. 4.3, we show the out-of-bag estimate of the generalization error for random
forests with mtry = 3 and nodesize = 1000 with respect to the number of trees
B. We observe that B = 2000 is more than enough. The out-of-bag estimate is already
stabilized from B = 500. Notice that adding more trees beyond B = 500 does not
decrease the predictive accuracy of the random forest.
126 4 Bagging Trees and Random Forests

Cross−validation results
0.550

Nodesize 500 Nodesize 1 000 Nodesize 5 000 Nodesize 10 000


0.545
Deviance
0.540
0.535
0.530

8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
mtry

Fig. 4.2 5-fold cross-validation estimates of the generalization error for mtry = 1, 2, . . . , 8 and
nodesize = 500, 1000, 5000, 10 000


We denote by  μrfD, the random forest fitted on the entire training set D with B =
500, mtry = 3 and nodesize = 1000. The relative importances of the features

for 
μrfD, are depicted in Fig. 4.4. The most important feature is AgePh followed by,
in descending order, Split, Fuel, AgeCar, Cover, Gender, PowerCat and Use. Notice
that this ranking of the features in terms of importance is almost identical to the one
shown in Fig. 3.23 and obtained from the tree depicted in Fig. 3.17, only the order of
features AgeCar and Cover is reversed (their importances are very similar here).

Figure 4.5 represents the partial dependence plots of the features for  μrfD, . Specif-
ically, one sees that the partial dependence plot for policyholder’s age is relatively
smooth. This is more realistic than the impact of policyholder’s age deduced from
regression tree  μTαk ∗ represented in Fig. 3.20, tree that is reproduced in Fig. 4.6 where
we have circled the nodes using AgePh for splitting. Indeed, with only six circled
nodes,  μTαk ∗ cannot reflect a smooth behavior of the expected claim frequency with
respect to AgePh. While random forests enable to capture nuances of the response,
regression trees suffer from a lack of smoothness.

Finally, the validation sample estimate of the generalization error of  μrfD, is given
by
val
rf∗

Err 
μD, = 0.5440970.

Compared to the validation sample estimate of the generalization error of 


μTαk ∗ , that
is
4.7 Example 127

0.5500 0.5500

0.5475 0.5475

0.5450 0.5450

0.5425 0.5425

0.5400 0.5400

0 500 1000 1500 2000 0 100 200 300 400 500


Number of trees Number of trees

Fig. 4.3 Out-of-bag estimate of the generalization error for random forests with mtry = 3 and
nodesize = 1000 with respect to the number of trees B

AgePh

Split

Fuel

AgeCar

Cover

Gender

PowerCat

Use

0.000 0.004 0.008 0.012


%IncLossFunction


Fig. 4.4 Relative importances of the features for 
μrf
D , obtained by permutation
128 4 Bagging Trees and Random Forests

0.225
0.15
0.12

0.200
0.10

0.10
0.08
0.175

0.150 0.05
0.05 0.04

0.125

0.00 0.00 0.00


0.100
20 30 40 50 60 70 80 90 Half−Yearly Monthly Quarterly Yearly Diesel Gasoline ComprehensiveLimited.MD TPL.Only
AgePh Split Fuel Cover

0.125
0.12 0.12

0.100
0.14

0.08 0.08
0.075

0.13 0.050

0.04 0.04

0.025

0.12 0.00 0.00 0.000

0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional


AgeCar Gender PowerCat Use


Fig. 4.5 Partial dependence plots for 
μrf
D ,

val


Err 
μTαk ∗ = 0.5452772


μrfD, improves by 1.1802 10−3 the predictive accuracy of the single
one sees that 
tree 
μTαk ∗ .

Remark 4.7.1 It is worth noticing that the selection procedure of the optimal tuning
parameters may depend on the initial choice of the training set D and on the folds
used to compute cross-validation estimates of the generalization error. This latter
point can be mitigated by increasing the number of folds.

4.8 Bibliographic Notes and Further Reading

Bagging is one of the first ensemble methods proposed by Breiman (1996), who
showed that aggregating multiple versions of an estimator into an ensemble improves
the model accuracy. Several authors added randomness into the training procedure.
Dietterich and Kong (1995) introduced the idea of random split selection. They
proposed to select at each node the best split among the twenty best ones. Amit et al.
(1997) rather proposed to choose the best split over a random subset of the features,
and Amit and Geman (1997) also defined a large number of geometric features. Ho
(1998) investigated the idea of building a decision forest whose trees are produced
4.8 Bibliographic Notes and Further Reading 129

0.14
16e+3 / 129e+3
100%
yes AgePh >= 32 no

0.21
3844 / 22e+3
17%
Split = Yearly
0.13
12e+3 / 107e+3
83%
Split = Half−Yearly,Yearly
0.17 0.24
3018 / 23e+3 2528 / 13e+3
18% 10%
AgePh >= 58 AgePh >= 26
0.12
9060 / 84e+3
65%
AgePh >= 50
0.13 0.18
4809 / 40e+3 2596 / 19e+3
31% 15%
Split = Yearly AgeCar < 6.5
0.1
4251 / 45e+3
35%

Fuel = Gasoline
0.15
2183 / 16e+3
13%

Fuel = Gasoline
0.095 0.12
3002 / 34e+3 2626 / 23e+3
26% 18%
AgePh >= 58 Cover = Comprehensive,Limited.MD
0.13
1708 / 14e+3
11%

Gender = Male
0.09
1982 / 24e+3
18%

Gender = Male

0.087
1558 / 19e+3
15%
AgePh < 74

0.084
1167 / 15e+3
11%
AgeCar >= 5.5

0.091 0.1 0.12 0.13 0.13 0.13 0.19 0.21


437 / 5105 424 / 4391 1249 / 11e+3 1141 / 9982 1301 / 11e+3 422 / 4023 1368 / 9316 1391 / 7909
4% 3% 8% 8% 8% 3% 7% 6%

0.081 0.095 0.11 0.11 0.15 0.16 0.17 0.17 0.27


730 / 9650 391 / 4363 1020 / 10e+3 918 / 8865 567 / 4282 882 / 5858 1228 / 9452 1316 / 8797 1137 / 4989
7% 3% 8% 7% 3% 5% 7% 7% 4%

Fig. 4.6 Optimal tree 


μTαk ∗ built in Chap. 3 on D. The circled nodes use AgePh for splitting

on random subsets of the features, each tree being constructed on a random subset of
the features drawn once (prior the construction of the tree). Breiman (2000) studied
the addition of noise to the response in order to perturb tree structure. From these
works emerged the random forests algorithm discussed in Breiman (2001).
Several authors applied random forests to insurance pricing, such as Wüthrich
and Buser (2019) who adapted tree-based methods to model claim frequencies or
Henckaert et al. (2020) who worked with random forests and boosted trees to develop
full tariff plans built from both the frequency and severity of claims.
Note that the bias-variance decomposition of the generalization error discussed
in this chapter is due to Geman et al. (1992). Also, Sects. 4.3.3.3 and 4.6.2 are
largely inspired by Denuit and Trufin (2020) and by Sect. 8.2 of Friedman (2001),
respectively.
Finally, the presentation is mainly inspired by Breiman (2001), Hastie et al. (2009),
Wüthrich and Buser (2019) and Louppe (2014).
130 4 Bagging Trees and Random Forests

References

Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput
9(7):1545–1588
Amit Y, Geman D, Wilder K (1997) Joint induction of shape features and tree classifiers. IEEE
Trans Pattern Anal Mach Intell 19(11):1300–1305
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2000) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229–242.
ISSN 0885-6125
Breiman L (2001) Random forests. Mach Learn 45:5–32
Denuit M, Trufin J (2020) Generalization error for Tweedie models: decomposition and bagging
models. Working paper
Dietterich TG, Kong EB (1995) Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat
29(5):1189–1232
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural
comput 4(1):1–58
Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. Data Mining,
Inference, and Prediction, 2nd edn. Springer Series in Statistics
Henckaerts R, Côté M-P, Antonio K, Verbelen R (2020) Boosting insights in insurance tariff
plans with tree-based machine learning methods. North Am Actuar J. https://doi.org/10.1080/
10920277.2020.1745656
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern
Anal Mach Intell 13:340–354
Louppe G (2014) Understanding random forests: from theory to practice. arXiv:14077502
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Chapter 5
Boosting Trees

5.1 Introduction

Bagging trees and random forests base their predictions on an ensemble of trees.
In this chapter, we consider another training procedure based on an ensemble of
trees, called boosting trees. However, the way the trees are produced and combined
differ between random forests (and so bagging trees) and boosting trees. In random
forests, the trees are created independently of each other and contribute equally to
the ensemble. Moreover, the constituent trees can be quite large, even fully grown.
In boosting, however, the trees are typically small, dependent on previous trees and
contribute unequally to the ensemble. Both training procedures are thus different,
but they produce competitive predictive performance. Note that the trees in random
forests can be created simultaneously since they are independent of each other, so
that computational time for random forests is in general smaller than for boosting.

5.2 Forward Stagewise Additive Modeling

Ensemble techniques assume structural models of the form


M
g(μ(x)) = score(x) = βm T (x; am ), (5.2.1)
m=1

where βm , m = 1, 2, . . . , M, are the expansion coefficients, and T (x; am ), m =


1, 2, . . . , M, are usually simple functions of the features x, characterized by param-
eters am . Estimating a score of the form (5.2.1) by minimizing the corresponding
training sample estimate of the generalized error

© Springer Nature Switzerland AG 2020 131


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_5
132 5 Boosting Trees
  
 
M
−1
min L yi , g βm T (x i ; am ) (5.2.2)
{βm ,am }1M
i∈I m=1

is in general infeasible. It requires computationally intensive numerical optimization


techniques.
One way to overcome this problem is to approximate the solution to (5.2.2) by
using a greedy forward stagewise approach. Such an approach consists in sequentially
fitting a single function and adding it to the expansion of prior fitted terms. Each
fitted term is not readjusted as new terms are added into the expansion, contrarily to
a stepwise approach where previous terms are each time readjusted when a new one
is added. Specifically, we start by computing
     
β1 ,
a1 = argmin L yi , g −1 score
 0 (x i ) + β1 T (x i ; a1 ) (5.2.3)
{β1 ,a1 } i∈I

where s
core0 (x) is an initial guess. Then, at each iteration m ≥ 2, we solve the
subproblem
     
m ,
β am = argmin L yi , g −1 score
 m−1 (x i ) + βm T (x i ; am ) (5.2.4)
{βm ,am } i∈I

with
 m−1 (x) = score
score m−1 T (x i ;
 m−2 (x) + β am−1 ).

This leads to the following algorithm.

Algorithm 5.1: Forward Stagewise Additive Modeling.

1. Initialize s
core0 (x) to be a constant. For instance:

s
core0 (x) = argmin L(yi , g −1 (β)).
β i∈I

2. For m = 1 to M do
2.1 Compute
     
m ,
β am = argmin L yi , g −1 score
 m−1 (x i ) + βm T (x i ; am ) .
{βm ,am } i∈I
(5.2.5)
 m (x) = score
2.2 Update score m T (x i ;
 m−1 (x) + β am ).
End for  
μD (x) = g −1 score
3. Output:   M (x) .
5.2 Forward Stagewise Additive Modeling 133

Considering the squared-error loss together with the identity link function, step
2.1 in Algorithm 5.1 simplifies to
   2
m ,
β am = argmin yi − score
 m−1 (x i ) + βm T (x i ; am )
{βm ,am } i∈I

= argmin (rim + βm T (x i ; am ))2 ,
{βm ,am } i∈I

where rim is the residual of the model after m − 1 iterations on the ith observation.
One sees that the term βm T (x; am ) actually fits the residuals obtained after m − 1
iterations.
The forward stagewise additive modeling described in Algorithm 5.1 is also called
boosting. Boosting is thus an iterative method based on the idea that combining many
simple functions should result in a powerful one. In a boosting context, the simple
functions T (x; am ) are called weak learners or base learners.
There is a large variety of weak learners available for boosting models. For
instance, commonly used weak learners are wavelets, multivariate adaptive regres-
sion splines, smoothing splines, classification trees and regression trees or neural
networks.
Although each weak learner has advantages and disadvantages, trees are the most
commonly accepted weak learners in ensemble techniques such as boosting. The
nature of trees corresponds well with the concept of weak learner. At each itera-
tion, adding a small tree will slightly improve the current predictive accuracy of the
ensemble.
In this second volume, we use trees as weak learners. Boosting using trees as weak
learners is then called boosting trees. As already noticed, the procedure underlying
boosting trees is completely different from bagging trees and random forests.

5.3 Boosting Trees

Henceforth, we use regression trees as weak learners. That is, we consider weak
learners T (x; am ) of the form
 
T (x; am ) = ctm I x ∈ χ(m)
t , (5.3.1)
t∈Tm



where χ(m)
t is the partition of the feature space χ induced by the regression tree
t∈Tm
T (x; am ) and {ctm }t∈Tm the corresponding predictions for the score. For regression
trees, the “parameters” am represent the splitting variables and their split values as
well as the corresponding predictions in the terminal nodes, that is,


am = ctm , χ(m)
t .
t∈Tm
134 5 Boosting Trees

5.3.1 Algorithm

Step 2.2 in Algorithm 5.1 becomes

 m (x) = score
score m T (x i ;
 m−1 (x) + β am )
 
 m−1 (x) + β
= score m ctm I x ∈ χ(m)
 t ,
t∈Tm

so that it can be alternatively expressed as


 
 m (x) = score
score  m−1 (x) + γtm I x ∈ χ(m)
 t (5.3.2)
t∈Tm

 
with  m
γtm = β ctm . Hence, one sees that if βm ,
am is solution to (5.2.5) with


am = 
 ctm , χ(m)
t ,
t∈Tm

 
then 1, 
bm with

 γtm , χ(m)
bm =  t
t∈Tm

is also solution to (5.2.5). In the following, without loss of generality, we consider


solutions to (5.2.5) of the form (1, am ), which yields the following algorithm.

Algorithm 5.2: Boosting Trees.

1. Initialize s
core0 (x) to be a constant. For instance:

s
core0 (x) = argmin L(yi , g −1 (β)).
β i∈I

2. For m = 1 to M do
2.1 Fit a regression tree T (x;
am ) with
   

am = argmin L yi , g −1 score
 m−1 (x i ) + T (x i ; am ) . (5.3.3)
am
i∈I

 m (x) = score
2.2 Update score  m−1 (x) + T (x;
am ).
End for  
−1
3. Output: 
μboost
D (x) = g  M (x) .
score
5.3 Boosting Trees 135

Notice that since


 
am ) =
T (x; ctm I x ∈ χ(m)
 t , (5.3.4)
t∈Tm

step 2.2 in Algorithm


5.2 can be viewed as adding |Tm | separate base learners
ctm I x ∈ χ(m)
 t , t ∈ Tm .

Remark 5.3.1 In Algorithm 5.2, we initialize s core0 (x) as a constant. However, if


there is an existing model for the score, obtained by any training procedure, then
it can also be used as a starting score s core0 (x). For instance, in actuarial pricing,
practitioners traditionally use GLMs, so that they can rely on the score fitted with a
GLM procedure as a starting score in Algorithm 5.2. Notice that boosting can serve
as an efficient tool for model back-testing, the boosting steps being the corrections
needing to be done according to the boosting procedure for the initial model.
Because the structure of regression trees is particularly powerful to account for
interactions between features, a combination of a GLM score for s core0 (x) account-
ing for only the main effects of the features with subsequent boosting trees can be of
particular interest. Indeed, the boosting trees will bring corrections to the main effects
already modeled by the GLM score, if needed, but will also model the interactions
between features.

5.3.2 Particular Cases

5.3.2.1 Squared Error Loss

For the squared-error loss with the identity link function (which is the canonical link
function for the Normal distribution), we have seen that (5.3.3) simplifies to
 2
am = argmin
 yi − score
 m−1 (x i ) + T (x i ; am )
am
i∈I

= argmin (r̃mi + T (x i ; am ))2
am
i∈I

= argmin L (r̃mi , T (x i ; am )) .
am
i∈I

Hence, at iteration m, T (x i ; am ) is simply the best regression tree fitting the current
residuals r̃mi = yi − score
 m−1 (x i ). Finding the solution to (5.2.5) is thus no harder
than for a single tree. It amounts to fit a regression tree on the working training set

D(m) = {(r̃mi , x i ), i ∈ I} .
136 5 Boosting Trees

5.3.2.2 Poisson Deviance Loss

Consider (5.3.3) with the Poisson deviance loss and the log-link function (which
is the canonical link function for the Poisson distribution). In actuarial studies, this
choice is often made to model the number of claims, so that we also account for
the observation period e referred to as the exposure-to-risk. In such a case, one
observation of the training set can be described by the claims count yi , the features
x i and the exposure-to-risk ei , so that we have

D = {(yi , x i , ei ), i ∈ I} .

In this context, (5.3.3) becomes


   

am = argmin L yi , ei exp score
 m−1 (x i ) + T (x i ; am )
am
i∈I
    
= argmin L yi , ei exp score
 m−1 (x i ) exp (T (x i ; am )) ,
am
i∈I

which can be expressed as




am = argmin L (yi , emi exp (T (x i ; am )))
am
i∈I

with  
emi = ei exp score
 m−1 (x i )

for i ∈ I. At iteration m, emi is a constant and can be regarded as a working exposure-


to-risk applied to observation i.
Again, solving (5.3.3) is thus no harder than for a single tree. This is equivalent
to building a single tree with the Poisson deviance loss and the log-link function on
the working training set

D(m) = {(yi , x i , emi ) , i ∈ I} .

Example
Consider the example of Sect. 3.2.4.1. The optimal tree is shown in Fig. 3.8, denoted
here  D . Next to the training set D made of 500 000 observations, we create a
μtree
validation set D containing a sufficiently large number of observations to get stable
results for validation sample estimates, here 1000 000 observations. The validation
sample estimate of the generalization error of μtree
D is then given by

val  tree 

Err 
μD = 0.5525081. (5.3.5)
5.3 Boosting Trees 137

As we have seen, the optimal tree  μtree


D produces the desired partition of the feature
space. The difference with the true model μ comes from the predictions in the terminal
nodes. We get
val

Err (μ) = 0.5524827. (5.3.6)

Let us now build Poisson boosting


  
trees μboost
D with the log-link function on D, so
that we have 
μD (x) = exp score
boost
 M (x) . First, we initialize the algorithm with the
optimal constant score

yi
s
core0 (x) = ln i∈I = −2.030834.
i∈I ei

Note that in this simulated example, we have ei = 1 for all i ∈ I.


Then, we use the R command rpart to produce each of the M constituent trees.
We control the size of the trees with the variable maxdepth=D. Note that specifying
the number of terminal nodes J of a tree is not possible with rpart.
To start, we consider trees with D = 1, that is, we consider trees with only J = 2
terminal nodes. For the first iteration, the working training set is

D(1) = {(yi , x i , e1i ) , i ∈ I} ,

where

e1i = ei exp(
score0 (x i ))
= exp(−2.030834).

The first tree fitted on D(1) is

a1 ) = −0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45] .


T (x;

The single split is x2 ≥ 45 and the predictions in the terminal nodes are −0.1439303
for x2 ≥ 45 and 0.0993644 for x2 < 45. We then get

core1 (x) = s
s core0 (x) + T (x;
a1 )
= −2.030834 − 0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45] .

The working training set at iteration m = 2 is

D(2) = {(yi , x i , e2i ) , i ∈ I} ,

with
 
e2i = ei exp s
core1 (x i )
= exp (−2.030834 − 0.1439303 I [xi2 ≥ 45] + 0.0993644 I [xi2 < 45]) .
138 5 Boosting Trees

The second tree fitted on D(2) is


 
a2 ) = −0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .
T (x;

The single split is made with x4 and the predictions in the terminal nodes are
−0.06514170 for x4 = no and 0.06126193 for x4 = yes. We get

core2 (x) = s
s core1 (x) + T (x;
a2 )
= −2.030834
−0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .

The working training set at iteration m = 3 is

D(3) = {(yi , x i , e3i ) , i ∈ I} ,

with
 
e3i = ei exp s
core2 (x i )
= exp (−2.030834 − 0.1439303 I [xi2 ≥ 45] + 0.0993644 I [xi2 < 45])
  
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .

The third tree fitted on D(3) is

a3 ) = −0.03341700 I [x2 ≥ 30] + 0.08237676 I [x2 < 30] .


T (x;

The single split is x2 ≥ 30 and the predictions in the terminal nodes are −0.03341700
for x2 ≥ 30 and 0.08237676 for x2 < 30. We get

3 (x) = score
score 2 (x) + T (x;
a3 )
= −2.030834
−0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.03341700 I [x2 ≥ 30] + 0.08237676 I [x2 < 30]
= −2.030834
+(0.0993644 + 0.08237676) I [x2 < 30]
+(0.0993644 − 0.03341700) I [30 ≤ x2 < 45]
−(0.1439303 + 0.03341700) I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
= −2.030834
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .
5.3 Boosting Trees 139

The working training set at iteration m = 4 is

D(4) = {(yi , x i , e4i ) , i ∈ I} ,

with
 
e4i = ei exp score
3 (x i )
= exp (−2.030834)
exp (0.1817412 I [xi2 < 30] + 0.0659474 I [30 ≤ xi2 < 45] − 0.1773473 I [xi2 ≥ 45])
  
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .

The fourth tree fitted on D(4) is

a4 ) = −0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male] .


T (x;

The single split is made with x1 and the predictions in the terminal nodes are
−0.05309399 for x1 = f emale and 0.05029175 for x1 = male. We get

4 (x) = score
score 3 (x) + T (x;
a4 )
= −2.030834
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male] .

The working training set at iteration m = 5 is

D(5) = {(yi , x i , e5i ) , i ∈ I} ,

with
 
e5i = ei exp score
4 (x i )
= exp (−2.030834)
exp (0.1817412 I [xi2 < 30] + 0.0659474 I [30 ≤ xi2 < 45] − 0.1773473 I [xi2 ≥ 45])
  
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes
exp (−0.05309399 I [xi1 = f emale] + 0.05029175 I [xi1 = male]) .

The fifth tree fitted on D(5) is

a5 ) = −0.01979230 I [x2 < 45] + 0.03326232 I [x2 ≥ 45] .


T (x;

The single split is x2 ≥ 45 and the predictions in the terminal nodes are −0.01979230
for x2 < 45 and 0.03326232 for x2 ≥ 45. We get
140 5 Boosting Trees

5 (x) = score
score 4 (x) + T (x;
a5 )
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.01979230 I [x2 < 45] + 0.03326232 I [x2 ≥ 45]
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+(0.1817412 − 0.01979230) I [x2 < 30] + (0.0659474 − 0.01979230) I [30 ≤ x2 < 45]
+(0.03326232 − 0.1773473) I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1619489 I [x2 < 30] + 0.0461551 I [30 ≤ x2 < 45] − 0.144085 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes

The working training set at iteration m = 6 is

D(6) = {(yi , x i , e6i ) , i ∈ I} ,

with
 
e6i = ei exp score
5 (x i )
= exp (−2.030834)
exp (−0.05309399 I [xi1 = f emale] + 0.05029175 I [xi1 = male])
exp (0.1619489 I [xi2 < 30] + 0.0461551 I [30 ≤ xi2 < 45] − 0.144085 I [xi2 ≥ 45])
  
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .

The sixth tree fitted on D(6) is

a6 ) = −0.009135574 I [x2 ≥ 30] + 0.019476688 I [x2 < 30] .


T (x;

The single split is x2 ≥ 30 and the predictions in the terminal nodes are −0.00913557
for x2 ≥ 30 and 0.019476688 for x2 < 30. We get

6 (x) = score
score 5 (x) + T (x;
a6 )
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1619489 I [x2 < 30] + 0.0461551 I [30 ≤ x2 < 45] − 0.144085 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.009135574 I [x2 ≥ 30] + 0.019476688 I [x2 < 30]
= −2.030834
5.3 Boosting Trees 141

−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]


+0.1814256 I [x2 < 30] + 0.03701953 I [30 ≤ x2 < 45] − 0.1532206 I [x2 ≥ 45]
 
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .

Hence, for M = 6, i.e. after 6 iterations, we have


μboost
D (x) = exp (−2.030834)
exp (−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male])
exp (0.1814256 I [x2 < 30] + 0.03701953 I [30 ≤ x2 < 45] − 0.1532206 I [x2 ≥ 45])
  
exp −0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .

We see that  μboost


D with M = 6 partitions the feature space into 12 parts, as desired.
In Table 5.1, we show the 12 risk classes with their corresponding expected claim
frequencies μ(x) and estimated expected claim frequencies  D (x) and 
μtree μboost
D (x).
If we continue to increase the number of iterations, we overfit the training set.
val  
In Fig. 5.1, we provide the validation sample estimate  Err 
μboost
D with respect to
val val  
the number of trees M, together with  Err (μ) and  Err μD . We see that the
tree
val  
boosting model minimizing  Err 
μboost
D contains M = 6 trees, and its predictive
accuracy is similar to  μtree
D . Note that when M ≤ 5, the optimal tree  μtree
D performs
better than μboost
D while from M = 7, we start to overfit the training set. For instance,
the seventh tree is

a7 ) = −0.01147564 I [x2 < 39] + 0.01109997 I [x2 ≥ 39] .


T (x;

Table 5.1 Risk classes with their corresponding expected claim frequencies μ(x) and estimated
expected claim frequencies 
μtree
D (x) and μboost
D (x) for M = 6
x
x1 (Gender) x2 (Age) x4 (Sport) μ(x) 
μtree
D (x) 
μboost
D (x)
(for M=6)
Female x2 ≥ 45 No 0.1000 0.1005 0.1000
Male x2 ≥ 45 No 0.1100 0.1085 0.1109
Female x2 ≥ 45 Yes 0.1150 0.1164 0.1135
Male x2 ≥ 45 Yes 0.1265 0.1285 0.1259
Female 30 ≤ x2 < 45 No 0.1200 0.1206 0.1210
Male 30 ≤ x2 < 45 No 0.1320 0.1330 0.1342
Female 30 ≤ x2 < 45 Yes 0.1380 0.1365 0.1373
Male 30 ≤ x2 < 45 Yes 0.1518 0.1520 0.1522
Female x2 < 30 No 0.1400 0.1422 0.1399
Male x2 < 30 No 0.1540 0.1566 0.1550
Female x2 < 30 Yes 0.1610 0.1603 0.1586
Male x2 < 30 Yes 0.1771 0.1772 0.1759
142 5 Boosting Trees

Generalization error (validation sample) 0.5540

0.5535

0.5530

0.5525

5 10 15 20
Boosting iterations

val  boost 
Fig. 5.1 Validation sample estimate 
Err 
μD with respect to the number of iterations for
val val  tree 
trees with two terminal nodes, together with 
Err (μ) (dotted line) and 
Err 
μD (solid line)

5.3.2.3 Gamma Deviance Loss

Consider the Gamma deviance loss. Using the log-link function, (5.3.3) is then given
by
   

am = argmin L yi , exp score
 m−1 (x i ) + T (x i ; am )
am
i∈I
 
 yi
= argmin − 2 ln  
am  m−1 (x i ) + T (x i ; am )
exp score
i∈I
 
yi
+2   −1 ,
 m−1 (x i ) + T (x i ; am )
exp score

so that we get


r̃mi

r̃mi


am = argmin − 2 ln +2 −1
am exp (T (x i ; am )) exp (T (x i ; am ))
i∈I

= argmin L (r̃mi , exp (T (x i ; am ))) (5.3.7)
am
i∈I
5.3 Boosting Trees 143

with yi
r̃mi =  
 m−1 (x i )
exp score

for i ∈ I. Therefore, (5.3.3) simplifies to (5.3.7), so that finding the solution to (5.3.3)
amounts to obtain the regression tree with the Gamma deviance loss and the log-link
function that best predicts the working responses r̃mi . Hence, solving (5.3.3) amounts
to build the best tree on the working training set

D(m) = {(r̃mi , x i ), i ∈ I} ,

using the Gamma deviance loss and the log-link function.


In practice, the choice of the log-link function is often made for the convenience
of having a multiplicative model. If we rather choose the canonical link function for
the Gamma distribution, that is, if we choose g(x) = −1 x
and hence g −1 (x) = −1
x
,
then (5.3.3) becomes
 
−1

am = argmin L yi ,
am
i∈I
 m−1 (x i ) + T (x i ; am )
score
   
= argmin − 2 ln −yi score
 m−1 (x i ) + T (x i ; am )
am
i∈I

   
−2 yi score
 m−1 (x i ) + T (x i ; am ) + 1 . (5.3.8)

5.3.3 Size of the Trees

Boosting trees have two important tuning parameters, that are the size of the trees
and the number of trees M. The size of the trees can be specified in different ways,
such as with the number of terminal nodes J or with the depth of the tree D.
In the boosting context, the size of the trees is controlled by the interaction depth
ID. Each subsequent split can be seen as a higher-level of interaction with the pre-
vious split features. Setting ID = 1 produces single-split regression trees, so that no
interactions are allowed. Only the main effects of the features can be captured by
the score. With ID = 2, two-way interactions are also permitted, and for ID = 3,
three-way interactions are also allowed, and so on. Thus, the value of ID reflects the
level of interactions permitted in the score. Note that ID corresponds to the number
of splits in the trees. Obviously, we have ID = J − 1.
In practice, the level of interactions required is often unknown, so that ID is a
tuning parameter that is set by considering different values and selecting the one
that minimizes the generalization error estimated on a validation set or by cross-
validation. In practice, ID = 1 will be often insufficient, while ID > 10 is very
unlikely.
144 5 Boosting Trees

Note that in the simulated example discussed in Sect. 5.3.2.2, trees with ID = 1
are big enough to get satisfying results since the true score does not contain interaction
effects.

5.3.3.1 Example 1

Consider the simulated example of Sect. 3.6. In this example, using the log-link
function, the true score is given by

score(x) = ln(0.1) + ln(1.1) I [x1 = male]


+ ln(1.4)I [18 ≤ x2 < 30] + ln(1.2) I [30 ≤ x2 < 45]
+ ln(1.15) I [x4 = yes] + ln (1.6/1.4) I [x1 = male, 18 ≤ x2 < 30] .

The first terms are functions of only one single feature while the last term is a two-
variable functions, producing a second-order interaction between X 1 and X 2 .
We generate 1000 000 additional observations to produce a validation set. The
validation sample estimate of the generalization error of 
μtree
D , which is the optimal
tree displayed in Fig. 3.25, is then given by
val  tree 

Err 
μD = 0.5595743. (5.3.9)

The optimal tree 


μtree
D produces the correct partition of the feature space and differs
from the true model μ by the predictions in the terminal nodes. We have
val

Err (μ) = 0.5595497. (5.3.10)

Let us fit boosting trees using the Poisson deviance loss and the log-link function.
We follow the procedure described in the example of Sect. 5.3.2.2. First, we consider
trees with only one split (ID = 1). Fig. 5.2 provides the validation sample estimate
val   val

Err 
μboost with respect to the number of trees M, together with  Err (μ) and
D
val  
 
μtree
Err D . We observe that the boosting models  μboost
D have higher validation
sample estimates of the generalization error than the optimal tree  μtree
D , whatever the
number of iterations M. Contrarily to the optimal tree, the boosting models with
ID = 1 cannot account for the second-order interaction. We should consider trees
with ID = 2. However, as already mentioned, specifying the number of terminal
nodes J of a tree, or equivalently the interaction depth ID, is not possible with
rpart.
To overcome this issue, we could rely on the R gbm package for instance, that
enables to specify the interaction depth. This package implements the gradient boost-
ing approach as described in Sect. 5.4, which is an approximation of boosting for any
differentiable loss function. In Sect. 5.4.4.2, we illustrate the use of the gbm package
on this particular example with ID = 1, 2, 3, 4.
5.3 Boosting Trees 145

0.5615
Generalization error (validation sample)

0.5610

0.5605

0.5600

0.5595
5 10 15 20
Boosting iterations

val  boost 
Fig. 5.2 Validation sample estimate 
Err 
μ
with respect to the number of iterations for
D
val val  tree 
Err (μ) (dotted line) and 
trees with two terminal nodes (J = 2), together with  Err 
μD
(solid line)

5.3.3.2 Example 2

Consider the simulated example of Sect. 3.7.2. We generate 1000 000 additional
observations to produce a validation set. The validation sample estimate of the gen-
eralization error of 
μtree
D , which is the tree minimizing the 10-fold cross validation
error and shown in Fig. 3.31, is then given by
val  tree 

Err 
μD = 0.5803903. (5.3.11)

Contrarily to the previous example and the example in Sect. 5.3.2.2, a single tree
(here 
μtree
D ) cannot reproduce the desired partition of the feature space. We have seen
in Sect. 3.7.2 that 
μtree
D suffers from a lack of smoothness. The validation sample
estimate of the generalization error of the true model μ is
val

Err (μ) = 0.5802196, (5.3.12)
146 5 Boosting Trees

0.5815
Generalization error (validation sample)

0.5810

0.5805

0 25 50 75 100
Boosting iterations

val  boost 
Fig. 5.3 Validation sample estimate 
Err 
μD with respect to the number of iterations for
val val  tree 
trees with two terminal nodes (J = 2), together with 
Err (μ) (dotted line) and 
Err 
μD
(solid line)

so that the room for improvement for the validation sample estimate of the general-
ization error is
val  tree  val

Err μD − 
 Err (μ) = 0.0001706516.

We follow the same procedure than in example of Sect. 5.3.2.2, namely we use
the Poisson deviance loss with the log-link function and we consider trees with only
one split (ID = 1) since there is no interaction effects in the true model. In Fig. 5.3,
val  
we provide the validation sample estimate  Err 
μboost
D with respect to the number
val val  

of trees M, together with Err (μ) and Err  
μD . We see that the error estimate
tree
val   val  

Err 
μD
boost
becomes smaller than  Err 
μD from M = 8 and stabilizes around
tree
val  
M = 25. The smallest error  Err 
μboost
D corresponds to M = 49 and is given by

val  boost 

Err 
μD = 0.5802692.

val  
In Fig. 5.4, we show the validation sample estimate 
Err μboost
D with respect
to the number of trees M for constituent trees with depth D = 2, 3. For D = 2, the
5.3 Boosting Trees 147

Generalization error (validation sample) 0.5810

0.5808

0.5806

0.5804

0.5802
0 25 50 75 100
Boosting iterations

0.5806
Generalization error (validation sample)

0.5805

0.5804

0.5803

0.5802
0 25 50 75 100
Boosting iterations

val  boost 
Fig. 5.4 Validation sample estimate 
Err 
μ
with respect to the number of iterations for
D
val val  tree 
Err (μ) (dotted line) and 
trees with D = 2 (top) and D = 3 (bottom), together with  Err 
μD
(solid line)
148 5 Boosting Trees

val  boost 
smallest error 
Err 
μD corresponds to M = 6 and is given by

val  boost 

Err 
μD = 0.5802779,

val  
while for D = 3, the smallest error Err 
μboost
D corresponds to M = 4 and is given
by
val  boost 
Err 
μD = 0.5803006.

One sees that increasing the depth of the trees does not enable to improve the pre-
dictive accuracy of the boosting models. These models incur unnecessary variance
leading to higher validation sample estimates of the generalization error. Denoting
val  
by M ∗ the number of trees minimizing Err μboost
D , we observe that the predictive
accuracy of the model significantly degrades when M > M ∗ . Increasing the size of
the trees enables the boosting model to fit the training set too well when M > M ∗ .

5.4 Gradient Boosting Trees

Simple fast algorithms do not always exist for solving (5.3.3). Depending on the
choice of the loss function and the link function, the solution to (5.3.3) can be
difficult to obtain.
For any differentiable loss function, this difficulty can be overcome by analogy
to numerical optimization. The solution to (5.3.3) can be approximated by using a
two-step procedure, as explained next.
To ease the presentation of this section, we equivalently use the notation {(y1∗ , x ∗1 ),

. . . , (y|I| , x ∗|I| )} for the observations of the training set, that is,

D = {(yi , x i ), i ∈ I}
= {(y1∗ , x ∗1 ), . . . , (y|I|

, x ∗|I| )}. (5.4.1)

5.4.1 Numerical Optimization

The primary objective is to minimize the training sample estimate of the generalized
error   
L(score(x)) = L yi , g −1 (score(x i ))) (5.4.2)
i∈I

with respect to score(x), where score(x) is assumed to be a sum of trees. In other


words, we are interested into a function score :  → R minimizing (5.4.2) under the
constraint that it is a sum of trees.
5.4 Gradient Boosting Trees 149

We observe that (5.4.2) only specifies this function in the values x i , i ∈ I. Hence,
forgetting that we work with constrained functions, we actually try to find optimal
parameters

η = (η1 , . . . , η|I| )
= (score(x ∗1 ), . . . , score(x ∗|I| )) (5.4.3)

minimizing
|I|
  
L(η) = L yi∗ , g −1 (ηi ) . (5.4.4)
i=1

The saturated model is of course one of the solution to (5.4.4), namely ηi = g(yi∗ ) for
all i = 1, . . . , |I|, but this solution typically leads to overfitting. Note that restricting
the score to be the sum of a limited number of relatively small regression trees will
prevent from overfitting.
Our problem can be viewed as the numerical optimization


η = argmin L(η). (5.4.5)
η

Numerical optimization methods often express the solution to (5.4.5) as


M

η= bm , bm ∈ R|I| , (5.4.6)
m=0

where b0 is an initial guess and b1 , . . . , b M are successive increments, also called


steps or boosts, each based on the preceding steps. The way each step bm is computed
depends on the optimization method considered.

5.4.2 Steepest Descent

One of the simplest and frequently used numerical minimization methods is the
steepest descent. The steepest descent defines step bm as

bm = −ρm gm ,

where ρm is a scalar and


⎛   ⎞
  ∗
∂ L(y1∗ , g −1 (η1 )) ∂ L(y|I| , g −1 (η|I| ))
gm = ⎝ ,..., ⎠
∂η1 η1 =
ηm−1,1 ∂η|I|
η|I| =
ηm−1,|I|
(5.4.7)
150 5 Boosting Trees

is the gradient of L(η) evaluated at 


η m−1 = ( ηm−1,|I| ) given by
ηm−1,1 , . . . , 

η m−1 = b0 + b1 + . . . + bm−1 .
 (5.4.8)

The negative gradient −gm gives the local direction along with L(η) decreases the
η m−1 . The step length ρm is then found by
most rapidly at η = 

ρm = argmin L(
η m−1 − ρgm ), (5.4.9)
ρ>0

which provides the update


ηm = 
 η m−1 − ρm gm . (5.4.10)

5.4.3 Algorithm

The steepest descent is the best strategy to minimize L(η). Unfortunately, the gradient
gm is only defined at the feature values x i , i ∈ I, so that it cannot be generalized to
other feature values whereas we would like a function defined on the entire feature
space χ. Furthermore, we want to prevent from overfitting.
A way to solve these issues is to constrain the step directions to be members of
a class of functions. Specifically, we reinstate the temporarily forgotten constrain
that is to work with regression trees. Hence, we approximate the direction −gm by a
regression tree T (x;am ) producing
 
gm = T (x ∗1 ;
− am ), . . . , T (x ∗|I| ;
am )

that is as close as possible to the negative gradient, i.e. the most parallel to −gm .
A common choice to measure the closeness between constrained candidates
T (x; am ) for the negative gradient and the unconstrained gradient −gm = −(gm1 , . . . ,
gm|I| ) is to use the squared error, so that we get

|I|
  2
am = argmin
 −gmi − T (x i∗ ; am ) . (5.4.11)
am
i=1

am ) is thus fitted to the negative gradient −gm by least squares.


The tree T (x;
The optimal step size  ρm is then found by

|I|
  

ρm = argmin L(yi∗ , g −1 score
 m−1 (x i∗ ) + ρm T (x i∗ ;
am ) ), (5.4.12)
ρm >0 i=1

where
 m−1 (x) = score
score  m−2 (x) + 
ρm−1 T (x;
am−1 )
5.4 Gradient Boosting Trees 151

 0 (x) is the constant function


and score

|I|

core0 (x) = argmin
s L(yi∗ , g −1 (ρ0 )).
ρ0
i=1

am ) can be written as
As already noticed, since T (x;
 
am ) =
T (x; ctm I x ∈ χ(m)
 t ,
t∈Tm

the update
 m (x) = score
score  m−1 (x) + 
ρm T (x;
am )

ctm I x ∈ χ(m)
can be viewed as adding |Tm | separate base learners  t , t ∈ Tm , to
 m−1 (x). Instead of using one coefficient for T (x;
score am ), the current procedure
 can

(m)
be improved by using the optimal coefficients for each base learner ctm I x ∈ χt ,
t ∈ Tm . Thus, we replace the optimal coefficient  ρm solution to (5.4.12) by |Tm |
coefficients 
ρtm , t ∈ Tm , that are solution to
⎛ ⎛ ⎞⎞
|I |
  
(m) ⎠⎠
{
ρtm }t∈Tm = argmin L ⎝ yi∗ , g −1 ⎝score
 m−1 (x i∗ ) + ctm I x i∗ ∈ χt
ρtm .
{ρtm }t∈Tm i=1 t∈Tm
(5.4.13)
Subspaces {χ(m)
t } t∈Tm of the feature space χ are disjoint, so that (5.4.13) amounts to
solve
   
ρtm = argmin
 L yi∗ , g −1 score
 m−1 (x i∗ ) + ρtm ctm (5.4.14)
ρtm
i:x i∗ ∈χ(m)
t

for each t ∈ Tm . Coefficients  ρtm , t ∈ Tm , are thus fitted separately. Here, the least-
squares coefficients  ctm , t ∈ Tm , are simply multiplicative constants. Ignoring these
latter coefficients, (5.4.14) reduces to
   

γtm = argmin L yi∗ , g −1 score
 m−1 (x i∗ ) + γtm , (5.4.15)
γtm
i:x i∗ ∈χ(m)
t

γtm is the best update for the score in subspace χ(m)


meaning that  t given the current
 m−1 (x).
approximation score
152 5 Boosting Trees

This leads to the following algorithm.

Algorithm 5.3: Gradient Boosting Trees.

1. Initialize s
core0 (x) to be a constant. For instance:

|I|

core0 (x) = argmin
s L(yi∗ , g −1 (ρ0 )). (5.4.16)
ρ0
i=1

2. For m = 1 to M do
2.1 For i = 1, . . . , |I|, compute
 
∂ L(yi∗ , g −1 (η))
rmi =− . (5.4.17)
∂η scorem−1 (x i∗ )
η=

2.2 Fit a regression tree T (x;


am ) to the working responses rmi by least squares,
i.e.
 I
 2
am = argmin
 rmi − T (x i∗ ; am ) , (5.4.18)
am
i=1

giving the partition {χ(m)


t , t ∈ Tm } of the feature space.
2.3 For t ∈ Tm , compute
   

γtm = argmin L yi∗ , g −1 score
 m−1 (x i∗ ) + γtm . (5.4.19)
γtm
i:x i∗ ∈χ(m)
t


(m)
 m (x) = score
2.4 Update score  m−1 (x) + 
γ
t∈Tm tm I x ∈ χ t .

End for  
(x) = g −1 score
grad boost
3. Output: 
μD  M (x) .

Step 2.2 in Algorithm 5.3 determines the partition {χ(m) t , t ∈ Tm } of the feature
space for the mth iteration. Then, given the partition {χ(m)
t , t ∈ Tm }, step 2.3 estimates
the predictions in the terminal nodes t ∈ Tm at the mth iteration based on the current
 m−1 (x).
score score
Instead of fitting directly one tree as in step 2.1 in Algorithm 5.2, we first fit a
tree on working responses by least squares (step 2.2 in Algorithm 5.3) to get the
structure of the tree and then we compute the predictions in the final nodes (step 2.3
in Algorithm 5.3).
5.4 Gradient Boosting Trees 153

5.4.4 Particular Cases

5.4.4.1 Squared Error Loss

For the squared-error loss with the identity link function, i.e.
 
L y, g −1 (score(x)) = (y − score(x))2 ,

we directly get
|I|
yi∗
core0 (x) =
s i=1
|I|

for (5.4.16) and the working responses (5.4.17) become

 
∂(yi∗ − η)2
rmi = −
∂η scorem−1 (x i∗ )
η=
 
=2 yi∗  m−1 (x i∗ ) .
− score (5.4.20)

Therefore, the working responses rmi are just the ordinary residuals, as in Sect. 5.3.2.1,
so that gradient boosting trees is here equivalent to boosting trees. The predictions
(5.4.19) are given by
 
i:x i∗ ∈χ(m) yi∗ − score
 m−1 (x i∗ )
γtm =
 t
. (5.4.21)
card{i : x i∗ ∈ χ(m)
t }

5.4.4.2 Poisson Deviance Loss

Consider the Poisson deviance loss together with the log-link function, and the train-
ing set

D = {(yi , x i , ei ), i ∈ I}
= {(y1∗ , x ∗1 , e1∗ ), . . . , (y|I|

, x ∗|I| , e|I|

)}. (5.4.22)

Since
  
 −1
 y e exp(score(x))
L y, eg (score(x)) = 2y ln −1+ ,
e exp(score(x)) y

(5.4.16) and (5.4.17) become


154 5 Boosting Trees
 |I| 

i=1 yi
s
core0 (x) = ln |I| ∗
i=1 ei

and

 y∗  ei∗ exp(η)

∂ ln e∗ exp(η)
i
−1+ yi∗
∗⎣ ⎦
rmi = −2yi i

∂η
scorem−1 (x i∗ )
η=
 
=2 yi∗ − ei∗  m−1 (x i∗ ))
exp(score , (5.4.23)

respectively. One sees that (5.4.23) can be rewritten as


 
rmi = 2 yi∗ − emi

, (5.4.24)


with emi = ei∗ exp(score
 m−1 (x i∗ )). At iteration m, step 2.2 in Algorithm 5.3 amounts
to fit a regression tree T (x;am ) on the residuals (5.4.24). The predictions (5.4.19)
are given by  
i:x i∗ ∈χ(m) yi∗
γtm = ln t
∗ . (5.4.25)
i:x ∗ ∈χ(m)
t
emi
i

As we have seen, boosting trees is fully manageable in this case since fitting the
mth tree in the sequence is no harder than fitting a single regression tree. Thus, gra-
dient boosting trees is a bit artificial here, as noted in Wüthrich and Buser (2019). At
each iteration, the structure of the new tree is obtained by least squares while the Pois-
son deviance loss can be used without any problem. Hence, gradient boosting trees
introduces an extra step which is artificial, leading to an unnecessary approximation.
Example
Consider the example in Sect. 5.3.3.1. We use the R package gbm to build the gradient
grad boost
boosting models  μD with the Poisson deviance loss and the log-link function.
Note that the command gbm enables to specify the interaction depth of the constituent
val  grad boost 
trees.
Figure 5.5 displays the validation sample estimate Err 
μD with respect
val val
the number of trees M for ID = 1, 2, 3, 4, together with 
to tree Err (μ) and  Err

μD .
The gradient boosting models with ID = 1 produce results similar to those
depicted in Fig. 5.2. As expected, because of the second-order interaction between X 1
and X 2 , ID = 2 is the optimal choice in this example. The gradient boosting model
grad boost
μD with ID = 2 and M = 6 produces the lowest validation sample estimate
of the generalization error that is similar to the single optimal tree 
μtree
D . Note that
gradient boosting models with ID = 3, 4 have lower validation sample estimates of
the generalization error as long as M ≤ 5. This is due to the fact that models with
ID = 3, 4 learn faster than those with ID = 2. From M = 6, gradient boosting mod-
5.4 Gradient Boosting Trees 155

Generalization error (validation sample)

Generalization error (validation sample)


0.5615
0.5610

0.5610

0.5605

0.5605

0.5600
0.5600

0.5595
0.5595
5 10 15 20 5 10 15 20
Boosting iterations Boosting iterations

0.5603
Generalization error (validation sample)

Generalization error (validation sample)


0.56025
0.5601

0.56000
0.5599

0.55975
0.5597

5 10 15 20 5 10 15 20
Boosting iterations Boosting iterations

val  grad boost 


Fig. 5.5 Validation sample estimate  Err 
μD with respect to the number of iterations
for trees with ID = 1 (top left), ID = 2 (top right), ID = 3 (bottom left), ID = 4 (bottom right),
val val  tree 
together with Err (μ) (dotted line) and  Err 
μ (solid line) D

els with ID = 2 outperform models with ID = 3, 4. The learning capacity of trees


with ID = 3, 4 is too strong, so that they reduce the training sample estimate of the
generalization error too quickly.

5.4.4.3 Gamma Deviance Loss

For the Gamma deviance loss with the log-link function, i.e.
 
  y y
L y, g −1 (score(x)) = −2 ln +2 −1 ,
exp (score(x)) exp (score(x))
(5.4.26)
(5.4.16) and (5.4.17) become
 |I| 
yi∗
core0 (x) = ln
s i=1
|I|

and
156 5 Boosting Trees

 y∗   y∗  ⎤
∂ ln exp(η)
i
− exp(η)
i
−1
rmi = 2⎣ ⎦
∂η
scorem−1 (x i∗ )
η=

yi∗
=2 −1 , (5.4.27)
 m−1 (x i∗ ))
exp(score

respectively. The latter expression for rmi can be expressed as

rmi = 2 (r̃mi − 1) (5.4.28)

with
yi∗
r̃mi = . (5.4.29)
 m−1 (x i∗ ))
exp(score

The predictions (5.4.19) are given by


⎛ yi∗

i:x i∗ ∈χ(m) scorem−1 (x i∗ ))
γtm = ln ⎝ ⎠.
exp(

t
(5.4.30)
card{i : x i∗ ∈ χ(m) t }

Again, as for the Poisson deviance loss with the log-link function, we have seen that
boosting trees for the Gamma deviance loss is easy to implement with the log-link
function, so that gradient boosting trees is also a bit artificial here.
If we consider the canonical link function g(x) = −1 x
, i.e.
 
L y, g −1 (score(x)) = −2 ln (−yscore(x)) − 2 (yscore(x) + 1) , (5.4.31)

we get for (5.4.16) and (5.4.17)

−|I|
s
core0 (x) = |I|

i=1 yi

and
 !   "
∂ ln −yi∗ η + yi∗ η + 1
rmi =2
∂η
scorem−1 (x i∗ )
η=

1
=2 + yi∗ , (5.4.32)
 m−1 (x i∗ )
score

respectively. The predictions (5.4.19) satisfy


 −1 
= yi∗ . (5.4.33)
 m−1 (x i∗ ) + 
score γtm
i:x i∗ ∈χ(m)
t i:x i∗ ∈χ(m)
t
5.5 Boosting Versus Gradient Boosting 157

5.5 Boosting Versus Gradient Boosting

In practice, actuaries often use distributions that belong to the Tweedie class together
with the log-link function to approximate μ(x). The log-link function is chosen
mainly because of the multiplicative structure it produces for the resulting model.
The Tweedie class regroups the members of the ED family having power variance
functions V (μ) = μξ for some ξ.
Specifically, the Tweedie class includes continuous distributions such as the Nor-
mal, Gamma and Inverse Gaussian distributions. It also includes the Poisson and
compound Poisson-Gamma distributions. Compound Poisson-Gamma distributions
can be used for modeling annual claim amounts, having positive probability at zero
and a continuous distribution on the positive real numbers. In practice, annual claim
amounts are often decomposed into claim numbers and claim severities and sepa-
rate analyses of these quantities are conducted. Typically, the Poisson distribution is
used for modeling claim counts and the Gamma or Inverse Gaussian distributions
for claim severities.
The following table gives a list of all Tweedie distributions.

Type Name
ξ<0 Continuous -
ξ=0 Continuous Normal
0<ξ<1 Non existing -
ξ=1 Discrete Poisson
1<ξ<2 Mixed, non-negative Compound Poisson-Gamma
ξ=2 Continuous, positive Gamma
2<ξ<3 Continuous, positive -
ξ=3 Continuous, positive Inverse Gaussian
ξ>3 Continuous, positive -

Negative values of ξ gives continuous distributions on the whole real axis. For
0 < ξ < 1, no ED member exists. Only the cases ξ ≥ 1 are thus interesting for
application in insurance. The corresponding deviance loss function is


⎪ (y − μ)2 for ξ = 0 
⎪ 

⎪ y
⎨ 2 y ln μ − (y − μ) for ξ = 1
 
L(y, 
μ) = (5.5.1)

⎪ 2 − ln μ + μ − 1 for ξ = 2
y y

⎪  

⎩ 2 max(y,0)2−ξ − yμ1−ξ + μ2−ξ else.
(1−ξ)(2−ξ) 1−ξ 2−ξ

Notice that we consider non-negative responses in our applications, so that we could


write max(y, 0) = y. However, for the sake of completeness, we let max(y, 0).
The Poisson and Gamma distributions with the log-link function give rise to simple
boosting algorithms, as discussed previously. In the Gamma case, we have seen that
  
L yi , exp score
 m−1 (x i ) + T (x i ; am ) = L (r̃mi , exp (T (x i ; am ))) (5.5.2)
158 5 Boosting Trees

with yi
r̃mi =  ,
 m−1 (x i )
exp score

while in the Poisson case, we have noticed that


  
L yi , ei exp score
 m−1 (x i ) + T (x i ; am ) = L (yi , emi exp (T (x i ; am ))) (5.5.3)

with  
emi = ei exp score
 m−1 (x i ) .

For a Poisson response Y , we know that the actuary is allowed to work either with
the observed claim count Y or with the observed claim rate Y ' = Y provided the
e
' still belongs to
weight ν = e enters the analysis. The distribution of the claim rate Y
the Tweedie class and is called the Poisson rate distribution. See Property 2.5.1 in
Denuit, Hainaut and Trufin (2019) for more details. This is reflected by the fact that
the Poisson deviance loss satisfies
     
L yi , ei exp score
 m−1 (x i ) + T (x i ; am ) = νi L ỹi , exp score
 m−1 (x i ) + T (x i ; am )
(5.5.4)

with νi = ei and ỹi = yi


ei
. Working with the claim rates ỹi then yields

νi L( ỹi , exp(score
 m−1 (x i ) + T (x i ; am )))
  
ỹi  
= νi 2 ỹi ln − ỹi − exp(score
 m−1 (x i ) + T (x i ; am ))
 m−1 (x i ) + T (x i ; am ))
exp(score
= νi exp(score
 m−1 (x i ))
 
ỹi ỹi
2 ln
 m−1 (x i ))
exp(score  m−1 (x i )) exp(T (x i ; am ))
exp(score

ỹi
− − exp(T (x i ; am ))
 m−1 (x i ))
exp(score

ỹi
= νi exp(score
 m−1 (x i ))L , exp(T (x i ; am ))
 m−1 (x i ))
exp(score
= νmi L (r̃mi , exp(T (x i ; am ))) (5.5.5)

with
νmi = νi exp(score
 m−1 (x i ))

and
ỹi
r̃mi = .
 m−1 (x i ))
exp(score

The mth iteration of the boosting procedure reduces to build a single tree on the
working training set
5.5 Boosting Versus Gradient Boosting 159

D(m) = {(νmi , r̃mi , x i ), i ∈ I},

using the Poisson deviance loss and the log-link function. The weights are each
time updated together with the responses that are assumed to follow Poisson rate
distributions.
While the weights are updated differently in the Poisson (rate) and Gamma cases
(they remain constant through the boosting procedure in the Gamma case), it is
interesting to notice that the working responses at the mth iteration are in both cases
 m−1 (x i )).
the original ones divided by the current predictions exp(score
The next result shows that the latter observation is also true for any member of
the Tweedie class with the log-link function. Actually, any member of the Tweedie
class with the log-link function gives rise to a simple boosting algorithm.

Proposition 5.5.1 Consider the deviance loss function (5.5.1). Then, (5.3.3) with
the log-link function, that is
   

am = argmin νi L yi , exp score
 m−1 (x i ) + T (x i ; am )
am
i∈I

can be rewritten as

am = argmin
 νmi L (r̃mi , exp (T (x i ; am ))) (5.5.6)
am
i∈I

with
νmi = νi exp(score
 m−1 (x i ))2−ξ

and yi
r̃mi = .
 m−1 (x i ))
exp(score

Proof The cases ξ = 1 and ξ = 2 have already been discussed. Turning to the Normal
case ξ = 0, we have

νi L(yi , exp(score
 m−1 (x i ) + T (x i ; am )))
= νi [yi − exp(score
 m−1 (x i ) + T (x i ; am ))]2
= νi [yi − exp(score
 m−1 (x i )) exp(T (x i ; am ))]2
 2
yi
= νi exp(2score
 m−1 (x i )) − exp(T (x i ; am ))
 m−1 (x i ))
exp(score

yi
= νi exp(2score
 m−1 (x i ))L , exp(T (x i ; am ))
 m−1 (x i ))
exp(score
= νmi L (r̃mi , exp(T (x i ; am ))) .
160 5 Boosting Trees

Finally, when ξ ∈
/ {0, 1, 2}, it comes

νi L(yi , exp(
scorem−1 (x i ) + T (x i ; am )))
 
max(yi , 0)2−ξ yi exp( scorem−1 (x i ) + T (x i ; am ))1−ξ exp( scorem−1 (x i ) + T (x i ; am ))2−ξ
= νi 2 − +
(1 − ξ)(2 − ξ) 1−ξ 2−ξ
 
max(r̃mi , 0) 2−ξ r̃mi exp(T (x i ; am )) 1−ξ exp(T (x i ; am ))2−ξ
= νi exp( scorem−1 (x i ))2−ξ 2 − +
(1 − ξ)(2 − ξ) 1−ξ 2−ξ
= νmi L (r̃mi , exp(T (x i ; am ))) ,

which completes the proof. 

This result shows that when we work with the log-link function and a response
that belongs to the Tweedie class (and so with a loss function of the form (5.5.1)),
solving (5.3.3) amounts to build a single regression tree on the working training set

D(m) = {(νmi , r̃mi , x i ), i ∈ I} .

Therefore, in these cases, that are the most relevant for application in insurance,
boosting trees should be preferred to gradient boosting trees since the latter procedure
introduces an extra step which is unnecessary on the one hand and that leads to an
approximation that can be easily avoided with boosting trees on the other hand.

5.6 Regularization and Randomness

5.6.1 Shrinkage

Boosting (and gradient boosting) models are susceptible to overfitting. They employ
the greedy strategy of selecting the optimal weak learner at each step. Such a strategy
produces an optimal solution at each stage of the training procedure. However, it does
not find the optimal global solution and often fits the training set too closely when M
is large: after a certain number of iterations, reducing the training sample estimate
of the generalization error starts to increase the generalization error.
Regularization methods aim to prevent such overfitting by constraining the train-
ing procedure. Controlling the value of M is a natural regularization strategy. For a
certain size of the constituent trees, that can be specified with ID, there is an opti-
mal number of trees M ∗ minimizing the generalization error. In practice, M ∗ can be
estimated as the value of M that minimizes the validation sample estimate (or the
cross validation estimate) of the generalization error.
Another regularization strategy consists in adding only a fraction of the prediction
produced by the new tree to the current one. This fraction is often referred to as the
learning rate or shrinkage factor and takes its values between 0 and 1. That is, line
2.2 in Algorithm 5.2 and line 2.4 in Algorithm 5.3 are replaced by
5.6 Regularization and Randomness 161

 m (x) = score
Update score  m−1 (x) + τ T (x;
am ) (5.6.1)

and
 
 m (x) = score
Update score  m−1 (x) + τ γtm I x ∈ χ(m)
 t , (5.6.2)
t∈Tm

respectively, where 0 < τ ≤ 1 is the shrinkage parameter. The shrinkage parameter


is another parameter to fine-tune. Small values for τ work best, but result in larger
computation time since more iterations are necessary. The optimal number of iter-
ations M ∗ and τ are closely related: smaller values of τ lead to larger values of
M ∗ . Empirically, small values for the shrinkage parameter τ (< 0.1) yield dramatic
improvements for regression estimation. Because the trees are small trees built with
no pruning, many iterations are generally computationally manageable.

5.6.2 Randomness

The bootstrap sampling procedure in bagging offers a reduction in variance for


bagging models. Stochastic (gradient) boosting exploits the same device to improve
both the computational time and the predictive accuracy of (gradient) boosting. The
boosting algorithm is updated with a random sampling scheme. At each iteration,
a random sample of the training set is taken without replacement before building
the next tree on that subsample. The fraction of training set used at each iteration
to produce the random samples is known as the bagging fraction, denoted α, and
becomes another parameter to fine-tune. A typical value for α is 0.5.
The boosting and gradient boosting predictions then writes
−1
 
D, (x) = g
μboost  M (x)
score

and
 
(x) = g −1 score
grad boost

μD,  M (x)

respectively, where

 = (1 , . . . , M ),

m expressing the randomization due to the random sampling at iteration m.


162 5 Boosting Trees

5.7 Interpretability

As for bagging trees and random forests, boosting trees are less interpretable than a
single tree. Nevertheless, we can also rely on tools such as relative importances and
partial dependences to better understand model outcomes. Moreover, Friedman’s H-
statistics, introduced in Friedman and Popescu (2008), enable to know which features
are involved in interactions with other features, the identities of the other features
with which they interact, as well as the order and strength of the respective interaction
effects. Note that Friedman’s H-statistics can be computed for any regression model,
including bagging trees and random forests.

5.7.1 Relative Importances

For constituent tree m, the relative importance of feature x j , denoted Im (x j ), is the


sum of the deviance reductions over the non-terminal nodes of the mth tree for which
x j was selected as the splitting feature. The relative importance of feature x j is then
obtained by summing the relative importances of x j over the collection of trees, that
is,

M
I(x j ) = Im (x j ). (5.7.1)
m=1

To improve their readability, the relative importances are often normalized so that
their sum equals to 100. Any individual number can then be interpreted as the per-
centage contribution to the overall model. Sometimes, the relative importances are
expressed as a percent of the maximum relative importance.

5.7.2 Partial Dependence Plots

grad boost
Also, the partial dependence of  μboost
D, (x) or 
μD, (x) on selected small subsets
of the features helps the analyst to improve its model understanding. We refer the
reader to Sect. 4.6.2 for more details about partial dependence plots.

5.7.3 Friedman’s H-Statistics

Tree-based models are praised for their ability to account for interaction effects
between features. Knowing which features are involved in interactions with other
features, the identities of the other features with which they interact, as well as the
5.7 Interpretability 163

order and strength of the respective interaction effects provide useful information to
the analyst.
Consider two features x j and xk . If these two variables do not interact, then the
function 
μ(x) can be written as

μ(x) = f \ j (x \ j ) + f \k (x \k ),
 (5.7.2)

where x \ j and x \k represent all the features except x j and xk , respectively. If a given
feature x j interacts with none of the other features, then  μ(x) can be expressed as

μ(x) = f \ j (x \ j ) + f j (x j ).
 (5.7.3)

Considering a third feature xl , 


μ(x) can be written as

μ(x) = f \ j (x \ j ) + f \k (x \k ) + f \l (x \l )
 (5.7.4)

if there is no three-variable interaction between x j , xk and xl , where x \l represents


all the features except xl . In the same way, similar expressions can be defined for the
absence of higher order interaction effects.
Recall that the partial dependence of μ(x) on the subvector x S of x is defined by
 
μS (x S ) = E X S̄ 
 μ(x S , X S̄ ) (5.7.5)

and can be estimated from the training set by

1 

μS (x S ) = 
μ(x S , x i S̄ ), (5.7.6)
|I| i∈I

where {x i S̄ , i ∈ I} are the values of X S̄ in the training set. In this section, we consider
that all partial dependence functions and the regression model  μ(x) are centered to
have a mean of zero.
If there is no interaction between x j and xk , then, from (5.7.2), the partial depen-
dence of μ(x) on x S = (x j , xk ) can be decomposed into the sum of the respective
partial dependences on each feature separately, that is,
   
μ j,k (x j , xk ) = E X \ j,k f \ j (xk , X \ j,k ) + E X \ j,k f \k (x j , X \ j,k )

   
=μk (xk ) − E X \k f \k (X \k ) +  μ j (x j ) − E X \ j f \ j (X \ j )
=μk (xk ) +  μ j (x j ) − E X [
μ(X)]
=
μk (xk ) + 
μ j (x j ). (5.7.7)

Moreover, if a given feature x j interacts


 with none  of the otherfeatures, then,
μ(X)] = E X \ j f \ j (X \ j ) + E X j f j (X j ) = 0, we have
from (5.7.3) and since E X [
164 5 Boosting Trees
   

μ(x) = 
μ\ j (x \ j ) − E X \ j f \ j (X \ j ) + 
μ j (x j ) − E X j f j (X j )
=
μ\ j (x \ j ) + 
μ j (x j ), (5.7.8)

where μ\ j (x \ j ) is the partial dependence of 


μ(x) on all features except x j .
If there is no three-variable interaction between x j , xk and xl , then, from (5.7.4),
we have
   
μ j,k,l (x j , xk , xl ) = E X \ j,k,l f \ j (xk , xl , X \ j,k,l ) + E X \ j,k,l f \k (x j , xl , X \ j,k,l )

 
+E X \ j,k,l f \l (x j , xk , X \ j,k,l )
=μk,l (xk , xl ) +  μ j,l (x j , xl ) + 
μ j,k (x j , xk )
−
μ j (x j ) − 
μk (xk ) − 
μl (xl ) (5.7.9)

since
     
μ j (x j ) = E X \ j f \ j (X \ j ) + E X \ j,k f \k (x j , X \ j,k ) + E X \ j,l f \l (x j , X \ j,l )


and
   
μ j,k (x j , xk ) = E X \ j,k f \ j (xk , X \ j,k ) + E X \ j,k f \k (x j , X \ j,k )

 
+E X \ j,k,l f \l (x j , xk , X \ j,k,l )

with
     
μ(X)] = 0.
E X \ j f \ j (X \ j ) + E X \k f \k (X \k ) + E X \l f \l (X \l ) = E X [

Similar expressions can be obtained for the absence of higher order interactions.
Expressions (5.7.7), (5.7.8) and (5.7.9) can be used to test for the presence of
interaction effects. Specifically, to test a potential interaction between two given
features x j and xk , we can rely on the statistic
 2
i∈I

μ j,k (x ji , xki ) − 
μ j (x ji ) − 
μk (xki )
2
H j,k = , (5.7.10)
2
i∈I μ j,k (x ji , x ki )

where x ji indicates that the estimated partial dependence function is evaluated at


the observed value of x j for policyholder i. Considering that all partial dependence
functions are centered at zero, the numerator measures the variance of the interaction
and the denominator measures the total variance, so that the statistic (5.7.12) measures
the interaction strength as the amount of variance explained by the interaction. A
2
value of 0 for H j,k means that there is no interaction between x j and xk while a value
of 1 means that the effect of x j and xk on the model is exclusively explained by the
interaction.
In the same way, for a given feature x j , we can use the statistic
5.7 Interpretability 165

 2
μ(x i ) − 
 μ j (x ji ) − 
μ\ j (x i\ j )
H j2 = i∈I
(5.7.11)
i∈I μ2 (x i )

to test whether x j interacts with any other variable. The statistic (5.7.11) differs from
zero to the extent that x j interacts with one or more other features.
In the case where x j interacts with more than one other feature, say with at least xk
and xl , it is interesting to determine whether these interactions represent separate two-
way interactions between (x j , xk ) and (x j , xl ) only, or whether there is an additional
three-way interaction between (x j , xk , xl ). This alternative can be tested by means
of the statistic

2
H j,k,l = μ j,k,l (x ji , xki , xli ) − 
 μ j,k (x ji , xki ) − 
μ j,l (x ji , xli ) − 
μk,l (xki , xli )
i∈I
2 ( 
+
μ j (x ji ) + 
μk (xki ) +  2
μk (xli ) μ j,k,l (x ji , xki , xli ). (5.7.12)
i∈I

which measures the fraction of variance of  μ j,k,l (x ji , xki , xli ) not explained by the
lower order interaction effects among these features. Similarly, additional statistics
for higher order interactions can be built, if needed.

5.8 Example

Consider the real dataset described in Sect. 3.2.4.2. We use the same training set D
and validation set D than in the examples of Sects. 3.3.2.3 and 4.7.
We fit gradient boosting trees on D with the Poisson deviance loss and the log-link
function by means of the R package gbm. The parameters we need to fine-tune are
• the number of trees M;
• the size of the trees ID;
• the bagging fraction α;
• the shrinkage parameter τ .
To this end, we consider different values for the tuning parameters ID, α and τ ,
namely ID = 1, 2, 3, 4, 5, 6, α = 1, 0.75, 0.5 and τ = 1, 0.1, 0.01, and we split the
training set D into five disjoint and stratified subsets D1 , D2 , . . . , D5 of equal size.
Then, for each value of (ID, α, τ ), depicted in Table 5.2, we compute the 5-fold
cross-validation estimates of the generalization error (from subsets D1 , D2 , . . . , D5 )
for models including up to 4000 trees, and we select the optimal number of trees as the
number of trees corresponding to the model with the smallest 5-fold cross-validation
estimate.
In Fig. 5.6, we show the optimal numbers of trees obtained for the different values
of (ID, α, τ ) under consideration. We notice that considering 4000 trees was more
than enough in this example, even for ID = 1 and τ = 0.01. Also, one sees that we
166 5 Boosting Trees

Table 5.2 Values for (ID, α, τ )


ID α τ
1–06 1,2,3,4,5,6 1.00 1.00
07–12 1,2,3,4,5,6 0.75 1.00
13–18 1,2,3,4,5,6 0.50 1.00
19–24 1,2,3,4,5,6 1.00 0.10
25–30 1,2,3,4,5,6 0.75 0.10
31–36 1,2,3,4,5,6 0.50 0.10
37–42 1,2,3,4,5,6 1.00 0.01
43–48 1,2,3,4,5,6 0.75 0.01
49–54 1,2,3,4,5,6 0.50 0.01

3000
Number of optimal trees

2000

1000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53

Fig. 5.6 Optimal numbers of trees for the values of (ID, α, τ ) summarized in Table 5.2

need more trees when the shrinkage parameter decreases or when the size of the trees
decreases, highlighting the interplay between these tuning parameters.
Figure 5.7 displays 5-fold cross-validation estimates of the generalization error
for the models built with their corresponding optimal number of trees. One sees that
the introduction of a shrinkage parameter enables to improve the predictive accuracy
of the boosting procedure. Also, results with τ = 0.01 are slightly better than the
ones obtained with τ = 0.1. To a lesser extent, adding randomness into the training
5.8 Example 167

0.540
Generalization error (cross−validation estimate)

0.539

0.538

0.537

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
Model

Fig. 5.7 5-fold cross-validation estimates of the generalization error for the best models corre-
sponding to the values of (ID, α, τ ) summarized in Table 5.2

procedure appears to be relevant in this example. For instance, one sees that for
any given value of ID, the model with (τ = 0.01, α = 0.5) performs slightly better
than the model with (τ = 0.01, α = 0.75), which, in turn, performs slightly better
than the model with (τ = 0.01, α = 1). Finally, if we look at models with τ = 0.01,
i.e. models 37 to 54, we observe that the interaction depths minimizing the 5-fold
cross-validation estimate of the generalization error are ID = 2 for α = 1, ID = 3
for α = 0.75 and ID = 4 for α = 0.5, so that the best value for ID ranges from 2 to
4. Based on the results depicted in Fig. 5.7, we decide to select M = 986, ID = 4,
α = 0.5 and τ = 0.01 as the optimal tuning parameters, which correspond to the
values minimizing the 5-fold cross-validation estimate (i.e. model 52). We denote
grad boost∗
by μD, the gradient boosting model fitted on the entire training set with these
optimal parameters.
Remark 5.8.1 Remark 4.7.1 about the instability related to the selection proce-
dure of the optimal tuning parameters still holds for (gradient) boosting trees. As
an illustration, Table 5.3 provides, for each iteration j of the 5 cross-validation,
the number of trees for τ = 0.01 and α = 0.5 minimizing the validation-sample
estimate of the generalization error computed on D j for models fitted on D\D j ,
together with the corresponding out-of-sample estimate of the generalization error.
We can see that the optimal tuning parameters for ID and M are unstable over the
168 5 Boosting Trees

Table 5.3 Number of trees for τ = 0.01 and α = 0.5 minimizing the validation-sample estimate
of the generalization error computed on D j for the model fitted on D\D j together with the corre-
sponding out-of-sample estimate of the generalization error
ID Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
1 1841 (0.5323761) 1847 (0.5336023) 2895 (0.5424494) 2356 (0.5368788) 3475 (0.5417317)
2 3194 (0.5314581) 1291 (0.5331031) 2422 (0.5422430) 1740 (0.5360356) 1814 (0.5412181)
3 1422 (0.5314159) 1004 (0.5329624) 1413 (0.5421080) 1401 (0.5359842) 1642 (0.5413473)
4 860 (0.5314600) 933 (0.5328013) 919 (0.5422055) 962 (0.5358395) 1315 (0.5415447)
5 907 (0.5316125) 615 (0.5328922) 779 (0.5424979) 987 (0.5356169) 702 (0.5416647)
6 619 (0.5314849) 738 (0.5327782) 652 (0.5424877) 992 (0.5360672) 707 (0.5418545)
AgePh
Split
Fuel PowerCat AgeCar
Gender
Cover
Use

0 10 20 30 40 50

Relative influence

grad boost∗
Fig. 5.8 Relative importances of the features for 
μD,

iterations, getting (M = 1422, ID = 3) for iteration 1, (M = 738, ID = 6) for iter-


ation 2, (M = 1413, ID = 3) for iteration 3, (M = 987, ID = 5) for iteration 4 and
(M = 1814, ID = 2) for iteration 5.
grad boost∗
The relative importances of the features for  μD, are depicted in Fig. 5.8. The
most important feature is AgePh followed by, in descending order, Split, AgeCar,
PowerCat, Fuel, Gender, Cover and Use. Notice that compared to Fig. 4.4, AgePh,
Split, Gender and Use keep the same ranking in terms of importance.
Figures 5.9 and 5.10 represent the partial dependence plots of the features and the
grad boost∗
2
H-statistics H j,k for 
μD, , respectively. Here we focus on two-way interactions.
5.8 Example 169
−1.4

−1.4

−1.4

−1.4
−1.6

−1.6

−1.6

−1.6
−1.8

−1.8

−1.8

−1.8
−2.0

−2.0

−2.0

−2.0
−2.2

−2.2

−2.2

−2.2
20 30 40 50 60 70 80 90 0 5 10 15 20 Diesel Gasoline Half−Yearly Quarterly

AgePh AgeCar Fuel Split


−1.4

−1.4

−1.4

−1.4
−1.6

−1.6

−1.6

−1.6
−1.8

−1.8

−1.8

−1.8
−2.0

−2.0

−2.0

−2.0
−2.2

−2.2

−2.2

−2.2
Comprehensive TPL.Only Male Female Private Professional C1 C2 C3 C4 C5

Cover Gender Use PowerCat

grad boost∗
Fig. 5.9 Partial dependence plots for 
μD, (on the score scale)

2
Fig. 5.10 H-statistic H j,k
grad boost∗ PowerCat
for 
μD,
Use

Gender

Friedman's
H−statistic
Cover
0.20
Var2

0.15
0.10
Split 0.05
0.00

Fuel

AgeCar

AgePh
se
ar

at
h

lit

r
er
el

de
eP

Sp
eC

C
Fu

U
ov

en

er
Ag

C
Ag

w
G

Po

Var1
170 5 Boosting Trees

20 40 60 80
Male Female

−1.2

−1.4

−1.6
y

−1.8

−2.0

−2.2

20 40 60 80
AgePh

Fig. 5.11 Effect of the policyholder’s age on the score for males (left-hand side) and females
(right-hand side)

One observes from Fig. 5.10 that the three strongest interactions are found between
Agecar and Cover, AgeCar and PowerCat and between AgePh and Gender, this latter
2
interaction being well-know in MTPL insurance. The H-statistic H j,k informs us on
the strength of the interaction between features x j and xk but does not give any clue
on how the effect behaves. For instance, in Fig. 5.11, we show the effect (on the score
scale) of the policyholder’s age for males on the left-hand side and for females on
le right-hand side. We observe that for young policyholders, males are on average
more risky drivers compared to females, whereas at older ages female drivers are
perceived as more risky than males.
grad boost∗
Finally, the validation sample estimate of the generalization error of  μD,
(computed on D) is given by
 

val grad boost∗
Err 
μD, = 0.5431231.

Compared to the validation sample estimates of the generalization error of 


μTαk ∗ and
rf∗

μD, (also computed on D) fitted in Sects. 3.3.2.3 and 4.7, that are

val  

Err 
μTαk ∗ = 0.5452772
5.8 Example 171

and
val  ∗ 

Err 
μrfD, = 0.5440970,


one sees that μrfD, improves by 2.1541 10−3 the predictive accuracy of the single

μTαk ∗ and by 0.9739 10−3 the predictive accuracy of the random forest 
tree  μrfD, .

5.9 Bibliographic Notes and Further Reading

Boosting was originally designed for classification problems. Valiant (1984) and
Kearns and Valiant (1989) introduced the concept of combining weak classifiers into
a strong classifier. These works influenced Schapire, who developed the first simple
boosting procedure (Schapire 1990). The performance of the simple boosting algo-
rithm of Schapire was improved by Freund (1995). Freund and Schapire collaborated
to produce the AdaBoost algorithm (Freund and Schapire 1996a, 1997). To sup-
port their algorithms, Freund and Schapire (1996a) and Schapire and Singer (1999)
derived some upper bounds on the generalization error. Other theories attempting
to explain boosting come from game theory (Freund and Schapire 1996b,Breiman
1998, 1998) and Vapnik-Chervonenkis theory (Schapire et al. 1998). In particular,
Breiman (1998) explained the algorithm as a gradient descent approach with numer-
ical optimization and statistical estimation. In practice, the AdaBoost algorithm was
shown to be a powerful prediction tool, far beyond the expectations implied by the
bounds and the theoretical developments.
Friedman et al. (2000) made the link between the AdaBoost algorithm and the
statistical concepts of loss functions, additive modeling and logistic regression. They
showed that boosting can be viewed as a forward stagewise additive model that mini-
mizes exponential loss. Friedman (2001) proposed a boosting method called Gradient
Boosting Machine for regression and classification problems, which combines weak
learners. Bühlmann and Hothorn (2007) adopted penalty splines, linear regressors,
and trees in various scenarios. Ridgeway (2007) uses only trees as the base learners.
Gradient boosting machine with neural networks can be found in Denuit et al. (2019).
Friedman and Popescu (2008) presented techniques to identify the variables that
are involved in interactions with other variables, the strength and degree of those
interactions, as well as the identities of the other variables with which they inter-
act. Tree-based models are known for their ability to account for interaction effects
between features, as illustrated in Buchner et al. (2017) and Schiltz et al. (2018).
Several authors applied boosting and gradient boosting to insurance pricing. Guel-
man (2012) proposed gradient boosted trees for predicting auto insurance loss. Liu
et al. (2014) treated the claim frequency prediction problem by using multi class
AdaBoost trees. Wüthrich and Buser (2019) adapted tree-based methods to model
claim frequencies. Yang et al. (2018) predicted insurance premiums by using a gra-
dient boosted tree algorithm to Tweedie models. Lee and Lin (2018) introduced
172 5 Boosting Trees

Delta Boosting Machine as a new member of the boosting gamily with application
to general insurance. Pesantez-Narvaez et al. (2019) employed XGBoost to predict
the occurrence of claims using telematics data. Henckaert et al. (2020) worked with
random forests and boosted trees to develop full tariff plans built from both the
frequency and severity of claims.
We mainly based our presentation on Hastie et al. (2009), Friedman (2000) and
Friedman and Popescu (2008). Sect. 5.5 is inspired by Denuit et al. (2020). Hastie
et al. (2009) and Kuhn and Johnson (2013) made a good overview of the existing
literature.

References

Breiman L (1998) Arcing classifiers (with discussion). Ann Stat 26(3):801–849


Breiman L (1999) Prediction games and arcing algorithms. Neural Comput 11(7):1493–1517
Buchner F, Wasem J, Schillo S (2017) Regression trees identify relevant interactions: can this
improve the predictive performance of risk adjustment? Health Econ 26(1):74–85
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting.
Stat Sci 22(4):477–505
Denuit M, Hainaut D, Trufin J (2019) Effective statistical learning methods for actuaries III: neural
networks and extensions. Springer actuarial lecture notes
Denuit M, Hainaut D, Trufin J (2020) Boosting versus gradient boosting for Tweedie models with
log-link function. Working paper
Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121(2):256–285
Freund Y, Schapire R (1996a) Experiments with a new boosting algorithm. In: Machine learning:
proceedings of the thirteenth international conference, Morgan Kauffman, San Francisco, pp
148–156
Freund Y, Schapire R (1996b) Game theory, on-line prediction and boosting. In: Proceedings of
the ninth annual conference on computational learning theory, Desenzano del Garda, Italy, pp
325–332
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an appli-
cation to boosting. J Comput Syst Sci 55(1):119–139
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting
(with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat
29(5):1189–1232
Friedman J, Popescu B (2008) Predictive learning via rule ensembles. Ann Appl Stat 2(3):916–954
Guelman L (2012) Gradient boosting trees for auto insurance loss cost modeling and prediction.
Expert Syst Appl 39(3):3659–3667
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, infer-
ence, and prediction. Springer series in statistics, 2nd edn
Henckaerts R, Côté M-P, Antonio K, Verbelen R (2020) Boosting insights in insurance tariff plans
with tree-based machine learning methods. North Am Actu J. https://doi.org/10.1080/10920277.
2020.1745656
Kearns M, Valiant LG (1989) Cryptographic limitations on learning Boolean formulae and finite
automata. In: Proceedings of the twenty-first annual ACM symposium on theory of computing,
Seattle, pp 433–444
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York
Lee SCK, Lin S (2018) Delta boosting machine with application to general insurance. N Am Actu
J 22(3):405–425
References 173

Liu Y, Wang B, Lv S (2014) Using multi-class AdaBoost tree for prediction frequency of auto
insurance. J Appl Financ Bank 4(5):45–53
Pesantez-Narvaez J, Guillen M, Alcañiz M (2019) Predicting motor insurance claims using telem-
atics data XGBoost versus logistic regression. Risks 7(2):1–16
Ridgeway G (2007) Generalized boosted models: a guide to the GBM package. Update 1(1):
Schapire R (1990) The strength of weak learnability. Mach Learn 5:197–227
Schapire R, Freund Y, Bartlett P, Lee W (1998) Boosting the margin: a new explanation for the
effectiveness of voting methods. Ann Stat 26(5):1651–1686
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions.
Mach Learn 37:297–336
Schiltz F, Masci C, Agasisti T, Horn D (2018) Using regression tree ensembles to model interaction
effects: a graphical approach. Appl Econ 50(58):6341–6354
Valiant LG (1984) A theory of the learnable. Commun ACM 27(11):1134–1142
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Yang Y, Qian W, Zou H (2018) Insurance premium prediction via gradient tree-boosted Tweedie
compound Poisson models. J Bus Econ Stat 36(3):456–470
Chapter 6
Other Measures for Model Comparison

6.1 Introduction

Actuarial pricing models are generally calibrated so that they minimize the general-
ization error computed with an appropriate loss function. Model selection is based
on the generalization error. Regression models are then evaluated by assessing their
generalization error with the same loss function, which is done by comparing their
generalization errors computed on a validation set.
Model selection and model assessment are thus based on the same objective
function (the deviance in our ED family setting), which provides consistency in
the approach. However, in our ED family setting, using the deviance as a tool for
model assessment has some drawbacks. As we have seen throughout the different
examples made in the previous chapters, the deviance only slightly reacts to a model
improvement. Moreover, a decrease of a certain amount of the deviance is difficult
to interpret for the analyst. There is a need for additional measures to assess pricing
models, notably more economic criteria. In this chapter, we describe complementary
measures frequently used by practitioners for model assessment.

Remark 6.1.1 Training a model with an objective function (say objective function
1) and assessing it with another one (say objective function 2) is not without criticism.
If the ultimate goal of the analyst is to minimize the second objective function, then
one could wonder why not directly training the model using this second objective
function as well.

© Springer Nature Switzerland AG 2020 175


M. Denuit et al., Effective Statistical Learning Methods for Actuaries II,
Springer Actuarial, https://doi.org/10.1007/978-3-030-57556-4_6
176 6 Other Measures for Model Comparison

6.2 Measures of Association

6.2.1 Context

The merits of a regression model can be assessed using the pairs (Y,  μ(X)). Besides
using validation sample estimates for the generalization error to assess the predic-
tive power of a model  μ, measures of association for the pairs (Y,  μ(X)) are also
frequently used by practitioners. In insurance, the most popular ones are Kendall’s
tau and Spearman’s rho, which are based on concordance probabilities.
Kendall’s tau and Spearman’s rho, defined in the following, are efficient tools for
measuring the strength of dependence between continuous outcomes. They can be
expressed in terms of the corresponding copula only and are thus independent of the
marginal distributions. When they are applied to discrete variables, they are no more
distribution-free so that their ranges are restricted to sub-intervals of [−1, 1]. This
makes their interpretation more difficult: relatively small values for Kendall’s tau
and Spearman’s rho may in fact strongly support the fitted model if their maximal
possible values are small as well. In case the response variable Y is discrete, such as
the number of claims, correlation indices are thus often restricted to a sub-interval
[−1, 1]. That is why positive values of Kendall’s tau and Spearman’s rho for the pairs
(Y,  μ(X)) must be compared to their highest attainable values and not 1.
Notice that, even for discrete responses, predictors  μ(X) are generally continu-
ous random variables. This is the case when there is at least one continuous feature
comprised in the available information X (so that the score is continuous) and the
function  μ is a continuously increasing function of the score. Of course, predic-
tors  μ(X) can still be discrete when all the features are discrete or when they are
piecewise constant predictors (such as a single tree, for instance). However, this is
unlikely to be the case since actuarial pricing is nowadays based on more sophisti-
cated models than piecewise constant predictors (trees being combined into random
forests, for instance) and uses more and more features. Even though predictors  μ(X)
are generally continuous, we also consider the discrete case for  μ(X). For ease of
exposition, we mean by the random variable Z a predictor  μ(X) and we denote by
pk the probabilities P[Y = k], k ∈ N. When Z is discrete, we assume it is valued in
{z 1 , z 2 , . . . , z m } with z 1 < z 2 < . . . < z m , and we define

jk = min{ j ∈ {1, 2, . . . , m} : FZ (z j ) ≥ FY (k)}, k ∈ N.

This section aims to derive the best possible upper bounds for Kendall’s tau and
Spearman’s rho when the response takes its values in N = {0, 1, 2, . . .}.
6.2 Measures of Association 177

6.2.2 Probability of Concordance

6.2.2.1 Definition

Consider independent copies (Y1 , Z 1 ) and (Y2 , Z 2 ) of (Y, Z ). Then, (Y1 , Z 1 ) and
(Y2 , Z 2 ) are said to be concordant if (Y1 − Y2 )(Z 1 − Z 2 ) > 0 holds true whereas
they are said to be discordant when (Y1 − Y2 )(Z 1 − Z 2 ) < 0.
Tied pairs (that is, pairs of observations that have equal values of Y or Z ) may
occur in practice. Specifically, the probability that a tie occurs is given by

⎨ P[Y1 = Y2 or Z 1 = Z 2 ] if Z is discrete
P[(Y1 − Y2 )(Z 1 − Z 2 ) = 0] =

P[Y1 = Y2 ] if Z is continuous.

The concordance probabilities can be expressed as follows.

Proposition 6.2.1 If H denotes the joint distribution function of the pair (Y, Z ),
then

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] = 2E[H (Y, Z )] − P[Y1 = Y2 ] − P[Z 1 = Z 2 ]


∞  ∞
=2 P[Y1 = k, Y2 = k + l, Z 1 < Z 2 ].
k=0 l=1

Proof As Z 1 and Z 2 are independent and identically distributed, we have

P[Z 1 ≤ Z 2 ] = P[Z 1 < Z 2 ] + P[Z 1 = Z 2 ]


= P[Z 1 > Z 2 ] + P[Z 1 = Z 2 ]
1 − P[Z 1 = Z 2 ]
= + P[Z 1 = Z 2 ]
2
1 + P[Z 1 = Z 2 ]
= .
2
This allows us to write

P[Y1 ≤ Y2 , Z 1 ≤ Z 2 ] = 1 − P[Y1 > Y2 ] − P[Z 1 > Z 2 ] + P[Y1 < Y2 , Z 1 < Z 2 ]


P[Y1 = Y2 ] P[Z 1 = Z 2 ]
= P[Y1 < Y2 , Z 1 < Z 2 ] + + .
2 2
The concordance probability can finally be expressed as

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] = 2P[Y1 < Y2 , Z 1 < Z 2 ]


= 2P[Y1 ≤ Y2 , Z 1 ≤ Z 2 ] − P[Y1 = Y2 ] − P[Z 1 = Z 2 ]
= 2E[H (Y, Z )] − P[Y1 = Y2 ] − P[Z 1 = Z 2 ]
178 6 Other Measures for Model Comparison

as announced. Now, since

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] = 2P[Y1 < Y2 , Z 1 < Z 2 ],

it suffices to notice that


∞ 
 ∞
P[Y1 < Y2 , Z 1 < Z 2 ] = P[Y1 = k, Y2 = k + l, Z 1 < Z 2 ],
k=0 l=1

which gives the second announced equality and ends the proof. 

Proposition 6.2.1 shows that concordance probabilities get higher when we replace
the joint distribution function H with H+ whose graph lies everywhere above H ,
provided the marginals are kept unchanged. A natural candidate for H+ is the Fréchet–
Höffding upper bound H u defined as

H u (y, z) = min{FY (y), FZ (z)}.

Proposition 6.2.2 We have

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] ≤ P[(Y1u − Y2u )(Z 1u − Z 2u ) > 0] (6.2.1)

where (Y1u , Z 1u ) and (Y2u , Z 2u ) are independent copies of the random pair (Y u , Z u )
obeying the Fréchet–Höffding upper bound H u , i.e.


Z u = FZ−1 (U ) and Y u = kI [FY (k − 1) ≤ U < FY (k)] (6.2.2)
k=0

with U being uniformly distributed over the unit interval [0, 1].

Proof The joint distribution function of the random pair (Y, Z ) satisfies

H (y, z) ≤ min{FY (y), FZ (z)} for all y and z.

This ensures that


E[H (Y, Z )] ≤ E[min{FY (Y ), FZ (Z )}]

holds true. Now, the inequality E[g(Y, Z )] ≤ E[g(Y u , Z u )] is known to be valid for
every supermodular function g (see e.g. Denuit et al. 2005, Sect. 6.2.4). As every
joint distribution function is supermodular, we also have

E[min{FY (Y ), FZ (Z )}] ≤ E[min{FY (Y u ), FZ (Z u )}],

so that
E[H (Y, Z )] ≤ E[min{FY (Y u ), FZ (Z u )}]
6.2 Measures of Association 179

is true. Hence, as

P[Y u ≤ y, Z u ≤ z] = min{FY (y), FZ (y)},

we have the announced result by Proposition 6.2.1. 

6.2.2.2 Upper Bounds

Based on Proposition 6.2.2, we can establish upper bounds for concordance proba-
bilities.

Proposition 6.2.3 If Z is continuous, then

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] ≤ 2E[FY (Y −)]. (6.2.3)

Proof By (6.2.1), since P[(Y1u − Y2u )(Z 1u − Z 2u ) > 0] = 2P[Y1u < Y2u , Z 1u < Z 2u ], it
suffices to show that P[Y1u < Y2u , Z 1u < Z 2u ] = E[FY (Y −)] with



Z iu = FZ−1 (Ui ) and Yiu = kI [FY (k − 1) ≤ Ui < FY (k)]
k=0

for i = 1, 2, where U1 and U2 are independent random variables, uniformly dis-


tributed over the unit interval [0, 1]. We get


P[Y1u < Y2u , Z 1u < Z 2u ] = P[Y1u = k, Y2u > k, Z 1u < Z 2u ]
k=0


= P[FY (k − 1) ≤ U1 < FY (k), FY (k) ≤ U2 , U1 < U2 ]
k=0
∞
= P[FY (k − 1) ≤ U1 < FY (k), FY (k) ≤ U2 ]
k=0
∞
 
= FY (k) − FY (k − 1) F̄Y (k)
k=0


= pk F̄Y (k)
k=0
= E[ F̄Y (Y )],

which ends the proof since E[FY (Y )] + E[FY (Y −)] = 1. 


180 6 Other Measures for Model Comparison

Proposition 6.2.4 If Z is discrete, then

P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0]


 ∞
  
≤ 2E[FY (Y −)] − 2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k) .
k=0

Proof From (6.2.1), it amounts to show that

P[Y1u < Y2u , Z 1u < Z 2u ]



   
= E[FY (Y −)] − FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k)
k=0

with


m ∞


Z iu = z j I FZ (z j−1 ) ≤ Ui < FZ (z j ) and Yiu = kI FY (k − 1) ≤ Ui < FY (k)


j=1 k=0

for i = 1, 2, where U1 and U2 are independent random variables, uniformly dis-


tributed over the unit interval [0, 1] and z 0 < z 1 such that FZ (z 0 ) = 0. We then have


P[Y1u < Y2u , Z 1u < Z 2u ] = P[Y1u = k, Y2u > k, Z 1u < Z 2u ]
k=0


= P[Z 1u < Z 2u |Y1u = k, Y2u > k]P[Y1u = k, Y2u > k]
k=0


= P[Z 1u < Z 2u |(U1 , U2 ) ∈ Ak ] pk F̄Y (k)
k=0


= 1 − P[Z 1u = Z 2u |(U1 , U2 ) ∈ Ak ] pk F̄Y (k)
k=0

where
Ak = [FY (k − 1), FY (k)[×[FY (k), 1], k ∈ N.

Define

B j = [FZ (z j−1 ), FZ (z j )[×[FZ (z j−1 ), FZ (z j )[, j ∈ {1, . . . , m}.

Then,
6.2 Measures of Association 181


m
P[Z 1u = Z 2u |(U1 , U2 ) ∈ Ak ] = P[(U1 , U2 ) ∈ B j |(U1 , U2 ) ∈ Ak ]
j=1

1 
m
= P[(U1 , U2 ) ∈ Ak ∩ B j ]
pk F̄Y (k) j=1

1  m
= αk, j βk, j
pk F̄Y (k) j=1

with  
αk, j = min{FY (k), FZ (z j )} − max{FY (k − 1), FZ (z j−1 )} +

and  
βk, j = FZ (z j ) − max{FY (k), FZ (z j−1 )} + ,

where, for any real number r , we let r+ denote the positive part of r ; that is, r+ = r
if r ≥ 0 and r+ = 0 if r < 0.
For j < jk , we get βk, j = 0 since FY (k) ≥ FZ (z j ) ≥ FZ (z j−1 ). Also, for j >
jk , we have αk, j = 0 since FZ (z j ) ≥ FZ (z j−1 ) ≥ FY (k) ≥ FY (k − 1). Now, in the
remaining case j = jk , it comes FZ (z jk ) ≥ FY (k) ≥ FZ (z jk −1 ) and hence

αk, jk = FY (k) − max{FY (k − 1), FZ (z jk −1 )} and βk, jk = FZ (z jk ) − FY (k).

Finally, we then get

P[Y1u < Y2u , Z 1u < Z 2u ]


∞  
= E[ F̄Y (Y )] − FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k) ,
k=0

which completes the proof. 

6.2.3 Kendall’s Tau

6.2.3.1 Definition

Kendall’s tau is a widely used measure of dependence between Y and Z = 


μ(X),
defined as

τ [Y, Z ] = P[(Y1 − Y2 )(Z 1 − Z 2 ) > 0] − P[(Y1 − Y2 )(Z 1 − Z 2 ) < 0].


182 6 Other Measures for Model Comparison

With continuous random variables, Kendall’s tau is completely determined by the


copula and unrelated to the marginal distributions. This is no more true in general
for random variables that are not necessarily continuous (see for instance Nešlehová
2007).

6.2.3.2 Upper Bounds

In the following, we derive the best possible upper bounds on Kendall’s tau for
discrete responses Y . We start with the case of a continuous Z .

Proposition 6.2.5 If Z is continuous, then

τ [Y, Z ] ≤ 2E[FY (Y −)]. (6.2.4)

Proof If the random pair (Y, Z ) obeys the upper Fréchet–Höffding bound, i.e. under
(6.2.2), we get

P[(Y1 − Y2 )(Z 1 − Z 2 ) < 0] = 2P[Y1 > Y2 , Z 1 < Z 2 ]


∞
=2 P[Y1 > k, Y2 = k, Z 1 < Z 2 ]
k=0


=2 P[FY (k) ≤ U1 , FY (k − 1) ≤ U2 < FY (k), U1 < U2 ]
k=0
= 0.

The maximal value of Kendall’s tau corresponds to a random pair distributed accord-
ing to the Fréchet–Höffding upper bound because this distribution simultaneously
maximizes the concordance probability and leads to zero discordance probability.
This ends the proof. 

We now turn to the case where Z is discrete.


Proposition 6.2.6 If Z is discrete, then

   
τ [Y, Z ] ≤ 2E[FY (Y −)] − 2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k) .
k=0
(6.2.5)

Proof Similarly to the continuous case for Z , when the random pair (Y, Z ) obeys the
upper Fréchet–Höffding bound, we obviously have P[(Y1 − Y2 )(Z 1 − Z 2 ) < 0] = 0,
so that we directly get the desired result. 

We notice that the latter upper bound (6.2.5) is smaller than the upper bound
(6.2.4) obtained when Z is continuous. Let us define
6.2 Measures of Association 183

k ∗ = min{k ∈ N : FY (k) > 0}.

The difference between the upper bounds (6.2.4) and (6.2.5) is then given by

   
2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k)
k=0
∞
  
=2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k)
k=k ∗
  
= 2 FY (k ∗ ) − FZ (z jk ∗ −1 ) FZ (z jk ∗ ) − FY (k ∗ )

   
+2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k) .
k=k ∗ +1

One sees that if Z is such that FZ (z m−1 ) < FY (k ∗ ), then this difference becomes


  
2 FY (k ∗ ) − FZ (z m−1 ) 1 − FY (k ∗ ) + 2 (FY (k) − FY (k − 1)) (1 − FY (k))
k=k ∗ +1


 
= −2FZ (z m−1 ) 1 − FY (k ∗ ) + 2 (FY (k) − FY (k − 1)) (1 − FY (k))
k=k ∗
 
= −2FZ (z m−1 ) 1 − FY (k ∗ ) + 2E[FY (Y −)],

which is larger as FZ (z m−1 ) decreases. By letting FZ (z m−1 ) → 0, the difference even


tends to its maximum 2E[FY (Y −)], such that the upper bound (6.2.5) tends to zero
as well.

6.2.4 Spearman’s Rho

6.2.4.1 Definition

Consider independent copies (Y1 , Z 1 ), (Y2 , Z 2 ) and (Y3 , Z 3 ) of (Y, Z ). It is well


known that the population version of Spearman’s rho is defined in terms of the
probability of concordance minus the probability of discordance for the random
pairs (Y1 , Z 1 ) and (Y2 , Z 3 ), that is,

ρ[Y, Z ] = 3 (P[(Y1 − Y2 )(Z 1 − Z 3 ) > 0] − P[(Y1 − Y2 )(Z 1 − Z 3 ) < 0]) .

Proposition 6.2.7 ρ[Y, Z ] can be expressed in terms of the joint distribution H of


(Y, Z ) as
 
ρ[Y, Z ] = 3 E[H (Y ∗ −, Z ∗ )] + E[H (Y ∗ , Z ∗ −)] + E[H (Y ∗ −, Z ∗ −)] + E[H (Y ∗ , Z ∗ )] − 1 ,
(6.2.6)
184 6 Other Measures for Model Comparison

where the random variables Y ∗ and Z ∗ are independent and distributed as Y and Z ,
respectively.

Proof We refer the reader to Mesfioui and Tajar (2005) for a formal proof. 

With continuous random variables, Spearman’s rho is entirely described by the


copula and unrelated to the marginal distributions. This is no more true in general
for random variables that are not necessarily continuous. (see for instance Nešlehová
2007). Tied pairs may occur. Specifically, for the random pairs (Y1 , Z 1 ) and (Y2 , Z 3 ),
the probability that a tie occurs is given by

⎨ P[Y1 = Y2 or Z 1 = Z 3 ] when Z is discrete
P[(Y1 − Y2 )(Z 1 − Z 3 ) = 0] =

P[Y1 = Y2 ] when Z is continuous.

6.2.4.2 Upper Bounds

Upper bounds on Spearman’s rho can be obtained by replacing the joint distribution
function H in (6.2.6) with the Fréchet–Höffding upper bound

H u (y, z) = min{FY (y), FZ (z)}.

Spearman’s rho ρ[Y, Z ] is then bounded by


  
ρmax = 3 E[min FY (Y ∗ −), FZ (Z ∗ ) ] + E[min FY (Y ∗ ), FZ (Z ∗ −) ]
   
+3 E[min FY (Y ∗ −), FZ (Z ∗ −) ] + E[min FY (Y ∗ ), FZ (Z ∗ ) ] − 1 .
(6.2.7)

For a continuous Z , upper bound (6.2.7) can be particularized as follows.


Proposition 6.2.8 If Z is continuous, then



ρ[Y, Z ] ≤ 3 1 − E FY2 (Y ) − E FY2 (Y −) . (6.2.8)

Proof Since Z is continuous, ρmax can be expressed as

ρmax = 3 (2E[min {FY (Y ), U }] + 2E[min {FY (Y −), U }] − 1) (6.2.9)

where U is a random variable uniformly distributed over the unit interval [0, 1] and
independent of Y . We get
6.2 Measures of Association 185

E [min{FY (Y −), U }] = E[E [min{FY (Y −), U }|Y ]]


 FY (Y −)  1 
=E udu + FY (Y −)du
0 FY (Y −)
1
= E[2FY (Y −) − FY2 (Y −)]. (6.2.10)
2
Similarly, we have

1
E [min{FY (Y ), U }] = E[2FY (Y ) − FY2 (Y )]. (6.2.11)
2
Inserting (6.2.10) and (6.2.11) in (6.2.9), we get the desired result. 

Turning to the case of a discrete Z , we define



⎨ min{k ∈ N : FY (k) > FZ (z j )} for j ∈ {1, 2, . . . , m − 1}
ψ(z j ) =

+∞ for j = m.

Using the notation φ(k) for jk , k ∈ N, we then get the following result.
Proposition 6.2.9 If Z is discrete, then
 
ρ[Y, Z ] ≤ 9 − 3 E[FY (Y )(FZ (z φ(Y ) ) + FZ (z φ(Y ) −))] + E[FY (Y −)(FZ (z φ(Y −) ) + FZ (z φ(Y −) −))]
−3 (E[FZ (Z )(FY (ψ(Z )) + FY (ψ(Z )−))] + E[FZ (Z −)(FY (ψ(Z −)) + FY (ψ(Z −)−))]) .
(6.2.12)

Proof The best upper bound for ρ[Y, Z ] is given by (6.2.7), namely
  
ρmax = 3 E[min FY (Y ∗ −), FZ (Z ∗ ) ] + E[min FY (Y ∗ ), FZ (Z ∗ −) ]
   
+3 E[min FY (Y ∗ −), FZ (Z ∗ −) ] + E[min FY (Y ∗ ), FZ (Z ∗ ) ] − 1 .
(6.2.13)

First, we note that

min{FY (Y ∗ ), FZ (Z ∗ )} = FY (Y ∗ )I [FZ (Z ∗ ) ≥ FY (Y ∗ )] + FZ (Z ∗ )I [FZ (Z ∗ ) < FY (Y ∗ )]


= FY (Y ∗ )I [Z ∗ ≥ z φ(Y ∗ ) ] + FZ (Z ∗ )I [Y ∗ ≥ ψ(Z ∗ )].

Hence, we get

E[min{FY (Y ∗ ), FZ (Z ∗ )}] = E[E[FY (Y ∗ )I [Z ∗ ≥ z φ(Y ∗ ) ]|Y ∗ ]] + E[E[FZ (Z ∗ )I [Y ∗ ≥ ψ(Z ∗ )]|Z ∗ ]]


= E[FY (Y ∗ )(1 − FZ (z φ(Y ∗ ) −))] + E[FZ (Z ∗ )(1 − FY (ψ(Z ∗ )−))].
(6.2.14)

Also, we have
186 6 Other Measures for Model Comparison

min{FY (Y ∗ −), FZ (Z ∗ )} = FY (Y ∗ −)I [FZ (Z ∗ ) ≥ FY (Y ∗ −)] + FZ (Z ∗ )I [FZ (Z ∗ ) < FY (Y ∗ −)]


= FY (Y ∗ −)I [Z ∗ ≥ z φ(Y ∗ −) ] + FZ (Z ∗ )I [Y ∗ ≥ ψ(Z ∗ )+],

which leads to

E[min{FY (Y ∗ −), FZ (Z ∗ )}] = E[FY (Y ∗ −)(1 − FZ (z φ(Y ∗ −) −))]


+E[FZ (Z ∗ )(1 − FY (ψ(Z ∗ )))]. (6.2.15)

Similarly, we get

E[min{FY (Y ∗ ), FZ (Z ∗ −)}] = E[FY (Y ∗ )(1 − FZ (z φ(Y ∗ ) ))]


+E[FZ (Z ∗ −)(1 − FY (ψ(Z ∗ −)−))]
(6.2.16)

and

E[min{FY (Y ∗ −), FZ (Z ∗ −)}] = E[FY (Y ∗ −)(1 − FZ (z φ(Y ∗ −) ))]


+E[FZ (Z ∗ −)(1 − FY (ψ(Z ∗ −)))].
(6.2.17)

Finally, the announced upper bound for ρ[Y, Z ] is obtained by inserting (6.2.14),
(6.2.15), (6.2.16) and (6.2.17) in (6.2.13). 

6.2.5 Numerical Example

Let us illustrate the computation of the upper bounds (6.2.4), (6.2.5), (6.2.8) and
(6.2.12) in a situation of practical relevance. To this end, we consider the motor
third-party liability insurance portfolio introduced in Sect. 3.2.4.2. Specifically, we
restrict our example to the 124 524 insurance policies of the portfolio that have been
observed during the whole year. Figure 6.1 displays for each feature the number of
policies by category/value, and Table 6.1 shows the observed numbers of claims.
Let n = 124 524 be the number of policies considered in this example and let us
denote by Yi the number of claims of policy i (i = 1, . . . , n). The probabilities pk
can be directly estimated using the empirical proportions
n
I[Yi = k]

pk = i=1
, k ∈ N.
n
Hence, based on the observations for the number of claims summarized in Table 6.1,
we get the empirical proportions depicted in Table 6.2.
6.2 Measures of Association 187

120000

75000
75000
Number of policies

Number of policies

Number of policies
90000

50000
50000
60000

25000 25000
30000

0 0 0
Male Female Diesel Gasoline Private Professional
Gender Fuel Use

60000 60000
60000

Number of policies
Number of policies

Number of policies
40000 40000
40000

20000 20000 20000

0 0 0
Comprehensive Limited.MD TPL.Only Half−Yearly Monthly Quarterly Yearly C1 C2 C3 C4 C5
Cover Split PowerCat
12500
3000

10000
Number of policies

Number of policies

7500 2000

5000
1000
2500

0 0
0 5 10 15 20 20 40 60 80
AgeCar AgePh

Fig. 6.1 The categories/values of the explanatory variables and their corresponding numbers of
policies

Table 6.1 Descriptive statistics for the number of claims


Number of claims Number of policies
0 110 231
1 13 024
2 1 149
3 115
4 5
≥5 0

Table 6.2 Empirical proportions 


pk displayed as percentages
k 0 1 2 3 4 ≥5

pk 88.52189 10.45903 0.92271 0.09235 0.00402 0
188 6 Other Measures for Model Comparison

Expectations E[FY (Y −)], E[FY2 (Y )] and E[FY2 (Y −)] can be estimated using the
empirical proportions 
pk ,


 ∞ 
 k−1
E[FY (Y −)] = pk FY (k − 1) = pk pl
k=1 k=1 l=0

 k 2
 
E[FY2 (Y )] = pk pl
k=0 l=0

 k−1 2
 
E[FY2 (Y −)] = pk pl .
k=1 l=0

With the 
pk displayed in Table 6.2, we get


E[FY (Y −)] = 0.1026812

E[FY2 (Y )] = 0.8063110

E[FY2 (Y −)] = 0.0919602.

Of course, other estimators can be considered, exploiting the regression model struc-
ture (whereas the proposed one is purely nonparametric).
Therefore, when Z is continuous, which is the case when AgePh and AgeCar are
treated as continuous variables using GAM for instance, upper bounds (6.2.4) on
Kendall’s tau and (6.2.8) on Spearman’s rho can be estimated by

2
E[FY (Y −)] = 0.2053624

and 


E FY2 (Y ) − 
3 1 − E FY2 (Y −) = 0.3051865,

respectively. The actuarial analyst wishing to assess the predictive performances of


the regression model for claim counts using Kendall’s tau or Spearman’s rho must
therefore compare the obtained value to the upper bound 20.5% for Kendall’s tau
and 30.5% for Spearman’s rho, and not to 1 (which cannot be attained with the data
under consideration).
In case Z is discrete, we need the distribution function of Z in order to compute
upper bounds (6.2.5) and (6.2.12). Consider that the number of claims Y is predicted
using a regression tree. For the illustration, we fit several regression trees, controlling
the size of the trees by the minimum number of observations required in the terminal
nodes. The results are shown in Tables 6.3 and 6.4. The corresponding regression
trees and distribution functions of Z are displayed in Figs. 6.2, 6.3 and 6.4.
From Table 6.3, we notice that the upper bound (6.2.5) increases with m to ulti-
mately tend to the upper bound (6.2.4) derived in the continuous case. Also, the
values for Kendall’s tau are rather small (around 2%), as it is generally the case with
6.2 Measures of Association 189

Table 6.3 Upper bound (6.2.5), Kendall’s tau and its normalized version Kendall’s tau
Upper bound as well as
the corresponding m (i.e. number of values taken by Z ) for trees whose sizes are controlled by the
minimum number of observations (Min. nb of obs.) required in the terminal nodes
Kendall’s tau
Min. nb of obs. m Kendall’s tau Upper bound Upper bound
35 000 2 0.0167340 0.1645063 0.1017228
30 000/25 000 3 0.0227923 0.1695944 0.1343933
20 000 4 0.0257349 0.1893385 0.1359203
15 000 5 0.0281253 0.1941500 0.1448638
10 000 7 0.0287892 0.1941500 0.1482834
5000 14 0.0313762 0.2000066 0.1568756
1000 66 0.0355663 0.2050653 0.1734387


Table 6.4 Upper bound (6.2.12), Spearman’s rho and its normalized version Spearman’s rho
Upper bound as
well as the corresponding m (i.e. number of values taken by Z ) for trees whose sizes are controlled
by the minimum number of observations (Min. nb of obs.) required in the terminal nodes
Spearman’s rho
Min. nb of obs. m Spearman’s rho Upper bound Upper bound
35 000 2 0.0582118 0.2467594 0.2359052
30 000/25 000 3 0.0665856 0.2543916 0.2617447
20 000 4 0.0723529 0.2840077 0.2547567
15 000 5 0.0784423 0.2912251 0.2693530
10 000 7 0.0790575 0.2912251 0.2714653
5000 14 0.0854499 0.2992722 0.2855257
1000 66 0.0965594 0.3050679 0.3165177

claim frequency data. However, they do not compare to 1 (that cannot be attained)
but with the upper bounds ranging from 16% to 20% (depending on m). This may
lead the analyst to a different conclusion than the one deduced from its normalized
version obtained by dividing Kendall’s tau by its corresponding upper bound.
Notice that the values for Kendall’s tau depicted in Table 6.3 are training sample
estimates, so that they are overly optimistic and favor larger regression trees (i.e.
larger m). Hence, validation sample estimates for Kendall’s tau should be smaller
than the already small values presented in Table 6.3, showing even more the need to
compare values of Kendall’s tau with the highest attainable values and not 1.

Remark 6.2.10 In Table 6.3, training sample estimates for Kendall’s tau more than
double from the simplest model (m = 2) to the most complex one (m = 66) while
training sample estimates for its normalized version only increase by 70.5%. This
is due to the fact that the values of the upper bound also increase with m, like the
training sample estimates for Kendall’s tau. In a way, one can say that training sample
estimates for normalized Kendall’s tau penalize the model’s complexity.
190 6 Other Measures for Model Comparison

1.0
0.8
0.13
16e+3 / 125e+3
100%

0.6
yes AgePh >= 38 no

cdf of Z
0.4
0.11 0.16
10e+3 / 89e+3 5672 / 35e+3
72% 28%

0.2
0.0
0.00 0.05 0.10 0.15 0.20 0.25
Z

(a)

1.0
0.13
16e+3 / 125e+3
100%

yes AgePh >= 37 no 0.8


0.6

0.11
cdf of Z

10e+3 / 92e+3
74%
0.4

AgePh >= 58
0.2

0.095 0.12 0.16


3314 / 35e+3 7066 / 57e+3 5307 / 33e+3
28% 46% 26%
0.0

0.00 0.05 0.10 0.15 0.20 0.25


Z

(b)
0.13
16e+3 / 125e+3
1.0

100%

yes AgePh >= 33 no


0.8

0.12
12e+3 / 103e+3
82%
0.6

AgePh >= 58
cdf of Z
0.4

0.13
8534 / 68e+3
55%

Split = Yearly
0.2
0.0

0.095 0.11 0.14 0.18


3314 / 35e+3 3799 / 33e+3 4735 / 34e+3 3839 / 22e+3
28% 27% 28% 18% 0.00 0.05 0.10 0.15 0.20 0.25
Z

(c)

Fig. 6.2 a Regression tree (on the left) and distribution function of Z (on the right) when the
minimum number of observations per final node is set to 35 000. b Regression tree (on the left) and
distribution function of Z (on the right) when the minimum number of observations per final node
is set to 30 000 (same results when minimum number of observations per node set to 25 000). c
Regression tree (on the left) and distribution function of Z (on the right) when the minimum number
of observations per final node is set to 20 000
6.2 Measures of Association 191

0.13
16e+3 / 125e+3

1.0
100%
yes AgePh >= 32 no

0.12

0.8
12e+3 / 105e+3
85%
Split = Half−Yearly,Yearly

0.6
0.11

cdf of Z
9638 / 88e+3
71%
AgePh >= 58

0.4
0.12
6719 / 57e+3
46%

0.2
Fuel = Gasoline

0.0
0.092 0.11 0.13 0.15 0.18
2919 / 32e+3 4141 / 37e+3 2578 / 20e+3 2583 / 17e+3 3466 / 19e+3
25% 30% 16% 14% 15% 0.00 0.05 0.10 0.15 0.20 0.25
Z

(a)
0.13
16e+3 / 125e+3

1.0
100%
yes AgePh >= 32 no

0.12
12e+3 / 105e+3
85% 0.8
Split = Half−Yearly,Yearly

0.11
9638 / 88e+3
0.6

71%
AgePh >= 58
cdf of Z

0.092 0.12
2919 / 32e+3 6719 / 57e+3
0.4

25% 46%
AgeCar >= 9 Fuel = Gasoline

0.11
4141 / 37e+3
0.2

30%
PowerCat = C1
0.0

0.089 0.095 0.1 0.12 0.13 0.15 0.18


1123 / 13e+3 1796 / 19e+3 1370 / 13e+3 2771 / 24e+3 2578 / 20e+3 2583 / 17e+3 3466 / 19e+3
10% 15% 11% 19% 16% 14% 15% 0.00 0.05 0.10 0.15 0.20 0.25
Z

(b)
0.13
1.0

16e+3 / 125e+3
100%
yes AgePh >= 32 no

0.12 0.18
12e+3 / 105e+3 3466 / 19e+3
85% 15%
Split = Half−Yearly,Yearly Split = Yearly
0.8

0.11 0.15
9638 / 88e+3 2583 / 17e+3
71% 14%
AgePh >= 58 AgePh >= 52

0.092 0.12
0.6

2919 / 32e+3 6719 / 57e+3


25% 46%
Fuel = Gasoline Fuel = Gasoline
cdf of Z

0.087 0.11 0.13


2225 / 26e+3 4141 / 37e+3 2578 / 20e+3
21% 30% 16%
0.4

AgePh < 74 PowerCat = C1 Cover = Comprehensive,Limited.MD

0.084 0.12
1684 / 20e+3 2771 / 24e+3
16% 19%
AgeCar >= 6 AgePh >= 50
0.2

0.12
1995 / 16e+3
13%
Split = Yearly
0.0

0.081 0.091 0.095 0.12 0.1 0.11 0.12 0.13 0.12 0.14 0.13 0.16 0.15 0.2
1070 / 13e+3 614 / 6741 541 / 5680 694 / 5952 1370 / 13e+3 776 / 7295 1106 / 9522 889 / 6770 1099 / 9063 1479 / 11e+3 699 / 5297 1884 / 12e+3 1310 / 8578 2156 / 11e+3
11% 5% 5% 5% 11% 6% 8% 5% 7% 9% 4% 9% 7% 9%

0.00 0.05 0.10 0.15 0.20 0.25


Z
(c)

Fig. 6.3 a Regression tree (on the left) and distribution function of Z (on the right) when the
minimum number of observations per final node is set to 15 000. b Regression tree (on the left) and
distribution function of Z (on the right) when the minimum number of observations per final node
is set to 10 000. c Regression tree (on the left) and distribution function of Z (on the right) when
the minimum number of observations per final node is set to 5 000
192 6 Other Measures for Model Comparison
0.13
16e+3 / 125e+3
100%

yes AgePh >= 32 no

1.0
0.18
3466 / 19e+3
15%

Split = Yearly

0.12
12e+3 / 105e+3
85%

Split = Half−Yearly,Yearly

0.15 0.2
2583 / 17e+3 2156 / 11e+3
14% 9%

AgePh >= 57 AgePh >= 27

0.11 0.15
9638 / 88e+3 1310 / 8578
71% 7%

AgePh >= 58 AgePh >= 26

0.12 0.16 0.23


6719 / 57e+3 2141 / 13e+3 937 / 4029
46% 11% 3%

Fuel = Gasoline PowerCat = C1,C4,C5 Gender = Female

0.8
0.092 0.12 0.14 0.18
2919 / 32e+3 442 / 3599 810 / 5924 1219 / 6602
25% 3% 5% 5%

Fuel = Gasoline PowerCat = C1 AgePh >= 30 Split = Half−Yearly,Monthly

0.12 0.13 0.17 0.14


694 / 5952 2578 / 20e+3 1468 / 8832 555 / 3887
5% 16% 7% 3%

AgeCar < 5 Cover = Comprehensive,Limited.MD Gender = Male PowerCat = C1,C4

0.087 0.11 0.15 0.18


2225 / 26e+3 4141 / 37e+3 673 / 4585 886 / 5050
21% 30% 4% 4%

AgePh < 78 PowerCat = C1,C2,C3 Fuel = Gasoline AgePh >= 30

0.12 0.14
497 / 4000 1479 / 11e+3
3% 9%

Split = Yearly Split = Yearly

0.085 0.11 0.12 0.14 0.16


1995 / 24e+3 3969 / 36e+3 1099 / 9063 465 / 3317 1062 / 6662
19% 29% 7% 3% 5%

AgeCar >= 15 AgePh >= 33 AgePh >= 54 AgePh >= 46 Cover = Comprehensive,Limited.MD

0.6
0.086 0.12 0.16
1911 / 22e+3 984 / 7959 598 / 3784
18% 6% 3%

Gender = Male PowerCat = C2,C5 PowerCat = C2,C3

0.11 0.13 0.15


3792 / 35e+3 881 / 6886 598 / 3972
28% 6% 3%

cdf of Z
PowerCat = C1 Gender = Male AgeCar >= 6

0.1 0.11
410 / 4079 2496 / 22e+3
3% 17%

AgePh < 68 AgePh >= 52

0.083 0.1 0.12 0.12


1501 / 18e+3 1296 / 13e+3 698 / 5820 692 / 5567
15% 10% 5% 4%

PowerCat = C1,C2,C4 Cover = Limited.MD,TPL.Only AgePh < 48 AgePh >= 38

0.12
2023 / 17e+3
14%

Gender = Male

0.4
0.081 0.097 0.1 0.12 0.12
1311 / 16e+3 1160 / 12e+3 473 / 4718 484 / 4193 529 / 4456
13% 10% 4% 3% 4%

AgeCar < 11 AgePh >= 36 Cover = Comprehensive,Limited.MD AgeCar < 4 AgePh < 44

0.091 0.13 0.12


297 / 3279 558 / 4225 391 / 3145
3% 3% 3%

AgePh < 68 Split = Yearly AgeCar >= 8

0.078 0.095 0.11


1014 / 13e+3 1001 / 10e+3 1465 / 13e+3
10% 8% 10%

AgeCar >= 6 Cover = Limited.MD AgeCar < 6

0.083 0.099 0.12


463 / 5559 741 / 7491 976 / 8243
4% 6% 7%

Cover = Comprehensive AgePh >= 49 PowerCat = C2

0.074 0.087 0.11


551 / 7412 260 / 3005 489 / 4576
6% 2% 4%

AgeCar >= 9 AgeCar >= 6 Cover = Comprehensive,Limited.MD

0.2
0.077 0.086 0.1
362 / 4715 383 / 4455 488 / 4694
4% 4% 4%

Cover = Comprehensive,Limited.MD AgeCar < 3 AgePh < 41

0.11
765 / 6666
5%

Split = Yearly

0.091
284 / 3133
3%

Cover = Limited.MD

0.11
414 / 3749
3%

Cover = TPL.Only

0.0
0.07 0.083 0.075 0.095 0.096 0.092 0.11 0.12 0.079 0.091 0.11 0.13 0.11 0.12 0.12 0.13 0.15 0.16 0.11 0.13 0.11 0.13 0.14 0.17 0.13 0.15 0.14 0.17 0.13 0.15 0.16 0.21 0.25
189 / 2697 230 / 2775 99 / 1322 160 / 1695 158 / 1646 222 / 2424 230 / 2086 333 / 2862 151 / 1906 253 / 2797 331 / 2998 136 / 1034 267 / 2509 205 / 1705 132 / 1086 211 / 1577 249 / 1628 172 / 1088 202 / 1868 214 / 1627 138 / 1311 197 / 1474 189 / 1319 249 / 1437 328 / 2597 319 / 2181 182 / 1338 464 / 2690 255 / 2037 340 / 2276 350 / 2199 333 / 1552 672 / 2648

0.00 0.05 0.10 0.15 0.20 0.25


2% 2% 1% 1% 1% 2% 2% 2% 2% 2% 2% 1% 2% 1% 1% 1% 1% 1% 2% 1% 1% 1% 1% 1% 2% 2% 1% 2% 2% 2% 2% 1% 2%

0.061 0.068 0.073 0.086 0.085 0.1 0.11 0.1 0.14 0.099 0.093 0.11 0.093 0.099 0.11 0.12 0.12 0.15 0.1 0.12 0.13 0.12 0.15 0.15 0.11 0.13 0.16 0.16 0.19 0.13 0.19 0.19 0.19
84 / 1382 132 / 1940 80 / 1104 124 / 1438 139 / 1633 190 / 1825 188 / 1655 197 / 1952 164 / 1138 109 / 1099 157 / 1696 159 / 1418 206 / 2209 284 / 2871 282 / 2663 351 / 2917 309 / 2597 177 / 1194 115 / 1104 282 / 2325 286 / 2139 194 / 1671 163 / 1111 349 / 2347 114 / 1002 146 / 1136 208 / 1268 416 / 2634 406 / 2170 215 / 1611 500 / 2654 536 / 2851 265 / 1381
1% 2% 1% 1% 1% 1% 1% 2% 1% 1% 1% 1% 2% 2% 2% 2% 2% 1% 1% 2% 2% 1% 1% 2% 1% 1% 1% 2% 2% 1% 2% 2% 1%

Fig. 6.4 Regression tree (on the left) and cdf of Z (on the right) when minimum number of
observations = 1000

Considering the values displayed in Table 6.4, we see that the values for Spear-
man’s rho are between 5% and 10%, which are also rather small compared to 1.
Again, this may lead the analyst to a different conclusion than the one deduced from
its normalized version, whose values range between 23% and 32%. As for Kendall’s
tau, the upper bound increases with the number of values m taken by Z to finally
converge to the upper bound we get when Z is continuous. Also, the values for
Spearman’s rho shown in Table 6.4 are training sample estimates, so that they are
overly optimistic and favor larger regression trees.

6.3 Measuring Lift

6.3.1 Motivation

We need actuarial measures of quality, especially for comparing different pricing


models. In insurance applications, more economic criteria must be used to decide
whether a model is worth to be implemented or not. A very accurate, and thus costly
to maintain pricing model offering limited gains compared to the existing, simple
one, will certainly be abandoned. The notion of lift proves to be useful in that respect.
Actuaries now resort to the concept of lift to evaluate a given model, or to compare
competing models. Lift measures the model’s ability to prevent adverse selection.
Precisely, it quantifies the model’s ability to charge each insured an actuarially fair
rate, thereby minimizing the potential for loosing business attracted by competitors
using finer price lists.
Measuring lift is an important component of model validation: once a predictive
model has been built, it is essential to determine its performance of predicting the true
premium μ(X) given the available features X. Notice that the response Y itself does
not play a direct role in the determination of the premium (beyond the definition of
6.3 Measuring Lift 193

μ(X) and the calibration of the supervised regression model delivering the prices  μ),
as departures Y − μ(X) cancel out when averaged over a sufficiently large portfolio
(this is the very essence of insurance). The premium  μ(X) has to be as close as
possible to the true premium μ(X). The very aim of ratemaking is not to predict
the actual losses Y but to create accurate estimates of μ(X), which is unobserved.
Goodness-of-lift must be measured with the validation set, not with the training set
(else, a model over-fitting the training data may appear to provide a high lift).
Let μ1 and μ2 be two predictors based on the information contained in X. Both
are candidate premiums and attempt to predict the true premium μ(X). There are
many methods to obtain such predictors, ranging from the classical GLMs to neural
networks. Lift charts sort data based on the ratio R =  μ1 /
μ2 to compare the two pre-
dictors. This ratio is called the relativity. The procedure is as follows. First, calculate
the ratio R for each observation in the validation set and sort the data according to it.
Second, bucket the data into equally populated classes. If  μ2 is the new model, to be
compared with the current one  μ1 , then the superiority of 
μ2 over μ1 is demonstrated
by plotting loss ratios corresponding to the old model  μ1 . If the buckets with low
relativities have lower loss ratios then we have lift, that is, if loss ratios exhibit an
increasing trend then  μ2 is preferred over  μ1 .

6.3.2 Predictors Characteristics

In actuarial pricing, the actuary aims to predict the technical premium μ(X). To
this end, a predictor μ(X) is built from the available information X. To ease the
exposition, we assume that all predictors  μ(X) under consideration, as well as the
conditional expectation μ(X) are continuous random variables admitting probability
density functions. This is generally the case when there is at least one continuous
feature comprised in the available information X and the function 
μ is a continuously
increasing function of a real score. However, this may rule out predictions based on
discrete features only, as well as piecewise constant predictors, e.g., a single tree,
since in those cases, 
μ(X) takes only a limited number of values. Now, as actuarial
pricing is nowadays based on more sophisticated models (trees being combined into
random forests, for instance), this assumption does not really restrict the generality
of the approach. Notice that the response Y may be discrete (such as the number of
claims, for instance) as the continuity only concerns μ(X) and μ(X).
The predictor is also supposed to be correct on average, that is,

μ(X)] = E[μ(X)] = E[Y ],


E[ (6.3.1)

and both the response Y and predictor μ are assumed to be non-negative.


We denote as
Fμ (t) = P[
μ(X) ≤ t], t ≥ 0,
194 6 Other Measures for Model Comparison

the distribution function of 


μ(X), as fμ the corresponding probability density func-
tion,  t
Fμ (t) = fμ (s)ds, t ≥ 0,
0

and as Fμ−1 the associated quantile function (or Value-at-Risk) defined as the gener-
alized inverse of Fμ , i.e.

Fμ−1 (α) = inf{t|Fμ (t) ≥ α} for a probability level α.

Since the predictor is continuous, we have



Fμ Fμ−1 (α) = α for all probability level α.

To evaluate the performances of a predictor 


μ, the following aspects appear to be
relevant:
• the variability of 
μ, as a more variable 
μ induces larger premium differentials, i.e.
more lift.
• the ability of 
μ to match the true premium for increasing risk profiles.
The first objective can be formalized with the help of the convex order that can be
characterized by means of the Lorenz curves. The convex order is often used in
applied probability to compare the variability inherent to probability distributions.
The second objective is assessed by means of concentration curves.

6.3.3 Convex Order

Clearly, the more 


μ(X) is dispersed, the more information it contains about the true
premium. The constant predictor  μ(X) = E[Y ], the least dispersed one, does not
bring any information about the relative riskiness of the different policies. Thus,
comparing the underlying variability appears to be important in the problem under
study.
The convex order is an effective probabilistic tool to assess the dispersion of
random variables, beyond simple indicators such as standard deviations.
Definition 6.3.1 Consider two non-negative random variables Z 1 and Z 2 . Then, Z 1
is said to be smaller than Z 2 in the convex order, henceforth denoted as Z 1 cx Z 2 ,
if
E[Z 1 ] = E[Z 2 ] and E[(Z 1 − t)+ ] ≤ E[(Z 2 − t)+ ] for all t ≥ 0,

where (z − t)+ denotes the positive part of z − t, i.e. (z − t)+ = max{z − t, 0}.
This means that the stop-loss premiums for Z 2 dominates the corresponding stop-
loss premiums for Z 1 for all deductible levels t. The name convex order comes from
6.3 Measuring Lift 195

the fact that Z 1 cx Z 2 ⇔ E[g(Z 1 )] ≤ E[g(Z 2 )] for all the convex functions g for
which the expectations exist. Moreover,

Z 1 cx Z 2 ⇒ Var[Z 1 ] ≤ Var[Z 2 ]. (6.3.2)

This explains why cx is a variability order: it only applies to random variables


with the same expected value and compares the dispersion of these variables. The
convex order is a more sophisticated comparison than only focusing on the variances,
yet (6.3.2) indicates that it agrees with this approach. Henceforth, we can interpret
Z 1 cx Z 2 as “Z 2 is more variable than Z 1 ”, keeping in mind that the variability in
question extends beyond the simple comparison of standard deviation. For a detailed
presentation of the convex order, we refer the reader to Denuit et al. (2005).
From the identity

E[(Z − t)+ ] − E (t − Z )+ = E [Z ] − t,

we find

Z 1 cx Z 2 ⇔ 1 ] = E[Z
2 ],
E[Z
(6.3.3)
E (t − Z 1 )+ ≤ E (t − Z 2 )+ , for all t ≥ 0.

Note that partial integration leads to


 ∞ 
  t
E[(Z − t)+ ] = 1 − FZ (z) dz and E[(t − Z )+ ] = FZ (z)dz. (6.3.4)
t −∞

Recall that for any random variables Z 1 and Z 2 with equal mean,
 1  1
Z 1 cx Z 2 ⇔ FZ−1
1
(u)du ≤ FZ−1
2
(u)du for all 0 < α < 1.
α α

1
As E[Z k ] = 0 FZ−1
k
(u)du, we also have
 α  1
FZ−1
1
(u)du = E[Z 1 ] − FZ−1
1
(u)du
0 α
 1
≥ E[Z 2 ] − FZ−1
2
(u)du
α
 α
= FZ−12
(u)du
0

when Z 1 cx Z 2 . Notice that we also have


 α  FZ−1 (α)

FZ−1 z f Z k (z)dz = E Z k I[Z k ≤ FZ−1


k

k
(u)du = k
(α)]
0 0
196 6 Other Measures for Model Comparison

and 
1 α 

FZ−1 (u)du = E Z k  Z k ≤ FZ−1 (α) .


α 0
k k

An important characterization of the convex order is by construction on the same


probability space using conditional expectations. Precisely, the random variables Z 1
and Z 2 satisfy Z 1 cx Z 2 if, and only if, there exist two random variables  Z 1 and
Z 2 , defined on the same probability space, such that 
 Z 1 is distributed as Z 1 , 
Z 2 is
distributed as Z 2 , and { 
Z1,  Z 2 
Z 2 } is a martingale, that is, E[  Z1] = Z 1 holds almost
surely. This directly shows that increasing the number of features is beneficial as
X 1 ⊆ X 2 ⇒ μ(X 1 ) cx μ(X 2 ). Switching from X 1 to the richer information X 2
thus produces more dispersed premiums μ.

6.3.4 Concentration Curve

6.3.4.1 Definition

Given a binary response Y ∈ {0, 1}, where



1 if there has been at least one claim filed against the insurer
Y = ,
0 otherwise

Gourieroux and Jasiak (2007, Chap. 4) defined several performance measures for a
predictor 
μ(X) of the mean response E[Y |X] = P[Y = 1|X]. These performance
measures for the predictor 
μ for a binary response Y were based on the two curves

μ(X) ≤ Fμ−1 (α)]


P[Y = 1| μ(X) ≤ Fμ−1 (α)]
E[Y |
α → =
P[Y = 1] E[Y ]

and
μ(X) ≤ Fμ−1 (α)|Y = 1].
α → P[

The former curve gives the proportion of policies reporting at least one claim despite
their associated prediction was low (precisely, among the 100α% smaller ones). The
latter corresponds to the proportion of policies with small predictions among those
with at least one claim reported. Notice that

P[Y = 1|
μ(X) ≤ t]
μ(X) ≤ t|Y = 1] =
P[ μ(X) ≤ t]
P[
P[Y = 1]

so that multiplying the first curve with α, we obtain the second one.
Considering that the identities
6.3 Measuring Lift 197

P[Y = 1|
μ(X) ≤ t] E[Y |
μ(X) ≤ t]
=
P[Y = 1] E[Y ]

and

μ(X) ≤ t]
E Y I[
μ(X) ≤ t|Y = 1] =
P[
E[Y ]

are both valid for binary responses, this suggests to base the performance measure of
the predictor on the concentration curve of the response with respect to the predictor,
whose definition is recalled next.
Definition 6.3.2 The concentration curve of the response Y with respect to the pre-
dictor 
μ based on the information contained in the vector X is defined as

μ(X) ≤ Fμ−1 (α)]


E Y I[
CC[Y, 
μ(X); α] =
E[Y ]

for a probability level α.


In words, the concentration curve CC[Y,  μ(X); α] measures the proportion of
the total losses Y attributable to the sub-portfolio gathering a given percentage α of
policies with the lower predictions. For an exhaustive review of the properties of the
absolute concentration curve, we refer the interested reader to the book by Yitzhaki
and Schechtman (2013).
The next result gives equivalent expressions for the concentration curve.
Property 6.3.3 The concentration curve of the response Y with respect to the pre-
dictor 
μ based on the information contained in the vector X can be equivalently
rewritten as
 Fμ−1 (α)
1 

CC[Y, 
μ(X); α] = E Y 
μ(X) = t fμ (t)dt (6.3.5)
E[Y ] 0


E Y μ(X) ≤ Fμ−1 (α)


= ×α (6.3.6)
E[Y ]

= CC μ(X),  μ(X); α (6.3.7)


Cov Y, I[ μ(X) ≤ Fμ−1 (α)]


= +α (6.3.8)
E[Y ]

for every probability level α.


Proof Let f (Y,μ) denote the joint probability density function of the pair (Y, 
μ(X)).
Then,  ∞

f (Y,μ) (y, t)
E Y μ(X) = t = y dy
0 fμ (t)

so that
198 6 Other Measures for Model Comparison

 Fμ−1 (α)  Fμ−1 (α) 




E Y 
μ(X) = t fμ (t)dt = y f (Y,μ) (y, t)dydt
0 0 0

μ(X) ≤ Fμ−1 (α)] .


= E Y I[

This establishes (6.3.5). To get (6.3.6), it suffices to note that


 y t
P[Y ≤ y,
μ(X) ≤ t] 1
P[Y ≤ y|
μ(X) ≤ t] = = f μ) (z, s)dzds
μ(X) ≤ t]
P[ μ(X) ≤ t] 0 0 (Y,
P[

so that



E Y 
μ(X) ≤ t = ydP[Y ≤ y,  μ(X) ≤ t]
0
 ∞ t
1
= y f (Y,μ) (y, s)dyds
μ(X) ≤ t] 0
P[

0
μ(X) ≤ t]
E Y I[
= .
μ(X) ≤ t]
P[

Then, (6.3.7) follows from



 
  
E Y I[ μ(X) ≤ t] X = E μ(X)I[
μ(X) ≤ t] = E E Y I[ μ(X) ≤ t] .

Finally, (6.3.8) is easily obtained from the definition of the concentration curve as

μ(X) ≤ Fμ−1 (α)] = E Y I[


Cov Y, I[ μ(X) ≤ Fμ−1 (α)] − αE[Y ].

This ends the proof. 

Formula (6.3.7) shows that we can equivalently replace the response Y with the
pure premium μ(X) in the concentration curve. This property is of utmost importance
as the actuary is interested in the pure premium, which is unknown, but is allowed
to replace it with the actual response values in the evaluation of the performances of
the predictor under consideration.

6.3.4.2 Positive Dependence Structures

If the predictor brings a lot of information about the technical premium, or equiva-
lently about the response, this means that these random variables are strongly cor-
related. The shape of the concentration curve of the response with respect to its
predictor thus depends on the kind of positive relationship between the response and
the predictor. This is why we recall the definition of the following positive depen-
dence notions.
6.3 Measuring Lift 199

Definition 6.3.4 Consider two non-negative random variables Z 1 and Z 2 .


(i) The random variable Z 1 is stochastically increasing on Z 2 if

z 2 → P[Z 1 ≤ z 1 |Z 2 = z 2 ] is non-increasing in z 2 ,

for all z 1 ≥ 0.
(ii) The random variable Z 1 is positively regression dependent on Z 2 if

z 2 → E[Z 1 |Z 2 = z 2 ] is non-decreasing in z 2 .

(iii) The random variable Z 1 is left-tail decreasing on Z 2 if

z 2 → P[Z 1 ≤ z 1 |Z 2 ≤ z 2 ] is non-increasing in z 2 ,

for all z 1 ≥ 0.
(iv) The random variable Z 1 is positively left-tail expectation dependent on Z 2 if

z 2 → E[Z 1 |Z 2 ≤ z 2 ] is non-decreasing in z 2 .

(v) The random variable Z 1 is positively expectation dependent on Z 2 if

E[Z 1 ] ≥ E[Z 1 |Z 2 ≤ z 2 ]

for all z 2 ≥ 0.

Let us give some intuitive explanation about these different concepts. Considering
stochastic increasingness, we see that the condition defining this positive dependence
notion expresses the fact that the probability that Z 1 is small, in the sense that Z 1
falls below the threshold z 1 , decreases as Z 2 gets larger. Intuitively speaking, Z 1
thus tends to become larger when Z 2 increases. Positive regression dependence is a
weaker concept as
 ∞
E[Z 1 |Z 2 = z 2 ] = P[Z 1 > z 1 |Z 2 = z 2 ]dz 1
0

is obviously increasing when Z 1 is stochastically increasing in Z 2 . Positive regression


dependence ensures that, on average, Z 1 gets larger when Z 2 increases. The next three
concepts are defined by conditionings of the form “Z 2 ≤ z 2 ” instead of “Z 2 = z 2 ”.
Intuitively speaking, “Z 2 ≤ z 2 ” means that Z 2 is small. Left-tail decreasingness and
left-tail expectation dependence are the counterparts to stochastic increasigness and
positive regression dependence for these alternative conditionings. The last concept,
called expectation dependence, expresses the fact that the knowledge that Z 2 is small,
i.e. Z 2 is below the threshold z 2 , decreases Z 1 on average compared to the situation
where there is no information about Z 1 .
200 6 Other Measures for Model Comparison

6.3.4.3 Properties of the Concentration Curve

Monotonicity
The concentration curve is based on the function

E μ(X)I[ μ(X) ≤ t]
t →
E[Y ]

evaluated at quantiles of 
μ(X). This function is a distribution function, starting from
(0, 0) to reach (1, 1), being non-decreasing and right-continuous. Therefore, α →
CC[μ(X),  μ(X); α] is non-decreasing and satisfies

lim CC[μ(X), 
μ(X); α] = 0 and lim CC[μ(X), 
μ(X); α] = 1.
α→0 α→1

Line of Independence
In the particular case where the predictor brings no information about the response,
in the sense that Y and 
μ(X) are mutually independent, then the concentration curve
is the 45-degree line, often referred to as the line of independence in the literature.
Formally, if Y and μ(X) are independent, then

μ(X) ≤ Fμ−1 (α)


E[Y ]P 
CC[μ(X), 
μ(X); α] = = α.
E[Y ]

Let us now study the position of the concentration curve with respect to the 45-
degree line. If 
μ brings a lot of information about the true premium μ(X), this means
that these random variables are strongly related and the concentration curve should
be far from the line of independence. Furthermore, the shape of the concentration
curve depends on the kind of relationship between μ(X) and  μ(X). The next result
shows that under weak positive dependence, every concentration curve lies below
the independence line.
Property 6.3.5 If μ(X) is positively expectation dependent on 
μ(X), that is, if the
inequality 

E[μ(X)] ≥ E μ(X) μ(X) ≤ t

holds for all t, then the concentration curve lies below the 45-degree line, i.e.

CC[μ(X), 
μ(X); α] ≤ α for all probability levels α.

Proof It suffices to write





E μ(X)I[
μ(X) ≤ t] μ(X) ≤ t]E μ(X)
P[ μ(X) ≤ t
=
E[Y ] E[Y ]
≤ P[
μ(X) ≤ t].
6.3 Measuring Lift 201

The announced then follows by replacing t with Fμ−1 (α). 

Convexity
The next result states a positive dependence condition under which the concentration
curve is convex. Again, the shape of the curve depends on the kind of relationship
existing between the response and the predictor.
Property 6.3.6 The concentration curve α → CC[μ(X),  μ(X); α] is convex if, and
only if, μ(X) is positively regression dependent on 
μ(X), that is, if the function


t → E μ(X)
μ(X) = t (6.3.9)

is non-decreasing.
Proof Let us start from the representation
 ∞ 

t
E μ(X)I[
μ(X) ≤ t] = u f (μ,μ) (u, s)duds,
0 0

where f (μ,μ) denotes the joint probability density function of the pair (μ(X), 
μ(X)).
Then, the first derivative of the selection curve is given by
 ∞
d   1
CC[μ(X), 
μ(X); α] = u f (μ,μ) u, Fμ−1 (α) du  
dα 0 fμ Fμ−1 (α)
 ∞  
= u f μ|μ u|Fμ−1 (α) du
0

where f μ|μ (·|t) denotes the conditional probability density function of the true pre-
mium μ(X), given  μ(X) = t. Hence,

d
CC[μ(X),  μ(X) = Fμ−1 (α)].
μ(X); α] = E[μ(X)|

We thus see that this derivative is increasing (and hence the primitive is convex) if,
and only if, μ(X) is positively regression dependent on the score 
μ(X), as announced.
This ends the proof. 
The convexity of the concentration curve ensures that the increments of the func-
tion
−1 −1

μ (α) < 
E Y I[F μ(X) ≤ F
μ (α + )]
μ(X); α + ] − CC[Y,
CC[Y, μ(X); α] =
E[Y ]

are non-decreasing in α, for every positive  such that α +  ≤ 1. This means that
the sub-portfolios created by isolating a proportion  of policies with predictions
comprised between Fμ−1 (α) and Fμ−1 (α + ) brings an increasing share of the losses,
on average as α increases.
202 6 Other Measures for Model Comparison

This property is in relation with lift charts as described in Tevet (2013). To draw
such graphs, the data set is sorted based on the values of  μ(X). The data are then
bucketed into equally populated classes based on quantiles. Within each bucket, the
average predicted loss is calculated with the help of the predictor  μ as well as the
actual loss cost Y . The average predicted and average actual loss costs are then
graphed for each class.
To assess the reasonableness of μ, the analyst checks whether the actual loss costs
monotonically increase as we move to higher buckets (by definition, this will be the
case for the predicted loss costs). Property 6.3.6 precisely identifies the condition
required to observe an increasing trend in such a lift chart.
The monotonicity condition imposed on t → E[μ(X)| μ(X) = t] appears to be
reasonable. However, this condition is not necessarily fulfilled for a given feature X j ,
i.e. t → E[μ(X)|X j = t] is not necessarily non-decreasing. For instance, consider-
ing as response, the number of claims filed against the insurer in a motor third-party
liability cover, the impact of policyholder’s age on the expected claim frequency often
exhibits a U-shape, which invalidates the monotonicity of the conditional expecta-
tion. Re-arranging the values of the feature is nevertheless possible, as explained
in Shaked et al. (2012). But such a procedure makes the analysis less transparent.
This is why we condition here on  μ(X), which maps the vector of features to the
prediction and induces a total order among risk profiles.
As positive regression dependence implies positive expectation dependence, the
condition of Property 6.3.6 ensures that the concentration curve is non-decreasing
and convex, starting from (0, 0) to end at (1, 1) with a graph everywhere below the
45-degree line.

6.3.4.4 Estimation

Assuming the observations (Yi , X i ), i = 1, . . . , n, to be independent and identically


distributed, the concentration curve CC[μ(X),  μ(X); α] can be estimated as follows:

 μ(X), 
CC 
μ(X); α = CC[Y, 
μ(X); α]
1 
= Yi
nY μ−1 (α)
μ(X i )≤ F
i|

−1 (α) Yi
μ(X i )≤ F
i|
= n μ ,
i=1 Yi

where n
Yi
Y = i=1
.
n
μ denotes the empirical distribution function of the resulting predictions, i.e.
Here, F
6.3 Measuring Lift 203

n
μ (t) = 1
F μ(X i ) ≤ t].
I[
n i=1

The empirical counterpart CC  to the population concentration curve CC can be inter-


preted as follows: a ratio of the total loss produced by those policies with predictor μ
below its empirical quantile at level α and the aggregate loss of the entire portfolio. It
 expresses this total sub-portfolio loss in relative terms, as a percentage
means that CC
of the aggregate loss at the entire portfolio level.

6.3.5 Assessing the Performances of a Given Predictor

6.3.5.1 From Premiums to Ranks

Notice that
 
μ(X) ≤ Fμ−1 (α) ⇔ Fμ 
 μ(X) = α.

This means that it is enough to consider the ranking induced by the predictor, that
is, we are free to replace every predictor 
μ(X) with the corresponding rank
 
M = Fμ 
μ(X)

obeying the unit uniform distribution. The intuitive meaning of M is as follows: M


is the rank of a policyholder, once all contracts have been ordered according to their
corresponding premiums (in ascending order).
Gourieroux and Jasiak (2007) were interested in credit scoring, that is, in the initial
selection of applicants based on their propensity to reimburse their loan. Hence, the
actual values of the predictor do no matter, only the rank they induce and the threshold
defining acceptance or rejection of the application. In insurance pricing, however,
the actual values μ(X) are also important and this information is captured by the
Lorenz curve.

6.3.5.2 Lorenz Curve

The Lorenz curves are used in economics to measure the inequality of incomes.
Intuitively speaking, the more incomes are variable in a given population, the less
egalitarian it is. Lorenz curves are thus intimately related to the convex order. The
Lorenz curve LC associated to the predictor  μ(X) is defined by
204 6 Other Measures for Model Comparison
 α
1
μ(X); α] =
LC[ F −1 (u)du
μ(X)] 0 μ
E[

E μ(X) ≤ Fμ−1 (α)]


μ(X)I[
=
μ(X)]
E[

for a given probability level α.


The Lorenz curve is based on the function

Eμ(X)I[
μ(X) ≤ t]
t →
μ(X)]
E[

that can be seen as a distribution function. Hence, α → LC[


μ(X); α] is non-
decreasing, starting from (0, 0) to reach (1, 1). Clearly,

μ(X); α] = CC[
LC[ μ(X), 
μ(X); α]

so that every Lorenz curve is convex by virtue of Property 6.3.6.


Assuming the observations (Yi , X i ), i = 1, . . . , n, to be independent and identi-
cally distributed, the empirical version of the Lorenz curve is obtained as

  −1 (α) 
μ(X i )≤ F
i| μ(X i )
 
LC μ(X); α = n μ .
i=1  μ(X i )

 is the percentage of the total premium income corresponding to the


In words, LC
100α% smaller premiums when the latter are computed using a predictor  μ. If the
n n
global balance i=1 
μ(X i ) = i=1 Yi holds true (which is the case with GLMs as
 also expresses this
long as canonical link functions are used, for instance) then LC
proportion with respect to the aggregate loss at the entire portfolio level.
Remark 6.3.7 Considering LC[ μ(X); α], the 45-degree line now refers to another
limit case, called the line of equality. Assume that the predictor is constant, that is,

μ(X) = E[Y ]. This may be due to the fact that none of the feature contained in X is
related to the response Y (hence the link with the line of independence). Then, we
have

μ(X) = E[Y ] ⇒ LC[ μ(X); α] = α

so that this particular case corresponds to the 45-degree line. Notice that we must con-
sider this limit case with some care because the constant predictor is not a continuous
random variable (which invalidates some of the formulas derived earlier).

6.3.5.3 Canonical Predictors

If an actuary has access to the true premium μ(X) based on the information contained
in X then there is no need to distinguish CC from LC. This is because if 
μ(X) = μ(X)
6.3 Measuring Lift 205

then
μ(X); α] = CC[μ(X), 
LC[ μ(X); α]

for all probability levels α.


In other words, the two performance curves reduce to the Lorenz curve of μ(X).
μ(X) ≤ Fμ−1 (α) is in equilibrium,
This means that the sub-portfolio corresponding to 
as the premium matches the conditional expectation on average. A large difference
between the two performance curves thus suggests that the predictor under consid-
eration poorly approximates the true technical premium.
This explains why many empirical studies only use the Lorenz curve of  μ(X)
to evaluate performance of a predictive model. There is a confusion between the
predictor μ and the conditional expectation whereas it is very likely that 
μ(X) only
approximates μ(X), being different from it in reality. Because of this difference, we
can resort to the pair of curves CC[μ(X), μ(X); α] and LC[ μ(X); α] to evaluate the
performance of a pricing model.

Remark 6.3.8 Many empirical studies also use Gini coefficients based on the Lorenz
curve of the predictor. The reason for using this indicator is the following.
The Gini mean difference is one possible measure of variability defined for a
non-negative continuous random variable Z as


Gini[Z ] = E |Z 1 − Z 2 | = E max{Z 1 , Z 2 } − E min{Z 1 , Z 2 }

where Z 1 and Z 2 are independent and distributed as Z . It represents the average


absolute difference between two observations distributed as Z . This is closely related
to the variance, which can equivalently be expressed as

Var[Z ] = E (Z 1 − Z 2 )2 .
2
If Z is continuous then it can be shown that

Gini[Z ] = 4Cov Z , FZ (Z ) .

Thus, Gini mean difference measures the association between a random variable and
its rank. In other words, considering a sequence of observations ranked in ascending
order, Gini mean difference quantifies the relationship between the actual value of Z
and its position in the sequence. Clearly, the more variability in Z , the larger its actual
value when its rank is high (i.e. it appears among the largest observations) whereas
a lower Gini mean difference indicates that the observations are more concentrated
around their central value.
This formula relates the Gini mean difference to the Lorenz curve.
Starting from
the lower absolute deviation of Z 1 , defined as E (t − Z 1 )I[Z 1 ≤ t] , let us replace t
with Z 2 to get
206 6 Other Measures for Model Comparison


1
1

E (Z 2 − Z 1 )I[Z 1 ≤ Z 2 ] = E |Z 1 − Z 2 | = Gini[Z ] = 2Cov Z , FZ (Z ) .


2 2

Now, the area between the identity line and the Lorenz curve for Z is E[Z 1
]
Cov[Z ,
FZ (Z )]. Therefore, the higher the Gini mean difference, the further the Lorenz curve
of the predictor from the 45-degree line (i.e. from the Lorenz curve of the uninfor-
mative predictor constantly equal to E[Y ]). This is why candidate premiums with
larger Gini mean difference tend to be preferred.
Notice that the Gini coefficient is the Gini mean difference divided by twice the
mean. It is also known as the concentration ratio and represents the area between the
45-degree line and the actual Lorenz curve divided by the area between the 45-degree
line and the Lorenz curve that yields the maximal value that this index can have.

6.3.5.4 Measuring Goodness-of-Lift

The performances of a predictor 


μ(X) can thus be assessed by means of the respective
positions of the two curves

α → LC[
μ(X); α] and α → CC[μ(X), 
μ(X); α].

The first one represents the share of premiums collected from the 100α% of policies
from the portfolio with the lowest 
μ(X) values. The second one gives the correspond-
ing share of the true premium that should have been collected from this sub-portfolio.
As the total expected income of μ and μ match the total expected loss by (6.3.1),
both ratios are directly comparable since we can compare two percentages: the one
obtained with μ(X) with the true one corresponding to μ(X). As actuaries, we would
like that the graph of the concentration curve is as close as possible to the graph of
the Lorenz curve. In other words, the smaller the area between the two curves the
better.
Gini mean difference measures the area between the 45-degree line and the Lorenz
curve. As explained earlier, an alternative is to consider both the Lorenz curve of the
predictor and the concentration curve of the true premium with respect to the predic-
tor. More precisely, defining a distance between the two curves CC[μ(X),  μ(X); α]
and LC[ μ(X); α] would be relevant, knowing that they coincide when  μ(X) = μ(X).
This is why the area between the concentration curve and the Lorenz curve turns out
to be another good candidate for assessing the performance of a given predictor. This
area between the curves, ABC in short, is given by
 1
μ(X)] =
ABC[ μ(X); α] − LC[
CC[Y, μ(X); α] dα
0
 1
1


= E Y I[M ≤ α] − E  μ(X)I[M ≤ α] dα
μ(X)] 0
E[
 1 ∞
1


= P μ(X) ≤ y, M ≤ α] − P Y ≤ y, M ≤ α] dydα
μ(X)] 0 0
E[
6.3 Measuring Lift 207

1


= Cov μ(X), M − Cov Y, M (6.3.10)
μ(X)]
E[

where we recognize the difference between the Gini mean difference of the predictor

μ(X) (up to the factor 4) and the Gini covariance of the response Y and the predictor

μ(X). Let us notice that (6.3.10) can be rewritten as

μ(X)] =
ABC[ Cov 
μ(X) − Y, M .
μ(X)]
E[

Hence, if we think about 


μ(X) − Y as the profit associated with a policy, then ABC
may be interpreted to be proportional to the covariance between profits and the rank
of premiums. This interpretation is similar to the one given in Frees et al. (2013) for
the Gini index.

Remark 6.3.9 In the case where μ(X) and  μ(X) are comonotonic, we have
CC[μ(X),  μ(X); α] = LC[
μ(X); α] for all α ∈ (0, 1) if and only if μ(X) = 
μ(X).
Indeed, in such a case
μ(X) = Fμ−1 (M)

where Fμ−1 is the quantile function associated to the distribution function Fμ of μ(X).
Therefore, since E[μ(X)] = E[ μ(X)], we get CC[μ(X),  μ(X); α] = LC[ μ(X); α]
for all α ∈ (0, 1) if and only if

E μ(X)I[μ(X) ≤ Fμ−1 (α)] = E  μ(X) ≤ Fμ−1 (α)]


μ(X)I[

⇔ E Fμ−1 (M)I[Fμ−1 (M) ≤ Fμ−1 (α)] = E Fμ−1 (M)I[Fμ−1 (M) ≤ Fμ−1 (α)]
 α  α
⇔ Fμ−1 (u)du = Fμ−1 (u)du. (6.3.11)
0 0

Thus, since (6.3.11) must be fulfilled for all α ∈ (0, 1), it comes Fμ−1 (u) = Fμ−1 (u)
for all u ∈ (0, 1) and so μ(X) = μ(X).

6.3.5.5 Assessing the Performances of Low-Risk Portfolios

Alternatively, we can also create a sub-portfolio gathering all the policies such that
μ(X) ≤ Fμ−1 (α) and only consider this one, in isolation. The result of each portfolio

is described by the performance curve



Eμ(X)μ(X) ≤ Fμ−1 (α) , E μ(X)
μ(X) ≤ Fμ−1 (α) .

Notice that we are allowed to replace the true premium μ(X) with the actual loss
Y . The first component corresponds to the conditional lower-tail expectation of the
predictor, defined as
208 6 Other Measures for Model Comparison


μ(X)
μ(X); α] = E 
CLTE[ μ(X) ≤ Fμ−1 (α) .

The second component looks like the conditional lower-tail expectation of the true
premium 

CLTE[μ(X); α] = E μ(X)μ(X) ≤ Fμ−1 (α)

except that the condition involves the predictor and not the true premium.
Clearly,
α → CLTE[ μ(X); α]


is non-decreasing. But this is not necessarily the case for α → E Y μ(X) ≤ Fμ−1 (α)
unless μ(X) is positively lower-tail expectation dependent on  μ(X). Positive expec-
tation dependence allows us to constrain the shape of this curve. If μ(X) and  μ(X)
are positively expectation dependent then

μ(X) ≤ t] = E[μ(X)] ≥ E[μ(X)|


lim E[μ(X)| μ(X) ≤ t].
t→∞


The next result gives the condition ensuring the monotonicity of α → E μ(X)
μ(X)
−1

≤ Fμ (α) .
Property 6.3.10 If μ(X) is positively lower-tail expectation dependent on 
μ(X)
then 

α → E μ(X) μ(X) ≤ Fμ−1 (α)

is non-decreasing. A sufficient condition is provided by left-tail decreasingness.


Proof We see from its very definition that if the response is positively lower-tail
expectation dependent on 
μ(X) then t → E[μ(X)| μ(X) ≤ t] is non-decreasing. Fur-
thermore, as

E[μ(X)|μ(X) ≤ t] − E[μ(X)|
μ(X) ≤ s]
 ∞
 
= P[μ(X) ≤ u|
μ(X) ≤ s] − P[μ(X) ≤ u|
μ(X) ≤ t] du
0

a sufficient condition for positive lower-tail expectation dependence is left-tail


decreasingness. This dependence notion is stronger than positively quadrant depen-
dence but weaker than stochastic increasingness. 
The next result establishes a lower bound for the second component.
Property 6.3.11 For any predictor 
μ, we have


CLTE[μ(X); α] ≤ E μ(X)μ(X) ≤ Fμ−1 (α) for all α.

Proof Let us first establish that E[μ(X)|A] ≥ E[μ(X)|μ(X) ≤ u] for any random
event A such that P[μ(X) ≤ u] = P[A]. This comes from
6.3 Measuring Lift 209

E[μ(X)|μ(X) ≤ u] = u + E[μ(X) − u|μ(X) ≤ u, A]P[A|μ(X) ≤ u]


+E[μ(X) − u|μ(X) ≤ u, A]P[A|μ(X) ≤ u]
≤ u + E[μ(X) − u|μ(X) ≤ u, A]P[A|μ(X) ≤ u]
= u + E[μ(X) − u|μ(X) ≤ u, A]P[μ(X) ≤ u|A]
≤ u + E[μ(X) − u|μ(X) ≤ u, A]P[μ(X) ≤ u|A]
+E[μ(X) − u|μ(X) > u, A]P[μ(X) > u|A]
= E[μ(X)|A].

μ(X) ≤ Fμ−1 (α) for A and u = Fμ−1 (α).


The announced result then follows by taking 


As μ(X) cx Y and  μ(X) aims to predict μ(X), it is reasonable to assume that



μ(X) cx Y also holds true. Then, we know that the inequality

CLTE[Y ; α] ≤ CLTE[
μ(X); α] holds for all α,

i.e. that we have





μ(X)
E μ(X)μ(X) ≤ Fμ−1 (α) ≤ E  μ(X) ≤ Fμ−1 (α) for all α

since
CLTE[μ(X); α] = CLTE[Y ; α].


Property 6.3.11 indicates that E μ(X)μ(X) ≤ Fμ−1 (α) lies above CLTE[μ(X); α]
and thus it may cross CLTE[ μ(X); α]. The corresponding sub-portfolio is therefore
in disequilibrium, with true premiums above the actual premiums, on average.

Remark 6.3.12 If the information is so rich that μ(X) and 


μ(X) are comonotonic,
i.e.
μ(X) = Fμ−1 (M),

then we have equality in Property 6.3.11.

6.3.6 Comparison of the Performances of Two Predictors

Consider two predictors  μ1 and  μ2 . Both attempt to predict the unknown pure pre-
mium μ(X). These predictors may differ in their functional form ( μ1 instead of 
μ2 )
and/or in the information (X 1 instead of X 2 ) on which they are based. There are two
important aspects when evaluating the performances of two predictors. First, their
respective variability, i.e. their ability to identify different risk profiles. Second, their
correlation with μ(X), i.e., the amount of information they bring in about the true
premium.
210 6 Other Measures for Model Comparison

6.3.6.1 Variability

The convex order appears to be the appropriate tool to measure the degree of lift
induced by a predictor. This probabilistic tool indeed assesses the differentiation
between the cheapest and costliest risk profiles identified by the model. In that respect,
replacing the predictor with a more variable one, based on the convex order, appears
to be a promising strategy.

μ2 (X 2 ) cx 
Property 6.3.13 If  μ1 (X 1 ) holds then
(i) min[μ1 (X 1 )] ≤ min[μ2 (X 2 )] and max[ μ1 (X 1 )] ≥ max[
μ2 (X 2 )] so that
μ1 (X 1 ) has a wider range than 
 μ2 (X 2 ).
(ii) for any α < β,



Eμ1 (X 1 )μ1 (X 1 ) > Fμ−1


1
(β) − E μ1 (X 1 )μ1 (X 1 ) ≤ Fμ−1
1
(α)



≥Eμ2 (X 2 )μ2 (X 2 ) > Fμ−1


2
μ2 (X 2 )
(β) − E  μ2 (X 2 ) ≤ Fμ−1
2
(α) .

Proof The proof of (i) is by contradiction. Suppose, for example, that max[ μ1 (X 1 )]
μ2 (X 2 )]. Let t be such that max[
< max[ μ1 (X 1 )] < t < max[
μ2 (X 2 )]. Then

μ1 (X 1 ) − t)+ ] = 0 < E[(


E[( μ2 (X 2 ) − t)+ ],

in contradiction to  μ2 (X 2 ) cx 
μ1 (X 1 ). Therefore we must have max[ μ1 (X 1 )] ≥
max[μ2 (X 2 )]. Similarly, it can be shown that min[ μ1 (X 1 )] ≤ min[
μ2 (X 2 )].
Considering (ii), we know that
⎧ 


⎪ Eμ1 (X 1 ) −1
μ1 (X 1 ) > F
μ1 (β) ≥ E μ2 (X 2 ) −1
μ2 (X 2 ) > F
μ2 (β)






⎪ for all probability levels β


μ2 (X 2 ) cx 
μ1 (X 1 ) ⇔ 





⎪ E μ1 (X 1 ) −1
μ1 (X 1 ) ≤ F
μ1 (α) ≤ E μ2 (X 2 ) −1
μ2 (X 2 ) ≤ F
μ2 (α)





for all probability levels α

The announced result then follows by combining these two inequalities. 

6.3.6.2 More Positive Expectation Dependence

The respective distribution functions of the two predictors  μ1 and  μ2 are denoted
as Fμ1 and Fμ2 . Both Fμk are assumed to be continuous and strictly increasing,
k = 1, 2. Define the scores M1 = Fμ1 ( μ1 ) and M2 = Fμ2 (
μ2 ) that are both uniformly
distributed over the unit interval [0, 1].
The more Mk is correlated to Y , the more information the corresponding pre-
dictor 
μk contains. More informative predictors thus lead to greater variability of
6.3 Measuring Lift 211

the conditional expectation E[Y |M]. This is formally stated in the next result estab-
lished by Muliere and Petrone (1992) in their study of dependence orderings based
on generalized Lorenz curves.
Property 6.3.14 Assume that the functions α → E[Y |Mk = α] are continuous and
strictly increasing for k ∈ {1, 2}. Then

E[Y |M2 ] cx E[Y |M1 ] ⇔ E[Y |M1 ≥ α] ≥ E[Y |M2 ≥ α] for all α.

Here, E[Y |Mk ] measures how the rank Mk induced by the predictor  μk explains
μk (X) = E[Y ], then E[Y |Mk ] = E[Y ]
the response Y . If all the ranks are equal, i.e. 
and the predictor does not bring any information about the response.
The next result shows that under the assumptions of Property 6.3.14 the mean
square error of prediction (MSEP) is smaller with M1 compared to M2 .
Property 6.3.15 Under the conditions of Property (6.3.14), we have
 2   2 
E Y − E[Y |M1 ] ≤ E Y − E[Y |M2 ] ,

that is, Y is closer to E[Y |M1 ] in the L 2 -norm.


Proof The announced result is a direct consequence of the convex inequality
E[Y |M2 ] cx E[Y |M1 ] since
       2 
Var[Y ] = E Var[Y |Mi ] + Var E[Y |Mi ] and E Var[Y |Mi ] = E Y − E[Y |Mi ]

hold for i = 1, 2, so that



E[Y |M2 ] cx E[Y |M1 ] ⇒ Var E[Y |M2 ] ≤ Var E[Y |M1 ] .

This ends the proof. 

6.3.6.3 Discriminatory Power

The performance and selection curves are useful to evaluate the value of a given
predictor 
μ(X). These curves are also helpful to compare the performances of dif-
ferent scores. Following Gourieroux and Jasiak (2007, Definition 4.5), we adopt the
following comparison rule.
Definition 6.3.16 The predictor  μ1 (X 1 ) is more discriminatory than the predictor
μ2 (X 2 ) for a response Y if, and only if, the inequalities


E[ μ1 (X 1 ) ≤ Fμ−1
μ1 (X 1 )| 1
(α)] ≤ E[ μ2 (X 2 ) ≤ Fμ−1
μ2 (X 2 )| 2
(α)]

and
212 6 Other Measures for Model Comparison

μ1 (X 1 ) ≤ Fμ−1
E[Y | 1
μ2 (X 2 ) ≤ Fμ−1
(α)] ≤ E[Y | 2
(α)]

both hold for all probability levels α.

μ2 (X 2 ) cx 
The first condition is fulfilled if, and only if  μ1 (X 1 ). The second
condition can be rewritten as

μ1 (X 1 ) ≤ Fμ−1
E[Y | 1
μ2 (X 2 ) ≤ Fμ−1
(α)] ≤ E[Y | 2
(α)]

⇔ E[Y |M1 ≤ α] ≤ E[Y |M2 ≤ α].

This amounts to requiring that Y is more positively expectation dependent on M1


than on M2 , in the sense that the reduction in the expectation resulting from the
knowledge that Mk ≤ α is larger for M1 compared to M2 . This can be seen from the
inequality
E[Y ] − E[Y |M1 ≤ α] ≥ E[Y ] − E[Y |M2 ≤ α].

This is equivalent to the corresponding condition appearing in Property 6.3.14 as the


identity
E[Y ] = αE[Y |Mk ≤ α] + (1 − α)E[Y |M1 > α]

holds for k ∈ {1, 2}.


The second condition in Definition 6.3.16 can be expressed in terms of the con-
cordance order. Two random variables are said to be concordant if they tend to be
all large together or small together. The concordance order expresses the idea that
large and small values tend to be more often associated under the distribution that
dominates the other one.
Definition 6.3.17 Let us consider two random couples (Z 1 , Z 2 ) and (V1 , V2 ) with
the same marginal distributions, i.e. P[Z k ≤ t] = P[Vk ≤ t] = Fk (t) for k ∈ {1, 2}.
If

P[Z 1 ≤ t1 , Z 2 ≤ t2 ] ≤ P[V1 ≤ t1 , V2 ≤ t2 ] for all t1 and t2 , (6.3.12)

or, equivalently, if

P[Z 1 > t1 , Z 2 > t2 ] ≤ P[V1 > t1 , V2 > t2 ] for all t1 and t2 , (6.3.13)

then (Z 1 , Z 2 ) is said to be less concordant that (V1 , V2 ). This is henceforth denoted


as (Z 1 , Z 2 ) conc (V1 , V2 ).

The intuitive meaning of a ranking with respect to conc is clear from Definition
6.3.17. Indeed, P[Z 1 ≤ t1 , Z 2 ≤ t2 ] and P[V1 ≤ t1 , V2 ≤ t2 ] read as “Z 1 and Z 2 are
both small” and “V1 and V2 are both small”, respectively (small meaning that Z 1 ,
resp. V1 , is smaller than the threshold t1 and Z 2 , resp. V2 , is smaller than the threshold
t2 ). So, (6.3.12) means that when X conc Y holds, the probability that V1 and V2
6.3 Measuring Lift 213

are both small is larger than the corresponding probability for Z 1 and Z 2 . Similarly
from (6.3.13), (Z 1 , Z 2 ) conc (V1 , V2 ) also ensures that the probability that Z 1 and
Z 2 are both large is smaller than the corresponding probability for V1 and V2 . This
corresponds to the intuitive content of “(V1 , V2 ) being more positively dependent
than (Z 1 , Z 2 )”.
We have the following result in terms of covariances.

Proposition 6.3.18 For random couples (Z 1 , Z 2 ) and (V1 , V2 ) with the same
marginal distributions, we have

(Z 1 , Z 2 ) conc (V1 , V2 ) ⇔ Cov[g1 (Z 1 ), g2 (Z 2 )] ≤ Cov[g1 (V1 ), g2 (V2 )]

for all the non-decreasing functions g1 and g2 , provided the expectations exist.

Proposition 6.3.18 shows when conc holds, the correlations between g1 (Z 1 ) and
g2 (Z 2 ) are less than between g1 (V1 ) and g2 (V2 ) for all increasing functions g1 and
g2 . Furthermore, Pearson’s correlation coefficient as well as Kendall’s and Spear-
man’s rank correlation coefficients all agree with a ranking in the conc -sense. This
reinforces the intuitive meaning of conc as a tool to compare the strength of the
dependence.
The next result gives a sufficient condition in terms of the concordance order.

Property 6.3.19 If

μ2 (X 2 ) cx 
 μ1 (X 1 ) and (Y, M2 ) conc (Y, M1 )

μ1 (X 1 ) is more discriminatory than the predictor 


then  μ2 (X 2 ) for the response Y .
Proof The result follows from

1 ∞
E[Y |M2 ≤ α] − E[Y |M1 ≤ α] = P[Y ≤ y, M1 ≤ α] − P[Y ≤ y, M2 ≤ α] dy
α 0

which is indeed positive is (Y, M1 ) is more concordant than (Y, M2 ). 

Thus, we see that  μ1 (X 1 ) is more discriminatory than 


μ2 (X 2 ) for the response
μ1 (X 1 ) is simultaneously more variable (in the sense of the convex order) and
Y if 
more correlated (in the sense of positive expectation dependence or the stronger
concordance order) with the response Y than  μ2 (X 2 ).

6.3.6.4 Integrated Concentration and Lorenz Curves

The preference relation proposed in Definition 6.3.16 only forms a partial ranking.
Two predictors might well be incomparable because their respective concentration
or Lorenz curves intersect: one predictor is better for low risks, and worse for high
risks, for example. In such a case, we can base the comparison on the integral of
214 6 Other Measures for Model Comparison

the concentration curves. This amounts to considering the integrated concentration


curve defined as
 α
ICC[μ(X),  μ(X); α] = CC[μ(X),  μ(X); ξ]dξ
0
 α

E μ(X)I[M ≤ ξ]
= dξ
E[Y ]
0

E μ(X)(α − M)+
=
E[Y ]

Cov μ(X), (α − M)+

= + E (α − M)+ .
E[Y ]

The first term is driven by the correlation between the response and the predictor
whereas the second one is just a constant as M is unit uniformly distributed:
 α

α2
E (α − M)+ = (α − ξ)dξ = .
0 2

The integral of the concentration curve over the whole interval [0, 1] is denoted ICC,
i.e.

ICC = ICC[μ(X), 
μ(X); 1]

Cov μ(X), 1 − M 1
= +
E[Y ] 2

1 Cov μ(X), M
= − .
2 E[Y ]

Again, as

E Y (α − M)+ = E E[Y (α − M)+ |X]


= E μ(X)(α − M)+

we are allowed to replace μ(X) with Y in the definition of the integrated concentration
curve. This means that we can use it to measure performance of  μ(X) in predicting
the unknown pure premium μ(X).
Let us now provide an intuitive interpretation for ICC. We still consider the 100α%
of policies with the smallest μ values, as (α − M)+ = 0 for α ≤ M. Now, ICC is
based on the covariance between μ(X) and (α − M)+ . The idea is that, the smaller
M with respect to α (i.e., the larger (α − M)+ ) the smaller the true premium should
be. Hence, a positive relationship between  μ(X) and μ(X) translates into a negative
covariance between μ(X) and (α − M)+ . And the more negative the covariance term
entering the decomposition of ICC, the better the corresponding candidate premium.
6.3 Measuring Lift 215

Proceeding in a similar way with the Lorenz curve, we define the integrated Lorenz
curve as
 α

E μ(X)I[M ≤ ξ]
ILC[ μ(X); α] = dξ
μ(X)]
E[
0

E μ(X)(α − M)+
=
E[μ(X)]

Cov  μ(X), (α − M)+

= + E (α − M)+ .
E[μ(X)]

The integral of the Lorenz curve over the whole interval [0, 1] is denoted ILC.
The smaller the ILC metrics is, the better the corresponding candidate premium.
Similarly, the smaller the ICC metrics, the better. Now, if we consider both metrics
simultaneously, then one should prefer a predictor with smaller ICC and ILC metrics,
or equivalently with smaller ICC and ABC values since

μ(X)] = ICC[μ(X), 
ABC[ μ(X); 1] − ILC[
μ(X); 1].

It is worth noticing that one predictor can be better for ICC and worse for ABC, for
instance.

6.3.7 Ordered Lorenz Curve

Let μ1 (X) and  μ2 (X) be two predictors for a response Y . In ratemaking, these
predictors are for the true technical premium μ(X). We can imagine that  μ1 is the
current predictor and that we consider replacing it with  μ2 provided the latter’s
performances are better. In order to compare these two predictors, let us define the
relativity as the ratio of the new to the old predictor, that is,

μ2 (X)

R= .

μ1 (X)

If R is less than 1, this means that the risk profile X is overpriced with the current
predictor. This profile is thus at risk of adverse selection: a competitor using the
predictor μ2 could offer a better rate to such policyholders who could then leave the
portfolio.
As before, both predictors are supposed to be balanced, i.e.

μ1 (X)] = E[
E[ μ2 (X)] = E[Y ],

μ1 (X), 
and to ease the explanation, we assume that  μ2 (X) and μ(X) are all contin-
uous.
216 6 Other Measures for Model Comparison

Following Frees et al. (2013), we define the ordered Lorenz curve as the set of
μ1 (X), R(X); α], CC[Y, R(X); α])
points (CC[



Eμ1 (X)I[R(X) ≤ FR−1 (α)] E Y I[R(X) ≤ FR−1 (α)]
= ,
μ1 (X)]
E[ E[Y ]

where FR denotes the distribution function of the relativity R, and FR−1 the associated
quantile function. Notice that

CC[Y, R(X); α] = E Y I[R(X) ≤ FR−1 (α)]




= E E Y I[R(X) ≤ FR−1 (α)]|X
 
= E μ(X)I[R(X) ≤ FR−1 (α)]

= CC μ(X), R(X); α

so that we are allowed to replace the true premium μ(X) with the actual loss Y .
Both functions

E μ1 (X)I[R(X) ≤ s]
s →
μ1 (X)]
E[

and

E Y I[R(X) ≤ s]
s →
E[Y ]

can be interpreted as distribution functions. They give the proportion of total current
premiums  μ1 (X) and the proportion of total losses Y (or true premiums μ(X)) in the
sub-portfolio determined by the condition R(X) ≤ s. The approach is thus based on
adverse selection against the insurer. Assume that a competitor attracts all profiles X
μ0 , i.e. those such that R(X) ≤ s for
that are overpriced under the current price list 
some s small enough. More precisely, in a sub-portfolio gathering all risk profiles
such that R(X) ≤ FR−1 (α), i.e. the 100α% of policies with the smaller relativities,
we record the proportion

E Y I[R(X) ≤ FR−1 (α)]


t1 =
E[Y ]

of total losses, for a proportion of


Eμ1 (X)I[R(X) ≤ FR−1 (α)]


t2 =
μ1 (X)]
E[

of premium income. Considering the point (t1 , t2 ) of the ordered Lorenz curve, cor-
responding to the particular α, its meaning is as follows. By forming a portfolio with
6.3 Measuring Lift 217

all policyholders whose relativities R(X) are less than FR−1 (α), i.e. all policies for
which the new premium  μ2 (X) is smaller than FR−1 (α) times the old one 
μ1 (X), the
corresponding premium income is t1 and the corresponding losses is t2 , on average.
If t1 < t2 then this is a profitable portfolio, one well worth retaining.
These graphical procedures can be supplemented with single numbers. Two coef-
ficients have been proposed in the literature to measure the goodness-of-lift: the
Value-of-Lift by Meyers and Cummings (2009) and the Gini index advocated by
Frees et al. (2011).

6.3.8 Numerical Illustration

6.3.8.1 Assumptions

In this section, 
μ(X) is assumed to follow a Gamma distribution with unit mean μ = 1
and variance σ 2 , henceforth denoted as Gam(μ, σ 2 ). Such predictors are known to
be ordered in the cx -sense with σ.
Also, in this section, we consider two distributions for the true premium μ(X),
namely
• a Gamma distribution with unit mean and variance σY2 (such true premiums are
known to be ordered in the cx -sense with σY2 );
• a LogNormal distribution with unit mean, i.e. ln μ(X) is Normally distributed with
mean −σY2 /2 and variance σY2 , which is henceforth denoted as LN or (−σY2 /2, σY ).
Notice that condition (6.3.1) is fulfilled in both cases since we have E[Y ] =
E[μ(X)] = E[ μ(X)] = 1. Also, it is worth noticing that the response may be discrete
(such as the number of claims, for instance), the continuity assumption only concerns
μ(X) and  μ(X).
In addition to the parameters σ and σY governing the variability of the predictor
and the true premium, respectively, we consider different dependence structures
for the random vector (μ(X),  μ(X)). Specifically, we consider Frank and Clayton
copulas, two copulas that are monotonically conc -increasing with their parameter.
Recall from Denuit et al. (2005) that the Clayton copula is given by
 −1/θ
Cθ (u, v) = u −θ + v −θ − 1 , θ > 0,

whereas the Frank copula is given by


 
1 (exp(−θu) − 1)(exp(−θv) − 1)
Cθ (u, v) = − ln 1 + , θ = 0.
θ exp(−θ) − 1

For positive values of θ in Frank’s case, these two copulas express positive depen-
dence. The parameter θ can be interpreted as a measure of strength of the dependence
between μ(X) and  μ(X). In order to make the dependence parameter more palatable,
218 6 Other Measures for Model Comparison

we rather use the corresponding Kendall’s tau. For the Clayton copula, Kendall’s tau
θ
is simply given by θ+2 . For the Frank copula, Kendall’s tau, which also increases
with θ, can only be expressed as a Debye function of the first kind.

6.3.8.2 Variability

The dependence structure between μ(X) and  μ(X) is assumed to be fixed and mod-
eled by means of the Clayton copula with Kendall’s tau equal to 0.5. The predictor

μ(X) is supposed to be Gamma distributed with mean and variance both equal to 1.
In addition, the true premium μ(X) is supposed to be Gamma distributed with unit
mean and variance σY2 . We aim to assess the impact of σY2 on ABC values.
In that goal, we consider three values for σY2 , that are 0.5, 1 and 2. The results are
summarized in the following table and illustrated in Fig. 6.5.

Line type 
μ(X) μ(X) Copula C ABC
medium dash G am(1, 1) G am(1, 2) Clayton(τ = 0.5) 6.33%
short dash G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%
dotted G am(1, 1) G am(1, 0.5) Clayton(τ = 0.5) 13.08%

We observe that the concentration curves are non-crossing as a result of the convex
order among the different distributions of μ(X). Furthermore, the smaller the variance
of μ(X) the further away the concentration curve from the Lorenz curve which leads
to a decreasing of the ABC value with the variance of μ(X). Notice that in this
example, there is no need to complement ABC values with the ICC metrics, the
Lorenz curve being the same in the three cases considered.
This example highlights the fact that when we have identically distributed predic-
tors that perform similarly in terms of dependence with the true premium, the ABC
metric will favor the case where the true premium is the most variable (in the convex
order sense). Similarly, for a given true premium and predictors performing the same
way in terms of dependence with the true premium, the ABC metric will favor the
predictor that is the less variable in terms of the convex order.
The situation where μ(X) cx  μ(X) may be due to overfitting. This can happen
when the predictor  μ(X) integrates random noise. Indeed, assume that only the
first q features, q < p, X 1 , . . . , X q matters and that X q+1 , . . . , X p are independent,
zero-mean
q random variables, independentp of X 1 , . . . , X q . Then, the true score β0 +
β
j=1 j jX is dominated by β 0 + β
j=1 j X j in the convex sense. On the contrary,
the situation where  μ(X) cx μ(X) may be due to underfitting, which can be the
case, for instance, when

μ(X) = E[Y |X 1 , . . . , X q , X q+1 , . . . , X p ]

and
μ(X) = E[Y |X 1 , . . . , X q ].

6.3 Measuring Lift 219

Fig. 6.5 Lorenz curve and several concentration curves for different variances of μ(X)

Indeed, we have seen that increasing the number of features produce more dispersed
premiums.

6.3.8.3 Dependence

Let us consider fixed distributions for μ(X) and 


μ(X), namely μ(X) ∼ Gam(1, 1)
and μ(X) ∼ Gam(1, 1). We aim to assess the impact of the strength of the depen-
dence between μ(X) and  μ(X) on the ABC value. In that purpose, we consider the
Clayton copula with three values for Kendall’s tau. The results are summarized in
the following table and illustrated in Fig. 6.6:

Line type 
μ(X) μ(X) C ABC
medium dash G am(1, 1) G am(1, 1) Clayton(τ = 0.75) 3.46%
short dash G am(1, 1) G am(1, 1) Clayton(τ = 0.50) 9.66%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.25) 17.04%

The weaker the dependence the further away the concentration curve from the
Lorenz curve. This can be explained as follows. With the Clayton copula, increasing
Kendall’s tau results in a random pair (μ(X),  μ(X)) larger in the sense of conc .
Therefore, from Property 6.3.19, we know that the concentration curve gets lower,
and thus closer to the Lorenz curve.
We observe that the ABC value decreases with Kendall’s tau, which is not sur-
prising since increasing Kendall’s tau means that 
μ(X) becomes more informative
about the true premium μ(X).
220 6 Other Measures for Model Comparison

Fig. 6.6 Lorenz curve and several concentration curves for different values of Kendall’s tau

Notice that there is no need here to consider ICC values. Indeed, the Lorenz curve
being not impacted by the dependence between μ(X) and  μ(X), ABC and ICC
metrics behave the same way.

6.3.8.4 Distribution

Again, we suppose that the predictor μ(X) is Gamma distributed with mean and
variance both equal to 1, and the dependence structure between μ(X) and μ(X) is
assumed to be fixed and modeled by means of the Clayton copula with Kendall’s
tau equal to 0.5. The true premium μ(X) is assumed to be Gamma distributed, and
this time, we also consider the LogNormal distribution for μ(X). Specifically, the
following table summarizes the three cases considered here:

Line type 
μ(X) μ(X) C ABC
√ √
LN or − (1.25 2 ln 2) , 1.25 ln 2
2
medium dash G am(1, 1) Clayton(τ = 0.5) 9.20%

short dash G am(1, 1) LN or − ln22 , ln 2 Clayton(τ = 0.5) 11.58%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%

In Fig. 6.7, we can see the corresponding concentration and Lorenz curves. In
both cases where the variance of μ(X) is equal to 1, one sees that the LogNormal
concentration curve (short dash) lies further away from the Lorenz curve than the
Gamma one (dotted). One observes that the ABC value favors the case where the
6.3 Measuring Lift 221

Fig. 6.7 Lorenz curve and several concentration curves for different distributions of μ(X)

distributions of 
μ(X) and μ(X) are similar, the dependence structure being the same
in both cases.
In the LogNormal case (short dash), increasing σY by factor 1.25 (medium dash)
yields concentration curves crossing around point 0.34. While in the previous exam-
ples the concentration curves were always ordered, we see that the use of different
distributions can lead to crossing concentration curves.
Notice that in the latter case (medium dash), the ABC value is the smallest one,
which is not surprising in light of Sect. 6.3.8.2 since it corresponds to the case where
the variance of μ(X) is the largest one.
Again, there is no need to complement ABC values with the ICC metrics since
the Lorenz curve remains the same across the three cases considered here.

6.3.8.5 Crossing Copulas

Similarly to the previous example, the use of different copulas can also lead to
crossing concentration curves. Let us consider the Clayton copula C1 and the Frank
copula C2 as in Example 2.3 of Denuit and Mesfioui (2013). In such a case, one can
show that there exists a function f such that C1 (u, v) − C2 (u, v) ≤ 0 if v ≤ f (u)
and C1 (u, v) − C2 (u, v) ≥ 0 if v ≥ f (u), so that these two copulas are not ordered
according to the concordance order. We consider the two following cases:
222 6 Other Measures for Model Comparison

Fig. 6.8 Lorenz curve and two concentration curves for different copulas

Line type 
μ(X) μ(X) C ABC
short dash G am(1, 1) G am(1, 1) Frank(τ = 0.5) 7.79%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%

The corresponding concentration and Lorenz curves are depicted in Fig. 6.8. We
observe that the concentration curves cross around point 0.35, which is an intuitive
result. Indeed, Clayton copula has stronger dependence in the lower quadrant than
Frank copula. Also, since the overall dependence is equal in both cases, the opposite
holds in the upper quadrant. The stronger the dependence the closer the concentration
curve is to the Lorenz curve. This is why the Clayton copula lies closer to the Lorenz
curve for small values and further away for large values.

6.3.8.6 Non-regression Dependent Copula Impact

Finally, we can consider a copula that does not exhibit positive quadrant depen-
dence but only positively expectation dependence. To this end, we can proceed as
in Egozcue et al. (2011) by mixing two copulas expressing quadrant dependence
of opposite signs. For instance, considering the Frechet–Hoeffding upper and lower
bound copulas, we can use

C(u, v) = (1 − θ) min{u, v} + θ max{0, u + v − 1} (6.3.14)


6.3 Measuring Lift 223

Fig. 6.9 Lorenz and concentration curves for a non-regression dependent copula

as in Example 2.1 of Denuit and Mesfioui (2017). We know from Egozcue et al.
(2011) that this mixture expresses positive expectation dependence if, and only if, θ ≤
1
2
. Alternatively, the Frechet–Hoeffding lower bound copula may be replaced with
another copula expressing negative quadrant dependence (such as the Farlie-Gumbel-
Morgenstern, or FGM copula with negative dependence parameter, for instance).
Considering the following setup

Line type 
μ(X) μ(X) C ABC
dotted G am(1, 1) G am(1, 1) (6.3.14) with θ = 0.8 10%

we see in Fig. 6.9 that the above mixture copula leads well to a non-convex concen-
tration curve.

6.3.9 Case Study

We end this chapter by considering a French motor third-party liability insurance


portfolio available in the CASdatasets package in R. Specifically, we investigate
the dataset freMTPL2freq which contains 678 013 observations of the number of
claims (response Y ) together with nine features (X = (X 1 , . . . , X 9 )). The features
correspond to several characteristics of the policyholder (age, density of inhabitants
in the home city, region, area and bonus-malus) and the car (power, age, brand and
fuel type). We refer the reader to Noll et al. (2018) for a broad description of the
dataset.
224 6 Other Measures for Model Comparison

Fig. 6.10 In- and out-of-sample errors for models under consideration

In this section, we aim to compare some of the models investigated in Noll et al.
(2018) by using ABC and ICC metrics discussed in this chapter. More specifically,
μk (X k ):
we consider the following models of Noll et al. (2018) for the predictors 
• glm1—Poisson GLM with a log-link function and all explanatory variables;
• glm3—same as glm1 but without area and region variables;
• pbm1—boosted SBS (Standardized Binary Splits) tree (depth = 1, iterations = 30);
• pbm3—boosted SBS tree (depth = 3, iterations = 50);
• pbm3.s2—boosted SBS tree (depth = 3, iterations = 50, shrinkage = 0.5);
• glm1.pbm3—boosted SBS tree starting from glm1 fit (depth = 3, iterations = 50);
• nn—shallow neural network (20 neurons with one hidden layer).
Models’ implementation details can be found in Noll et al. (2018). We refer to Denuit
et al. (2019a) for details on neural networks.
The dataset is partitioned into a training set of 610 000 observations and a valida-
tion set comprising the remaining observations.
Figure 6.10 shows the training sample estimate of the generalization error (in-
sample error) and the validation sample estimate of the generalization error (out-of-
sample error) for the models under study together with bootstrapped 95% confidence
intervals. The bounds are derived for in- and out-of-sample errors individually, so
only vertical and horizontal distances are meaningful. In particular, the oval shape
is due to spline smoothing through the points (in-sample error, out-of-sample error):
{(lower, observed), (observed,higher), (higher,observed), (observed,lower)}. Over-
all, in-sample error and out-of-sample error classify the models in a similar way,
except the boosted tree model (pbm3) and its shrunken version. For the latter mod-
6.3 Measuring Lift 225

 for models glm1 and nn


Fig. 6.11 CC

els, introducing a shrinkage factor increases the in-sample error while it reduces
the out-of-sample error. This is not surprising as the introduction of a shrinkage
factor aims to avoid overfitting issues. We also note that the boosted GLM model
(glm1.pbm3) improves substantially over the original GLM model (glm1). However,
it does not outperform the boosted SBS tree (pbm3). The latter observation indicates
that the fixed structural form imposed to the expected claim frequency by the GLM
model does not provide any additional explanatory insights compared to the boosted
SBS tree. Finally, the optimal model with respect to the out-of-sample error metric
is the boosted tree model with a shrinkage factor (pbm3.s2).
Looking at the bootstrapped confidence intervals, all the models except the ones
based on (pbm3) are nicely separated. It also seems that boosted methods yield more
varying results than GLMs or the neural network model (for out-of-sample error).
Let us now turn to the goodness-of-lift metrics discussed in this chapter. In the
 and the Lorenz
following, we use the empirical versions of the concentration curve CC

curve LC computed on the validation set in order to get ABC and ICC values.
In case the number of observations are insufficient, a smoothed version of the
empirical concentration curve CC  could be used instead. Here, the size of the valida-
 depicted in Fig. 6.11 for two of
tion set is judged as sufficient to simply rely on CC,
the considered models. The remaining models are close to these two models, forming
two groups of curves for α larger than 0.15. The higher group of curves is related to
models glm1, glm3 and pbm1, which are also the three worst models according to
the out-of-sample errors.
226 6 Other Measures for Model Comparison

Fig. 6.12 Estimated ABC and ICC values for models under consideration

ABC and ICC values are displayed in Fig. 6.12 also with the same visualization of
bootstrapped confidence intervals. We notice that ICC metric classifies the models
as the out-of-sample error metric, except for models pbm3 and pbm3.s2. While both
metrics agree that these two last models are the best ones, pbm3.s2 outperforms pbm3
according to the out-of-sample error while with ICC it is the other way around.
Regarding the ABC values, we observe that a model with a low ICC can either have
low or large ABC. For instance, glm1.pbm3, which is one of the best model according
to ICC, has the highest ABC, while pbm3.s2 has both low ICC and ABC. If we
compare glm1.pbm3 and pbm3.s2 that have similar degrees of lift according to ICC,
we notice that ABC metric favors pbm3.s2 that is less variable than glm1.pbm3, which
is in line with Sect. 6.3.8.2. In the same way, while pbm3 and pbm3.s2 have similar
ICC, pbm3.s2 outperforms pbm3 according to ABC, pbm3.s2 being less variable
than pbm3 (since both models have the same number of trees while pbm3.s2 uses a
shrinkage parameter). Finally, the optimal model with respect to ABC is pbm3.s2.
To end the case study, we display in Fig. 6.13 ICC and ABC as functions of α (i.e.
integrating over the interval [0, α] instead of the whole interval [0, 1]). We present
only curves for models pbm1 and glm1.pbm3 as the remaining curves look fairly
similar. One sees that glm1.pbm3 has always lower ICC while the ABC values cross at
around 91% quantile. Hence, from that quantile, one can say that pbm1 outperforms
glm1.pbm3 according to ABC metric.
6.4 Bibliographic Notes and Further Reading 227

Fig. 6.13 Estimated ABC and ICC values for models under consideration

6.4 Bibliographic Notes and Further Reading

Denuit et al. (2019b) considered binary responses and derive the set of attainable
values for concordance-based association measures so that the closeness to the best-
possible fit can be properly assessed. Denuit et al. (2019c) and Mesfioui et al. (2020)
obtained the best-possible upper bounds for Kendall’s tau and Spearman’s rho when
the response is a discrete random variable. Section 6.2 is largely inspired from these
two papers.
Several testing procedures have been proposed in the literature to detect depen-
dence relations. For positive quadrant dependence, we refer the reader to Denuit
and Scaillet (2004) and Scaillet (2005). Zhu et al. (2016) investigated hypothe-
sis tests for first-degree and higher-degree expectation dependence. Testing pro-
cedures for the convex order have been proposed in economics (see e.g., Barrett and
Donald 2003). Section 6.3 is strongly inspired from Denuit et al. (2019d), in which
concentration curves and Lorenz curves are shown to provide actuaries with effec-
tive tools to evaluate whether a premium is appropriate or to compare two competing
alternatives.
228 6 Other Measures for Model Comparison

References

Barrett GF, Donald SG (2003) Consistent tests for stochastic dominance. Econometrica 71(1):71–
104
Denuit M, Dhaene J, Goovaerts MJ, Kaas R (2005) Actuarial theory for dependent risks: measures,
orders and models. Wiley, New York
Denuit M, Mesfioui M (2013) A sufficient condition of crossing-type for the bivariate orthant convex
order. Stat Probab Lett 83(1):157–162
Denuit M, Mesfioui M (2017) Preserving the Rothschild-Stiglitz type increase in risk with back-
ground risk: a characterization. Insur: Math Econ 72:1–5
Denuit M, Hainaut D, Trufin J (2019a) Effective statistical learning methods for actuaries III: neural
networks and extensions. Springer Actuarial Lecture Notes
Denuit M, Mesfioui M, Trufin J (2019b) Bounds on concordance-based validation statistics in
regression models for binary responses. Methodol Comput Appl Probab 21(2):491–509
Denuit M, Mesfioui M, Trufin J (2019c) Concordance-based predictive measures in regression
models for discrete responses. Scand Actuar J 10:824–836
Denuit M, Scaillet O (2004) J Financ Econ 2(3):422–450
Denuit M, Sznajder D, Trufin J (2019d) Model selection based on Lorenz and concentration curves,
Gini indices and convex order. Insur: Math Econ 89:128–139
Egozcue M, Garcia L-F, Wong W-K, Zitikis R (2011) Grüss-type bounds for covariances and the
notion of quadrant dependence in expectation. Cent Eur J Math 9(6):1288–1297
Frees E, Meyers G, Cummings A (2011) Summarizing insurance scores using a Gini index. J Amer
Stat Asso 106(495):1085–1098
Frees EW, Meyers G, Cummings AD (2013) Insurance ratemaking and a Gini index. J Risk Insur
81(2):335–366
Gourieroux C, Jasiak J (2007) The econometrics of individual risk: credit, insurance, and marketing.
Princeton University Press, Princeton
Mesfioui M, Tajar A (2005) On the properties of some nonparametric concordance measures in the
discrete case. Nonparametric Stat 17(5):541–554
Mesfioui M, Trufin J, Zuyderhoff P (2020) Bounds on Spearman’s rho when at least one random
variable is discrete. Working paper
Meyers G, Cummings AD (2009) Goodness of Fit" vs. "Goodness of Lift. Actuar Rev 36–3:16–17
Muliere P, Petrone S (1992) Generalized Lorenz curve and monotone dependence orderings. Metron
50:19–38
Nešlehová J (2007) On rank correlation measures for non-continuous random variables. J Multivar
Anal 98(3):544–567
Noll A, Salzmann R, Wüthrich M (2018) Case study: French motor third-party liability claims.
Available at SSRN: https://ssrn.com/abstract=3164764
Scaillet O (2005) A Kolmogorov-Smirnov type test for Positive Quadrant Dependence. Can J Stat
33(3):415–427
Shaked M, Sordo MA, Suarez-Llorens A (2012) Global dependence stochastic orders. Methodol
Comput Appl Probab 14(3):617–648
Tevet D (2013) Exploring model lift: is your model worth implementing. Actuar Rev 40(2):10–13
Yitzhaki S, Schechtman E (2013) The gini methodology: a primer on statistical methodology.
Springer, Berlin
Zhu X, Guo X, Lin L, Zhu L (2016) Testing for positive expectation dependence. Ann Inst Stat
Math 68:135–153

You might also like