10.1007@978 3 030 57556 4
10.1007@978 3 030 57556 4
Denuit
Donatien Hainaut
Julien Trufin
Effective
Statistical
Learning
Methods
for Actuaries II
Tree-Based Methods and Extensions
Springer Actuarial
Editors-in-Chief
Hansjoerg Albrecher, University of Lausanne, Lausanne, Switzerland
Michael Sherris, UNSW, Sydney, NSW, Australia
Series Editors
Daniel Bauer, University of Wisconsin-Madison, Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Ermanno Pitacco, Università di Trieste, Trieste, Italy
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, The University of Hong Kong, Hong Kong, Hong Kong
This subseries of Springer Actuarial includes books with the character of lecture
notes. Typically these are research monographs on new, cutting-edge developments
in actuarial science; sometimes they may be a glimpse of a new field of research
activity, or presentations of a new angle in a more classical field.
In the established tradition of Lecture Notes, the timeliness of a manuscript can
be more important than its form, which may be informal, preliminary or tentative.
Julien Trufin
123
Michel Denuit Donatien Hainaut
Institut de Statistique, Biostatistique et Institut de Statistique, Biostatistique et
Sciences Actuarielles (ISBA) Sciences Actuarielles (ISBA)
Université Catholique Louvain Université Catholique Louvain
Louvain-la-Neuve, Belgium Louvain-la-Neuve, Belgium
Julien Trufin
Département de Mathématiques
Université Libre de Bruxelles
Brussels, Belgium
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The present material is written for students enrolled in actuarial master programs
and practicing actuaries, who would like to gain a better understanding of insurance
data analytics. It is built in three volumes, starting from the celebrated Generalized
Linear Models, or GLMs and continuing with tree-based methods and neural
networks.
This second volume summarizes the state of the art using regression trees and
their various combinations such as random forests and boosting trees. This second
volume also goes through tools enabling to assess the predictive accuracy of
regression models. Throughout this book, we alternate between methodological
aspects and numerical illustrations or case studies to demonstrate practical appli-
cations of the proposed techniques. The R statistical software has been found
convenient to perform the analyses throughout this book. It is a free language and
environment for statistical computing and graphics. In addition to our own R code,
we have benefited from many R packages contributed by the members of the very
active community of R-users. The open-source statistical software R is freely
available from https://www.r-project.org/.
The technical requirements to understand the material are kept at a reasonable
level so that this text is meant for a broad readership. We refrain from proving all
results but rather favor an intuitive approach with supportive numerical illustrations,
providing the reader with relevant references where all justifications can be found,
as well as more advanced material. These references are gathered in a dedicated
section at the end of each chapter.
The three authors are professors of actuarial mathematics at the universities of
Brussels and Louvain-la-Neuve, Belgium. Together, they accumulate decades of
teaching experience related to the topics treated in the three books, in Belgium and
throughout Europe and Canada. They are also scientific directors at Detralytics, a
consulting office based in Brussels.
Within Detralytics as well as on behalf of actuarial associations, the authors have
had the opportunity to teach the material contained in the three volumes of
“Effective Statistical Learning Methods for Actuaries” to various audiences of
practitioners. The feedback received from the participants to these short courses
v
vi Preface
greatly helped to improve the exposition of the topic. Throughout their contacts
with the industry, the authors also implemented these techniques in a variety of
consulting and R&D projects. This makes the three volumes of “Effective Statistical
Learning Methods for Actuaries” the ideal support for teaching students and CPD
events for professionals.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Risk Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Insurance Risk Diversification . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Why Classifying Risks? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 The Need for Regression Models . . . . . . . . . . . . . . . . . . . 2
1.1.4 Observable Versus Hidden Risk Factors . . . . . . . . . . . . . . 2
1.1.5 Insurance Ratemaking Versus Loss Prediction . . . . . . . . . . 3
1.2 Insurance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Claim Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Frequency-Severity Decomposition . . . . . . . . . . . . . . . . . . 4
1.2.3 Observational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Format of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Data Quality Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Exponential Dispersion (ED) Distributions . . . . . . . . . . . . . . . . . . 8
1.3.1 Frequency and Severity Distributions . . . . . . . . . . . . . . . . 8
1.3.2 From Normal to ED Distributions . . . . . . . . . . . . . . . . . . . 9
1.3.3 Some ED Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.5 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.6 Exposure-to-Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Likelihood-Based Statistical Inference . . . . . . . . . . . . . . . . 22
1.4.2 Maximum-Likelihood Estimator . . . . . . . . . . . . . . . . . . . . 22
1.4.3 Derivation of the Maximum-Likelihood Estimate . . . . . . . . 23
1.4.4 Properties of the Maximum-Likelihood Estimators . . . . . . . 24
1.4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
viii Contents
Insurance companies cover risks (that is, random financial losses) by collecting pre-
miums. Premiums are generally paid in advance (hence their name). The pure pre-
mium is the amount collected by the insurance company, to be re-distributed as
benefits among policyholders and third parties in execution of the contract, without
loss nor profit. Under the conditions of validity of the law of large numbers, the
pure premium is the expected amount of compensation to be paid by the insurer
(sometimes discounted to policy issue in case of long-term liabilities).
The pure premiums are just re-distributed among policyholders to pay for their
respective claims, without loss nor profit on average. Hence, they cannot be consid-
ered as insurance prices because loadings must be added to face operating costs, in
order to ensure solvency, to cover general expenses, to pay commissions to interme-
diaries, to generate profit for stockholders, not to mention the taxes imposed by the
local authorities.
In practice, most of portfolios are heterogeneous: they mix individuals with different
risk levels. Some policyholders tend to report claims more often or to report more
expensive claims, on average. In an heterogeneous portfolio with a uniform price
list, the financial result of the insurance company depends on the composition of the
portfolio.
The modification in the composition of the portfolio may generate losses for the
insurer charging a uniform premium to different risk profiles, when competitors
Considering that risk profiles differ inside insurance portfolios, it theoretically suf-
fices to subdivide the entire portfolio into homogeneous risk classes, i.e. groups of
policyholders sharing the same risk factors, and to determine an amount of pure pre-
mium specific to each risk class. However, if the data are subdivided into risk classes
determined by many factors, actuaries often deal with sparsely populated groups
of contracts. Therefore, simple averages become useless and regression models are
needed.
Regression models predict a response variable from a function of risk factors and
parameters. This approach is also referred to as supervised learning. By connecting
the different risk profiles, a regression analysis can deal with highly segmented
problems resulting from the massive amount of information about the policyholders
that has now become available to the insurers.
Some risk factors can easily be observed, such as the policyholder’s age, gender,
marital status or occupation, the type and use of the car, or the place of residence
for instance. Other ones can be observed but subject to some effort or cost. This
is typically the case with behavioral characteristics reflected in telematics data or
information gathered in external databases that can be accessed by the insurer for a
fee paid to the provider. But besides these observable factors, there always remain
risk factors unknown to the insurer. In motor insurance for instance, these hidden
1.1 The Risk Classification Problem 3
risk factors typically include temper and skills, aggressiveness behind the wheel,
respect of the highway code or swiftness of reflexes (even if telematics data now
help insurers to figure out these behavioral traits, but only after contract inception).
Henceforth, we denote as X the random vector gathering the observable risk
factors used by the insurer. Notice that those risk factors are not necessarily in causal
relationship with the response Y . As a consequence, some components of X could
become irrelevant if the hidden risk factors influencing the risk (in addition to X) in
causal relationship with the response Y , denoted X + , would be available.
Even if the actuary wants to model the total claim amount Y generated by a policy of
the portfolio over one period (typically, one year), this random variable is generally
not the modeling target. Indeed, modeling Y does not allow to study the effect of
per-claim deductibles nor bonus-malus rules, for instance. Rather, the total claim
amount Y is decomposed into
N
Y = Ck
k=1
where
N = number of claims
Ck = cost (or severity) of the kth claim
C1 , C2 , . . . identically distributed
all these random variables being independent. By convention, the empty sum is zero,
that is,
N = 0 ⇒ Y = 0.
1.2 Insurance Data 5
Having individual costs for each claim, the actuary often wishes to model their respec-
tive amounts (also called claim sizes or claim severities in the actuarial literature).
Prior to the analysis, the actuary first needs to exclude possible large claims, keeping
only the standard, or attritional ones.
Overall, the modeling of claim amounts is more difficult than claim frequencies.
There are several reasons for that. First and foremost, claims sometimes need several
years to be settled as explained before. Only estimates of the final cost appear in
the insurer’s records until the claim is closed. Moreover, the statistics available to fit
a model for claim severities are much more scarce, since generally only 5–10% of
the policies in the portfolio produced claims. Finally, the unexplained heterogeneity
is sometimes more pronounced for costs than for frequencies. The cost of a traffic
accident for instance is indeed for the most part beyond the control of a policyholder
since the payments of the insurance company are determined by third-party charac-
teristics. The degree of care exercised by a driver mostly influences the number of
accidents, but in a much lesser way the cost of these accidents.
Statistical analyzes are conducted with data either from experimental or from obser-
vational studies. In the former case, random assignment of individual units (humans
or animals, for instance) to the experimental treatments plays a fundamental role to
draw conclusions about causal relationships (to demonstrate the usefulness of a new
drug, for instance). This is however not the case with insurance data, which consist
of observations recorded on past contracts issued by the insurer.
As an example, let us consider motor insurance. The policyholders covered by a
given insurance company are generally not a random sample from the entire pop-
ulation of drivers in the country. Each company targets a specific segment of this
6 1 Introduction
The data required to perform analyses carried out in this book generally consist of
linked policy and claims information at the individual risk level. The appropriate
definition of individual risk level varies according to the line of business and the type
of study. For instance, an individual risk generally corresponds to a vehicle in motor
insurance or to a building in fire insurance.
The database must contain one record for each period of time during which a
policy was exposed to the risk of filling out a claim, and during which all risk factors
remained unchanged. A new record must be created each time risk factors change,
with the previous exposure curtailed at the point of amendment. The policy number
then allows the actuary to track the experience of the individual risks over time. Policy
cancellations and new business also result in the exposure period to be curtailed. For
each record, the database registers policy characteristics together with the number of
1.2 Insurance Data 7
claims and the total incurred losses. In addition to this policy file, there is a claim file
recording all the information about each claim, separately (the link between the two
files being made using the policy number). This second file also contains specific
features about each claim, such as the presence of bodily injuries, the number of
victims, and so on. This second file is interesting to build predictive models for the
cost of claims based on the information about the circumstances of each insured
event. This allows the insurer to better assess incurred losses.
The information available to perform risk classification is summarized into a set
of features xi j , j = 1, . . . , p, available for each policy i. These features may have
different formats:
• categorical (such as gender, with two levels, male and female);
• integer-valued, or discrete (such as the number of vehicles in the household);
• continuous (such as policyholder’s age).
Categorical covariates may be ordered (when the levels can be ordered in a mean-
ingful way, such as education level) or not (when the levels cannot be ranked, think
for instance to marital status, with levels single, married, cohabiting, divorced, or
widow, say).
Notice that continuous features are generally available to a finite precision so that
they are actually discrete variables with a large number of numerical values.
As in most actuarial textbooks, we assume here that the available data are reliable
and accurate. This assumption hides a time-consuming step in every actuarial study,
during which data are gathered, checked for consistency, cleaned if needed and some-
times connected to external data bases to increase the volume of information. Setting
up the database often takes the most time and does not look very rewarding. Data
preparation is however of crucial importance because, as the saying goes, “garbage
in, garbage out”: there is no hope to get a reliable technical price list from a database
suffering many limitations.
Once data have been gathered, it is important to spend enough time on exploratory
data analysis. This part of the analysis aims at discovering which features seem
to influence the response, as well as subsets of strongly correlated features. This
traditional, seemingly old-fashioned view may well conflict with the modern data
science approach, where practitioners are sometimes tempted to put all the features
in a black-box model without taking the time to even know what they mean. But we
firmly believe that such a blind strategy can sometimes lead to disastrous conclusions
in insurance pricing so that we strongly advise to dedicate enough time to discover
the kind of information recorded in the database under study.
8 1 Introduction
Regression models aim to analyze the relationship between a variable whose outcome
needs to be predicted and one or more potential explanatory variables. The variable
of interest is called the response and is denoted as Y . Insurance analysts typically
encounter non-Normal responses such as the number of claims or the claim severities.
Actuaries then often select the distribution of the response from the exponential
dispersion (or ED) family.
Claim numbers are modeled by means of non-negative integer-valued random
variables (often called counting random variables). Such random variables are
described by their probability mass function: given a counting random variable Y
valued in the set {0, 1, 2, . . .} of non-negative integers, its probability mass function
pY is defined as
y → pY (y) = P[Y = y], y = 0, 1, 2, . . .
and we set pY to zero otherwise. The support S of Y is defined as the set of all values
y such that pY (y) > 0. Expectation and variance are then respectively given by
∞
∞
2
E[Y ] = ypY (y) and Var[Y ] = y − E[Y ] pY (y).
y=0 y=0
In this case,
P[Y ≈ y] = P y − ≤Y ≤y+ ≈ f Y (y)
2 2
for sufficiently small > 0, so that f Y also indicates the region where Y is most
likely to fall. In particular, f Y = 0 where Y cannot assume its values. The support S
of Y is then defined as the set of all values y such that f Y (y) > 0. Expectation and
variance are then respectively given by
∞ ∞ 2
E[Y ] = y f Y (y)dy and Var[Y ] = y − E[Y ] f Y (y)dy.
−∞ −∞
1.3 Exponential Dispersion (ED) Distributions 9
The oldest distribution for errors in a regression setting is certainly the Normal
distribution, also called Gaussian, or Gauss–Laplace distribution after its inventors.
The family of ED distributions in fact extends the nice structure of this probability
law to more general errors.
Considering (1.3.1), we see that Normally distributed responses can take any real
value, positive or negative as f Y > 0 over the whole real line (−∞, ∞).
Figure 1.1 displays the probability density function (1.3.1) for different parameter
values. The Nor (μ, σ 2 ) probability density function appears to be a symmetric bell-
shaped curve centered at μ, with σ 2 controlling the spread of the distribution. The
probability density function f Y being symmetric with respect to μ, positive or nega-
tive deviations from the mean μ have the same probability to occur. To be effective,
any analysis based on the Normal distribution requires that the probability density
function of the data has a shape similar to one of those visible in Fig. 1.1, which is
rarely the case in insurance applications.
Notice that the Normal distribution enjoys the convenient convolution stability
property, meaning that the sum of independent, Normally distributed random vari-
ables remain Normally distributed.
1.3.2.2 ED Distributions
The Nor (μ, σ 2 ) probability density function can be rewritten in order to be extended
to a larger class of probability distributions sharing some convenient properties: the
ED family. The idea is as follows. The parameter of interest in insurance pricing is the
mean μ involved in pure premium calculations. This is why we isolate components of
the Normal probability density function where μ appears. This is done by expanding
the square appearing inside the exponential function in (1.3.1), which gives
10 1 Introduction
0.12
0.10
0.08
fY(y)
0.06
0.04
0.02
0.00
−20 0 20 40
Fig. 1.1 Probability density functions of Nor (10, 32 ) in continuous line, Nor (10, 52 ) in broken
line, and Nor (10, 102 ) in dotted line
1 1
f Y (y) = √ exp − 2 y 2 − 2yμ + μ2
σ 2π 2σ
2
yμ − μ2 exp − 2σ2
2 y
= exp √ . (1.3.2)
σ2 σ 2π
The second factor appearing in (1.3.2) does not involve μ so that the important
component is the first one. We see that it has a very simple form, being the exponential
(hence the vocable “exponential” in ED) of a ratio with the variance σ 2 , i.e. the
dispersion parameter, appearing in the denominator. The numerator appears to be
the difference between the product of the response y and the canonical Normal mean
parameter μ with a function of μ, only. Notice that the derivative of this second term
μ2
2
is just the mean μ. Such a decomposition allows us to define the whole ED class
of distributions as follows.
Definition 1.3.1 Consider a response Y valued in a subset S of the real line
(−∞, ∞). Its distribution is said to belong to the ED family if Y obeys a proba-
bility mass function pY or a probability density function f Y of the form
pY (y) yθ − a(θ)
= exp c(y, φ/ν), y ∈ S, (1.3.3)
f Y (y) φ/ν
where
1.3 Exponential Dispersion (ED) Distributions 11
θ=μ
μ2 θ2
a(θ) = =
2 2
φ = σ2
ν=1 2
y
exp − 2σ 2
c(y, φ) = √ .
σ 2π
Remark 1.3.2 Sometimes, (1.3.3) is replaced with the more general form
yθ − a(θ)
exp c(y, φ, ν).
b(φ, ν)
However, the particular case where φ and ν are combined into φ/ν, i.e.
φ φ
b(φ, ν) = and c(y, φ, ν) = c y,
ν ν
λy
pY (y) = exp(−λ) , y = 0, 1, 2, . . . . (1.3.4)
y!
Considering (1.3.5), we see that both the mean and variance of the Poisson distribu-
tion are equal to λ, a phenomenon termed as equidispersion. The skewness coefficient
of the Poisson distribution is
1
γ[Y ] = √ .
λ
As λ increases, the Poisson distribution thus becomes more symmetric and is even-
tually well approximated by a Normal distribution, the √ approximation turning out
to be quite good for λ > 20. But if Y ∼ Poi(λ) then Y converges much faster to
the Nor (λ, 41 ) distribution. Hence, the square root transformation was often recom-
mended as a variance stabilizing transformation for count data at a time classical
methods assuming Normality (and constant variance) were employed.
The shape of the Poisson probability mass function is displayed in the graphs of
Fig. 1.2. For small values of λ, we see that the Poi(λ) probability mass function is
highly asymmetric. When λ increases, it becomes more symmetric and ultimately
looks like the Normal bell curve.
The Poisson distribution enjoys the convenient convolution stability property, i.e.
⎫
Y1 ∼ Poi(λ1 ) ⎬
Y2 ∼ Poi(λ2 ) ⇒ Y1 + Y2 ∼ Poi(λ1 + λ2 ). (1.3.6)
⎭
Y1 and Y2 independent
This property is useful because sometimes the actuary has only access to aggregated
data. Assuming that individual data is Poisson distributed, then so is the summed
count and Poisson modeling still applies.
In order to establish that the Poisson distribution belongs to the ED family, let us
write the Poi(λ) probability mass function (1.3.4) as follows:
1.3 Exponential Dispersion (ED) Distributions 13
0.6
0.8
0.5
0.4
0.6
pY(y)
pY(y)
0.3
0.4
0.2
0.2
0.1
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y
0.12
0.15
0.10
0.08
0.10
pY(y)
pY(y)
0.06
0.04
0.05
0.02
0.00
0.00
0 5 10 15 20 0 5 10 15 20
y y
Fig. 1.2 Probability mass functions of Poi(λ) with λ = 0.05, 0.5, 5, 10 (from upper left to lower
right)
λy
pY (y) = exp(−λ)
y!
1
= exp y ln λ − λ
y!
θ = ln λ
a(θ) = λ = exp(θ)
φ=1
ν=1
1
c(y, φ) = .
y!
The Gamma distribution is right-skewed, with a sharp peak and a long tail to the right.
These characteristics are often visible on empirical distributions of claim amounts.
This makes the Gamma distribution a natural candidate for modeling accident benefits
paid by the insurer.
Precisely, a random variable Y valued in S = (0, ∞) is distributed according to
the Gamma distribution with parameters α > 0 and τ > 0, which will henceforth be
denoted as Y ∼ Gam(α, τ ), if its probability density function is given by
y α−1 τ α exp(−τ y)
f Y (y) = , y > 0, (1.3.7)
(α)
where ∞
(α) = x α−1 e−x d x.
0
The parameter α is often called the shape of the Gamma distribution whereas τ is
referred to as the scale parameter.
The mean and the variance of Y ∼ Gam(α, τ ) are respectively given by
α α 1 2
E[Y ] = and Var[Y ] = 2 = E[Y ] . (1.3.8)
τ τ α
We thus see that the variance is a quadratic function of the mean. The Gamma
distribution is useful for modeling a positive, continuous response when the variance
grows with the mean but where the coefficient of variation
√
Var[Y ] 1
CV[Y ] = =√
E[Y ] α
stays constant. As their names suggest, the scale parameter in the Gamma family
influences the spread (and incidentally, the location) but not the shape of the dis-
tribution, while the shape parameter controls the skewness of the distribution. For
Y ∼ Gam(α, τ ), we have
1.3 Exponential Dispersion (ED) Distributions 15
1.0
0.8
μ = 1, τ = 1
μ = 1, τ = 2
μ = 1, τ = 4
0.6
fY(y)
0.4
0.2
0.0
0 1 2 3 4 5
2
γ[Y ] = √
α
so that the Gamma distribution is positively skewed. As the shape parameter gets
larger, the distribution grows more symmetric.
Figure 1.3 displays Gamma probability density functions for different parameter
values. Here, we fix the mean μ = ατ to 1 and we take τ ∈ {1, 2, 4} so that the variance
is equal to 1, 0.5, and 0.25. Unlike the Normal distribution (whose probability density
function resembles a bell shape centered at μ whatever the variance σ 2 ), the shape of
the Gamma probability density function changes with the parameter α. For α ≤ 1,
the probability density function has a maximum at the origin whereas for α > 1 it is
unimodal but skewed. The skewness decreases as α increases.
Gamma distributions enjoy the convenient convolution stability property for fixed
scale parameter τ . Specifically,
⎫
Y1 ∼ Gam(α1 , τ ) ⎬
Y2 ∼ Gam(α2 , τ ) ⇒ Y1 + Y2 ∼ Gam(α1 + α2 , τ ). (1.3.9)
⎭
Y1 and Y2 independent
In this case, we write Y ∼ Ex p(τ ). This distribution enjoys the remarkable memo-
ryless property:
exp(−τ (s + t))
P[Y > s + t|Y > s] =
exp(−τ s)
= exp(−τ t) = P[Y > t].
This particular case is referred to as the Erlang distribution. The Erlang distribution
corresponds to the distribution of a sum
Y = Z1 + Z2 + . . . + Zα
1
θ=−
μ
a(θ) = ln μ = − ln(−θ)
1
φ=
α
αα α−1
c(y, φ) = y .
(α)
In addition to the Normal, Poisson and Gamma distributions, Table 1.1 gives an
overview of some other ED distributions that appear to be useful in the analysis of
insurance data, namely the Bernoulli distribution Ber (q) with parameter q ∈ (0, 1);
the Binomial distribution Bin(m, q) with parameters m ∈ N+ = {1, 2, . . .} and
q ∈ (0, 1); the Geometric distribution Geo(q) with parameter q ∈ (0, 1); the Pas-
cal distribution Pas(m, q) with parameters m ∈ N+ and q ∈ (0, 1); the Negative
Exponential distribution Ex p(μ) with parameter μ > 0 and the Inverse Gaussian dis-
tribution IGau(μ, α) with parameters μ > 0 and α > 0. For each distribution, we list
the canonical parameter θ, the cumulant function a(·), and the dispersion parameter
φ entering the general definition (1.3.3) of ED probability mass or probability density
function. We also give the two first moments.
mean response from a
. This turns out to be a property generally valid for all ED
distributions.
Property 1.3.3 If the response Y has probability density/mass function of the form
(1.3.3) then
E[Y ] = a
(θ).
18 1 Introduction
The mean response Y then corresponds to the first derivative of the function a(·)
involved in (1.3.3). The next result shows that the variance is proportional to the
second derivative of a(·)
Property 1.3.4 If the response Y has probability density/mass function of the form
(1.3.3) then
φ
Var[Y ] = a
(θ).
ν
Notice that increasing the weight thus decreases the variance whereas the variance
increases linearly in the dispersion parameter φ. The impact of θ on the variance is
given by the factor
d
a
(θ) = μ(θ)
dθ
expressing how a change in the canonical parameter θ modifies the expected response.
In the Normal case, a
(θ) = 1 and the variance is just constantly equal to φ/ν =
σ 2 /ν, not depending on θ. In this case, the mean response does not influence its
variance. For the other members of the ED family, a
is not constant and a change
in θ modifies the variance.
An absence of relation between the mean and the variance is only possible for
real-valued responses (such as Normally distributed ones, where the variance σ 2 does
not depend on the mean μ). Indeed, if Y is non-negative (i.e. Y ≥ 0) then intuitively
the variance of Y tends to zero as the mean of Y tends to zero. That is, the variance
is a function of the mean for non-negative responses. The relationship between the
mean and variance of an ED distribution is indicated by the variance function V (·).
The variance function V (·) is formally defined as
d2 d
V (μ) = 2
a(θ) = μ(θ).
dθ dθ
The variance function thus corresponds to the variation in the mean response μ(θ)
viewed as a function of the canonical parameter θ. In the Normal case, μ(θ) = θ and
V (μ) = 1. The other ED distributions have non-constant variance functions. Again,
we see that the cumulant function a(·) determines the distributional properties in the
ED family.
The variance of the response can thus be written as
φ
Var[Y ] = V (μ).
ν
It is important to keep in mind that the variance function is not the variance of the
response, but the function of the mean entering this variance (to be multiplied by
φ/ν). The variance function is regarded as a function of the mean μ, even if it appears
as a function of θ; this is possible by inverting the relationship between θ and μ as
we known from Property 1.3.3 that μ = E[Y ] = a
(θ). The convexity of a(·) ensures
1.3 Exponential Dispersion (ED) Distributions 19
that the mean function a
is increasing so that its inverse is well defined. Hence, we
can express the canonical parameter in terms of the mean response μ by the relation
θ = (a )−1 (μ).
d2
V (μ) = exp(θ)
dθ2
= exp(θ)
=μ
d2
V (μ) = − ln(−θ)
dθ2
1
=− 2
θ
= μ2 .
Notice that
⎧
⎪
⎪ 0 for the Normal distribution
⎨
ξ 1 for the Poisson distribution
V (μ) = μ with ξ =
⎪ 2 for the Gamma distribution
⎪
⎩
3 for the Inverse Gaussian distribution.
These members of the ED family thus have power variance functions. The whole
family of ED distributions with power variance functions is referred to as the Tweedie
class.
20 1 Introduction
1.3.5 Weights
1
n
Y = Yi
n i=1
The distribution for Y is thus the same as for each Yi except that the weight ν is
replaced with nν.
Averaging observations is thus accounted for by modifying the weights in the ED
family.
In actuarial studies, responses can be ratios with the aggregate exposure or even
premiums in the denominator (in case loss ratios are analyzed). The numerator may
correspond to individual data, or to grouped data aggregated over a set of homo-
geneous policies. This means that the size of the group has to be accounted for as
the response ratios will tend to be far more volatile in low-volume cells than in
high-volume ones. Actuaries generally consider that a large-volume cell is the result
of summing smaller independent cells, leading to response variance proportional to
the inverse of the volume measure. This implies that weights vary according to the
business volume measure.
The next result extends Property 1.3.5 to weighted averages of ED responses.
Property 1.3.6 Consider independent responses Y1 , . . . , Yn obeying ED distribu-
tions (1.3.3) with common mean μ, dispersion parameter φ and specific weights νi .
Define the total weight
n
ν• = νi .
i=1
1.3.6 Exposure-to-Risk
The number of observed events generally depends on a size variable that determines
the number of opportunities for the event to occur. This size variable is often the time
as the number of claims obviously depends on the length of the coverage period.
However, some other choices are possible, such as distance traveled in motor insur-
ance, for instance.
The Poisson process setting is useful when the actuary wants to analyze claims
experience from policyholders who have been observed during periods of unequal
lengths. Assume that the claims occur according to a Poisson process with rate λ. In
this setting, claims occur randomly and independently in time. Denoting as T1 , T2 , . . .
the times between two consecutive events, this means that these random variables are
independent and obey the Ex p(λ) distribution, the only one enjoying the memoryless
property. Hence, the kth claim occurs at time
k
T j ∼ Gam(k, λ)
j=1
(λe)k
P[Y = k] = P[Y ≥ k] − P[Y ≥ k + 1] = exp(−λe) , k = 0, 1, . . . ,
k!
Assuming that the responses are random variables with unknown distribution depend-
ing on one (or several) parameter(s) θ, the actuary must draw conclusions about the
unknown parameter θ based on available data. Such conclusions are thus subject to
sampling errors: another dataset from the same population would have inevitably
produced different results. To perform premium calculations, the actuary needs to
select a value of θ hopefully close to the true parameter value. Such a value is called
an estimate (or pointwise estimation) of the unknown parameter θ. The estimate is
distinguished from the model parameters by a hat, which means that an estimate of
θ is denoted by θ. This distinction is necessary since it is generally impossible to
estimate the true parameter θ without error. Thus, θ = θ in general. The estimator is
itself a random variable as it varies from sample to sample (in a repeated sampling
setting, that is, drawing random samples from a given population and computing
the estimated value, again and again). Formally, an estimator θ is a function of the
observations Y1 , Y2 , . . . , Yn , that is,
θ=
θ(Y1 , Y2 , . . . , Yn ).
The parameter value that makes the observed y1 , . . . , yn the most probable is
called the maximum-likelihood estimate. Formally, the likelihood function LF(θ) is
defined as the joint probability mass/density function of the observations. In our case,
for independent observations Y1 , . . . , Yn obeying the same ED distribution (1.3.3)
1.4 Maximum Likelihood Estimation 23
n
yi θ − a(θ)
LF(θ) = exp c(yi , φ).
i=1
φ
L(θ) = ln LF(θ)
which is given by
n
yi θ − a(θ)
L(θ) = + ln c(yi , φ)
i=1
φ
n yθ − a(θ) n
= + ln c(yi , φ)
φ i=1
1
n
y= yi
n i=1
The desired
θ can easily be obtained by solving the likelihood equation
d
L(θ) = 0.
dθ
This gives
24 1 Introduction
d
0= L(θ)
dθ
n
d n yθ − a(θ)
=
i=1
dθ φ
n y − a
(θ)
=
φ
y = a
(θ) ⇔
θ = (a
)−1 (y).
d2 na
(θ) n
L(θ) = − = − 2 Var[Y1 ]
dθ 2 φ φ
is always negative. We see that the individual observations y1 , . . . , yn are not needed
to compute θ as long as the analyst knows y (which thus summarizes all the informa-
tion contained in the observations about the canonical parameter). Also, we notice
that the nuisance parameter φ does not show up in the estimation of θ.
1
k
E[
θ] = lim θj
k∞ k
j=1
by the law of large number. In this setting, Var[θ] measures the stability of the
estimates across these samples.
We now briefly discuss some relevant properties of the maximum-likelihood esti-
mators.
1.4.4.1 Consistency
θn =
θ(Y1 , Y2 , . . . , Yn ) is consistent for the parameter θ, namely that
lim P[|
θn − θ| < ] = 1 for all > 0.
n∞
1.4.4.2 Invariance
when maximum-likelihood is used for estimation. This ensures that for every distri-
bution in the ED family, the maximum-likelihood estimate fulfills
μ = y. (1.4.1)
1
n
μ] = E[Y ] =
E[ E [Yi ] = μ.
n i=1
lim E[
θn ] = θ.
n∞
In the class of all estimators, for large samples, the maximum-likelihood estimator
θ has the minimum variance and is therefore the most accurate estimator possible.
We see that many attractive properties of the maximum-likelihood estimation
principle hold in large samples. As actuaries generally deal with massive amounts of
data, this makes this estimation procedure particularly attractive to conduct insurance
studies.
1.4.5 Examples
Assume that the responses Y1 , . . . , Yn are independent and Poisson distributed with
Yi ∼ Poi(λei ) for all i = 1, . . . , n, where ei is the exposure-to-risk for observation
i and λ is the annual expected claim frequency that is common to all observations.
We thus have μi = λei for all i = 1, . . . , n so that θi = ln μi = ln ei + ln λ and
a(θi ) = exp(θi ) = λei . In this setting, the log-likelihood function writes
n
Yi θi − a(θi )
L(λ) = + ln c(Yi , φ)
i=1
φ
n
= (Yi θi − exp(θi ) + ln c(Yi , φ))
i=1
n
= (Yi ln ei + Yi ln λ − λei + ln c(Yi , φ)) .
i=1
d
L(λ) = 0
dλ
is given by n
Yi
λ = i=1
n .
i=1 ei
We have n n
i=1 Yi 1
E[λ] = E n = n E [Yi ] = λ,
i=1 ei i=1 ei i=1
1.4 Maximum Likelihood Estimation 27
Assume that the responses Y1 , . . . , Yn are independent and Gamma distributed with
Yi ∼ Gam(μ, ανi ) for all i = 1, . . . , n, where νi is the weight for observation i.
The log-likelihood function writes
n
Yi θ − a(θ)
L(μ) = + ln c(Yi , φ/νi )
i=1
φ/νi
n
− Yμi − ln μ
= + ln c(Yi , φ/νi ) .
i=1
φ/νi
d
L(μ) = 0,
dμ
is given by n
νi Yi
μ = i=1
n .
i=1 νi
1 n
μ2 /α
μ] = n
Var[ ν 2
Var [Yi ] = n ,
( i=1 νi )2 i=1 i i=1 νi
28 1 Introduction
n
μ] converges to 0 as i=1
so that Var[ νi becomes infinitely large. The maximum
likelihood estimator
μ is then consistent for μ.
1.5 Deviance
In the absence of relation between features and the response, we fit a common mean
μ to all observations, such that
μi = y for i = 1, 2, . . . , n,
as we have seen in Sect. 1.4. This model is called the null model and corresponds
to the case where the features do not bring any information about the response. In
the null model, the data are represented entirely as random variations around the
common mean μ. If the null model applies then the data are homogeneous and there
is no reason to charge different premium amounts to subgroups of policyholders.
The null model represents one extreme where the data are purely random. Another
extreme is the full model which represents the data as being entirely systematic. The
model estimate μi is just the corresponding observation yi , that is,
μi = yi for i = 1, 2, . . . , n.
Thus, each fitted value is equal to the observation and the full model fits perfectly.
However, this model does not extract any structure from the data, but merely repeats
the available observations without condensing them.
The deviance, or residual deviance of a regression model μ is defined as
μ) = 2φ L full − L(
D( μ)
n
=2 νi yi θi − a
θi − θi + a
θi , (1.5.1)
i=1
where L full is the log-likelihood of the full model based on the considered ED dis-
tribution and with μi ) and
θi = (a
)−1 ( θi = (a
)−1 (
μi ), where
μi =
μ(x i ). As the
full model gives the highest attainable log-likelihood with the ED distribution under
consideration, the difference between the log-likelihood L full of the full model and
the log-likelihood L( μ) of the regression model under interest is always positive. The
deviance is a measure of distance between a model and the observed data defined
by means of the saturated model. It quantifies the variations in the data that are not
explained by the model under consideration. A too large value of D( μ) indicates that
the model μ under consideration does not satisfactorily fit the actual data. The larger
the deviance, the larger the differences between the actual data and the fitted values.
The deviance of the null model is called the null deviance.
1.5 Deviance 29
Table 1.3 Deviance associated to regression models based on some members of the ED family of
distributions
Distribution Deviance
n yi n i −yi
Binomial 2 i=1 yi ln μi + (n i − yi ) ln n i −
μi where
μi = n i
qi
n yi
Poisson 2 i=1 yi ln μi − (yi − μi ) where
y ln y = 0 if y = 0
n 2
Normal i=1 yi − μi
n yi yi − μi
Gamma 2 i=1 − ln μi + μi
2
n yi −μi
Inverse Gaussian i=1
μ y
2
i i
Table 1.3 displays the deviance associated to regression models based on some
members of the ED family of distributions. Notice that in that table, y ln y is taken
to be 0 when y = 0 (its limit as y → 0).
In actuarial pricing, the aim is to evaluate the pure premium as accurately as possible.
The target is the conditional expectation μ(X) = E[Y |X] of the response Y (claim
number or claim amount for instance) given the available information X. The function
x → μ(x) = E[Y |X = x] is generally unknown to the actuary and is approximated
by a working premium x → μ(x). The goal is to produce the most accurate function
μ(x).
Lack of accuracy for μ(x) is defined by the generalization error
Err (
μ) = E [L(Y,
μ(X))] ,
where L(., .) is a function measuring the discrepancy between its two arguments,
called loss function, and the expected value is over the joint distribution of (Y, X).
We aim to find a function μ(x) of the features minimizing the generalization error.
Notice that the loss function L(., .) should not be confused with the log-likelihood
L(.).
Let
L = {(y1 , x 1 ), (y2 , x 2 ), . . . , (yn , x n )} (1.6.1)
be the set of observations available to the insurer, called learning set. The learning
set is often partitioned into a training set
D = {(yi , x i ); i ∈ I}
30 1 Introduction
p
score(x) = β0 + βjxj. (1.6.3)
j=1
Estimates β0 , β
1 , . . . , β
p for parameters β0 , β1 , . . . , β p are then obtained from the
training set by maximum likelihood, so that we get
μ(x) = g −1 score(x)
⎛ ⎞
p
= g −1 ⎝β0 + βj x j ⎠ . (1.6.4)
j=1
train 1
Err (
μ) = L(yi ,
μi )
|I| i∈I
1
correspond to the in-sample deviance (up to the factor |I|
), that is,
train D train (
μ)
Err (
μ) = , (1.6.7)
|I|
where |I| denotes the number of elements of I. In the following, we extend the
choice (1.6.6) for the loss function to the tree-based methods studied in this book.
The GLM training procedure thus amounts to estimate scores structured like in (1.6.4)
by minimizing the corresponding in-sample deviances (1.6.7) and to select the best
model among the GLMs under consideration. Note that selecting the best model
among different GLMs on the basis of the deviance computed on the training set will
favor the most complex models. In that goal, the Akaike Information Criteria (AIC)
is preferred over the in-sample deviance, because it accounts for a measure of model
complexity in the penalty. As mentioned in Denuit et al. (2019), comparing different
GLMs on the basis of AIC amounts to account for the optimism in the deviance
computed on the training set.
In this second volume, we work with models whose scores are linear combinations
of regression trees, that is,
M
g(μ(x)) = score(x) = βm Tm (x),
m=1
where M is the number of trees producing the score and Tm (x) is the prediction
obtained from regression tree Tm , m = 1, . . . , M. The parameters β1 , . . . , β M specify
the linear combination used to produce the score. The training procedures studied in
this book differ in the way the regression trees are fitted from the training set and in
the linear combination used to produce the ensemble.
In Chap. 3, we consider single regression trees for the score. Specifically, we work
with M = 1 and β1 = 1. In this setting, the identity link function is appropriate and
implicitly chosen, so that we assume
The estimated score will be in the range of the response, the prediction in a terminal
node of the tree will be computed as the (weighted) average of the responses in that
node. Note that, contrarily to GLMs for instance for which the form of the score is
32 1 Introduction
strongly constrained, the score is here left unspecified and estimated from the training
set. Theoretically, any function (here the true model) can be approximated as close as
possible by a piecewise constant function (here a regression tree). However, in prac-
tice, the training procedure limits the level of accuracy of a regression tree. A large
tree might overfit the training set, while a small tree might not capture the important
structure of the true model. The size of the tree is thus an important parameter in the
training procedure because it controls the score’s complexity. Selecting the optimal
size of the tree will also be part of the training procedure.
Remark 1.6.1 Because of the high flexibility of the score, a large regression tree
is prone to overfit the training set. However, it turns out that the training sample
estimate of the generalization error will favor larger trees. To combat this issue, a
part of the observations of the training set can only be used to fit trees of different
size and the remaining part of the observations of the training set to estimate the
generalization error of these trees in order to select the best one. Using the K-fold
cross validation estimate of the generalization error, as discussed in the next chapter,
is also an alternative to avoid this issue.
In Chap. 4, we work with regression models assumed to be the average of M
regression trees, each built on a bootstrap sample of the training set. The goal is to
reduce the variance of the predictions obtained from a single tree (slight changes in
the training set can drastically change the structure of the tree fitted on it). That is,
we suppose β1 = β2 = . . . = β M = M1 , so that we consider models of the form
1
M
μ(x) = score(x) = Tm (x),
M m=1
M
g(μ(x)) = score(x) = Tm (x),
m=1
where g is an appropriate link function mapping the score to the range of the response
and T1 , . . . , TM are relatively small regression trees that will be sequentially fitted
on random subsamples of the training set. Note that the number of trees M and the
size of the trees will also be selected during the training procedure.
This chapter closely follows the book of Denuit et al. (2019). Precisely, we summarize
the first three chapters of Denuit et al. (2019) with the notions useful for this second
volume. We refer the reader to the first three chapters by Denuit et al. (2019) for
more details, as well as for an extensive overview of the literature. Section 1.6 gives
an overview of the methods used throughout this book. We refer the reader to the
bibliographic notes of the next chapters for more details about the corresponding
literature.
References
Denuit M, Hainaut D, Trufin J (2019) Effective statistical learning methods for actuaries I: GLMs
and extensions. Springer actuarial lecture notes
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat
29(5):1189–1232
Chapter 2
Performance Evaluation
2.1 Introduction
In actuarial pricing, the objective is to evaluate the pure premium as accurately as pos-
sible. The target is thus the conditional expectation μ(X) = E[Y |X] of the response
Y (claim number or claim amount for instance) given the available information X.
The function x → μ(x) = E[Y |X = x] is generally unknown to the actuary and
is approximated by a working premium x → μ(x). The goal is to produce the most
accurate function μ(x). Lack of accuracy for μ(x) is defined by the generalization
error. Producing a model μ whose predictions are as good as possible can be stated as
finding a model which minimizes its generalization error. In this chapter, we describe
the generalization error used throughout this book for model selection and model
assessment.
2.2.1 Definition
We denote by
L = {(y1 , x 1 ), (y2 , x 2 ), . . . , (yn , x n )} (2.2.1)
the set of observations available to the insurer. This dataset is called the learning set.
We aim to find a model μ built on the learning set L (or only on a part of L, called
training set, as discussed thereafter) which approximates the best the true model μ.
Lack of accuracy for μ is defined by the generalization error. The generalization
error, also known as expected prediction error, of μ is defined as follows:
Err (
μ) = E [L(Y,
μ(X))] , (2.2.2)
where L(., .) is a function measuring the discrepancy between its two arguments,
called loss function.
The goal is thus to find a function of the covariates which predicts at best the
response, that is, which minimizes the generalization error. The model performance
is evaluated according to the generalization error which depends on a predefined
loss function. The choice of an appropriate loss function in our ED family setting is
discussed thereafter.
Notice that the expectation in (2.2.2) is taken over all possible data, that is, with
respect to the probability distribution of the random vector (Y, X) assumed to be
independent of the data used to fit the model.
1
n
μ) =
Err( L(yi ,
μi ) (2.2.3)
n i=1
with μi = μ(x i ). In the ED family setting, the appropriate choice for the loss function
is related to the deviance. It suffices to observe that the regression model μ maxi-
mizing the log-likelihood function L( μ) also minimizes the corresponding deviance
D( μ). Specifically, since the deviance D( μ) can be expressed as
n
μ) = 2
D( νi yi (a )−1 (yi ) − (a )−1 (
μi ) − a (a )−1 (yi ) + a (a )−1 (
μi ) ,
i=1
(2.2.4)
we see from (2.2.3) that the appropriate loss function in our ED family setting is
given by
μi ) = 2νi yi (a )−1 (yi ) − (a )−1 (
L(yi , μi ) − a (a )−1 (yi ) + a (a )−1 (
μi ) .
(2.2.5)
The constant 2 is not necessary and is there to make the loss function match the
deviance. Throughout this book, we use the choice (2.2.5) for the loss function.
2.2 Generalization Error 37
2.2.3 Estimates
The performance of a model is evaluated throughout the generalization error Err ( μ).
In practice, we usually do not know the probability distribution from which the
observations are drawn, making the direct evaluation of the generalization error
Err (μ) not feasible. Hence, the set of observations available to the insurer often
constitutes the only data on which the model needs to be fitted and its generalization
error estimated.
constitutes the only data available to the insurer. When the whole learning set is used
to fit the model
μ, the generalization error Err (
μ) can only be estimated on the same
data as the ones used to build the model, that is,
1
n
train
Err (
μ) = L(yi ,
μ(x i )). (2.2.7)
n i=1
This estimate is called the training sample estimate and has been introduced in (2.2.3).
In our setting, we thus have
train μ)
D(
Err (
μ) = . (2.2.8)
n
Typically, the training sample estimate (2.2.7) will be less that the true generalization
error, because the same data is being used to fit the model and assess its error. A
model typically adapts to the data used to train it, and hence the training sample
estimate will be an overly optimistic estimate of the generalization error. This is
particularly true for tree-based models because of their high flexibility to adapt to
the training set: the resulting models are too closely fitted to the training set, which
is called overfitting.
The training sample estimate (2.2.7) directly evaluates the accuracy of the model on
the dataset used to build the model. While the training sample estimate is useful to
fit the model, the resulting estimate for the generalization error is likely to be very
optimistic since the model is precisely built to reduce it. This is of course an issue
38 2 Performance Evaluation
when we aim to assess the predictive performance of the model, namely its accuracy
on new data.
As actuaries generally deal with massive amounts of data, a better approach is
to divide the learning set L into two disjoint sets D and D, called training set and
validation set, and to use the training set for fitting the model and the validation set for
estimating the generalization error of the model. The learning set is thus partitioned
into a training set
D = {(yi , x i ); i ∈ I}
with I ⊂ {1, . . . , n} labelling the observations in D considered for fitting the model
and I = {1, . . . , n}\I labelling the remaining observations of L used to assess the
predictive accuracy of the model. The validation sample estimate of the generalization
error of the model μ that has been built on the training set D is then given by
val 1
Err ( μ) = L(yi ,
μ(x i ))
|I|
i∈I
D val (
μ)
= (2.2.9)
|I|
train 1
Err (
μ) = L(yi ,
μ(x i ))
|I| i∈I
D train (
μ)
= , (2.2.10)
|I|
where we denote by D train (μ) the deviance computed from the observations of the
training set (also called in-sample deviance) and D val (μ) the deviance computed
from the observations of the validation set (also called out-of-sample deviance). As
a rule-of-thumb, the training set usually represents 80% of the learning set and the
validation set the remaining 20%. Of course, this allocation depends on the problem
under consideration. In any case, the splitting of the learning set must be done in a
way that observations in the training set can be considered independent from those
in the validation set and drawn from the same population. Usually, this is guaranteed
by drawing both sets at random from the learning set.
Training and validation sets should be as homogeneous as possible. Creating
those two sets by taking simple random samples, as mentioned above, is usually
sufficient to guarantee similar data sets. However, in some cases, the distribution
of the response can be quite different between the training and validation sets. For
instance, consider the annual number of claims in MTPL insurance. Typically, the
2.2 Generalization Error 39
vast majority of the policyholders makes no claim over the year (say 95%). Some
policyholders experience one claim (say 4%) while only a few of them have more
than one claim (say 1% with two claims). In such a situation, because the proportions
of policyholders with one or two claims are small compared to the proportion of
policyholders with no claim, the distribution of the response can be very different
between the training and validation sets.
To address this potential issue, random sampling can be applied within subgroups,
a subgroup being a set of observations with the same response. In our example, we
would thus have three subgroups: a first one made of the observations with no claim
(95% of the observations), a second one composed of observations with one claim
(4% of the observations) and a third one with remaining observations (1% of the
observations). Applying the randomization within these subgroups is called stratified
random sampling.
val 1
Err ( μL\Lk ) = L(yi ,
μL\Lk (x i )). (2.2.11)
|Ik | i∈I
k
For the model μL\Lk , the set of observations L\Lk plays the role of training set while
Lk of validation set. The generalization error can then be estimated as the weighted
val
average of the estimates Err ( μL\Lk ) given in (2.2.11), that is,
CV
K
|Ik | val
Err (
μ) =
Err (
μL\Lk )
k=1
n
1
K
= L(yi ,
μL\Lk (x i )). (2.2.12)
n k=1 i∈I
k
40 2 Performance Evaluation
CV
The idea behind the K-fold cross validation estimate Err ( μ) is that each model
μL\Lk should be close enough to the model
μ fitted on the whole learning set L.
val
Therefore, the estimates Err ( μL\Lk ) given in (2.2.11), that are unbiased estimates
of Err ( μ), should also be close enough to Err ( μ).
Contrary to the validation sample estimate, the K-fold cross validation estimate
uses every observation (yi , x i ) in the learning set L for estimating the generalization
error. Typically, K is fixed to 10, which appears to be a value that produces stable
and reliable estimates. However, the K-fold cross validation estimate is more com-
putationally intensive since it requires to fit K models, while only one model needs
to be fitted for computing its validation sample counterpart.
Note that the K partitions can be chosen in a way that makes the subsets
L1 , . . . , L K balanced with respect to the response. Applying stratified random sam-
pling as discussed in Sect. 2.2.3.2 produces folds that have similar distributions for
the response.
Model selection consists in choosing the best model among different models pro-
duced by a training procedure, say the final model, while model assessment consists
in assessing the generalization error of this final model. In practice, we do both model
selection and model assessment.
Model assessment should be performed on data that are kept out of the entire
training procedure (which includes the fit of the different models together with model
selection). Ideally, the training set D should be entirely dedicated to the training
procedure and the validation set D to model assessment.
As part of the training procedure, model selection is thus based on observations
from the training set. To guarantee an unbiased estimate of the generalization error
for each model under consideration during model selection, a possibility is to divide
in its turn the training set into two parts: a part used to fit the models and another
(sometimes called test set) to estimate the corresponding generalization errors. Of
course, this approach supposes to be in a data-rich situation. If we are in situations
where there is insufficient observations in the training set to make this split, another
possibility for model selection consists in relying on K-fold cross validation estimates
as described above, using the training set D instead of the entire learning set L.
2.2.4 Decomposition
Consider that the loss function is the squared error loss. In our ED family setting, it
amounts to assume that the responses are normally distributed. The generalization
error of model
μ at X = x becomes
Err (
μ(x)) = E (Y − μ(x))2 X = x
= E (Y − μ(x) + μ(x) − μ(x))2 X = x
= E (Y − μ(x))2 X = x + E (μ(x) − μ(x))2 X = x
+2 E (Y − μ(x)) (μ(x) − μ(x)) X = x
= E (Y − μ(x))2 X = x + E (μ(x) − μ(x))2 X = x
since
μ(x)) X = x
E (Y − μ(x)) (μ(x) −
μ(x)) E (Y − μ(x)) X = x
= (μ(x) −
= (μ(x) −
μ(x)) (E [Y |X = x] − μ(x))
=0
Err ( μ(X))2 .
μ) = Err (μ) + E (μ(X) − (2.2.17)
μ. The further our model from the true one, the larger the generalization error. The
generalization error of the true model is called the residual error and is irreducible.
Indeed, we have
μ) ≥ Err (μ) ,
Err (
which means that the smallest generalization error coincides with the one associated
to the true model.
Consider that the loss function is the Poisson deviance. This choice is appropriate
when the responses are assumed to be Poisson distributed, as when examining the
number of claims for instance. The generalization error of model
μ at X = x is then
given by
Y
μ(x)) = 2E Y ln
Err ( − (Y − μ(x)) X = x
μ(x)
Y
= 2E Y ln − (Y − μ(x))X = x
μ(x)
μ(x)
+2E μ(x) − μ(x) − Y ln X = x
μ(x)
Y
= 2E Y ln − (Y − μ(x))X = x
μ(x)
μ(x)
+2 (
μ(x) − μ(x)) − 2E [Y |X = x] ln .
μ(x)
Notice that E P (
μ(x)) is always positive because y → y − 1 − ln y is positive on
R+ , so that we have
μ) ≥ Err (μ) .
Err (
Consider the Gamma deviance loss. This choice is often made when we study claim
severities for instance. The generalization error of model
μ at X = x is then given
by
Y Y
μ(x)) = 2E − ln
Err ( + − 1X = x
μ(x)
μ(x)
Y Y
= 2E − ln + − 1 X = x
μ(x) μ(x)
Y Y Y Y
+2E − ln + ln + − X = x
μ(x) μ(x)
μ(x) μ(x)
μ(x) μ(x)
μ(x) −
= Err (μ(x)) + 2E ln +Y X = x
μ(x)
μ(x)μ(x)
μ(x) μ(x) − μ(x)
= Err (μ(x)) + 2 ln + E [Y |X = x]
μ(x)
μ(x)μ(x)
μ(x) μ(x) − μ(x)
= Err (μ(x)) + 2 ln +
μ(x)
μ(x)
μ(x) μ(x)
= Err (μ(x)) + 2 − 1 − ln (2.2.20)
μ(x)
μ(x)
since E [Y |X = x] = μ(x). The generalization error Err ( μ) thus writes as the sum
of the generalization error of the true model and an estimation error E[E G (
μ(X))],
where
μ(x) μ(x)
E G (
μ(x)) = 2 − 1 − ln ,
μ(x)
μ(x)
that is,
μ) = Err (μ) + E[E G (
Err ( μ(X))]. (2.2.21)
Note that E G (
μ(x)) is always positive since we have already noticed in the Poisson
case that y → y − 1 − ln y is positive on R+ , so that we have
μ) ≥ Err (μ) .
Err (
44 2 Performance Evaluation
The model μ under consideration is estimated on the training set D so that it depends
on D. To make explicit the dependence on the training set, we use from now on both
notations μ and μD for the model under interest. We assume in a first time there
is only one model which corresponds to a given training set, that is, we consider
training procedures that are said to be deterministic. Training procedures that can
produce different models for a fixed training set are discussed in Sect. 2.4.
The generalization error Err ( μD ) is evaluated conditional on the training set. That
is, the model μD under study is first fitted on the training set D before computing the
expectation over all possible observations independently from the training set D. In
that sense, the generalization error Err ( μD ) gives an idea of the general accuracy of
the training procedure for the particular training set D. In order to study the general
behavior of our training procedure, and not only its behavior for a specific training
set, it is interesting to evaluate the training procedure on different training sets of the
same size.
The training set D is itself a random variable sampled from a distribution usually
unknown in practice, so that the generalization error Err ( μD ) is in its turn a random
variable. In order to study the general performance of the training procedure, it is
then of interest to take the average of the generalization error Err ( μD ) over D, that
is, to work with the expected generalization error ED [Err ( μD )] over the models
learned from all possible training sets and produced with the training procedure under
investigation.
The expected generalization error is thus given by
ED [Err ( μD (X))]] ,
μD )] = ED [E X [Err ( (2.3.1)
ED [Err ( μD (X))]] .
μD )] = E X [ED [Err ( (2.3.2)
When the loss function is the squared error loss, we know from Eq. (2.2.16) that the
generalization error at X = x writes
The true model μ is independent of the training set, so is the generalization error
Err (μ(x)). The expected generalization error of
μ at X = x is then given by
The first term is the local generalization error of the true model while the second
term is the expected estimation error at X = x, which can be re-expressed as
ED (μ(x) −
μD (x))2
= ED (μ(x) − ED [
μD (x)] + ED [
μD (x)] −
μD (x))2
μD (x)])2 + ED (ED [
= ED (μ(x) − ED [ μD (x)] −
μD (x))2
+2ED [(μ(x) − ED [μD (x)]) (ED [
μD (x)] −
μD (x))]
= (μ(x) − ED [
μD (x)])2 + ED (ED [μD (x)] −
μD (x))2
since
μD (x)]) (ED [
ED [(μ(x) − ED [ μD (x)] −
μD (x))]
= (μ(x) − ED [
μD (x)]) ED [(ED [
μD (x)] −
μD (x))]
= (μ(x) − ED [
μD (x)]) (ED [
μD (x)] − ED [
μD (x)])
= 0. (2.3.4)
In the case of the Poisson deviance loss, we know from Eq. (2.2.18) that the local
generalization error writes
where
P μD (x)
μD (x)
μD (x)) = 2μ(x)
E ( − 1 − ln . (2.3.8)
μ(x) μ(x)
Because the true model μ is independent of the training set, the expected general-
μD (x))] can be expressed as
ization error ED [Err (
with
μD (x)
μD (x)
ED E P (μD (x)) = 2μ(x) ED − 1 − ED ln .
μ(x) μ(x)
(2.3.10)
Locally, the expected generalization error is equal to the generalization error of
the true model plus the expected estimation error which can be attributed to
the bias and the
estimation fluctuation. Notice that the expected estimation error
ED E P ( μD (x)) is positive since we have seen that the estimation error E P (
μD (x))
is always positive. The generalization error of the true model is again a theoretical
lower bound for the expected generalization error.
From (2.3.9) and (2.3.10), the expected generalization error writes
μD (X)
μD (X)
μD )] = Err (μ) + 2E X μ(X) ED
ED [Err ( − 1 − ED ln .
μ(X) μ(X)
(2.3.11)
In the Gamma case, Eq. (2.2.20) tells us that the local generalization error is given
by
Err (μD (x)) = Err (μ(x)) + E G ( μD (x)) , (2.3.12)
where
μ(x) μ(x)
E G (
μD (x)) = 2 − 1 − ln . (2.3.13)
μD (x)
μD (x)
2.3 Expected Generalization Error 47
with
G
μ(x) μ(x)
ED E (
μD (x)) = 2 ED − 1 − ED ln . (2.3.15)
μD (x)
μD (x)
A training procedure which always produces the same model μD for a given training
set D (and fixed values for the tuning parameters) is said to be deterministic. This is
the case for instance of regression trees studied in Chap. 3.
There also exist randomized training procedures that can produce different models
for a fixed training set (and fixed values for the tuning parameters), such as random
forests and boosting trees discussed in Chaps. 4 and 5. In order to account for the
randomness of the training procedure, we introduce a random vector which is
assumed to fully capture the randomness of the training procedure. The model μ
resulting from the randomized training procedure depends on the training set D and
48 2 Performance Evaluation
ED, Err
μD, = E X ED, Err
μD, (X) . (2.4.2)
Again, we can first determine the expected local error ED, Err μD, (X) in
order to get the expected generalization error.
Taking into account the additional source of randomness in the training procedure,
expressions (2.3.5), (2.3.9) and (2.3.14) become respectively
2
ED, Err
μD, (x) = Err (μ(x)) + μ(x) − ED, μD, (x)
2
+ED, ED, μD, (x) , (2.4.3)
μD, (x) −
with
μD, (x)
μD, (x)
ED, E P
μD, (x) = 2μ(x) ED, − 1 − ED, ln ,
μ(x) μ(x)
(2.4.5)
and
with
μ(x) μ(x)
ED, E G
μD, (x) = 2 ED, − 1 − ED, ln .
μD, (x)
μD, (x)
(2.4.7)
2.5 Bibliographic Notes and Further Reading 49
This chapter is mainly inspired from Louppe (2014) and the book of Hastie et al.
(2009). We also find inspiration from Wüthrich and Buser (2019) for the choice of
the loss function in our ED family setting as well as for the decomposition of the
generalization error in the Poisson case.
References
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning. Data Mining,
Inference, and Prediction. Second Edition. Springer Series in Statistics
Louppe, G. (2014). Understanding random forests: from theory to practice. arXiv:14077502
Wüthrich, M. V., Buser, C. (2019). Data analytics for non-life insurance pricing. Lecture notes
Chapter 3
Regression Trees
3.1 Introduction
In this chapter, we present the regression trees introduced by Breiman et al. (1984).
Regression trees are at the core of this second volume. They are the building blocks of
the ensemble techniques described in Chaps. 4 and 5. We closely follow the seminal
book of Breiman et al. (1984). The presentation is also mainly inspired from Hastie
et al. (2009) and Wüthrich and Buser (2019).
A regression tree partitions the feature space χ into disjoint subsets {χt }t∈T , where
T is a set of indexes. On each subset χt , the prediction ct of the response on that part
of the feature space is assumed to be constant. The resulting predictions μ(x) can
be written as
μ(x) =
ct I [x ∈ χt ] . (3.2.1)
t∈T
t3 t4 t5 t6
χt3 ∪ χt4 = χt1 . The node t1 is the parent of nodes t3 and t4 that correspond to sub-
spaces χt3 and χt4 , while t3 and t4 are the left and right children of t1 , respectively.
The node t4 does not have children. Such a node is called terminal node or leaf of
the tree, and is represented by a rectangle box. Non-terminal nodes are indicated by
circles.
This process is continued until all nodes are designated terminals. One says that the
tree stops growing. In Fig. 3.1, the terminal nodes are t4 , t7 , t9 , t10 , t12 , t13 , t14 , t15 and
t16 . The corresponding subspaces are disjoint and form together a partition of the
feature space χ , namely χt4 ∪ χt7 ∪ χt9 ∪ χt10 ∪ χt12 ∪ χt13 ∪ χt14 ∪ χt15 ∪ χt16 = χ .
The set of indexes T is then given by {t4 , t7 , t9 , t10 , t12 , t13 , t14 , t15 , t16 }.
In each terminal node t ∈ T , the prediction of the response is denoted ct , that is,
μ(x) = ct for x ∈ χt .
Notice that each node t is split into a left child node t L and a right child node t R .
In a more general case, we could consider multiway splits, resulting in more than
two children nodes for t. Figure 3.2 represents a binary split with a multiway split
resulting in four children nodes. However, the problem with multiway splits is that
they partition the data too quickly, leaving insufficient observations at the next level
down. Also, since multiway splits can be achieved by a series of binary splits, the
latter are often preferred. Henceforth, we work with binary regression trees.
At each node t, the selection of the optimal split st requires to define a candidate set
St of possible binary splits and a goodness of split criterion to pick the best one.
Let us assume that node t is composed of observations with k distinct values for x.
The number of partitions of χt into two non-empty disjoint subsets χtL and χt R is
then given by 2k−1 − 1. Because of the exponential growth of the number of binary
partitions with k, the strategy which consists in trying all partitions and taking the
best one turns out to be unrealistic since often computationally intractable.
For this reason, the number of possible splits is restricted. Specifically, only stan-
dardized binary splits are usually considered. A standardized binary split is charac-
terized as follows:
1. Depends on the value of only one single feature;
2. For an ordered feature x j , only allows questions of the form x j ≤ c, where c is
a constant;
3. For a categorical variable x j , only allows questions of the form x j ∈ C, where
C is a subset of possible categories of the variable x j .
An ordered feature x j taking q different values x j1 , . . . , x jq at node t generates
q − 1 standardized binary splits at that node. The split questions are x j ≤ ci , i =
1, . . . , q − 1, where the constants ci are taken halfway between consecutive values
x +x
x ji and x j,i+1 , that is ci = j,i+12 ji .
For a categorical feature x j with q different values at node t, the number of
standardized binary splits generated at that node is 2q−1 − 1. Let us notice that
trivial standardized binary splits are excluded, meaning splits of χt that generate
a subset (either χtL or χt R ) with no observation. Therefore, at each node t, every
possible standardized binary split is tested ant the best split is selected by means of
a goodness of split criterion. In the feature space, working with standardized binary
splits amounts to only consider splits that are perpendicular to the coordinate axes.
54 3 Regression Trees
t3 t4 t5 t6
χt5 χt6
8
χt3
35 50 x1
Hence, the resulting regression trees recursively partition the feature space into hyper
rectangles.
For example, let p = 2 with 18 ≤ x1 ≤ 100 and 0 ≤ x2 ≤ 20 and suppose that
the resulting tree is the one described in Fig. 3.3. The feature space is then partitioned
into the rectangles depicted in Fig. 3.4.
where t L(s) and t R(s) are the left and right children nodes of t resulting from split s
ct (s) and
and ct (s) are the corresponding predictions. The optimal split st then leads to
L R
3.2 Binary Regression Trees 55
children nodes t L(st ) and t R(st ) that we also denote t L and t R . Notice that solving (3.2.2)
amounts to find s ∈ St that maximizes the decrease of the deviance at node t, namely
ct ) − Dχ (s)
max Dχt (
ct (s) + Dχ (s)
ct (s) . (3.2.3)
s∈St tL L tR R
Remark 3.2.2 A tree starts from the whole feature space χ and is grown by iter-
atively dividing the subsets of χ into smaller subsets. This procedure consists in
dividing each node t using the optimal split st that locally maximizes the decrease of
the deviance. This greedy strategy could be improved by assessing the goodness of a
split by also looking at those deeper in the tree. However, such an approach is more
time consuming and does not seem to significantly improve the model performance.
For every terminal node t ∈ T , we need to compute the prediction ct . The deviance,
equivalently denoted D (μ) and D ((ct )t∈T ) in the following, is given by
μ) =
D ( ct ) .
Dχt (
t∈T
risk. Let D = {(yi , x i ); i ∈ I} be the observations available to train our model. For
a node t, we denote by wt the corresponding volume defined by
wt = ei ,
i:x i ∈χt
1
ct = ei μ(x i ) > 0.
wt i:x ∈χ
i t
Z t | ∼ Poi(ct wt ),
so that it comes
For more details about credibility theory and the latter calculations, we refer the
interested reader to Bühlmann and Gisler (2005).
Therefore, given values for ct and γ , we can compute E [ct |Z t ], which is always
positive, contrarily to ct that can be zero. However, ct is obviously not known in
practice. It could be replaced byct or E [ct |Z t ]. The maximum-likelihood estimator
ct does not solve our original problem. Turning to E [ct |Z t ], we can compute it
recursively as
=
Bayes,k Bayes,k−1
ct E [ct |Z t ] =
αt
ct + (1 −
αt )
ct ,
3.2 Binary Regression Trees 57
Bayes,0 Yi
ct c0 =
i∈I ,
=
i∈I ei
which is the maximum-likelihood estimator for the expected claim frequency without
making distinction between individuals. The remaining parameter γ , which enters
the computation of the estimated credibility weights, still needs to be selected and is
chosen externally.
Note that the R command rpart used in this chapter for the examples simplifies
the recursive approach described above. It rather considers the estimator
ctrpart =
αtrpart
ct + (1 −
αtrpart )
c0
wt
c0
αtrpart =
,
γ + wt c0
Remark 3.2.3 Considering the same in each terminal node introduces depen-
dence between the leaves. One way to remedy to this undesirable effect is to consider
as many (independent) random effects as there are leaves in the tree. Notice that this
latter approach requires the knowledge of the tree structure.
A node t is inevitably terminal when χt can no longer be split. This is the case
when the observations in node t share the same values for all the features, i.e. when
x i = x j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt . In such a case, splitting
node t would generate a subset with no observation. A node t is also necessarily
terminal when it contains observations with the same value for the response, i.e.
when yi = y j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt . In particular, this is
the case when node t contains only one observation.
Those stopping criteria are inherent to the recursive partitioning procedure. Such
inevitable terminal nodes lead to the biggest possible regression tree that we can
58 3 Regression Trees
t1 t2
t3 t4 t5 t6
grow on the training set. However such a tree is likely to capture noise in the training
set and to cause overfitting.
In order to reduce the size of the tree and hence to prevent overfitting, these
stopping criteria that are inherent to the recursive partitioning procedure are com-
plemented with several rules. Three stopping rules that are commonly used can be
formulated as follows:
– A node t is declared terminal when it contains less than a fixed number of obser-
vations.
– A node t is declared terminal if at least one of its children nodes t L and t R that
results from the optimal split st contains less than a fixed number of observations.
– A node t is declared terminal when its depth is equal to a fixed maximal depth.
Notice that the depth of a node t is equal to d if it belongs to generation d + 1. For
instance, in Fig. 3.5, t0 has a depth of zero, t1 and t2 have a depth of one and terminal
nodes t3 , t4 , t5 and t6 have a depth of two.
Another common stopping rule consists of setting a threshold and deciding that
a node t is terminal if the decrease in deviance that we have by splitting node t with
the optimal split st is less than this fixed threshold. Recall that splitting node t into
children nodes t L and t R results in a decrease of the deviance given by
Dχt = Dχt (
ct ) − DχtL ctL + Dχt R ct R . (3.2.4)
It is interesting to notice that this deviance reduction Dχt is always positive. Indeed,
we have
ct ) = DχtL (
Dχt ( ct ) + Dχt R (
ct )
≥ D χt L ctL + Dχt R ct R
since the maximum-likelihood estimates ctL and ct R minimize the deviances DχtL (
ct )
and DχtL (ct ), respectively. As a consequence, if for a partition {χt }t∈T of χ , we
consider an additional standardized binary split st of χt for a given t ∈ T , yielding
the new set of indexes T = T \{t} ∪ {t L , t R } for the terminal nodes, we necessarily
have
ct )t∈T ) ≥ D ((
D (( ct )t∈T ) .
3.2 Binary Regression Trees 59
3.2.4 Examples
The examples in this chapter are done with the R package rpart, which stands for
recursive partitioning.
Being a male increases the expected claim frequency by 10%, drivers between 18
and 29 (resp. 30 and 44) years old have expected claim frequencies 40% (resp.
20%) larger than policyholders older than 45 years old, splitting its premium does
not influence the expected claim frequency while driving a sports car increases the
expected claim frequency by 15%.
In this example, the true model μ(x) is known and we can simulate realizations
of the random vector (Y, X). Specifically, we generate n = 500 000 independent
realizations of (Y, X), that is, we consider a learning set made of 500 000 observations
(y1 , x 1 ), (y2 , x 2 ), . . . , (y500 000 , x 500 000 ). An observation represents a policy that has
been observed during a whole year. In Table 3.1, we provide the ten first observations
of the learning set. While the nine first policies made no claim over the past year,
the tenth policyholder, who is a 36 years old man with a sports car and paying his
premium annually, experienced two claims.
In this simulated dataset, the proportion of males is approximately 50%, so are
the proportions of sports cars and policyholders splitting their premiums. For each
age 18,19,...,65, there are between 10 188 and 10 739 policyholders.
We now aim to estimate the expected claim frequency μ(x). In that goal, we fit
several trees on our simulated dataset with Poisson deviance as loss function. Here,
we do not divide the learning set into a training set and a validation set, so that the
whole learning set is used to train the models. The R command used is
> tree <− rpart (Y ~ Gender+Age+Split+Sport ,
data = dataset ,
method="poisson" ,
control = rpart . control ( ) )
where data specifies the training set used to build the tree, method refers to the
optimisation criterion applied at each split, here the Poisson deviance, and control
enables to control the size of the tree.
3.2 Binary Regression Trees 61
0.13
66e+3 / 500e+3
100%
0.15
41e+3 / 281e+3
56%
2 Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
9 11 13 15
8 10 12 14
Fig. 3.6 Tree with a maximum depth equal to three as stopping rule
A first tree with a maximum depth of three as stopping rule is built. A node t is
then terminal when its depth is equal to three, meaning that it belongs to the fourth
generation of nodes. The resulting tree is depicted in Fig. 3.6 and presented in more
details below:
> n= 500000
We start with n=500 000 observations in the training set. The nodes are num-
bered with the variable node). The node 1) is the root node, node 2) corresponds
to its left-child and node 3) to its right-child, and so on, such that node k) corre-
sponds to node tk−1 in our notations. For each node, the variable split specifies
the split criterion applied, n the number of observations in the node, deviance
the deviance at that node and yval the prediction (i.e. the estimate of the expected
claim frequency) in that node. Also, * denotes terminal nodes, that are in this case
nodes 8) to 15).
In particular, terminal node 14), which corresponds to node t13 , is obtained by
with answer no. It contains n t13 = 62 475 observations, the
using feature x4 (Sport)
deviance is Dχt13 ct13 = 37 323.97 and the corresponding estimated expected claim
frequency is given by ct13 = 0.1493856. Its parent node is t6 and the decrease of the
deviance resulting from the split of t6 into nodes t13 and t14 is
Dχt6 = Dχt6 ct6 − Dχt13 ct13 + Dχt14 ct14
= 77295.79 − (37323.97 + 39898.18)
= 73.64.
In Fig. 3.6, each rectangle represents a node of the tree. In each node, one can
see the estimated expected claim frequency, the number of claims, the number of
observations as well as the proportion of the training set in the node. As an example,
in node 3) (or t2 ), we
observe the estimated expected claim frequency ct2 ≈ 0.15,
the number of claims i:x i ∈χt yi ≈ 41 000 and the number of observations n t2 ≈
2
281 000 which corresponds to approximately 56% of the training set. Below each
non-terminal node t, we find the question that characterizes the split st . In case the
answer is yes, one moves to the left-child node t L while when the answer is no,
one moves to the right-child node t R . Each node is topped by a small rectangle
box containing the number of the node which corresponds to the variable node).
Terminal nodes are nodes 8) to 15) and belong to the fourth generation of nodes
as requested by the stopping rule. Finally, the darker the gray of a node, the higher
the estimated expected claim frequency in that node.
The first split of the tree is defined by the question Age ≥ 44.5 (and not by
Age ≥ 44 as it is suggested in Fig. 3.6). For feature Age, the set of possible questions
is x2 ≤ ck with constants ck = 18+19+2k−2
2
, k = 1, . . . , 47. The best split for the root
node is thus x2 ≤ 44.5. The feature Age is indeed the one that influences the most the
expected claim frequency, up to a difference of 40% (resp. 20%) for policyholders
with 18 ≤ Age < 30 (resp. 30 ≤ Age < 45) compared to those older than 45.
The left child node is node 2) and comprised policyholders with Age ≥ 45. In
that node, the feature Age does not influence the expected claim frequency, while
features Sport and Gender can lead to expected claim frequencies that differ from
15% and 10%, respectively. The feature Sport is then naturally selected to define the
best split at that node. The two resulting nodes are 4) and 5), in which only the
feature Gender still influences the expected claim frequency, hence the choice of the
Gender to perform both splits leading to terminal nodes.
3.2 Binary Regression Trees 63
0.13
66e+3 / 500e+3
100%
0.15
41e+3 / 281e+3
56%
Age >= 30
2
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%
9 11 13 15
17 19 21 23 25 27 29 31
16 18 20 22 24 26 28 30
Fig. 3.7 Tree with a maximum depth equal to four as stopping rule
The right child node of the root is node 3). In that node, the feature Age is
again the preferred feature as it can still produce a difference of 16.67% = 1.2 1.4
−1
in expected claim frequencies. The children nodes 6) and 7) are then in turn split
with the feature Sport since in both nodes, it yields to differences in expected claim
frequencies of 15% while the feature Gender yields to differences of 10%. Notice
that the feature Age is no longer relevant in these two nodes. The resulting nodes,
namely 12), 13), 14) and 15), are terminals since they all belongs to the fourth
generation of nodes.
The tree that has been built with a maximum depth of three as stopping rule is
not enough deep. Indeed, we notice that nodes 12), 13), 14) and 15) should still
be split with the feature Gender since males have expected claim frequencies 10%
higher than females. So, in this example, a maximum depth of three is too restrictive.
A tree with a maximum depth of four as stopping rule is then fitted. The resulting
tree is showed in Fig. 3.7. The first four generations of nodes are obviously the same
than in our previous tree depicted in Fig. 3.6. The only difference lies in the addition
of a fifth generation of nodes, which enables to split nodes 12), 13), 14) and 15)
with the feature Gender, as desired. However, while nodes 8), 9), 10) and 11)
were terminal nodes in our previous tree, they now become non-terminal nodes as
they do not belong to the fifth generation. Therefore, they are all split in order to
meet our stopping rule. For instance, node 8) is split with the feature Split, while we
64 3 Regression Trees
know that this feature does not influence the expected claim frequency. Therefore,
while we improve our model on the right hand side of the tree, i.e. node 3) and its
children, we start to overfit our dataset on the left hand side of the tree, i.e. node 2)
and subsequent.
In this example, specifying the stopping rule only with respect to the maximum
depth does not lead to satisfying results. Instead, we could combine the rule of a
maximum depth equal to four with a requirement of a minimum decrease of the
deviance. In Table 3.2, we show the decrease of the deviance Dχt observed at each
node t. As expected, we notice that nodes 8), 9), 10) and 11) have the four
smallest values for Dχt .
If we select a threshold somewhere between 3.52 and 16.80 for the minimum
decrease of the deviance allowed to split a node, which are the values of Dχt at
nodes 4) and 8), respectively, we get the optimal tree presented in Fig. 3.8, in which
nodes 16) to 23) now disappeared compared to our previous tree showed in Fig. 3.7.
Hence, nodes 8) to 11) are now terminal nodes, as desired. Finally, we then get
twelve terminal nodes.
In Table 3.3, we show the terminal nodes with their corresponding expected claim
frequencies μ(x) as well as with their estimates μ(x).
3.2 Binary Regression Trees 65
0.13
66e+3 / 500e+3
100%
0.15
41e+3 / 281e+3
56%
2 Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%
9 11 25 27 29 31
8 10 24 26 28 30
Table 3.3 Terminal nodes with their corresponding expected claim frequencies μ(x) and estimated
expected claim frequencies
μ(x)
Node k) x μ(x)
μ(x)
x1 (Gender) x2 (Age) x4 (Sport)
Node 8) Female x2 ≥ 44.5 No 0.1000 0.1005
Node 9) Male x2 ≥ 44.5 No 0.1100 0.1085
Node 10) Female x2 ≥ 44.5 Yes 0.1150 0.1164
Node 11) Male x2 ≥ 44.5 Yes 0.1265 0.1285
Node 24) Female 29.5 ≤ x2 < 44.5 No 0.1200 0.1206
Node 25) Male 29.5 ≤ x2 < 44.5 No 0.1320 0.1330
Node 26) Female 29.5 ≤ x2 < 44.5 Yes 0.1380 0.1365
Node 27) Male 29.5 ≤ x2 < 44.5 Yes 0.1518 0.1520
Node 28) Female x2 < 29.5 No 0.1400 0.1422
Node 29) Male x2 < 29.5 No 0.1540 0.1566
Node 30) Female x2 < 29.5 Yes 0.1610 0.1603
Node 31) Male x2 < 29.5 Yes 0.1771 0.1772
66 3 Regression Trees
120000
100000
80000
Number of policies
60000
40000
20000
0
1 2 3 4 5 6 7 8 9 10 11 12
Exposure (in months)
Fig. 3.9 Number of policies with respect to the exposure-to-risk expressed in months
100000
100000
75000
Exposure−to−risk
Exposure−to−risk
Exposure−to−risk
75000 1e+05
50000
50000
5e+04
25000 25000
0 0 0e+00
Male Female Diesel Gasoline Private Professional
Gender Fuel Use
80000
80000
60000
60000
Exposure−to−risk
Exposure−to−risk
Exposure−to−risk
60000
40000 40000
40000
20000 20000
20000
0 0 0
Comprehensive Limited.MD TPL.Only Half−Yearly Monthly Quarterly Yearly C1 C2 C3 C4 C5
Cover Split PowerCat
3000
Exposure−to−risk
Exposure−to−risk
10000
2000
5000
1000
0 0
0 5 10 15 20 20 40 60 80
AgeCar AgePh
0.14
20e+3 / 161e+3
100%
yes AgePh >= 30 no
0.21
4358 / 24e+3
15%
Split = Yearly
2
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
5
0.17
3931 / 29e+3
18%
AgePh >= 58
4
0.12
12e+3 / 108e+3
67%
AgePh >= 58
9 11
0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%
0.097
3305 / 36e+3
23%
Fuel = Gasoline 19
0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD
16 18
0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%
0.13
2200 / 18e+3
11%
Cover = Limited.MD
32 36
0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
AgePh < 74 AgePh >= 48
73
0.12
1866 / 17e+3
10%
Cover = Comprehensive,Limited.MD
64
0.086
1482 / 18e+3
11%
AgeCar >= 5.5
129 33 72 147 75 39 22 6
128 65 17 146 74 38 10 23 7
μ
μ
μ−2 ,
μ+2
e × 5000 e × 5000
13.9% 13.9%
= 13.9% − 2 , 13.9% + 2
0.89 × 5000 0.89 × 5000
= [12.8%, 15.0%] .
70 3 Regression Trees
Roughly speaking, one sees that we obtain a precision of 1% in the average annual
claim frequency
μ, which can be considered as satisfactory here. Notice that if
we would have selected 1000 for the minimum number of observations in the ter-
minal nodes, we would have get a precision of 2.5% in the average annual claim
frequency
μ.
We have seen several rules to declare a node t terminal. These rules have in common
that they early stop the growth of the tree. Another way to find the right sized tree
consists in fully developing the tree and then pruning it.
Henceforth, to ease the presentation, we denote a regression tree by T . A tree T
is defined by a set of splits together with the order in which they are used and the
predictions in the terminal nodes. When needed to make explicit the link with tree
T , we use the notation T(T ) for the corresponding set of indexes T of the terminal
nodes. In addition, we mean by |T | the number of terminal nodes of tree T , that is
|T(T ) |.
In order to define the pruning process, one needs to specify the notion of a tree
branch. A branch T (t) is the part of T that is composed of node t and all its descendants
nodes. Pruning a branch T (t) of a tree T means deleting from T all descendants nodes
of t. The resulting tree is denoted T − T (t) . One says that T − T (t) is a pruned subtree
of T . For instance, in Fig. 3.12, we represent a tree T , its branch T (t2 ) and the subtree
T − T (t2 ) obtained from T by pruning the branch T (t2 ) . More generally, a tree T that
is obtained from T by successively pruning branches is called a pruned subtree of
T , or simply a subtree of T , and is denoted T T .
The first step of the pruning process is to grow the largest possible tree Tmax
by letting the splitting procedure continue until all terminal nodes contain either
observations with identical values for the features (i.e. terminal nodes t ∈ T where
x i = x j for all (yi , x i ) and (y j , x j ) such that x i , x j ∈ χt ) or either observations
with the same value for the response (i.e. terminal nodes t ∈ T where yi = y j for all
(yi , x i ) and (y j , x j ) such that x i , x j ∈ χt ).
Notice that the initial tree Tinit can be smaller than Tmax . Indeed, let us assume that
the pruning process starting with the largest tree Tmax produces the subtree Tprune .
Then, the pruning process will always lead to the same subtree Tprune if we start with
any subtree Tinit of Tmax such that Tprune is a subtree of Tinit .
We thus start the pruning process with a sufficiently large tree Tinit . Then, the idea
of pruning Tinit consists in constructing a sequence of smaller and smaller trees
t0
t1 t2
t3 t4 t5 t6
t7 t8 t9 t10
(a) Tree T .
t2 t0
t5 t6 t1 t2
t7 t8 t9 t10 t3 t4
(b) Branch T (t2 ) . (c) Subtree T − T (t2 ) .
Fig. 3.12 Example of a branch T (t2 ) for a tree T as well as the resulting subtree T − T (t2 ) when
pruning T (t2 )
Let us denote by C(Tinit , k) the class of all subtrees of Tinit having k terminal
nodes. An intuitive procedure to produce the sequence of trees Tinit , T|Tinit |−1 , . . . , T1
is to
select, for every k = 1, . . . , |Tinit | − 1, the tree Tk which minimizes the deviance
D ( ct )t∈T(T ) among all subtrees T of Tinit with k terminal nodes, that is
ct )t∈T(Tk ) =
D ( min ct )t∈T(T ) .
D (
T ∈C(Tinit ,k)
Thus, for every k = 1, . . . , |Tinit | − 1, Tk is the best subtree of Tinit with k terminal
nodes according to the deviance loss function.
It is natural to use the deviance in comparing subtrees with the same number
of terminal nodes. However, the deviance is not helpful for comparing subtrees
Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1 . Indeed, as noticed in Sect. 3.2.3, if Tk is a subtree of
Tk+1 ∈ C(Tinit , k + 1), we necessarily have
72 3 Regression Trees
ct )t∈T(Tk ) ≥ D (
D ( ct )t∈T(T )
k+1
≥ D (
ct )t∈T(Tk+1 ) .
Therefore, the selection of the best subtree Tprune among the sequence
that is based on the deviance, or equivalently on the training sample estimate of the
generalization error, will always lead to the largest tree Tinit .
That is why the generalization errors of the pruned subtrees
has some drawbacks. One of them is to produce subtrees of Tinit that are not nested,
meaning that subtree Tk is not necessarily a subtree of Tk+1 . Hence, a node t of Tinit
can reappear in tree Tk while it was cut off in tree Tk+1 .
That is why the minimal cost-complexity pruning presented hereafter is usually
preferred.
where the parameter α is a positive real number. The number of terminal nodes |T |
is called the complexity of the tree T . Thus, the cost-complexity measure Rα (T )
is a combination of the deviance D ( ct )t∈T(T ) and a penalty for the complexity of
the tree α|T |. The parameter α can be interpreted as the increase in the penalty for
having one more terminal node.
When we increase by one the number of terminal nodes of a tree T by splitting
one of its terminal node t into two children nodes t L and t R , then we know that the
deviance of the resulting tree T is smaller than the deviance of the original tree T ,
that is
D (ct )t∈T(T ) ≤ D ( ct )t∈T(T ) .
3.3 Right Sized Trees 73
The deviance will always favor the more complex tree T over T .
By introducing a penalty for the complexity of the tree, the cost-complexity mea-
sure may now prefer the original tree T over the most complex one T . Indeed, the
cost-complexity measure of T can be written as
Rα (T ) = D (ct )t∈T(T ) + α|T |
= D (ct )t∈T(T ) + α(|T | + 1)
= Rα (T ) + D ( ct )t∈T(T ) − D (
ct )t∈T(T ) + α.
Hence, depending on the value of α, the more complex tree T may have a higher
cost-complexity measure than T . We have Rα (T ) ≥ Rα (T ) if and only if
ct )t∈T(T ) − D (
α ≥ D ( ct )t∈T(T ) . (3.3.1)
This additional condition says that if there is a tie, namely more than one subtree of
Tinit minimizing Rα (T ), then we select for T (α) the smallest tree, that is the one that
is a subtree of all others satisfying (3.3.2). The resulting subtrees T (α) are called the
smallest minimizing subtrees.
It is obvious that for every value of α there is at least one subtree of Tinit that
minimizes Rα (T ) since there are only finitely many pruned subtrees of Tinit . However,
it is not clear whether the additional condition (3.3.3) can be met for every value of
α. Indeed, this says that we cannot have two subtrees that minimize Rα (T ) such that
neither is a subtree of the other. This is guaranteed by the next proposition.
Proposition 3.3.1 For every value of α, there exists a smallest minimizing subtree
T (α).
Proof We refer the interested reader to Sect. 10.2 (Theorem 10.7) in Breiman et al.
(1984).
Thus, for every value of α ≥ 0, there exists a unique subtree T (α) of Tinit that
minimizes Rα (T ) and which satisfies T (α) T for all subtrees T minimizing Rα (T ).
The large tree Tinit has only a finite number of subtrees. Hence, even if α goes
from zero to infinity in a continuous way, the set of the smallest minimizing subtrees
{T (α)}α≥0 only contains a finite number of subtrees of Tinit .
Let α = 0. We start from Tinit and we find any pair of terminal nodes with a
(t)
common parent node t such that the branch Tinit can be pruned without increasing
the cost-complexity measure. We continue until we cannot longer find such pair in
order to obtain a subtree of Tinit with the same cost-complexity measure as Tinit for
α = 0. We define α0 = 0 and we denote Tα0 the resulting subtree of Tinit .
When α increases, it may become optimal to prune the branch Tα(t) 0
for a certain
node t of Tα0 , meaning that the smaller tree Tα0 − Tα(t)
0
becomes better than Tα0 . This
will be the case once α is high enough to have
Rα (Tα0 ) ≥ Rα Tα0 − Tα(t)
0
.
= D (
cs )s∈T (t)
+ D (
cs )s∈T (t)
− D (
cs )s∈{t} .
Tα0 −Tα Tα
0 0
Furthermore, we have
The left-hand side of (3.3.5) is the cost-complexity measure of the branch Tα(t) 0
while the right-hand side is the cost-complexity measure of the node t. Therefore,
it
becomes optimal to cut the branch Tα(t)0
once its cost-complexity measure R α Tα
(t)
0
becomes higher than the cost complexity measure Rα (t) of its root node t. This
happens for values of α satisfying
cs )s∈{t} − D (
D ( cs )s∈T (t)
Tα
α≥ .
0
(3.3.6)
|Tα(t)
0 | − 1
We denote by α1(t) the right-hand side of (3.3.6). For each non-terminal node t of Tα0
we can compute α1(t) . For a tree T , let us denote by T(T ) the set of its non-terminal
nodes. Then, we define the weakest links of Tα0 as the non-terminal nodes t for which
α1(t) is the smallest and we denote α1 this minimum value, i.e.
α1 = min α1(t) .
t∈T
(Tα0 )
Cutting any branch of Tα0 is not optimal as long as α < α1 . However, once the
parameter α reaches the value α1 , it becomes preferable to prune Tα0 at its weakest
links. The resulting tree is then denoted Tα1 . Notice that it is appropriate to prune the
tree Tα0 by exploring the nodes in top-down order. In such a way, we avoid cutting
in a node t that will disappear later on when pruning Tα0 .
Now, we repeat the same process for Tα1 . Namely, for a non-terminal node t of
Tα1 , it will be preferable to cut the branch Tα(t)
1
when
cs )s∈{t} − D (
D ( cs )s∈T (t)
Tα
α≥ .
1
(3.3.7)
|Tα(t)
1 | − 1
76 3 Regression Trees
α2 = min α2(t) .
t∈T
(Tα1 )
The non-terminal nodes t of Tα1 for which α2(t) = α2 are called the weakest links of
Tα1 and it becomes better to cut in these nodes once α reaches the value α2 in order
to produce Tα2 .
Then we continue the same process for Tα2 , and so on until we reach the root node
{t0 }. Finally, we come up with the sequence of trees
In the sequence, we can obtain the next tree by pruning the current one.
The next proposition makes the link between this sequence of trees and the smallest
minimizing subtrees T (α).
Proof We refer the interested reader to Sect. 10.2 in Breiman et al. (1984).
This result is important since it gives the instructions to find the smallest min-
imizing subtrees T (α). It suffices to apply the recursive pruning steps described
above. This recursive procedure is an efficient algorithm, any tree in the sequence
being obtained by pruning the previous one. Hence, this algorithm only requires to
consider a small fraction of the total possible subtrees of Tinit .
The cost-complexity measure Rα (T ) is given by
ct )t∈T(T ) + α|T |,
Rα (T ) = D (
where the parameter α is called the regularization parameter. The parameter α has
the same unit as the deviance. It is often convenient to normalize the regularization
parameter by the deviance of the root tree. We define the cost-complexity parameter
as α
cp = ,
D (ct )t∈{t0 }
where D ( ct )t∈{t0 } is the deviance of the root tree. The cost-complexity measure
Rα (T ) can then be rewritten as
ct )t∈T(T ) + α|T |
Rα (T ) = D (
= D (
ct )t∈T(T ) + cp D ( ct )t∈{t0 } |T |. (3.3.8)
3.3 Right Sized Trees 77
3.3.1.1 Example
Consider the simulated dataset presented in Sect. 3.2.4.1. We use as Tinit the tree
depicted in Fig. 3.7, where a maximum depth of four has been used as stopping rule.
The right sized tree Tprune is shown in Fig. 3.8 and is a subtree of Tinit . The initial tree
Tinit is then large enough to start the pruning process.
Table 3.2 presents the decrease of the deviance Dχt at each non-terminal node
t of the initial tree Tinit . The smallest decrease of the deviance is observed for node
11), also denoted t10 . Here, node t10 is the weakest node of Tinit . It becomes optimal
(t10 )
to cut the branch Tinit for values of α satisfying
D (
cs )s∈{t10 } − D (
cs )s∈T (t )
Tinit10
α≥ (t10 )
|Tinit |−1
= Dχt10
= 0.3254482.
ct )t∈{t0 } = 279 043.30,
Therefore, we get α1 = 0.3254482 and hence, since D (
α1 0.3254482
cp1 = = = 1.1663 10−6 .
D (
ct )t∈{t0 } 279 043.30
Tree Tα1 is depicted in Fig. 3.13. Terminal nodes 22) and 23) of Tinit disappeared
in Tα1 and node 11) becomes a terminal node.
The weakest node of Tα1 is node 10) or t9 with a decrease of the deviance
Dχt9 = 1.13. We cut the branch Tα(t19 ) once α satisfies
D (
cs )s∈{t9 } − D (
cs )s∈T (t )
Tα 9
1
α≥
|Tα(t19 ) | − 1
= Dχt9
= 1.135009.
0.13
66e+3 / 500e+3
100%
0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%
9 13 15
17 19 21 24 26 28 30
α2 1.135009
cp2 = = = 4.0676 10−6 .
D (
ct )t∈{t0 } 279 043.30
Tree Tα2 is shown in Fig. 3.14 where node 10) is now terminal.
In tree Tα2 , the smallest decrease of the deviance is at node 9). We have
cs )s∈{t8 } − D (
D ( cs )s∈T (t )
Tα 8
2
α3 =
|Tα(t28 ) | − 1
= Dχt8
= 1.495393
and
α3 1.495393
cp3 = = = 5.3590 10−6 .
D (
ct )t∈{t0 } 279 043.30
0.13
66e+3 / 500e+3
100%
yes Age >= 44 no
0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%
17 19 11 25 27 29 31
cs )s∈{t7 } − D (
D ( cs )s∈T (t )
Tα 7
3
α4 =
|Tα(t37 ) | − 1
= Dχt7
= 3.515387
and
α4 3.515387
cp4 = = = 1.2598 10−5 .
D (
ct )t∈{t0 } 279 043.30
0.13
66e+3 / 500e+3
100%
0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4 6
Gender = female Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%
0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%
8 12 14
Gender = female Gender = female
0.1 0.13 0.15
5512 / 55e+3 9871 / 78e+3 9333 / 62e+3
11% 16% 12%
17 10 24 26 28 30
Once the sequence of trees Tα0 , Tα1 , Tα2 , . . . , Tακ = {t0 } has been built, we need to
select the best pruned tree. One way to proceed is to rely on cross-validation. We set
√
α̃k = αk αk+1 , , k = 0, 1, . . . , κ,
where α̃k is considered as a typical value for [αk , αk+1 ) and hence as the value
corresponding to Tαk . The parameter α̃k corresponds to the geometric mean of αk
and αk+1 . Notice that α̃0 = 0 and α̃κ = ∞ since α0 = 0 and ακ+1 = ∞.
In K −fold cross-validation, the training set D is partitioned into K subsets
D1 , D2 , . . . , D K of roughly equal size and we label by I j ⊂ I the observations in D j
for all j = 1, . . . , K . The jth training set is defined as D\D j , j = 1, . . . , K , so that
it contains the observations of the training set D that are not in D j . For each training
set D\D j , we build the corresponding sequence of smallest minimizing subtrees
( j)
T ( j) (α̃0 ), T ( j) (α̃1 ), . . . , T ( j) (α̃κ ), starting from a sufficiently large tree Tinit . Each
( j)
tree T (α̃k ) of the sequence provides a partition {χt }t∈T T ( j) (α̃ ) of the feature space
( k )
χ and predictions ( ct )t∈T ( j) on that partition computed on the training set D\D j .
(T (α̃k ))
Since the observations in D j have not been used to build the trees
3.3 Right Sized Trees 81
0.13
66e+3 / 500e+3
100%
yes Age >= 44 no
0.15
41e+3 / 281e+3
56%
2
Age >= 30
0.11
25e+3 / 219e+3
44%
Sport = no
5 7
0.12 0.16
13e+3 / 109e+3 20e+3 / 125e+3
22% 25%
4
Gender = female 6
Sport = no
0.1 0.14
11e+3 / 110e+3 21e+3 / 156e+3
22% 31%
0.14 0.17
11e+3 / 78e+3 11e+3 / 62e+3
16% 12%
12
Gender = female 14
Gender = female
0.13 0.15
9871 / 78e+3 9333 / 62e+3
16% 12%
9 11 25 27 29 31
8 10 24 26 28 30
D j can play the role of validation set for those trees. If we denote by
μT ( j) (α̃k ) the
model produced by tree T ( j) (α̃k ), that is
μT ( j) (α̃k ) (x) =
ct I [x ∈ χt ] ,
t∈T
( T ( j) (α̃k ) )
μT ( j) (α̃k ) on D j is given by
then an estimate of the generalization error of
val 1
Err μT ( j) (α̃k ) = L yi ,
μT ( j) (α̃k ) (x i ) .
|I j | i∈I
j
So, the K −fold cross-validation estimate of the generalization error for the regular-
ization parameter α̃k is given by
82 3 Regression Trees
CV
K
|I j | val
Err (α̃k ) =
Err
μT ( j) (α̃k )
j=1
|I|
1
K
= L yi ,
μT ( j) (α̃k ) (x i ) .
|I| j=1 i∈I
j
According to the minimum cross-validation error principle, the right sized tree
Tprune is then selected as the tree Tαk ∗ of the sequence Tα0 , Tα1 , Tα2 , . . . , Tακ such that
CV CV
Err (α̃k ∗ ) = min
Err (α̃k ).
k∈{0,1,...,κ}
3.3.2.1 Example 1
Consider the simulated dataset of Sect. 3.2.4.1. Starting with the same initial tree Tinit
than in example of Sect. 3.3.1.1, we have
so that
α̃1 = 0.6077719, α̃2 = 1.302799, α̃3 = 2.29279, . . .
This relative error is a training sample estimate. It starts from the smallest value
rel error0 = 0.99376 to increase up to rel error15 = 1 since the trees become
smaller and smaller to reach the root tree. Obviously, choosing the optimal tree based
on the relative error would always favor the largest tree.
Turning to the fourth column xerror, it provides the relative 10-fold cross-
validation error estimate for parameter α̃k , namely
CV
|I|
Err (α̃k )
xerrork = ,
D ( ct )t∈{t0 }
where, in this example, |I| = n = 500 000. For instance, we have xerror0 =
0.99395, xerror1 = 0.99395 and so on up to xerror15 = 1. Hence, using the
minimum cross-validation error principle, the right-sized tree turns out to be Tα4
with a relative 10-fold cross-validation error equal to 0.99386. The dataset being
simulated, we know that Tα4 shown in Fig. 3.16 indeed coincides with the best pos-
sible tree.
The last column xstd will be commented in the next example.
3.3.2.2 Example 2
Consider the real dataset described in Sect. 3.2.4.2. The tree depicted in Fig. 3.11
with 17 terminal nodes, that has been obtained by requiring a minimum number of
observations in terminal nodes equal to 5000, is used as Tinit . We get the following
results:
CP nsplit rel error xerror xstd
1 9.3945e−03 0 1.00000 1.00003 0.0048524
2 4.0574e−03 1 0.99061 0.99115 0.0047885
3 2.1243e−03 2 0.98655 0.98719 0.0047662
4 1.4023e−03 3 0.98442 0.98524 0.0047565
5 5.1564e−04 4 0.98302 0.98376 0.0047441
6 4.5896e−04 5 0.98251 0.98344 0.0047427
7 4.1895e−04 6 0.98205 0.98312 0.0047437
84 3 Regression Trees
The sequence of trees Tα0 , Tα1 , . . . , Tακ is composed of 17 trees, namely κ = 16.
The minimum 10-fold cross-validation error is equal to 0.98189 and corresponds to
tree Tα2 with a complexity parameter cp2 = 8.6149 10−5 and 14 splits. Tα2 is shown
in Fig. 3.17. So, it is the tree with the minimum 10-fold cross-validation error when
requiring at least 5000 observations in the terminal nodes. Notice that Tα2 with 15
terminal nodes is relatively big compared to Tinit with 17 terminal nodes. The reason
is that any terminal node of the initial tree must contain at least 5000 observations
so that the size of Tinit is limited.
For any α̃k , k = 0, 1, . . . , κ, the standard deviation estimate
0.14
20e+3 / 161e+3
100%
0.21
4358 / 24e+3
15%
Split = Yearly
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
0.17
3931 / 29e+3
18%
AgePh >= 58
0.12
12e+3 / 108e+3
67%
AgePh >= 58
0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%
0.097
3305 / 36e+3
23%
Fuel = Gasoline
0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD
0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%
0.13
2200 / 18e+3
11%
Cover = Limited.MD
0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
Fig. 3.17 Tree with minimum cross-validation error when requiring at least 5000 observations in
the terminal nodes
3.3 Right Sized Trees 85
number of splits
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.010
1.005
1.000
X−val Relative Error
0.995
0.990
0.985
0.980
0.975
cp
Fig. 3.18 Relative cross-validation error xerrork together with the relative standard error xstdk
CV 1/2
V ar Err (α̃k )
can be estimated empirically over the K estimates of the generalization error. The
last column xstd provides an estimate of the relative standard error of the cross-
validation error, namely
CV 1/2
|I|
V ar Err (α̃k )
xstdk = ,
D (ct )t∈{t0 }
where, in this example, |I| = n = 160 944. Figure 3.18 shows the relative cross-
validation error xerrork together with the relative standard error xstdk for each
tree Tαk . From right to left, we start with Tα0 which corresponds to a complexity
parameter equal to 0 and 16 splits to end with the root node tree Tα16 with 0 split.
As we can see, the tree of the sequence with only 3 splits, namely Tα13 , is within one
standard deviation (SD) of the tree Tα2 with the minimum cross-validation error. The
1-SD rule consists in selecting the smallest tree that is within one standard deviation
of the tree that minimizes the cross-validation error. This rule recognizes that there
is some uncertainty in the estimate of the cross-validation error and chooses the
simplest tree whose accuracy is still judged acceptable. Hence, according to the 1-
SD rule, the tree selected is Tαk ∗∗ where k ∗∗ is the maximum k ∈ {k ∗ , k ∗ + 1, . . . , κ}
satisfying
86 3 Regression Trees
0.14
20e+3 / 161e+3
100%
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
0.12
12e+3 / 108e+3
67%
AgePh >= 58
CV CV CV 1/2
Err (α̃k ) ≤
Err (α̃k ∗ ) +
V ar Err (α̃k ∗ ) .
In our case, Tαk ∗∗ = Tα13 , which is depicted in Fig. 3.19. Compared to tree Tαk ∗ =
Tα2 which minimizes the cross-validation error, Tα13 is much more simpler. The
number of terminal nodes decreases by 11, going from 15 in Tα2 to 4 in Tα13 .
As a result, the minimum cross-validation principle selects Tα2 while the 1-SD
rule chooses Tα13 . Both trees can be compared on a validation set by computing their
respective generalization errors.
3.3.2.3 Example 3
Consider once again the real dataset described in Sect. 3.2.4.2. This time, a stratified
random split of the available dataset L is done to get a training set D and a validation
set D. In this example, the training set D is assumed to be composed of 80% of the
observations of the learning set L and the validation set D of the 20% remaining
observations.
The initial tree Tinit is grown on the training set by requiring a minimum number
of observations in terminal nodes equal to 4000 = 80% × 5000. The resulting tree
Tinit is shown in Fig. 3.20 and only differs from the previous initial tree obtained on
the whole dataset L at only one split in the bottom. Conducting the pruning process,
we get the following results that are also illustrated in Fig. 3.21:
3.3 Right Sized Trees 87
The tree that minimizes the cross-validation error is the initial tree Tα0 , which means
that requiring at least 4000 observations in the terminal nodes already prevents for
overfitting. The 1-SD rule selects the tree Tα11 with only 3 splits, depicted in Fig. 3.22.
In order to compare both trees Tαk ∗ = Tα0 and Tαk ∗∗ = Tα11 , we estimate their
respective generalization errors on the validation set D. Remember that the validation
0.14
16e+3 / 129e+3
100%
yes AgePh >= 32 no
0.21
3844 / 22e+3
17%
Split = Yearly
0.13
12e+3 / 107e+3
83%
Split = Half−Yearly,Yearly
0.17 0.24
3018 / 23e+3 2528 / 13e+3
18% 10%
AgePh >= 58 AgePh >= 26
0.12
9060 / 84e+3
65%
AgePh >= 50
0.13 0.18
4809 / 40e+3 2596 / 19e+3
31% 15%
Split = Yearly AgeCar < 6.5
0.1
4251 / 45e+3
35%
Fuel = Gasoline
0.15
2183 / 16e+3
13%
Fuel = Gasoline
0.095 0.12
3002 / 34e+3 2626 / 23e+3
26% 18%
AgePh >= 58 Cover = Comprehensive,Limited.MD
0.13
1708 / 14e+3
11%
Gender = Male
0.09
1982 / 24e+3
18%
Gender = Male
0.087
1558 / 19e+3
15%
AgePh < 74
0.084
1167 / 15e+3
11%
AgeCar >= 5.5
Fig. 3.20 Initial tree when requiring at least 4000 observations in the terminal nodes
88 3 Regression Trees
number of splits
1.010
1.005
1.000 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16
X−val Relative Error
0.995
0.990
0.985
0.980
0.975
cp
Fig. 3.21 Relative cross-validation error xerrork together with the relative standard error xstdk
0.14
16e+3 / 129e+3
100%
0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly
0.12
9325 / 86e+3
67%
AgePh >= 58
sample estimate of the generalization error for a tree T fitted on the training set is
given by
val 1
Err ( μT ) = L (yi ,
μT (x i )ei ) ,
|I|
i∈I
where
μT corresponds to the model induced by tree T . We get
val
Err
μTαk ∗ = 0.5452772
and
val
Err μTαk ∗∗ = 0.5464333.
The tree Tαk ∗ that minimizes the cross-validation error is also the one that minimizes
the generalization error estimated on the validation set. Hence, Tαk ∗ has the best
predictive accuracy compared to Tαk ∗∗ and is thus judged as the best tree.
val
The difference between both generalization error estimates Err
μTαk ∗ and
val
Err μTαk ∗∗ is around 10−3 . Such a difference appears to be significant in this
context. The validation sample estimate of the generalization error for the root node
tree {t0 } is given by
val
Err
μ{t0 } = 0.54963.
The generalization error estimate only decreases by 0.0043528 from the root node
tree {t0 }, that is the null model, to the optimal one Tαk ∗ .
The generalization error of a model μ can be decomposed as the sum of the gen-
eralization error of the true model μ and an estimation error that is positive. The
generalization error of the true model μ is irreducible. Provided that the generaliza-
tion error of the true model μ is large compared to the estimation error of the null
model, a small decrease of the generalization error can actually mean a significant
improvement.
The generalization error Err ( μ) measures the performance of the model μ. Specifi-
val
cally, its validation sample estimate
Err ( μ) enables to assess the predictive accu-
racy of μ. However, as noticed in the previous section, there are some situations
where the validation sample estimate only slightly reacts to a model change while
such a small variation could actually reveal a significant improvement in terms of
model accuracy.
90 3 Regression Trees
Consider the simulated dataset of Sect. 3.2.4.1. Because the true model μ is known
in this example, we can estimate the generalization error of the true model on the
whole dataset, that is
Err (μ) = 0.5546299. (3.4.1)
Let
μnull be the model obtained by averaging the true expected claim frequencies
μnull = 0.1312089, such that its generalization error
over the observations. We get
estimated on the whole dataset is
μnull ) = 0.5580889.
Err( (3.4.2)
The difference between both error estimates Err (μ) and Err(μnull ) is 0.003459.
We observe that the improvement we get in terms of generalization error by using
μnull is only of the order of 10−3 . A slight
the true model μ instead of the null model
decrease of the generalization error can actually mean a real improvement in terms
of model accuracy.
In insurance, the features are not all of equally importance for the response. Often,
only a few of them have substantial influence on the response. Assessing the relative
importances of the features to the response can thus be useful for the analyst.
For a tree T , the relative importance of feature x j is the total reduction of deviance
obtained by using this feature throughout the tree. The overall objective is to minimize
the deviance. A feature that contributes a lot to this reduction will be more important
than another one with a small or no deviance reduction. Specifically, denoting by
T(T ) (x j ) the set of all non-terminal nodes of tree T for which x j was selected as
the splitting feature, the relative importance of feature x j is the sum of the deviance
reductions Dχt over the non-terminal nodes t ∈ T(T ) (x j ), that is,
I(x j ) = Dχt . (3.5.1)
(T ) (x j )
t∈T
The features can be ordered with respect to their relative importances. The feature
with the largest relative importance is the most important one, the feature with the
second largest relative importance is the second most important one, and so on up to
the least important feature. The most important features are those that appear higher
in the tree or several times in the tree.
Note that the relative importances are relative measures, so that they can be nor-
malized to improve their readability. It is customary to assign the largest a value
of 100 and then scale the others accordingly. Another way is to scale the relative
3.5 Relative Importance of Features 91
importances such that their sum equals 100, so that any relative importance can be
interpreted as the percentage contribution to the overall model.
3.5.1 Example 1
Normalizing the relative importances such that their sum equals 100, we get
Table 3.6 Decrease of the deviance Dχt for any non-terminal node t of the optimal tree
Node k) Splitting Dχtk−1
feature
1) Age 279 043.30−(111779.70+166261.40) = 1002.20
2) Sport 111779.70−(53386.74+58236.07) = 156.89
3) Age 166261.40−(88703.96+77295.79) = 261.65
4) Gender 53386.74−(26084.89+27285.05) = 16.80
5) Gender 58236.07−(28257.29+29946.18) = 32.60
6) Sport 88703.96−(42515.21+46100.83) = 87.92
7) Sport 77295.79−(37323.97+39898.18) = 73.64
12) Gender 42515.21−(20701.11+21790.59) = 23.51
13) Gender 46100.83−(22350.27+23718.03) = 32.53
14) Gender 37323.97−(18211.52+19090.83) = 21.62
15) Gender 39898.18−(19474.59+20396.94) = 26.65
92 3 Regression Trees
The feature Age is the most important one, followed by Sport and Gender, which
is in line with our expectation. Notice that the variable Split has an importance equals
to 0 as it is not used in the tree.
3.5.2 Example 2
Consider the example of Sect. 3.2.4.2. The relative importances of features related
to the tree depicted in Fig. 3.17 are shown in Fig. 3.23. One sees that the relative
importance of features AgePh and Split represents approximately 90% of the total
importance while the relative importance of the last five features (that are Cover,
AgeCar, Gender, PowerCat and Use) is less than 5%.
AgePh
Split
Fuel
Cover
Feature
AgeCar
Gender
PowerCat
Use
0 20 40 60
Relative importance
Fig. 3.23 Relative importances of the features related to the tree depicted in Fig. 3.17
3.5 Relative Importance of Features 93
Most of the time, the features X j are correlated in observational studies. When the
correlation between some features becomes large, trees may run into trouble, as
illustrated next. Such high correlations mean that the same information is encoded
in several features.
Consider the example of Sect. 3.2.4.1 and assume that variables X 1 (Gender) and
X 3 (Split) are correlated. In order to generate the announced correlation, we suppose
that the distribution of females inside the portfolio differs according to the split of
the premium. Specifically,
with ρ ∈ [0.5, 1]. Since P [X 1 = f emale] = 0.5 and P [X 3 = yes] = 0.5, we nec-
essarily have
The correlation between X 1 and X 3 increases with ρ. In particular, the case ρ = 0.5
corresponds to the independent case while both features X 1 and X 3 are perfectly
correlated when ρ = 1.
We consider different values for ρ, from 0.5 to 1 by 0.1 steps. For any value of ρ,
we generate a training set with 500 000 observations on which we build a tree min-
imizing the 10-fold cross validation error. The corresponding relative importances
are depicted in Fig. 3.24. One sees that the importance of the variable Split increases
with ρ, starting from 0 when ρ = 0.5 to completely replace the variable Gender
when ρ = 1. Also, one observes that the more important the variable Split, the less
the variable Gender.
The variable Split should not be used to explain the expected claim frequency,
which is the case when Split and Gender are independent. However, introducing a
correlation between Split and Gender leads to a transfer of importance from Gender
to Split. As a result, the variable Split seems to be useful to explain the expected claim
frequency while the variable Gender appears to be less important than it should be.
This effect becomes even more pronounced as the correlation increases, the variable
Split bringing more and more information about Gender. In the extreme case where
Split and Gender are perfectly dependent, using Split instead of Gender always yields
the same deviance reduction, so that both variables becomes fully equivalent. Note
that in such a situation, the variable Split was automatically selected over Gender so
that Gender has no importance in Fig. 3.24 for ρ = 1.
94 3 Regression Trees
Age Age
Sport Sport
Feature
Feature
Gender Gender
Split Split
0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance
Age Age
Sport Sport
Feature
Feature
Gender Gender
Split Split
0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance
Age Age
Sport Sport
Feature
Feature
Gender Split
Split Gender
0 20 40 60 80 0 20 40 60 80
Relative importance Relative importance
Fig. 3.24 Relative importances of the features for different values of ρ: ρ = 0.5 (top left), ρ = 0.6
(top right), ρ = 0.7 (middle left), ρ = 0.8 (middle right), ρ = 0.9 (bottom left) and ρ = 1 (bottom
right)
For a set of features, here Split and Gender, we might have all the importances
be small, yet we cannot delete them all. This little example shows that the relative
importances could easily lead to wrong conclusions if they are not used carefully by
the analyst.
3.6 Interactions 95
3.6 Interactions
Interaction arises when the effect of a particular feature is reliant on the value of
another. An example in motor insurance is given by driver’s age and gender: often,
young female drivers cause on average less claims compared to young male ones
whereas this gender difference disappears (and sometimes even reverses) at older
ages. Hence, the effect of age depends on gender. Regression trees automatically
accounts for interactions.
Let us revisit the example of Sect. 3.2.4.1 for which we now assume the following
expected annual claim frequencies:
This time, the effect of the age on the expected claim frequency depends on the
policyholder’s gender. Young male drivers are indeed more risky than young female
drivers in our example, young meaning here younger than 30 years old. Equivalently,
the effect of the gender on the expected claim frequency depends on the policyholder’s
age. Indeed, a man and a woman both with 18 ≤ Age < 30 and with the same value
for the feature Sport have expected claim frequencies that differ by 1.1×1.6 1.4
−1=
25.71% while a man and a woman both with Age ≥ 30 and with the same value for
the feature Sport have expected claim frequencies that only differ by 10%. One says
that features Gender and Age interact.
The optimal regression tree is shown in Fig. 3.25. By nature, the structure of
a regression tree enables to account for potential interactions between features. In
Fig. 3.25, the root node is split with the rule Age ≥ 30. Hence, once the feature Gender
is used for a split on the left hand side of the tree (node 2) and children), it only applies
for policyholders with Age ≥ 30, while when it appears on the right hand side of the
tree (node 3) and children), it only applies for policyholders with 18 ≤ Age < 30.
By construction, the impact of the feature Gender can then be different for categories
18 ≤ Age < 30 and Age ≥ 30. The structure of a tree, which is a succession of
binary splits, can thus easily reveal existing interactions between features.
Remark 3.6.1 Interaction has nothing to do with correlation. For instance, consider
a motor insurance portfolio with the same age structure for males and females (so that
the risk factors age and gender are mutually independent). If young male drivers are
more dangerous compared to young female drivers whereas this ranking disappears
or reverses at older ages, then age and gender interact despite being independent.
96 3 Regression Trees
0.13
67e+3 / 500e+3
100%
0.17
21e+3 / 125e+3
25%
2
Gender = female
0.12
46e+3 / 375e+3
75%
Age >= 44
5 7
0.14 0.19
21e+3 / 156e+3 12e+3 / 62e+3
31% 12%
4 6
Sport = no Sport = no
0.11 0.15
25e+3 / 219e+3 9464 / 63e+3
44% 13%
Sport = no Sport = no
9 11
0.12 0.14
13e+3 / 109e+3 11e+3 / 78e+3
22% 16%
8 10
Gender = female Gender = female
0.1 0.13
11e+3 / 110e+3 9871 / 78e+3
22% 16%
17 19 21 23 13 15
One issue with trees is their high variance. There is a high variability of the prediction
μD (x) over the models trained from all possible training sets. The main reason is
due to the hierarchical nature of the procedure. A change in one of the top splits
is propagated to all the subsequent splits. A way to remedy that problem is to rely
on bagging that averages many trees. This has the effect to reduce the variance. But
the price to be paid for stabilizing the prediction is to deal with less comprehensive
models so that we lose in terms of model interpretability. Instead of relying on one
tree, Bagging works with many trees. Ensemble learning techniques such as Bagging
will be studied in Chap. 4.
Consider the example of Sect. 3.2.4.1. The true expected claim frequency only
takes twelve values, so that the complexity of the true model is small. We first
generate 10 training sets with 5000 observations. On each training set, we fit a tree
with only one split. Figure 3.26 shows the resulting trees. As we can see, even in this
simple example, the feature used to make the split of the root node is not always the
3.7 Limitations of Trees 97
1 1
0.13 0.13
633 / 5000 642 / 5000
100% 100%
2 3 2 3
1 1
0.13 0.13
649 / 5000 663 / 5000
100% 100%
2 3 2 3
1 1
0.14 0.13
685 / 5000 667 / 5000
100% 100%
2 3 2 3
1 1
0.13 0.13
644 / 5000 636 / 5000
100% 100%
2 3 2 3
1 1
0.13 0.14
664 / 5000 713 / 5000
100% 100%
54
2 3 2 3
Fig. 3.26 Trees with only one split built on different training sets of size 5000
98 3 Regression Trees
same. Furthermore, the value of the split when using the feature Age differs from
one tree to another.
Increasing the size of the training set should decrease the variability of the trees
with respect to the training set. In Figs. 3.27 and 3.28, we depict trees built on training
sets with 50 000 and 500 000 observations, respectively. We notice that trees fitted
on training sets with 50 000 observations always use the feature Age for the first
split. Only the value of the split varies, but not too much. Finally, with training sets
made of 500 000 observations, we observe that all the trees have the same structure,
with Age ≥ 44 as the split of the root node. In this case, the trees only differ on the
corresponding predictions for Age < 44 and Age ≥ 44.
The variability of the models with respect to the training set can be measured by
the variance
E X ED (ED [ μD (X)] − μD (X))2 , (3.7.1)
which has been introduced in Chap. 2. In Table 3.7, we calculate the variance (3.7.1)
by Monte-Carlo simulation for training sets of sizes 5000, 50 000 and 500 000. As
expected, the variance decreases as the number of observations in the training set
increases.
Even in this simple example, where the true model only takes twelve values
and where we only have three important features (one of which being continuous),
training sets with 50 000 observations can still lead to trees with different splits for
the root node. Larger trees can even be more impacted due to the hierarchical nature
of the procedure. A change in the split for the root node can be propagated to the
subsequent nodes, which may lead to trees with very different structures.
In practice, the complexity of the true model is usually much larger than in this
example. Typically, more features influence the true expected claim frequency and
the impact of a continuous feature such as the age is often more complicated than
being summarized with three categories. In this setting, let us assume that the true
expected claim frequency is actually
This time, the impact of the age on the expected claim frequency is more complex than
in the previous example, and is depicted in Fig. 3.29. The expected claim frequency
smoothly decreases with the age, young drivers being more risky. Even with training
sets made of 500 000 observations, the resulting trees can still have different splits
for the root node, as shown in Fig. 3.30. We get the following variance
μD (X))2 = 0.03374507 10−3
μD (X)] −
E X ED (ED [
3.7 Limitations of Trees 99
1 1
0.13 0.14
6597 / 50e+3 6776 / 50e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
6428 / 50e+3 6566 / 50e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
6606 / 50e+3 6584 / 50e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
6598 / 50e+3 6636 / 50e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
6677 / 50e+3 6567 / 50e+3
100% 100%
55
2 3 2 3
Fig. 3.27 Trees with only one split built on different training sets of size 50 000
100 3 Regression Trees
1 1
0.13 0.13
66e+3 / 500e+3 66e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
65e+3 / 500e+3 66e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
66e+3 / 500e+3 66e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
66e+3 / 500e+3 65e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.13 0.13
65e+3 / 500e+3 65e+3 / 500e+3
100% 100%
56
2 3 2 3
Fig. 3.28 Trees with only one split built on different training sets of size 500 000
3.7 Limitations of Trees 101
Table 3.7 Estimation of (3.7.1) by Monte-Carlo simulation for training sets of sizes 5000, 50 000
and 500 000
Size of the training sets E X ED (ED [ μD (X)] − μD (X))2
5000 0.206928850 10−3
50 000 0.061018808 10−3
500 000 0.002361467 10−3
2.0
1.8
1+1/(Age−17)^0.5
1.6
1.4
1.2
20 30 40 50 60
Age
by Monte-Carlo simulation, which is, as expected, larger than the variance in Table 3.7
for training sets with 500 000 observations.
In insurance, the complexity of the true model can be large so that training sets of
reasonable size for an insurance portfolio often lead to unstable estimators μD with
respect to D. In addition, correlated features can still increase model instability.
When the true model μ(x) is smooth, trees are unlikely to capture all the nuances
of μ(x). One says that regression trees suffer from a lack of smoothness. Ensemble
techniques described in Chaps. 4 and 5 will enable to address this issue.
As an illustration, we consider the example of Sect. 3.2.4.1 with expected claim
frequencies given by (3.7.2). The expected claim frequency μ(x) smoothly decreases
with the feature Age. We simulate a training set with 500 000 observations and we
102 3 Regression Trees
1 1
0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.14 0.14
71e+3 / 500e+3 72e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%
2 3 2 3
1 1
0.14 0.14
71e+3 / 500e+3 71e+3 / 500e+3
100% 100%
2 3 2 3
0.14
53e+3 / 396e+3
79%
0.17
18e+3 / 104e+3
21% 59 0.14
60e+3 / 438e+3
88%
0.18
11e+3 / 62e+3
12%
Fig. 3.30 Trees with only one split built on different training sets of size 500 000 for the true
expected claim frequency given by (3.7.2)
3.7 Limitations of Trees 103
0.14
71e+3 / 500e+3
100%
0.18
11e+3 / 62e+3
12%
2
Age >= 18
0.14
60e+3 / 438e+3
88%
Sport = no
5 7
0.15 0.23
32e+3 / 219e+3 2414 / 11e+3
44% 2%
4 6
Gender = female Sport = no
0.13 0.17
28e+3 / 219e+3 8998 / 52e+3
44% 10%
9 11 13
25
0.18
1781 / 10e+3
2%
18 26
Gender = female
0.13 0.18
10e+3 / 81e+3 3744 / 21e+3
16% 4%
0.13 0.19
9845 / 76e+3 1928 / 10e+3
15% 2%
17 74 19 21 23 50 52 107 14
0.12 0.12 0.14 0.13 0.15 0.15 0.19 0.16 0.2 0.24
9855 / 84e+3 613 / 5086 1817 / 13e+3 10e+3 / 76e+3 11e+3 / 76e+3 2408 / 16e+3 965 / 5115 417 / 2534 1065 / 5224 1276 / 5241
17% 1% 3% 15% 15% 3% 1% 1% 1% 1%
fit a tree which minimizes the 10-fold cross validation error, depicted in Fig. 3.31.
Notice that the feature Split is not used by this tree, as desired.
The corresponding model μ is not satisfactory. Figure 3.32 illustrates both
μ and
μ as functions of policyholder’s age for fixed values of variables Gender and Sport.
One sees that μ does not reproduce the smooth decreasing behavior of the true model
μ with respect to the age.
Typically, small trees suffer from a lack of smoothness because of their limited
number of terminal nodes whereas large trees, that could overcome this limitation
because of their larger number of leaves, tend to overfit the data. Trees with the
highest predictive accuracy cannot be too large so that their limited possible predicted
outcomes prevent obtaining models with smooth behaviors with respect to some
features.
The decision trees are first due to Morgan and Sonquist (1963), who suggested a
method called automatic interaction detector (AID) in the context of survey data.
Several improvements were then proposed by Sonquist (1970), Messenger and
Expected claim frequency (our model) Expected claim frequency (our model) Expected claim frequency (our model) Expected claim frequency (our model)
0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30
104
20
20
20
20
30
30
30
30
40
40
40
40
50
50
50
50
60
60
60
60
Expected claim frequency (true model) Expected claim frequency (true model) Expected claim frequency (true model) Expected claim frequency (true model)
0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30
30
30
30
30
40
40
40
40
Age
Age
Age
Age
50
50
50
50
60
60
60
60
μ (on the left) and true model μ (on the right) as functions of policyholder’s
age. From top to bottom: (Gender = male, Sport = yes), (Gender = male, Sport = no), (Gender =
3 Regression Trees
3.8 Bibliographic Notes and Further Reading 105
Mandell (1972), Gillo (1972) and Sonquist et al. (1974). The most important con-
tributors to modern methodological principles of decision trees are however Breiman
(1978a, b), Friedman (1977, 1979) and Quinlan (1979, 1986) who proposed very sim-
ilar algorithms for the construction of decision trees. The seminal work of Breiman
et al. (1984), complemented by the work of Quinlan (1993), enabled to create a
simple and consistent methodological framework for decision trees, the classification
and regression tree (CART) techniques, which facilitated the diffusion of tree-based
models towards a large audience. Our presentation is mainly inspired from Breiman
et al. (1984), Hastie et al. (2009) and Wüthrich and Buser (2019), the latter reference
adapting tree based methods to model claim frequencies. Also, Louppe (2014) made
a good overview of the literature from which this section is greatly inspired.
References
Breiman L (1978a) Parsimonious binary classification trees. Preliminary report. Technology Service
Corporation, Santa Monica, Calif
Breiman L (1978b) Description of chlorine tree development and use. Technical report, Technology
Service Corporation, Santa Monica, CA
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees.
Wadsworth statistics/probability series
Bühlmann H, Gisler A (2005) A course in credibility theory and its applications. Springer, Berlin
Denuit M, Hainaut D, Trufin J (2019) Effective statistical learning methods for actuaries I: GLMs
and extensions. Springer actuarial lecture notes
Feelders A (2019) Classification trees. Lecture notes
Friedman JH (1977) A recursive partitioning decision rule for nonparametric classification. IEEE
Trans Comput 100(4):404–408
Friedman JH (1979) A tree-structured approach to nonparametric multiple regression. Smoothing
techniques for curve estimation. Springer, Berlin, pp 5–22
Geurts P (2002) Contributions to decision tree induction: bias/variance tradeoff and time series
classification. PhD thesis
Gillo M (1972) Maid: a honeywell 600 program for an automatised survey analysis. Behav Sci
17:251–252
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, infer-
ence, and prediction, 2nd edn. Springer series in statistics
Louppe G (2014) Understanding random forests: from theory to practice. arXiv:14077502
Messenger R, Mandell L (1972) A modal search technique for predictive nominal scale multivariate
analysis. J Am Stat Assoc 67(340):768–772
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat
Assoc 58(302):415–434
Quinlan JR (1979) Discovering rules by induction from large collections of examples. Expert
systems in the micro electronic age. Edinburgh University Press, Edinburgh
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, Burlington
Sonquist JA (1970) Multivariate model building: the validation of a search strategy. Survey Research
Center, University of Michigan
Sonquist JA, Baker EL, Morgan JN (1974) Searching for structure: an approach to analysis of
substantial bodies of micro-data and documentation for a computer program. Survey Research
Center, University of Michigan Ann Arbor, MI
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Chapter 4
Bagging Trees and Random Forests
4.1 Introduction
Two ensemble methods are considered in this chapter, namely bagging trees and
random forests. One issue with regression trees is their high variance. There is a high
μD (x) over the trees trained from all possible training
variability of the prediction
sets D. Bagging trees and random forests aim to reduce the variance without too
much altering bias.
Ensemble methods are relevant tools to reduce the expected generalization error
of a model by driving down the variance of the model without increasing too much
the bias. The principle of ensemble methods (based on randomization) consists in
introducing random perturbations into the training procedure in order to get different
models from a single training set D and combining them to obtain the estimate of
the ensemble.
μD (x)]. It has the same bias as
Let us start with the average prediction ED [ μD (x)
since
μD (x)] = ED [ED [
ED [ μD (x)]] , (4.1.1)
μD (x)]] = 0.
VarD [ED [ (4.1.2)
Hence, finding a training procedure that produces a good approximation of the aver-
age model in order to stabilize model predictions seems to be a good strategy.
If we assume that we can draw as many training sets as we want, so that we have B
training sets D1 , D2 , . . . , D B available, then an approximation of the average model
can be obtained by averaging the regression trees built on these training sets, that is,
1
B
μD (x)] =
ED [
μDb (x). (4.1.3)
B b=1
In such a case, the average of the estimate (4.1.3) with respect to the training sets
D1 , . . . , D B is the average prediction ED [
μD (x)], that is,
1 1
B B
ED1 ,...,D B
μD (x) =
b μDb (x)]
EDb [
B b=1 B b=1
μD (x)] ,
= ED [ (4.1.4)
1
B
= μDb (x)]
VarDb [
B 2 b=1
μD (x)]
VarD [
= (4.1.5)
B
since predictions μD1 (x), . . . ,
μD B (x) are independent and identically distributed.
So, averaging over B estimates fitted on different training sets leaves the bias
unchanged compared to each individual estimate while it divides the variance by
B. The estimate (4.1.3) is then less variable than each individual one.
In practice, the probability distribution from which the observations of the training
set are drawn is usually not known so that there is only one training set available. In
this context, the bootstrap approach, used both in bagging trees and random forests,
appears to be particularly useful.
4.2 Bootstrap
θ = g(Y1 , Y2 , . . . , Yn )
n
n (x) = #{Yi such that Yi ≤ x} = 1
F I [Yi ≤ x] .
n n i=1
4.2 Bootstrap 109
1 (∗b)
B
Fθ∗ (x) = I
θ ≤x .
B b=1
Bagging is one of the first ensemble methods proposed in the literature. Consider
a model fitted to our training set D, obtaining the prediction μD (x) at point x.
Bootstrap aggregation or bagging averages this prediction over a set of bootstrap
samples in order to reduce its variance.
The probability distribution of the random vector (Y, X) is usually not known.
This latter distribution is then approximated by its empirical version which puts an
1
equal probability |I| on each of the observations {(yi , x i ); i ∈ I} of the training set
D. Hence, instead of simulating B training sets D1 , D2 , . . . , D B from the probability
distribution of (Y, X), which is not possible in practice, the idea of bagging is rather
to simulate B bootstrap samples D∗1 , D∗2 , . . . , D∗B of the training set D from its
empirical counterpart. Specifically, a bootstrap sample of D is obtained by simulating
independently |I| observations from the empirical distribution of (Y, X) defined
above. A bootstrap sample is thus a random sample of D taken with replacement
which has the same size as D. Notice that, on average, 63.2% of the observations of
the training set are represented at least once in a bootstrap sample. Indeed,
|I|
|I| − 1
1− , (4.3.1)
|I|
110 4 Bagging Trees and Random Forests
which is computed in Table 4.1 for different values of |I|, is the probability that a
given observation of the training set is represented at least once. One can see that
(4.3.1) quickly reaches the value of 63.2%.
Let D∗1 , D∗2 , . . . , D∗B be B bootstrap samples of the training set D. For each
∗b
D , b = 1, . . . , B, we fit our model, giving prediction μD,b (x) = μD∗b (x). The
bagging prediction is then defined by
1
B
bag
μD, (x) =
μD,b (x), (4.3.2)
B b=1
For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit an unpruned tree on D∗b , which gives prediction
μD,b (x).
End for B
bag
Output:
μD, (x) = 1
B b=1
μD,b (x).
As mentioned previously, two main drawbacks of regression trees are that they
produce piece-wise constant estimates and that they are rather unstable under a small
change in the observations of the training set. The construction of an ensemble of
trees produces more stable and smoothed estimates under averaging.
4.3 Bagging Trees 111
4.3.1 Bias
For bagging, the bias is the same as the one of the individual sampled models. Indeed,
bag
Bias(x) = μ(x) − ED, μD, (x)
1
B
= μ(x) − ED,1 ,...,B
μD,b (x)
B b=1
1
B
= μ(x) − ED,b μD,b (x)
B b=1
= μ(x) − ED,b μD,b (x) (4.3.3)
4.3.2 Variance
bag
The variance of
μD, (x) can be written as
1
B
bag
VarD,
μD, (x) = VarD,1 ,...,B
μD,b (x)
B b=1
B
1
= 2 VarD,1 ,...,B
μD,b (x)
B b=1
B
1
= 2 VarD E1 ,...,B
μD,b (x)D
B b=1
B
+ED Var1 ,...,B μD,b (x)D
b=1
1
= VarD Eb
μD,b (x)D + ED Varb
μD,b (x)D
B
(4.3.4)
112 4 Bagging Trees and Random Forests
bag
The variance of the bagging prediction μD, (x) is smaller than the variance of an
μD,b (x). Actually, we learn from (4.3.4) and (4.3.5) that the
individual prediction
variance reduction is given by
B−1
bag
VarD,b μD,b (x) − VarD, μD, (x) = ED Varb μD,b (x)D ,
B
(4.3.7)
which increases as B increases and tends to ED Varb μD,b (x)D when
B → ∞.
Let us introduce the correlation coefficient ρ(x) between any pair of predictions
used in the averaging which are built on the same training set but fitted on two different
bootstrap samples. Using the definition of the Pearson’s correlation coefficient, we
get
CovD,b ,b μD,b (x),
μD,b (x)
ρ(x) =
VarD,b μD,b (x) VarD,b μD,b (x)
CovD,b ,b
μD,b (x),
μD,b (x)
= (4.3.8)
VarD,b μD,b (x)
since conditionally to D, estimates μD,b (x) and μD,b (x) are independent and
identically distributed. Hence, combining (4.3.5) and (4.3.9), the correlation coeffi-
cient in (4.3.8) becomes
4.3 Bagging Trees 113
VarD Eb μD,b (x)|D
ρ(x) = (4.3.10)
VarD,b μD,b (x)
VarD Eb μD,b (x)|D
=
. (4.3.11)
VarD Eb μD,b (x)D + ED Varb μD,b (x)D
The correlation coefficient ρ(x) measures the correlation between a pair of pre-
dictions in the ensemble induced by repeatedly making training sample draws D
from the population and then drawing a pair of bootstrap samples from D.
When ρ(x) is close to 1, the predictions are highly correlated, suggesting that
the randomization due to the bootstrap sampling has no significant effect on the
predictions. On the contrary, when ρ(x) is close to 0, the predictions are uncorrelated,
suggesting that the randomization due to the bootstrap sampling has a strong impact
on the predictions.
One sees that ρ(x) is the ratio between the variance due to the training set and the
total variance. The total variance is the sum of the variance due to the training set and
the variance due to randomization induced by the bootstrap samples. A correlation
coefficient close to 1 and hence correlated predictions means that the total variance
is mostly driven by the training set. On the contrary, a correlation coefficient close
to 0 and hence de-correlated predictions means that the total variance is mostly due
to the randomization induced by the bootstrap samples.
bag
Alternatively, the variance of μD, (x) given in (4.3.4) can be re-expressed in
terms of the correlation coefficient. Indeed, from (4.3.10) and (4.3.11), we have
VarD Eb
μD,b (x)|D = ρ(x)VarD,b
μD,b (x) (4.3.12)
and
ED Varb
μD,b (x)|D = (1 − ρ(x)) VarD,b
μD,b (x) , (4.3.13)
As B increases, the second term disappears, but the first term remains. Hence, when
ρ(x) < 1, one sees that the variance of the ensemble is strictly smaller than the
variance of an individual model. Let us mention that assuming ρ(x) < 1 amounts
to suppose that the randomization due to the bootstrap sampling influences the indi-
vidual predictions.
114 4 Bagging Trees and Random Forests
Notice that the random perturbation introduced by the bootstrap sampling induces
μD,b (x) than for
a higher variance for an individual prediction μD (x), so that
VarD,b
μD,b (x) ≥ VarD [
μD (x)] . (4.3.15)
Therefore, bagging averages models with higher variances. Nevertheless, the bagging
bag
prediction μD, (x) has generally a smaller variance than μD (x). This comes from
the fact that, typically, the correlation
coefficient
ρ(x) in (4.3.14) compensates for
the variance increase VarD,b μD,b (x) − VarD [ μD (x)], so that the combined
effect of ρ(x) < 1 and VarD,b μD,b (x) ≥ VarD [ μD (x)] often leads to a variance
reduction
VarD [μD (x)] − ρ(x)VarD,b μD,b (x) (4.3.16)
that is positive. Because of their high variance, regression trees very likely benefit
from the averaging procedure.
For some loss functions, such as the squared error and Poisson deviance losses, we
bag
μD, (x) is
can show that the expected generalization error for the bagging prediction
μD,b (x),
smaller than the expected generalization error for an individual prediction
that is,
bag
ED, Err μD, (x) ≤ ED,b Err μD,b (x) . (4.3.17)
However, while it is typically the case with bagging trees, we cannot highlight some
bag
situations where the estimate μD, (x) performs always better than μD (x) in the
sense of the expected generalization error, even for the squared error and Poisson
deviance losses.
From (4.3.3) and (4.3.6), one observes that the bias remains unchanged while the
μD,b (x), so that we get
variance decreases compared to the individual prediction
4.3 Bagging Trees 115
bag
ED, Err μD, (x)
2
bag
= Err (μ(x)) + μ(x) − ED,b μD,b (x) + VarD, μD, (x)
2
≤ Err (μ(x)) + μ(x) − ED,b μD,b (x) + VarD,b μD,b (x)
= ED,b Err μD,b (x) . (4.3.19)
For every value of X, the expected generalization error of the ensemble is smaller
than the expected generalization error of an individual model.
Taking the average of (4.3.19) over X leads to
bag
ED, Err
μD, ≤ ED,b Err
μD,b . (4.3.20)
For the Poisson deviance loss, from (2.4.4) and (2.4.5), the expected generalization
bag
μD, (x) is given by
error for
μD, (x) = Err (μ(x)) + ED, E P
bag bag
ED, Err μD, (x) (4.3.21)
with
⎛ ⎡ ⎤ ⎡ ⎛ ⎞⎤⎞
bag
μD, (x)
bag
μD, (x)
P
μD, (x) = 2μ(x) ⎝ED, ⎣ ⎦ − 1 − ED, ⎣ln ⎝ ⎠⎦⎠ .
bag
ED, E
μ(x) μ(x)
(4.3.22)
We have
bag
μD, (x)
μD,b (x)
ED, = ED,b , (4.3.23)
μ(x) μ(x)
and hence
bag
ED, Err μD, (x) ≤ ED,b Err
μD,b (x) . (4.3.27)
For every value of X, the expected generalization error of the ensemble is smaller
than the expected generalization error of an individual model.
Taking the average of (4.3.27) over X leads to
bag
ED, Err
μD, ≤ ED,b Err
μD,b . (4.3.28)
Example Consider the example of Sect. 3.7.2. We simulate training sets D made
of 100 000 observations and validation sets D of the same size. For each simulated
training set D, we build the corresponding tree μD with maxdepth = D = 5, which
corresponds to a reasonable size in this context (see Fig. 3.31), and we estimate its
generalization error on a validation set D. Also, we generate bootstrap samples
D∗1 , D∗2 , . . . of D and we produce the corresponding trees μD∗1 ,
μD∗2 , . . . with
D = 5. We estimate their generalization errors on a validation set D, together with
the generalization errors of the corresponding bagging models. Note that in this
example, we use the R package rpart to build the different trees described above.
μD ,
Figure 4.1 displays estimates of the expected generalization errors for μD∗b =
bag
μD,b and
μD, for B = 1, 2, . . . , 10 obtained by Monte-Carlo simulations. As
expected, we notice that
μD, ≤
ED, Err
bag
ED,b Err
μD,b .
For B ≥ 2, bagging trees outperforms individual sample trees. Also, we note that
4.3 Bagging Trees 117
0.57925
Expected generalization error
0.57900
0.57875
0.57850
1 2 3 4 5 6 7 8 9 10
Number of trees
Fig. 4.1
bag
ED, Err μD, with respect to the number of trees B, together with
ED,b Err μD,b (dotted line) and μD )] (solid line)
ED [Err (
μD )] ≤
ED [Err ( ED,b Err
μD,b ,
showing that the restriction imposed by the reduced sample D∗b does not allow to
build trees as predictive as trees built on the entire training set D. Finally, from B = 4,
we note that
ED, Err μD, ≤
bag
ED [Err ( μD )] ,
meaning that for B ≥ 4, bagging trees also outperforms single trees built on the
entire training set.
Consider the Gamma deviance loss. From (2.4.6) and (2.4.7), the expected general-
bag
μD, (x) is given by
ization error for
μD, (x) = Err (μ(x)) + ED, E G
bag bag
ED, Err μD, (x) (4.3.29)
118 4 Bagging Trees and Random Forests
with
μ(x) μ(x)
G bag
ED, E μD, (x) = 2 ED, bag
− 1 − ED, ln bag
.
μD, (x)
μD, (x)
(4.3.30)
Since we have
ED, E G
μD, (x) = ED,b E G
bag
μD,b (x)
μ(x) μ(x)
+2 ED, − E D ,b
bag
μD, (x)
μD,b (x)
μ(x) μ(x)
+2 ED,b ln − ED, ln
μD,b (x)
bag
μD, (x)
G
= ED,b E μD,b (x)
μ(x) μ(x)
+2ED, bag
− ln bag
μD, (x)
μD, (x)
μ(x) μ(x)
−2ED,b − ln , (4.3.31)
μD,b (x)
μD,b (x)
we see that
ED, E G
μD, (x) ≤ ED,b E G
bag
μD,b (x)
if and only if
bag
μ(x) μD, (x)
μ(x)
μD,b (x)
ED, + ln ≤ ED,b + ln .
bag
μD, (x) μ(x)
μD,b (x) μ(x)
(4.3.32)
The latter inequality is fulfilled when the individual sample trees satisfy
μD,b (x)
≤ 2, (4.3.33)
μ(x)
bag
μ (x)
which, in turn, guarantees that D,μ(x)
≤ 2. Indeed, the function φ : x > 0 → 1
x
+
ln x is convex for x ≤ 2, so that Jensen’s inequality implies
4.3 Bagging Trees 119
bag
bag
μ(x) μD, (x)
μD, (x)
ED, + ln = ED, φ
bag
μD, (x) μ(x) μ(x)
1
B
μD,b (x)
= ED,1 ,...,B φ
B b=1 μ(x)
1
B
μD,b (x)
≤ ED,1 ,...,B φ
B b=1 μ(x)
μD,b (x)
= ED,b φ
μ(x)
μ(x)
μD,b (x)
= ED,b + ln
μD,b (x) μ(x)
and so
bag
ED, Err
μD, ≤ ED,b Err
μD,b . (4.3.34)
μD,b (x)
Note that condition (4.3.33) means that the individual sample prediction
should be not too far from the true prediction μ(x).
1
B
μrfD, (x) =
μD,b (x),
B b=1
where μD,b (x) denotes the prediction at point x for the bth random forest tree. Ran-
dom vectors 1 , . . . , B capture not only the randomness of the bootstrap sampling,
as for bagging, but also the additional randomness of the training procedure due to
the random selection of m features before each split. This provides the following
algorithm:
For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit a tree on D∗b .
For each node t do
(2.1) Select m (≤ p) features at random from the p original features.
(2.2) Pick the best feature among the m.
(2.3) Split the node into two daughter nodes.
End for
This gives prediction
μD,b (x) (use typical tree stopping criteria (but do not
prune)).
End for B
Output:
μrfD, (x) = 1
B b=1
μD,b (x).
As soon as m < p, random forests differs from bagging trees since the optimal
split can be missed if it is not among the m features selected. Typical value of m is
p/3. However, the best value for m depends on the problem under consideration
and is treated as a tuning parameter. Decreasing m reduces the correlation between
any pair of trees while it increases the variance of the individual trees.
Notice that random forests is more computationally efficient on a tree-by-tree basis
than bagging since the training procedure only needs to assess a part of the original
features at each split. However, compared to bagging, random forests usually require
more trees.
4.4 Random Forests 121
Remark 4.4.1 Obviously, results obtained in Sects. 4.3.1, 4.3.2 and 4.3.3 also hold
for random forests. The only difference relies in the meaning of random vectors
1 , . . . , B . For bagging, those vectors express the randomization due to the boot-
strap sampling, while for random forests, they also account for the randomness due
to the feature selection at each node.
Bagging trees and random forests aggregate trees built on bootstrap samples D∗1 , . . . ,
D∗B . For each observation (yi , x i ) of the training set D, an out-of-bag prediction
can be constructed by averaging only trees corresponding to bootstrap samples D∗b
in which (yi , x i ) does not appear. The out-of-bag prediction for observation (yi , x i )
is thus given by
1
B
μoob
D, (x i ) = B / D∗b .
μD,b (x i )I (yi , x i ) ∈ (4.5.1)
/ D∗b
b=1 I (yi , x i ) ∈ b=1
oob 1
Err (
μD, ) = L(yi ,
μoob
D, (x i )), (4.5.2)
|I| i∈I
4.6 Interpretability
A bagged model is less interpretable than a model that is not bagged. Bagging trees
and random forests are no longer a tree. However, there exist tools that enable to
better understand model outcomes.
122 4 Bagging Trees and Random Forests
As for a single regression tree, the relative importance of a feature can be computed
for an ensemble by combining relative importances from the bootstrap trees. For the
bth tree in the ensemble, denoted Tb , the relative importance of feature x j is the sum
of the deviance reductions Dχt over the non-terminal nodes t ∈ T!(Tb ) (x j ) (i.e. the
non-terminal nodes t of Tb for which x j was selected as the splitting feature), that is,
Ib (x j ) = Dχt . (4.6.1)
! T (x j )
t∈T( b)
For the ensemble, the relative importance of feature x j is obtained by averaging the
relative importances of x j over the collection of trees, namely
1
B
I(x j ) = Ib (x j ). (4.6.2)
B b=1
For convenience, the relative importances are often normalized so that their sum
equals to 100. Any individual number can then be interpreted as the percentage
contribution to the overall model. Sometimes, the relative importances are expressed
as a percent of the maximum relative importance.
An alternative to compute variable importances for bagging trees and random
forests is based on out-of-bag observations. Some observations (yi , x i ) of the train-
ing set D do not appear in bootstrap sample D∗b . They are called the out-of-bag
observations for the bth tree. Because they were not used to fit that specific tree,
these observations enable to assess the predictive accuracy of
μD,b , that is,
1
Err (
μD,b ) = L(yi ,
μD,b (x i )), (4.6.3)
∗b
|I\I | ∗b
i∈I\I
where I ∗b labels the observations in D∗b . The categories of feature x j are then ran-
domly permuted in the out-of-bag observations, so that we get perturbed observations
), i ∈ I\I ∗b , and the predictive accuracy of
perm(j)
(yi , x i μD,b is again computed as
perm(j) 1
Err (
μD,b ) = L(yi ,
perm(j)
μD,b (x i )). (4.6.4)
|I\I ∗b | ∗b
i∈I\I
The decrease in predictive accuracy due to this permuting is averaged over all trees
and is used as a measure of importance for feature x j in the ensemble, that is
1
perm(j)
B
I(x j ) = Err μD,b ) −
( Err (
μD,b ) . (4.6.5)
B b=1
4.6 Interpretability 123
x S ∪ x S̄ = x.
In principle,
μ(x) depends on features in x S and x S̄ , so that (by rearranging the
order of the features if needed) we can write
μ(x S , x S̄ ).
μ(x) =
This average function can be used as a description of the effect of the selected subset
x S on μ(x) when the features in x S do not have strong interactions with those in
x S̄ .
For instance, in the particular case where the dependence of
μ(x) on x S is additive
124 4 Bagging Trees and Random Forests
μ(x) = f S (x S ) + f S̄ (x S̄ ), (4.6.7)
Hence, (4.6.6) produces f S (x S ) up to an additive constant. In this case, one sees that
(4.6.6) provides a complete description of the way μ(x) varies on the subset x S .
The partial dependence function μS (x S ) can be estimated from the training set
by
1
μ(x S , x i S̄ ), (4.6.9)
|I| i∈I
μS (x S ) = E X [
! μ(X)|X S = x S ] . (4.6.10)
Indeed, this latter expression captures the effect of x S on μ(x) ignoring the effects
of x S̄ . Both expressions (4.6.6) and (4.6.10) are actually equivalent only when X S
and X S̄ are independent.
For instance, in the specific case (4.6.7) where μ(x) is additive, !μS (x S ) can be
written as
μS (x S ) = f S (x S ) + E X S̄ f S̄ (X S̄ )|X S = x S .
! (4.6.11)
4.7 Example
Consider the real dataset described in Sect. 3.2.4.2. We use the same training set D
and validation set D than in the example of Sect. 3.3.2.3, so that the estimates for the
generalization error will be comparable.
We fit random forests with B = 2000 trees on D by means of the R pack-
age rfCountData. More precisely, we use the R command rfPoisson(),
which stands for random forest Poisson, producing random forests with Poisson
deviance as loss function. The number of trees B = 2000 is set arbitrarily. However,
we will see that B = 2000 is large enough (i.e. adding trees will not improve the
predictive accuracy of the random forests under investigation).
The other parameters we need to fine-tune are
– the number m of features tried at each split (mtry);
– the size of the trees, controlled here by the minimum number of observations
(nodesize) required in terminal nodes.
To this end, we try different values for mtry and nodesize and we split the
training set D into five disjoint and stratified subsets D1 , D2 , . . . , D5 of equal size.
Specifically, we consider all possible values for mtry (from 1 to 8) together with
four values for nodesize: 500, 1000, 5000 and 10 000. Note that the training
set D contains 128 755 observations, so that values of 500 or 1000 for nodesize
amounts to require at least 0.39% and 0.78% of the observations in the final nodes
of the individual trees, respectively, which allows for rather large trees. Then, for
each value of (mtry, nodesize), we compute the 5-fold cross-validation estimate
of the generalization error from subsets D1 , D2 , . . . , D5 . The results are depicted in
Fig. 4.2. We can see that the minimum 5-fold cross-validation estimate corresponds
to mtry = 3 and nodesize = 1000. Notice that for any value of nodesize, it is
never optimal to use all the features at each split (i.e. mtry = 8). Introducing a ran-
dom feature selection at each node therefore improves the predictive accuracy of the
ensemble. Moreover, as expected, limiting too much the size of the trees (here with
nodesize = 5000, 10 000) turns out to be counterproductive. The predictive per-
formances for trees with nodesize = 1000 are already satisfying and comparable
to the ones obtained with even smaller trees.
In Fig. 4.3, we show the out-of-bag estimate of the generalization error for random
forests with mtry = 3 and nodesize = 1000 with respect to the number of trees
B. We observe that B = 2000 is more than enough. The out-of-bag estimate is already
stabilized from B = 500. Notice that adding more trees beyond B = 500 does not
decrease the predictive accuracy of the random forest.
126 4 Bagging Trees and Random Forests
Cross−validation results
0.550
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
mtry
Fig. 4.2 5-fold cross-validation estimates of the generalization error for mtry = 1, 2, . . . , 8 and
nodesize = 500, 1000, 5000, 10 000
∗
We denote by μrfD, the random forest fitted on the entire training set D with B =
500, mtry = 3 and nodesize = 1000. The relative importances of the features
∗
for
μrfD, are depicted in Fig. 4.4. The most important feature is AgePh followed by,
in descending order, Split, Fuel, AgeCar, Cover, Gender, PowerCat and Use. Notice
that this ranking of the features in terms of importance is almost identical to the one
shown in Fig. 3.23 and obtained from the tree depicted in Fig. 3.17, only the order of
features AgeCar and Cover is reversed (their importances are very similar here).
∗
Figure 4.5 represents the partial dependence plots of the features for μrfD, . Specif-
ically, one sees that the partial dependence plot for policyholder’s age is relatively
smooth. This is more realistic than the impact of policyholder’s age deduced from
regression tree μTαk ∗ represented in Fig. 3.20, tree that is reproduced in Fig. 4.6 where
we have circled the nodes using AgePh for splitting. Indeed, with only six circled
nodes, μTαk ∗ cannot reflect a smooth behavior of the expected claim frequency with
respect to AgePh. While random forests enable to capture nuances of the response,
regression trees suffer from a lack of smoothness.
∗
Finally, the validation sample estimate of the generalization error of μrfD, is given
by
val
rf∗
Err
μD, = 0.5440970.
0.5500 0.5500
0.5475 0.5475
0.5450 0.5450
0.5425 0.5425
0.5400 0.5400
Fig. 4.3 Out-of-bag estimate of the generalization error for random forests with mtry = 3 and
nodesize = 1000 with respect to the number of trees B
AgePh
Split
Fuel
AgeCar
Cover
Gender
PowerCat
Use
∗
Fig. 4.4 Relative importances of the features for
μrf
D , obtained by permutation
128 4 Bagging Trees and Random Forests
0.225
0.15
0.12
0.200
0.10
0.10
0.08
0.175
0.150 0.05
0.05 0.04
0.125
0.125
0.12 0.12
0.100
0.14
0.08 0.08
0.075
0.13 0.050
0.04 0.04
0.025
∗
Fig. 4.5 Partial dependence plots for
μrf
D ,
val
Err
μTαk ∗ = 0.5452772
∗
μrfD, improves by 1.1802 10−3 the predictive accuracy of the single
one sees that
tree
μTαk ∗ .
Remark 4.7.1 It is worth noticing that the selection procedure of the optimal tuning
parameters may depend on the initial choice of the training set D and on the folds
used to compute cross-validation estimates of the generalization error. This latter
point can be mitigated by increasing the number of folds.
Bagging is one of the first ensemble methods proposed by Breiman (1996), who
showed that aggregating multiple versions of an estimator into an ensemble improves
the model accuracy. Several authors added randomness into the training procedure.
Dietterich and Kong (1995) introduced the idea of random split selection. They
proposed to select at each node the best split among the twenty best ones. Amit et al.
(1997) rather proposed to choose the best split over a random subset of the features,
and Amit and Geman (1997) also defined a large number of geometric features. Ho
(1998) investigated the idea of building a decision forest whose trees are produced
4.8 Bibliographic Notes and Further Reading 129
0.14
16e+3 / 129e+3
100%
yes AgePh >= 32 no
0.21
3844 / 22e+3
17%
Split = Yearly
0.13
12e+3 / 107e+3
83%
Split = Half−Yearly,Yearly
0.17 0.24
3018 / 23e+3 2528 / 13e+3
18% 10%
AgePh >= 58 AgePh >= 26
0.12
9060 / 84e+3
65%
AgePh >= 50
0.13 0.18
4809 / 40e+3 2596 / 19e+3
31% 15%
Split = Yearly AgeCar < 6.5
0.1
4251 / 45e+3
35%
Fuel = Gasoline
0.15
2183 / 16e+3
13%
Fuel = Gasoline
0.095 0.12
3002 / 34e+3 2626 / 23e+3
26% 18%
AgePh >= 58 Cover = Comprehensive,Limited.MD
0.13
1708 / 14e+3
11%
Gender = Male
0.09
1982 / 24e+3
18%
Gender = Male
0.087
1558 / 19e+3
15%
AgePh < 74
0.084
1167 / 15e+3
11%
AgeCar >= 5.5
on random subsets of the features, each tree being constructed on a random subset of
the features drawn once (prior the construction of the tree). Breiman (2000) studied
the addition of noise to the response in order to perturb tree structure. From these
works emerged the random forests algorithm discussed in Breiman (2001).
Several authors applied random forests to insurance pricing, such as Wüthrich
and Buser (2019) who adapted tree-based methods to model claim frequencies or
Henckaert et al. (2020) who worked with random forests and boosted trees to develop
full tariff plans built from both the frequency and severity of claims.
Note that the bias-variance decomposition of the generalization error discussed
in this chapter is due to Geman et al. (1992). Also, Sects. 4.3.3.3 and 4.6.2 are
largely inspired by Denuit and Trufin (2020) and by Sect. 8.2 of Friedman (2001),
respectively.
Finally, the presentation is mainly inspired by Breiman (2001), Hastie et al. (2009),
Wüthrich and Buser (2019) and Louppe (2014).
130 4 Bagging Trees and Random Forests
References
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput
9(7):1545–1588
Amit Y, Geman D, Wilder K (1997) Joint induction of shape features and tree classifiers. IEEE
Trans Pattern Anal Mach Intell 19(11):1300–1305
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2000) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229–242.
ISSN 0885-6125
Breiman L (2001) Random forests. Mach Learn 45:5–32
Denuit M, Trufin J (2020) Generalization error for Tweedie models: decomposition and bagging
models. Working paper
Dietterich TG, Kong EB (1995) Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat
29(5):1189–1232
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural
comput 4(1):1–58
Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. Data Mining,
Inference, and Prediction, 2nd edn. Springer Series in Statistics
Henckaerts R, Côté M-P, Antonio K, Verbelen R (2020) Boosting insights in insurance tariff
plans with tree-based machine learning methods. North Am Actuar J. https://doi.org/10.1080/
10920277.2020.1745656
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern
Anal Mach Intell 13:340–354
Louppe G (2014) Understanding random forests: from theory to practice. arXiv:14077502
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Chapter 5
Boosting Trees
5.1 Introduction
Bagging trees and random forests base their predictions on an ensemble of trees.
In this chapter, we consider another training procedure based on an ensemble of
trees, called boosting trees. However, the way the trees are produced and combined
differ between random forests (and so bagging trees) and boosting trees. In random
forests, the trees are created independently of each other and contribute equally to
the ensemble. Moreover, the constituent trees can be quite large, even fully grown.
In boosting, however, the trees are typically small, dependent on previous trees and
contribute unequally to the ensemble. Both training procedures are thus different,
but they produce competitive predictive performance. Note that the trees in random
forests can be created simultaneously since they are independent of each other, so
that computational time for random forests is in general smaller than for boosting.
M
g(μ(x)) = score(x) = βm T (x; am ), (5.2.1)
m=1
where s
core0 (x) is an initial guess. Then, at each iteration m ≥ 2, we solve the
subproblem
m ,
β am = argmin L yi , g −1 score
m−1 (x i ) + βm T (x i ; am ) (5.2.4)
{βm ,am } i∈I
with
m−1 (x) = score
score m−1 T (x i ;
m−2 (x) + β am−1 ).
1. Initialize s
core0 (x) to be a constant. For instance:
s
core0 (x) = argmin L(yi , g −1 (β)).
β i∈I
2. For m = 1 to M do
2.1 Compute
m ,
β am = argmin L yi , g −1 score
m−1 (x i ) + βm T (x i ; am ) .
{βm ,am } i∈I
(5.2.5)
m (x) = score
2.2 Update score m T (x i ;
m−1 (x) + β am ).
End for
μD (x) = g −1 score
3. Output: M (x) .
5.2 Forward Stagewise Additive Modeling 133
Considering the squared-error loss together with the identity link function, step
2.1 in Algorithm 5.1 simplifies to
2
m ,
β am = argmin yi − score
m−1 (x i ) + βm T (x i ; am )
{βm ,am } i∈I
= argmin (rim + βm T (x i ; am ))2 ,
{βm ,am } i∈I
where rim is the residual of the model after m − 1 iterations on the ith observation.
One sees that the term βm T (x; am ) actually fits the residuals obtained after m − 1
iterations.
The forward stagewise additive modeling described in Algorithm 5.1 is also called
boosting. Boosting is thus an iterative method based on the idea that combining many
simple functions should result in a powerful one. In a boosting context, the simple
functions T (x; am ) are called weak learners or base learners.
There is a large variety of weak learners available for boosting models. For
instance, commonly used weak learners are wavelets, multivariate adaptive regres-
sion splines, smoothing splines, classification trees and regression trees or neural
networks.
Although each weak learner has advantages and disadvantages, trees are the most
commonly accepted weak learners in ensemble techniques such as boosting. The
nature of trees corresponds well with the concept of weak learner. At each itera-
tion, adding a small tree will slightly improve the current predictive accuracy of the
ensemble.
In this second volume, we use trees as weak learners. Boosting using trees as weak
learners is then called boosting trees. As already noticed, the procedure underlying
boosting trees is completely different from bagging trees and random forests.
Henceforth, we use regression trees as weak learners. That is, we consider weak
learners T (x; am ) of the form
T (x; am ) = ctm I x ∈ χ(m)
t , (5.3.1)
t∈Tm
where χ(m)
t is the partition of the feature space χ induced by the regression tree
t∈Tm
T (x; am ) and {ctm }t∈Tm the corresponding predictions for the score. For regression
trees, the “parameters” am represent the splitting variables and their split values as
well as the corresponding predictions in the terminal nodes, that is,
am = ctm , χ(m)
t .
t∈Tm
134 5 Boosting Trees
5.3.1 Algorithm
m (x) = score
score m T (x i ;
m−1 (x) + β am )
m−1 (x) + β
= score m ctm I x ∈ χ(m)
t ,
t∈Tm
with m
γtm = β ctm . Hence, one sees that if βm ,
am is solution to (5.2.5) with
am =
ctm , χ(m)
t ,
t∈Tm
then 1,
bm with
γtm , χ(m)
bm = t
t∈Tm
1. Initialize s
core0 (x) to be a constant. For instance:
s
core0 (x) = argmin L(yi , g −1 (β)).
β i∈I
2. For m = 1 to M do
2.1 Fit a regression tree T (x;
am ) with
am = argmin L yi , g −1 score
m−1 (x i ) + T (x i ; am ) . (5.3.3)
am
i∈I
m (x) = score
2.2 Update score m−1 (x) + T (x;
am ).
End for
−1
3. Output:
μboost
D (x) = g M (x) .
score
5.3 Boosting Trees 135
For the squared-error loss with the identity link function (which is the canonical link
function for the Normal distribution), we have seen that (5.3.3) simplifies to
2
am = argmin
yi − score
m−1 (x i ) + T (x i ; am )
am
i∈I
= argmin (r̃mi + T (x i ; am ))2
am
i∈I
= argmin L (r̃mi , T (x i ; am )) .
am
i∈I
Hence, at iteration m, T (x i ; am ) is simply the best regression tree fitting the current
residuals r̃mi = yi − score
m−1 (x i ). Finding the solution to (5.2.5) is thus no harder
than for a single tree. It amounts to fit a regression tree on the working training set
D(m) = {(r̃mi , x i ), i ∈ I} .
136 5 Boosting Trees
Consider (5.3.3) with the Poisson deviance loss and the log-link function (which
is the canonical link function for the Poisson distribution). In actuarial studies, this
choice is often made to model the number of claims, so that we also account for
the observation period e referred to as the exposure-to-risk. In such a case, one
observation of the training set can be described by the claims count yi , the features
x i and the exposure-to-risk ei , so that we have
D = {(yi , x i , ei ), i ∈ I} .
with
emi = ei exp score
m−1 (x i )
Example
Consider the example of Sect. 3.2.4.1. The optimal tree is shown in Fig. 3.8, denoted
here D . Next to the training set D made of 500 000 observations, we create a
μtree
validation set D containing a sufficiently large number of observations to get stable
results for validation sample estimates, here 1000 000 observations. The validation
sample estimate of the generalization error of μtree
D is then given by
val tree
Err
μD = 0.5525081. (5.3.5)
5.3 Boosting Trees 137
where
e1i = ei exp(
score0 (x i ))
= exp(−2.030834).
The single split is x2 ≥ 45 and the predictions in the terminal nodes are −0.1439303
for x2 ≥ 45 and 0.0993644 for x2 < 45. We then get
core1 (x) = s
s core0 (x) + T (x;
a1 )
= −2.030834 − 0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45] .
with
e2i = ei exp s
core1 (x i )
= exp (−2.030834 − 0.1439303 I [xi2 ≥ 45] + 0.0993644 I [xi2 < 45]) .
138 5 Boosting Trees
The single split is made with x4 and the predictions in the terminal nodes are
−0.06514170 for x4 = no and 0.06126193 for x4 = yes. We get
core2 (x) = s
s core1 (x) + T (x;
a2 )
= −2.030834
−0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .
with
e3i = ei exp s
core2 (x i )
= exp (−2.030834 − 0.1439303 I [xi2 ≥ 45] + 0.0993644 I [xi2 < 45])
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .
The single split is x2 ≥ 30 and the predictions in the terminal nodes are −0.03341700
for x2 ≥ 30 and 0.08237676 for x2 < 30. We get
3 (x) = score
score 2 (x) + T (x;
a3 )
= −2.030834
−0.1439303 I [x2 ≥ 45] + 0.0993644 I [x2 < 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.03341700 I [x2 ≥ 30] + 0.08237676 I [x2 < 30]
= −2.030834
+(0.0993644 + 0.08237676) I [x2 < 30]
+(0.0993644 − 0.03341700) I [30 ≤ x2 < 45]
−(0.1439303 + 0.03341700) I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
= −2.030834
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .
5.3 Boosting Trees 139
with
e4i = ei exp score
3 (x i )
= exp (−2.030834)
exp (0.1817412 I [xi2 < 30] + 0.0659474 I [30 ≤ xi2 < 45] − 0.1773473 I [xi2 ≥ 45])
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .
The single split is made with x1 and the predictions in the terminal nodes are
−0.05309399 for x1 = f emale and 0.05029175 for x1 = male. We get
4 (x) = score
score 3 (x) + T (x;
a4 )
= −2.030834
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male] .
with
e5i = ei exp score
4 (x i )
= exp (−2.030834)
exp (0.1817412 I [xi2 < 30] + 0.0659474 I [30 ≤ xi2 < 45] − 0.1773473 I [xi2 ≥ 45])
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes
exp (−0.05309399 I [xi1 = f emale] + 0.05029175 I [xi1 = male]) .
The single split is x2 ≥ 45 and the predictions in the terminal nodes are −0.01979230
for x2 < 45 and 0.03326232 for x2 ≥ 45. We get
140 5 Boosting Trees
5 (x) = score
score 4 (x) + T (x;
a5 )
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1817412 I [x2 < 30] + 0.0659474 I [30 ≤ x2 < 45] − 0.1773473 I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.01979230 I [x2 < 45] + 0.03326232 I [x2 ≥ 45]
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+(0.1817412 − 0.01979230) I [x2 < 30] + (0.0659474 − 0.01979230) I [30 ≤ x2 < 45]
+(0.03326232 − 0.1773473) I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1619489 I [x2 < 30] + 0.0461551 I [30 ≤ x2 < 45] − 0.144085 I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
with
e6i = ei exp score
5 (x i )
= exp (−2.030834)
exp (−0.05309399 I [xi1 = f emale] + 0.05029175 I [xi1 = male])
exp (0.1619489 I [xi2 < 30] + 0.0461551 I [30 ≤ xi2 < 45] − 0.144085 I [xi2 ≥ 45])
exp −0.06514170 I [xi4 = no] + 0.06126193 I xi4 = yes .
The single split is x2 ≥ 30 and the predictions in the terminal nodes are −0.00913557
for x2 ≥ 30 and 0.019476688 for x2 < 30. We get
6 (x) = score
score 5 (x) + T (x;
a6 )
= −2.030834
−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male]
+0.1619489 I [x2 < 30] + 0.0461551 I [30 ≤ x2 < 45] − 0.144085 I [x2 ≥ 45]
−0.06514170 I [x4 = no] + 0.06126193 I x4 = yes
−0.009135574 I [x2 ≥ 30] + 0.019476688 I [x2 < 30]
= −2.030834
5.3 Boosting Trees 141
μboost
D (x) = exp (−2.030834)
exp (−0.05309399 I [x1 = f emale] + 0.05029175 I [x1 = male])
exp (0.1814256 I [x2 < 30] + 0.03701953 I [30 ≤ x2 < 45] − 0.1532206 I [x2 ≥ 45])
exp −0.06514170 I [x4 = no] + 0.06126193 I x4 = yes .
Table 5.1 Risk classes with their corresponding expected claim frequencies μ(x) and estimated
expected claim frequencies
μtree
D (x) and μboost
D (x) for M = 6
x
x1 (Gender) x2 (Age) x4 (Sport) μ(x)
μtree
D (x)
μboost
D (x)
(for M=6)
Female x2 ≥ 45 No 0.1000 0.1005 0.1000
Male x2 ≥ 45 No 0.1100 0.1085 0.1109
Female x2 ≥ 45 Yes 0.1150 0.1164 0.1135
Male x2 ≥ 45 Yes 0.1265 0.1285 0.1259
Female 30 ≤ x2 < 45 No 0.1200 0.1206 0.1210
Male 30 ≤ x2 < 45 No 0.1320 0.1330 0.1342
Female 30 ≤ x2 < 45 Yes 0.1380 0.1365 0.1373
Male 30 ≤ x2 < 45 Yes 0.1518 0.1520 0.1522
Female x2 < 30 No 0.1400 0.1422 0.1399
Male x2 < 30 No 0.1540 0.1566 0.1550
Female x2 < 30 Yes 0.1610 0.1603 0.1586
Male x2 < 30 Yes 0.1771 0.1772 0.1759
142 5 Boosting Trees
0.5535
0.5530
0.5525
5 10 15 20
Boosting iterations
val boost
Fig. 5.1 Validation sample estimate
Err
μD with respect to the number of iterations for
val val tree
trees with two terminal nodes, together with
Err (μ) (dotted line) and
Err
μD (solid line)
Consider the Gamma deviance loss. Using the log-link function, (5.3.3) is then given
by
am = argmin L yi , exp score
m−1 (x i ) + T (x i ; am )
am
i∈I
yi
= argmin − 2 ln
am m−1 (x i ) + T (x i ; am )
exp score
i∈I
yi
+2 −1 ,
m−1 (x i ) + T (x i ; am )
exp score
so that we get
r̃mi
r̃mi
am = argmin − 2 ln +2 −1
am exp (T (x i ; am )) exp (T (x i ; am ))
i∈I
= argmin L (r̃mi , exp (T (x i ; am ))) (5.3.7)
am
i∈I
5.3 Boosting Trees 143
with yi
r̃mi =
m−1 (x i )
exp score
for i ∈ I. Therefore, (5.3.3) simplifies to (5.3.7), so that finding the solution to (5.3.3)
amounts to obtain the regression tree with the Gamma deviance loss and the log-link
function that best predicts the working responses r̃mi . Hence, solving (5.3.3) amounts
to build the best tree on the working training set
D(m) = {(r̃mi , x i ), i ∈ I} ,
Boosting trees have two important tuning parameters, that are the size of the trees
and the number of trees M. The size of the trees can be specified in different ways,
such as with the number of terminal nodes J or with the depth of the tree D.
In the boosting context, the size of the trees is controlled by the interaction depth
ID. Each subsequent split can be seen as a higher-level of interaction with the pre-
vious split features. Setting ID = 1 produces single-split regression trees, so that no
interactions are allowed. Only the main effects of the features can be captured by
the score. With ID = 2, two-way interactions are also permitted, and for ID = 3,
three-way interactions are also allowed, and so on. Thus, the value of ID reflects the
level of interactions permitted in the score. Note that ID corresponds to the number
of splits in the trees. Obviously, we have ID = J − 1.
In practice, the level of interactions required is often unknown, so that ID is a
tuning parameter that is set by considering different values and selecting the one
that minimizes the generalization error estimated on a validation set or by cross-
validation. In practice, ID = 1 will be often insufficient, while ID > 10 is very
unlikely.
144 5 Boosting Trees
Note that in the simulated example discussed in Sect. 5.3.2.2, trees with ID = 1
are big enough to get satisfying results since the true score does not contain interaction
effects.
5.3.3.1 Example 1
Consider the simulated example of Sect. 3.6. In this example, using the log-link
function, the true score is given by
The first terms are functions of only one single feature while the last term is a two-
variable functions, producing a second-order interaction between X 1 and X 2 .
We generate 1000 000 additional observations to produce a validation set. The
validation sample estimate of the generalization error of
μtree
D , which is the optimal
tree displayed in Fig. 3.25, is then given by
val tree
Err
μD = 0.5595743. (5.3.9)
Let us fit boosting trees using the Poisson deviance loss and the log-link function.
We follow the procedure described in the example of Sect. 5.3.2.2. First, we consider
trees with only one split (ID = 1). Fig. 5.2 provides the validation sample estimate
val val
Err
μboost with respect to the number of trees M, together with Err (μ) and
D
val
μtree
Err D . We observe that the boosting models μboost
D have higher validation
sample estimates of the generalization error than the optimal tree μtree
D , whatever the
number of iterations M. Contrarily to the optimal tree, the boosting models with
ID = 1 cannot account for the second-order interaction. We should consider trees
with ID = 2. However, as already mentioned, specifying the number of terminal
nodes J of a tree, or equivalently the interaction depth ID, is not possible with
rpart.
To overcome this issue, we could rely on the R gbm package for instance, that
enables to specify the interaction depth. This package implements the gradient boost-
ing approach as described in Sect. 5.4, which is an approximation of boosting for any
differentiable loss function. In Sect. 5.4.4.2, we illustrate the use of the gbm package
on this particular example with ID = 1, 2, 3, 4.
5.3 Boosting Trees 145
0.5615
Generalization error (validation sample)
0.5610
0.5605
0.5600
0.5595
5 10 15 20
Boosting iterations
val boost
Fig. 5.2 Validation sample estimate
Err
μ
with respect to the number of iterations for
D
val val tree
Err (μ) (dotted line) and
trees with two terminal nodes (J = 2), together with Err
μD
(solid line)
5.3.3.2 Example 2
Consider the simulated example of Sect. 3.7.2. We generate 1000 000 additional
observations to produce a validation set. The validation sample estimate of the gen-
eralization error of
μtree
D , which is the tree minimizing the 10-fold cross validation
error and shown in Fig. 3.31, is then given by
val tree
Err
μD = 0.5803903. (5.3.11)
Contrarily to the previous example and the example in Sect. 5.3.2.2, a single tree
(here
μtree
D ) cannot reproduce the desired partition of the feature space. We have seen
in Sect. 3.7.2 that
μtree
D suffers from a lack of smoothness. The validation sample
estimate of the generalization error of the true model μ is
val
Err (μ) = 0.5802196, (5.3.12)
146 5 Boosting Trees
0.5815
Generalization error (validation sample)
0.5810
0.5805
0 25 50 75 100
Boosting iterations
val boost
Fig. 5.3 Validation sample estimate
Err
μD with respect to the number of iterations for
val val tree
trees with two terminal nodes (J = 2), together with
Err (μ) (dotted line) and
Err
μD
(solid line)
so that the room for improvement for the validation sample estimate of the general-
ization error is
val tree val
Err μD −
Err (μ) = 0.0001706516.
We follow the same procedure than in example of Sect. 5.3.2.2, namely we use
the Poisson deviance loss with the log-link function and we consider trees with only
one split (ID = 1) since there is no interaction effects in the true model. In Fig. 5.3,
val
we provide the validation sample estimate Err
μboost
D with respect to the number
val val
of trees M, together with Err (μ) and Err
μD . We see that the error estimate
tree
val val
Err
μD
boost
becomes smaller than Err
μD from M = 8 and stabilizes around
tree
val
M = 25. The smallest error Err
μboost
D corresponds to M = 49 and is given by
val boost
Err
μD = 0.5802692.
val
In Fig. 5.4, we show the validation sample estimate
Err μboost
D with respect
to the number of trees M for constituent trees with depth D = 2, 3. For D = 2, the
5.3 Boosting Trees 147
0.5808
0.5806
0.5804
0.5802
0 25 50 75 100
Boosting iterations
0.5806
Generalization error (validation sample)
0.5805
0.5804
0.5803
0.5802
0 25 50 75 100
Boosting iterations
val boost
Fig. 5.4 Validation sample estimate
Err
μ
with respect to the number of iterations for
D
val val tree
Err (μ) (dotted line) and
trees with D = 2 (top) and D = 3 (bottom), together with Err
μD
(solid line)
148 5 Boosting Trees
val boost
smallest error
Err
μD corresponds to M = 6 and is given by
val boost
Err
μD = 0.5802779,
val
while for D = 3, the smallest error Err
μboost
D corresponds to M = 4 and is given
by
val boost
Err
μD = 0.5803006.
One sees that increasing the depth of the trees does not enable to improve the pre-
dictive accuracy of the boosting models. These models incur unnecessary variance
leading to higher validation sample estimates of the generalization error. Denoting
val
by M ∗ the number of trees minimizing Err μboost
D , we observe that the predictive
accuracy of the model significantly degrades when M > M ∗ . Increasing the size of
the trees enables the boosting model to fit the training set too well when M > M ∗ .
Simple fast algorithms do not always exist for solving (5.3.3). Depending on the
choice of the loss function and the link function, the solution to (5.3.3) can be
difficult to obtain.
For any differentiable loss function, this difficulty can be overcome by analogy
to numerical optimization. The solution to (5.3.3) can be approximated by using a
two-step procedure, as explained next.
To ease the presentation of this section, we equivalently use the notation {(y1∗ , x ∗1 ),
∗
. . . , (y|I| , x ∗|I| )} for the observations of the training set, that is,
D = {(yi , x i ), i ∈ I}
= {(y1∗ , x ∗1 ), . . . , (y|I|
∗
, x ∗|I| )}. (5.4.1)
The primary objective is to minimize the training sample estimate of the generalized
error
L(score(x)) = L yi , g −1 (score(x i ))) (5.4.2)
i∈I
We observe that (5.4.2) only specifies this function in the values x i , i ∈ I. Hence,
forgetting that we work with constrained functions, we actually try to find optimal
parameters
η = (η1 , . . . , η|I| )
= (score(x ∗1 ), . . . , score(x ∗|I| )) (5.4.3)
minimizing
|I|
L(η) = L yi∗ , g −1 (ηi ) . (5.4.4)
i=1
The saturated model is of course one of the solution to (5.4.4), namely ηi = g(yi∗ ) for
all i = 1, . . . , |I|, but this solution typically leads to overfitting. Note that restricting
the score to be the sum of a limited number of relatively small regression trees will
prevent from overfitting.
Our problem can be viewed as the numerical optimization
η = argmin L(η). (5.4.5)
η
M
η= bm , bm ∈ R|I| , (5.4.6)
m=0
One of the simplest and frequently used numerical minimization methods is the
steepest descent. The steepest descent defines step bm as
bm = −ρm gm ,
η m−1 = b0 + b1 + . . . + bm−1 .
(5.4.8)
The negative gradient −gm gives the local direction along with L(η) decreases the
η m−1 . The step length ρm is then found by
most rapidly at η =
ρm = argmin L(
η m−1 − ρgm ), (5.4.9)
ρ>0
5.4.3 Algorithm
The steepest descent is the best strategy to minimize L(η). Unfortunately, the gradient
gm is only defined at the feature values x i , i ∈ I, so that it cannot be generalized to
other feature values whereas we would like a function defined on the entire feature
space χ. Furthermore, we want to prevent from overfitting.
A way to solve these issues is to constrain the step directions to be members of
a class of functions. Specifically, we reinstate the temporarily forgotten constrain
that is to work with regression trees. Hence, we approximate the direction −gm by a
regression tree T (x;am ) producing
gm = T (x ∗1 ;
− am ), . . . , T (x ∗|I| ;
am )
that is as close as possible to the negative gradient, i.e. the most parallel to −gm .
A common choice to measure the closeness between constrained candidates
T (x; am ) for the negative gradient and the unconstrained gradient −gm = −(gm1 , . . . ,
gm|I| ) is to use the squared error, so that we get
|I|
2
am = argmin
−gmi − T (x i∗ ; am ) . (5.4.11)
am
i=1
|I|
ρm = argmin L(yi∗ , g −1 score
m−1 (x i∗ ) + ρm T (x i∗ ;
am ) ), (5.4.12)
ρm >0 i=1
where
m−1 (x) = score
score m−2 (x) +
ρm−1 T (x;
am−1 )
5.4 Gradient Boosting Trees 151
|I|
core0 (x) = argmin
s L(yi∗ , g −1 (ρ0 )).
ρ0
i=1
am ) can be written as
As already noticed, since T (x;
am ) =
T (x; ctm I x ∈ χ(m)
t ,
t∈Tm
the update
m (x) = score
score m−1 (x) +
ρm T (x;
am )
ctm I x ∈ χ(m)
can be viewed as adding |Tm | separate base learners t , t ∈ Tm , to
m−1 (x). Instead of using one coefficient for T (x;
score am ), the current procedure
can
(m)
be improved by using the optimal coefficients for each base learner ctm I x ∈ χt ,
t ∈ Tm . Thus, we replace the optimal coefficient ρm solution to (5.4.12) by |Tm |
coefficients
ρtm , t ∈ Tm , that are solution to
⎛ ⎛ ⎞⎞
|I |
(m) ⎠⎠
{
ρtm }t∈Tm = argmin L ⎝ yi∗ , g −1 ⎝score
m−1 (x i∗ ) + ctm I x i∗ ∈ χt
ρtm .
{ρtm }t∈Tm i=1 t∈Tm
(5.4.13)
Subspaces {χ(m)
t } t∈Tm of the feature space χ are disjoint, so that (5.4.13) amounts to
solve
ρtm = argmin
L yi∗ , g −1 score
m−1 (x i∗ ) + ρtm ctm (5.4.14)
ρtm
i:x i∗ ∈χ(m)
t
for each t ∈ Tm . Coefficients ρtm , t ∈ Tm , are thus fitted separately. Here, the least-
squares coefficients ctm , t ∈ Tm , are simply multiplicative constants. Ignoring these
latter coefficients, (5.4.14) reduces to
γtm = argmin L yi∗ , g −1 score
m−1 (x i∗ ) + γtm , (5.4.15)
γtm
i:x i∗ ∈χ(m)
t
1. Initialize s
core0 (x) to be a constant. For instance:
|I|
core0 (x) = argmin
s L(yi∗ , g −1 (ρ0 )). (5.4.16)
ρ0
i=1
2. For m = 1 to M do
2.1 For i = 1, . . . , |I|, compute
∂ L(yi∗ , g −1 (η))
rmi =− . (5.4.17)
∂η scorem−1 (x i∗ )
η=
(m)
m (x) = score
2.4 Update score m−1 (x) +
γ
t∈Tm tm I x ∈ χ t .
End for
(x) = g −1 score
grad boost
3. Output:
μD M (x) .
Step 2.2 in Algorithm 5.3 determines the partition {χ(m) t , t ∈ Tm } of the feature
space for the mth iteration. Then, given the partition {χ(m)
t , t ∈ Tm }, step 2.3 estimates
the predictions in the terminal nodes t ∈ Tm at the mth iteration based on the current
m−1 (x).
score score
Instead of fitting directly one tree as in step 2.1 in Algorithm 5.2, we first fit a
tree on working responses by least squares (step 2.2 in Algorithm 5.3) to get the
structure of the tree and then we compute the predictions in the final nodes (step 2.3
in Algorithm 5.3).
5.4 Gradient Boosting Trees 153
For the squared-error loss with the identity link function, i.e.
L y, g −1 (score(x)) = (y − score(x))2 ,
we directly get
|I|
yi∗
core0 (x) =
s i=1
|I|
∂(yi∗ − η)2
rmi = −
∂η scorem−1 (x i∗ )
η=
=2 yi∗ m−1 (x i∗ ) .
− score (5.4.20)
Therefore, the working responses rmi are just the ordinary residuals, as in Sect. 5.3.2.1,
so that gradient boosting trees is here equivalent to boosting trees. The predictions
(5.4.19) are given by
i:x i∗ ∈χ(m) yi∗ − score
m−1 (x i∗ )
γtm =
t
. (5.4.21)
card{i : x i∗ ∈ χ(m)
t }
Consider the Poisson deviance loss together with the log-link function, and the train-
ing set
D = {(yi , x i , ei ), i ∈ I}
= {(y1∗ , x ∗1 , e1∗ ), . . . , (y|I|
∗
, x ∗|I| , e|I|
∗
)}. (5.4.22)
Since
−1
y e exp(score(x))
L y, eg (score(x)) = 2y ln −1+ ,
e exp(score(x)) y
and
⎡
y∗ ei∗ exp(η)
⎤
∂ ln e∗ exp(η)
i
−1+ yi∗
∗⎣ ⎦
rmi = −2yi i
∂η
scorem−1 (x i∗ )
η=
=2 yi∗ − ei∗ m−1 (x i∗ ))
exp(score , (5.4.23)
∗
with emi = ei∗ exp(score
m−1 (x i∗ )). At iteration m, step 2.2 in Algorithm 5.3 amounts
to fit a regression tree T (x;am ) on the residuals (5.4.24). The predictions (5.4.19)
are given by
i:x i∗ ∈χ(m) yi∗
γtm = ln
t
∗ . (5.4.25)
i:x ∗ ∈χ(m)
t
emi
i
As we have seen, boosting trees is fully manageable in this case since fitting the
mth tree in the sequence is no harder than fitting a single regression tree. Thus, gra-
dient boosting trees is a bit artificial here, as noted in Wüthrich and Buser (2019). At
each iteration, the structure of the new tree is obtained by least squares while the Pois-
son deviance loss can be used without any problem. Hence, gradient boosting trees
introduces an extra step which is artificial, leading to an unnecessary approximation.
Example
Consider the example in Sect. 5.3.3.1. We use the R package gbm to build the gradient
grad boost
boosting models μD with the Poisson deviance loss and the log-link function.
Note that the command gbm enables to specify the interaction depth of the constituent
val grad boost
trees.
Figure 5.5 displays the validation sample estimate Err
μD with respect
val val
the number of trees M for ID = 1, 2, 3, 4, together with
to tree Err (μ) and Err
μD .
The gradient boosting models with ID = 1 produce results similar to those
depicted in Fig. 5.2. As expected, because of the second-order interaction between X 1
and X 2 , ID = 2 is the optimal choice in this example. The gradient boosting model
grad boost
μD with ID = 2 and M = 6 produces the lowest validation sample estimate
of the generalization error that is similar to the single optimal tree
μtree
D . Note that
gradient boosting models with ID = 3, 4 have lower validation sample estimates of
the generalization error as long as M ≤ 5. This is due to the fact that models with
ID = 3, 4 learn faster than those with ID = 2. From M = 6, gradient boosting mod-
5.4 Gradient Boosting Trees 155
0.5610
0.5605
0.5605
0.5600
0.5600
0.5595
0.5595
5 10 15 20 5 10 15 20
Boosting iterations Boosting iterations
0.5603
Generalization error (validation sample)
0.56000
0.5599
0.55975
0.5597
5 10 15 20 5 10 15 20
Boosting iterations Boosting iterations
For the Gamma deviance loss with the log-link function, i.e.
y y
L y, g −1 (score(x)) = −2 ln +2 −1 ,
exp (score(x)) exp (score(x))
(5.4.26)
(5.4.16) and (5.4.17) become
|I|
yi∗
core0 (x) = ln
s i=1
|I|
and
156 5 Boosting Trees
⎡
y∗ y∗ ⎤
∂ ln exp(η)
i
− exp(η)
i
−1
rmi = 2⎣ ⎦
∂η
scorem−1 (x i∗ )
η=
yi∗
=2 −1 , (5.4.27)
m−1 (x i∗ ))
exp(score
with
yi∗
r̃mi = . (5.4.29)
m−1 (x i∗ ))
exp(score
Again, as for the Poisson deviance loss with the log-link function, we have seen that
boosting trees for the Gamma deviance loss is easy to implement with the log-link
function, so that gradient boosting trees is also a bit artificial here.
If we consider the canonical link function g(x) = −1 x
, i.e.
L y, g −1 (score(x)) = −2 ln (−yscore(x)) − 2 (yscore(x) + 1) , (5.4.31)
−|I|
s
core0 (x) =
|I|
∗
i=1 yi
and
! "
∂ ln −yi∗ η + yi∗ η + 1
rmi =2
∂η
scorem−1 (x i∗ )
η=
1
=2 + yi∗ , (5.4.32)
m−1 (x i∗ )
score
In practice, actuaries often use distributions that belong to the Tweedie class together
with the log-link function to approximate μ(x). The log-link function is chosen
mainly because of the multiplicative structure it produces for the resulting model.
The Tweedie class regroups the members of the ED family having power variance
functions V (μ) = μξ for some ξ.
Specifically, the Tweedie class includes continuous distributions such as the Nor-
mal, Gamma and Inverse Gaussian distributions. It also includes the Poisson and
compound Poisson-Gamma distributions. Compound Poisson-Gamma distributions
can be used for modeling annual claim amounts, having positive probability at zero
and a continuous distribution on the positive real numbers. In practice, annual claim
amounts are often decomposed into claim numbers and claim severities and sepa-
rate analyses of these quantities are conducted. Typically, the Poisson distribution is
used for modeling claim counts and the Gamma or Inverse Gaussian distributions
for claim severities.
The following table gives a list of all Tweedie distributions.
Type Name
ξ<0 Continuous -
ξ=0 Continuous Normal
0<ξ<1 Non existing -
ξ=1 Discrete Poisson
1<ξ<2 Mixed, non-negative Compound Poisson-Gamma
ξ=2 Continuous, positive Gamma
2<ξ<3 Continuous, positive -
ξ=3 Continuous, positive Inverse Gaussian
ξ>3 Continuous, positive -
Negative values of ξ gives continuous distributions on the whole real axis. For
0 < ξ < 1, no ED member exists. Only the cases ξ ≥ 1 are thus interesting for
application in insurance. The corresponding deviance loss function is
⎧
⎪
⎪ (y − μ)2 for ξ = 0
⎪
⎪
⎪ y
⎨ 2 y ln μ − (y − μ) for ξ = 1
L(y,
μ) = (5.5.1)
⎪
⎪ 2 − ln μ + μ − 1 for ξ = 2
y y
⎪
⎪
⎪
⎩ 2 max(y,0)2−ξ − yμ1−ξ + μ2−ξ else.
(1−ξ)(2−ξ) 1−ξ 2−ξ
with yi
r̃mi = ,
m−1 (x i )
exp score
with
emi = ei exp score
m−1 (x i ) .
For a Poisson response Y , we know that the actuary is allowed to work either with
the observed claim count Y or with the observed claim rate Y ' = Y provided the
e
' still belongs to
weight ν = e enters the analysis. The distribution of the claim rate Y
the Tweedie class and is called the Poisson rate distribution. See Property 2.5.1 in
Denuit, Hainaut and Trufin (2019) for more details. This is reflected by the fact that
the Poisson deviance loss satisfies
L yi , ei exp score
m−1 (x i ) + T (x i ; am ) = νi L ỹi , exp score
m−1 (x i ) + T (x i ; am )
(5.5.4)
νi L( ỹi , exp(score
m−1 (x i ) + T (x i ; am )))
ỹi
= νi 2 ỹi ln − ỹi − exp(score
m−1 (x i ) + T (x i ; am ))
m−1 (x i ) + T (x i ; am ))
exp(score
= νi exp(score
m−1 (x i ))
ỹi ỹi
2 ln
m−1 (x i ))
exp(score m−1 (x i )) exp(T (x i ; am ))
exp(score
ỹi
− − exp(T (x i ; am ))
m−1 (x i ))
exp(score
ỹi
= νi exp(score
m−1 (x i ))L , exp(T (x i ; am ))
m−1 (x i ))
exp(score
= νmi L (r̃mi , exp(T (x i ; am ))) (5.5.5)
with
νmi = νi exp(score
m−1 (x i ))
and
ỹi
r̃mi = .
m−1 (x i ))
exp(score
The mth iteration of the boosting procedure reduces to build a single tree on the
working training set
5.5 Boosting Versus Gradient Boosting 159
using the Poisson deviance loss and the log-link function. The weights are each
time updated together with the responses that are assumed to follow Poisson rate
distributions.
While the weights are updated differently in the Poisson (rate) and Gamma cases
(they remain constant through the boosting procedure in the Gamma case), it is
interesting to notice that the working responses at the mth iteration are in both cases
m−1 (x i )).
the original ones divided by the current predictions exp(score
The next result shows that the latter observation is also true for any member of
the Tweedie class with the log-link function. Actually, any member of the Tweedie
class with the log-link function gives rise to a simple boosting algorithm.
Proposition 5.5.1 Consider the deviance loss function (5.5.1). Then, (5.3.3) with
the log-link function, that is
am = argmin νi L yi , exp score
m−1 (x i ) + T (x i ; am )
am
i∈I
can be rewritten as
am = argmin
νmi L (r̃mi , exp (T (x i ; am ))) (5.5.6)
am
i∈I
with
νmi = νi exp(score
m−1 (x i ))2−ξ
and yi
r̃mi = .
m−1 (x i ))
exp(score
Proof The cases ξ = 1 and ξ = 2 have already been discussed. Turning to the Normal
case ξ = 0, we have
νi L(yi , exp(score
m−1 (x i ) + T (x i ; am )))
= νi [yi − exp(score
m−1 (x i ) + T (x i ; am ))]2
= νi [yi − exp(score
m−1 (x i )) exp(T (x i ; am ))]2
2
yi
= νi exp(2score
m−1 (x i )) − exp(T (x i ; am ))
m−1 (x i ))
exp(score
yi
= νi exp(2score
m−1 (x i ))L , exp(T (x i ; am ))
m−1 (x i ))
exp(score
= νmi L (r̃mi , exp(T (x i ; am ))) .
160 5 Boosting Trees
Finally, when ξ ∈
/ {0, 1, 2}, it comes
νi L(yi , exp(
scorem−1 (x i ) + T (x i ; am )))
max(yi , 0)2−ξ yi exp( scorem−1 (x i ) + T (x i ; am ))1−ξ exp( scorem−1 (x i ) + T (x i ; am ))2−ξ
= νi 2 − +
(1 − ξ)(2 − ξ) 1−ξ 2−ξ
max(r̃mi , 0) 2−ξ r̃mi exp(T (x i ; am )) 1−ξ exp(T (x i ; am ))2−ξ
= νi exp( scorem−1 (x i ))2−ξ 2 − +
(1 − ξ)(2 − ξ) 1−ξ 2−ξ
= νmi L (r̃mi , exp(T (x i ; am ))) ,
This result shows that when we work with the log-link function and a response
that belongs to the Tweedie class (and so with a loss function of the form (5.5.1)),
solving (5.3.3) amounts to build a single regression tree on the working training set
Therefore, in these cases, that are the most relevant for application in insurance,
boosting trees should be preferred to gradient boosting trees since the latter procedure
introduces an extra step which is unnecessary on the one hand and that leads to an
approximation that can be easily avoided with boosting trees on the other hand.
5.6.1 Shrinkage
Boosting (and gradient boosting) models are susceptible to overfitting. They employ
the greedy strategy of selecting the optimal weak learner at each step. Such a strategy
produces an optimal solution at each stage of the training procedure. However, it does
not find the optimal global solution and often fits the training set too closely when M
is large: after a certain number of iterations, reducing the training sample estimate
of the generalization error starts to increase the generalization error.
Regularization methods aim to prevent such overfitting by constraining the train-
ing procedure. Controlling the value of M is a natural regularization strategy. For a
certain size of the constituent trees, that can be specified with ID, there is an opti-
mal number of trees M ∗ minimizing the generalization error. In practice, M ∗ can be
estimated as the value of M that minimizes the validation sample estimate (or the
cross validation estimate) of the generalization error.
Another regularization strategy consists in adding only a fraction of the prediction
produced by the new tree to the current one. This fraction is often referred to as the
learning rate or shrinkage factor and takes its values between 0 and 1. That is, line
2.2 in Algorithm 5.2 and line 2.4 in Algorithm 5.3 are replaced by
5.6 Regularization and Randomness 161
m (x) = score
Update score m−1 (x) + τ T (x;
am ) (5.6.1)
and
m (x) = score
Update score m−1 (x) + τ γtm I x ∈ χ(m)
t , (5.6.2)
t∈Tm
5.6.2 Randomness
and
(x) = g −1 score
grad boost
μD, M (x)
respectively, where
= (1 , . . . , M ),
5.7 Interpretability
As for bagging trees and random forests, boosting trees are less interpretable than a
single tree. Nevertheless, we can also rely on tools such as relative importances and
partial dependences to better understand model outcomes. Moreover, Friedman’s H-
statistics, introduced in Friedman and Popescu (2008), enable to know which features
are involved in interactions with other features, the identities of the other features
with which they interact, as well as the order and strength of the respective interaction
effects. Note that Friedman’s H-statistics can be computed for any regression model,
including bagging trees and random forests.
To improve their readability, the relative importances are often normalized so that
their sum equals to 100. Any individual number can then be interpreted as the per-
centage contribution to the overall model. Sometimes, the relative importances are
expressed as a percent of the maximum relative importance.
grad boost
Also, the partial dependence of μboost
D, (x) or
μD, (x) on selected small subsets
of the features helps the analyst to improve its model understanding. We refer the
reader to Sect. 4.6.2 for more details about partial dependence plots.
Tree-based models are praised for their ability to account for interaction effects
between features. Knowing which features are involved in interactions with other
features, the identities of the other features with which they interact, as well as the
5.7 Interpretability 163
order and strength of the respective interaction effects provide useful information to
the analyst.
Consider two features x j and xk . If these two variables do not interact, then the
function
μ(x) can be written as
μ(x) = f \ j (x \ j ) + f \k (x \k ),
(5.7.2)
where x \ j and x \k represent all the features except x j and xk , respectively. If a given
feature x j interacts with none of the other features, then μ(x) can be expressed as
μ(x) = f \ j (x \ j ) + f j (x j ).
(5.7.3)
μ(x) = f \ j (x \ j ) + f \k (x \k ) + f \l (x \l )
(5.7.4)
1
μS (x S ) =
μ(x S , x i S̄ ), (5.7.6)
|I| i∈I
where {x i S̄ , i ∈ I} are the values of X S̄ in the training set. In this section, we consider
that all partial dependence functions and the regression model μ(x) are centered to
have a mean of zero.
If there is no interaction between x j and xk , then, from (5.7.2), the partial depen-
dence of μ(x) on x S = (x j , xk ) can be decomposed into the sum of the respective
partial dependences on each feature separately, that is,
μ j,k (x j , xk ) = E X \ j,k f \ j (xk , X \ j,k ) + E X \ j,k f \k (x j , X \ j,k )
=μk (xk ) − E X \k f \k (X \k ) + μ j (x j ) − E X \ j f \ j (X \ j )
=μk (xk ) + μ j (x j ) − E X [
μ(X)]
=
μk (xk ) +
μ j (x j ). (5.7.7)
since
μ j (x j ) = E X \ j f \ j (X \ j ) + E X \ j,k f \k (x j , X \ j,k ) + E X \ j,l f \l (x j , X \ j,l )
and
μ j,k (x j , xk ) = E X \ j,k f \ j (xk , X \ j,k ) + E X \ j,k f \k (x j , X \ j,k )
+E X \ j,k,l f \l (x j , xk , X \ j,k,l )
with
μ(X)] = 0.
E X \ j f \ j (X \ j ) + E X \k f \k (X \k ) + E X \l f \l (X \l ) = E X [
Similar expressions can be obtained for the absence of higher order interactions.
Expressions (5.7.7), (5.7.8) and (5.7.9) can be used to test for the presence of
interaction effects. Specifically, to test a potential interaction between two given
features x j and xk , we can rely on the statistic
2
i∈I
μ j,k (x ji , xki ) −
μ j (x ji ) −
μk (xki )
2
H j,k =
, (5.7.10)
2
i∈I μ j,k (x ji , x ki )
2
μ(x i ) −
μ j (x ji ) −
μ\ j (x i\ j )
H j2 = i∈I
(5.7.11)
i∈I μ2 (x i )
to test whether x j interacts with any other variable. The statistic (5.7.11) differs from
zero to the extent that x j interacts with one or more other features.
In the case where x j interacts with more than one other feature, say with at least xk
and xl , it is interesting to determine whether these interactions represent separate two-
way interactions between (x j , xk ) and (x j , xl ) only, or whether there is an additional
three-way interaction between (x j , xk , xl ). This alternative can be tested by means
of the statistic
2
H j,k,l = μ j,k,l (x ji , xki , xli ) −
μ j,k (x ji , xki ) −
μ j,l (x ji , xli ) −
μk,l (xki , xli )
i∈I
2 (
+
μ j (x ji ) +
μk (xki ) + 2
μk (xli ) μ j,k,l (x ji , xki , xli ). (5.7.12)
i∈I
which measures the fraction of variance of μ j,k,l (x ji , xki , xli ) not explained by the
lower order interaction effects among these features. Similarly, additional statistics
for higher order interactions can be built, if needed.
5.8 Example
Consider the real dataset described in Sect. 3.2.4.2. We use the same training set D
and validation set D than in the examples of Sects. 3.3.2.3 and 4.7.
We fit gradient boosting trees on D with the Poisson deviance loss and the log-link
function by means of the R package gbm. The parameters we need to fine-tune are
• the number of trees M;
• the size of the trees ID;
• the bagging fraction α;
• the shrinkage parameter τ .
To this end, we consider different values for the tuning parameters ID, α and τ ,
namely ID = 1, 2, 3, 4, 5, 6, α = 1, 0.75, 0.5 and τ = 1, 0.1, 0.01, and we split the
training set D into five disjoint and stratified subsets D1 , D2 , . . . , D5 of equal size.
Then, for each value of (ID, α, τ ), depicted in Table 5.2, we compute the 5-fold
cross-validation estimates of the generalization error (from subsets D1 , D2 , . . . , D5 )
for models including up to 4000 trees, and we select the optimal number of trees as the
number of trees corresponding to the model with the smallest 5-fold cross-validation
estimate.
In Fig. 5.6, we show the optimal numbers of trees obtained for the different values
of (ID, α, τ ) under consideration. We notice that considering 4000 trees was more
than enough in this example, even for ID = 1 and τ = 0.01. Also, one sees that we
166 5 Boosting Trees
3000
Number of optimal trees
2000
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
Fig. 5.6 Optimal numbers of trees for the values of (ID, α, τ ) summarized in Table 5.2
need more trees when the shrinkage parameter decreases or when the size of the trees
decreases, highlighting the interplay between these tuning parameters.
Figure 5.7 displays 5-fold cross-validation estimates of the generalization error
for the models built with their corresponding optimal number of trees. One sees that
the introduction of a shrinkage parameter enables to improve the predictive accuracy
of the boosting procedure. Also, results with τ = 0.01 are slightly better than the
ones obtained with τ = 0.1. To a lesser extent, adding randomness into the training
5.8 Example 167
0.540
Generalization error (cross−validation estimate)
0.539
0.538
0.537
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
Model
Fig. 5.7 5-fold cross-validation estimates of the generalization error for the best models corre-
sponding to the values of (ID, α, τ ) summarized in Table 5.2
procedure appears to be relevant in this example. For instance, one sees that for
any given value of ID, the model with (τ = 0.01, α = 0.5) performs slightly better
than the model with (τ = 0.01, α = 0.75), which, in turn, performs slightly better
than the model with (τ = 0.01, α = 1). Finally, if we look at models with τ = 0.01,
i.e. models 37 to 54, we observe that the interaction depths minimizing the 5-fold
cross-validation estimate of the generalization error are ID = 2 for α = 1, ID = 3
for α = 0.75 and ID = 4 for α = 0.5, so that the best value for ID ranges from 2 to
4. Based on the results depicted in Fig. 5.7, we decide to select M = 986, ID = 4,
α = 0.5 and τ = 0.01 as the optimal tuning parameters, which correspond to the
values minimizing the 5-fold cross-validation estimate (i.e. model 52). We denote
grad boost∗
by μD, the gradient boosting model fitted on the entire training set with these
optimal parameters.
Remark 5.8.1 Remark 4.7.1 about the instability related to the selection proce-
dure of the optimal tuning parameters still holds for (gradient) boosting trees. As
an illustration, Table 5.3 provides, for each iteration j of the 5 cross-validation,
the number of trees for τ = 0.01 and α = 0.5 minimizing the validation-sample
estimate of the generalization error computed on D j for models fitted on D\D j ,
together with the corresponding out-of-sample estimate of the generalization error.
We can see that the optimal tuning parameters for ID and M are unstable over the
168 5 Boosting Trees
Table 5.3 Number of trees for τ = 0.01 and α = 0.5 minimizing the validation-sample estimate
of the generalization error computed on D j for the model fitted on D\D j together with the corre-
sponding out-of-sample estimate of the generalization error
ID Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
1 1841 (0.5323761) 1847 (0.5336023) 2895 (0.5424494) 2356 (0.5368788) 3475 (0.5417317)
2 3194 (0.5314581) 1291 (0.5331031) 2422 (0.5422430) 1740 (0.5360356) 1814 (0.5412181)
3 1422 (0.5314159) 1004 (0.5329624) 1413 (0.5421080) 1401 (0.5359842) 1642 (0.5413473)
4 860 (0.5314600) 933 (0.5328013) 919 (0.5422055) 962 (0.5358395) 1315 (0.5415447)
5 907 (0.5316125) 615 (0.5328922) 779 (0.5424979) 987 (0.5356169) 702 (0.5416647)
6 619 (0.5314849) 738 (0.5327782) 652 (0.5424877) 992 (0.5360672) 707 (0.5418545)
AgePh
Split
Fuel PowerCat AgeCar
Gender
Cover
Use
0 10 20 30 40 50
Relative influence
grad boost∗
Fig. 5.8 Relative importances of the features for
μD,
−1.4
−1.4
−1.4
−1.6
−1.6
−1.6
−1.6
−1.8
−1.8
−1.8
−1.8
−2.0
−2.0
−2.0
−2.0
−2.2
−2.2
−2.2
−2.2
20 30 40 50 60 70 80 90 0 5 10 15 20 Diesel Gasoline Half−Yearly Quarterly
−1.4
−1.4
−1.4
−1.6
−1.6
−1.6
−1.6
−1.8
−1.8
−1.8
−1.8
−2.0
−2.0
−2.0
−2.0
−2.2
−2.2
−2.2
−2.2
Comprehensive TPL.Only Male Female Private Professional C1 C2 C3 C4 C5
grad boost∗
Fig. 5.9 Partial dependence plots for
μD, (on the score scale)
2
Fig. 5.10 H-statistic H j,k
grad boost∗ PowerCat
for
μD,
Use
Gender
Friedman's
H−statistic
Cover
0.20
Var2
0.15
0.10
Split 0.05
0.00
Fuel
AgeCar
AgePh
se
ar
at
h
lit
r
er
el
de
eP
Sp
eC
C
Fu
U
ov
en
er
Ag
C
Ag
w
G
Po
Var1
170 5 Boosting Trees
20 40 60 80
Male Female
−1.2
−1.4
−1.6
y
−1.8
−2.0
−2.2
20 40 60 80
AgePh
Fig. 5.11 Effect of the policyholder’s age on the score for males (left-hand side) and females
(right-hand side)
One observes from Fig. 5.10 that the three strongest interactions are found between
Agecar and Cover, AgeCar and PowerCat and between AgePh and Gender, this latter
2
interaction being well-know in MTPL insurance. The H-statistic H j,k informs us on
the strength of the interaction between features x j and xk but does not give any clue
on how the effect behaves. For instance, in Fig. 5.11, we show the effect (on the score
scale) of the policyholder’s age for males on the left-hand side and for females on
le right-hand side. We observe that for young policyholders, males are on average
more risky drivers compared to females, whereas at older ages female drivers are
perceived as more risky than males.
grad boost∗
Finally, the validation sample estimate of the generalization error of μD,
(computed on D) is given by
val grad boost∗
Err
μD, = 0.5431231.
val
Err
μTαk ∗ = 0.5452772
5.8 Example 171
and
val ∗
Err
μrfD, = 0.5440970,
∗
one sees that μrfD, improves by 2.1541 10−3 the predictive accuracy of the single
∗
μTαk ∗ and by 0.9739 10−3 the predictive accuracy of the random forest
tree μrfD, .
Boosting was originally designed for classification problems. Valiant (1984) and
Kearns and Valiant (1989) introduced the concept of combining weak classifiers into
a strong classifier. These works influenced Schapire, who developed the first simple
boosting procedure (Schapire 1990). The performance of the simple boosting algo-
rithm of Schapire was improved by Freund (1995). Freund and Schapire collaborated
to produce the AdaBoost algorithm (Freund and Schapire 1996a, 1997). To sup-
port their algorithms, Freund and Schapire (1996a) and Schapire and Singer (1999)
derived some upper bounds on the generalization error. Other theories attempting
to explain boosting come from game theory (Freund and Schapire 1996b,Breiman
1998, 1998) and Vapnik-Chervonenkis theory (Schapire et al. 1998). In particular,
Breiman (1998) explained the algorithm as a gradient descent approach with numer-
ical optimization and statistical estimation. In practice, the AdaBoost algorithm was
shown to be a powerful prediction tool, far beyond the expectations implied by the
bounds and the theoretical developments.
Friedman et al. (2000) made the link between the AdaBoost algorithm and the
statistical concepts of loss functions, additive modeling and logistic regression. They
showed that boosting can be viewed as a forward stagewise additive model that mini-
mizes exponential loss. Friedman (2001) proposed a boosting method called Gradient
Boosting Machine for regression and classification problems, which combines weak
learners. Bühlmann and Hothorn (2007) adopted penalty splines, linear regressors,
and trees in various scenarios. Ridgeway (2007) uses only trees as the base learners.
Gradient boosting machine with neural networks can be found in Denuit et al. (2019).
Friedman and Popescu (2008) presented techniques to identify the variables that
are involved in interactions with other variables, the strength and degree of those
interactions, as well as the identities of the other variables with which they inter-
act. Tree-based models are known for their ability to account for interaction effects
between features, as illustrated in Buchner et al. (2017) and Schiltz et al. (2018).
Several authors applied boosting and gradient boosting to insurance pricing. Guel-
man (2012) proposed gradient boosted trees for predicting auto insurance loss. Liu
et al. (2014) treated the claim frequency prediction problem by using multi class
AdaBoost trees. Wüthrich and Buser (2019) adapted tree-based methods to model
claim frequencies. Yang et al. (2018) predicted insurance premiums by using a gra-
dient boosted tree algorithm to Tweedie models. Lee and Lin (2018) introduced
172 5 Boosting Trees
Delta Boosting Machine as a new member of the boosting gamily with application
to general insurance. Pesantez-Narvaez et al. (2019) employed XGBoost to predict
the occurrence of claims using telematics data. Henckaert et al. (2020) worked with
random forests and boosted trees to develop full tariff plans built from both the
frequency and severity of claims.
We mainly based our presentation on Hastie et al. (2009), Friedman (2000) and
Friedman and Popescu (2008). Sect. 5.5 is inspired by Denuit et al. (2020). Hastie
et al. (2009) and Kuhn and Johnson (2013) made a good overview of the existing
literature.
References
Liu Y, Wang B, Lv S (2014) Using multi-class AdaBoost tree for prediction frequency of auto
insurance. J Appl Financ Bank 4(5):45–53
Pesantez-Narvaez J, Guillen M, Alcañiz M (2019) Predicting motor insurance claims using telem-
atics data XGBoost versus logistic regression. Risks 7(2):1–16
Ridgeway G (2007) Generalized boosted models: a guide to the GBM package. Update 1(1):
Schapire R (1990) The strength of weak learnability. Mach Learn 5:197–227
Schapire R, Freund Y, Bartlett P, Lee W (1998) Boosting the margin: a new explanation for the
effectiveness of voting methods. Ann Stat 26(5):1651–1686
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions.
Mach Learn 37:297–336
Schiltz F, Masci C, Agasisti T, Horn D (2018) Using regression tree ensembles to model interaction
effects: a graphical approach. Appl Econ 50(58):6341–6354
Valiant LG (1984) A theory of the learnable. Commun ACM 27(11):1134–1142
Wüthrich MV, Buser C (2019) Data analytics for non-life insurance pricing. Lecture notes
Yang Y, Qian W, Zou H (2018) Insurance premium prediction via gradient tree-boosted Tweedie
compound Poisson models. J Bus Econ Stat 36(3):456–470
Chapter 6
Other Measures for Model Comparison
6.1 Introduction
Actuarial pricing models are generally calibrated so that they minimize the general-
ization error computed with an appropriate loss function. Model selection is based
on the generalization error. Regression models are then evaluated by assessing their
generalization error with the same loss function, which is done by comparing their
generalization errors computed on a validation set.
Model selection and model assessment are thus based on the same objective
function (the deviance in our ED family setting), which provides consistency in
the approach. However, in our ED family setting, using the deviance as a tool for
model assessment has some drawbacks. As we have seen throughout the different
examples made in the previous chapters, the deviance only slightly reacts to a model
improvement. Moreover, a decrease of a certain amount of the deviance is difficult
to interpret for the analyst. There is a need for additional measures to assess pricing
models, notably more economic criteria. In this chapter, we describe complementary
measures frequently used by practitioners for model assessment.
Remark 6.1.1 Training a model with an objective function (say objective function
1) and assessing it with another one (say objective function 2) is not without criticism.
If the ultimate goal of the analyst is to minimize the second objective function, then
one could wonder why not directly training the model using this second objective
function as well.
6.2.1 Context
The merits of a regression model can be assessed using the pairs (Y, μ(X)). Besides
using validation sample estimates for the generalization error to assess the predic-
tive power of a model μ, measures of association for the pairs (Y, μ(X)) are also
frequently used by practitioners. In insurance, the most popular ones are Kendall’s
tau and Spearman’s rho, which are based on concordance probabilities.
Kendall’s tau and Spearman’s rho, defined in the following, are efficient tools for
measuring the strength of dependence between continuous outcomes. They can be
expressed in terms of the corresponding copula only and are thus independent of the
marginal distributions. When they are applied to discrete variables, they are no more
distribution-free so that their ranges are restricted to sub-intervals of [−1, 1]. This
makes their interpretation more difficult: relatively small values for Kendall’s tau
and Spearman’s rho may in fact strongly support the fitted model if their maximal
possible values are small as well. In case the response variable Y is discrete, such as
the number of claims, correlation indices are thus often restricted to a sub-interval
[−1, 1]. That is why positive values of Kendall’s tau and Spearman’s rho for the pairs
(Y, μ(X)) must be compared to their highest attainable values and not 1.
Notice that, even for discrete responses, predictors μ(X) are generally continu-
ous random variables. This is the case when there is at least one continuous feature
comprised in the available information X (so that the score is continuous) and the
function μ is a continuously increasing function of the score. Of course, predic-
tors μ(X) can still be discrete when all the features are discrete or when they are
piecewise constant predictors (such as a single tree, for instance). However, this is
unlikely to be the case since actuarial pricing is nowadays based on more sophisti-
cated models than piecewise constant predictors (trees being combined into random
forests, for instance) and uses more and more features. Even though predictors μ(X)
are generally continuous, we also consider the discrete case for μ(X). For ease of
exposition, we mean by the random variable Z a predictor μ(X) and we denote by
pk the probabilities P[Y = k], k ∈ N. When Z is discrete, we assume it is valued in
{z 1 , z 2 , . . . , z m } with z 1 < z 2 < . . . < z m , and we define
This section aims to derive the best possible upper bounds for Kendall’s tau and
Spearman’s rho when the response takes its values in N = {0, 1, 2, . . .}.
6.2 Measures of Association 177
6.2.2.1 Definition
Consider independent copies (Y1 , Z 1 ) and (Y2 , Z 2 ) of (Y, Z ). Then, (Y1 , Z 1 ) and
(Y2 , Z 2 ) are said to be concordant if (Y1 − Y2 )(Z 1 − Z 2 ) > 0 holds true whereas
they are said to be discordant when (Y1 − Y2 )(Z 1 − Z 2 ) < 0.
Tied pairs (that is, pairs of observations that have equal values of Y or Z ) may
occur in practice. Specifically, the probability that a tie occurs is given by
⎧
⎨ P[Y1 = Y2 or Z 1 = Z 2 ] if Z is discrete
P[(Y1 − Y2 )(Z 1 − Z 2 ) = 0] =
⎩
P[Y1 = Y2 ] if Z is continuous.
Proposition 6.2.1 If H denotes the joint distribution function of the pair (Y, Z ),
then
which gives the second announced equality and ends the proof.
Proposition 6.2.1 shows that concordance probabilities get higher when we replace
the joint distribution function H with H+ whose graph lies everywhere above H ,
provided the marginals are kept unchanged. A natural candidate for H+ is the Fréchet–
Höffding upper bound H u defined as
where (Y1u , Z 1u ) and (Y2u , Z 2u ) are independent copies of the random pair (Y u , Z u )
obeying the Fréchet–Höffding upper bound H u , i.e.
∞
Z u = FZ−1 (U ) and Y u = kI [FY (k − 1) ≤ U < FY (k)] (6.2.2)
k=0
with U being uniformly distributed over the unit interval [0, 1].
Proof The joint distribution function of the random pair (Y, Z ) satisfies
holds true. Now, the inequality E[g(Y, Z )] ≤ E[g(Y u , Z u )] is known to be valid for
every supermodular function g (see e.g. Denuit et al. 2005, Sect. 6.2.4). As every
joint distribution function is supermodular, we also have
so that
E[H (Y, Z )] ≤ E[min{FY (Y u ), FZ (Z u )}]
6.2 Measures of Association 179
is true. Hence, as
Based on Proposition 6.2.2, we can establish upper bounds for concordance proba-
bilities.
Proof By (6.2.1), since P[(Y1u − Y2u )(Z 1u − Z 2u ) > 0] = 2P[Y1u < Y2u , Z 1u < Z 2u ], it
suffices to show that P[Y1u < Y2u , Z 1u < Z 2u ] = E[FY (Y −)] with
∞
Z iu = FZ−1 (Ui ) and Yiu = kI [FY (k − 1) ≤ Ui < FY (k)]
k=0
with
m ∞
where
Ak = [FY (k − 1), FY (k)[×[FY (k), 1], k ∈ N.
Define
Then,
6.2 Measures of Association 181
m
P[Z 1u = Z 2u |(U1 , U2 ) ∈ Ak ] = P[(U1 , U2 ) ∈ B j |(U1 , U2 ) ∈ Ak ]
j=1
1
m
= P[(U1 , U2 ) ∈ Ak ∩ B j ]
pk F̄Y (k) j=1
1 m
= αk, j βk, j
pk F̄Y (k) j=1
with
αk, j = min{FY (k), FZ (z j )} − max{FY (k − 1), FZ (z j−1 )} +
and
βk, j = FZ (z j ) − max{FY (k), FZ (z j−1 )} + ,
where, for any real number r , we let r+ denote the positive part of r ; that is, r+ = r
if r ≥ 0 and r+ = 0 if r < 0.
For j < jk , we get βk, j = 0 since FY (k) ≥ FZ (z j ) ≥ FZ (z j−1 ). Also, for j >
jk , we have αk, j = 0 since FZ (z j ) ≥ FZ (z j−1 ) ≥ FY (k) ≥ FY (k − 1). Now, in the
remaining case j = jk , it comes FZ (z jk ) ≥ FY (k) ≥ FZ (z jk −1 ) and hence
6.2.3.1 Definition
In the following, we derive the best possible upper bounds on Kendall’s tau for
discrete responses Y . We start with the case of a continuous Z .
Proof If the random pair (Y, Z ) obeys the upper Fréchet–Höffding bound, i.e. under
(6.2.2), we get
The maximal value of Kendall’s tau corresponds to a random pair distributed accord-
ing to the Fréchet–Höffding upper bound because this distribution simultaneously
maximizes the concordance probability and leads to zero discordance probability.
This ends the proof.
Proof Similarly to the continuous case for Z , when the random pair (Y, Z ) obeys the
upper Fréchet–Höffding bound, we obviously have P[(Y1 − Y2 )(Z 1 − Z 2 ) < 0] = 0,
so that we directly get the desired result.
We notice that the latter upper bound (6.2.5) is smaller than the upper bound
(6.2.4) obtained when Z is continuous. Let us define
6.2 Measures of Association 183
The difference between the upper bounds (6.2.4) and (6.2.5) is then given by
∞
2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k)
k=0
∞
=2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k)
k=k ∗
= 2 FY (k ∗ ) − FZ (z jk ∗ −1 ) FZ (z jk ∗ ) − FY (k ∗ )
∞
+2 FY (k) − max{FY (k − 1), FZ (z jk −1 )} FZ (z jk ) − FY (k) .
k=k ∗ +1
One sees that if Z is such that FZ (z m−1 ) < FY (k ∗ ), then this difference becomes
∞
2 FY (k ∗ ) − FZ (z m−1 ) 1 − FY (k ∗ ) + 2 (FY (k) − FY (k − 1)) (1 − FY (k))
k=k ∗ +1
∞
= −2FZ (z m−1 ) 1 − FY (k ∗ ) + 2 (FY (k) − FY (k − 1)) (1 − FY (k))
k=k ∗
= −2FZ (z m−1 ) 1 − FY (k ∗ ) + 2E[FY (Y −)],
6.2.4.1 Definition
where the random variables Y ∗ and Z ∗ are independent and distributed as Y and Z ,
respectively.
Proof We refer the reader to Mesfioui and Tajar (2005) for a formal proof.
Upper bounds on Spearman’s rho can be obtained by replacing the joint distribution
function H in (6.2.6) with the Fréchet–Höffding upper bound
where U is a random variable uniformly distributed over the unit interval [0, 1] and
independent of Y . We get
6.2 Measures of Association 185
1
E [min{FY (Y ), U }] = E[2FY (Y ) − FY2 (Y )]. (6.2.11)
2
Inserting (6.2.10) and (6.2.11) in (6.2.9), we get the desired result.
Using the notation φ(k) for jk , k ∈ N, we then get the following result.
Proposition 6.2.9 If Z is discrete, then
ρ[Y, Z ] ≤ 9 − 3 E[FY (Y )(FZ (z φ(Y ) ) + FZ (z φ(Y ) −))] + E[FY (Y −)(FZ (z φ(Y −) ) + FZ (z φ(Y −) −))]
−3 (E[FZ (Z )(FY (ψ(Z )) + FY (ψ(Z )−))] + E[FZ (Z −)(FY (ψ(Z −)) + FY (ψ(Z −)−))]) .
(6.2.12)
Proof The best upper bound for ρ[Y, Z ] is given by (6.2.7), namely
ρmax = 3 E[min FY (Y ∗ −), FZ (Z ∗ ) ] + E[min FY (Y ∗ ), FZ (Z ∗ −) ]
+3 E[min FY (Y ∗ −), FZ (Z ∗ −) ] + E[min FY (Y ∗ ), FZ (Z ∗ ) ] − 1 .
(6.2.13)
Hence, we get
Also, we have
186 6 Other Measures for Model Comparison
which leads to
Similarly, we get
and
Finally, the announced upper bound for ρ[Y, Z ] is obtained by inserting (6.2.14),
(6.2.15), (6.2.16) and (6.2.17) in (6.2.13).
Let us illustrate the computation of the upper bounds (6.2.4), (6.2.5), (6.2.8) and
(6.2.12) in a situation of practical relevance. To this end, we consider the motor
third-party liability insurance portfolio introduced in Sect. 3.2.4.2. Specifically, we
restrict our example to the 124 524 insurance policies of the portfolio that have been
observed during the whole year. Figure 6.1 displays for each feature the number of
policies by category/value, and Table 6.1 shows the observed numbers of claims.
Let n = 124 524 be the number of policies considered in this example and let us
denote by Yi the number of claims of policy i (i = 1, . . . , n). The probabilities pk
can be directly estimated using the empirical proportions
n
I[Yi = k]
pk = i=1
, k ∈ N.
n
Hence, based on the observations for the number of claims summarized in Table 6.1,
we get the empirical proportions depicted in Table 6.2.
6.2 Measures of Association 187
120000
75000
75000
Number of policies
Number of policies
Number of policies
90000
50000
50000
60000
25000 25000
30000
0 0 0
Male Female Diesel Gasoline Private Professional
Gender Fuel Use
60000 60000
60000
Number of policies
Number of policies
Number of policies
40000 40000
40000
0 0 0
Comprehensive Limited.MD TPL.Only Half−Yearly Monthly Quarterly Yearly C1 C2 C3 C4 C5
Cover Split PowerCat
12500
3000
10000
Number of policies
Number of policies
7500 2000
5000
1000
2500
0 0
0 5 10 15 20 20 40 60 80
AgeCar AgePh
Fig. 6.1 The categories/values of the explanatory variables and their corresponding numbers of
policies
Expectations E[FY (Y −)], E[FY2 (Y )] and E[FY2 (Y −)] can be estimated using the
empirical proportions
pk ,
∞
∞
k−1
E[FY (Y −)] = pk FY (k − 1) = pk pl
k=1 k=1 l=0
∞
k 2
E[FY2 (Y )] = pk pl
k=0 l=0
∞
k−1 2
E[FY2 (Y −)] = pk pl .
k=1 l=0
With the
pk displayed in Table 6.2, we get
E[FY (Y −)] = 0.1026812
E[FY2 (Y )] = 0.8063110
E[FY2 (Y −)] = 0.0919602.
Of course, other estimators can be considered, exploiting the regression model struc-
ture (whereas the proposed one is purely nonparametric).
Therefore, when Z is continuous, which is the case when AgePh and AgeCar are
treated as continuous variables using GAM for instance, upper bounds (6.2.4) on
Kendall’s tau and (6.2.8) on Spearman’s rho can be estimated by
2
E[FY (Y −)] = 0.2053624
and
E FY2 (Y ) −
3 1 − E FY2 (Y −) = 0.3051865,
Table 6.4 Upper bound (6.2.12), Spearman’s rho and its normalized version Spearman’s rho
Upper bound as
well as the corresponding m (i.e. number of values taken by Z ) for trees whose sizes are controlled
by the minimum number of observations (Min. nb of obs.) required in the terminal nodes
Spearman’s rho
Min. nb of obs. m Spearman’s rho Upper bound Upper bound
35 000 2 0.0582118 0.2467594 0.2359052
30 000/25 000 3 0.0665856 0.2543916 0.2617447
20 000 4 0.0723529 0.2840077 0.2547567
15 000 5 0.0784423 0.2912251 0.2693530
10 000 7 0.0790575 0.2912251 0.2714653
5000 14 0.0854499 0.2992722 0.2855257
1000 66 0.0965594 0.3050679 0.3165177
claim frequency data. However, they do not compare to 1 (that cannot be attained)
but with the upper bounds ranging from 16% to 20% (depending on m). This may
lead the analyst to a different conclusion than the one deduced from its normalized
version obtained by dividing Kendall’s tau by its corresponding upper bound.
Notice that the values for Kendall’s tau depicted in Table 6.3 are training sample
estimates, so that they are overly optimistic and favor larger regression trees (i.e.
larger m). Hence, validation sample estimates for Kendall’s tau should be smaller
than the already small values presented in Table 6.3, showing even more the need to
compare values of Kendall’s tau with the highest attainable values and not 1.
Remark 6.2.10 In Table 6.3, training sample estimates for Kendall’s tau more than
double from the simplest model (m = 2) to the most complex one (m = 66) while
training sample estimates for its normalized version only increase by 70.5%. This
is due to the fact that the values of the upper bound also increase with m, like the
training sample estimates for Kendall’s tau. In a way, one can say that training sample
estimates for normalized Kendall’s tau penalize the model’s complexity.
190 6 Other Measures for Model Comparison
1.0
0.8
0.13
16e+3 / 125e+3
100%
0.6
yes AgePh >= 38 no
cdf of Z
0.4
0.11 0.16
10e+3 / 89e+3 5672 / 35e+3
72% 28%
0.2
0.0
0.00 0.05 0.10 0.15 0.20 0.25
Z
(a)
1.0
0.13
16e+3 / 125e+3
100%
0.11
cdf of Z
10e+3 / 92e+3
74%
0.4
AgePh >= 58
0.2
(b)
0.13
16e+3 / 125e+3
1.0
100%
0.12
12e+3 / 103e+3
82%
0.6
AgePh >= 58
cdf of Z
0.4
0.13
8534 / 68e+3
55%
Split = Yearly
0.2
0.0
(c)
Fig. 6.2 a Regression tree (on the left) and distribution function of Z (on the right) when the
minimum number of observations per final node is set to 35 000. b Regression tree (on the left) and
distribution function of Z (on the right) when the minimum number of observations per final node
is set to 30 000 (same results when minimum number of observations per node set to 25 000). c
Regression tree (on the left) and distribution function of Z (on the right) when the minimum number
of observations per final node is set to 20 000
6.2 Measures of Association 191
0.13
16e+3 / 125e+3
1.0
100%
yes AgePh >= 32 no
0.12
0.8
12e+3 / 105e+3
85%
Split = Half−Yearly,Yearly
0.6
0.11
cdf of Z
9638 / 88e+3
71%
AgePh >= 58
0.4
0.12
6719 / 57e+3
46%
0.2
Fuel = Gasoline
0.0
0.092 0.11 0.13 0.15 0.18
2919 / 32e+3 4141 / 37e+3 2578 / 20e+3 2583 / 17e+3 3466 / 19e+3
25% 30% 16% 14% 15% 0.00 0.05 0.10 0.15 0.20 0.25
Z
(a)
0.13
16e+3 / 125e+3
1.0
100%
yes AgePh >= 32 no
0.12
12e+3 / 105e+3
85% 0.8
Split = Half−Yearly,Yearly
0.11
9638 / 88e+3
0.6
71%
AgePh >= 58
cdf of Z
0.092 0.12
2919 / 32e+3 6719 / 57e+3
0.4
25% 46%
AgeCar >= 9 Fuel = Gasoline
0.11
4141 / 37e+3
0.2
30%
PowerCat = C1
0.0
(b)
0.13
1.0
16e+3 / 125e+3
100%
yes AgePh >= 32 no
0.12 0.18
12e+3 / 105e+3 3466 / 19e+3
85% 15%
Split = Half−Yearly,Yearly Split = Yearly
0.8
0.11 0.15
9638 / 88e+3 2583 / 17e+3
71% 14%
AgePh >= 58 AgePh >= 52
0.092 0.12
0.6
0.084 0.12
1684 / 20e+3 2771 / 24e+3
16% 19%
AgeCar >= 6 AgePh >= 50
0.2
0.12
1995 / 16e+3
13%
Split = Yearly
0.0
0.081 0.091 0.095 0.12 0.1 0.11 0.12 0.13 0.12 0.14 0.13 0.16 0.15 0.2
1070 / 13e+3 614 / 6741 541 / 5680 694 / 5952 1370 / 13e+3 776 / 7295 1106 / 9522 889 / 6770 1099 / 9063 1479 / 11e+3 699 / 5297 1884 / 12e+3 1310 / 8578 2156 / 11e+3
11% 5% 5% 5% 11% 6% 8% 5% 7% 9% 4% 9% 7% 9%
Fig. 6.3 a Regression tree (on the left) and distribution function of Z (on the right) when the
minimum number of observations per final node is set to 15 000. b Regression tree (on the left) and
distribution function of Z (on the right) when the minimum number of observations per final node
is set to 10 000. c Regression tree (on the left) and distribution function of Z (on the right) when
the minimum number of observations per final node is set to 5 000
192 6 Other Measures for Model Comparison
0.13
16e+3 / 125e+3
100%
1.0
0.18
3466 / 19e+3
15%
Split = Yearly
0.12
12e+3 / 105e+3
85%
Split = Half−Yearly,Yearly
0.15 0.2
2583 / 17e+3 2156 / 11e+3
14% 9%
0.11 0.15
9638 / 88e+3 1310 / 8578
71% 7%
0.8
0.092 0.12 0.14 0.18
2919 / 32e+3 442 / 3599 810 / 5924 1219 / 6602
25% 3% 5% 5%
0.12 0.14
497 / 4000 1479 / 11e+3
3% 9%
AgeCar >= 15 AgePh >= 33 AgePh >= 54 AgePh >= 46 Cover = Comprehensive,Limited.MD
0.6
0.086 0.12 0.16
1911 / 22e+3 984 / 7959 598 / 3784
18% 6% 3%
cdf of Z
PowerCat = C1 Gender = Male AgeCar >= 6
0.1 0.11
410 / 4079 2496 / 22e+3
3% 17%
0.12
2023 / 17e+3
14%
Gender = Male
0.4
0.081 0.097 0.1 0.12 0.12
1311 / 16e+3 1160 / 12e+3 473 / 4718 484 / 4193 529 / 4456
13% 10% 4% 3% 4%
AgeCar < 11 AgePh >= 36 Cover = Comprehensive,Limited.MD AgeCar < 4 AgePh < 44
0.2
0.077 0.086 0.1
362 / 4715 383 / 4455 488 / 4694
4% 4% 4%
0.11
765 / 6666
5%
Split = Yearly
0.091
284 / 3133
3%
Cover = Limited.MD
0.11
414 / 3749
3%
Cover = TPL.Only
0.0
0.07 0.083 0.075 0.095 0.096 0.092 0.11 0.12 0.079 0.091 0.11 0.13 0.11 0.12 0.12 0.13 0.15 0.16 0.11 0.13 0.11 0.13 0.14 0.17 0.13 0.15 0.14 0.17 0.13 0.15 0.16 0.21 0.25
189 / 2697 230 / 2775 99 / 1322 160 / 1695 158 / 1646 222 / 2424 230 / 2086 333 / 2862 151 / 1906 253 / 2797 331 / 2998 136 / 1034 267 / 2509 205 / 1705 132 / 1086 211 / 1577 249 / 1628 172 / 1088 202 / 1868 214 / 1627 138 / 1311 197 / 1474 189 / 1319 249 / 1437 328 / 2597 319 / 2181 182 / 1338 464 / 2690 255 / 2037 340 / 2276 350 / 2199 333 / 1552 672 / 2648
0.061 0.068 0.073 0.086 0.085 0.1 0.11 0.1 0.14 0.099 0.093 0.11 0.093 0.099 0.11 0.12 0.12 0.15 0.1 0.12 0.13 0.12 0.15 0.15 0.11 0.13 0.16 0.16 0.19 0.13 0.19 0.19 0.19
84 / 1382 132 / 1940 80 / 1104 124 / 1438 139 / 1633 190 / 1825 188 / 1655 197 / 1952 164 / 1138 109 / 1099 157 / 1696 159 / 1418 206 / 2209 284 / 2871 282 / 2663 351 / 2917 309 / 2597 177 / 1194 115 / 1104 282 / 2325 286 / 2139 194 / 1671 163 / 1111 349 / 2347 114 / 1002 146 / 1136 208 / 1268 416 / 2634 406 / 2170 215 / 1611 500 / 2654 536 / 2851 265 / 1381
1% 2% 1% 1% 1% 1% 1% 2% 1% 1% 1% 1% 2% 2% 2% 2% 2% 1% 1% 2% 2% 1% 1% 2% 1% 1% 1% 2% 2% 1% 2% 2% 1%
Fig. 6.4 Regression tree (on the left) and cdf of Z (on the right) when minimum number of
observations = 1000
Considering the values displayed in Table 6.4, we see that the values for Spear-
man’s rho are between 5% and 10%, which are also rather small compared to 1.
Again, this may lead the analyst to a different conclusion than the one deduced from
its normalized version, whose values range between 23% and 32%. As for Kendall’s
tau, the upper bound increases with the number of values m taken by Z to finally
converge to the upper bound we get when Z is continuous. Also, the values for
Spearman’s rho shown in Table 6.4 are training sample estimates, so that they are
overly optimistic and favor larger regression trees.
6.3.1 Motivation
μ(X) and the calibration of the supervised regression model delivering the prices μ),
as departures Y − μ(X) cancel out when averaged over a sufficiently large portfolio
(this is the very essence of insurance). The premium μ(X) has to be as close as
possible to the true premium μ(X). The very aim of ratemaking is not to predict
the actual losses Y but to create accurate estimates of μ(X), which is unobserved.
Goodness-of-lift must be measured with the validation set, not with the training set
(else, a model over-fitting the training data may appear to provide a high lift).
Let μ1 and μ2 be two predictors based on the information contained in X. Both
are candidate premiums and attempt to predict the true premium μ(X). There are
many methods to obtain such predictors, ranging from the classical GLMs to neural
networks. Lift charts sort data based on the ratio R = μ1 /
μ2 to compare the two pre-
dictors. This ratio is called the relativity. The procedure is as follows. First, calculate
the ratio R for each observation in the validation set and sort the data according to it.
Second, bucket the data into equally populated classes. If μ2 is the new model, to be
compared with the current one μ1 , then the superiority of
μ2 over μ1 is demonstrated
by plotting loss ratios corresponding to the old model μ1 . If the buckets with low
relativities have lower loss ratios then we have lift, that is, if loss ratios exhibit an
increasing trend then μ2 is preferred over μ1 .
In actuarial pricing, the actuary aims to predict the technical premium μ(X). To
this end, a predictor μ(X) is built from the available information X. To ease the
exposition, we assume that all predictors μ(X) under consideration, as well as the
conditional expectation μ(X) are continuous random variables admitting probability
density functions. This is generally the case when there is at least one continuous
feature comprised in the available information X and the function
μ is a continuously
increasing function of a real score. However, this may rule out predictions based on
discrete features only, as well as piecewise constant predictors, e.g., a single tree,
since in those cases,
μ(X) takes only a limited number of values. Now, as actuarial
pricing is nowadays based on more sophisticated models (trees being combined into
random forests, for instance), this assumption does not really restrict the generality
of the approach. Notice that the response Y may be discrete (such as the number of
claims, for instance) as the continuity only concerns μ(X) and μ(X).
The predictor is also supposed to be correct on average, that is,
and as Fμ−1 the associated quantile function (or Value-at-Risk) defined as the gener-
alized inverse of Fμ , i.e.
where (z − t)+ denotes the positive part of z − t, i.e. (z − t)+ = max{z − t, 0}.
This means that the stop-loss premiums for Z 2 dominates the corresponding stop-
loss premiums for Z 1 for all deductible levels t. The name convex order comes from
6.3 Measuring Lift 195
the fact that Z 1 cx Z 2 ⇔ E[g(Z 1 )] ≤ E[g(Z 2 )] for all the convex functions g for
which the expectations exist. Moreover,
E[(Z − t)+ ] − E (t − Z )+ = E [Z ] − t,
we find
Z 1 cx Z 2 ⇔ 1 ] = E[Z
2 ],
E[Z
(6.3.3)
E (t − Z 1 )+ ≤ E (t − Z 2 )+ , for all t ≥ 0.
Recall that for any random variables Z 1 and Z 2 with equal mean,
1 1
Z 1 cx Z 2 ⇔ FZ−1
1
(u)du ≤ FZ−1
2
(u)du for all 0 < α < 1.
α α
1
As E[Z k ] = 0 FZ−1
k
(u)du, we also have
α 1
FZ−1
1
(u)du = E[Z 1 ] − FZ−1
1
(u)du
0 α
1
≥ E[Z 2 ] − FZ−1
2
(u)du
α
α
= FZ−12
(u)du
0
k
(u)du = k
(α)]
0 0
196 6 Other Measures for Model Comparison
and
1 α
6.3.4.1 Definition
Gourieroux and Jasiak (2007, Chap. 4) defined several performance measures for a
predictor
μ(X) of the mean response E[Y |X] = P[Y = 1|X]. These performance
measures for the predictor
μ for a binary response Y were based on the two curves
and
μ(X) ≤ Fμ−1 (α)|Y = 1].
α
→ P[
The former curve gives the proportion of policies reporting at least one claim despite
their associated prediction was low (precisely, among the 100α% smaller ones). The
latter corresponds to the proportion of policies with small predictions among those
with at least one claim reported. Notice that
P[Y = 1|
μ(X) ≤ t]
μ(X) ≤ t|Y = 1] =
P[ μ(X) ≤ t]
P[
P[Y = 1]
so that multiplying the first curve with α, we obtain the second one.
Considering that the identities
6.3 Measuring Lift 197
P[Y = 1|
μ(X) ≤ t] E[Y |
μ(X) ≤ t]
=
P[Y = 1] E[Y ]
and
μ(X) ≤ t]
E Y I[
μ(X) ≤ t|Y = 1] =
P[
E[Y ]
are both valid for binary responses, this suggests to base the performance measure of
the predictor on the concentration curve of the response with respect to the predictor,
whose definition is recalled next.
Definition 6.3.2 The concentration curve of the response Y with respect to the pre-
dictor
μ based on the information contained in the vector X is defined as
CC[Y,
μ(X); α] = E Y
μ(X) = t fμ (t)dt (6.3.5)
E[Y ] 0
so that
198 6 Other Measures for Model Comparison
so that
∞
E Y
μ(X) ≤ t = ydP[Y ≤ y, μ(X) ≤ t]
0
∞ t
1
= y f (Y,μ) (y, s)dyds
μ(X) ≤ t] 0
P[
0
μ(X) ≤ t]
E Y I[
= .
μ(X) ≤ t]
P[
Finally, (6.3.8) is easily obtained from the definition of the concentration curve as
Formula (6.3.7) shows that we can equivalently replace the response Y with the
pure premium μ(X) in the concentration curve. This property is of utmost importance
as the actuary is interested in the pure premium, which is unknown, but is allowed
to replace it with the actual response values in the evaluation of the performances of
the predictor under consideration.
If the predictor brings a lot of information about the technical premium, or equiva-
lently about the response, this means that these random variables are strongly cor-
related. The shape of the concentration curve of the response with respect to its
predictor thus depends on the kind of positive relationship between the response and
the predictor. This is why we recall the definition of the following positive depen-
dence notions.
6.3 Measuring Lift 199
z 2 → P[Z 1 ≤ z 1 |Z 2 = z 2 ] is non-increasing in z 2 ,
for all z 1 ≥ 0.
(ii) The random variable Z 1 is positively regression dependent on Z 2 if
z 2 → E[Z 1 |Z 2 = z 2 ] is non-decreasing in z 2 .
z 2 → P[Z 1 ≤ z 1 |Z 2 ≤ z 2 ] is non-increasing in z 2 ,
for all z 1 ≥ 0.
(iv) The random variable Z 1 is positively left-tail expectation dependent on Z 2 if
z 2 → E[Z 1 |Z 2 ≤ z 2 ] is non-decreasing in z 2 .
E[Z 1 ] ≥ E[Z 1 |Z 2 ≤ z 2 ]
for all z 2 ≥ 0.
Let us give some intuitive explanation about these different concepts. Considering
stochastic increasingness, we see that the condition defining this positive dependence
notion expresses the fact that the probability that Z 1 is small, in the sense that Z 1
falls below the threshold z 1 , decreases as Z 2 gets larger. Intuitively speaking, Z 1
thus tends to become larger when Z 2 increases. Positive regression dependence is a
weaker concept as
∞
E[Z 1 |Z 2 = z 2 ] = P[Z 1 > z 1 |Z 2 = z 2 ]dz 1
0
Monotonicity
The concentration curve is based on the function
E μ(X)I[ μ(X) ≤ t]
t
→
E[Y ]
evaluated at quantiles of
μ(X). This function is a distribution function, starting from
(0, 0) to reach (1, 1), being non-decreasing and right-continuous. Therefore, α
→
CC[μ(X), μ(X); α] is non-decreasing and satisfies
lim CC[μ(X),
μ(X); α] = 0 and lim CC[μ(X),
μ(X); α] = 1.
α→0 α→1
Line of Independence
In the particular case where the predictor brings no information about the response,
in the sense that Y and
μ(X) are mutually independent, then the concentration curve
is the 45-degree line, often referred to as the line of independence in the literature.
Formally, if Y and μ(X) are independent, then
Let us now study the position of the concentration curve with respect to the 45-
degree line. If
μ brings a lot of information about the true premium μ(X), this means
that these random variables are strongly related and the concentration curve should
be far from the line of independence. Furthermore, the shape of the concentration
curve depends on the kind of relationship between μ(X) and μ(X). The next result
shows that under weak positive dependence, every concentration curve lies below
the independence line.
Property 6.3.5 If μ(X) is positively expectation dependent on
μ(X), that is, if the
inequality
holds for all t, then the concentration curve lies below the 45-degree line, i.e.
CC[μ(X),
μ(X); α] ≤ α for all probability levels α.
E μ(X)I[
μ(X) ≤ t] μ(X) ≤ t]E μ(X)
P[ μ(X) ≤ t
=
E[Y ] E[Y ]
≤ P[
μ(X) ≤ t].
6.3 Measuring Lift 201
Convexity
The next result states a positive dependence condition under which the concentration
curve is convex. Again, the shape of the curve depends on the kind of relationship
existing between the response and the predictor.
Property 6.3.6 The concentration curve α
→ CC[μ(X), μ(X); α] is convex if, and
only if, μ(X) is positively regression dependent on
μ(X), that is, if the function
t
→ E μ(X)
μ(X) = t (6.3.9)
is non-decreasing.
Proof Let us start from the representation
∞
t
E μ(X)I[
μ(X) ≤ t] = u f (μ,μ) (u, s)duds,
0 0
where f (μ,μ) denotes the joint probability density function of the pair (μ(X),
μ(X)).
Then, the first derivative of the selection curve is given by
∞
d 1
CC[μ(X),
μ(X); α] = u f (μ,μ) u, Fμ−1 (α) du
dα 0 fμ Fμ−1 (α)
∞
= u f μ|μ u|Fμ−1 (α) du
0
where f μ|μ (·|t) denotes the conditional probability density function of the true pre-
mium μ(X), given μ(X) = t. Hence,
d
CC[μ(X), μ(X) = Fμ−1 (α)].
μ(X); α] = E[μ(X)|
dα
We thus see that this derivative is increasing (and hence the primitive is convex) if,
and only if, μ(X) is positively regression dependent on the score
μ(X), as announced.
This ends the proof.
The convexity of the concentration curve ensures that the increments of the func-
tion
−1 −1
μ (α) <
E Y I[F μ(X) ≤ F
μ (α + )]
μ(X); α + ] − CC[Y,
CC[Y, μ(X); α] =
E[Y ]
are non-decreasing in α, for every positive such that α + ≤ 1. This means that
the sub-portfolios created by isolating a proportion of policies with predictions
comprised between Fμ−1 (α) and Fμ−1 (α + ) brings an increasing share of the losses,
on average as α increases.
202 6 Other Measures for Model Comparison
This property is in relation with lift charts as described in Tevet (2013). To draw
such graphs, the data set is sorted based on the values of μ(X). The data are then
bucketed into equally populated classes based on quantiles. Within each bucket, the
average predicted loss is calculated with the help of the predictor μ as well as the
actual loss cost Y . The average predicted and average actual loss costs are then
graphed for each class.
To assess the reasonableness of μ, the analyst checks whether the actual loss costs
monotonically increase as we move to higher buckets (by definition, this will be the
case for the predicted loss costs). Property 6.3.6 precisely identifies the condition
required to observe an increasing trend in such a lift chart.
The monotonicity condition imposed on t
→ E[μ(X)| μ(X) = t] appears to be
reasonable. However, this condition is not necessarily fulfilled for a given feature X j ,
i.e. t
→ E[μ(X)|X j = t] is not necessarily non-decreasing. For instance, consider-
ing as response, the number of claims filed against the insurer in a motor third-party
liability cover, the impact of policyholder’s age on the expected claim frequency often
exhibits a U-shape, which invalidates the monotonicity of the conditional expecta-
tion. Re-arranging the values of the feature is nevertheless possible, as explained
in Shaked et al. (2012). But such a procedure makes the analysis less transparent.
This is why we condition here on μ(X), which maps the vector of features to the
prediction and induces a total order among risk profiles.
As positive regression dependence implies positive expectation dependence, the
condition of Property 6.3.6 ensures that the concentration curve is non-decreasing
and convex, starting from (0, 0) to end at (1, 1) with a graph everywhere below the
45-degree line.
6.3.4.4 Estimation
μ(X),
CC
μ(X); α = CC[Y,
μ(X); α]
1
= Yi
nY μ−1 (α)
μ(X i )≤ F
i|
−1 (α) Yi
μ(X i )≤ F
i|
= n μ ,
i=1 Yi
where n
Yi
Y = i=1
.
n
μ denotes the empirical distribution function of the resulting predictions, i.e.
Here, F
6.3 Measuring Lift 203
n
μ (t) = 1
F μ(X i ) ≤ t].
I[
n i=1
Notice that
μ(X) ≤ Fμ−1 (α) ⇔ Fμ
μ(X) = α.
This means that it is enough to consider the ranking induced by the predictor, that
is, we are free to replace every predictor
μ(X) with the corresponding rank
M = Fμ
μ(X)
The Lorenz curves are used in economics to measure the inequality of incomes.
Intuitively speaking, the more incomes are variable in a given population, the less
egalitarian it is. Lorenz curves are thus intimately related to the convex order. The
Lorenz curve LC associated to the predictor μ(X) is defined by
204 6 Other Measures for Model Comparison
α
1
μ(X); α] =
LC[ F −1 (u)du
μ(X)] 0 μ
E[
Eμ(X)I[
μ(X) ≤ t]
t
→
μ(X)]
E[
μ(X); α] = CC[
LC[ μ(X),
μ(X); α]
so that this particular case corresponds to the 45-degree line. Notice that we must con-
sider this limit case with some care because the constant predictor is not a continuous
random variable (which invalidates some of the formulas derived earlier).
If an actuary has access to the true premium μ(X) based on the information contained
in X then there is no need to distinguish CC from LC. This is because if
μ(X) = μ(X)
6.3 Measuring Lift 205
then
μ(X); α] = CC[μ(X),
LC[ μ(X); α]
Remark 6.3.8 Many empirical studies also use Gini coefficients based on the Lorenz
curve of the predictor. The reason for using this indicator is the following.
The Gini mean difference is one possible measure of variability defined for a
non-negative continuous random variable Z as
Var[Z ] = E (Z 1 − Z 2 )2 .
2
If Z is continuous then it can be shown that
Gini[Z ] = 4Cov Z , FZ (Z ) .
Thus, Gini mean difference measures the association between a random variable and
its rank. In other words, considering a sequence of observations ranked in ascending
order, Gini mean difference quantifies the relationship between the actual value of Z
and its position in the sequence. Clearly, the more variability in Z , the larger its actual
value when its rank is high (i.e. it appears among the largest observations) whereas
a lower Gini mean difference indicates that the observations are more concentrated
around their central value.
This formula relates the Gini mean difference to the Lorenz curve.
Starting from
the lower absolute deviation of Z 1 , defined as E (t − Z 1 )I[Z 1 ≤ t] , let us replace t
with Z 2 to get
206 6 Other Measures for Model Comparison
1
1
Now, the area between the identity line and the Lorenz curve for Z is E[Z 1
]
Cov[Z ,
FZ (Z )]. Therefore, the higher the Gini mean difference, the further the Lorenz curve
of the predictor from the 45-degree line (i.e. from the Lorenz curve of the uninfor-
mative predictor constantly equal to E[Y ]). This is why candidate premiums with
larger Gini mean difference tend to be preferred.
Notice that the Gini coefficient is the Gini mean difference divided by twice the
mean. It is also known as the concentration ratio and represents the area between the
45-degree line and the actual Lorenz curve divided by the area between the 45-degree
line and the Lorenz curve that yields the maximal value that this index can have.
α
→ LC[
μ(X); α] and α
→ CC[μ(X),
μ(X); α].
The first one represents the share of premiums collected from the 100α% of policies
from the portfolio with the lowest
μ(X) values. The second one gives the correspond-
ing share of the true premium that should have been collected from this sub-portfolio.
As the total expected income of μ and μ match the total expected loss by (6.3.1),
both ratios are directly comparable since we can compare two percentages: the one
obtained with μ(X) with the true one corresponding to μ(X). As actuaries, we would
like that the graph of the concentration curve is as close as possible to the graph of
the Lorenz curve. In other words, the smaller the area between the two curves the
better.
Gini mean difference measures the area between the 45-degree line and the Lorenz
curve. As explained earlier, an alternative is to consider both the Lorenz curve of the
predictor and the concentration curve of the true premium with respect to the predic-
tor. More precisely, defining a distance between the two curves CC[μ(X), μ(X); α]
and LC[ μ(X); α] would be relevant, knowing that they coincide when μ(X) = μ(X).
This is why the area between the concentration curve and the Lorenz curve turns out
to be another good candidate for assessing the performance of a given predictor. This
area between the curves, ABC in short, is given by
1
μ(X)] =
ABC[ μ(X); α] − LC[
CC[Y, μ(X); α] dα
0
1
1
= E Y I[M ≤ α] − E μ(X)I[M ≤ α] dα
μ(X)] 0
E[
1 ∞
1
= P μ(X) ≤ y, M ≤ α] − P Y ≤ y, M ≤ α] dydα
μ(X)] 0 0
E[
6.3 Measuring Lift 207
1
= Cov μ(X), M − Cov Y, M (6.3.10)
μ(X)]
E[
where we recognize the difference between the Gini mean difference of the predictor
μ(X) (up to the factor 4) and the Gini covariance of the response Y and the predictor
μ(X). Let us notice that (6.3.10) can be rewritten as
μ(X)] =
ABC[ Cov
μ(X) − Y, M .
μ(X)]
E[
Remark 6.3.9 In the case where μ(X) and μ(X) are comonotonic, we have
CC[μ(X), μ(X); α] = LC[
μ(X); α] for all α ∈ (0, 1) if and only if μ(X) =
μ(X).
Indeed, in such a case
μ(X) = Fμ−1 (M)
where Fμ−1 is the quantile function associated to the distribution function Fμ of μ(X).
Therefore, since E[μ(X)] = E[ μ(X)], we get CC[μ(X), μ(X); α] = LC[ μ(X); α]
for all α ∈ (0, 1) if and only if
⇔ E Fμ−1 (M)I[Fμ−1 (M) ≤ Fμ−1 (α)] = E Fμ−1 (M)I[Fμ−1 (M) ≤ Fμ−1 (α)]
α α
⇔ Fμ−1 (u)du = Fμ−1 (u)du. (6.3.11)
0 0
Thus, since (6.3.11) must be fulfilled for all α ∈ (0, 1), it comes Fμ−1 (u) = Fμ−1 (u)
for all u ∈ (0, 1) and so μ(X) = μ(X).
Alternatively, we can also create a sub-portfolio gathering all the policies such that
μ(X) ≤ Fμ−1 (α) and only consider this one, in isolation. The result of each portfolio
is described by the performance curve
Eμ(X)μ(X) ≤ Fμ−1 (α) , E μ(X)
μ(X) ≤ Fμ−1 (α) .
Notice that we are allowed to replace the true premium μ(X) with the actual loss
Y . The first component corresponds to the conditional lower-tail expectation of the
predictor, defined as
208 6 Other Measures for Model Comparison
μ(X)
μ(X); α] = E
CLTE[ μ(X) ≤ Fμ−1 (α) .
The second component looks like the conditional lower-tail expectation of the true
premium
except that the condition involves the predictor and not the true premium.
Clearly,
α
→ CLTE[ μ(X); α]
is non-decreasing. But this is not necessarily the case for α
→ E Y μ(X) ≤ Fμ−1 (α)
unless μ(X) is positively lower-tail expectation dependent on μ(X). Positive expec-
tation dependence allows us to constrain the shape of this curve. If μ(X) and μ(X)
are positively expectation dependent then
The next result gives the condition ensuring the monotonicity of α
→ E μ(X)
μ(X)
−1
≤ Fμ (α) .
Property 6.3.10 If μ(X) is positively lower-tail expectation dependent on
μ(X)
then
E[μ(X)|μ(X) ≤ t] − E[μ(X)|
μ(X) ≤ s]
∞
= P[μ(X) ≤ u|
μ(X) ≤ s] − P[μ(X) ≤ u|
μ(X) ≤ t] du
0
Proof Let us first establish that E[μ(X)|A] ≥ E[μ(X)|μ(X) ≤ u] for any random
event A such that P[μ(X) ≤ u] = P[A]. This comes from
6.3 Measuring Lift 209
CLTE[Y ; α] ≤ CLTE[
μ(X); α] holds for all α,
μ(X)
E μ(X)μ(X) ≤ Fμ−1 (α) ≤ E μ(X) ≤ Fμ−1 (α) for all α
since
CLTE[μ(X); α] = CLTE[Y ; α].
Property 6.3.11 indicates that E μ(X)μ(X) ≤ Fμ−1 (α) lies above CLTE[μ(X); α]
and thus it may cross CLTE[ μ(X); α]. The corresponding sub-portfolio is therefore
in disequilibrium, with true premiums above the actual premiums, on average.
Consider two predictors μ1 and μ2 . Both attempt to predict the unknown pure pre-
mium μ(X). These predictors may differ in their functional form ( μ1 instead of
μ2 )
and/or in the information (X 1 instead of X 2 ) on which they are based. There are two
important aspects when evaluating the performances of two predictors. First, their
respective variability, i.e. their ability to identify different risk profiles. Second, their
correlation with μ(X), i.e., the amount of information they bring in about the true
premium.
210 6 Other Measures for Model Comparison
6.3.6.1 Variability
The convex order appears to be the appropriate tool to measure the degree of lift
induced by a predictor. This probabilistic tool indeed assesses the differentiation
between the cheapest and costliest risk profiles identified by the model. In that respect,
replacing the predictor with a more variable one, based on the convex order, appears
to be a promising strategy.
μ2 (X 2 ) cx
Property 6.3.13 If μ1 (X 1 ) holds then
(i) min[μ1 (X 1 )] ≤ min[μ2 (X 2 )] and max[ μ1 (X 1 )] ≥ max[
μ2 (X 2 )] so that
μ1 (X 1 ) has a wider range than
μ2 (X 2 ).
(ii) for any α < β,
Proof The proof of (i) is by contradiction. Suppose, for example, that max[ μ1 (X 1 )]
μ2 (X 2 )]. Let t be such that max[
< max[ μ1 (X 1 )] < t < max[
μ2 (X 2 )]. Then
in contradiction to μ2 (X 2 ) cx
μ1 (X 1 ). Therefore we must have max[ μ1 (X 1 )] ≥
max[μ2 (X 2 )]. Similarly, it can be shown that min[ μ1 (X 1 )] ≤ min[
μ2 (X 2 )].
Considering (ii), we know that
⎧
⎪ Eμ1 (X 1 ) −1
μ1 (X 1 ) > F
μ1 (β) ≥ E μ2 (X 2 ) −1
μ2 (X 2 ) > F
μ2 (β)
⎪
⎪
⎪
⎪
⎪
⎪
⎪ for all probability levels β
⎨
μ2 (X 2 ) cx
μ1 (X 1 ) ⇔
⎪
⎪
⎪
⎪ E μ1 (X 1 ) −1
μ1 (X 1 ) ≤ F
μ1 (α) ≤ E μ2 (X 2 ) −1
μ2 (X 2 ) ≤ F
μ2 (α)
⎪
⎪
⎪
⎪
⎩
for all probability levels α
The respective distribution functions of the two predictors μ1 and μ2 are denoted
as Fμ1 and Fμ2 . Both Fμk are assumed to be continuous and strictly increasing,
k = 1, 2. Define the scores M1 = Fμ1 ( μ1 ) and M2 = Fμ2 (
μ2 ) that are both uniformly
distributed over the unit interval [0, 1].
The more Mk is correlated to Y , the more information the corresponding pre-
dictor
μk contains. More informative predictors thus lead to greater variability of
6.3 Measuring Lift 211
the conditional expectation E[Y |M]. This is formally stated in the next result estab-
lished by Muliere and Petrone (1992) in their study of dependence orderings based
on generalized Lorenz curves.
Property 6.3.14 Assume that the functions α
→ E[Y |Mk = α] are continuous and
strictly increasing for k ∈ {1, 2}. Then
E[Y |M2 ] cx E[Y |M1 ] ⇔ E[Y |M1 ≥ α] ≥ E[Y |M2 ≥ α] for all α.
Here, E[Y |Mk ] measures how the rank Mk induced by the predictor μk explains
μk (X) = E[Y ], then E[Y |Mk ] = E[Y ]
the response Y . If all the ranks are equal, i.e.
and the predictor does not bring any information about the response.
The next result shows that under the assumptions of Property 6.3.14 the mean
square error of prediction (MSEP) is smaller with M1 compared to M2 .
Property 6.3.15 Under the conditions of Property (6.3.14), we have
2 2
E Y − E[Y |M1 ] ≤ E Y − E[Y |M2 ] ,
E[Y |M2 ] cx E[Y |M1 ] ⇒ Var E[Y |M2 ] ≤ Var E[Y |M1 ] .
The performance and selection curves are useful to evaluate the value of a given
predictor
μ(X). These curves are also helpful to compare the performances of dif-
ferent scores. Following Gourieroux and Jasiak (2007, Definition 4.5), we adopt the
following comparison rule.
Definition 6.3.16 The predictor μ1 (X 1 ) is more discriminatory than the predictor
μ2 (X 2 ) for a response Y if, and only if, the inequalities
E[ μ1 (X 1 ) ≤ Fμ−1
μ1 (X 1 )| 1
(α)] ≤ E[ μ2 (X 2 ) ≤ Fμ−1
μ2 (X 2 )| 2
(α)]
and
212 6 Other Measures for Model Comparison
μ1 (X 1 ) ≤ Fμ−1
E[Y | 1
μ2 (X 2 ) ≤ Fμ−1
(α)] ≤ E[Y | 2
(α)]
μ2 (X 2 ) cx
The first condition is fulfilled if, and only if μ1 (X 1 ). The second
condition can be rewritten as
μ1 (X 1 ) ≤ Fμ−1
E[Y | 1
μ2 (X 2 ) ≤ Fμ−1
(α)] ≤ E[Y | 2
(α)]
or, equivalently, if
P[Z 1 > t1 , Z 2 > t2 ] ≤ P[V1 > t1 , V2 > t2 ] for all t1 and t2 , (6.3.13)
The intuitive meaning of a ranking with respect to conc is clear from Definition
6.3.17. Indeed, P[Z 1 ≤ t1 , Z 2 ≤ t2 ] and P[V1 ≤ t1 , V2 ≤ t2 ] read as “Z 1 and Z 2 are
both small” and “V1 and V2 are both small”, respectively (small meaning that Z 1 ,
resp. V1 , is smaller than the threshold t1 and Z 2 , resp. V2 , is smaller than the threshold
t2 ). So, (6.3.12) means that when X conc Y holds, the probability that V1 and V2
6.3 Measuring Lift 213
are both small is larger than the corresponding probability for Z 1 and Z 2 . Similarly
from (6.3.13), (Z 1 , Z 2 ) conc (V1 , V2 ) also ensures that the probability that Z 1 and
Z 2 are both large is smaller than the corresponding probability for V1 and V2 . This
corresponds to the intuitive content of “(V1 , V2 ) being more positively dependent
than (Z 1 , Z 2 )”.
We have the following result in terms of covariances.
Proposition 6.3.18 For random couples (Z 1 , Z 2 ) and (V1 , V2 ) with the same
marginal distributions, we have
for all the non-decreasing functions g1 and g2 , provided the expectations exist.
Proposition 6.3.18 shows when conc holds, the correlations between g1 (Z 1 ) and
g2 (Z 2 ) are less than between g1 (V1 ) and g2 (V2 ) for all increasing functions g1 and
g2 . Furthermore, Pearson’s correlation coefficient as well as Kendall’s and Spear-
man’s rank correlation coefficients all agree with a ranking in the conc -sense. This
reinforces the intuitive meaning of conc as a tool to compare the strength of the
dependence.
The next result gives a sufficient condition in terms of the concordance order.
Property 6.3.19 If
μ2 (X 2 ) cx
μ1 (X 1 ) and (Y, M2 ) conc (Y, M1 )
The preference relation proposed in Definition 6.3.16 only forms a partial ranking.
Two predictors might well be incomparable because their respective concentration
or Lorenz curves intersect: one predictor is better for low risks, and worse for high
risks, for example. In such a case, we can base the comparison on the integral of
214 6 Other Measures for Model Comparison
E μ(X)I[M ≤ ξ]
= dξ
E[Y ]
0
E μ(X)(α − M)+
=
E[Y ]
= + E (α − M)+ .
E[Y ]
The first term is driven by the correlation between the response and the predictor
whereas the second one is just a constant as M is unit uniformly distributed:
α
α2
E (α − M)+ = (α − ξ)dξ = .
0 2
The integral of the concentration curve over the whole interval [0, 1] is denoted ICC,
i.e.
ICC = ICC[μ(X),
μ(X); 1]
Cov μ(X), 1 − M 1
= +
E[Y ] 2
1 Cov μ(X), M
= − .
2 E[Y ]
Again, as
= E μ(X)(α − M)+
we are allowed to replace μ(X) with Y in the definition of the integrated concentration
curve. This means that we can use it to measure performance of μ(X) in predicting
the unknown pure premium μ(X).
Let us now provide an intuitive interpretation for ICC. We still consider the 100α%
of policies with the smallest μ values, as (α − M)+ = 0 for α ≤ M. Now, ICC is
based on the covariance between μ(X) and (α − M)+ . The idea is that, the smaller
M with respect to α (i.e., the larger (α − M)+ ) the smaller the true premium should
be. Hence, a positive relationship between μ(X) and μ(X) translates into a negative
covariance between μ(X) and (α − M)+ . And the more negative the covariance term
entering the decomposition of ICC, the better the corresponding candidate premium.
6.3 Measuring Lift 215
Proceeding in a similar way with the Lorenz curve, we define the integrated Lorenz
curve as
α
E μ(X)I[M ≤ ξ]
ILC[ μ(X); α] = dξ
μ(X)]
E[
0
E μ(X)(α − M)+
=
E[μ(X)]
= + E (α − M)+ .
E[μ(X)]
The integral of the Lorenz curve over the whole interval [0, 1] is denoted ILC.
The smaller the ILC metrics is, the better the corresponding candidate premium.
Similarly, the smaller the ICC metrics, the better. Now, if we consider both metrics
simultaneously, then one should prefer a predictor with smaller ICC and ILC metrics,
or equivalently with smaller ICC and ABC values since
μ(X)] = ICC[μ(X),
ABC[ μ(X); 1] − ILC[
μ(X); 1].
It is worth noticing that one predictor can be better for ICC and worse for ABC, for
instance.
Let μ1 (X) and μ2 (X) be two predictors for a response Y . In ratemaking, these
predictors are for the true technical premium μ(X). We can imagine that μ1 is the
current predictor and that we consider replacing it with μ2 provided the latter’s
performances are better. In order to compare these two predictors, let us define the
relativity as the ratio of the new to the old predictor, that is,
μ2 (X)
R= .
μ1 (X)
If R is less than 1, this means that the risk profile X is overpriced with the current
predictor. This profile is thus at risk of adverse selection: a competitor using the
predictor μ2 could offer a better rate to such policyholders who could then leave the
portfolio.
As before, both predictors are supposed to be balanced, i.e.
μ1 (X)] = E[
E[ μ2 (X)] = E[Y ],
μ1 (X),
and to ease the explanation, we assume that μ2 (X) and μ(X) are all contin-
uous.
216 6 Other Measures for Model Comparison
Following Frees et al. (2013), we define the ordered Lorenz curve as the set of
μ1 (X), R(X); α], CC[Y, R(X); α])
points (CC[
Eμ1 (X)I[R(X) ≤ FR−1 (α)] E Y I[R(X) ≤ FR−1 (α)]
= ,
μ1 (X)]
E[ E[Y ]
where FR denotes the distribution function of the relativity R, and FR−1 the associated
quantile function. Notice that
= CC μ(X), R(X); α
so that we are allowed to replace the true premium μ(X) with the actual loss Y .
Both functions
E μ1 (X)I[R(X) ≤ s]
s
→
μ1 (X)]
E[
and
E Y I[R(X) ≤ s]
s
→
E[Y ]
can be interpreted as distribution functions. They give the proportion of total current
premiums μ1 (X) and the proportion of total losses Y (or true premiums μ(X)) in the
sub-portfolio determined by the condition R(X) ≤ s. The approach is thus based on
adverse selection against the insurer. Assume that a competitor attracts all profiles X
μ0 , i.e. those such that R(X) ≤ s for
that are overpriced under the current price list
some s small enough. More precisely, in a sub-portfolio gathering all risk profiles
such that R(X) ≤ FR−1 (α), i.e. the 100α% of policies with the smaller relativities,
we record the proportion
of premium income. Considering the point (t1 , t2 ) of the ordered Lorenz curve, cor-
responding to the particular α, its meaning is as follows. By forming a portfolio with
6.3 Measuring Lift 217
all policyholders whose relativities R(X) are less than FR−1 (α), i.e. all policies for
which the new premium μ2 (X) is smaller than FR−1 (α) times the old one
μ1 (X), the
corresponding premium income is t1 and the corresponding losses is t2 , on average.
If t1 < t2 then this is a profitable portfolio, one well worth retaining.
These graphical procedures can be supplemented with single numbers. Two coef-
ficients have been proposed in the literature to measure the goodness-of-lift: the
Value-of-Lift by Meyers and Cummings (2009) and the Gini index advocated by
Frees et al. (2011).
6.3.8.1 Assumptions
In this section,
μ(X) is assumed to follow a Gamma distribution with unit mean μ = 1
and variance σ 2 , henceforth denoted as Gam(μ, σ 2 ). Such predictors are known to
be ordered in the cx -sense with σ.
Also, in this section, we consider two distributions for the true premium μ(X),
namely
• a Gamma distribution with unit mean and variance σY2 (such true premiums are
known to be ordered in the cx -sense with σY2 );
• a LogNormal distribution with unit mean, i.e. ln μ(X) is Normally distributed with
mean −σY2 /2 and variance σY2 , which is henceforth denoted as LN or (−σY2 /2, σY ).
Notice that condition (6.3.1) is fulfilled in both cases since we have E[Y ] =
E[μ(X)] = E[ μ(X)] = 1. Also, it is worth noticing that the response may be discrete
(such as the number of claims, for instance), the continuity assumption only concerns
μ(X) and μ(X).
In addition to the parameters σ and σY governing the variability of the predictor
and the true premium, respectively, we consider different dependence structures
for the random vector (μ(X), μ(X)). Specifically, we consider Frank and Clayton
copulas, two copulas that are monotonically conc -increasing with their parameter.
Recall from Denuit et al. (2005) that the Clayton copula is given by
−1/θ
Cθ (u, v) = u −θ + v −θ − 1 , θ > 0,
For positive values of θ in Frank’s case, these two copulas express positive depen-
dence. The parameter θ can be interpreted as a measure of strength of the dependence
between μ(X) and μ(X). In order to make the dependence parameter more palatable,
218 6 Other Measures for Model Comparison
we rather use the corresponding Kendall’s tau. For the Clayton copula, Kendall’s tau
θ
is simply given by θ+2 . For the Frank copula, Kendall’s tau, which also increases
with θ, can only be expressed as a Debye function of the first kind.
6.3.8.2 Variability
The dependence structure between μ(X) and μ(X) is assumed to be fixed and mod-
eled by means of the Clayton copula with Kendall’s tau equal to 0.5. The predictor
μ(X) is supposed to be Gamma distributed with mean and variance both equal to 1.
In addition, the true premium μ(X) is supposed to be Gamma distributed with unit
mean and variance σY2 . We aim to assess the impact of σY2 on ABC values.
In that goal, we consider three values for σY2 , that are 0.5, 1 and 2. The results are
summarized in the following table and illustrated in Fig. 6.5.
Line type
μ(X) μ(X) Copula C ABC
medium dash G am(1, 1) G am(1, 2) Clayton(τ = 0.5) 6.33%
short dash G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%
dotted G am(1, 1) G am(1, 0.5) Clayton(τ = 0.5) 13.08%
We observe that the concentration curves are non-crossing as a result of the convex
order among the different distributions of μ(X). Furthermore, the smaller the variance
of μ(X) the further away the concentration curve from the Lorenz curve which leads
to a decreasing of the ABC value with the variance of μ(X). Notice that in this
example, there is no need to complement ABC values with the ICC metrics, the
Lorenz curve being the same in the three cases considered.
This example highlights the fact that when we have identically distributed predic-
tors that perform similarly in terms of dependence with the true premium, the ABC
metric will favor the case where the true premium is the most variable (in the convex
order sense). Similarly, for a given true premium and predictors performing the same
way in terms of dependence with the true premium, the ABC metric will favor the
predictor that is the less variable in terms of the convex order.
The situation where μ(X) cx μ(X) may be due to overfitting. This can happen
when the predictor μ(X) integrates random noise. Indeed, assume that only the
first q features, q < p, X 1 , . . . , X q matters and that X q+1 , . . . , X p are independent,
zero-mean
q random variables, independentp of X 1 , . . . , X q . Then, the true score β0 +
β
j=1 j jX is dominated by β 0 + β
j=1 j X j in the convex sense. On the contrary,
the situation where μ(X) cx μ(X) may be due to underfitting, which can be the
case, for instance, when
and
μ(X) = E[Y |X 1 , . . . , X q ].
6.3 Measuring Lift 219
Fig. 6.5 Lorenz curve and several concentration curves for different variances of μ(X)
Indeed, we have seen that increasing the number of features produce more dispersed
premiums.
6.3.8.3 Dependence
Line type
μ(X) μ(X) C ABC
medium dash G am(1, 1) G am(1, 1) Clayton(τ = 0.75) 3.46%
short dash G am(1, 1) G am(1, 1) Clayton(τ = 0.50) 9.66%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.25) 17.04%
The weaker the dependence the further away the concentration curve from the
Lorenz curve. This can be explained as follows. With the Clayton copula, increasing
Kendall’s tau results in a random pair (μ(X), μ(X)) larger in the sense of conc .
Therefore, from Property 6.3.19, we know that the concentration curve gets lower,
and thus closer to the Lorenz curve.
We observe that the ABC value decreases with Kendall’s tau, which is not sur-
prising since increasing Kendall’s tau means that
μ(X) becomes more informative
about the true premium μ(X).
220 6 Other Measures for Model Comparison
Fig. 6.6 Lorenz curve and several concentration curves for different values of Kendall’s tau
Notice that there is no need here to consider ICC values. Indeed, the Lorenz curve
being not impacted by the dependence between μ(X) and μ(X), ABC and ICC
metrics behave the same way.
6.3.8.4 Distribution
Again, we suppose that the predictor μ(X) is Gamma distributed with mean and
variance both equal to 1, and the dependence structure between μ(X) and μ(X) is
assumed to be fixed and modeled by means of the Clayton copula with Kendall’s
tau equal to 0.5. The true premium μ(X) is assumed to be Gamma distributed, and
this time, we also consider the LogNormal distribution for μ(X). Specifically, the
following table summarizes the three cases considered here:
Line type
μ(X) μ(X) C ABC
√ √
LN or − (1.25 2 ln 2) , 1.25 ln 2
2
medium dash G am(1, 1) Clayton(τ = 0.5) 9.20%
√
short dash G am(1, 1) LN or − ln22 , ln 2 Clayton(τ = 0.5) 11.58%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%
In Fig. 6.7, we can see the corresponding concentration and Lorenz curves. In
both cases where the variance of μ(X) is equal to 1, one sees that the LogNormal
concentration curve (short dash) lies further away from the Lorenz curve than the
Gamma one (dotted). One observes that the ABC value favors the case where the
6.3 Measuring Lift 221
Fig. 6.7 Lorenz curve and several concentration curves for different distributions of μ(X)
distributions of
μ(X) and μ(X) are similar, the dependence structure being the same
in both cases.
In the LogNormal case (short dash), increasing σY by factor 1.25 (medium dash)
yields concentration curves crossing around point 0.34. While in the previous exam-
ples the concentration curves were always ordered, we see that the use of different
distributions can lead to crossing concentration curves.
Notice that in the latter case (medium dash), the ABC value is the smallest one,
which is not surprising in light of Sect. 6.3.8.2 since it corresponds to the case where
the variance of μ(X) is the largest one.
Again, there is no need to complement ABC values with the ICC metrics since
the Lorenz curve remains the same across the three cases considered here.
Similarly to the previous example, the use of different copulas can also lead to
crossing concentration curves. Let us consider the Clayton copula C1 and the Frank
copula C2 as in Example 2.3 of Denuit and Mesfioui (2013). In such a case, one can
show that there exists a function f such that C1 (u, v) − C2 (u, v) ≤ 0 if v ≤ f (u)
and C1 (u, v) − C2 (u, v) ≥ 0 if v ≥ f (u), so that these two copulas are not ordered
according to the concordance order. We consider the two following cases:
222 6 Other Measures for Model Comparison
Fig. 6.8 Lorenz curve and two concentration curves for different copulas
Line type
μ(X) μ(X) C ABC
short dash G am(1, 1) G am(1, 1) Frank(τ = 0.5) 7.79%
dotted G am(1, 1) G am(1, 1) Clayton(τ = 0.5) 9.66%
The corresponding concentration and Lorenz curves are depicted in Fig. 6.8. We
observe that the concentration curves cross around point 0.35, which is an intuitive
result. Indeed, Clayton copula has stronger dependence in the lower quadrant than
Frank copula. Also, since the overall dependence is equal in both cases, the opposite
holds in the upper quadrant. The stronger the dependence the closer the concentration
curve is to the Lorenz curve. This is why the Clayton copula lies closer to the Lorenz
curve for small values and further away for large values.
Finally, we can consider a copula that does not exhibit positive quadrant depen-
dence but only positively expectation dependence. To this end, we can proceed as
in Egozcue et al. (2011) by mixing two copulas expressing quadrant dependence
of opposite signs. For instance, considering the Frechet–Hoeffding upper and lower
bound copulas, we can use
Fig. 6.9 Lorenz and concentration curves for a non-regression dependent copula
as in Example 2.1 of Denuit and Mesfioui (2017). We know from Egozcue et al.
(2011) that this mixture expresses positive expectation dependence if, and only if, θ ≤
1
2
. Alternatively, the Frechet–Hoeffding lower bound copula may be replaced with
another copula expressing negative quadrant dependence (such as the Farlie-Gumbel-
Morgenstern, or FGM copula with negative dependence parameter, for instance).
Considering the following setup
Line type
μ(X) μ(X) C ABC
dotted G am(1, 1) G am(1, 1) (6.3.14) with θ = 0.8 10%
we see in Fig. 6.9 that the above mixture copula leads well to a non-convex concen-
tration curve.
Fig. 6.10 In- and out-of-sample errors for models under consideration
In this section, we aim to compare some of the models investigated in Noll et al.
(2018) by using ABC and ICC metrics discussed in this chapter. More specifically,
μk (X k ):
we consider the following models of Noll et al. (2018) for the predictors
• glm1—Poisson GLM with a log-link function and all explanatory variables;
• glm3—same as glm1 but without area and region variables;
• pbm1—boosted SBS (Standardized Binary Splits) tree (depth = 1, iterations = 30);
• pbm3—boosted SBS tree (depth = 3, iterations = 50);
• pbm3.s2—boosted SBS tree (depth = 3, iterations = 50, shrinkage = 0.5);
• glm1.pbm3—boosted SBS tree starting from glm1 fit (depth = 3, iterations = 50);
• nn—shallow neural network (20 neurons with one hidden layer).
Models’ implementation details can be found in Noll et al. (2018). We refer to Denuit
et al. (2019a) for details on neural networks.
The dataset is partitioned into a training set of 610 000 observations and a valida-
tion set comprising the remaining observations.
Figure 6.10 shows the training sample estimate of the generalization error (in-
sample error) and the validation sample estimate of the generalization error (out-of-
sample error) for the models under study together with bootstrapped 95% confidence
intervals. The bounds are derived for in- and out-of-sample errors individually, so
only vertical and horizontal distances are meaningful. In particular, the oval shape
is due to spline smoothing through the points (in-sample error, out-of-sample error):
{(lower, observed), (observed,higher), (higher,observed), (observed,lower)}. Over-
all, in-sample error and out-of-sample error classify the models in a similar way,
except the boosted tree model (pbm3) and its shrunken version. For the latter mod-
6.3 Measuring Lift 225
els, introducing a shrinkage factor increases the in-sample error while it reduces
the out-of-sample error. This is not surprising as the introduction of a shrinkage
factor aims to avoid overfitting issues. We also note that the boosted GLM model
(glm1.pbm3) improves substantially over the original GLM model (glm1). However,
it does not outperform the boosted SBS tree (pbm3). The latter observation indicates
that the fixed structural form imposed to the expected claim frequency by the GLM
model does not provide any additional explanatory insights compared to the boosted
SBS tree. Finally, the optimal model with respect to the out-of-sample error metric
is the boosted tree model with a shrinkage factor (pbm3.s2).
Looking at the bootstrapped confidence intervals, all the models except the ones
based on (pbm3) are nicely separated. It also seems that boosted methods yield more
varying results than GLMs or the neural network model (for out-of-sample error).
Let us now turn to the goodness-of-lift metrics discussed in this chapter. In the
and the Lorenz
following, we use the empirical versions of the concentration curve CC
curve LC computed on the validation set in order to get ABC and ICC values.
In case the number of observations are insufficient, a smoothed version of the
empirical concentration curve CC could be used instead. Here, the size of the valida-
depicted in Fig. 6.11 for two of
tion set is judged as sufficient to simply rely on CC,
the considered models. The remaining models are close to these two models, forming
two groups of curves for α larger than 0.15. The higher group of curves is related to
models glm1, glm3 and pbm1, which are also the three worst models according to
the out-of-sample errors.
226 6 Other Measures for Model Comparison
Fig. 6.12 Estimated ABC and ICC values for models under consideration
ABC and ICC values are displayed in Fig. 6.12 also with the same visualization of
bootstrapped confidence intervals. We notice that ICC metric classifies the models
as the out-of-sample error metric, except for models pbm3 and pbm3.s2. While both
metrics agree that these two last models are the best ones, pbm3.s2 outperforms pbm3
according to the out-of-sample error while with ICC it is the other way around.
Regarding the ABC values, we observe that a model with a low ICC can either have
low or large ABC. For instance, glm1.pbm3, which is one of the best model according
to ICC, has the highest ABC, while pbm3.s2 has both low ICC and ABC. If we
compare glm1.pbm3 and pbm3.s2 that have similar degrees of lift according to ICC,
we notice that ABC metric favors pbm3.s2 that is less variable than glm1.pbm3, which
is in line with Sect. 6.3.8.2. In the same way, while pbm3 and pbm3.s2 have similar
ICC, pbm3.s2 outperforms pbm3 according to ABC, pbm3.s2 being less variable
than pbm3 (since both models have the same number of trees while pbm3.s2 uses a
shrinkage parameter). Finally, the optimal model with respect to ABC is pbm3.s2.
To end the case study, we display in Fig. 6.13 ICC and ABC as functions of α (i.e.
integrating over the interval [0, α] instead of the whole interval [0, 1]). We present
only curves for models pbm1 and glm1.pbm3 as the remaining curves look fairly
similar. One sees that glm1.pbm3 has always lower ICC while the ABC values cross at
around 91% quantile. Hence, from that quantile, one can say that pbm1 outperforms
glm1.pbm3 according to ABC metric.
6.4 Bibliographic Notes and Further Reading 227
Fig. 6.13 Estimated ABC and ICC values for models under consideration
Denuit et al. (2019b) considered binary responses and derive the set of attainable
values for concordance-based association measures so that the closeness to the best-
possible fit can be properly assessed. Denuit et al. (2019c) and Mesfioui et al. (2020)
obtained the best-possible upper bounds for Kendall’s tau and Spearman’s rho when
the response is a discrete random variable. Section 6.2 is largely inspired from these
two papers.
Several testing procedures have been proposed in the literature to detect depen-
dence relations. For positive quadrant dependence, we refer the reader to Denuit
and Scaillet (2004) and Scaillet (2005). Zhu et al. (2016) investigated hypothe-
sis tests for first-degree and higher-degree expectation dependence. Testing pro-
cedures for the convex order have been proposed in economics (see e.g., Barrett and
Donald 2003). Section 6.3 is strongly inspired from Denuit et al. (2019d), in which
concentration curves and Lorenz curves are shown to provide actuaries with effec-
tive tools to evaluate whether a premium is appropriate or to compare two competing
alternatives.
228 6 Other Measures for Model Comparison
References
Barrett GF, Donald SG (2003) Consistent tests for stochastic dominance. Econometrica 71(1):71–
104
Denuit M, Dhaene J, Goovaerts MJ, Kaas R (2005) Actuarial theory for dependent risks: measures,
orders and models. Wiley, New York
Denuit M, Mesfioui M (2013) A sufficient condition of crossing-type for the bivariate orthant convex
order. Stat Probab Lett 83(1):157–162
Denuit M, Mesfioui M (2017) Preserving the Rothschild-Stiglitz type increase in risk with back-
ground risk: a characterization. Insur: Math Econ 72:1–5
Denuit M, Hainaut D, Trufin J (2019a) Effective statistical learning methods for actuaries III: neural
networks and extensions. Springer Actuarial Lecture Notes
Denuit M, Mesfioui M, Trufin J (2019b) Bounds on concordance-based validation statistics in
regression models for binary responses. Methodol Comput Appl Probab 21(2):491–509
Denuit M, Mesfioui M, Trufin J (2019c) Concordance-based predictive measures in regression
models for discrete responses. Scand Actuar J 10:824–836
Denuit M, Scaillet O (2004) J Financ Econ 2(3):422–450
Denuit M, Sznajder D, Trufin J (2019d) Model selection based on Lorenz and concentration curves,
Gini indices and convex order. Insur: Math Econ 89:128–139
Egozcue M, Garcia L-F, Wong W-K, Zitikis R (2011) Grüss-type bounds for covariances and the
notion of quadrant dependence in expectation. Cent Eur J Math 9(6):1288–1297
Frees E, Meyers G, Cummings A (2011) Summarizing insurance scores using a Gini index. J Amer
Stat Asso 106(495):1085–1098
Frees EW, Meyers G, Cummings AD (2013) Insurance ratemaking and a Gini index. J Risk Insur
81(2):335–366
Gourieroux C, Jasiak J (2007) The econometrics of individual risk: credit, insurance, and marketing.
Princeton University Press, Princeton
Mesfioui M, Tajar A (2005) On the properties of some nonparametric concordance measures in the
discrete case. Nonparametric Stat 17(5):541–554
Mesfioui M, Trufin J, Zuyderhoff P (2020) Bounds on Spearman’s rho when at least one random
variable is discrete. Working paper
Meyers G, Cummings AD (2009) Goodness of Fit" vs. "Goodness of Lift. Actuar Rev 36–3:16–17
Muliere P, Petrone S (1992) Generalized Lorenz curve and monotone dependence orderings. Metron
50:19–38
Nešlehová J (2007) On rank correlation measures for non-continuous random variables. J Multivar
Anal 98(3):544–567
Noll A, Salzmann R, Wüthrich M (2018) Case study: French motor third-party liability claims.
Available at SSRN: https://ssrn.com/abstract=3164764
Scaillet O (2005) A Kolmogorov-Smirnov type test for Positive Quadrant Dependence. Can J Stat
33(3):415–427
Shaked M, Sordo MA, Suarez-Llorens A (2012) Global dependence stochastic orders. Methodol
Comput Appl Probab 14(3):617–648
Tevet D (2013) Exploring model lift: is your model worth implementing. Actuar Rev 40(2):10–13
Yitzhaki S, Schechtman E (2013) The gini methodology: a primer on statistical methodology.
Springer, Berlin
Zhu X, Guo X, Lin L, Zhu L (2016) Testing for positive expectation dependence. Ann Inst Stat
Math 68:135–153