Chapter2 Annotated Part2
Chapter2 Annotated Part2
Linear Regression
This chapter discusses linear regression from a predictive modelling perspective. Extensions
to penalized likelihood (ridge regression) and variable selection (LASSO) are also discussed.
The material in this chapter can be mostly found in Chapter 3 of Hastie et al. (2009) and
Chapter 1 of Wood (2017b).
2.1.1 Setup
Given a set of inputs and outputs called a training set, prediction is the task of deriving a
function that maps new inputs to new outputs:
19
20 CHAPTER 2. LINEAR REGRESSION
The term predictive model is a colloquialism that usually either (a) describes a learning
algorithm,
or a prediction rule,
2.1. PREDICTIVE MODELLING 21
This setup is intuitive, but is not specific enough to actually execute. Any function that
maps a set of inputs and outputs to a function from inputs to outputs could be called a
“learning algorithm”, just like how in statistics any function from the sample space to the
parameter space could be called an “estimator”. A framework for evaluating the quality of a
learning algorithm is required.
Learning algorithms are evaluated by evaluating the quality of the predictions they pro-
duce. If the predictions produced by a given algorithm are generally close to the correspond-
ing outputs, then a learning algorithm is “good”. To make this specific enough to implement,
definitions of “generally” and “close” are required. We say that a prediction algorithm is
good if it has small expected loss:
If we replace “generally” by “on average” and “close” by “small loss”, we can implement
predictive modelling.
22 CHAPTER 2. LINEAR REGRESSION
It is important to stress that we could use any loss function; the choice of square loss is
convenient for development, but other loss functions are usefull in other contexts. There are
a number of motivations for and desirable properties of square loss:
(a) It is the negative log-likelihood in a normal model where we predict using the mean,
bring a connection with classical statistics;
(b) It admits a tractable minimizer over the class of all functions (see below);
On the last point, consider the Taylor expansion of a twice-differentiable, stationary loss
function:
While none of these properties are unique or necessary, they are nice. We will use square
loss for the remainder of the course.
2.1. PREDICTIVE MODELLING 23
This expectation is taken with respect to the joint distribution of a single (input, ouput)
pair. This minimization is over the space of all functions. Remarkably, with square loss it
is tractable:
The use of square loss implies that predictions of an output at a given input should be
set equal to the conditional expectation of the outputs at that input.
24 CHAPTER 2. LINEAR REGRESSION
The phrase “fitting the model to the data” is usually used to describe this whole process.
We now have some thinking to do. One option is to appeal to the law of large numbers/central
limit theorem:
1. From classical statistics, we know that the error in using the sample mean to approx-
imate the population mean is proportional to the number of data in the sum. So we
will need a lot of repeated inputs for this to work.
2. What happens if a new input doesn’t equal any previously-observed input? This model
cannot extrapolate, but it also cannot even predict for values within the range—but
not equal to any—of the inputs in the training set.
One option is to expand the range of data used in the prediction of the outputs to any
“nearby” inputs. This is what is done by the k-nearest neighbours prediction algorithm:
2.1. PREDICTIVE MODELLING 25
The k-nearest neighbours approach to prediction is smoother and more efficient (in the
variance sense) than simple averaging and can be applied to any input value. It also makes
almost no assumptions on the form of the distribution of the training data, which sounds
like a purely good thing. However, it only uses a small subset of the available training data
to make each prediction. As the dimension of the inputs grows, predictions will become less
precise: the number of training data needed to make predictions as precise as those for a
single-dimensional input is exponential in the input dimension.
Another way to put it is that k-nearest neighbours is a local method: only a small
number of training data close to the input are used to predict the output. If we are willing
to make strong assumptions about the form of the regression function, global methods can
out-perform local ones, if the assumptions hold. Consider a simple polynomial model for
the regression function, based on a Taylor expansion at the sample mean of the inputs:
This is the context in which we view the familiar linear regression model.
26 CHAPTER 2. LINEAR REGRESSION
Lift
output
learningrule
When the regression function is approximated well globally by a linear function, then
linear regression is an excellent method: easy to fit, provably lowest variance, and easy
to interpret. It also forms the basis of many of the more complicated models covered in
subsequent chapters.
Linear regression is a parametric model. Fitting a linear regression model to data is
equivalent to estimating the regression parameter:
Fix Xp
I
vector in IRP
function
2.2. LINEAR REGRESSION VIA LEAST SQUARES 27
XYY XP 0
Xy X'xp B xx xTy
Under a Gaussian model for the outputs, least squares is maximum likelihood:
Y N XP 0 In
Up oY Const 14 B 4 43
maximizing l β forfixed 02
minimizing14 xp It Lp XY
28 CHAPTER 2. LINEAR REGRESSION
This is interpreted as choosing the mean to be the closest vector to y in Rn that can be
written as a linear combination of the p n-dimensional vectors x1 , . . . , xp .
2.2. LINEAR REGRESSION VIA LEAST SQUARES 29
Prediction of new outputs also has a geometric flavour, as a projection into the span of
the inputs. This is simply because we use the mean output to predict:
see above
We can write the whole vector of predictions as a linear combination of the outputs:
Y Xp XX XTY HY
H NX XT
The “hat matrix” is an orthogonal projection:
HE Xx X Nx XT Xx x H
Properties
The eigenvalues of H are all 0 and 1
With p = 1 and no intercept, this is written in terms of the inner product of the inputs
and outputs and the norm of the inputs:
f
Now add a new input, and suppose it is orthogonal to the previous input:
If the second input is orthogonal to the first input, then the least sqaures estimate of the
first regression coefficient doesn’t change. Supposing now that all p inputs are orthogonal,
we have:
Of course, inputs will never be orthogonal outside of designed experiments (where they
are not random); in fact if their distribution is continuous then they are orthogonal with
probability zero. But, we can always orthogonalize them! The Gram-Schmidt procedure for
orthogonalization is as follows:
2.2. LINEAR REGRESSION VIA LEAST SQUARES 33
obtained by regressing xj on the preceding residuals. But this is exactly the Gram-
Schmidt procedure; indeed, if we let
then we have
2.2. LINEAR REGRESSION VIA LEAST SQUARES 35
ER can be written P
VERNP UTV Ip
UDVT
D
diag f singularvalues
fjsoiffVE
Sp
RPXPVTV VVT Ip Hnk
Eigendecomposition of
So sub of X TX
UDIE D
LIV DUTY
UREDUTY
UUTY
36 CHAPTER 2. LINEAR REGRESSION
As we add more complicated inputs that are more highly correlated with each other, the
ratio of the largest singular value to the smallest (called the condition number) of X will
increase. Eventually, the smallest singular value will reach zero, indicating that two or more
inputs are linearly dependent, X is not full rank, X T X is not invertible, and b cannot be
(uniquely) determined. This sounds bad. However, predictions can still be made! So there
is no problem?
The problem comes when considering the uncertainty in the predictions. The formula
for the point predictions assumed that X T X could be inverted. We have:
Y XP Hy it xx exists
Intuitively, it really shouldn’t be the case that adding in a single linearly dependent
predictor completely destroys a perfectly good predictive model. The problem is that all
inputs contribute equally to the predictions, precisely because of the lack of dependence
of the predictions on the singular values of X. We want directions in X space that are
(nearly) linearly dependent to have (nearly) no influence on the predictions. This can be
accomplished by modifying the spectral decomposition of X T X directly.
2.3. RIDGE REGRESSION 37
Y XX xty HY
Ridge regression adds a small value to the diagonal of X T X:
Hla y
UD D XI DUTY
The predictions now depend on the singular values of X. Specifically, the predictions are
shrunk towards zero by a factor inversely proportional to the singular values:
Inputs corresponding to large singular values have their influence on the predictions
reduced less, because this factor will be closer to 1. Inputs with small singular values have
smaller influence on the predictions. Inputs with singular values equal to zero have no
influence on the predictions.
2.3. RIDGE REGRESSION 39
This problem is another solution to collinearity: by simply forbidding the size of the
regression coefficients to become too large, we limit the influence of linearly dependent vari-
ables on the predictions. This problem always has a solution for any t > 0, even if X is low
rank. To see this, consider the Lagrangian corresponding to this objective:
114 4311
argmin
β 111311K
114
where A and t are
arging 4311 apply related
40 CHAPTER 2. LINEAR REGRESSION
2 44 xp 2 13 0
XY xp 713 0
XX Ip B XY
B XXXI Xy
Y Xp XX XI XTY
prediction
Ridge regression
2.4. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) 41
The solution to the constrained optimization problem is exactly the ridge regression
solution from Section 2.3.1.
To fit a ridge regression model to data for fixed > 0 is trivial given the ability to fit
the ordinary linear regression model (which is ridge regression with = 0). Estimating
from the data is necessary in practice. We will discuss this in Chapter 4.
114 7111311
jilβjl
1431122 111311
argman
EE Yi I
I l p
iiXij j
2.4. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) 43
Changing from the 2-norm to the 1-norm has surprising consequences: the latter can set
coefficients to zero exactly. We can illustrate this using the following infamous geometric
argument: β2
Ordinaryleastsquares
β
Ridge regression
if Mia
x Ba ridge
solutions
1181124
13112 t cannotintersect
exactly
atoonanyaxis LA a
pllist f
The constant contours of the objective have larger values the farther they are from the
centre. The same is true of the constraint. where they intersect is the solution to the
constrained minimization problem.
44 CHAPTER 2. LINEAR REGRESSION
x̅ argminfix f R IR convex
x̅ argmyinflxiix.in Xp Optional
Reference optimization
K argm.intX xyx3 xp numerical
Nocedal
Wright
by
argmginflx.e.xj.lij Xjti Xp
argingfix Xp 1 Xp
And cycles the process
until convergence is attained. It is generally an inefficient convex optimization algorithm,
requiring p iterations to minimize a p-dimensional quadratic (as opposed to 1 for, say, New-
ton’s method). However, it turns out that for the LASSO, the sparsity structure in the
problem is exploited well by a clever implementation of coordinate descent.
2.4. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) 45
Yi XP Ej βjI
continuous
βjER
Differentiable for βj70 βjCO
Not differentiable at 0
The partial derivative with respect to
βj
only is:
j
211T EI Xij Yi XP 7 if
Bjo
7 if Bjco
deaf if Pj 0
contain
Does Current value of
The solution is: Bj p Pj1 Piti BP
y
ftp.f o II Xp III ikjkdeafexi.FI J
confusing
Ii ijyi Zejx.pe 7tEIEIBj
part ejxieβe
pj XijYi 7
But we assumed β so here If
Bj so then
B Ei ijyi Zetjx.pe a
so we have
ii XijYi j eXiepe 7 βj o
β S Xiii Ej eXieβe 7
0 0 14
Where
scar 20 8 41
871 1
Precompete
Tiwa are non zero fffh.am edonce
Bj sfxjy E.ggfXKBK Bj ftfnetffsfjm.li
It is known in 70 888cgVanished
advance which is will go from computed
are already
The_So, software precomputes all the feature inner products, and then considers only variables
for whichb j 6= 0. Further, it can be shown that:
directlyfrom
Hence, it is known whether a bj will change at each iteration based on quantities com-
puted in advance of the iteration, speeding up the algorithm.
2.4. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) 47
Finally, computing the solution path of the LASSO—a sequence of solutions as a function
of —can be done efficiently. The software starts with:
A decreasing sequence of values is then used. The computational trick is to use the
coefficients from the previous iteration as a starting value for the next iteration; in the opti-
mization literature this is called warm starts and it is a very popular technique in problems
of this type. However, it is known that b is a piecewise linear function of :
Let Ba argmin
114 XP lit 7111311
(See Hastie et al. 2009, Exercise 3.27 (c)). This explains why the warm starts work so
well in this problem: the values of b will be close if their values of are close, i.e.
In practice, the glmnet (Friedman et al., 2010) implementation is extremely fast for very
high-dimensional, sparse regression problems.