Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

18.

650 – Fundamentals of Statistics

7. Generalized linear models

1/32
Linear model

A linear model assumes


2
Y |X = x ⇠ N (µ(x), I),

And 1
>
IE(Y |X = x) = µ(x) = x ,

1
Throughout we drop the boldface notation for vectors 2/32
Components of a linear model

The two model components (that we are going to relax) are

1. Random component: the response variable Y is continuous


and Y |X = x is with mean µ(x).

2. Regression function: µ(x) = x > .

3/32
Kyphosis

The Kyphosis data consist of measurements on 81 children


following corrective spinal surgery. The binary response variable,
Y , indicates the presence or absence of a postoperative deforming.
The three covariates are:
I X (1) : Age of the child in month,

I X (2) : Number of the vertebrae involved in the operation, and

I X (3) : Start of the range of the vertebrae involved.

Write X = ( (1) (2)


,X ,X ,X )(3) > 2 IR 4

4/32
Kyphosis

I The response variable is binary so there is no choice:


Y |X = x is with expected value
µ(x) = IE[Y |X = x] 2
I We cannot write
>
µ(x) = x
because the right-hand side ranges through
I We need an invertible function f such that f (x> ) 2

5/32
Generalization

A generalized linear model (GLM) generalizes normal linear


regression models in the following directions.

1. Random component:

Y |X = x ⇠ some distribution

(e.g. Bernoulli, exponential, Poisson)

2. Regression function:
>
µ(x) = x

where g called link function and µ(x) = IE(Y |X = x) is the

6/32
Predator/Prey
Consider the following model for the number of preys Y that a
predator (Hawk) catches per day a predator given a number X of
preys (mice) in its hunting territory.
Random component: Y > 0 and the variance of capture rate is
known to be approximately equal to its expectation so we propose
the following model:

Y |X = x ⇠

Where µ(x) = IE[Y |X = x].


Regression function: We assume
mx
µ(x) = , for some unknown m, h > 0.
h+x
where:
I m is the max expected daily preys the predator can cope with
I h is the number of preys such that µ(h) =
7/32
The regression function m(x) for m = h = 10

8/32
Example 2: Prey Capture Rate

Obviously µ(x) is not linear but using reciprocal link: g(x) = ,


the right-hand side can be made linear in the parameters:

1 1
g(µ(x)) = = = 0+ 1 .
µ(x) x

9/32
Exponential Family

A family of distribution {IP✓ : ✓ 2 ⇥}, ⇥ ⇢ k


is said to be a
IR
q
k-parameter exponential family on IR , if there exist real valued
functions:
I ⌘1 , ⌘2 , · · · , ⌘k and B of ✓,

I T1 , T2 , · · · , Tk , and h of y 2 IRq such that the density


function (pmf or pdf) of IP✓ can be written as

hX
k i
f✓ (y) = exp ⌘i (✓)Ti (y) B(✓) h(y)
i=1

10/32
Normal distribution example
I Consider Y ⇠ N (µ, 2 ), ✓ = (µ, 2 ). The density is
⇣µ 1 µ 2 ⌘ 1
2
f✓ (y) = exp 2
y 2
y 2
p ,
2 2 2⇡
which forms a two-parameter exponential family with
µ 1 2
⌘1 = 2
, ⌘ 2 = 2
, T 1 (y) = y, T 2 (y) = y ,
2
µ 2 p
B(✓) = 2
+ log( 2⇡), h(y) = 1.
2
I When 2 is known, it becomes a one-parameter exponential
family on IR:
y2
µ µ 2 e 2 2
⌘= 2
, T (y) = y, B(✓) = 2
, h(y) = p .
2 2⇡
11/32
Examples of discrete distributions

The following distributions form discrete exponential families of


distributions with pmf

I Bernoulli(p): p (1y
p) 1 y
, y 2 {0, 1}
y
I Poisson( ): e , y = 0, 1, . . . .
y!

12/32
Examples of Continuous distributions
The following distributions form continuous exponential families
of distributions with pdf:
1 y
I Gamma(a, b): y a 1
e b;
(a)ba
I above: a: shape parameter, b: scale parameter
I reparametrize: µ = ab: mean parameter
✓ ◆a
1 a a 1
ay
y e µ.
(a) µ

I Inverse Gamma(↵, ): y ↵ 1
e /y
.
(↵)
s
2 2 (y µ)2
I Inverse Gaussian(µ, 2 ): e 2µ2 y .
2⇡y 3

Others: Chi-square, Beta, Binomial, Negative binomial


distributions.
13/32
One-parameter canonical exponential family

I Canonical exponential family for k = 1, y 2 IR


⇣ y✓ b(✓) ⌘
f✓ (y) = exp + c(y, )

for some known functions b(·) and c(·, ·) .

I If is known, this is a one-parameter exponential family with


✓ being the canonical parameter .
I If is unknown, this may/may not be a two-parameter
exponential family.
I is called dispersion parameter.
I In this class, we always assume that is known.

14/32
Normal distribution example

I Consider the following Normal density function with known


variance 2 ,
1 (y µ)2
f✓ (y) = p e 2 2
2⇡
⇢ 1 2 ✓ ◆
yµ 2µ 1 y2 2
= exp 2 2
+ log(2⇡ ) ,
2

I Therefore ✓ = µ, = 2, b(✓) = ✓2
2 , and

1 y 2
c(y, ) = ( + log(2⇡ )).
2

15/32
Other distributions

Table 1: Exponential Family


Normal Poisson Bernoulli
Notation 2
N (µ, ) P(µ) B(p)
Range of y ( 1, 1) [0, 1) {0, 1}
2 1 1
✓2
b(✓) 2 e✓ log(1 + e✓ )
1 y2
c(y, ) 2 ( + log(2⇡ )) log y! 0

16/32
Likelihood

Let `(✓) = log f✓ (Y ) denote the log-likelihood function.


The mean IE(Y ) and the variance var(Y ) can be derived from the
following identities
I First identity
@`
IE( ) =
@✓
I Second identity

@2` @` 2
IE( 2 ) + IE( ) = 0.
@✓ @✓

17/32
Expected value

Note that
Y✓ b(✓)
`(✓) = + c(Y ; ),

Therefore
@`
=
@✓
It yields
@` IE(Y ) 0
b (✓)
0 = IE( ) = ,
@✓
which leads to
IE(Y ) =

18/32
Variance
On the other hand we have we have
@2` @` 2
+ ( ) =
@✓2 @✓
and from the previous result,

Y 0
b (✓) Y IE(Y )
=

Together, with the second identity, this yields


00
b (✓) var(Y )
0= + 2
,

which leads to
var(Y ) =

19/32
Example: Poisson distribution

Example: Consider a Poisson likelihood,


µy µ
f (y) = e = exp y log µ µ log(y!)
y!
Thus,

✓= b(✓) = = c(y, ) = log(y!),

So
✓ 00
µ=e , b(✓) = b (✓) =

20/32
Link function

I is the parameter of interest, and needs to appear somehow


in the likelihood function to use maximum likelihood.
I A link function g relates the linear predictor X > to the mean
parameter µ,
>
X = g(µ).
I g is required to be monotone increasing and di↵erentiable
1 >
µ=g (X ).

21/32
Examples of link functions

I For LM, g(·) = identity.


I Poisson data. Suppose Y |X ⇠ Poisson(µ(X)).
I µ(X) > 0;
I log(µ(X)) = X > ;
I In general, a link function for the count data should map
(0, +1) to IR.
I The log link is a natural one.
I Bernoulli/Binomial data.
I 0 < µ < 1;
I g should map (0, 1) to IR:
I 3 choices: ⇣ ⌘
µ(X)
1. logit: log 1 µ(X)
= X> ;
2. probit: 1
(µ(X)) = X > where (·) is the normal cdf;
I The logit link is the natural choice.

22/32
Examples of link functions for Bernoulli response
5

2
I in blue:
1
1 g1 (x) = f1 (x) =
x
log (logit link)
0
1 x
-1
I in red:
1 1
g2 (x) = f2 (x) = (x)
-2
(probit link)
-3

-4

-5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

23/32
Examples of link functions for Bernoulli response
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

ex
I in blue: f1 (x) =
1 + ex
I in red: f2 (x) = (x) (Gaussian CDF)
24/32
Canonical Link

I The function g that links the mean µ to the canonical


parameter ✓ is called Canonical Link:

g(µ) = ✓

I Since µ = b0 (✓), the canonical link is given by


0 1
g(µ) = (b ) (µ) .

I If > 0, the canonical link function is strictly increasing.


Why?

25/32
Example: the Bernoulli distribution

I We can check that



b(✓) = log(1 + e )

I Hence we solve

0 exp(✓)
b (✓) = =µ , ✓=
1 + exp(✓)
I The canonical link for the Bernoulli distribution is the

26/32
Other examples

b(✓) g(µ)
Normal 2
✓ /2 µ
Poisson exp(✓) log µ
✓ µ
Bernoulli log(1 + e ) log 1 µ
1
Gamma log( ✓) µ

27/32
Model and notation

I Let (Xi , Yi ) 2 IRp ⇥ IR, i = 1, . . . , n be independent random


pairs such that the conditional distribution of Yi given
Xi = xi has density in the canonical exponential family:

yi ✓ i b(✓i ) o
f✓i (yi ) = exp + c(yi , ) .

I Y = (Y1 , . . . , Yn )> , X = (X1 , . . . , Xn )>


I Here the mean µi = IE[Yi |Xi ] is related to the canonical
parameter ✓i via
µi =
I and µi depends linearly on the covariates through a link
function g:
g(µi ) = .

28/32
Back to

I Given a link function g, note the following relationship


between and ✓:
0 1
✓i = (b ) (µi )
0 1 1 > >
= (b ) (g (Xi )) ⌘ h(Xi ),

where h is defined as
0 1 1 0 1
h = (b ) g = (g b ) .

I Remark: if g is the canonical link function, h is

29/32
Log-likelihood

I The log-likelihood is given by


X Yi ✓i b(✓i )
`n (Y, X, ) =
i
X Yi h(X > ) >
b(h(Xi ))
i
=
i

up to a constant term.
I Note that when we use the canonical link function, we obtain
the simpler expression
X Yi X > >
b(Xi )
i
`n (Y, X, ) =
i

30/32
Strict concavity

I The log-likelihood `(✓) is strictly concave using the


canonical function when > 0. Why?
I As a consequence the maximum likelihood estimator is

I On the other hand, if another parameterization is used, the


likelihood function may not be strictly concave leading to
several local maxima.

31/32
Concluding remarks

I Maximum likelihood for Bernoulli Y and the logit link is called

I In general, there is no closed form for the MLE and we have


to use
I The asymptotic normality of the MLE also applies to GLMs.

32/32

You might also like