Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Lecture 7

Machine Learning
BackPropaga*on for Logis*c
Regression
Last Times:

• Machine learning, especially supervised learning


• Bias, variance, and overfi7ng
• Minimized an objec<ve func<on, called error or cost or risk
• Gradient Descent, SGD on Empirical Risk
• We introduced the test set
Statement of the Learning
Problem

The sample must be representa/ve of the


popula/on!

A: Empirical risk es/mates in-sample risk.


B: Thus the out of sample risk is also small.
LLN: Expecta,ons -> sample averages

Empirical Risk Minimiza0on:

on training set(sample) .
What we'd really like: popula3on

i.e. out of sample RISK


• This is an average over our sampling distribu5on, if we had it
• What do we do?

Fit hypothesis , where is our training sample.

Then we'd like

But:
Gradient Descent.

For a par'cular sample, we want:

LLN:

SGD takes gradient inside sum


Empirical Risk Minimiza0on

• But we only have the in-sample risk


• Furthermore its an empirical risk
• And its not even a full on empirical distribu<on, as N is usually
quite finite
UNDERFITTING (Bias)
vs OVERFITTING (Variance)
BALANCE THE COMPLEXITY
Is this s'll a test set?
Trouble:
• no discussion on the error bars on our error es0mates
• "visually fi7ng" a value of contaminated test set.

The moment we use it in the learning process, it is not a test set.


Is in-sample
Approxima)ng out-of-sample?
Hoeffding's inequality

popula&on frac&on , sample drawn with replacement, frac&on :

For hypothesis , iden/fy 1 with at sample . Then


are popula/on/sample error rates. Then,
• Hoeffding inequality holds ONCE we have picked a hypothesis ,
as we need it to label the 1 and 0s.
• But over the training set we one by one pick all the models in the
hypothesis space
• best fit is among the in , must be OR OR....Say
effec$vely M such choices:
Hoeffding, repharased:

Now let .

Then, with probability :

For finite effec,ve hypothesis set size , as N larger..


Training vs Test

• training error approximates out-of-sample error slowly


• is test set just another sample like the training sample?
• key observa;on: test set is looking at only one hypothesis
because the fi?ng is already done on the training set. So
for this sample!
Training vs Test

• the test set does not have an op-mis-c bias like the training
set(thats why the larger effec-ve M factor)
• once you start fi?ng for things like on the test set, you cant
call it a test set any more since we lose -ght guarantee.
• test set has a cost of less data in the training set and must thus
fit a less complex model.
VALIDATION
• train-test not enough as we fit for on
test set and contaminate it
• thus do train-validate-test
If we dont fit a hyperparameter

• first assume that the valida0on set is ac0ng like a test set.
• valida0on risk or error is an unbiased es0mate of the out of
sample risk.
• Hoeffding bound for a valida0on set is then iden0cal to that of
the test set.
usually we want to fit a hyperparameter

• we wrongly already a*empted to do on our previous test set.


• choose the combina8on with the lowest valida8on set risk.

• has an op8mis8c bias since effec8vely fit on


valida8on set
• its Hoeffding bound must now take into account the grid-size as
the effec8ve size of the hypothesis space.
• this size from hyperparameters is typically a smaller size than
that from parameters.

Retrain on en*re set!

• finally retrain on the en.re train+valida.on set using the


appropriate combina.on.

• works as training for a given hypothesis space with more data


typically reduces the risk even further.
CROSS-VALIDATION
CROSS-VALIDATION
is
• a resampling method
• robust to outlier valida4on set
• allows for larger training sets
• allows for error es4mates

Here we find .
Cross Valida+on considera+ons

• valida'on process as one that es'mates directly, on the


valida'on set.
• It's cri'cal use is in the model selec'on process.
• once you do that you can es'mate using the test set as
usual, but now you have also got the benefit of a robust average
and error bars.
• key subtlety: in the risk averaging process, you are actually
REGULARIZATION
Keep higher a-priori complexity and
impose a

complexity penalty

on risk instead, to choose a SUBSET of


. We'll make the coefficients small:
REGULARIZATION

As we increase , coefficients go towards


0.

Lasso uses sets coefficients to

exactly 0.
MLE for Logis+c Regression
• example of a Generalized Linear Model (GLM)
• "Squeeze" linear regression through a Sigmoid func>on
• this bounds the output to be a probability
• What is the sampling Distribu>on?
Sigmoid func,on

This func*on is plo.ed below:


h = lambda z: 1./(1+np.exp(-z))
zs=np.arange(-5,5,0.1)
plt.plot(zs, h(zs), alpha=0.5);

Iden%fy: and with the


probability that the sample is a '1' ( ).
Then, the condi,onal probabili,es of or given a
par,cular sample's features are:

These two can be wri/en together as

BERNOULLI!!
Mul$plying over the samples we get:

A noisy is to imagine that our data was generated from a joint


probability distribu7on . Thus we need to model at a given
, wri<en as , and since is also a probability
distribu7on, we have:
Indeed its important to realize that a par1cular sample can be
thought of as a draw from some "true" probability distribu1on.

maximum likelihood es$ma$on maximises the likelihood of the


sample y,

Again, we can equivalently maximize


Thus
NLL

The nega(ve of this log likelihood (NLL), also called cross-entropy.

Gradient:

Hessian: posi+ve definite convex


Units based diagram
So#max formula,on

• Iden&fy and as two separate probabili&es constrained


to add to 1. That is

• Can translate coefficients by fixed amount without any change


NLL and gradients for So0max
Units diagram for So/max
Rewrite NLL

where puts the first argument in the

numerator. Di3o for which is simply .


Units diagram Again
Equa%ons, layer by layer
Reverse Mode Differen.a.on

Write as:
From Reverse Mode to Back Propaga4on

• Recursive Structure
• Always a vector 3mes a Jacobian
• We add a "cost layer" to $z^4$. The deriva3ve of this layer with
respect to $z^4$ will always be 1.
• We then propagate this deriva3ve back.
Layer Cake
Backpropaga)on

RULE1: FORWARD (.forward in pytorch)

RULE2: BACKWARD (.backward in pytorch)


or .
In par'cular:

RULE 3: PARAMETERS

(backward pass is thus also used to fill the variable.grad parts


of parameters in pytorch)
THATS IT! Write your Own Layer

You might also like