Bayesian Note Part 1
Bayesian Note Part 1
Bayesian Note Part 1
Meisam Hejazinia
1/14/2013
DEA vs. Stochastic Frontier Joint probability distribution for all observable
and unobservable quantities in a problem.
Uncertainty
model should be consistent with the knowledge
Common Approaches to statistical analysis about the underlying problem and data collection.
1
quantities of ultimate interest Logit can be approximated by probit.
Evaluate the fit and interpret the implications of If you don’t care about six degree of freedom you
the resulting posterior distribution: can use probit, and you will get the logit estimation.
Does the model fit the data You should be able to justify the method you use.
How sensitive are the result to modeling assump- In general, statisitcal inference attemps to draw
tions in step 1 condlusion about quantities that are not observed
(estimands) based on numerical data. Estimands
Usually frequentists question the assumption of come in two favors: a) potentially observable (e.g.
having the prior future observations), and b) not directly observable
(e.g. regression coefficients):
When you have a dice you can come with the
prior, and then the question will come that whether
θ population parameters (vector of unboservable
it is fair or not. As a result it is not always difficult quantities)
to know the prior. y : observed data
ȳ potentially observable quantities
Markov chain is used for this purpose that you
α, β, γ parameters
plug in the priors in.
x, w observable/ observed scalars and vectors
X, W observable/ observed matrices
Markove chain allows you to separate the proba-
bilities and get estimate simpler.
When you can see and count something you have
observed that thing.
You should always be worried about whether your
estimate is reasonable.
You assume some to be known, and you draw
unknown things out of those observables.
What does confidence interval means in frequentist
approach?
Example of unobserved is the coefficient of gravity,
Common sense interpretations of statisitcal con- but you can count the drops, so it is observable.
clusions:
Random variables:
Bayesian (probability) interval for an unknown
quantity of interest Outcome variables y are considered random: Each
observed value could have turned out different due
Directly implies a high probability of containing to the sampling process and natural variation in
the unknown quantity population
Implies a sequency of similar inferences made in Joint probability density p(y) is invariant to
repeated practices permutations of the indexes (exchangeability)
2
Usually data from an exchangeable distribution For the next session provide the written descrip-
is modled as i.i.d given some unknown parameter tion of data that you are interested in. You should
vector θ try to work on the data that you like and understand
to run your homeworks on.
Bayesian Inference
The good book is ”Greenberg” book of ”introduc-
Conclusions about parameters are made in terms tion to Bayesian statistics”.
of probability statements:
Uniform prior put huge weight on the tails.
p(θ|y) Typically if you have the data of what the piror
should be, try to use it as prior rather than uniform.
This probability is explicitly conditional on ob-
served outcomes, and implicitly conditional on any Homophilia example
covariate
Males have X-Y chromosomes, and females: X-X
p(θ, y) = p(θ)p(y|θ) chromosomes
p(θ, y) = p(θ,y)
= p(θ)p(y|θ) Homophilia is inhterited through the X-
p(y) p(y)
chromosome
P
p(y) = p(θ)p(y|θ)
Males are affected with just one ”bad” chromo-
some, but females are affected only with two bad
Since everything would sum up to one, then
genes
p(θ|y) ∝ p(θ)p(y|θ)
Consider a woman who has affected brother and
Bayesian Inference:
unaffected parents, which implies that her mother
has one ”bad” chromosome
p(θ|y) ∝ p(θ)p(y|θ)
The unknown quantity of interest is whether
This is the technical essence of Bayesian inference: the women is herself a carrier of the ”bad” gene
(θ = 1), ornotcarrier(θ = 0)
1. develop the model p(θ, y), and
2. perform the necessary computation to summarize The prior distribution based on presented infor-
p(θ|y) mation is:
You start with normal distribution, and any p(θ = 1) = p(θ = 0) = 0.5
distribution and you simulate multiple random The modl and likelihood:
numbers.
Outcome variable yt = 0 if son i is not affected
Markove chain and random number generation and yt = 1 if son i is affected
has nothing to do with Bayesian, but you need to be
able to do that on MATLAB. The woman has two unaffected sons who are not
identical twins:
Posterior odds:
y1 = 0, y2 = 0
3
p(θ)p(y|θ)
p(θ|y) = p(y)
Both these approaches sound eminently reason-
p(θ = 1|y1 = 0, y2 = 0) = able, to the point that differences between them
Transformation of variables: sound subtle to the point of unimportance.
Suppose p(u) is the density of vector u, and we Frequentist: The world is a certain way, but I
need a transformation v=f(u) dont know how it is. Further, I cant necessarily tell
how the world is just by collecting data, because
Discrete distribution p(.): data are always finite and noisy. So Ill use statistics
to line up the alternative possibilities, and see which
−1
Density q(v) = p(f (v)) ones the data more or less rule out., they calculate
P (D|H), treating data as random, and hypothesis
Continuous distribution p(.): as fixed, and H is some ”null” hypothesis.
A frequentist interprets the word probability as
Density q(v) = |J|p(f −1 (v)) meaning the frequency with which something would
happen, in a lengthy series of trials
Frequentist say that data is consistent with the
Where |J| is the Jacobian of the transformation
−1 probability of life in mars.
u = f (v)
A Bayesian basically says, I dont know how the
Example for univariate normal distribution:
world is. All I have to go on is finite data. So Ill use
statistics to infer something from those data about
If u ∼ N (µ, σ 2 ) then what is distribution of
u−µ how probable different possible states of the world
v= σ
are. they calculate, P (H|D).
Bayesian interpretation of probability (though not
J = ∂u
∂v = σ the only one) is as subjective degree of belief: the
probability that you (personally) attach to a hypoth-
Need review of Jacobian. This is the kind of stuff esis is a measure of how strongly you (personally)
that will be asked on midterm. believe that hypothesis.
RR Bayesian would say Theres probably not life on Mars
R RE(u) = R up(u, v)dudv =
u.p(u|v)dup(v)dv = E(u|v)p(v)dv = E(E(u|v)) There are contexts in which Bayesian and frequen-
tist statistics easily coexist.
E(θ) = E(E(θ|y))
1. Diffference in philosophy
2. Ease of calculation
4
Seminar of Andrei @ UTD: Second session
Meisam Hejazinia
1/28/2013
Observed heterogeneity could be captured from In the bayesian after simulation you know the
the individual characteristics, or through covariates distribution of random variable.
DEA: Data Envelop Analysis: helps you check the Information prior distribution
efficiency frontier
Population interpretation: all possible parameter
Two effects of random effect or fixed effect. Fixed values
effect, you would think there would be intercept.
It is important to not capture fixed, but a random State of knowledge interpretation: use subjective
effect as well. knowledge about parameters
Bayesian can merge them and do that simply, Prior should include all plausible parameter
assuming that each have different variable different values, but need not be concentrated around true
for individual. It would not be complicated, but be value
more simple.
Conjugate prior distirbution
Heterogeneity in Basian would be random effect.
In frequentist is that βi would be different with a If F is class of sampling distribution p(y|θ) and P
distribution. The data would be of the form of panel. is class of prior distributions for θ, then the class P
You just will do simulation. You would not have the is conjugate for F if:
problem to check whether it is local maximum or
global. It may take over night for calculation, but P (θ|y) ∈ P for all p(.|θ) ∈ F and p(.) ∈ P
would be simple.
Exponential family distribution
It is not important what method of estimation you
use. Random number generation for simulation by Only classes of distribution that have natural
computer based on the distribution. We use MCMC conjugate priors
in bayesian and split it into separate pieces and use
0
markov chain to see what we can take, and they will p(yi |θ) = f (yi )g(θ)eφ(θ) u(yi )
converge.
1
(y−µ)2
p(y|µ, σ 2 ) = √ 1 e− 2σ 2
2πσ
− 1
2σ 2
1 2µ
[−1−2µ][− 2
e 2σ 2 2σ
2µ
φ = [− 2σ1 2 − 2σ 2 ]
1
τ2 is called perceision, which is reverse of variance.
Exponential family
Matrix algebra
2
Seminar of Andrei @ UTD: Third session
Meisam Hejazinia
04/04/2013
Only when the close solution is not available then each of the times. you condition on the previous
you need to go over simulation, and markov chain. times value of these states, and as you condition
them you will get new values, and as each of the
How many times you want to draw is a questions values as converge when we have on the value of
to choose the sample size. states with regard to the previous time value then
we say it is converged.
Significant test, 95% interval. 2.5 on each edge of
the distribution. To calculate the likelihood we multiply everything.
Markov chains should onverge at some point. p(µ, σ 2 ) = p(µ).p(σ 2 ) then p(µ) ∝ 1, p(σ 2 ) ∝ σ −2 .
p(ln(σ 2 )) ∝ 1 ⇒ p(σ 2 ), yout take the jacobian, and
Vectors of parameters should converge. it is not about jeffery, since that would need informa-
tion matrix. Any function could be proportioniate
Full distribution will be separated to parts, and to 1, so this is not the problem.
you want to make sure that all the conditional
distributions should convrge. To simulate first you draw from µ, and then take
from σ, and then plug in from them. As a result first
If you can do it directly then the markove chain is you take the marginals.
not needed.
The main point that we try to calculate this is
Split to chunks, and you separate and condition that we are not able to calculate the Gamma.
them. Split the distribution into conditional joint
distributions. Any CDF is normal distribution between zero and
one, so you use cdf −1 to use for distributing the
We can simulate from full distribution directly. random number with specific distribution.
You draw from conditional, and marginals and then
you would be able to simulate them. As long as you know the distribution the method
would be what we do here.
We can spli p(a, b, c|y) = p(a|y)p(b|a, y)p(c|a, b, y).
Simulate the data, compre distributions.
you can always create
p(a|b, c, y), p(b|a, c, y), p(c|a, b, c) you can draw from To check whether the model and simulation
this. You need to be careful about where the point is. worked, you abstract model from the data, and you
generate the outcome and compare with real data.
States for above would be the value of (a, b, c) at
1
You a researcher must improve the model, and Derieve the following:
you should show that you captured the dynamic
that previously has not been captured. p(β|σ 2 , y)
yt = xt β + t
t = 1, , n
N (p, σ 2 )
p(σ 2 , β|y)
= (σ 2 β)T
2
Seminar of Andrei @ UTD: Forth session
Meisam Hejazinia
02/11/2013
In bayesian you are interested in the whole distri- When you tighten the range of variance you may
bution and not just estimate. This is why we worked need to run many simulations to converge to your
over both β, and σ 2 , and we tried to estimate both number.
at the same time.
Moments could be compared for checking whether
You can normalize it later to make sure that the the result is the same or not?
area is one. You just here try to find the proportion. Multicollinearity should be handled by frequentist
approach.
This is called Gibbs sampler, and you will find
about it when we talked about metrapolist hasting The first step is to take the parameters from data
algorithm. and develop model for random x, random y, and
check whether the code works.
You will look at the relation, and remove anything
that is not canceled out, mean, is function of the Second step is to take x from data and simulate y
main. based on your code, and then check whether the real
y is compared with the y simulated.
If priors not related through the sigma square it
would be very difficult. Comparing simulated y and real y could be
done through histogram, or through checking the
Semi conjugate since beta and sigma square are moments.
independent. There are situations that they are not
independent, but unless you have reason that they informal report.
are dependent do not relate to it.
HW Check this model over your own data.
You don’t need to use this method mechanically.
what priors? how simulatation is sensitive to
The simplest way to simulate is to simulate it them?
directly.
Remove assumption and check how the simulation
Things can go wrong anyway, and you just make will look like.
sure you don’t give up.
Frequentist integrate beta and sigma, here is
Do not put too much time in finding out about simulation based.
prior.
Based on the output that we took from the code
1
we take today.
second homework
H0 : p = 0.5
H1 : p <> 0.5
Frequentist:
Bays factor:
2
Seminar of Andrei @ UTD: Fifth session
Meisam Hejazinia
02/18/2013
18th of March would be the time for the midterm. Logit has fatter tail, and normal distribution of
probit has thinner tail.
The OLS and the Bayesian should have the same
estimate, so I need to check why mine is different. Multinomial, blue buss, and red bus. There is
connection between them. Student t has fatter tail.
Small number of parameters, it would be a prob- The parameters will be distinguished. Theoretically
lem. they may be different due to the shape.
We will start with the probit model and then we Probit has much more complicated. IIA in logit.
will deal with marginal likelihood. 15 variate choice, and reasonabily large number,
there is no computer can evaluate it.
Next time we will talk about meteropolis hasting,
and markov chain will be discussed there. Logit is simple to write the likelihood and estimate
it.
When we don’t observe full variable.
Bayesian exact opposite is true. Logit does not
Utility some function of the price. have the family. Meteropolis hasting would be used
for the logit. Student t would be different, yet logit
You need to ask everybody for utility. The person would be different.
using the product does not know what utility is.
The utility of buying something is greater than not Multinomial problem you need to think differently.
buying. There are different ways to deal with mixed logit,
but when you use probit there wouldn’t be problem.
If I decide to buy something means, my utility of
buying it would be greater. You try to get the best guess, to understand what
underlying utility is.
Two possibilities: buy or not.
Job candidate can get the utility based on analysis.
Multinomial and multivariate will have the same
kind of logic. The main point is to recover complex structure.
H is link function between the probability of Trick: connect data that you have Y , which could
something happening, could be anything. Usually it be 0 or 1.
is used logit, and here it is probit.
When you take the difference it is like you get the
1
result of comparison. We try to get non parametric model as we inte-
grate them.
We have 15 choices, and you can recover the
number of choices. It is similar to what you have in To calculate the bayesian you need to integrate
the frequentist. the beta, and you will take the following.
Introducing Z can help you to do the regression. it would be γ(202)/(γ(116).γ(86)) and you then
Ultimately you don’t know what Z is. you convert to factorial to find the number.
Homework code indirectly to find how it will Since we know that integration of beta would be
work. We will simulate from normal distribution one, so we used beta.
until we get draw from main.
This integration will be equal to the ratio that we
It is markov chain, so it should always be condi- calculated mean 0.006/0.5750 the first one does not
tional on all the current values. have any parameter.
For bayesian people usually use R programming. Decisive mean we can reject the null. Shows
strong confidence that this model is better.
In hypothesis testing. It is for model selection.
What generated this data. You want to know Homework: calculate the marginal likelihood for
whether this data came from this assumption or your regression model.
another.
We need to be careful about how we select the
Likelihood would be 115200 priors, since i getting ln we would have problem
since it will go to infinity for small avlues.
exp(gammaln(201) − gammaln(116) −
gammaln(86) + 115 ∗ log(p) + 85 ∗ log(1 − p)) Homework: Linear regression, binary probit,
for the calculation of the value of likelihood differ- and other model, try to run on your models.
ence.
Three people should present their findings.
If we set p to 0.5 and 115/200 then we will get two
different value of 0.006 on one side and on the other
hand 0.5750 which tells you that this could not be
really could not be 0.5.
2
Seminar of Andrei @ UTD: Sixth session
Meisam Hejazinia
02/25/2013
In the previous code we condition on z0, so you Heterogeneity could be included in the model in
need to replace z1 with z0. the form of bimodal priors.
Burn in period is used sometimes. Poisson model for quantity, and the number of
products to each category bought.
You need to check whether all the parameters
converged or not. Multinomial choice.
Burn in period is to make sure that it converges. Charactersitic of optimal pricing policy, when and
how much to do the mark down.
You can look at the graph instead.
Generate multinomial choice model, and aggregate
Today we will talk about meteropolis hasting. them. You need to make sure that the result would
be unique.
Marginal likelihood tells us how to select the
model. Characteristic of the product could give ou what
would be up for most of the time.
On bayesian you would not be able to through ev-
erything, and you just should think about whatever The optimal mark down policy for specific group.
covarients make sense.
On march 25 you need to present your papers in
When you have many different variables, factor the same way.
analysis could be proper method.
You should not have the prior so that it would be
Every source is unique, and you can not use the the only determining factor.
same code for everything.
MCMC Simulation.
R− has building block built in and you can use to
do the updating. The paper of green and Chib. Read the paper.
The week after the spring break you need to Transition probability. Two state between which
provide the powerpoint slides showing your data, you can go.
what you are planning to do, and the model you will
1
The only thing that matters is transition matrix. The likelihood would be logit here.
xt β
e
P (x, A) in this paper means P (A|x). p(yt |β) = ( 1+e xt β )
yt
( 1+e1xt β )1−yt
p(x, Rd ) = 1. p(β) ∝ 1
Metrapolis hasting it puts is in reverse. We know logp(β|β) = Πni=1 (ext β )yt /(1 + ext β ) =
PN xt β
the distribution of final result. The question is where t=1 yt log(e ) − log(1 + ex+β )
to go that finally distribution converges.
With probit you can do gibbs sampler, with logit
invariant distribution means converging. you can not.
the function should be reversible means Read the paper of today, and new paper. Read
π(x)p(x, y) = π(y)p(x, y) them carefully, so that in case any question arises,
we resolve it carefully.
It is like Gibbs sampler that you draw, but like
accept rejection algorithm you remain on the current
state. As a result it is something in between.